Operations and Tuning Guide¶
This guide focuses on practical tuning for stable long-running fault injection tests.
Baseline Workflow¶
Measure baseline behavior without fault policies.
Enable one policy at a time.
Compare throughput, error rate, and latency deltas.
flowchart TD
Base["Baseline run<br/>no policy"] --> Add["Enable one policy"]
Add --> Measure["Measure throughput/error/latency"]
Measure --> Tune{"Within expected bounds?"}
Tune -->|No| Adjust["Adjust parameters<br/>(latency/jitter/loss/rate)"]
Adjust --> Measure
Tune -->|Yes| Stress["Run stress profile<br/>(smoke or long)"]
Stress --> Checklist["Finalize with operational checklist"]
Diagram focus: iterative tuning loop from baseline to stress validation.
Recommended scripts:
examples/12_perf_baseline.pyfor baseline vs policy throughput.tests/integration/test_stress.py --mode smokefor fast validation.sh tests_long.shfor long runs in a separate path.
Policy Tuning¶
Latency and Jitter¶
Start with small values (
latency="20ms".."50ms",jitter="2ms".."10ms").Increase gradually and track
p95/p99response time impact.Use directional config when needed:
uplink={...}to affect client-to-server path.downlink={...}to affect server-to-client path.
Packet Reorder¶
Begin with low probability (
packet_reorder={"prob": "5%", "window": 4}) and bounded window.Keep
windowsmall first (2..8) to avoid queue pressure.Use
max_delay(e.g.,"50ms") only when simulating delayed release behavior.
Packet Duplicate and Loss¶
Combine carefully:
packet_duplicatemagnifies traffic volume.packet_lossandburst_lossreduce delivery ratio.
Validate expected side effects with protocol-specific tests (TCP vs UDP).
DNS Faults¶
When using the @dns() decorator:
delayto simulate slow resolver behavior.timeoutto simulate timeout (EAI_AGAINstyle behavior).nxdomainto simulate name-not-found responses.
When using register_policy(), pass a dns dict with the same keys (delay, timeout, nxdomain).
Prefer isolated DNS scenarios first, then combine with transport-level faults.
Target Rules¶
Prefer explicit protocol + port when possible to reduce accidental matches.
Use
priorityto define deterministic resolution when rules overlap.Keep rule sets focused; large broad CIDRs should have lower priority than exact host rules.
Stress Run Profiles¶
Smoke profile:
Short duration and low worker count.
Use in local iteration and pre-commit checks.
Long profile:
--mode longwith sustained concurrency.Use
--max-rss-delta-kb <value>to enforce a memory growth ceiling.Use for memory stability and long-tail latency checks.
Current calibrated defaults:
STRESS_MAX_ERROR_RATE=0.02,STRESS_MAX_RSS_DELTA_KB=131072.
Operational Checklist¶
sh lint.sh(check mode) - requiresuvsh build.shsh tests.shOptional: run
sh tests_long.shand store results with timestamp.