# Operations and Tuning Guide This guide focuses on practical tuning for stable long-running fault injection tests. ## Baseline Workflow 1. Measure baseline behavior without fault policies. 2. Enable one policy at a time. 3. Compare throughput, error rate, and latency deltas. ```mermaid flowchart TD Base["Baseline run
no policy"] --> Add["Enable one policy"] Add --> Measure["Measure throughput/error/latency"] Measure --> Tune{"Within expected bounds?"} Tune -->|No| Adjust["Adjust parameters
(latency/jitter/loss/rate)"] Adjust --> Measure Tune -->|Yes| Stress["Run stress profile
(smoke or long)"] Stress --> Checklist["Finalize with operational checklist"] ``` Diagram focus: iterative tuning loop from baseline to stress validation. Recommended scripts: - `examples/12_perf_baseline.py` for baseline vs policy throughput. - `tests/integration/test_stress.py --mode smoke` for fast validation. - `sh tests_long.sh` for long runs in a separate path. ## Policy Tuning ### Latency and Jitter - Start with small values (`latency="20ms".."50ms"`, `jitter="2ms".."10ms"`). - Increase gradually and track `p95/p99` response time impact. - Use directional config when needed: - `uplink={...}` to affect client-to-server path. - `downlink={...}` to affect server-to-client path. ### Packet Reorder - Begin with low probability (`packet_reorder={"prob": "5%", "window": 4}`) and bounded window. - Keep `window` small first (`2..8`) to avoid queue pressure. - Use `max_delay` (e.g., `"50ms"`) only when simulating delayed release behavior. ### Packet Duplicate and Loss - Combine carefully: - `packet_duplicate` magnifies traffic volume. - `packet_loss` and `burst_loss` reduce delivery ratio. - Validate expected side effects with protocol-specific tests (TCP vs UDP). ### DNS Faults When using the `@dns()` decorator: - `delay` to simulate slow resolver behavior. - `timeout` to simulate timeout (`EAI_AGAIN` style behavior). - `nxdomain` to simulate name-not-found responses. When using `register_policy()`, pass a `dns` dict with the same keys (`delay`, `timeout`, `nxdomain`). Prefer isolated DNS scenarios first, then combine with transport-level faults. ### Target Rules - Prefer explicit protocol + port when possible to reduce accidental matches. - Use `priority` to define deterministic resolution when rules overlap. - Keep rule sets focused; large broad CIDRs should have lower priority than exact host rules. ## Stress Run Profiles - Smoke profile: - Short duration and low worker count. - Use in local iteration and pre-commit checks. - Long profile: - `--mode long` with sustained concurrency. - Use `--max-rss-delta-kb ` to enforce a memory growth ceiling. - Use for memory stability and long-tail latency checks. - Current calibrated defaults: `STRESS_MAX_ERROR_RATE=0.02`, `STRESS_MAX_RSS_DELTA_KB=131072`. ## Operational Checklist 1. `sh lint.sh` (check mode) - requires `uv` 2. `sh build.sh` 3. `sh tests.sh` 4. Optional: run `sh tests_long.sh` and store results with timestamp.