# Operations and Tuning Guide
This guide focuses on practical tuning for stable long-running fault injection tests.
## Baseline Workflow
1. Measure baseline behavior without fault policies.
2. Enable one policy at a time.
3. Compare throughput, error rate, and latency deltas.
```mermaid
flowchart TD
Base["Baseline run
no policy"] --> Add["Enable one policy"]
Add --> Measure["Measure throughput/error/latency"]
Measure --> Tune{"Within expected bounds?"}
Tune -->|No| Adjust["Adjust parameters
(latency/jitter/loss/rate)"]
Adjust --> Measure
Tune -->|Yes| Stress["Run stress profile
(smoke or long)"]
Stress --> Checklist["Finalize with operational checklist"]
```
Diagram focus: iterative tuning loop from baseline to stress validation.
Recommended scripts:
- `examples/12_perf_baseline.py` for baseline vs policy throughput.
- `tests/integration/test_stress.py --mode smoke` for fast validation.
- `sh tests_long.sh` for long runs in a separate path.
## Policy Tuning
### Latency and Jitter
- Start with small values (`latency="20ms".."50ms"`, `jitter="2ms".."10ms"`).
- Increase gradually and track `p95/p99` response time impact.
- Use directional config when needed:
- `uplink={...}` to affect client-to-server path.
- `downlink={...}` to affect server-to-client path.
### Packet Reorder
- Begin with low probability (`packet_reorder={"prob": "5%", "window": 4}`) and bounded window.
- Keep `window` small first (`2..8`) to avoid queue pressure.
- Use `max_delay` (e.g., `"50ms"`) only when simulating delayed release behavior.
### Packet Duplicate and Loss
- Combine carefully:
- `packet_duplicate` magnifies traffic volume.
- `packet_loss` and `burst_loss` reduce delivery ratio.
- Validate expected side effects with protocol-specific tests (TCP vs UDP).
### DNS Faults
When using the `@dns()` decorator:
- `delay` to simulate slow resolver behavior.
- `timeout` to simulate timeout (`EAI_AGAIN` style behavior).
- `nxdomain` to simulate name-not-found responses.
When using `register_policy()`, pass a `dns` dict with the same keys (`delay`, `timeout`, `nxdomain`).
Prefer isolated DNS scenarios first, then combine with transport-level faults.
### Target Rules
- Prefer explicit protocol + port when possible to reduce accidental matches.
- Use `priority` to define deterministic resolution when rules overlap.
- Keep rule sets focused; large broad CIDRs should have lower priority than exact host rules.
## Stress Run Profiles
- Smoke profile:
- Short duration and low worker count.
- Use in local iteration and pre-commit checks.
- Long profile:
- `--mode long` with sustained concurrency.
- Use `--max-rss-delta-kb ` to enforce a memory growth ceiling.
- Use for memory stability and long-tail latency checks.
- Current calibrated defaults: `STRESS_MAX_ERROR_RATE=0.02`, `STRESS_MAX_RSS_DELTA_KB=131072`.
## Operational Checklist
1. `sh lint.sh` (check mode) - requires `uv`
2. `sh build.sh`
3. `sh tests.sh`
4. Optional: run `sh tests_long.sh` and store results with timestamp.