# Operations and Tuning Guide

This guide focuses on practical tuning for stable long-running fault injection tests.

## Baseline Workflow

1. Measure baseline behavior without fault policies.
2. Enable one policy at a time.
3. Compare throughput, error rate, and latency deltas.

```mermaid
flowchart TD
    Base["Baseline run<br/>no policy"] --> Add["Enable one policy"]
    Add --> Measure["Measure throughput/error/latency"]
    Measure --> Tune{"Within expected bounds?"}
    Tune -->|No| Adjust["Adjust parameters<br/>(latency/jitter/loss/rate)"]
    Adjust --> Measure
    Tune -->|Yes| Stress["Run stress profile<br/>(smoke or long)"]
    Stress --> Checklist["Finalize with operational checklist"]
```

Diagram focus: iterative tuning loop from baseline to stress validation.

Recommended scripts:
- `examples/12_perf_baseline.py` for baseline vs policy throughput.
- `tests/integration/test_stress.py --mode smoke` for fast validation.
- `sh tests_long.sh` for long runs in a separate path.

## Policy Tuning

### Latency and Jitter

- Start with small values (`latency="20ms".."50ms"`, `jitter="2ms".."10ms"`).
- Increase gradually and track `p95/p99` response time impact.
- Use directional config when needed:
  - `uplink={...}` to affect client-to-server path.
  - `downlink={...}` to affect server-to-client path.

### Packet Reorder

- Begin with low probability (`packet_reorder={"prob": "5%", "window": 4}`) and bounded window.
- Keep `window` small first (`2..8`) to avoid queue pressure.
- Use `max_delay` (e.g., `"50ms"`) only when simulating delayed release behavior.

### Packet Duplicate and Loss

- Combine carefully:
  - `packet_duplicate` magnifies traffic volume.
  - `packet_loss` and `burst_loss` reduce delivery ratio.
- Validate expected side effects with protocol-specific tests (TCP vs UDP).

### DNS Faults

When using the `@dns()` decorator:
- `delay` to simulate slow resolver behavior.
- `timeout` to simulate timeout (`EAI_AGAIN` style behavior).
- `nxdomain` to simulate name-not-found responses.

When using `register_policy()`, pass a `dns` dict with the same keys (`delay`, `timeout`, `nxdomain`).

Prefer isolated DNS scenarios first, then combine with transport-level faults.

### Target Rules

- Prefer explicit protocol + port when possible to reduce accidental matches.
- Use `priority` to define deterministic resolution when rules overlap.
- Keep rule sets focused; large broad CIDRs should have lower priority than exact host rules.

## Stress Run Profiles

- Smoke profile:
  - Short duration and low worker count.
  - Use in local iteration and pre-commit checks.
- Long profile:
  - `--mode long` with sustained concurrency.
  - Use `--max-rss-delta-kb <value>` to enforce a memory growth ceiling.
  - Use for memory stability and long-tail latency checks.
  - Current calibrated defaults: `STRESS_MAX_ERROR_RATE=0.02`, `STRESS_MAX_RSS_DELTA_KB=131072`.

## Operational Checklist

1. `sh lint.sh` (check mode) - requires `uv`
2. `sh build.sh`
3. `sh tests.sh`
4. Optional: run `sh tests_long.sh` and store results with timestamp.