Operations and Tuning Guide

This guide focuses on practical tuning for stable long-running fault injection tests.

Baseline Workflow

  1. Measure baseline behavior without fault policies.

  2. Enable one policy at a time.

  3. Compare throughput, error rate, and latency deltas.

        flowchart TD
    Base["Baseline run<br/>no policy"] --> Add["Enable one policy"]
    Add --> Measure["Measure throughput/error/latency"]
    Measure --> Tune{"Within expected bounds?"}
    Tune -->|No| Adjust["Adjust parameters<br/>(latency/jitter/loss/rate)"]
    Adjust --> Measure
    Tune -->|Yes| Stress["Run stress profile<br/>(smoke or long)"]
    Stress --> Checklist["Finalize with operational checklist"]
    

Diagram focus: iterative tuning loop from baseline to stress validation.

Recommended scripts:

  • examples/12_perf_baseline.py for baseline vs policy throughput.

  • tests/integration/test_stress.py --mode smoke for fast validation.

  • sh tests_long.sh for long runs in a separate path.

Policy Tuning

Latency and Jitter

  • Start with small values (latency="20ms".."50ms", jitter="2ms".."10ms").

  • Increase gradually and track p95/p99 response time impact.

  • Use directional config when needed:

    • uplink={...} to affect client-to-server path.

    • downlink={...} to affect server-to-client path.

Packet Reorder

  • Begin with low probability (packet_reorder={"prob": "5%", "window": 4}) and bounded window.

  • Keep window small first (2..8) to avoid queue pressure.

  • Use max_delay (e.g., "50ms") only when simulating delayed release behavior.

Packet Duplicate and Loss

  • Combine carefully:

    • packet_duplicate magnifies traffic volume.

    • packet_loss and burst_loss reduce delivery ratio.

  • Validate expected side effects with protocol-specific tests (TCP vs UDP).

DNS Faults

When using the @dns() decorator:

  • delay to simulate slow resolver behavior.

  • timeout to simulate timeout (EAI_AGAIN style behavior).

  • nxdomain to simulate name-not-found responses.

When using register_policy(), pass a dns dict with the same keys (delay, timeout, nxdomain).

Prefer isolated DNS scenarios first, then combine with transport-level faults.

Target Rules

  • Prefer explicit protocol + port when possible to reduce accidental matches.

  • Use priority to define deterministic resolution when rules overlap.

  • Keep rule sets focused; large broad CIDRs should have lower priority than exact host rules.

Stress Run Profiles

  • Smoke profile:

    • Short duration and low worker count.

    • Use in local iteration and pre-commit checks.

  • Long profile:

    • --mode long with sustained concurrency.

    • Use --max-rss-delta-kb <value> to enforce a memory growth ceiling.

    • Use for memory stability and long-tail latency checks.

    • Current calibrated defaults: STRESS_MAX_ERROR_RATE=0.02, STRESS_MAX_RSS_DELTA_KB=131072.

Operational Checklist

  1. sh lint.sh (check mode) - requires uv

  2. sh build.sh

  3. sh tests.sh

  4. Optional: run sh tests_long.sh and store results with timestamp.