# Architecture This document describes the current `faultcore` architecture using a runtime-stage execution graph (`R0..R8`). ## Design Goals - Keep network fault logic in `faultcore_network`. - Keep `faultcore_interceptor` as a thin Linux syscall adapter (`LD_PRELOAD` boundary). - Keep Python API ergonomic while preserving a stable SHM contract. - Keep layer execution deterministic and testable. ## High-Level Layout ![Faultcore interceptor flow](assets/faultcore-flow.svg) - `src/faultcore/`: Python API, decorators, policy registry, SHM writer. - `faultcore_network/`: runtime-stage engine, SHM contract/runtime, socket metadata helpers, interceptor bridge. - `faultcore_interceptor/`: syscall hooks and original libc dispatch (`dlsym`/`RTLD_NEXT`). - `tests/`: Python unit/integration test suites. - Rust tests live in each Rust crate and are executed with `cargo test --manifest-path ...`. ### Module Layout Diagram ```mermaid flowchart LR P["src/faultcore
Python API + SHM writer"] N["faultcore_network
Runtime stages + SHM runtime + bridge"] I["faultcore_interceptor
LD_PRELOAD hooks + libc dispatch"] T["tests
Python + Rust coverage"] P -->|"writes policy rows"| N I -->|"delegates decisions"| N T -->|"validates"| P T -->|"validates"| N T -->|"validates"| I ``` Diagram focus: ownership boundaries and primary call/data edges. ## Runtime Flow 1. Python decorator/policy writes a `FaultcoreConfig` row into SHM (`tid` slot). 2. Interceptor hook is called (`send`, `recv`, `connect`, `sendto`, `recvfrom`, `getaddrinfo`). 3. Interceptor asks `faultcore_network::interceptor_bridge` for effective runtime config. 4. `ChaosEngine` runs runtime stages with operation-specific applicability (`R0..R8`). 5. `InterceptorRuntime` maps `LayerDecision` to syscall directives (`sleep`, return value, `errno`). 6. Interceptor applies directive and delegates to original libc function if needed. ### Runtime Sequence Diagram ```mermaid sequenceDiagram participant Py as Python Decorator/Policy participant SHM as SHM Slot (tid) participant Hook as Interceptor Hook participant Bridge as interceptor_bridge participant Engine as ChaosEngine (R0..R8) participant RT as InterceptorRuntime participant Libc as Original libc syscall Py->>SHM: write FaultcoreConfig Hook->>Bridge: read effective config Bridge->>Engine: build PacketContext + run stages Engine-->>RT: LayerDecision RT-->>Hook: syscall directive (sleep/drop/errno/continue) alt Continue path Hook->>Libc: call original syscall Libc-->>Hook: return rc/errno else Terminal path Hook-->>Hook: apply terminal behavior end ``` Diagram focus: runtime interaction order from Python write to syscall result. ## Module Responsibilities ### Python package (`src/faultcore`) - `__init__.py`: public API surface, stable export list. - `decorator.py`: decorator behavior, policy registration/validation, context wiring. - `shm_writer.py`: SHM open/write semantics and binary writes for policy fields. ### `faultcore_network/src` - `lib.rs`: public composition and re-exports. - `layers/mod.rs`: shared layer contracts (`Layer`, `LayerDecision`, `PacketContext`, `LayerStage`). - `layers/r1_session_guard.rs`: session-level budget pre-checks (`max_bytes_tx/rx`, `max_ops`, `max_duration_ms`) with terminal actions. - `layers/r2_chaos_base.rs`: latency, packet loss, burst loss, correlated loss, reorder/duplicate trigger decisions. - `layers/r3_flow_control.rs`: bandwidth/token-bucket shaping. - `layers/r4_timing_variation.rs`: jitter/routing variance. - `layers/r5_transport_faults.rs`: connect/recv timeouts and transport-level error injection. - `layers/r6_resolver_faults.rs`: DNS delay/timeout/NXDOMAIN decisions. - `layers/r7_payload_transform.rs`: deterministic payload mutation decisions and buffer-aware mutation for stream operations. - `chaos_engine.rs`: runtime-stage orchestration and stage metrics. - `runtime.rs`: interceptor runtime state (non-blocking delay tracking, reorder queues) and decision-to-directive mapping. - `shm_contract.rs`: SHM binary schema/constants/validation (`FaultcoreConfig`, offsets, limits). - `shm_runtime.rs`: SHM mapping, stable reads, `tid`/`fd` assignment helpers. - `socket_runtime.rs`: socket metadata extraction (`protocol`, peer/addr endpoint, monotonic clock). - `interceptor_bridge.rs`: single façade used by interceptor to fetch effective runtime config and bind/clear fd policy. - `setpriority_compat.rs`: optional compatibility shim for legacy `setpriority` control path. - `observability.rs`: metrics collection and reporting for fault decisions. - `record_replay.rs`: deterministic decision capture and replay functionality. ### `faultcore_interceptor/src` - `lib.rs`: only hook-facing concerns: - resolve and call original libc symbols; - recursion guard; - hook entrypoints; - call into `faultcore_network` for config + decision mapping; - apply syscall-level behavior. ## Runtime Stage Semantics - Stage order is canonical: `R0 -> R1 -> R2 -> R3 -> R4 -> R5 -> R6 -> R7 -> R8`. - Main pipeline accepts terminal outcomes (`Drop`, `TimeoutMs`, `ConnectionErrorKind`, `NxDomain`) and short-circuits. - Delay decisions are accumulated. - R7 can emit `LayerDecision::Mutate(Vec)` for payload transformation in stream paths. - Reorder and duplicate are post-routing stream behaviors, handled outside main pipeline. - DNS path evaluates resolver layer behavior and skips non-DNS effects by layer applicability. - Gilbert-Elliott correlated-loss state is tracked per FD in `L1Chaos` to avoid cross-flow coupling. ### R7 Payload Mutation Semantics - Applies only to stream operations (`Send`, `Recv`). - Never applies to `Connect` or `DnsLookup`. - Directional targeting: - `both` - `uplink_only` - `downlink_only` - Selection gates: - probability (`payload_mutation_prob_ppm`) - packet cadence (`payload_mutation_every_n_packets`) - size bounds (`payload_mutation_min_size`/`payload_mutation_max_size`) - Supported mutation primitives: - truncate - corrupt_bytes - inject_bytes - replace_pattern - corrupt_encoding - swap_bytes - `dry_run` preserves bytes while still emitting mutation decision for observability. ### Stream Hook Ordering with Mutation - Uplink (`send`/`sendto`): evaluate decision, mutate before syscall, then reorder/duplicate operate on mutated bytes. - Downlink (`recv`/`recvfrom`): execute syscall first, then mutate only the received span before exposing to caller. - Non-blocking reorder paths that return `EAGAIN` do not apply mutation in that call. ### Fault Decision State Diagram ```mermaid stateDiagram-v2 [*] --> Evaluate Evaluate --> Evaluate: AddDelay Evaluate --> Continue: No terminal decision Evaluate --> Drop: Drop Evaluate --> Timeout: TimeoutMs Evaluate --> ConnErr: ConnectionErrorKind Evaluate --> NxDomain: NxDomain Continue --> [*] Drop --> [*] Timeout --> [*] ConnErr --> [*] NxDomain --> [*] ``` Diagram focus: accumulated delay vs terminal short-circuit decisions. ### Reorder Matrix - `send`: supported (`LayerDecision::StageReorder` + per-FD staging queue) - `sendto`: supported (`LayerDecision::StageReorder` + per-FD staging queue) - `recv`: supported (blocking and non-blocking, using dedicated recv pending queue) - `recvfrom`: supported (blocking and non-blocking, using dedicated recv pending queue) ## Ownership Boundaries - Business/network fault behavior belongs to `faultcore_network`. - Linux interception details belong to `faultcore_interceptor`. - SHM binary contract belongs to `faultcore_network::shm_contract` and must be reflected by Python writer. ### FD/TID Ownership Model - Socket creation/binding (`socket` hook) binds an `fd` to the current thread slot via `bind_fd_to_current_thread`. - Runtime config resolution for stream hooks prefers the `fd` owner slot (`get_tid_slot_for_fd`) when present. - If the owner slot cannot be resolved to a valid config row, resolution falls back to current thread `tid` config. - This makes cross-thread `fd` handoff deterministic: the bound owner policy stays authoritative unless explicitly cleared/rebound. - FD aliasing hooks (`dup`, `dup2`, `dup3`, `accept`, `accept4`) clone the owner binding from source/listener FD to the new FD. ## Change Rules - SHM layout changes must update: - `faultcore_network/src/shm_contract.rs` - `src/faultcore/shm_writer.py` - `tests/unit/test_shm_contract.py` - `docs/shm_protocol.md` - New network fault behavior should be implemented as a layer concern first, then mapped through `LayerDecision`. - Interceptor should not gain new policy interpretation logic.