Architecture¶
This document describes the current faultcore architecture using a runtime-stage execution graph (R0..R8).
Design Goals¶
Keep network fault logic in
faultcore_network.Keep
faultcore_interceptoras a thin Linux syscall adapter (LD_PRELOADboundary).Keep Python API ergonomic while preserving a stable SHM contract.
Keep layer execution deterministic and testable.
High-Level Layout¶
src/faultcore/: Python API, decorators, policy registry, SHM writer.faultcore_network/: runtime-stage engine, SHM contract/runtime, socket metadata helpers, interceptor bridge.faultcore_interceptor/: syscall hooks and original libc dispatch (dlsym/RTLD_NEXT).tests/: Python unit/integration test suites.Rust tests live in each Rust crate and are executed with
cargo test --manifest-path ....
Module Layout Diagram¶
flowchart LR
P["src/faultcore<br/>Python API + SHM writer"]
N["faultcore_network<br/>Runtime stages + SHM runtime + bridge"]
I["faultcore_interceptor<br/>LD_PRELOAD hooks + libc dispatch"]
T["tests<br/>Python + Rust coverage"]
P -->|"writes policy rows"| N
I -->|"delegates decisions"| N
T -->|"validates"| P
T -->|"validates"| N
T -->|"validates"| I
Diagram focus: ownership boundaries and primary call/data edges.
Runtime Flow¶
Python decorator/policy writes a
FaultcoreConfigrow into SHM (tidslot).Interceptor hook is called (
send,recv,connect,sendto,recvfrom,getaddrinfo).Interceptor asks
faultcore_network::interceptor_bridgefor effective runtime config.ChaosEngineruns runtime stages with operation-specific applicability (R0..R8).InterceptorRuntimemapsLayerDecisionto syscall directives (sleep, return value,errno).Interceptor applies directive and delegates to original libc function if needed.
Runtime Sequence Diagram¶
sequenceDiagram
participant Py as Python Decorator/Policy
participant SHM as SHM Slot (tid)
participant Hook as Interceptor Hook
participant Bridge as interceptor_bridge
participant Engine as ChaosEngine (R0..R8)
participant RT as InterceptorRuntime
participant Libc as Original libc syscall
Py->>SHM: write FaultcoreConfig
Hook->>Bridge: read effective config
Bridge->>Engine: build PacketContext + run stages
Engine-->>RT: LayerDecision
RT-->>Hook: syscall directive (sleep/drop/errno/continue)
alt Continue path
Hook->>Libc: call original syscall
Libc-->>Hook: return rc/errno
else Terminal path
Hook-->>Hook: apply terminal behavior
end
Diagram focus: runtime interaction order from Python write to syscall result.
Module Responsibilities¶
Python package (src/faultcore)¶
__init__.py: public API surface, stable export list.decorator.py: decorator behavior, policy registration/validation, context wiring.shm_writer.py: SHM open/write semantics and binary writes for policy fields.
faultcore_network/src¶
lib.rs: public composition and re-exports.layers/mod.rs: shared layer contracts (Layer,LayerDecision,PacketContext,LayerStage).layers/r1_session_guard.rs: session-level budget pre-checks (max_bytes_tx/rx,max_ops,max_duration_ms) with terminal actions.layers/r2_chaos_base.rs: latency, packet loss, burst loss, correlated loss, reorder/duplicate trigger decisions.layers/r3_flow_control.rs: bandwidth/token-bucket shaping.layers/r4_timing_variation.rs: jitter/routing variance.layers/r5_transport_faults.rs: connect/recv timeouts and transport-level error injection.layers/r6_resolver_faults.rs: DNS delay/timeout/NXDOMAIN decisions.layers/r7_payload_transform.rs: deterministic payload mutation decisions and buffer-aware mutation for stream operations.chaos_engine.rs: runtime-stage orchestration and stage metrics.runtime.rs: interceptor runtime state (non-blocking delay tracking, reorder queues) and decision-to-directive mapping.shm_contract.rs: SHM binary schema/constants/validation (FaultcoreConfig, offsets, limits).shm_runtime.rs: SHM mapping, stable reads,tid/fdassignment helpers.socket_runtime.rs: socket metadata extraction (protocol, peer/addr endpoint, monotonic clock).interceptor_bridge.rs: single façade used by interceptor to fetch effective runtime config and bind/clear fd policy.setpriority_compat.rs: optional compatibility shim for legacysetprioritycontrol path.observability.rs: metrics collection and reporting for fault decisions.record_replay.rs: deterministic decision capture and replay functionality.
faultcore_interceptor/src¶
lib.rs: only hook-facing concerns:resolve and call original libc symbols;
recursion guard;
hook entrypoints;
call into
faultcore_networkfor config + decision mapping;apply syscall-level behavior.
Runtime Stage Semantics¶
Stage order is canonical:
R0 -> R1 -> R2 -> R3 -> R4 -> R5 -> R6 -> R7 -> R8.Main pipeline accepts terminal outcomes (
Drop,TimeoutMs,ConnectionErrorKind,NxDomain) and short-circuits.Delay decisions are accumulated.
R7 can emit
LayerDecision::Mutate(Vec<Mutation>)for payload transformation in stream paths.Reorder and duplicate are post-routing stream behaviors, handled outside main pipeline.
DNS path evaluates resolver layer behavior and skips non-DNS effects by layer applicability.
Gilbert-Elliott correlated-loss state is tracked per FD in
L1Chaosto avoid cross-flow coupling.
R7 Payload Mutation Semantics¶
Applies only to stream operations (
Send,Recv).Never applies to
ConnectorDnsLookup.Directional targeting:
bothuplink_onlydownlink_only
Selection gates:
probability (
payload_mutation_prob_ppm)packet cadence (
payload_mutation_every_n_packets)size bounds (
payload_mutation_min_size/payload_mutation_max_size)
Supported mutation primitives:
truncate
corrupt_bytes
inject_bytes
replace_pattern
corrupt_encoding
swap_bytes
dry_runpreserves bytes while still emitting mutation decision for observability.
Stream Hook Ordering with Mutation¶
Uplink (
send/sendto): evaluate decision, mutate before syscall, then reorder/duplicate operate on mutated bytes.Downlink (
recv/recvfrom): execute syscall first, then mutate only the received span before exposing to caller.Non-blocking reorder paths that return
EAGAINdo not apply mutation in that call.
Fault Decision State Diagram¶
stateDiagram-v2
[*] --> Evaluate
Evaluate --> Evaluate: AddDelay
Evaluate --> Continue: No terminal decision
Evaluate --> Drop: Drop
Evaluate --> Timeout: TimeoutMs
Evaluate --> ConnErr: ConnectionErrorKind
Evaluate --> NxDomain: NxDomain
Continue --> [*]
Drop --> [*]
Timeout --> [*]
ConnErr --> [*]
NxDomain --> [*]
Diagram focus: accumulated delay vs terminal short-circuit decisions.
Reorder Matrix¶
send: supported (LayerDecision::StageReorder+ per-FD staging queue)sendto: supported (LayerDecision::StageReorder+ per-FD staging queue)recv: supported (blocking and non-blocking, using dedicated recv pending queue)recvfrom: supported (blocking and non-blocking, using dedicated recv pending queue)
Ownership Boundaries¶
Business/network fault behavior belongs to
faultcore_network.Linux interception details belong to
faultcore_interceptor.SHM binary contract belongs to
faultcore_network::shm_contractand must be reflected by Python writer.
FD/TID Ownership Model¶
Socket creation/binding (
sockethook) binds anfdto the current thread slot viabind_fd_to_current_thread.Runtime config resolution for stream hooks prefers the
fdowner slot (get_tid_slot_for_fd) when present.If the owner slot cannot be resolved to a valid config row, resolution falls back to current thread
tidconfig.This makes cross-thread
fdhandoff deterministic: the bound owner policy stays authoritative unless explicitly cleared/rebound.FD aliasing hooks (
dup,dup2,dup3,accept,accept4) clone the owner binding from source/listener FD to the new FD.
Change Rules¶
SHM layout changes must update:
faultcore_network/src/shm_contract.rssrc/faultcore/shm_writer.pytests/unit/test_shm_contract.pydocs/shm_protocol.md
New network fault behavior should be implemented as a layer concern first, then mapped through
LayerDecision.Interceptor should not gain new policy interpretation logic.