Data Flow Tracer
Builds a complete, annotated map of how data enters a system, transforms step by step, and exits — across any framework, language, or paradigm.
Overview
The Data Flow Tracer follows data, not code structure. Code is organized by module, class, and function — but data doesn't respect those boundaries. This agent traces the data's actual path, crossing whatever boundaries it crosses, through PyTorch, NumPy, TensorFlow, JAX, C++ pipelines, ROS nodes, or any combination.
| Property | Details |
|---|---|
| Tools | Read, Grep, Glob (read-only) |
| Auto-Dispatch | Yes — when data pipeline or preprocessing changes |
| Trigger | Data pipeline changes, environment integration, model input/output modifications |
Input-to-Output Tracing
The tracer's primary job is complete end-to-end tracing:
- True source identification — not "the dataloader" but specifically "a HDF5 file containing joint angles as float64 arrays of shape
(N, T, 7)" - Every transformation documented — function/operation applied, input shape/dtype/range, output shape/dtype/range, what changed semantically, file path and line number
- True sink identification — traces all the way to the final destination: loss function, action sent to motor, logged metric, saved checkpoint
Semantic Annotation
Shapes and dtypes are not enough. The tracer tracks meaning:
- Axis labels — every axis labeled:
(batch, time, joints, xyz, channels, envs, agents, ...). Never leaves an axis unlabeled. - Units — radians vs degrees, meters vs millimeters, normalized
[-1,1]vs raw sensor range - Coordinate frames — world frame, body frame, camera frame, end-effector frame. A
(3,)vector in the wrong frame is a bug no shape check catches. - Value range constraints — quaternions should be unit norm, probabilities should sum to 1, joint angles should be within limits
Branching and Merging
Real pipelines aren't linear. The tracer tracks:
- Fan-out — where one tensor feeds multiple downstream consumers (e.g., observation goes to both policy and value networks)
- Fan-in — where multiple tensors merge. Verifies concat axis and ordering is correct.
- Residual/skip connections — traces what gets added back, verifies shapes and semantics match at the addition point
- Conditional paths — if data flows through different branches based on mode (train vs eval, sim vs real), traces both paths
Boundary Analysis
Special attention at boundaries where data crosses between systems:
| Boundary | What to Verify |
|---|---|
| Data loading → preprocessing | File formats parsed correctly? Dtypes preserved or silently cast? |
| Preprocessing → model input | Does the model expect the exact format that preprocessing produces? |
| Model output → postprocessing | Outputs denormalized/decoded correctly? |
| Software → hardware | Actions clipped, scaled, and in correct units before actuators? |
| Between processes/nodes | Serialization/deserialization preserves data correctly? |
| Between frameworks | NumPy/PyTorch/JAX conversions handle memory layouts, dtypes, device placement? |
Mutation Tracking
Tracks where data gets modified in place:
- In-place operations (
x *= 2,x[idx] = val,x.fill_()) - Buffer updates (running mean/var in BatchNorm, replay buffer overwrites)
- Shared references where modifying one variable silently modifies another
- Flags in-place mutation that could affect upstream consumers expecting original data
Missing Transformations
Flags where important transformations are absent:
- Raw sensor data used without normalization
- Model outputs used without denormalization
- Missing clipping on actions before hardware execution
- Missing dtype conversion (float64 data entering a float32 model — silent precision loss)
- Missing device transfer (CPU tensor where GPU expected, or vice versa)
Output Format
The tracer produces a structured report containing:
- Overview — what the pipeline does, input source, output destination
- Flow Diagram — ASCII/text diagram showing high-level flow with branches and merges
- Detailed Trace — every stage with input/output shape, dtype, axes, range, units, and notes
- Boundary Crossings — what crosses each boundary and any concerns
- Issues Found — location, severity, concrete consequence
- Assumptions Made — things that couldn't be verified from static analysis, with suggested runtime checks
No gaps in the trace — a gap is a finding, not something to skip. Read the code, don't infer from names — a function called normalize() might do anything. The bugs are in the transitions — most data flow bugs are in the hand-offs between stages. Units and frames matter — a perfectly shaped tensor in the wrong coordinate frame will produce plausible-looking but wrong behavior.