Summary
Interpretable RL is challenging because we rarely understand how hidden policy signals drive behavior. In this experiment, we apply sparse transcoders to compress hidden activations of a REINFORCE policy trained on Cartpole into 2, 4, and 8 dimensions. From those, we find that a 4-dimensional bottleneck strikes the best balance between sparsity and informativeness. To do this, we train small encoder/decoder modules on each hidden layer and then tweak one feature during rollout. That single adjustment raises average return from 261 to 323 without any retraining. This is the first end-to-end sparse-transcoder + causal intervention pipeline in RL. Beyond CartPole, our method offers a general way to interpret policy internals and steer behavior in more complex environments.
Introduction
Interpreting the internal reasoning of reinforcement‐learning (RL) agents remains a pressing challenge. Correlation‐based methods (e.g., saliency maps) can highlight which input features are associated with certain actions, but they do not reveal causal mechanisms inside the policy network. In parallel, the field of mechanistic interpretability (especially for large language models) has recently exploded with tools that trace “circuits” across transformer MLP sublayers, revealing how specific neurons or feature combinations cause particular outputs.
Anthropic’s May 29, 2025 open‐source release of circuit-tracing tools provides utilities for generating attribution graphs on transformer models, making it easier to visualize and perturb causal pathways in large language models (LLMs). At the same time, Jacob Dunefsky’s transcoder-circuits
repository demonstrates that transcoders (small encoder/decoder modules that compress MLP layers into neurons) can decompose activations into sparse, interpretable features and be used to reverse-engineer fine‐grained circuits in GPT-2 and other LLMs.
Inspired by these advances, we ask: Can we build a similarly transparent, causal pipeline for a simple RL agent such as CartPole?
In this blog post, we:
- Train a CartPole policy (MLP with hidden layers of size 128 and 64).
- Fit three sparse transcoders (latent_dim = 2, 4, 8) on each hidden layer’s activations, imposing an L1 penalty to encourage sparsity and interpretability.
- Inspect the latent features of each transcoder, showing how increasing latent dimension yields progressively more detailed features.
- Causally intervene on the 4-dim transcoder (the “sweet spot” between too few and too many dims), adding decoder columns to the policy’s hidden activations and measuring how the agent’s performance and failure modes change.
- Contextualize our pipeline within broader interpretability work, including Anthropic’s attribution graphs and Jacob Dunefsky’s demonstration that transcoders produce more interpretable features than standard sparse autoencoders.
By providing a small‐scale but complete pipeline, we hope to offer a reproducible blueprint for interpretable RL that can be extended to more complex environments. We encourage readers to experiment further using our colab notebook.
Related Work
1. Mechanistic Interpretability & Circuit Tracing
Anthropic’s open‐source circuit‐tracing tools (May 29, 2025) enable researchers to generate attribution graphs for transformers, revealing how feature activations propagate across MLP layers and influence logits or downstream tasks. Their demo notebooks and the circuit-tracer
library show how to trace multi‐step reasoning circuits in Llama-3.2 and Gemma-2.2 (1 billion parameters) by training transcoders on MLP sublayers. These transcoders compress high‐dimensional hidden activations into small sparse codes, allowing for interpretable features to be extracted and tracked.
Similarly, Nanda et al. (Oct 2023) used sparse autoencoders (SAEs) to interpret superposition in GPT-1’s MLP layers, finding that a sizable fraction of latent features were monosemantic (each feature had a single clear interpretation). Their work laid the groundwork for the idea that sparse bottlenecks can uncover human‐readable circuits in large networks.
2. Transcoders vs. Sparse Autoencoders
Recent paper, “Transcoders Beat Sparse Autoencoders for Interpretability,” Jan 31, 2025, systematically compared transcoders (trained to reconstruct output from input of an MLP sublayer) to traditional SAEs (trained to reconstruct features from themselves) and found that transcoders yield significantly more interpretable features under the same sparsity budget. By focusing on input→output reconstruction, transcoders encourage the latent code to capture precisely those features causally used by the layer’s transformation.
“Transcoders Find Interpretable LLM Feature Circuits” first proposed this approach, demonstrating that a small 16-dim transcoder on GPT-2-small’s MLP sublayers uncovered circuits corresponding to “greater‐than comparison,” “demographic bias,” and “semantic partitioning”. Our work adapts this concept to the RL domain, training even smaller (2, 4, 8 dim) transcoders on MLPs inside a CartPole policy.
3. Interpretability in Reinforcement Learning
While interpretability in vision and language has advanced rapidly, RL remains relatively under‐explored. Most RL interpretability efforts focus on input‐level saliency (e.g., visualizing which pixels in a game frame matter to a DQN) but not on internal hidden activations. By contrast, our pipeline uses transcoders to compress hidden‐layer activations into truly human‐readable features, then performs causal interventions to confirm their role. To our knowledge, this is one of the first end‐to‐end demonstrations of sparse transcoder + intervention in an RL policy.
Methodology
1. CartPole Policy Training
We train a policy network on OpenAI Gym’s CartPole-v1. The state consists of cart position, cart velocity, pole angle, and pole angular velocity. Our policy is an MLP:
- Layer 1 (fc1): Linear 4→128, ReLU.
- Layer 2 (fc2): Linear 128→64, ReLU.
- Output (fc3): Linear 64→2 (logits for “move left” vs. “move right”).
We train via REINFORCE (policy gradient) until the average episode return ≥ 200 over 100 validation episodes. Training details:
- Learning rate: 1e-3 (Adam)
- Discount factor γ = 0.99
- Batch size: per-episode updates (no replay)
- Number of episodes: ~2000 until convergence (average return ≈ 200)
2. Collecting Hidden Activations & Raw States
After policy convergence, we run 50 episodes (≈ 10,000 timesteps) and record:
- Raw state ].
- Layer1 input:
- Layer1 output: .
- Layer2 input: .
- Layer2 output: .
We store ~20k samples of. In PyTorch, we attach forward hooks:
# Hook for the first hidden layer (fc1)
def hook_fc1(module, input_tuple, output):
if len(layer1_data["inputs"]) < MAX_SAMPLES_PER_LAYER:
# input_tuple is a tuple, e.g. (tensor_of_shape_[1, obs_dim],)
inp = input_tuple[0].detach().cpu()
outp = output.detach().cpu()
layer1_data["inputs"].append(inp.squeeze(0)) # shape [obs_dim]
layer1_data["outputs"].append(outp.squeeze(0)) # shape [hidden1]
layer1_data["states"].append(current_state["value"].copy())
# Hook for the second hidden layer (fc2)
def hook_fc2(module, input_tuple, output):
if len(layer2_data["inputs"]) < MAX_SAMPLES_PER_LAYER:
inp = input_tuple[0].detach().cpu()
outp = output.detach().cpu()
layer2_data["inputs"].append(inp.squeeze(0)) # shape [hidden1]
layer2_data["outputs"].append(outp.squeeze(0)) # shape [hidden2]
layer2_data["states"].append(current_state["value"].copy())
3. Training 2-dim, 4-dim, and 8-dim Transcoders
For each hidden layer , we train three separate transcoders with latent_dim ∈ {2, 4, 8}. Each transcoder is defined as:
class SingleLayerTranscoder(nn.Module):
def __init__(self, input_dim, latent_dim, output_dim):
super().__init__()
# Encoder: compress input_dim → latent_dim
self.encoder = nn.Sequential(
nn.Linear(input_dim, latent_dim),
nn.ReLU() # ensure non-negative codes
)
# Decoder: reconstruct latent_dim → output_dim
self.decoder = nn.Linear(latent_dim, output_dim)
def forward(self, x):
"""
x: [batch_size, input_dim]
Returns:
code: [batch_size, latent_dim]
recon: [batch_size, output_dim]
"""
code = self.encoder(x)
recon = self.decoder(code)
return code, recon
- For layer1:
in_dim = 4
(raw state)out_dim = 128
(layer1 output)- For layer2:
in_dim = 128
(layer1 output)out_dim = 64
(layer2 output)
We minimize:
where is the decoder’s reconstruction of the original hidden activation , and is the latent code. We set:
- (to find a balance between sparsity and reconstruction fidelity)
- Epochs = 50, batch_size = 256, learning_rate = 1e-3 (Adam)
Thus, for each layer we end up with six trained transcoder checkpoints:
transcoder_layer1_2dim.pth
transcoder_layer1_4dim.pth
transcoder_layer1_8dim.pth
transcoder_layer2_2dim.pth
transcoder_layer2_4dim.pth
transcoder_layer2_8dim.pth
Here are the loss graphs for all 6 transcoders:
Inspecting Transcoder Features
Having trained sparse transcoders at latent dimensions 2, 4, and 8 for each hidden layer, we observed that the 4-dimensional bottleneck yielded the most succinct and semantically distinct channels. In this section, we highlight a few representative neurons from the 4-dim transcoders (two from Layer 1 and two from Layer 2) that illustrate how these small codes capture human-readable features. Each neuron’s description draws on its top-triggering states and Pearson correlations with the four raw state variables
Layer 1 (4-Dim Transcoder)
Neuron 0: “Cart Zipping Right”
- Activation statistics:
- Max = 2.4866 , Mean = 0.5814 , Std = 0.2897
- Top‐triggering states (cart_pos, cart_vel, pole_angle, angle_vel):
- [+2.387,+3.133,+0.095,–0.007] (activation 2.4866)
- [+2.258,+3.136,+0.103,–0.071] (activation 2.4067)
- [+1.940,+2.950,+0.127,+0.030] (activation 2.3476)
- [+2.128,+3.139,+0.113,–0.142] (activation 2.3237)
- [+2.321,+3.330,+0.101,–0.330] (activation 2.2886)
- Pearson correlations:
- cart_pos: +0.6887 | cart_vel: +0.8063 | pole_angle: +0.2075 | angle_vel: +0.1037
Interpretation:Neuron 0 fires almost exclusively when the cart is near the right boundary and moving extremely fast to the right (cart_pos ≈ +2.3 – 2.4, cart_vel ≈ +3.0). Its strong positive correlations with both cart_pos and cart_vel confirm this: it is, effectively, a “cart-zipping-right” detector. Pole angle and angular velocity play only minor roles in its activation. Whenever the cart crosses +2.3 with a large rightward velocity, Neuron 0’s code value rises above 2.3, signaling that the policy should correct sharply to avoid sliding off the edge.
Neuron 2: “Pole Tilted Right on Left Side”
- Activation statistics:
- Max = 2.0946 , Mean = 1.0530 , Std = 0.3015
- Top‐triggering states (cart_pos, cart_vel, pole_angle, angle_vel):
- [–2.037,–1.252,+0.205,+0.618] (activation 2.0946)
- [–1.590,–0.530,+0.204,+0.507] (activation 2.0507)
- [–1.893,–0.947,+0.205,+0.421] (activation 2.0504)
- [–1.620,–0.735,+0.207,+0.647] (activation 2.0468)
- [–1.870,–1.139,+0.192,+0.647] (activation 2.0392)
- Pearson correlations:
- cart_pos: –0.8164 | cart_vel: –0.0363 | pole_angle: +0.8276 | angle_vel: +0.3130
Interpretation:Neuron 2 specializes in detecting when the pole is significantly tilted right (pole_angle ≈ +0.20 rad) while the cart is far to the left (cart_pos ≈ –1.6 to –2.0). Its very high negative correlation with cart_pos (–0.8164) and strong positive correlation with pole_angle (+0.8276) confirm that semantic: “pole swinging right on the left side of the track.” In these situations, the policy often needs to push right strongly, because the combination of “left-edge cart” and “right-tilted pole” is precarious.
Layer 2 (4-Dim Transcoder)
Neuron 0: “Pole Swinging Left”
- Activation statistics:
- Max = 22.8307 , Mean = 4.1357 , Std = 3.2498
- Top‐triggering states (cart_pos, cart_vel, pole_angle, angle_vel):
- [–1.962,–1.439,+0.175,+0.730] (activation 22.8307)
- [–2.037,–1.252,+0.205,+0.618] (activation 21.5898)
- [–1.870,–1.139,+0.192,+0.647] (activation 21.0805)
- [–1.909,–1.435,+0.155,+0.629] (activation 20.9296)
- [–1.753,–1.288,+0.176,+0.596] (activation 20.0147)
- Pearson correlations:
- cart_pos: –0.5537 | cart_vel: –0.3620 | pole_angle: +0.5612 | angle_vel: +0.8708
Interpretation:Neuron 0 in Layer 2 is overwhelmingly an “angle_vel > 0” detector, i.e. “pole swinging right.” Its very high positive correlation with angle_vel (0.8708) confirms this. Although its top states show the cart often on the left (cart_pos ≈ –1.7 to –2.0), that is secondary. Neuron 0 fires whenever the pole’s angular velocity is strongly positive, regardless of cart position. As soon as exceeds ~0.6, this neuron’s activation climbs above 20, indicating a vigorous rightward pole swing.
Neuron 2: “Pole Tilted Right, Near‐Static”
- Activation statistics:
- Max = 12.3788 , Mean = 2.6520 , Std = 1.4613
- Top‐triggering states (cart_pos, cart_vel, pole_angle, angle_vel):
- [–1.433,+0.009,+0.209,+0.022] (activation 9.4760)
- [-1.566,–0.328,+0.193,+0.037] (activation 9.2072)
- [–1.249,+0.185,+0.196,–0.016] (activation 9.0402)
- [–1.408,–0.173,+0.185,+0.008] (activation 8.9129)
- [–1.088,+0.518,+0.207,–0.012] (activation 8.8388)
- Pearson correlations:
- cart_pos: –0.5421 | cart_vel: +0.2089 | pole_angle: +0.7822 | angle_vel: –0.1868
Interpretation:Neuron 2 at Layer 2 activates when the pole is noticeably tilted right () but with very small angular velocity ). Its strong positive correlation with pole_angle (+0.7822) and slight negative correlation with angle_vel (–0.1868) shows that it encodes “tilt without motion,” a subtler signal than “swing.” This neuron thus picks out states where the pole is precariously leaning right but not yet swinging, providing an early warning cue to the policy.
Discussion of Selected Features
- “Cart Zipping Right” (Layer 1 Neuron 0)
- “Pole Tilted Right on Left Side” (Layer 1 Neuron 2)
- “Pole Swinging Left” (Layer 2 Neuron 0)
- “Pole Tilted Right, Near-Static” (Layer 2 Neuron 2)
Granularity vs. Redundancy Across Latent Sizes
- 2-Dim Bottleneck:
- 4-Dim Bottleneck:
- 8-Dim Bottleneck:
Hence, the 4-dim transcoder represents a “sweet spot” that is rich enough to disentangle all major policy cues, but small enough to stay sparse and interpretable. In practice, we focus subsequent causal‐intervention experiments on these 4-dim channels, knowing that each corresponds to a clear, actionable feature in the CartPole policy’s reasoning.
Causal Interventions (4-Dim Transcoder)
Although we trained transcoders at latent_dims = {2, 4, 8}, we chose to perform causal interventions only on the 4-dim transcoder because it produced the cleanest, most interpretable feature set without unnecessary dormant channels. Below are the details of each intervention experiment.
1. Intervention Mechanism
To force a transcoder dimension at layer , we:
- Compute the policy’s normal hidden activation.
- Extract the decoder column from
transcoder.decoder.weight[:, j]
. - Add , effectively setting .
- Continue the forward pass: . Subsequent layers use as input.
For example, in PyTorch:
def forward(self, x, intervene_on=None):
"""
If intervene_on is provided, it must be a dict:
{'layer': 'layer1' or 'layer2',
'decoder_col': torch.Tensor of shape [hidden_dim],
'scale': float}
We add (scale * decoder_col) to the hidden activation of the specified layer.
"""
# Layer 1 forward
z1 = self.fc1(x) # [batch_size, hidden1]
h1 = F.relu(z1) # policy’s real layer1 output
# Intervention on layer1?
if intervene_on is not None and intervene_on["layer"] == "layer1":
boost_vec = intervene_on["decoder_col"].view(1, -1) * intervene_on["scale"]
h1 = h1 + boost_vec
# Layer 2 forward
z2 = self.fc2(h1) # [batch_size, hidden2]
h2 = F.relu(z2) # policy’s real layer2 output
# Intervention on layer2?
if intervene_on is not None and intervene_on["layer"] == "layer2":
boost_vec2 = intervene_on["decoder_col"].view(1, -1) * intervene_on["scale"]
h2 = h2 + boost_vec2
logits = self.fc3(h2) # action logits
return logits
2. Experiment Settings
For each (layer, neuron)
pair in the 4-dim transcoder, we run:
- NUM_EPISODES = 50 baseline (no intervention) episodes.
- NUM_EPISODES = 50 intervention episodes with set to [1, -1, 2.5, -2.5] so that it reliably triggers the feature.
- We record:
3. Experiment Results
Below is an interpretation of what happens when we force each transcoder neuron on or off (i.e. use ±1 or ±2.5) in light of “what that neuron actually represents.” In each case, “forcing” a positive α makes the policy act as if that feature is strongly present, while a negative α makes it act as if the feature is strongly absent (or even reversed). We compare everything back to the baseline (avg = 261.3).
Layer 1, Neuron 0 (“Cart Zipping Right”):
Recall that this neuron’s positive code spikes whenever the cart is already far to the right (x ≈ +2.3 → +2.4) and moving quickly right (v ≈ +3.0). In normal operation, whenever Neuron 0 > ≈2 it tells the policy: “The cart is about to slide off the right edge, correct left now.”
- Positive α (= +1, +2.5): Forcing “Cart Zipping Right”
- Negative α (= −1, −2.5): Forcing “Anti–Cart Zipping Right” (i.e. “Cart Zipping Left”)
Layer 1, Neuron 2 (“Pole Tilted Right on Left Side”)
Neuron 2 fires when (cart ≈ –1.6 to –2.0) AND (pole angle ≈ +0.20) i.e. “the cart is far left but the pole is already tilting right.” In that scenario, the correct action is usually to push right (to catch the pole) but also be careful not to slide off left.
- Positive α (= +1, +2.5): Forcing “Pole Tilted Right on Left Side”
- Negative α (= −1, −2.5): Forcing “Anti–(Pole Tilted Right on Left Side)” (i.e. “Pole Tilted Left on Left Side” or “No Tilt”)
Layer 2, Neuron 0 (“Pole Swinging Left”)
Neuron 0 at Layer 2 fires nearly any time θ̇ > 0 (pole’s angular velocity is strongly positive, i.e. “pole is swinging left”). When that spikes, the policy says: “Stop it, you must push right next.”
- Positive α (= +1): Forcing “Pole Swinging Left” Weakly
- Positive α (= +2.5): Forcing “Pole Swinging Left” Strongly
- Negative α (= −1): Forcing “Anti–(Pole Swinging Left)” (i.e. “Pole Swinging Right” or “No Swing Left”)
- Negative α (= −2.5): Forcing “Strong Anti–(Pole Swinging Left)” (i.e. “Pole Swinging Right”)
Layer 2, Neuron 2 (“Pole Tilted Right, Near‐Static”)
Neuron 2 fires when θ ≈ +0.20 but θ̇ ≈ 0—“pole is leaning right but not yet swinging.” In practice, that is an early warning signal: the pole is about to tip right, so a small left push is needed before it begins to fall hard.
- Positive α (= +1, +2.5): Forcing “Pole Tilted Right, Near‐Static”
- Negative α (= −1): Forcing “Anti–(Pole Tilted Right)” (i.e. “Pole Tilted Left”)
- Negative α (= −2.5): Forcing “Strong Anti–(Pole Tilted Right)”
Discussion
1. Why Multiple Latent Dimensions Matter
- 2-Dim vs. 4-Dim vs. 8-Dim Transcoders
- Novelty & Impact
2. Performance Improvement via Targeted Intervention on Layer 2, Neuron 2
- Key Finding:
- Generalization to More Complex Systems:
- Practical Implications for Future Research:
3. Practical Recommendations for Further Experiments
- Choose the Right Latent Dimensionality
- Tune the L1 Sparsity Penalty
- Perform More Causal Interventions
4. Limitations & Future Work
- Single Environment
- More Complex Architectures
- Quantitative Interpretability Metrics
Conclusion
In this research post, we presented a proof-of-concept pipeline for interpreting RL policies via sparse transcoders and causal interventions:
- Train a CartPole MLP policy (4 → 128 → 64 → 2) until it reliably solves the task.
- Collect hidden‐layer activations (128-dim and 64-dim) and raw states (4-dim).
- Train 2, 4, and 8-dim transcoders on each hidden layer, imposing an L1 penalty to encourage sparsity.
- Inspect the latent codes and label each dimension in human‐readable terms (e.g., “pole-left swing,” “cart-left & pole-right”).
- Causally intervene on the 4-dim transcoder (the most interpretable) by adding decoder columns to the policy’s hidden activations, demonstrating that each feature genuinely drives policy failures (pole falls or cart‐outs).
By providing clear, reproducible code and a concise methodology, we hope this work serves as a blueprint for interpretable RL. Whether you are an RL novice or a seasoned mechanistic-interpretability researcher, this pipeline can be adapted to more complex tasks, deeper networks, or real-world robots. We encourage readers to clone our Colab notebook and experiment with their own environments. Push the limits of what a 2, 4, or 8 dim transcoder can reveal about an RL agent’s inner workings.
References
- Open-sourcing circuit tracing tools (Anthropic, May 29, 2025).
- Open Source Replication & Commentary on Anthropic’s Dictionary Learning (Neel Nanda, Oct 23, 2023).
- Transcoder-circuits: reverse-engineering LLM circuits with transcoders (Jacob Dunefsky, GitHub).
- Transcoders Find Interpretable LLM Feature Circuits (ArXiv 2406.11944v1, Jun 2024).
- Mechanistic Interpretability in Transformers (Ayyuce Kizrak’s repo).
- Transcoders Enable Fine-Grained Interpretable Circuit Analysis (Alignment Forum, Jul 2023).
- Transcoders Beat Sparse Autoencoders for Interpretability (ArXiv 2501.18823, Jan 31, 2025).