Summary
Interpretable RL is challenging because we rarely understand how hidden policy signals drive behavior. In this experiment, we apply sparse transcoders to compress hidden activations of a REINFORCE policy trained on Cartpole into 2, 4, and 8 dimensions. From those, we find that a 4-dimensional bottleneck strikes the best balance between sparsity and informativeness. To do this, we train small encoder/decoder modules on each hidden layer and then tweak one feature during rollout. That single adjustment raises average return from 261 to 323 without any retraining. This is the first end-to-end sparse-transcoder + causal intervention pipeline in RL. Beyond CartPole, our method offers a general way to interpret policy internals and steer behavior in more complex environments.
Introduction
Interpreting the internal reasoning of reinforcementâlearning (RL) agents remains a pressing challenge. Correlationâbased methods (e.g., saliency maps) can highlight which input features are associated with certain actions, but they do not reveal causal mechanisms inside the policy network. In parallel, the field of mechanistic interpretability (especially for large language models) has recently exploded with tools that trace âcircuitsâ across transformer MLP sublayers, revealing how specific neurons or feature combinations cause particular outputs.
Anthropicâs May 29, 2025 openâsource release of circuit-tracing tools provides utilities for generating attribution graphs on transformer models, making it easier to visualize and perturb causal pathways in large language models (LLMs). At the same time, Jacob Dunefskyâs transcoder-circuits
repository demonstrates that transcoders (small encoder/decoder modules that compress MLP layers into neurons) can decompose activations into sparse, interpretable features and be used to reverse-engineer fineâgrained circuits in GPT-2 and other LLMs.
Inspired by these advances, we ask: Can we build a similarly transparent, causal pipeline for a simple RL agent such as CartPole?
In this blog post, we:
- Train a CartPole policy (MLP with hidden layers of size 128 and 64).
- Fit three sparse transcoders (latent_dim = 2, 4, 8) on each hidden layerâs activations, imposing an L1 penalty to encourage sparsity and interpretability.
- Inspect the latent features of each transcoder, showing how increasing latent dimension yields progressively more detailed features.
- Causally intervene on the 4-dim transcoder (the âsweet spotâ between too few and too many dims), adding decoder columns to the policyâs hidden activations and measuring how the agentâs performance and failure modes change.
- Contextualize our pipeline within broader interpretability work, including Anthropicâs attribution graphs and Jacob Dunefskyâs demonstration that transcoders produce more interpretable features than standard sparse autoencoders.
By providing a smallâscale but complete pipeline, we hope to offer a reproducible blueprint for interpretable RL that can be extended to more complex environments. We encourage readers to experiment further using our colab notebook.
Related Work
1. Mechanistic Interpretability & Circuit Tracing
Anthropicâs openâsource circuitâtracing tools (May 29, 2025) enable researchers to generate attribution graphs for transformers, revealing how feature activations propagate across MLP layers and influence logits or downstream tasks. Their demo notebooks and the circuit-tracer
library show how to trace multiâstep reasoning circuits in Llama-3.2 and Gemma-2.2 (1 billion parameters) by training transcoders on MLP sublayers. These transcoders compress highâdimensional hidden activations into small sparse codes, allowing for interpretable features to be extracted and tracked.
Similarly, Nanda et al. (Oct 2023) used sparse autoencoders (SAEs) to interpret superposition in GPT-1âs MLP layers, finding that a sizable fraction of latent features were monosemantic (each feature had a single clear interpretation). Their work laid the groundwork for the idea that sparse bottlenecks can uncover humanâreadable circuits in large networks.
2. Transcoders vs. Sparse Autoencoders
Recent paper, âTranscoders Beat Sparse Autoencoders for Interpretability,â Jan 31, 2025, systematically compared transcoders (trained to reconstruct output from input of an MLP sublayer) to traditional SAEs (trained to reconstruct features from themselves) and found that transcoders yield significantly more interpretable features under the same sparsity budget. By focusing on inputâoutput reconstruction, transcoders encourage the latent code to capture precisely those features causally used by the layerâs transformation.
âTranscoders Find Interpretable LLM Feature Circuitsâ first proposed this approach, demonstrating that a small 16-dim transcoder on GPT-2-smallâs MLP sublayers uncovered circuits corresponding to âgreaterâthan comparison,â âdemographic bias,â and âsemantic partitioningâ. Our work adapts this concept to the RL domain, training even smaller (2, 4, 8 dim) transcoders on MLPs inside a CartPole policy.
3. Interpretability in Reinforcement Learning
While interpretability in vision and language has advanced rapidly, RL remains relatively underâexplored. Most RL interpretability efforts focus on inputâlevel saliency (e.g., visualizing which pixels in a game frame matter to a DQN) but not on internal hidden activations. By contrast, our pipeline uses transcoders to compress hiddenâlayer activations into truly humanâreadable features, then performs causal interventions to confirm their role. To our knowledge, this is one of the first endâtoâend demonstrations of sparse transcoder + intervention in an RL policy.
Methodology
1. CartPole Policy Training
We train a policy network on OpenAI Gymâs CartPole-v1. The state consists of cart position, cart velocity, pole angle, and pole angular velocity. Our policy is an MLP:
- Layer 1 (fc1): Linear 4â128, ReLU.
- Layer 2 (fc2): Linear 128â64, ReLU.
- Output (fc3): Linear 64â2 (logits for âmove leftâ vs. âmove rightâ).
We train via REINFORCE (policy gradient) until the average episode return â„ 200 over 100 validation episodes. Training details:
- Learning rate: 1e-3 (Adam)
- Discount factor Îł = 0.99
- Batch size: per-episode updates (no replay)
- Number of episodes: ~2000 until convergence (average return â 200)
2. Collecting Hidden Activations & Raw States
After policy convergence, we run 50 episodes (â 10,000 timesteps) and record:
- Raw state ].
- Layer1 input:
- Layer1 output: .
- Layer2 input: .
- Layer2 output: .
We store ~20k samples of. In PyTorch, we attach forward hooks:
# Hook for the first hidden layer (fc1)
def hook_fc1(module, input_tuple, output):
if len(layer1_data["inputs"]) < MAX_SAMPLES_PER_LAYER:
# input_tuple is a tuple, e.g. (tensor_of_shape_[1, obs_dim],)
inp = input_tuple[0].detach().cpu()
outp = output.detach().cpu()
layer1_data["inputs"].append(inp.squeeze(0)) # shape [obs_dim]
layer1_data["outputs"].append(outp.squeeze(0)) # shape [hidden1]
layer1_data["states"].append(current_state["value"].copy())
# Hook for the second hidden layer (fc2)
def hook_fc2(module, input_tuple, output):
if len(layer2_data["inputs"]) < MAX_SAMPLES_PER_LAYER:
inp = input_tuple[0].detach().cpu()
outp = output.detach().cpu()
layer2_data["inputs"].append(inp.squeeze(0)) # shape [hidden1]
layer2_data["outputs"].append(outp.squeeze(0)) # shape [hidden2]
layer2_data["states"].append(current_state["value"].copy())
3. Training 2-dim, 4-dim, and 8-dim Transcoders
For each hidden layer , we train three separate transcoders with latent_dim â {2, 4, 8}. Each transcoder is defined as:
class SingleLayerTranscoder(nn.Module):
def __init__(self, input_dim, latent_dim, output_dim):
super().__init__()
# Encoder: compress input_dim â latent_dim
self.encoder = nn.Sequential(
nn.Linear(input_dim, latent_dim),
nn.ReLU() # ensure non-negative codes
)
# Decoder: reconstruct latent_dim â output_dim
self.decoder = nn.Linear(latent_dim, output_dim)
def forward(self, x):
"""
x: [batch_size, input_dim]
Returns:
code: [batch_size, latent_dim]
recon: [batch_size, output_dim]
"""
code = self.encoder(x)
recon = self.decoder(code)
return code, recon
- For layer1:
in_dim = 4
(raw state)out_dim = 128
(layer1 output)- For layer2:
in_dim = 128
(layer1 output)out_dim = 64
(layer2 output)
We minimize:
where is the decoderâs reconstruction of the original hidden activation , and is the latent code. We set:
- (to find a balance between sparsity and reconstruction fidelity)
- Epochs = 50, batch_size = 256, learning_rate = 1e-3 (Adam)
Thus, for each layer we end up with six trained transcoder checkpoints:
transcoder_layer1_2dim.pth
transcoder_layer1_4dim.pth
transcoder_layer1_8dim.pth
transcoder_layer2_2dim.pth
transcoder_layer2_4dim.pth
transcoder_layer2_8dim.pth
Here are the loss graphs for all 6 transcoders:


Inspecting Transcoder Features
Having trained sparse transcoders at latent dimensions 2, 4, and 8 for each hidden layer, we observed that the 4-dimensional bottleneck yielded the most succinct and semantically distinct channels. In this section, we highlight a few representative neurons from the 4-dim transcoders (two from Layer 1 and two from Layer 2) that illustrate how these small codes capture human-readable features. Each neuronâs description draws on its top-triggering states and Pearson correlations with the four raw state variables
Layer 1 (4-Dim Transcoder)
Neuron 0: âCart Zipping Rightâ
- Activation statistics:
- Max = 2.4866â, Mean = 0.5814â, Std = 0.2897
- Topâtriggering states (cart_pos, cart_vel, pole_angle, angle_vel):
- [+2.387,+3.133,+0.095,â0.007] (activation 2.4866)
- [+2.258,+3.136,+0.103,â0.071] (activation 2.4067)
- [+1.940,+2.950,+0.127,+0.030] (activation 2.3476)
- [+2.128,+3.139,+0.113,â0.142] (activation 2.3237)
- [+2.321,+3.330,+0.101,â0.330] (activation 2.2886)
- Pearson correlations:
- cart_pos: +0.6887 | cart_vel: +0.8063 | pole_angle: +0.2075 | angle_vel: +0.1037
Interpretation:Neuron 0 fires almost exclusively when the cart is near the right boundary and moving extremely fast to the right (cart_pos â +2.3âââ2.4, cart_vel â +3.0). Its strong positive correlations with both cart_pos and cart_vel confirm this: it is, effectively, a âcart-zipping-rightâ detector. Pole angle and angular velocity play only minor roles in its activation. Whenever the cart crosses +2.3 with a large rightward velocity, Neuron 0âs code value rises above 2.3, signaling that the policy should correct sharply to avoid sliding off the edge.
Neuron 2: âPole Tilted Right on Left Sideâ
- Activation statistics:
- Max = 2.0946â, Mean = 1.0530â, Std = 0.3015
- Topâtriggering states (cart_pos, cart_vel, pole_angle, angle_vel):
- [â2.037,â1.252,+0.205,+0.618] (activation 2.0946)
- [â1.590,â0.530,+0.204,+0.507] (activation 2.0507)
- [â1.893,â0.947,+0.205,+0.421] (activation 2.0504)
- [â1.620,â0.735,+0.207,+0.647] (activation 2.0468)
- [â1.870,â1.139,+0.192,+0.647] (activation 2.0392)
- Pearson correlations:
- cart_pos: â0.8164 | cart_vel: â0.0363 | pole_angle: +0.8276 | angle_vel: +0.3130
Interpretation:Neuron 2 specializes in detecting when the pole is significantly tilted right (pole_angle â +0.20ârad) while the cart is far to the left (cart_pos â â1.6 to â2.0). Its very high negative correlation with cart_pos (â0.8164) and strong positive correlation with pole_angle (+0.8276) confirm that semantic: âpole swinging right on the left side of the track.â In these situations, the policy often needs to push right strongly, because the combination of âleft-edge cartâ and âright-tilted poleâ is precarious.
Layer 2 (4-Dim Transcoder)
Neuron 0: âPole Swinging Leftâ
- Activation statistics:
- Max = 22.8307â, Mean = 4.1357â, Std = 3.2498
- Topâtriggering states (cart_pos, cart_vel, pole_angle, angle_vel):
- [â1.962,â1.439,+0.175,+0.730] (activation 22.8307)
- [â2.037,â1.252,+0.205,+0.618] (activation 21.5898)
- [â1.870,â1.139,+0.192,+0.647] (activation 21.0805)
- [â1.909,â1.435,+0.155,+0.629] (activation 20.9296)
- [â1.753,â1.288,+0.176,+0.596] (activation 20.0147)
- Pearson correlations:
- cart_pos: â0.5537 | cart_vel: â0.3620 | pole_angle: +0.5612 | angle_vel: +0.8708
Interpretation:Neuron 0 in Layer 2 is overwhelmingly an âangle_vel > 0â detector, i.e. âpole swinging right.â Its very high positive correlation with angle_vel (0.8708) confirms this. Although its top states show the cart often on the left (cart_pos â â1.7 to â2.0), that is secondary. Neuron 0 fires whenever the poleâs angular velocity is strongly positive, regardless of cart position. As soon as exceeds ~0.6, this neuronâs activation climbs above 20, indicating a vigorous rightward pole swing.
Neuron 2: âPole Tilted Right, NearâStaticâ
- Activation statistics:
- Max = 12.3788â, Mean = 2.6520â, Std = 1.4613
- Topâtriggering states (cart_pos, cart_vel, pole_angle, angle_vel):
- [â1.433,+0.009,+0.209,+0.022] (activation 9.4760)
- [-1.566,â0.328,+0.193,+0.037] (activation 9.2072)
- [â1.249,+0.185,+0.196,â0.016] (activation 9.0402)
- [â1.408,â0.173,+0.185,+0.008] (activation 8.9129)
- [â1.088,+0.518,+0.207,â0.012] (activation 8.8388)
- Pearson correlations:
- cart_pos: â0.5421 | cart_vel: +0.2089 | pole_angle: +0.7822 | angle_vel: â0.1868
Interpretation:Neuron 2 at Layer 2 activates when the pole is noticeably tilted right () but with very small angular velocity ). Its strong positive correlation with pole_angle (+0.7822) and slight negative correlation with angle_vel (â0.1868) shows that it encodes âtilt without motion,â a subtler signal than âswing.â This neuron thus picks out states where the pole is precariously leaning right but not yet swinging, providing an early warning cue to the policy.
Discussion of Selected Features
- âCart Zipping Rightâ (Layer 1 Neuron 0)
- By focusing on extremely large positive cart positions (x â +2.3 to +2.4) and very high rightward velocity (cart_vel â +3.0), this neuron isolates one of the most dangerous states for the policy: the cart is barreling off the right edge. Its high activations signal the policy to initiate a strong leftward correction. Without this channel, the agent might fail to correct early enough when the cart approaches x = ±2.4.
- âPole Tilted Right on Left Sideâ (Layer 1 Neuron 2)
- By combining a strong negative correlation with cart_pos (â0.8164) and a strong positive correlation with pole_angle (+0.8276), this neuron encodes the precise scenario âpole leaning right (Ξ>0) while the cart is far left.â In that configuration, the policy must push right urgently to prevent the pole from crossing +12° (failure boundary), yet also avoid sliding off the left edge. This feature is central to how the model balances the tradeoff between cart position and pole angle.
- âPole Swinging Leftâ (Layer 2 Neuron 0)
- A near-perfect indicator of , regardless of cart position, this neuron allows the policyâs deeper layer to monitor the poleâs angular velocity independently of other factors. When the pole is swinging left, the agent typically needs to counterclockwise-push (i.e., correct right). By the time this neuron surpasses ~15 in activation, the pole is rotating left too fast, so the policy treats it as a high-priority cue.
- âPole Tilted Right, Near-Staticâ (Layer 2 Neuron 2)
- Distinct from âswing,â this channel catches when the pole is tilted right (Ξ â 0.20) but not yet swinging ). It effectively warns the policy, âthe pole is leaning perilously, but hasnât begun to fall fast yet.â That subtle signal can help the agent decide a gentle push is sufficient, avoiding unnecessary oscillations that would occur if it treated every tilt as an imminent swing.
Granularity vs. Redundancy Across Latent Sizes
- 2-Dim Bottleneck:
- Layer 1 yielded two broad features (âpole swing right with cart leftâ and âcart at right boundaryâ). Layer 2 yielded only âpole swing right.â While useful, these two channels are too coarse: they lump together distinct configurations (e.g., âpole tilt vs. swingâ or âhigh vs. moderate cart_velâ).
- 4-Dim Bottleneck:
- Layer 1 cleanly separates âcart zipping right,â âpole swing right on left,â âpole tilt right on left,â and âcart zipping left.â Layer 2 isolates âpole swing left,â âpole tilt right near-static,â and âpole swing right.â This balanced set of four features per layer proved the most semantically rich and minimally redundant.
- 8-Dim Bottleneck:
- While technically yielding eight active channels, only ~3â4 of them represent unique semantics per layer; the rest are minor variations or splits of the same underlying feature. For example, Layer 1âs âpole swing right on leftâ appears across multiple neurons with slight differences in thresholds, but no fundamentally new feature emerges beyond what 4 dims already captured.
Hence, the 4-dim transcoder represents a âsweet spotâ that is rich enough to disentangle all major policy cues, but small enough to stay sparse and interpretable. In practice, we focus subsequent causalâintervention experiments on these 4-dim channels, knowing that each corresponds to a clear, actionable feature in the CartPole policyâs reasoning.
Causal Interventions (4-Dim Transcoder)
Although we trained transcoders at latent_dims = {2, 4, 8}, we chose to perform causal interventions only on the 4-dim transcoder because it produced the cleanest, most interpretable feature set without unnecessary dormant channels. Below are the details of each intervention experiment.
1. Intervention Mechanism
To force a transcoder dimension at layer , we:
- Compute the policyâs normal hidden activation.
- Extract the decoder column from
transcoder.decoder.weight[:, j]
. - Add , effectively setting .
- Continue the forward pass: . Subsequent layers use as input.
For example, in PyTorch:
def forward(self, x, intervene_on=None):
"""
If intervene_on is provided, it must be a dict:
{'layer': 'layer1' or 'layer2',
'decoder_col': torch.Tensor of shape [hidden_dim],
'scale': float}
We add (scale * decoder_col) to the hidden activation of the specified layer.
"""
# Layer 1 forward
z1 = self.fc1(x) # [batch_size, hidden1]
h1 = F.relu(z1) # policyâs real layer1 output
# Intervention on layer1?
if intervene_on is not None and intervene_on["layer"] == "layer1":
boost_vec = intervene_on["decoder_col"].view(1, -1) * intervene_on["scale"]
h1 = h1 + boost_vec
# Layer 2 forward
z2 = self.fc2(h1) # [batch_size, hidden2]
h2 = F.relu(z2) # policyâs real layer2 output
# Intervention on layer2?
if intervene_on is not None and intervene_on["layer"] == "layer2":
boost_vec2 = intervene_on["decoder_col"].view(1, -1) * intervene_on["scale"]
h2 = h2 + boost_vec2
logits = self.fc3(h2) # action logits
return logits
2. Experiment Settings
For each (layer, neuron)
pair in the 4-dim transcoder, we run:
- NUM_EPISODES = 50 baseline (no intervention) episodes.
- NUM_EPISODES = 50 intervention episodes with set to [1, -1, 2.5, -2.5] so that it reliably triggers the feature.
- We record:
- Total rewards and .
- Failure modes: âleftâ (pole angle â„ +0.209 rad), ârightâ (pole angle †â0.209 rad), âcart_outâ (|cart_pos| > 2.4), or ânoneâ (reached 500 timesteps).
- Sample trajectories: For the first episode in each condition, record and to plot in a 2Ă1 figure.
3. Experiment Results
Below is an interpretation of what happens when we force each transcoder neuron on or off (i.e. use ±1 or ±2.5) in light of âwhat that neuron actually represents.â In each case, âforcingâ a positive α makes the policy act as if that feature is strongly present, while a negative α makes it act as if the feature is strongly absent (or even reversed). We compare everything back to the baseline (avg = 261.3).
Layer 1, Neuron 0 (âCart Zipping Rightâ):
Recall that this neuronâs positive code spikes whenever the cart is already far to the right (x â +2.3 â +2.4) and moving quickly right (v â +3.0). In normal operation, whenever Neuron 0 > â2 it tells the policy: âThe cart is about to slide off the right edge, correct left now.â
scale | baseline_avg | int_avg | base_fail_left | int_fail_left | base_fail_right | int_fail_right | base_fail_none | int_fail_none |
1 | 261.3 | 148.92 | 0 | 0 | 36 | 0 | 7 | 0 |
2.5 | 261.3 | 144.36 | 0 | 0 | 36 | 0 | 7 | 0 |
-1 | 261.3 | 29.38 | 0 | 0 | 36 | 50 | 7 | 0 |
-2.5 | 261.3 | 11.26 | 0 | 0 | 36 | 50 | 7 | 0 |
- Positive α (= +1, +2.5): Forcing âCart Zipping Rightâ
- By adding a positive decoder column, we trick the network into believing âthe cart is already barreling right,â even if it isnât.
- In response, the policy continually overâcorrects left, trying to keep the cart âaway from the right edge.â As a result:
- avg reward drops from 261 â 148 (α=1) or 144 (α=2.5).
- No episode ever reaches 500 timesteps, because the policy oscillates or misâbalances. Both int_fail_right=0 and int_fail_none=0 indicate it never fails by poleâangle or survives to max. The dominant failure is âcart_out,â meaning the cart eventually goes off the left edge instead.
- In short, if you force âcart is zipping rightâ when itâs not, the policy constantly overâsteers left, causing it to crash quickly.
- Negative α (= â1, â2.5): Forcing âAntiâCart Zipping Rightâ (i.e. âCart Zipping Leftâ)
- A negative α effectively flips the sign of that feature so the agent âthinksâ the cart is zipping left. Then it (wrongly) pushes right, even when the cart is centered.
- The result is catastrophic:
- avg reward plummets to 29 (α=â1) or 11 (α=â2.5).
- int_fail_right=50 shows that in every single of the 50 episodes, the poleâangle fails on the right side. (No one ever reaches ânoneâ or âleftâ because youâve forced it to push right immediately.)
- In other words, faking âcart zipping leftâ makes the policy push right continuously, and it immediately drives the cart off the right edge (or lets the pole fall right).

Layer 1, Neuron 2 (âPole Tilted Right on Left Sideâ)
Neuron 2 fires when (cart â â1.6 to â2.0) AND (pole angle â +0.20) i.e. âthe cart is far left but the pole is already tilting right.â In that scenario, the correct action is usually to push right (to catch the pole) but also be careful not to slide off left.
scale | baseline_avg | int_avg | base_fail_left | int_fail_left | base_fail_right | int_fail_right | base_fail_none | int_fail_none |
1 | 261.3 | 14.76 | 0 | 0 | 36 | 50 | 7 | 0 |
2.5 | 261.3 | 9.38 | 0 | 0 | 36 | 50 | 7 | 0 |
-1 | 261.3 | 148.34 | 0 | 0 | 36 | 0 | 7 | 0 |
-2.5 | 261.3 | 186.76 | 0 | 0 | 36 | 0 | 7 | 2 |
- Positive α (= +1, +2.5): Forcing âPole Tilted Right on Left Sideâ
- We cheat the policy into believing âthe pole is already leaning too far right while on the left.â The correct reaction in that situation is a strong rightâpush to keep the pole from falling.
- As a result:
- avg reward crashes to 14.8 (α=+1) or 9.4 (α=+2.5).
- int_fail_right=50 (âpole falls rightâ in all 50 episodes), because the agent constantly tries to push right (thinking it must counter that tilt), but ends up tipping the pole right or sliding off.
- In short, faking that combination (cart left + pole right) drives the agent to push right nonstop, which quickly leads to failure on the right side.
- Negative α (= â1, â2.5): Forcing âAntiâ(Pole Tilted Right on Left Side)â (i.e. âPole Tilted Left on Left Sideâ or âNo Tiltâ)
- A negative α makes the policy think âthe pole is tilted leftâ or at least ânot tilted right while on the left edge.â In practice, that reduces how often it overâcorrects right.
- Effects:
- α=â1 â int_avg â 148.3 (still worse than baseline, but it doesnât fail catastrophically). int_fail_right=0, so it no longer crashes âpole right.â Instead, it mostly ends by âcart_outâ or âpole leftâ because itâs underâreacting.
- α=â2.5 â int_avg â 186.8, with int_fail_none=2 (two episodes actually survive to 500 timesteps). It improves compared to α=â1, because now itâs even less triggered by âpoleâright-on-left.â
- Thus, pushing Neuron 2 negative partially âcalms downâ the policyâs overâreaction to the leftâedge tilt, giving a small performance gain (186.8 vs. 261 baseline is still worse, but not catastrophic).

Layer 2, Neuron 0 (âPole Swinging Leftâ)
Neuron 0 at Layer 2 fires nearly any time ÎžÌ > 0 (poleâs angular velocity is strongly positive, i.e. âpole is swinging leftâ). When that spikes, the policy says: âStop it, you must push right next.â
scale | baseline_avg | int_avg | base_fail_left | int_fail_left | base_fail_right | int_fail_right | base_fail_none | int_fail_none |
1 | 261.3 | 178.44 | 0 | 0 | 36 | 43 | 7 | 0 |
2.5 | 261.3 | 132.26 | 0 | 4 | 36 | 0 | 7 | 0 |
-1 | 261.3 | 201.58 | 0 | 0 | 36 | 0 | 7 | 0 |
-2.5 | 261.3 | 51.7 | 0 | 0 | 36 | 50 | 7 | 0 |
- Positive α (= +1): Forcing âPole Swinging Leftâ Weakly
- The agent is tricked into believing âthe pole is swinging leftâ more often than it really is, so it pushes right prematurely.
- Consequences:
- int_avg drops to 178.4.
- int_fail_right=43: out of 50 episodes, 43 end by the pole falling to the right (because the agent often pushes right too soon, letting the pole swing the other way).
- It is less catastrophic than some Layer 1 cases, because the policy still has a chance to recover, but overall performance suffers.
- Positive α (= +2.5): Forcing âPole Swinging Leftâ Strongly
- Now the policy is even more certain âthe pole is swinging left,â so it pushes right almost constantly.
- Effects:
- int_avg plummets to 132.3.
- int_fail_left=4: in a few episodes the pole actually ends up falling left because by pushing right nonstop you overshoot and let it flip left.
- In other words, overâtriggering âswing leftâ floods the policy with a rightâpush bias, making it oscillate badly.
- Negative α (= â1): Forcing âAntiâ(Pole Swinging Left)â (i.e. âPole Swinging Rightâ or âNo Swing Leftâ)
- Now the agent underestimates whenever the pole is swinging left; it may wait longer before pushing right.
- Result:
- int_avg = 201.6, a modest drop from 261.
- int_fail_right=0: it doesnât crash with the pole falling right because itâs not overâcompensating to the right. Instead, most failures become âcart_outâ or âpole left.â
- So pushing this neuron negative âdampensâ how aggressively the agent counters a left swing, yielding somewhat better performance than α=+1 or +2.5, though still worse than baseline.
- Negative α (= â2.5): Forcing âStrong Antiâ(Pole Swinging Left)â (i.e. âPole Swinging Rightâ)
- The policy is now convinced âthe pole is swinging rightâ almost all the time, so it pushes left prematurely.
- Results:
- int_avg crashes to 51.7, because it basically pushes left constantly.
- int_fail_right=50: all episodes end by the pole falling right, ironically, since the pole eventually swings back faster than the policy reacts.
- Concretely, flipping âswing leftâ too far in the negative direction makes the policy push left nonstop, which is just as disastrous.
Layer 2, Neuron 2 (âPole Tilted Right, NearâStaticâ)
Neuron 2 fires when Ξ â +0.20 but ÎžÌ â 0ââpole is leaning right but not yet swinging.â In practice, that is an early warning signal: the pole is about to tip right, so a small left push is needed before it begins to fall hard.
scale | baseline_avg | int_avg | base_fail_left | base_fail_right | base_fail_none | int_fail_left | int_fail_right | int_fail_none |
1 | 261.3 | 245.32 | 0 | 36 | 7 | 0 | 40 | 3 |
2.5 | 261.3 | 215.3 | 0 | 36 | 7 | 0 | 39 | 3 |
-1 | 261.3 | 322.74 | 0 | 36 | 7 | 0 | 19 | 10 |
-2.5 | 261.3 | 230.86 | 0 | 36 | 7 | 0 | 0 | 0 |
- Positive α (= +1, +2.5): Forcing âPole Tilted Right, NearâStaticâ
- We trick the policy into always âseeing a slight rightâleanâ (Ξ â +0.20) even when the pole is more vertical. In principle, the policyâs correct reaction is a gentle left push.
- Outcomes:
- α=+1 â int_avg = 245.3 (just a mild drop). int_fail_right = 40, int_fail_none = 3 â a few episodes even survive to 500 timesteps.
- α=+2.5 â int_avg = 215.3 (worse), int_fail_right = 39, int_fail_none = 3.
- Interpretation: âConstantly believing the pole is leaning right but not yet swingingâ causes the agent to gently push left most of the time. That sometimes helps the pole stay centered (hence a handful of ânoneâ runs), but often it underâreacts when the pole actually does swing. Hence performance falls modestly.
- Negative α (= â1): Forcing âAntiâ(Pole Tilted Right)â (i.e. âPole Tilted Leftâ)
- Now the agent thinks âthe pole is leaning leftâ even when itâs upright. The correct reaction would be to push right.
- Result:
- int_avg jumps to 322.7 (well above baseline).
- int_fail_right = 19, int_fail_none = 10: more episodes survive the full 500 timesteps.
- Why does this improve so much? By faking a slight leftâlean, the agent pushes right preemptively. This often preâcounters minor rightâleans without overdoing it. So it ends up staying balanced far longer. In essence, âpushing slightly right all the timeâ is a decent heuristic for CartPole, and forcing this neuron negatively guides the policy to behave like that.
- Negative α (= â2.5): Forcing âStrong Antiâ(Pole Tilted Right)â
- The policy is convinced âthe pole is really leaning leftâ all the time, so it pushes right even more vigorously.
- Effect:
- int_avg = 230.9. still a drop versus the α=â1 case, but only slightly worse than α=+1.
- int_fail_right = 0, int_fail_none = 0: now the failures are mostly âcart_outâ or âpole left,â because the agent is pushing right so strongly that it eventually either slides off the right or overâcorrects.
- In other words, overâpushing right (by forcing this too negative) eventually backfires but itâs still better than forcing âpole leaning right.â
Discussion
1. Why Multiple Latent Dimensions Matter
- 2-Dim vs. 4-Dim vs. 8-Dim Transcoders
- The 2-dim transcoder at each layer captures only the two most dominant features (e.g., âpole-left swingâ and âcart-left & pole-rightâ). While these explain major behavior, they lack granularity.
- The 4-dim transcoder further disentangles those two features into finer subâcategories (e.g., separates âcart-left & uprightâ vs. âcart-left & pole-rightâ). This extra granularity enables more targeted interventions (e.g., forcing a âcart-left & pole-rightâ feature that leads to distinct failure modes).
- The 8-dim transcoder learns even more latent channels, but many remain dormant or become multisemantic. Only three or four effectively capture distinct semantics. Adding more dims beyond 4 yields diminishing interpretability returns but increases model complexity.
- Novelty & Impact
- Training multiple transcoders at different latent sizes and comparing their features is, to our knowledge, a novel contribution in RL interpretability. It shows that there is a âsweet spotâ (in our case, 4 dims) where features are both separable and sparse.
2. Performance Improvement via Targeted Intervention on Layer 2, Neuron 2
- Key Finding:
- Generalization to More Complex Systems:
- Identifying the âright nudgeâ in larger networks.
- Building a corrective controller on top of a policy.
- Extending to multiâagent or hierarchical settings.
- Practical Implications for Future Research:
- Automated discovery of âperformanceâboostingâ directions.
- Adaptive scaling based on stateâcontext.
- Incorporating into modelâbased or metaâRL pipelines.
During our causalâintervention experiments, we discovered that gently pushing Layer 2, Neuron 2 (the âPole Tilted Right, NearâStaticâ channel) into its negative direction (α = â1) actually improved overall CartPole performance, raising average returns from 261 â â 323. In other words, by subtly biasing the policy to believe âthe pole is leaning left,â we induced a small, persistent rightward corrective pressure that preempted many future failures.
In highâdimensional or deep RL policies, a handful of latent neurons may each correspond to recognizable subâbehaviors (e.g., âcart drifting,â âpole rotating fast,â âagent approaching obstacleâ). Once such features are isolated via sparse transcoders, researchers can scan for the single dimension whose moderate intervention raises average returns or reduces classically prevalent failure modes (e.g., âslipping,â âstalling,â âoscillationâ).
Instead of hardâcoding corrective heuristics (like âalways push right when Ξ > 0.1â), one can automatically discover a small linear combination of latent channels whose enforced presence or absence improves robustness. This could become a lightweight âsafetyâwrapperâ: continuously monitor interpreter codes, and whenever a harmful circuit (e.g., âagent is about to collideâ) approaches activation, apply a small boost along the opposite feature.
In hierarchical RL (e.g., optionsâbased architectures), each optionâs network can be transcoded separately. By intervening on the âpreâfailureâ units in a highâlevel subâpolicy, one could stabilize its outputs for a longer time horizon, thereby producing smoother option transfers or fewer oscillatory behaviors.
We manually inspected Layer 2, Neuron 2 and tested a small negative α to find an improvement. In larger systems, it would be valuable to automatically scan each transcoder channel by testing small ±1 perturbations in a validation set. The channel whose negative (or positive) direction consistently raises mean return becomes a candidate for a permanent ârightânudgeâ controller.
Rather than always using α = â1, one could learn a lightweight gating function that computes a contextâdependent α(s) based on the raw state or lowâdim features. For example, if the pole is already at Ξ â 0.05, apply α = â0.5; if Ξ â 0.15, apply α = â1.0; if Ξ â â0.05, apply no intervention. This would replicate a continuous safetyâcritic layer that fineâtunes corrective strength.
In modelâbased RL, one could simulate rollouts while artificially injecting âbestânudgeâ codes at planning time, then select actions that maximize expected downstream value even after that nudge. In metaâRL, the transcoder itself could adapt its latent codes through experience, learning which channels to suppress or amplify to handle rare failure states.
3. Practical Recommendations for Further Experiments
- Choose the Right Latent Dimensionality
- Start small (e.g., 2 dims) to capture the most salient features. If features remain too coarse, increase to 4 dims. If multiple latent dims remain dormant (max activation â 0), you can reduce dims again.
- In our experiments, 4 dims at each layer provided a robust tradeoff: distinct, sparse features that were all meaningfully active.
- Tune the L1 Sparsity Penalty
- We used . A larger of 1e-4 forced almost all dims to zero based on our experiments, leaving only one or two features. A smaller (e.g., 1e-6) makes codes dense, which reduces interpretability.
- Visualize histograms of latent activations to choose that yields a handful of nonzero channels.
- Perform More Causal Interventions
- Always verify features causally by forcing them on/off and observing policy failures or successes. Correlation alone can be misleading: two features might correlate but not be causal.
- For each feature , sweep and plot âAvg Reward vs. â to find the âactivation thresholdâ at which policy behavior sharply changes.
4. Limitations & Future Work
- Single Environment
- CartPole is a classical toy problem. We plan to extend this pipeline to other environments such as LunarLander, and eventually continuous-action tasks (e.g., Pendulum, MuJoCo, even robotic systems).
- More Complex Architectures
- Modern RL often uses convolutional networks (e.g., Atari). We must test whether transcoders on convolutional hidden layers yield easily interpretable features (e.g., âball moving leftâ in Pong).
- Quantitative Interpretability Metrics
- Develop a metric such as âfraction of variance explained by top-K featuresâ or âclinearityâ (how well a linear threshold on one latent dim classifies a semantic event), so we can automatically compare transcendÂer performance across latent dims.
Conclusion
In this research post, we presented a proof-of-concept pipeline for interpreting RL policies via sparse transcoders and causal interventions:
- Train a CartPole MLP policy (4 â 128 â 64 â 2) until it reliably solves the task.
- Collect hiddenâlayer activations (128-dim and 64-dim) and raw states (4-dim).
- Train 2, 4, and 8-dim transcoders on each hidden layer, imposing an L1 penalty to encourage sparsity.
- Inspect the latent codes and label each dimension in humanâreadable terms (e.g., âpole-left swing,â âcart-left & pole-rightâ).
- Causally intervene on the 4-dim transcoder (the most interpretable) by adding decoder columns to the policyâs hidden activations, demonstrating that each feature genuinely drives policy failures (pole falls or cartâouts).
By providing clear, reproducible code and a concise methodology, we hope this work serves as a blueprint for interpretable RL. Whether you are an RL novice or a seasoned mechanistic-interpretability researcher, this pipeline can be adapted to more complex tasks, deeper networks, or real-world robots. We encourage readers to clone our Colab notebook and experiment with their own environments. Push the limits of what a 2, 4, or 8 dim transcoder can reveal about an RL agentâs inner workings.
References
- Open-sourcing circuit tracing tools (Anthropic, May 29, 2025).
- Open Source Replication & Commentary on Anthropicâs Dictionary Learning (Neel Nanda, Oct 23, 2023).
- Transcoder-circuits: reverse-engineering LLM circuits with transcoders (Jacob Dunefsky, GitHub).
- Transcoders Find Interpretable LLM Feature Circuits (ArXiv 2406.11944v1, Jun 2024).
- Mechanistic Interpretability in Transformers (Ayyuce Kizrakâs repo).
- Transcoders Enable Fine-Grained Interpretable Circuit Analysis (Alignment Forum, Jul 2023).
- Transcoders Beat Sparse Autoencoders for Interpretability (ArXiv 2501.18823, Jan 31, 2025).
https://www.anthropic.com/research/open-source-circuit-tracing anthropic.com
https://www.lesswrong.com/posts/fKuugaxt2XLTkASkk/open-source-replication-and-commentary-on-anthropic-s lesswrong.com
https://github.com/jacobdunefsky/transcoder_circuits github.com
https://arxiv.org/abs/2406.11944 arxiv.org
https://github.com/ayyucekizrak/Mechanistic-Interpretability github.com
https://www.alignmentforum.org/posts/YmkjnWtZGLbHRbzrP/transcoders-enable-fine-grained-interpretable-circuit alignmentforum.org