🥅

Interpretable RL via Multiscale Transcoders: Sparse Bottlenecks at 2, 4, and 8 Dimensions

Summary

Interpretable RL is challenging because we rarely understand how hidden policy signals drive behavior. In this experiment, we apply sparse transcoders to compress hidden activations of a REINFORCE policy trained on Cartpole into 2, 4, and 8 dimensions. From those, we find that a 4-dimensional bottleneck strikes the best balance between sparsity and informativeness. To do this, we train small encoder/decoder modules on each hidden layer and then tweak one feature during rollout. That single adjustment raises average return from 261 to 323 without any retraining. This is the first end-to-end sparse-transcoder + causal intervention pipeline in RL. Beyond CartPole, our method offers a general way to interpret policy internals and steer behavior in more complex environments.

Introduction

Interpreting the internal reasoning of reinforcement‐learning (RL) agents remains a pressing challenge. Correlation‐based methods (e.g., saliency maps) can highlight which input features are associated with certain actions, but they do not reveal causal mechanisms inside the policy network. In parallel, the field of mechanistic interpretability (especially for large language models) has recently exploded with tools that trace “circuits” across transformer MLP sublayers, revealing how specific neurons or feature combinations cause particular outputs.

Anthropic’s May 29, 2025 open‐source release of circuit-tracing tools provides utilities for generating attribution graphs on transformer models, making it easier to visualize and perturb causal pathways in large language models (LLMs). At the same time, Jacob Dunefsky’s transcoder-circuits repository demonstrates that transcoders (small encoder/decoder modules that compress MLP layers into neurons) can decompose activations into sparse, interpretable features and be used to reverse-engineer fine‐grained circuits in GPT-2 and other LLMs.

Inspired by these advances, we ask: Can we build a similarly transparent, causal pipeline for a simple RL agent such as CartPole?

In this blog post, we:

Train a CartPole policy (MLP with hidden layers of size 128 and 64).
Fit three sparse transcoders (latent_dim = 2, 4, 8) on each hidden layer’s activations, imposing an L1 penalty to encourage sparsity and interpretability.
Inspect the latent features of each transcoder, showing how increasing latent dimension yields progressively more detailed features.
Causally intervene on the 4-dim transcoder (the “sweet spot” between too few and too many dims), adding decoder columns to the policy’s hidden activations and measuring how the agent’s performance and failure modes change.
Contextualize our pipeline within broader interpretability work, including Anthropic’s attribution graphs and Jacob Dunefsky’s demonstration that transcoders produce more interpretable features than standard sparse autoencoders.

By providing a small‐scale but complete pipeline, we hope to offer a reproducible blueprint for interpretable RL that can be extended to more complex environments. We encourage readers to experiment further using our colab notebook.

Related Work

1. Mechanistic Interpretability & Circuit Tracing

Anthropic’s open‐source circuit‐tracing tools (May 29, 2025) enable researchers to generate attribution graphs for transformers, revealing how feature activations propagate across MLP layers and influence logits or downstream tasks. Their demo notebooks and the circuit-tracer library show how to trace multi‐step reasoning circuits in Llama-3.2 and Gemma-2.2 (1 billion parameters) by training transcoders on MLP sublayers. These transcoders compress high‐dimensional hidden activations into small sparse codes, allowing for interpretable features to be extracted and tracked.

Similarly, Nanda et al. (Oct 2023) used sparse autoencoders (SAEs) to interpret superposition in GPT-1’s MLP layers, finding that a sizable fraction of latent features were monosemantic (each feature had a single clear interpretation). Their work laid the groundwork for the idea that sparse bottlenecks can uncover human‐readable circuits in large networks.

2. Transcoders vs. Sparse Autoencoders

Recent paper, “Transcoders Beat Sparse Autoencoders for Interpretability,” Jan 31, 2025, systematically compared transcoders (trained to reconstruct output from input of an MLP sublayer) to traditional SAEs (trained to reconstruct features from themselves) and found that transcoders yield significantly more interpretable features under the same sparsity budget. By focusing on input→output reconstruction, transcoders encourage the latent code to capture precisely those features causally used by the layer’s transformation.

“Transcoders Find Interpretable LLM Feature Circuits” first proposed this approach, demonstrating that a small 16-dim transcoder on GPT-2-small’s MLP sublayers uncovered circuits corresponding to “greater‐than comparison,” “demographic bias,” and “semantic partitioning”. Our work adapts this concept to the RL domain, training even smaller (2, 4, 8 dim) transcoders on MLPs inside a CartPole policy.

3. Interpretability in Reinforcement Learning

While interpretability in vision and language has advanced rapidly, RL remains relatively under‐explored. Most RL interpretability efforts focus on input‐level saliency (e.g., visualizing which pixels in a game frame matter to a DQN) but not on internal hidden activations. By contrast, our pipeline uses transcoders to compress hidden‐layer activations into truly human‐readable features, then performs causal interventions to confirm their role. To our knowledge, this is one of the first end‐to‐end demonstrations of sparse transcoder + intervention in an RL policy.

Methodology

1. CartPole Policy Training

We train a policy network $\pi_\theta$ on OpenAI Gym’s CartPole-v1. The state $s = [x, \dot x, θ, \dot θ]$ consists of cart position, cart velocity, pole angle, and pole angular velocity. Our policy is an MLP:

Layer 1 (fc1): Linear 4→128, ReLU.
Layer 2 (fc2): Linear 128→64, ReLU.
Output (fc3): Linear 64→2 (logits for “move left” vs. “move right”).

We train via REINFORCE (policy gradient) until the average episode return ≥ 200 over 100 validation episodes. Training details:

Learning rate: 1e-3 (Adam)
Discount factor γ = 0.99
Batch size: per-episode updates (no replay)
Number of episodes: ~2000 until convergence (average return ≈ 200)

2. Collecting Hidden Activations & Raw States

After policy convergence, we run 50 episodes (≈ 10,000 timesteps) and record:

Raw state $[x,\dot x,θ,\dot θ$ ].
Layer1 input: $z(1)=W_{1} s + b_{1}.$
Layer1 output: $h(1)=ReLU(z(1)) ∈ ℝ¹²⁸$ .
Layer2 input: $z(2)= W_{2} h^{(1)} + b_{2}$ .
Layer2 output: $h(2)= \mathrm{ReLU}(z^{(2)}) ∈ ℝ⁶⁴$ .

We store ~20k samples of $(z^{(1)}_i,\,h^{(1)}_i,\,h^{(2)}_i,\,s_i)$ . In PyTorch, we attach forward hooks:

# Hook for the first hidden layer (fc1)
    def hook_fc1(module, input_tuple, output):
        if len(layer1_data["inputs"]) < MAX_SAMPLES_PER_LAYER:
            # input_tuple is a tuple, e.g. (tensor_of_shape_[1, obs_dim],)
            inp = input_tuple[0].detach().cpu()
            outp = output.detach().cpu()
            layer1_data["inputs"].append(inp.squeeze(0))   # shape [obs_dim]
            layer1_data["outputs"].append(outp.squeeze(0))  # shape [hidden1]
            layer1_data["states"].append(current_state["value"].copy())

    # Hook for the second hidden layer (fc2)
    def hook_fc2(module, input_tuple, output):
        if len(layer2_data["inputs"]) < MAX_SAMPLES_PER_LAYER:
            inp = input_tuple[0].detach().cpu()
            outp = output.detach().cpu()
            layer2_data["inputs"].append(inp.squeeze(0))   # shape [hidden1]
            layer2_data["outputs"].append(outp.squeeze(0))  # shape [hidden2]
            layer2_data["states"].append(current_state["value"].copy())

3. Training 2-dim, 4-dim, and 8-dim Transcoders

For each hidden layer $L \in \{\text{layer1},\,\text{layer2}\}$ , we train three separate transcoders with latent_dim ∈ {2, 4, 8}. Each transcoder $T^{(L)}_{k}$ is defined as:

class SingleLayerTranscoder(nn.Module):
    def __init__(self, input_dim, latent_dim, output_dim):
        super().__init__()
        # Encoder: compress input_dim → latent_dim
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, latent_dim),
            nn.ReLU()  # ensure non-negative codes
        )
        # Decoder: reconstruct latent_dim → output_dim
        self.decoder = nn.Linear(latent_dim, output_dim)

    def forward(self, x):
        """
        x: [batch_size, input_dim]
        Returns:
          code: [batch_size, latent_dim]
          recon: [batch_size, output_dim]
        """
        code = self.encoder(x)
        recon = self.decoder(code)
        return code, recon

For layer1:

in_dim = 4 (raw state)
out_dim = 128 (layer1 output)

For layer2:

in_dim = 128 (layer1 output)
out_dim = 64 (layer2 output)

We minimize:

{L} \;=\; \frac{1}{N}\sum_{i=1}^N \Bigl\|\,\hat{h}_i - h_i\Bigr\|_2^2 \;+\; \lambda\,\|c_i\|_1,

where $\hat{h}_i$ is the decoder’s reconstruction of the original hidden activation $h_i$ , and $c_i\in\mathbb{R}^k$ is the latent code. We set:

$\lambda = 1\times10^{-5}$ (to find a balance between sparsity and reconstruction fidelity)
Epochs = 50, batch_size = 256, learning_rate = 1e-3 (Adam)

Thus, for each layer we end up with six trained transcoder checkpoints:

transcoder_layer1_2dim.pth
transcoder_layer1_4dim.pth
transcoder_layer1_8dim.pth
transcoder_layer2_2dim.pth
transcoder_layer2_4dim.pth
transcoder_layer2_8dim.pth

Here are the loss graphs for all 6 transcoders:

Figure 1: Reconstruction loss (MSE + L1) vs. epoch for all 2-, 4-, and 8-dim transcoders on layer 1

Figure 2: Reconstruction loss (MSE + L1) vs. epoch for all 2-, 4-, and 8-dim transcoders on layer 2

Inspecting Transcoder Features

Having trained sparse transcoders at latent dimensions 2, 4, and 8 for each hidden layer, we observed that the 4-dimensional bottleneck yielded the most succinct and semantically distinct channels. In this section, we highlight a few representative neurons from the 4-dim transcoders (two from Layer 1 and two from Layer 2) that illustrate how these small codes capture human-readable features. Each neuron’s description draws on its top-triggering states and Pearson correlations with the four raw state variables $[\,\text{cart\_pos},\,\text{cart\_vel},\,\text{pole\_angle},\,\text{angle\_vel}\,]$

Layer 1 (4-Dim Transcoder)

Neuron 0: “Cart Zipping Right”

Activation statistics:

Max = 2.4866 , Mean = 0.5814 , Std = 0.2897

Top‐triggering states (cart_pos, cart_vel, pole_angle, angle_vel):

[+2.387,+3.133,+0.095,–0.007] (activation 2.4866)
[+2.258,+3.136,+0.103,–0.071] (activation 2.4067)
[+1.940,+2.950,+0.127,+0.030] (activation 2.3476)
[+2.128,+3.139,+0.113,–0.142] (activation 2.3237)
[+2.321,+3.330,+0.101,–0.330] (activation 2.2886)

Pearson correlations:

cart_pos: +0.6887 | cart_vel: +0.8063 | pole_angle: +0.2075 | angle_vel: +0.1037

Interpretation:
Neuron 0 fires almost exclusively when the cart is near the right boundary and moving extremely fast to the right (cart_pos ≈ +2.3 – 2.4, cart_vel ≈ +3.0). Its strong positive correlations with both cart_pos and cart_vel confirm this: it is, effectively, a “cart-zipping-right” detector. Pole angle and angular velocity play only minor roles in its activation. Whenever the cart crosses +2.3 with a large rightward velocity, Neuron 0’s code value rises above 2.3, signaling that the policy should correct sharply to avoid sliding off the edge.

Neuron 2: “Pole Tilted Right on Left Side”

Activation statistics:

Max = 2.0946 , Mean = 1.0530 , Std = 0.3015

Top‐triggering states (cart_pos, cart_vel, pole_angle, angle_vel):

[–2.037,–1.252,+0.205,+0.618] (activation 2.0946)
[–1.590,–0.530,+0.204,+0.507] (activation 2.0507)
[–1.893,–0.947,+0.205,+0.421] (activation 2.0504)
[–1.620,–0.735,+0.207,+0.647] (activation 2.0468)
[–1.870,–1.139,+0.192,+0.647] (activation 2.0392)

Pearson correlations:

cart_pos: –0.8164 | cart_vel: –0.0363 | pole_angle: +0.8276 | angle_vel: +0.3130

Interpretation:
Neuron 2 specializes in detecting when the pole is significantly tilted right (pole_angle ≈ +0.20 rad) while the cart is far to the left (cart_pos ≈ –1.6 to –2.0). Its very high negative correlation with cart_pos (–0.8164) and strong positive correlation with pole_angle (+0.8276) confirm that semantic: “pole swinging right on the left side of the track.” In these situations, the policy often needs to push right strongly, because the combination of “left-edge cart” and “right-tilted pole” is precarious.

Layer 2 (4-Dim Transcoder)

Neuron 0: “Pole Swinging Left”

Activation statistics:

Max = 22.8307 , Mean = 4.1357 , Std = 3.2498

Top‐triggering states (cart_pos, cart_vel, pole_angle, angle_vel):

[–1.962,–1.439,+0.175,+0.730] (activation 22.8307)
[–2.037,–1.252,+0.205,+0.618] (activation 21.5898)
[–1.870,–1.139,+0.192,+0.647] (activation 21.0805)
[–1.909,–1.435,+0.155,+0.629] (activation 20.9296)
[–1.753,–1.288,+0.176,+0.596] (activation 20.0147)

Pearson correlations:

cart_pos: –0.5537 | cart_vel: –0.3620 | pole_angle: +0.5612 | angle_vel: +0.8708

Interpretation:
Neuron 0 in Layer 2 is overwhelmingly an “angle_vel > 0” detector, i.e. “pole swinging right.” Its very high positive correlation with angle_vel (0.8708) confirms this. Although its top states show the cart often on the left (cart_pos ≈ –1.7 to –2.0), that is secondary. Neuron 0 fires whenever the pole’s angular velocity is strongly positive, regardless of cart position. As soon as $\dot θ$ exceeds ~0.6, this neuron’s activation climbs above 20, indicating a vigorous rightward pole swing.

Neuron 2: “Pole Tilted Right, Near‐Static”

Activation statistics:

Max = 12.3788 , Mean = 2.6520 , Std = 1.4613

Top‐triggering states (cart_pos, cart_vel, pole_angle, angle_vel):

[–1.433,+0.009,+0.209,+0.022] (activation 9.4760)
[-1.566,–0.328,+0.193,+0.037] (activation 9.2072)
[–1.249,+0.185,+0.196,–0.016] (activation 9.0402)
[–1.408,–0.173,+0.185,+0.008] (activation 8.9129)
[–1.088,+0.518,+0.207,–0.012] (activation 8.8388)

Pearson correlations:

cart_pos: –0.5421 | cart_vel: +0.2089 | pole_angle: +0.7822 | angle_vel: –0.1868

Interpretation:
Neuron 2 at Layer 2 activates when the pole is noticeably tilted right ( $\theta\approx+0.20$ ) but with very small angular velocity $\dot θ$ ). Its strong positive correlation with pole_angle (+0.7822) and slight negative correlation with angle_vel (–0.1868) shows that it encodes “tilt without motion,” a subtler signal than “swing.” This neuron thus picks out states where the pole is precariously leaning right but not yet swinging, providing an early warning cue to the policy.

Discussion of Selected Features

“Cart Zipping Right” (Layer 1 Neuron 0)

By focusing on extremely large positive cart positions (x ≈ +2.3 to +2.4) and very high rightward velocity (cart_vel ≈ +3.0), this neuron isolates one of the most dangerous states for the policy: the cart is barreling off the right edge. Its high activations signal the policy to initiate a strong leftward correction. Without this channel, the agent might fail to correct early enough when the cart approaches x = ±2.4.

“Pole Tilted Right on Left Side” (Layer 1 Neuron 2)

By combining a strong negative correlation with cart_pos (–0.8164) and a strong positive correlation with pole_angle (+0.8276), this neuron encodes the precise scenario “pole leaning right (θ>0) while the cart is far left.” In that configuration, the policy must push right urgently to prevent the pole from crossing +12° (failure boundary), yet also avoid sliding off the left edge. This feature is central to how the model balances the tradeoff between cart position and pole angle.

“Pole Swinging Left” (Layer 2 Neuron 0)

A near-perfect indicator of $\dot θ<0$ , regardless of cart position, this neuron allows the policy’s deeper layer to monitor the pole’s angular velocity independently of other factors. When the pole is swinging left, the agent typically needs to counterclockwise-push (i.e., correct right). By the time this neuron surpasses ~15 in activation, the pole is rotating left too fast, so the policy treats it as a high-priority cue.

“Pole Tilted Right, Near-Static” (Layer 2 Neuron 2)

Distinct from “swing,” this channel catches when the pole is tilted right (θ ≈ 0.20) but not yet swinging $(\dot θ$ ). It effectively warns the policy, “the pole is leaning perilously, but hasn’t begun to fall fast yet.” That subtle signal can help the agent decide a gentle push is sufficient, avoiding unnecessary oscillations that would occur if it treated every tilt as an imminent swing.

Granularity vs. Redundancy Across Latent Sizes

2-Dim Bottleneck:

Layer 1 yielded two broad features (“pole swing right with cart left” and “cart at right boundary”). Layer 2 yielded only “pole swing right.” While useful, these two channels are too coarse: they lump together distinct configurations (e.g., “pole tilt vs. swing” or “high vs. moderate cart_vel”).

4-Dim Bottleneck:

Layer 1 cleanly separates “cart zipping right,” “pole swing right on left,” “pole tilt right on left,” and “cart zipping left.” Layer 2 isolates “pole swing left,” “pole tilt right near-static,” and “pole swing right.” This balanced set of four features per layer proved the most semantically rich and minimally redundant.

8-Dim Bottleneck:

While technically yielding eight active channels, only ~3–4 of them represent unique semantics per layer; the rest are minor variations or splits of the same underlying feature. For example, Layer 1’s “pole swing right on left” appears across multiple neurons with slight differences in thresholds, but no fundamentally new feature emerges beyond what 4 dims already captured.

Hence, the 4-dim transcoder represents a “sweet spot” that is rich enough to disentangle all major policy cues, but small enough to stay sparse and interpretable. In practice, we focus subsequent causal‐intervention experiments on these 4-dim channels, knowing that each corresponds to a clear, actionable feature in the CartPole policy’s reasoning.

Causal Interventions (4-Dim Transcoder)

Although we trained transcoders at latent_dims = {2, 4, 8}, we chose to perform causal interventions only on the 4-dim transcoder because it produced the cleanest, most interpretable feature set without unnecessary dormant channels. Below are the details of each intervention experiment.

1. Intervention Mechanism

To force a transcoder dimension $j$ at layer $L$ , we:

Compute the policy’s normal hidden activation $h^{(L)}\in\mathbb{R}^D$ .
Extract the decoder column $w_j \in\mathbb{R}^D$ from transcoder.decoder.weight[:, j].
Add $\alpha\,w_j$ , effectively setting $c_j = \alpha$ .
Continue the forward pass: $\tilde h^{(L)} = h^{(L)} + \alpha\,w_j$ . Subsequent layers use $\tilde h^{(L)}$ as input.

For example, in PyTorch:

def forward(self, x, intervene_on=None):
        """
        If intervene_on is provided, it must be a dict:
          {'layer': 'layer1' or 'layer2',
           'decoder_col': torch.Tensor of shape [hidden_dim],
           'scale': float}
        We add (scale * decoder_col) to the hidden activation of the specified layer.
        """
        # Layer 1 forward
        z1 = self.fc1(x)       # [batch_size, hidden1]
        h1 = F.relu(z1)        # policy’s real layer1 output

        # Intervention on layer1?
        if intervene_on is not None and intervene_on["layer"] == "layer1":
            boost_vec = intervene_on["decoder_col"].view(1, -1) * intervene_on["scale"]
            h1 = h1 + boost_vec

        # Layer 2 forward
        z2 = self.fc2(h1)      # [batch_size, hidden2]
        h2 = F.relu(z2)        # policy’s real layer2 output

        # Intervention on layer2?
        if intervene_on is not None and intervene_on["layer"] == "layer2":
            boost_vec2 = intervene_on["decoder_col"].view(1, -1) * intervene_on["scale"]
            h2 = h2 + boost_vec2

        logits = self.fc3(h2)  # action logits
        return logits

2. Experiment Settings

For each (layer, neuron) pair in the 4-dim transcoder, we run:

NUM_EPISODES = 50 baseline (no intervention) episodes.
NUM_EPISODES = 50 intervention episodes with $\alpha$ set to [1, -1, 2.5, -2.5] so that it reliably triggers the feature.
We record:

Total rewards $\{R_b\}$ and $\{R_i\}$ .
Failure modes: “left” (pole angle ≥ +0.209 rad), “right” (pole angle ≤ −0.209 rad), “cart_out” (|cart_pos| > 2.4), or “none” (reached 500 timesteps).
Sample trajectories: For the first episode in each condition, record $(θ_t)_{t=1}^T$ and $(x_t)_{t=1}^T$ to plot in a 2×1 figure.

3. Experiment Results

Below is an interpretation of what happens when we force each transcoder neuron on or off (i.e. use ±1 or ±2.5) in light of “what that neuron actually represents.” In each case, “forcing” a positive α makes the policy act as if that feature is strongly present, while a negative α makes it act as if the feature is strongly absent (or even reversed). We compare everything back to the baseline (avg = 261.3).

Layer 1, Neuron 0 (“Cart Zipping Right”):

Recall that this neuron’s positive code spikes whenever the cart is already far to the right (x ≈ +2.3 → +2.4) and moving quickly right (v ≈ +3.0). In normal operation, whenever Neuron 0 > ≈2 it tells the policy: “The cart is about to slide off the right edge, correct left now.”

scale	baseline_avg	int_avg	base_fail_left	int_fail_left	base_fail_right	int_fail_right	base_fail_none	int_fail_none
1	261.3	148.92	0	0	36	0	7	0
2.5	261.3	144.36	0	0	36	0	7	0
-1	261.3	29.38	0	0	36	50	7	0
-2.5	261.3	11.26	0	0	36	50	7	0

Positive α (= +1, +2.5): Forcing “Cart Zipping Right”

By adding a positive decoder column, we trick the network into believing “the cart is already barreling right,” even if it isn’t.
In response, the policy continually over‐corrects left, trying to keep the cart “away from the right edge.” As a result:

avg reward drops from 261 → 148 (α=1) or 144 (α=2.5).
No episode ever reaches 500 timesteps, because the policy oscillates or mis‐balances. Both int_fail_right=0 and int_fail_none=0 indicate it never fails by pole‐angle or survives to max. The dominant failure is “cart_out,” meaning the cart eventually goes off the left edge instead.

In short, if you force “cart is zipping right” when it’s not, the policy constantly over‐steers left, causing it to crash quickly.

Negative α (= −1, −2.5): Forcing “Anti–Cart Zipping Right” (i.e. “Cart Zipping Left”)

A negative α effectively flips the sign of that feature so the agent “thinks” the cart is zipping left. Then it (wrongly) pushes right, even when the cart is centered.
The result is catastrophic:

avg reward plummets to 29 (α=−1) or 11 (α=−2.5).
int_fail_right=50 shows that in every single of the 50 episodes, the pole‐angle fails on the right side. (No one ever reaches “none” or “left” because you’ve forced it to push right immediately.)

In other words, faking “cart zipping left” makes the policy push right continuously, and it immediately drives the cart off the right edge (or lets the pole fall right).

Layer 1, Neuron 2 (“Pole Tilted Right on Left Side”)

Neuron 2 fires when (cart ≈ –1.6 to –2.0) AND (pole angle ≈ +0.20) i.e. “the cart is far left but the pole is already tilting right.” In that scenario, the correct action is usually to push right (to catch the pole) but also be careful not to slide off left.

scale	baseline_avg	int_avg	base_fail_left	int_fail_left	base_fail_right	int_fail_right	base_fail_none	int_fail_none
1	261.3	14.76	0	0	36	50	7	0
2.5	261.3	9.38	0	0	36	50	7	0
-1	261.3	148.34	0	0	36	0	7	0
-2.5	261.3	186.76	0	0	36	0	7	2

Positive α (= +1, +2.5): Forcing “Pole Tilted Right on Left Side”

We cheat the policy into believing “the pole is already leaning too far right while on the left.” The correct reaction in that situation is a strong right‐push to keep the pole from falling.
As a result:

avg reward crashes to 14.8 (α=+1) or 9.4 (α=+2.5).
int_fail_right=50 (“pole falls right” in all 50 episodes), because the agent constantly tries to push right (thinking it must counter that tilt), but ends up tipping the pole right or sliding off.

In short, faking that combination (cart left + pole right) drives the agent to push right nonstop, which quickly leads to failure on the right side.

Negative α (= −1, −2.5): Forcing “Anti–(Pole Tilted Right on Left Side)” (i.e. “Pole Tilted Left on Left Side” or “No Tilt”)

A negative α makes the policy think “the pole is tilted left” or at least “not tilted right while on the left edge.” In practice, that reduces how often it over‐corrects right.
Effects:

α=−1 → int_avg ≈ 148.3 (still worse than baseline, but it doesn’t fail catastrophically). int_fail_right=0, so it no longer crashes “pole right.” Instead, it mostly ends by “cart_out” or “pole left” because it’s under‐reacting.
α=−2.5 → int_avg ≈ 186.8, with int_fail_none=2 (two episodes actually survive to 500 timesteps). It improves compared to α=−1, because now it’s even less triggered by “pole‐right-on-left.”

Thus, pushing Neuron 2 negative partially “calms down” the policy’s over‐reaction to the left‐edge tilt, giving a small performance gain (186.8 vs. 261 baseline is still worse, but not catastrophic).

Layer 2, Neuron 0 (“Pole Swinging Left”)

Neuron 0 at Layer 2 fires nearly any time θ̇ > 0 (pole’s angular velocity is strongly positive, i.e. “pole is swinging left”). When that spikes, the policy says: “Stop it, you must push right next.”

scale	baseline_avg	int_avg	base_fail_left	int_fail_left	base_fail_right	int_fail_right	base_fail_none	int_fail_none
1	261.3	178.44	0	0	36	43	7	0
2.5	261.3	132.26	0	4	36	0	7	0
-1	261.3	201.58	0	0	36	0	7	0
-2.5	261.3	51.7	0	0	36	50	7	0

Positive α (= +1): Forcing “Pole Swinging Left” Weakly

The agent is tricked into believing “the pole is swinging left” more often than it really is, so it pushes right prematurely.
Consequences:

int_avg drops to 178.4.
int_fail_right=43: out of 50 episodes, 43 end by the pole falling to the right (because the agent often pushes right too soon, letting the pole swing the other way).

It is less catastrophic than some Layer 1 cases, because the policy still has a chance to recover, but overall performance suffers.

Positive α (= +2.5): Forcing “Pole Swinging Left” Strongly

Now the policy is even more certain “the pole is swinging left,” so it pushes right almost constantly.
Effects:

int_avg plummets to 132.3.
int_fail_left=4: in a few episodes the pole actually ends up falling left because by pushing right nonstop you overshoot and let it flip left.

In other words, over‐triggering “swing left” floods the policy with a right‐push bias, making it oscillate badly.

Negative α (= −1): Forcing “Anti–(Pole Swinging Left)” (i.e. “Pole Swinging Right” or “No Swing Left”)

Now the agent underestimates whenever the pole is swinging left; it may wait longer before pushing right.
Result:

int_avg = 201.6, a modest drop from 261.
int_fail_right=0: it doesn’t crash with the pole falling right because it’s not over‐compensating to the right. Instead, most failures become “cart_out” or “pole left.”

So pushing this neuron negative “dampens” how aggressively the agent counters a left swing, yielding somewhat better performance than α=+1 or +2.5, though still worse than baseline.

Negative α (= −2.5): Forcing “Strong Anti–(Pole Swinging Left)” (i.e. “Pole Swinging Right”)

The policy is now convinced “the pole is swinging right” almost all the time, so it pushes left prematurely.
Results:

int_avg crashes to 51.7, because it basically pushes left constantly.
int_fail_right=50: all episodes end by the pole falling right, ironically, since the pole eventually swings back faster than the policy reacts.

Concretely, flipping “swing left” too far in the negative direction makes the policy push left nonstop, which is just as disastrous.

Layer 2, Neuron 2 (“Pole Tilted Right, Near‐Static”)

Neuron 2 fires when θ ≈ +0.20 but θ̇ ≈ 0—“pole is leaning right but not yet swinging.” In practice, that is an early warning signal: the pole is about to tip right, so a small left push is needed before it begins to fall hard.

scale	baseline_avg	int_avg	base_fail_left	base_fail_right	base_fail_none	int_fail_left	int_fail_right	int_fail_none
1	261.3	245.32	0	36	7	0	40	3
2.5	261.3	215.3	0	36	7	0	39	3
-1	261.3	322.74	0	36	7	0	19	10
-2.5	261.3	230.86	0	36	7	0	0	0

Positive α (= +1, +2.5): Forcing “Pole Tilted Right, Near‐Static”

We trick the policy into always “seeing a slight right‐lean” (θ ≈ +0.20) even when the pole is more vertical. In principle, the policy’s correct reaction is a gentle left push.
Outcomes:

α=+1 → int_avg = 245.3 (just a mild drop). int_fail_right = 40, int_fail_none = 3 → a few episodes even survive to 500 timesteps.
α=+2.5 → int_avg = 215.3 (worse), int_fail_right = 39, int_fail_none = 3.

Interpretation: “Constantly believing the pole is leaning right but not yet swinging” causes the agent to gently push left most of the time. That sometimes helps the pole stay centered (hence a handful of “none” runs), but often it under‐reacts when the pole actually does swing. Hence performance falls modestly.

Negative α (= −1): Forcing “Anti–(Pole Tilted Right)” (i.e. “Pole Tilted Left”)

Now the agent thinks “the pole is leaning left” even when it’s upright. The correct reaction would be to push right.
Result:

int_avg jumps to 322.7 (well above baseline).
int_fail_right = 19, int_fail_none = 10: more episodes survive the full 500 timesteps.

Why does this improve so much? By faking a slight left‐lean, the agent pushes right preemptively. This often pre‐counters minor right‐leans without overdoing it. So it ends up staying balanced far longer. In essence, “pushing slightly right all the time” is a decent heuristic for CartPole, and forcing this neuron negatively guides the policy to behave like that.

Negative α (= −2.5): Forcing “Strong Anti–(Pole Tilted Right)”

The policy is convinced “the pole is really leaning left” all the time, so it pushes right even more vigorously.
Effect:

int_avg = 230.9. still a drop versus the α=−1 case, but only slightly worse than α=+1.
int_fail_right = 0, int_fail_none = 0: now the failures are mostly “cart_out” or “pole left,” because the agent is pushing right so strongly that it eventually either slides off the right or over‐corrects.

In other words, over‐pushing right (by forcing this too negative) eventually backfires but it’s still better than forcing “pole leaning right.”

Discussion

1. Why Multiple Latent Dimensions Matter

2-Dim vs. 4-Dim vs. 8-Dim Transcoders

The 2-dim transcoder at each layer captures only the two most dominant features (e.g., “pole-left swing” and “cart-left & pole-right”). While these explain major behavior, they lack granularity.
The 4-dim transcoder further disentangles those two features into finer sub‐categories (e.g., separates “cart-left & upright” vs. “cart-left & pole-right”). This extra granularity enables more targeted interventions (e.g., forcing a “cart-left & pole-right” feature that leads to distinct failure modes).
The 8-dim transcoder learns even more latent channels, but many remain dormant or become multisemantic. Only three or four effectively capture distinct semantics. Adding more dims beyond 4 yields diminishing interpretability returns but increases model complexity.

Novelty & Impact

Training multiple transcoders at different latent sizes and comparing their features is, to our knowledge, a novel contribution in RL interpretability. It shows that there is a “sweet spot” (in our case, 4 dims) where features are both separable and sparse.

2. Performance Improvement via Targeted Intervention on Layer 2, Neuron 2

Key Finding:

During our causal‐intervention experiments, we discovered that gently pushing Layer 2, Neuron 2 (the “Pole Tilted Right, Near‐Static” channel) into its negative direction (α = −1) actually improved overall CartPole performance, raising average returns from 261 → ≈ 323. In other words, by subtly biasing the policy to believe “the pole is leaning left,” we induced a small, persistent rightward corrective pressure that preempted many future failures.

Generalization to More Complex Systems:

Identifying the “right nudge” in larger networks.

In high‐dimensional or deep RL policies, a handful of latent neurons may each correspond to recognizable sub‐behaviors (e.g., “cart drifting,” “pole rotating fast,” “agent approaching obstacle”). Once such features are isolated via sparse transcoders, researchers can scan for the single dimension whose moderate intervention raises average returns or reduces classically prevalent failure modes (e.g., “slipping,” “stalling,” “oscillation”).

Building a corrective controller on top of a policy.

Instead of hard‐coding corrective heuristics (like “always push right when θ > 0.1”), one can automatically discover a small linear combination of latent channels whose enforced presence or absence improves robustness. This could become a lightweight “safety‐wrapper”: continuously monitor interpreter codes, and whenever a harmful circuit (e.g., “agent is about to collide”) approaches activation, apply a small boost along the opposite feature.

Extending to multi‐agent or hierarchical settings.

In hierarchical RL (e.g., options‐based architectures), each option’s network can be transcoded separately. By intervening on the “pre‐failure” units in a high‐level sub‐policy, one could stabilize its outputs for a longer time horizon, thereby producing smoother option transfers or fewer oscillatory behaviors.

Practical Implications for Future Research:

Automated discovery of “performance‐boosting” directions.

We manually inspected Layer 2, Neuron 2 and tested a small negative α to find an improvement. In larger systems, it would be valuable to automatically scan each transcoder channel by testing small ±1 perturbations in a validation set. The channel whose negative (or positive) direction consistently raises mean return becomes a candidate for a permanent “right‐nudge” controller.

Adaptive scaling based on state‐context.

Rather than always using α = −1, one could learn a lightweight gating function that computes a context‐dependent α(s) based on the raw state or low‐dim features. For example, if the pole is already at θ ≈ 0.05, apply α = −0.5; if θ ≈ 0.15, apply α = −1.0; if θ ≈ −0.05, apply no intervention. This would replicate a continuous safety‐critic layer that fine‐tunes corrective strength.

Incorporating into model‐based or meta‐RL pipelines.

In model‐based RL, one could simulate rollouts while artificially injecting “best‐nudge” codes at planning time, then select actions that maximize expected downstream value even after that nudge. In meta‐RL, the transcoder itself could adapt its latent codes through experience, learning which channels to suppress or amplify to handle rare failure states.

3. Practical Recommendations for Further Experiments

Choose the Right Latent Dimensionality

Start small (e.g., 2 dims) to capture the most salient features. If features remain too coarse, increase to 4 dims. If multiple latent dims remain dormant (max activation ≈ 0), you can reduce dims again.
In our experiments, 4 dims at each layer provided a robust tradeoff: distinct, sparse features that were all meaningfully active.

Tune the L1 Sparsity Penalty

We used $\lambda=10^{-5}$ . A larger $\lambda$ of 1e-4 forced almost all dims to zero based on our experiments, leaving only one or two features. A smaller $\lambda$ (e.g., 1e-6) makes codes dense, which reduces interpretability.
Visualize histograms of latent activations to choose $\lambda$ that yields a handful of nonzero channels.

Perform More Causal Interventions

Always verify features causally by forcing them on/off and observing policy failures or successes. Correlation alone can be misleading: two features might correlate but not be causal.
For each feature $j$ , sweep $\alpha \in \{-2,-1,-0.5,0.5,1,2\}$ and plot “Avg Reward vs. $\alpha$ ” to find the “activation threshold” at which policy behavior sharply changes.

4. Limitations & Future Work

Single Environment

CartPole is a classical toy problem. We plan to extend this pipeline to other environments such as LunarLander, and eventually continuous-action tasks (e.g., Pendulum, MuJoCo, even robotic systems).

More Complex Architectures

Modern RL often uses convolutional networks (e.g., Atari). We must test whether transcoders on convolutional hidden layers yield easily interpretable features (e.g., “ball moving left” in Pong).

Quantitative Interpretability Metrics

Develop a metric such as “fraction of variance explained by top-K features” or “clinearity” (how well a linear threshold on one latent dim classifies a semantic event), so we can automatically compare transcender performance across latent dims.

Conclusion

In this research post, we presented a proof-of-concept pipeline for interpreting RL policies via sparse transcoders and causal interventions:

Train a CartPole MLP policy (4 → 128 → 64 → 2) until it reliably solves the task.
Collect hidden‐layer activations (128-dim and 64-dim) and raw states (4-dim).
Train 2, 4, and 8-dim transcoders on each hidden layer, imposing an L1 penalty to encourage sparsity.
Inspect the latent codes and label each dimension in human‐readable terms (e.g., “pole-left swing,” “cart-left & pole-right”).
Causally intervene on the 4-dim transcoder (the most interpretable) by adding decoder columns to the policy’s hidden activations, demonstrating that each feature genuinely drives policy failures (pole falls or cart‐outs).

By providing clear, reproducible code and a concise methodology, we hope this work serves as a blueprint for interpretable RL. Whether you are an RL novice or a seasoned mechanistic-interpretability researcher, this pipeline can be adapted to more complex tasks, deeper networks, or real-world robots. We encourage readers to clone our Colab notebook and experiment with their own environments. Push the limits of what a 2, 4, or 8 dim transcoder can reveal about an RL agent’s inner workings.

References

Open-sourcing circuit tracing tools (Anthropic, May 29, 2025).

https://www.anthropic.com/research/open-source-circuit-tracing anthropic.com

Open Source Replication & Commentary on Anthropic’s Dictionary Learning (Neel Nanda, Oct 23, 2023).

https://www.lesswrong.com/posts/fKuugaxt2XLTkASkk/open-source-replication-and-commentary-on-anthropic-s lesswrong.com

Transcoder-circuits: reverse-engineering LLM circuits with transcoders (Jacob Dunefsky, GitHub).

https://github.com/jacobdunefsky/transcoder_circuits github.com

Transcoders Find Interpretable LLM Feature Circuits (ArXiv 2406.11944v1, Jun 2024).

https://arxiv.org/abs/2406.11944 arxiv.org

Mechanistic Interpretability in Transformers (Ayyuce Kizrak’s repo).

https://github.com/ayyucekizrak/Mechanistic-Interpretability github.com

Transcoders Enable Fine-Grained Interpretable Circuit Analysis (Alignment Forum, Jul 2023).

https://www.alignmentforum.org/posts/YmkjnWtZGLbHRbzrP/transcoders-enable-fine-grained-interpretable-circuit alignmentforum.org

Transcoders Beat Sparse Autoencoders for Interpretability (ArXiv 2501.18823, Jan 31, 2025).

https://arxiv.org/abs/2501.18823 arxiv.org