Thursday, January 1

Deep Reinforcement Learning: The Actor-Critic Method


that frustrating hovering drone from ? The one that learned to descend toward the platform, pass through it, and then just… hang out below it forever? Yeah, me too. I spent an entire afternoon watching it hover there, accumulating negative rewards like a slow-motion crash, and I couldn’t even be mad because technically it was doing exactly what I told it to do.

The fundamental problem was that my reward function could only see the current state, not the trajectory. When I rewarded it for being close to the platform, it couldn’t tell the difference between a drone making progress toward landing and a drone that had already passed through the platform and was now exploiting the reward structure from below. The reward function r(s') just looked at where the drone was, not how it got there or where it was going. (This will become a recurring theme, by the way. Reward engineering haunts me in my sleep at this point.)

But here’s where things get interesting. While I was staring at my drone hovering below the platform for what felt like the hundredth time, I kept thinking: why am I waiting for the entire episode to finish before learning anything? REINFORCE made me collect a full trajectory, watch the drone crash (or occasionally land), compute all the returns, and then update the policy. What if we could just… learn after every single step? Like, get immediate feedback as the drone flies? Wouldn’t that be way more efficient?

That’s Actor-Critic. And spoiler alert: it works way better than I expected. Well, after I fixed three major bugs, rewrote my reward function twice, spent two days thinking PyTorch was broken (it wasn’t, I was just using it wrong), and finally understood why my discount factor was making terminal rewards completely invisible. But we’ll get to all of that.

In this post, I’m going to walk you through my entire journey implementing Actor-Critic methods for the drone landing task. You’ll see the successes, the frustrating failures, and the debugging marathons. Here’s what we’re covering:

Basic Actor-Critic with TD error, which got me to 68% success rate and converged twice as fast as REINFORCE. This part worked surprisingly well once I fixed the moving target bug (more on that nightmare later).

My attempt at Generalized Advantage Estimation (GAE), which completely failed. I spent three entire days debugging why my critic values were exploding to thousands, tried every fix I could think of, and eventually just… gave up and moved on. Sometimes you need to know when to pivot. (I’m still a bit salty about this one, honestly.)

Proximal Policy Optimization (PPO), which finally gave me stable, robust performance and taught me why the entire RL industry just uses this by default. Turns out when OpenAI says “this is the thing,” they’re probably right.

But more importantly, you’ll learn about the three critical bugs that nearly derailed everything. These aren’t small “oops, typo” bugs. These are “stare at training curves for six hours, wondering if you fundamentally misunderstand neural networks” bugs:

  1. The moving target problem that made my critic loss oscillate forever because I didn’t detach the TD target (this one made me question my entire understanding of backpropagation)
  2. The gamma value was too low and it made landing rewards worth literally 0.00000006 after discount, so my agent just learned to crash immediately because why bother trying? (I printed the actual discounted values and laughed, then cried)
  3. The reward exploits where my drone learned to zoom past the platform at maximum speed, collect distance rewards on the way, and crash far away because that was somehow better than landing. This taught me that 90% of RL really is reward engineering, and the other 90% is debugging why your reward engineering didn’t work. (Yes, I know that’s 180%. That’s how much work it is.)

Let’s dive in. Grab some coffee, you’re going to need it. All the code can be found in my repository on my github.

What is Actor-Critic?

REINFORCE had one fundamental problem: we had to wait. Wait for the drone to crash. Wait for the episode to end. Wait to compute the full return. Then, and only then, could we update the policy. One learning signal per episode. For a 150-step trajectory, that’s one update after watching 150 actions play out.

I ran REINFORCE for 1200 iterations (6 hours on my machine) to hit 55% success rate. And the whole time I kept thinking: this feels wasteful. Why can’t I learn during the episode?

Actor-Critic fixes this with a simple idea: train a second neural network (the “critic”) to estimate future returns for any state. Then use those estimates to update the policy after every single step. No more waiting for episodes to finish. Just continuous learning as the drone flies.

The result? 68% success rate in 600 iterations (3 hours). Half the time. Better performance. Same hardware.

How it works: Two networks collaborate in real-time.

The Actor (π(a|s)): Same policy network from REINFORCE. Takes the current state and outputs action probabilities. This is the network that actually controls the drone.

The Critic (V(s)): New network. Takes the current state and estimates “how good is this state?” It outputs a single value representing expected future rewards. Not tied to any specific action, just evaluates states.

Here’s the clever part: the critic provides immediate feedback. The actor takes an action, the environment updates, and the critic immediately evaluates whether that moved us to a better or worse state. The actor learns from this signal and adjusts. The critic simultaneously learns to make better predictions. Both networks improve together as episodes unfold.

Image taken from this paper here

In code, they look like this:

class DroneGamerBoi(nn.Module):
    """The Actor: outputs action probabilities"""
    def __init__(self, state_dim=15):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 128), nn.LayerNorm(128), nn.ReLU(),
            nn.Linear(128, 128), nn.LayerNorm(128), nn.ReLU(),
            nn.Linear(128, 64), nn.LayerNorm(64), nn.ReLU(),
            nn.Linear(64, 3),  # Three independent thrusters
            nn.Sigmoid()
        )

    def forward(self, state):
        return self.network(state)  # Output: probabilities for each thruster


class DroneTeacherBoi(nn.Module):
    """The Critic: outputs state value estimate"""
    def __init__(self, state_dim=15):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 128), nn.LayerNorm(128), nn.ReLU(),
            nn.Linear(128, 128), nn.LayerNorm(128), nn.ReLU(),
            nn.Linear(128, 64), nn.LayerNorm(64), nn.ReLU(),
            nn.Linear(64, 1)  # Single value: V(s)
        )

    def forward(self, state):
        return self.network(state)  # Output: scalar value estimate

Notice the critic network is almost identical to the actor, except the final layer outputs a single value (how good is this state?) instead of action probabilities.

The Bootstrapping Trick

Okay, here’s where it gets clever. In REINFORCE, we needed the full return to update the policy:

\[ G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{T-t} r_T \]

We had to wait until the episode ended to know all the rewards. But what if… we didn’t? What if we just estimated the future using our critic network?

Instead of computing the actual return, we estimate it:

\[ G_t \approx r_t + \gamma V(s_{t+1}) \]

This is called bootstrapping. The critic “bootstraps” its own value estimate to approximate the full return. We use its prediction of “how good will the next state be?” to estimate the return right now.

An image illustrating bootstrapping vs td learning

Why does this help?

Lower variance. We’re not waiting for the actual random sequence of future rewards. We’re using an estimate based on what we’ve learned about states in general. This is noisier than the ground truth (the critic might be wrong!), but it’s less noisy than any single episode outcome.

Online learning. We can update immediately at every step. No need to finish the episode first. As soon as the drone takes one action, we know the immediate reward, and we can estimate what comes next, so we can learn.

Better sample efficiency. In REINFORCE with 6 parallel games, each drone learns once per episode completion. In Actor-Critic with 6 parallel games, each drone learns at every step (about 150 steps per episode). That’s 150x more learning signals per episode!

Of course, there’s a trade-off: we introduce bias. If our critic is wrong (and it will be, especially early in training), our agent learns from incorrect estimates. But the critic doesn’t need to be perfect. It just needs to be less noisy than a single episode outcome. As the critic gradually improves, the actor learns from better feedback. They bootstrap each other upward. In practice, the variance reduction is so powerful that it’s worth accepting the small bias.

TD Error: The New Advantage

Now we need to answer: how much better or worse was this action than expected?

In REINFORCE, we had the advantage: actual return minus baseline. The baseline was a global average. But we can do much better. Instead of a global baseline, we use the critic’s state-specific estimate.

The TD (Temporal Difference) error is our new advantage:

\[ \delta_t = r_t + \gamma V(s_{t+1}) – V(s_t) \]

In plain terms:

  • \(r_t + \gamma V(s_{t+1})\) = TD target. The immediate reward plus our estimate of the next state’s value.
  • \(V(s_t)\) = Our prediction for the current state.
  • \(\delta_t\) = The difference. Did we do better or worse than expected?

If \(\delta_t > 0\), we did better than expected → reinforce that action.

If \(\delta_t < 0\), we did worse than expected → decrease that action’s probability.

If \(\delta_t \approx 0\), we were spot on → action was about average.

This is way more informative than REINFORCE’s global baseline. The signal is now state-specific. The drone in a tricky spin might get -10 reward and that’s actually pretty good (usually gets -50 there). But if it’s hovering peacefully over the platform, -10 is terrible. The critic knows the difference. The TD error captures that.

Here’s how this flows through the training loop (simplified):

# 1. Take one action in each parallel game
action = actor(state)
next_state, reward = env.step(action)

# 2. Get value estimates
value_current = critic(state)
value_next = critic(next_state)

# 3. Compute TD error (our advantage)
td_error = reward + gamma * value_next - value_current

# 4. Update the critic: it should have predicted better
#    The critic wants to minimize prediction error, so we use squared error.
#    The gradient then pushes the critic's predictions closer to actual returns.
critic_loss = td_error ** 2
critic_loss.backward()
critic_optimizer.step()

# 5. Update the actor: reinforce or discourage based on TD error
#    (same policy gradient as REINFORCE, but with TD error instead of returns)
actor_loss = -log_prob(action) * td_error
actor_loss.backward()
actor_optimizer.step()

Notice we’re updating both networks per step, not per episode. That’s the online learning magic.

One more comparison to make this crystal clear:

Method What We Learn From Timing Baseline
REINFORCE Full return G_t After episode ends Global average of all returns
Actor-Critic TD error δ_t After every step State-specific V(s_t)

The second one is more precise, more informative, and arrives much sooner.

(Image generated using Gemini nano banana pro)

This is why Actor-Critic converged in 600 iterations on my machine while REINFORCE needed 1200. Same reward function, same environment, same drone. But getting feedback after every step instead of every 150 steps? That’s a 150x information advantage per iteration.

The Three Bugs: A Debugging Odyssey

Alright, I’m about to tell you about three bugs that nearly broke me. Not “oops, off-by-one error” broken. I mean the kind of broken where you stare at training curves for six hours, seriously question whether you understand backpropagation, debug your code five times, and then spend another two hours reading academic papers to convince yourself you’re not insane.

These bugs are subtle enough that even experienced RL practitioners have to be careful. The good news: once you understand them, they become obvious. The bad news: you have to understand them first, and I learned the hard way.

Bug #1: The Moving Target Problem

The Setup

I implemented Actor-Critic exactly as it seemed logical. I have two networks. One predicts actions, one predicts values. Simple, right? I wrote out the TD error computation:

# Compute value estimates
values = critic(batch_data['states'])
next_values = critic(batch_data['next_states'])

# Compute TD targets and errors
td_targets = rewards + gamma * next_values * (1 - dones)
td_errors = td_targets - values

# Critic loss
critic_loss = (td_errors ** 2).mean()

# Backward pass
critic_loss.backward()

This looked completely reasonable to me. We compute what we expected (values), we compute what we should have gotten (td_targets), we measure the error, and we update. Standard supervised learning stuff.

The Symptom: Nothing Works

I trained for 200 iterations and the critic loss was… sitting around 500-1000 and not moving. Not decreasing, not increasing, just oscillating wildly like a sine wave. I checked my reward function. Looked fine. I checked the critic network. Standard architecture, nothing weird. I checked the TD error values themselves. They were bouncing around between -50 and +50, which seemed reasonable given the reward scale.

But the loss refused to converge.

I spent two days on this. I added dropout, thinking maybe overfitting. (Wrong problem, didn’t help.) I reduced the learning rate from 1e-3 to 1e-4, thinking maybe the optimizer was overshooting. (Nope, just learned slower while oscillating.) I checked if my environment was returning NaNs. (It wasn’t.) I even wondered if PyTorch’s autograd had a bug. (Spoiler: PyTorch is fine, I was the bug.)

The Breakthrough

I was reading the Actor-Critic chapter in Sutton & Barto (again, for the fifth time) when something caught my eye. The pseudocode had a line about “computing the next value estimate.” And I thought: wait, when I compute next_values = critic(next_states), what happens to those gradients during backprop?

And then my brain went click. Oh no. The target is moving as we try to optimize toward it. This is called the moving target problem.

Why This Breaks Everything

When we compute next_values = critic(next_states) without detaching, PyTorch’s autograd flows gradients through BOTH V(s) and V(s’). That means we’re updating the prediction AND the target simultaneously—the critic chases a target that moves every time it updates. The gradient becomes:

\[ \frac{\partial L}{\partial \theta} = 2 \cdot (r + \gamma V(s’) – V(s)) \cdot \left( \gamma \frac{\partial V(s’)}{\partial \theta} – \frac{\partial V(s)}{\partial \theta} \right) \]

That γ · ∂V(s')/∂θ term is the problem—we’re telling the critic to change the target, not just the prediction. The loss oscillates forever.

The Fix (Finally)

I needed to treat the TD target as a fixed constant. In PyTorch, that means detaching the gradients:

# ✅ CORRECT
values = critic(batch_data['states'])

with torch.no_grad():  # CRITICAL LINE
    next_values = critic(batch_data['next_states'])

td_targets = rewards + gamma * next_values * (1 - dones)
td_errors = td_targets - values

critic_loss = (td_errors ** 2).mean()
critic_loss.backward()

The torch.no_grad() context manager says: “Compute these next values, but don’t remember how you computed them. For gradient purposes, treat this as a constant.” Now during the backward pass:

\[ \frac{\partial L}{\partial \theta} = 2 \cdot (r + \gamma V(s’) – V(s)) \cdot \left( – \frac{\partial V(s)}{\partial \theta} \right) \]

That problematic term is gone! Now we’re only updating V(s), the prediction, to match the fixed target r + γV(s’). This is exactly what we want.

The TD target becomes what it should be: a fixed label, like the ground truth in supervised learning. We’re no longer trying to hit a moving target. We’re just trying to predict something stable.

I changed exactly one line. The critic loss went from oscillating chaotically around 500-1000 to decreasing smoothly: 500 → 250 → 100 → 35 → 8 over 200 iterations. This bug is insidious because the code looks completely reasonable—but always detach your TD targets.

Bug #2: Gamma Too Low (Invisible Rewards)

The Setup

Alright, Bug #1 was subtle. This bug is embarrassingly obvious in retrospect. But you know what? Sometimes the most obvious mistakes are the easiest to miss because you don’t expect the problem to be that simple.

I fixed the moving target bug and suddenly the critic loss started converging. Fantastic! I felt like a real engineer for a moment there. But then I ran the agent for a full training iteration and… nothing. Absolutely nothing improved. The drone would take a few random moves and then immediately crash into the ground or fly off the screen. No learning. No improvement. No signs of life.

Actually, wait. The critic was learning. The loss was going down. But the drone wasn’t getting better. That seemed backwards. Why would the critic learn to predict values if the agent wasn’t learning anything from those values?

The Discovery

I printed the TD targets and they were all negative—ranging from -5 to -30. No sign of the +500 landing reward. Then I did the math: with 150-step episodes and gamma=0.90:

\[ 500 \times 0.90^{150} \approx 0.00000006 \]

The landing reward had been discounted into oblivion. The agent learned to crash immediately because trying to land was literally invisible to the value function.

The discount factor γ controls the effective horizon (≈ 1/(1-γ)). With gamma = 0.90, that’s only 10 steps—way too short for 100-300 step episodes.

The fix: change gamma from 0.90 to 0.99.

The Impact

I changed gamma from 0.90 to 0.99. Same network, same rewards, same everything else.

Result: Iteration 5, the drone moved toward the platform. Iteration 50, it slowed when approaching. Iteration 100, first landing. By iteration 600, 68% success rate.

One parameter change, completely different agent behavior. The terminal reward went from invisible to crystal clear. Always check: effective horizon (1/(1-γ)) should match your episode length.

Bug #3: Reward Exploits (The Arms Race)

At this point, I’d fixed both the moving target problem and the gamma issue. My agent was actually learning! It approached the platform, slowed down occasionally, and even landed sometimes. I was genuinely excited. Then I started watching the failures more carefully, and something weird happened.

After fixing bugs #1 and #2, the agent learned two new exploits:

Zoom-past: Accelerate toward the platform at maximum speed, overshoot, crash far away. Net reward: -140 (approach rewards +60, crash penalty -200). Better than crashing immediately (-300), but not landing.

Hovering: Get close to the platform and vibrate in place with tiny movements (speed 0.01-0.02) to farm approach rewards indefinitely while avoiding crash penalties.

Why This Happens: The Fundamental Problem

Here’s the thing that bothered me: My reward function could only see the current state, not the trajectory.

The reward function is r(s', a): given the next state and the action I just took, compute my reward. It has no memory. It can’t tell the difference between:

  1. A drone making genuine progress toward landing: approaching from above with controlled, purposeful descent
  2. A drone farming the reward structure: hovering with meaningless micro-movements

Both scenarios might have:

  • distance_to_platform < 0.3 (close to target)
  • speed > 0 (technically moving)
  • velocity_alignment > 0 (pointed in the right direction)

The agent isn’t dumb. It’s doing exactly what I told it to do—maximize the scalar rewards I’m feeding it. The problem is that the rewards don’t actually encode landing, they encode proximity and movement. And proximity without landing is exploitable.

This is the core insight of reward hacking: the agent will find loopholes in your reward specification, not because it’s clever, but because you under-specified the task.

The Fix: Reward State Transitions, Not Snapshots

The fix: reward based on state transitions r(s, s'), not just current state r(s'). Instead of asking “Is distance < 0.3?”, ask “Did we get closer (distance_delta > 0) AND move fast enough to mean it (speed ≥ 0.15)?”

def calc_reward(state: DroneState, prev_state: DroneState = None):
    if prev_state is not None:
        distance_delta = prev_state.distance_to_platform - state.distance_to_platform
        speed = state.speed
        velocity_toward_platform = calculate_alignment(state)  # cosine similarity

        MIN_MEANINGFUL_SPEED = 0.15

        if speed >= MIN_MEANINGFUL_SPEED and velocity_toward_platform > 0.1:
            speed_multiplier = 1.0 + speed * 2.0
            rewards['approach'] = distance_delta * 15.0 * speed_multiplier
        elif speed < 0.05:
            rewards['hovering_penalty'] = -1.0

Key changes: (1) Reward distance_delta (progress), not proximity, (2) MIN_SPEED threshold blocks hovering, (3) Speed multiplier encourages decisive action.

To use this, track prev_state in your training loop and pass it to calc_reward(next_state, prev_state).

90% of RL is reward engineering. The other 90% is debugging your reward engineering. Rewards are a specification of the objective, and the agent will find every loophole.

Basic Actor-Critic Results

I have to admit, when I fixed the third bug (that velocity-magnitude-weighted reward function) and launched a fresh training run with all three fixes in place, I was skeptical. I’d spent so much time chasing my tail with these algorithms that I half expected Actor-Critic to hit some new, creative failure mode I hadn’t anticipated. But something surprising happened: it just… worked.

And I mean really worked. Better than REINFORCE, in fact—noticeably better. After hundreds of hours debugging REINFORCE’s reward hacking, I was expecting Actor-Critic to at least match its performance. Instead, it blew past it.

Why This Beats REINFORCE (And Why That Matters):

Actor-Critic’s online updates create a feedback loop that REINFORCE can’t match. Every single step, the critic whispers in the actor’s ear: “Hey, that state is good” or “That state is bad.” It’s not a global baseline like REINFORCE uses. It’s state-specific evaluation that gets better and better as the critic learns.

This is why the convergence is 2x faster. This is why the final performance is 13% better. This is why the learning curves are so clean.

And all of it hinged on three things: detaching the TD target, using the right discount factor, and tracking state transitions in the reward function. No new algorithm tricks needed. Just correct implementation.

What’s Next: Pushing Beyond Actor-Critic

With Actor-Critic working alright, you may have noticed that the policy is consistently landing the drone on the left side of the platform, and also the movements are slightly jittery. To solve this, I am working on convering Proximal Policy Optimization (PPO), which is supposed to help with this by “making the learning process more stable”. The good thing is, this method has used by the researchers at OpenAI to train their flagship “GPT” models.

References

Foundational RL Papers

  1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.

Actor-Critic Methods

  1. Konda, V. R., & Tsitsiklis, J. N. (2000). “Actor-Critic Algorithms.” SIAM Journal on Control and Optimization, 42(4), 1143-1166.
    • Theoretical foundations of Actor-Critic with convergence proofs
  2. Mnih, V., Badia, A. P., Mirza, M., et al. (2016). “Asynchronous Methods for Deep Reinforcement Learning.” International Conference on Machine Learning.

Temporal Difference Learning

  1. Sutton, R. S. (1988). “Learning to Predict by the Methods of Temporal Differences.” Machine Learning, 3(1), 9-44.
    • Original TD learning paper

Previous Posts in This Series

  1. Jumle, V. (2025). “Deep Reinforcement Learning: 0 to 100 – Policy Gradients (REINFORCE).”

Code Repository & Implementation

  1. Jumle, V. (2025). “Reinforcement Learning 101: Delivery Drone Landing.”

All images in this article are either AI-generated (using Gemini or Sora), personally made by me, or screenshots & plots that I made, unless specified otherwise.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *