All Posts

Reinforcement Learning: Agents, Environments, and Rewards in Practice

Trial-and-Error Learning for Sequential Decision Making

Abstract AlgorithmsAbstract Algorithms
Β·Β·15 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Reinforcement Learning trains agents to make sequences of decisions by learning from rewards and penalties. Unlike supervised learning, RL learns through trial and error rather than labeled examples. Use it for sequential decision problems where the optimal action depends on future consequences β€” robotics, game AI, recommendation systems, and RLHF training for LLMs.


πŸ“– The Trial-and-Error Learning Problem: When Labels Don't Exist

You're training a robot arm, but there's no labeled dataset. It needs to figure out through trial and error how to grip objects without breaking them, learning from thousands of attempts where it either succeeds or drops the object. Traditional supervised learning can't solve this because there's no "correct grip pressure" label for each object β€” the robot must discover the optimal policy through exploration.

Reinforcement Learning (RL) is machine learning for sequential decision-making problems. An agent takes actions in an environment, receives rewards or penalties, and learns a policy (strategy) to maximize cumulative reward over time. Unlike supervised learning, which learns from input-output pairs, RL learns from the consequences of actions.

Learning TypeData SourceGoalExample
SupervisedLabeled examplesMap inputs to outputsEmail spam classification
UnsupervisedUnlabeled patternsFind hidden structureCustomer segmentation
ReinforcementAction-reward feedbackMaximize cumulative rewardRobot navigation

The key insight: RL optimizes for long-term outcomes, not immediate correctness. A chess move that looks bad in isolation might be part of a winning strategy β€” RL agents learn to balance immediate rewards with future potential.


πŸ” RL Components: The Agent-Environment Feedback Loop

Reinforcement Learning involves four core components interacting in a continuous loop:

Agent: The learner/decision-maker (e.g., chess AI, trading algorithm, robot controller) Environment: The external system the agent interacts with (e.g., chessboard, stock market, physical world) State: The current situation or context (e.g., board position, portfolio balance, sensor readings) Action: What the agent can do (e.g., move piece, buy/sell stock, move joint) Reward: Feedback signal indicating how good the action was (e.g., +1 for win, profit/loss, task completion)

graph TD
    A[Agent] -->|Action| E[Environment]
    E -->|State, Reward| A
    E -->|Next State| E

    subgraph "RL Components"
        A
        E
        S[State Space]
        R[Reward Function]
        P[Policy Ο€]
    end

    style A fill:#e1f5fe
    style E fill:#f3e5f5
    style S fill:#e8f5e8
    style R fill:#fff3e0
    style P fill:#fce4ec

At each time step:

  1. Agent observes the current state s_t
  2. Agent selects an action a_t based on its policy Ο€(a|s)
  3. Environment transitions to new state s_{t+1} and returns reward r_t
  4. Agent updates its policy to improve future decisions

Policy Ο€: The strategy that maps states to actions. This is what the agent learns β€” which action to take in each situation to maximize long-term reward.

Value Function V(s): Expected cumulative reward starting from state s following policy Ο€. This helps the agent evaluate how "good" different states are.


βš™οΈ How Q-Learning Discovers Optimal Policies Through Trial and Error

The core challenge in RL is the credit assignment problem: if the agent wins a game after 50 moves, which moves were actually good? Q-Learning solves this by learning the Q-function β€” the expected reward for taking action a in state s and following the optimal policy afterward.

Q-Learning Algorithm:

  1. Initialize Q-table Q(s,a) with small random values
  2. For each episode:
    • Start in initial state s
    • While not terminal:
      • Choose action a using Ξ΅-greedy policy
      • Execute action, observe reward r and next state s'
      • Update: Q(s,a) ← Q(s,a) + Ξ±[r + Ξ³ max Q(s',a') - Q(s,a)]
      • Set s ← s'

Key Parameters:

  • Ξ± (learning rate): How much to update Q-values each step (typically 0.1-0.5)
  • Ξ³ (discount factor): How much to value future rewards vs immediate rewards (0.9-0.99)
  • Ξ΅ (exploration rate): Probability of taking random action vs best known action

The Bellman Equation captures the core insight: the value of being in state s equals the immediate reward plus the discounted value of the best next state:

Q(s,a) = r + Ξ³ max Q(s',a')

This means optimal decisions require looking ahead β€” sometimes accepting lower immediate rewards for better long-term outcomes.


🧠 Deep Dive: How Modern RL Algorithms Scale Beyond Tabular Methods

The Internals

Tabular Q-Learning stores Q-values in a lookup table, but this breaks down with large state spaces. A chess game has ~10^43 possible positions β€” impossible to store in memory.

Deep Q-Networks (DQN) replace the Q-table with a neural network that approximates Q(s,a). The network takes state s as input and outputs Q-values for all possible actions:

# DQN Architecture
state_input = tf.keras.Input(shape=state_size)
x = tf.keras.layers.Dense(64, activation='relu')(state_input)
x = tf.keras.layers.Dense(64, activation='relu')(x)
q_values = tf.keras.layers.Dense(action_size, activation='linear')(x)
model = tf.keras.Model(state_input, q_values)

Experience Replay: DQN stores (state, action, reward, next_state) transitions in a replay buffer and samples random batches for training. This breaks the correlation between consecutive experiences and improves learning stability.

Target Network: DQN uses a separate target network for computing expected future rewards, updated periodically. This prevents the "moving target" problem where both current and target Q-values change simultaneously.

Performance Analysis

Time Complexity: Tabular Q-learning is O(1) per update but requires O(|S| Γ— |A|) memory. DQN is O(n) per forward pass where n is network size, but can handle infinite state spaces.

Sample Efficiency: Q-learning is sample-efficient for small problems but struggles with sparse rewards. Policy gradient methods like REINFORCE directly optimize the policy but require more samples:

# REINFORCE update rule
policy_loss = -tf.reduce_mean(log_probs * returns)
optimizer.minimize(policy_loss)

Exploration vs Exploitation: Ξ΅-greedy balances exploration (random actions) with exploitation (greedy actions). Advanced methods use Upper Confidence Bounds (UCB) or Thompson Sampling for smarter exploration.


πŸ“Š Visualizing the RL Learning Process

sequenceDiagram
    participant A as Agent
    participant E as Environment
    participant M as Memory/Replay Buffer
    participant N as Neural Network

    loop Training Episode
        A->>E: action_t
        E->>A: state_{t+1}, reward_t
        A->>M: store (s_t, a_t, r_t, s_{t+1})

        Note over A: Sample mini-batch from M
        A->>N: Update Q-network
        N->>A: Updated Q-values

        Note over A: Ξ΅-greedy action selection
    end

    Note over A,N: Policy improves over time

This diagram shows how modern RL agents learn:

  1. Agent takes action in environment
  2. Environment returns new state and reward
  3. Experience stored in replay buffer
  4. Neural network trained on sampled experiences
  5. Updated policy used for next decision

The key insight: learning happens offline from stored experiences, not just from immediate feedback.


🌍 Real-World Applications: From Atari to RLHF

Case Study 1: Game AI Mastery

AlphaGo/AlphaZero used RL to master Go, chess, and shogi without human game data. The agent learned by playing millions of games against itself, starting with random moves and gradually discovering winning strategies.

Input: Board position (19Γ—19 grid for Go) Process: Monte Carlo Tree Search guided by neural network value estimates Output: Move selection that maximizes win probability

Scaling insight: Self-play generates infinite training data, but requires massive computational resources (Google used 5,000 TPUs for AlphaZero training).

Case Study 2: LLM RLHF Training

Reinforcement Learning from Human Feedback (RLHF) trains language models to be helpful and harmless. ChatGPT uses RLHF to learn human preferences for response quality.

Input: Prompt + multiple response candidates Process: Human raters rank responses; RL optimizes for higher-rated outputs Output: Language model that aligns with human preferences

Operational challenge: Human preferences are noisy and inconsistent. Production systems use Proximal Policy Optimization (PPO) to prevent the policy from deviating too far from the base model.


βš–οΈ Trade-offs & Failure Modes: When RL Goes Wrong

Performance vs Sample Efficiency

RL typically requires orders of magnitude more data than supervised learning. DQN needed 200 million frames to master Atari games β€” equivalent to 924 hours of gameplay. Modern methods like Rainbow DQN achieve the same performance with 10x fewer samples through algorithmic improvements.

Exploration vs Exploitation Dilemma

Pure exploitation (always choose best known action) gets stuck in local optima. Pure exploration (random actions) never leverages learned knowledge. The optimal balance depends on problem structure:

  • High-stakes environments: Conservative exploitation (financial trading)
  • Safe exploration: Cautious exploration with safety constraints (robotics)
  • Multi-armed bandits: UCB provides theoretical guarantees for exploration-exploitation balance

Common Failure Modes

Reward Hacking: Agent finds unexpected ways to maximize reward that don't align with intended behavior. A cleaning robot might create messes to appear more useful.

Catastrophic Forgetting: Neural network forgets previously learned skills when learning new tasks. Requires techniques like Elastic Weight Consolidation or Progressive Neural Networks.

Distribution Shift: Performance degrades when deployment environment differs from training. RL agents can be brittle to changes in state representation or reward structure.


🧭 Decision Guide: When to Choose RL Over Supervised Learning

SituationRecommendation
Use RL whenSequential decisions with delayed rewards, no labeled training data, need to balance exploration/exploitation, environment provides reward signal
Avoid RL whenSingle-step prediction tasks, abundant labeled data available, real-time inference required (RL training is slow), safety-critical applications with no tolerance for exploration failures
AlternativeSupervised learning for classification/regression, imitation learning when expert demonstrations available, evolutionary algorithms for black-box optimization
Edge casesSparse rewards need reward shaping, continuous action spaces require policy gradients, multi-agent environments need specialized algorithms (MADDPG)

Rule of thumb: Use RL when the optimal action depends on future consequences, not just current context. Supervised learning optimizes immediate accuracy; RL optimizes long-term outcomes.


πŸ› οΈ Gymnasium: The OpenAI Standard for RL Environments

Gymnasium (formerly OpenAI Gym) provides standardized environments for RL research and development. It defines a common interface that works with any RL algorithm:

import gymnasium as gym
import numpy as np
from collections import defaultdict

# Create CartPole environment
env = gym.make("CartPole-v1", render_mode="human")

# Simple Q-learning for discrete state approximation
class QLearningAgent:
    def __init__(self, action_size, learning_rate=0.1, discount=0.95, epsilon=0.1):
        self.q_table = defaultdict(lambda: np.zeros(action_size))
        self.lr = learning_rate
        self.gamma = discount
        self.epsilon = epsilon

    def discretize_state(self, state):
        """Convert continuous state to discrete bins for Q-table"""
        # CartPole state: [position, velocity, angle, angular_velocity]
        bins = [
            np.digitize(state[0], np.linspace(-2.4, 2.4, 10)),
            np.digitize(state[1], np.linspace(-2, 2, 10)), 
            np.digitize(state[2], np.linspace(-0.2, 0.2, 10)),
            np.digitize(state[3], np.linspace(-2, 2, 10))
        ]
        return tuple(bins)

    def choose_action(self, state):
        """Ξ΅-greedy action selection"""
        discrete_state = self.discretize_state(state)
        if np.random.random() < self.epsilon:
            return env.action_space.sample()  # Random action
        return np.argmax(self.q_table[discrete_state])  # Best known action

    def update(self, state, action, reward, next_state, done):
        """Q-learning update rule"""
        discrete_state = self.discretize_state(state)
        discrete_next_state = self.discretize_state(next_state)

        # Q(s,a) ← Q(s,a) + Ξ±[r + Ξ³ max Q(s',a') - Q(s,a)]
        current_q = self.q_table[discrete_state][action]
        if done:
            target_q = reward
        else:
            target_q = reward + self.gamma * np.max(self.q_table[discrete_next_state])

        self.q_table[discrete_state][action] += self.lr * (target_q - current_q)

# Training loop
agent = QLearningAgent(action_size=env.action_space.n)
episodes = 1000
scores = []

for episode in range(episodes):
    state, info = env.reset()
    total_reward = 0

    while True:
        action = agent.choose_action(state)
        next_state, reward, terminated, truncated, info = env.step(action)

        # Custom reward shaping for CartPole
        # Reward staying upright, penalize extreme positions
        angle_reward = 1.0 - abs(next_state[2]) / 0.2  # Angle penalty
        position_reward = 1.0 - abs(next_state[0]) / 2.4  # Position penalty
        shaped_reward = reward + 0.1 * (angle_reward + position_reward)

        agent.update(state, action, shaped_reward, next_state, terminated or truncated)
        state = next_state
        total_reward += reward

        if terminated or truncated:
            break

    scores.append(total_reward)

    # Decay exploration rate
    if agent.epsilon > 0.01:
        agent.epsilon *= 0.995

    if episode % 100 == 0:
        avg_score = np.mean(scores[-100:])
        print(f"Episode {episode}, Average Score: {avg_score:.2f}, Ξ΅: {agent.epsilon:.3f}")

env.close()

Why Gymnasium Matters: It standardizes the RL interface across different environments β€” from simple control tasks like CartPole to complex simulations like MuJoCo robotics. The same algorithm can work on any Gymnasium-compatible environment with minimal code changes.

Environment Categories:

  • Classic Control: CartPole, MountainCar (simple physics)
  • Atari: Video games with pixel observations
  • MuJoCo: Continuous control robotics simulation
  • Custom: Easy to wrap any environment with the Gymnasium API

πŸ§ͺ Practical Examples: Deep Q-Network for Continuous State Spaces

Example 1: DQN Implementation

import tensorflow as tf
import numpy as np
from collections import deque
import random

class DQNAgent:
    def __init__(self, state_size, action_size, learning_rate=0.001):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=10000)  # Experience replay buffer
        self.epsilon = 1.0  # Exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = learning_rate
        self.gamma = 0.95  # Discount factor

        # Neural network for Q-value approximation
        self.q_network = self._build_model()
        self.target_network = self._build_model()
        self.update_target_network()

    def _build_model(self):
        """Build neural network to approximate Q-values"""
        model = tf.keras.Sequential([
            tf.keras.layers.Dense(64, input_dim=self.state_size, activation='relu'),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dense(32, activation='relu'),
            tf.keras.layers.Dense(self.action_size, activation='linear')
        ])
        model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate))
        return model

    def remember(self, state, action, reward, next_state, done):
        """Store experience in replay buffer"""
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        """Ξ΅-greedy action selection"""
        if np.random.random() <= self.epsilon:
            return random.randrange(self.action_size)

        q_values = self.q_network.predict(state.reshape(1, -1), verbose=0)
        return np.argmax(q_values[0])

    def replay(self, batch_size=32):
        """Train the model on a batch of experiences"""
        if len(self.memory) < batch_size:
            return

        batch = random.sample(self.memory, batch_size)
        states = np.array([e[0] for e in batch])
        actions = np.array([e[1] for e in batch])
        rewards = np.array([e[2] for e in batch])
        next_states = np.array([e[3] for e in batch])
        dones = np.array([e[4] for e in batch])

        # Current Q-values
        current_q_values = self.q_network.predict(states, verbose=0)

        # Target Q-values using target network
        next_q_values = self.target_network.predict(next_states, verbose=0)
        max_next_q_values = np.max(next_q_values, axis=1)

        # Bellman equation: Q(s,a) = r + Ξ³ * max(Q(s',a'))
        target_q_values = rewards + (self.gamma * max_next_q_values * ~dones)

        # Update only the Q-values for the actions taken
        for i in range(batch_size):
            current_q_values[i][actions[i]] = target_q_values[i]

        # Train the network
        self.q_network.fit(states, current_q_values, epochs=1, verbose=0)

        # Decay exploration rate
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def update_target_network(self):
        """Copy weights from main network to target network"""
        self.target_network.set_weights(self.q_network.get_weights())

# Training with DQN
env = gym.make("CartPole-v1")
agent = DQNAgent(state_size=4, action_size=2)

episodes = 500
scores = []

for episode in range(episodes):
    state, _ = env.reset()
    total_reward = 0

    for time_step in range(500):
        action = agent.act(state)
        next_state, reward, terminated, truncated, _ = env.step(action)

        # Store experience
        agent.remember(state, action, reward, next_state, terminated or truncated)
        state = next_state
        total_reward += reward

        if terminated or truncated:
            break

    scores.append(total_reward)

    # Train the agent
    agent.replay()

    # Update target network periodically
    if episode % 10 == 0:
        agent.update_target_network()

    if episode % 50 == 0:
        avg_score = np.mean(scores[-50:])
        print(f"Episode {episode}, Average Score: {avg_score:.2f}, Ξ΅: {agent.epsilon:.3f}")

env.close()

Key Differences from Q-Learning:

  • Function approximation: Neural network replaces lookup table
  • Experience replay: Training on stored experiences breaks temporal correlation
  • Target network: Stable target for Bellman updates prevents divergence

πŸ“š Lessons Learned: Common Pitfalls and Best Practices

Reward Engineering Is Critical: The reward function defines what the agent optimizes for. Poorly designed rewards lead to reward hacking β€” optimizing the metric without achieving the intended behavior. Always test rewards on edge cases and consider unintended consequences.

Exploration Never Ends: Unlike supervised learning where more data always helps, RL agents can get worse if they stop exploring. Production RL systems need continuous exploration mechanisms like Ξ΅-greedy decay that maintains minimum exploration.

Sample Efficiency Matters in Production: RL is data-hungry. DQN needs millions of samples for simple tasks. Use techniques like transfer learning, reward shaping, and curriculum learning to reduce sample complexity. For real-world deployment, consider imitation learning to bootstrap from expert demonstrations.

Environment Assumptions Are Fragile: RL agents are sensitive to changes in state representation, reward structure, and environment dynamics. Production systems need robust monitoring and adaptation mechanisms.

Don't Ignore the Basics: Advanced algorithms like PPO and A3C get attention, but simple Q-learning often works well for discrete problems. Start simple and add complexity only when needed.


πŸ“Œ Summary & Key Takeaways

  • RL learns through trial and error: Agents discover optimal policies by exploring environments and learning from reward feedback, unlike supervised learning which requires labeled examples
  • Sequential decision-making is the key insight: RL optimizes for long-term cumulative reward, making it suitable for problems where immediate actions affect future outcomes
  • The agent-environment loop drives learning: State β†’ Action β†’ Reward β†’ Next State creates a feedback cycle that enables policy improvement over time
  • Q-learning builds a map of action values: The Q-function estimates expected future reward for each state-action pair, enabling optimal decision-making
  • Deep RL scales to complex environments: Neural networks can approximate value functions for infinite state spaces, but require careful engineering (experience replay, target networks)
  • Applications span games to language models: From AlphaGo mastering Go to RLHF training ChatGPT, RL excels when the optimal strategy emerges from experience rather than examples
  • Choose RL when consequences matter more than immediate accuracy: If the optimal action depends on future outcomes and you can tolerate exploration failures, RL is the right approach

Remember: RL is powerful but data-hungry β€” use it when the problem truly requires sequential decision-making and you can afford the exploration cost.


πŸ“ Practice Quiz

  1. What is the main difference between supervised learning and reinforcement learning?

    • A) Supervised learning uses neural networks, RL uses decision trees
    • B) Supervised learning learns from labeled examples, RL learns from trial-and-error feedback
    • C) Supervised learning is for classification, RL is for regression Correct Answer: B
  2. In the Q-learning update rule Q(s,a) ← Q(s,a) + Ξ±[r + Ξ³ max Q(s',a') - Q(s,a)], what does the discount factor Ξ³ control?

    • A) How fast the agent learns from each experience
    • B) How much the agent explores vs exploits current knowledge
    • C) How much future rewards are valued compared to immediate rewards Correct Answer: C
  3. You're building an RL agent for stock trading, but it keeps making extremely risky trades that occasionally pay off big. What's the most likely problem?

    • A) Learning rate is too high
    • B) Discount factor is too low (focuses only on immediate rewards)
    • C) Not enough training data Correct Answer: B
  4. A robotics company wants to train a robot arm to assemble electronics components. They have expert human demonstrations of the assembly process. Should they use pure RL, supervised learning, or a hybrid approach? Justify your recommendation considering sample efficiency, safety, and performance requirements.

Open-ended answer: A hybrid approach would be optimal. Start with imitation learning (supervised) on the human demonstrations to bootstrap a reasonable policy quickly and safely. Then use RL fine-tuning in simulation to optimize beyond human performance while maintaining safety constraints. Pure RL would be too sample-inefficient and potentially unsafe for physical robots, while pure supervised learning couldn't adapt to variations or optimize for task-specific objectives like speed or precision.


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms