Reinforcement Learning: Agents, Environments, and Rewards in Practice
Trial-and-Error Learning for Sequential Decision Making
Abstract AlgorithmsTLDR: Reinforcement Learning trains agents to make sequences of decisions by learning from rewards and penalties. Unlike supervised learning, RL learns through trial and error rather than labeled examples. Use it for sequential decision problems where the optimal action depends on future consequences β robotics, game AI, recommendation systems, and RLHF training for LLMs.
π The Trial-and-Error Learning Problem: When Labels Don't Exist
You're training a robot arm, but there's no labeled dataset. It needs to figure out through trial and error how to grip objects without breaking them, learning from thousands of attempts where it either succeeds or drops the object. Traditional supervised learning can't solve this because there's no "correct grip pressure" label for each object β the robot must discover the optimal policy through exploration.
Reinforcement Learning (RL) is machine learning for sequential decision-making problems. An agent takes actions in an environment, receives rewards or penalties, and learns a policy (strategy) to maximize cumulative reward over time. Unlike supervised learning, which learns from input-output pairs, RL learns from the consequences of actions.
| Learning Type | Data Source | Goal | Example |
| Supervised | Labeled examples | Map inputs to outputs | Email spam classification |
| Unsupervised | Unlabeled patterns | Find hidden structure | Customer segmentation |
| Reinforcement | Action-reward feedback | Maximize cumulative reward | Robot navigation |
The key insight: RL optimizes for long-term outcomes, not immediate correctness. A chess move that looks bad in isolation might be part of a winning strategy β RL agents learn to balance immediate rewards with future potential.
π RL Components: The Agent-Environment Feedback Loop
Reinforcement Learning involves four core components interacting in a continuous loop:
Agent: The learner/decision-maker (e.g., chess AI, trading algorithm, robot controller) Environment: The external system the agent interacts with (e.g., chessboard, stock market, physical world) State: The current situation or context (e.g., board position, portfolio balance, sensor readings) Action: What the agent can do (e.g., move piece, buy/sell stock, move joint) Reward: Feedback signal indicating how good the action was (e.g., +1 for win, profit/loss, task completion)
graph TD
A[Agent] -->|Action| E[Environment]
E -->|State, Reward| A
E -->|Next State| E
subgraph "RL Components"
A
E
S[State Space]
R[Reward Function]
P[Policy Ο]
end
style A fill:#e1f5fe
style E fill:#f3e5f5
style S fill:#e8f5e8
style R fill:#fff3e0
style P fill:#fce4ec
At each time step:
- Agent observes the current state
s_t - Agent selects an action
a_tbased on its policyΟ(a|s) - Environment transitions to new state
s_{t+1}and returns rewardr_t - Agent updates its policy to improve future decisions
Policy Ο: The strategy that maps states to actions. This is what the agent learns β which action to take in each situation to maximize long-term reward.
Value Function V(s): Expected cumulative reward starting from state s following policy Ο. This helps the agent evaluate how "good" different states are.
βοΈ How Q-Learning Discovers Optimal Policies Through Trial and Error
The core challenge in RL is the credit assignment problem: if the agent wins a game after 50 moves, which moves were actually good? Q-Learning solves this by learning the Q-function β the expected reward for taking action a in state s and following the optimal policy afterward.
Q-Learning Algorithm:
- Initialize Q-table
Q(s,a)with small random values - For each episode:
- Start in initial state
s - While not terminal:
- Choose action
ausing Ξ΅-greedy policy - Execute action, observe reward
rand next states' - Update:
Q(s,a) β Q(s,a) + Ξ±[r + Ξ³ max Q(s',a') - Q(s,a)] - Set
s β s'
- Choose action
- Start in initial state
Key Parameters:
- Ξ± (learning rate): How much to update Q-values each step (typically 0.1-0.5)
- Ξ³ (discount factor): How much to value future rewards vs immediate rewards (0.9-0.99)
- Ξ΅ (exploration rate): Probability of taking random action vs best known action
The Bellman Equation captures the core insight: the value of being in state s equals the immediate reward plus the discounted value of the best next state:
Q(s,a) = r + Ξ³ max Q(s',a')
This means optimal decisions require looking ahead β sometimes accepting lower immediate rewards for better long-term outcomes.
π§ Deep Dive: How Modern RL Algorithms Scale Beyond Tabular Methods
The Internals
Tabular Q-Learning stores Q-values in a lookup table, but this breaks down with large state spaces. A chess game has ~10^43 possible positions β impossible to store in memory.
Deep Q-Networks (DQN) replace the Q-table with a neural network that approximates Q(s,a). The network takes state s as input and outputs Q-values for all possible actions:
# DQN Architecture
state_input = tf.keras.Input(shape=state_size)
x = tf.keras.layers.Dense(64, activation='relu')(state_input)
x = tf.keras.layers.Dense(64, activation='relu')(x)
q_values = tf.keras.layers.Dense(action_size, activation='linear')(x)
model = tf.keras.Model(state_input, q_values)
Experience Replay: DQN stores (state, action, reward, next_state) transitions in a replay buffer and samples random batches for training. This breaks the correlation between consecutive experiences and improves learning stability.
Target Network: DQN uses a separate target network for computing expected future rewards, updated periodically. This prevents the "moving target" problem where both current and target Q-values change simultaneously.
Performance Analysis
Time Complexity: Tabular Q-learning is O(1) per update but requires O(|S| Γ |A|) memory. DQN is O(n) per forward pass where n is network size, but can handle infinite state spaces.
Sample Efficiency: Q-learning is sample-efficient for small problems but struggles with sparse rewards. Policy gradient methods like REINFORCE directly optimize the policy but require more samples:
# REINFORCE update rule
policy_loss = -tf.reduce_mean(log_probs * returns)
optimizer.minimize(policy_loss)
Exploration vs Exploitation: Ξ΅-greedy balances exploration (random actions) with exploitation (greedy actions). Advanced methods use Upper Confidence Bounds (UCB) or Thompson Sampling for smarter exploration.
π Visualizing the RL Learning Process
sequenceDiagram
participant A as Agent
participant E as Environment
participant M as Memory/Replay Buffer
participant N as Neural Network
loop Training Episode
A->>E: action_t
E->>A: state_{t+1}, reward_t
A->>M: store (s_t, a_t, r_t, s_{t+1})
Note over A: Sample mini-batch from M
A->>N: Update Q-network
N->>A: Updated Q-values
Note over A: Ξ΅-greedy action selection
end
Note over A,N: Policy improves over time
This diagram shows how modern RL agents learn:
- Agent takes action in environment
- Environment returns new state and reward
- Experience stored in replay buffer
- Neural network trained on sampled experiences
- Updated policy used for next decision
The key insight: learning happens offline from stored experiences, not just from immediate feedback.
π Real-World Applications: From Atari to RLHF
Case Study 1: Game AI Mastery
AlphaGo/AlphaZero used RL to master Go, chess, and shogi without human game data. The agent learned by playing millions of games against itself, starting with random moves and gradually discovering winning strategies.
Input: Board position (19Γ19 grid for Go) Process: Monte Carlo Tree Search guided by neural network value estimates Output: Move selection that maximizes win probability
Scaling insight: Self-play generates infinite training data, but requires massive computational resources (Google used 5,000 TPUs for AlphaZero training).
Case Study 2: LLM RLHF Training
Reinforcement Learning from Human Feedback (RLHF) trains language models to be helpful and harmless. ChatGPT uses RLHF to learn human preferences for response quality.
Input: Prompt + multiple response candidates Process: Human raters rank responses; RL optimizes for higher-rated outputs Output: Language model that aligns with human preferences
Operational challenge: Human preferences are noisy and inconsistent. Production systems use Proximal Policy Optimization (PPO) to prevent the policy from deviating too far from the base model.
βοΈ Trade-offs & Failure Modes: When RL Goes Wrong
Performance vs Sample Efficiency
RL typically requires orders of magnitude more data than supervised learning. DQN needed 200 million frames to master Atari games β equivalent to 924 hours of gameplay. Modern methods like Rainbow DQN achieve the same performance with 10x fewer samples through algorithmic improvements.
Exploration vs Exploitation Dilemma
Pure exploitation (always choose best known action) gets stuck in local optima. Pure exploration (random actions) never leverages learned knowledge. The optimal balance depends on problem structure:
- High-stakes environments: Conservative exploitation (financial trading)
- Safe exploration: Cautious exploration with safety constraints (robotics)
- Multi-armed bandits: UCB provides theoretical guarantees for exploration-exploitation balance
Common Failure Modes
Reward Hacking: Agent finds unexpected ways to maximize reward that don't align with intended behavior. A cleaning robot might create messes to appear more useful.
Catastrophic Forgetting: Neural network forgets previously learned skills when learning new tasks. Requires techniques like Elastic Weight Consolidation or Progressive Neural Networks.
Distribution Shift: Performance degrades when deployment environment differs from training. RL agents can be brittle to changes in state representation or reward structure.
π§ Decision Guide: When to Choose RL Over Supervised Learning
| Situation | Recommendation |
| Use RL when | Sequential decisions with delayed rewards, no labeled training data, need to balance exploration/exploitation, environment provides reward signal |
| Avoid RL when | Single-step prediction tasks, abundant labeled data available, real-time inference required (RL training is slow), safety-critical applications with no tolerance for exploration failures |
| Alternative | Supervised learning for classification/regression, imitation learning when expert demonstrations available, evolutionary algorithms for black-box optimization |
| Edge cases | Sparse rewards need reward shaping, continuous action spaces require policy gradients, multi-agent environments need specialized algorithms (MADDPG) |
Rule of thumb: Use RL when the optimal action depends on future consequences, not just current context. Supervised learning optimizes immediate accuracy; RL optimizes long-term outcomes.
π οΈ Gymnasium: The OpenAI Standard for RL Environments
Gymnasium (formerly OpenAI Gym) provides standardized environments for RL research and development. It defines a common interface that works with any RL algorithm:
import gymnasium as gym
import numpy as np
from collections import defaultdict
# Create CartPole environment
env = gym.make("CartPole-v1", render_mode="human")
# Simple Q-learning for discrete state approximation
class QLearningAgent:
def __init__(self, action_size, learning_rate=0.1, discount=0.95, epsilon=0.1):
self.q_table = defaultdict(lambda: np.zeros(action_size))
self.lr = learning_rate
self.gamma = discount
self.epsilon = epsilon
def discretize_state(self, state):
"""Convert continuous state to discrete bins for Q-table"""
# CartPole state: [position, velocity, angle, angular_velocity]
bins = [
np.digitize(state[0], np.linspace(-2.4, 2.4, 10)),
np.digitize(state[1], np.linspace(-2, 2, 10)),
np.digitize(state[2], np.linspace(-0.2, 0.2, 10)),
np.digitize(state[3], np.linspace(-2, 2, 10))
]
return tuple(bins)
def choose_action(self, state):
"""Ξ΅-greedy action selection"""
discrete_state = self.discretize_state(state)
if np.random.random() < self.epsilon:
return env.action_space.sample() # Random action
return np.argmax(self.q_table[discrete_state]) # Best known action
def update(self, state, action, reward, next_state, done):
"""Q-learning update rule"""
discrete_state = self.discretize_state(state)
discrete_next_state = self.discretize_state(next_state)
# Q(s,a) β Q(s,a) + Ξ±[r + Ξ³ max Q(s',a') - Q(s,a)]
current_q = self.q_table[discrete_state][action]
if done:
target_q = reward
else:
target_q = reward + self.gamma * np.max(self.q_table[discrete_next_state])
self.q_table[discrete_state][action] += self.lr * (target_q - current_q)
# Training loop
agent = QLearningAgent(action_size=env.action_space.n)
episodes = 1000
scores = []
for episode in range(episodes):
state, info = env.reset()
total_reward = 0
while True:
action = agent.choose_action(state)
next_state, reward, terminated, truncated, info = env.step(action)
# Custom reward shaping for CartPole
# Reward staying upright, penalize extreme positions
angle_reward = 1.0 - abs(next_state[2]) / 0.2 # Angle penalty
position_reward = 1.0 - abs(next_state[0]) / 2.4 # Position penalty
shaped_reward = reward + 0.1 * (angle_reward + position_reward)
agent.update(state, action, shaped_reward, next_state, terminated or truncated)
state = next_state
total_reward += reward
if terminated or truncated:
break
scores.append(total_reward)
# Decay exploration rate
if agent.epsilon > 0.01:
agent.epsilon *= 0.995
if episode % 100 == 0:
avg_score = np.mean(scores[-100:])
print(f"Episode {episode}, Average Score: {avg_score:.2f}, Ξ΅: {agent.epsilon:.3f}")
env.close()
Why Gymnasium Matters: It standardizes the RL interface across different environments β from simple control tasks like CartPole to complex simulations like MuJoCo robotics. The same algorithm can work on any Gymnasium-compatible environment with minimal code changes.
Environment Categories:
- Classic Control: CartPole, MountainCar (simple physics)
- Atari: Video games with pixel observations
- MuJoCo: Continuous control robotics simulation
- Custom: Easy to wrap any environment with the Gymnasium API
π§ͺ Practical Examples: Deep Q-Network for Continuous State Spaces
Example 1: DQN Implementation
import tensorflow as tf
import numpy as np
from collections import deque
import random
class DQNAgent:
def __init__(self, state_size, action_size, learning_rate=0.001):
self.state_size = state_size
self.action_size = action_size
self.memory = deque(maxlen=10000) # Experience replay buffer
self.epsilon = 1.0 # Exploration rate
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.learning_rate = learning_rate
self.gamma = 0.95 # Discount factor
# Neural network for Q-value approximation
self.q_network = self._build_model()
self.target_network = self._build_model()
self.update_target_network()
def _build_model(self):
"""Build neural network to approximate Q-values"""
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, input_dim=self.state_size, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(self.action_size, activation='linear')
])
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate))
return model
def remember(self, state, action, reward, next_state, done):
"""Store experience in replay buffer"""
self.memory.append((state, action, reward, next_state, done))
def act(self, state):
"""Ξ΅-greedy action selection"""
if np.random.random() <= self.epsilon:
return random.randrange(self.action_size)
q_values = self.q_network.predict(state.reshape(1, -1), verbose=0)
return np.argmax(q_values[0])
def replay(self, batch_size=32):
"""Train the model on a batch of experiences"""
if len(self.memory) < batch_size:
return
batch = random.sample(self.memory, batch_size)
states = np.array([e[0] for e in batch])
actions = np.array([e[1] for e in batch])
rewards = np.array([e[2] for e in batch])
next_states = np.array([e[3] for e in batch])
dones = np.array([e[4] for e in batch])
# Current Q-values
current_q_values = self.q_network.predict(states, verbose=0)
# Target Q-values using target network
next_q_values = self.target_network.predict(next_states, verbose=0)
max_next_q_values = np.max(next_q_values, axis=1)
# Bellman equation: Q(s,a) = r + Ξ³ * max(Q(s',a'))
target_q_values = rewards + (self.gamma * max_next_q_values * ~dones)
# Update only the Q-values for the actions taken
for i in range(batch_size):
current_q_values[i][actions[i]] = target_q_values[i]
# Train the network
self.q_network.fit(states, current_q_values, epochs=1, verbose=0)
# Decay exploration rate
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
def update_target_network(self):
"""Copy weights from main network to target network"""
self.target_network.set_weights(self.q_network.get_weights())
# Training with DQN
env = gym.make("CartPole-v1")
agent = DQNAgent(state_size=4, action_size=2)
episodes = 500
scores = []
for episode in range(episodes):
state, _ = env.reset()
total_reward = 0
for time_step in range(500):
action = agent.act(state)
next_state, reward, terminated, truncated, _ = env.step(action)
# Store experience
agent.remember(state, action, reward, next_state, terminated or truncated)
state = next_state
total_reward += reward
if terminated or truncated:
break
scores.append(total_reward)
# Train the agent
agent.replay()
# Update target network periodically
if episode % 10 == 0:
agent.update_target_network()
if episode % 50 == 0:
avg_score = np.mean(scores[-50:])
print(f"Episode {episode}, Average Score: {avg_score:.2f}, Ξ΅: {agent.epsilon:.3f}")
env.close()
Key Differences from Q-Learning:
- Function approximation: Neural network replaces lookup table
- Experience replay: Training on stored experiences breaks temporal correlation
- Target network: Stable target for Bellman updates prevents divergence
π Lessons Learned: Common Pitfalls and Best Practices
Reward Engineering Is Critical: The reward function defines what the agent optimizes for. Poorly designed rewards lead to reward hacking β optimizing the metric without achieving the intended behavior. Always test rewards on edge cases and consider unintended consequences.
Exploration Never Ends: Unlike supervised learning where more data always helps, RL agents can get worse if they stop exploring. Production RL systems need continuous exploration mechanisms like Ξ΅-greedy decay that maintains minimum exploration.
Sample Efficiency Matters in Production: RL is data-hungry. DQN needs millions of samples for simple tasks. Use techniques like transfer learning, reward shaping, and curriculum learning to reduce sample complexity. For real-world deployment, consider imitation learning to bootstrap from expert demonstrations.
Environment Assumptions Are Fragile: RL agents are sensitive to changes in state representation, reward structure, and environment dynamics. Production systems need robust monitoring and adaptation mechanisms.
Don't Ignore the Basics: Advanced algorithms like PPO and A3C get attention, but simple Q-learning often works well for discrete problems. Start simple and add complexity only when needed.
π Summary & Key Takeaways
- RL learns through trial and error: Agents discover optimal policies by exploring environments and learning from reward feedback, unlike supervised learning which requires labeled examples
- Sequential decision-making is the key insight: RL optimizes for long-term cumulative reward, making it suitable for problems where immediate actions affect future outcomes
- The agent-environment loop drives learning: State β Action β Reward β Next State creates a feedback cycle that enables policy improvement over time
- Q-learning builds a map of action values: The Q-function estimates expected future reward for each state-action pair, enabling optimal decision-making
- Deep RL scales to complex environments: Neural networks can approximate value functions for infinite state spaces, but require careful engineering (experience replay, target networks)
- Applications span games to language models: From AlphaGo mastering Go to RLHF training ChatGPT, RL excels when the optimal strategy emerges from experience rather than examples
- Choose RL when consequences matter more than immediate accuracy: If the optimal action depends on future outcomes and you can tolerate exploration failures, RL is the right approach
Remember: RL is powerful but data-hungry β use it when the problem truly requires sequential decision-making and you can afford the exploration cost.
π Practice Quiz
What is the main difference between supervised learning and reinforcement learning?
- A) Supervised learning uses neural networks, RL uses decision trees
- B) Supervised learning learns from labeled examples, RL learns from trial-and-error feedback
- C) Supervised learning is for classification, RL is for regression Correct Answer: B
In the Q-learning update rule Q(s,a) β Q(s,a) + Ξ±[r + Ξ³ max Q(s',a') - Q(s,a)], what does the discount factor Ξ³ control?
- A) How fast the agent learns from each experience
- B) How much the agent explores vs exploits current knowledge
- C) How much future rewards are valued compared to immediate rewards Correct Answer: C
You're building an RL agent for stock trading, but it keeps making extremely risky trades that occasionally pay off big. What's the most likely problem?
- A) Learning rate is too high
- B) Discount factor is too low (focuses only on immediate rewards)
- C) Not enough training data Correct Answer: B
A robotics company wants to train a robot arm to assemble electronics components. They have expert human demonstrations of the assembly process. Should they use pure RL, supervised learning, or a hybrid approach? Justify your recommendation considering sample efficiency, safety, and performance requirements.
Open-ended answer: A hybrid approach would be optimal. Start with imitation learning (supervised) on the human demonstrations to bootstrap a reasonable policy quickly and safely. Then use RL fine-tuning in simulation to optimize beyond human performance while maintaining safety constraints. Pure RL would be too sample-inefficient and potentially unsafe for physical robots, while pure supervised learning couldn't adapt to variations or optimize for task-specific objectives like speed or precision.
π Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Two Pointer Technique: Solving Pair and Partition Problems in O(n)
TLDR: Place one pointer at the start and one at the end of a sorted array. Move them toward each other based on a comparison condition. Every classic pair/partition problem that naively runs in O(nΒ²) collapses to O(n) with this one idea β and masteri...
Tries (Prefix Trees): The Data Structure Behind Autocomplete
TLDR: A Trie stores strings character by character in a tree, so every string sharing a common prefix shares those nodes. Insert and search are O(L) where L is the word length. Tries beat HashMaps on prefix queries β startsWith in O(L) with zero coll...
Sliding Window Technique: From O(nΒ·k) Scans to O(n) in One Pass
TLDR: Instead of recomputing a subarray aggregate from scratch on every shift, maintain it incrementally β add the incoming element, remove the outgoing element. For a fixed window this costs O(1) per shift. For a variable window, expand the right bo...
Merge Intervals Pattern: Solve Scheduling Problems with Sort and Sweep
TLDR: Sort intervals by start time, then sweep left-to-right and merge any interval whose start β€ the current running end. O(n log n) time, O(n) space. One pattern β three interview problems solved. π When Two Meetings Overlap: The Scheduling Prob...
