Reinforcement Learning: Agents, Environments, and Rewards in Practice

Trial-and-Error Learning for Sequential Decision Making

Machine Learning Fundamentals

Abstract Algorithms

·Mar 29, 2026·14 min read

Cover Image for Reinforcement Learning: Agents, Environments, and Rewards in Practice

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 14 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Reinforcement Learning trains agents to make sequences of decisions by learning from rewards and penalties. Unlike supervised learning, RL learns through trial and error rather than labeled examples. Use it for sequential decision problems where the optimal action depends on future consequences — robotics, game AI, recommendation systems, and RLHF training for LLMs.

📖 The Trial-and-Error Learning Problem: When Labels Don't Exist

You're training a robot arm, but there's no labeled dataset. It needs to figure out through trial and error how to grip objects without breaking them, learning from thousands of attempts where it either succeeds or drops the object. Traditional supervised learning can't solve this because there's no "correct grip pressure" label for each object — the robot must discover the optimal policy through exploration.

Reinforcement Learning (RL) is machine learning for sequential decision-making problems. An agent takes actions in an environment, receives rewards or penalties, and learns a policy (strategy) to maximize cumulative reward over time. Unlike supervised learning, which learns from input-output pairs, RL learns from the consequences of actions.

Learning Type	Data Source	Goal	Example
Supervised	Labeled examples	Map inputs to outputs	Email spam classification
Unsupervised	Unlabeled patterns	Find hidden structure	Customer segmentation
Reinforcement	Action-reward feedback	Maximize cumulative reward	Robot navigation

The key insight: RL optimizes for long-term outcomes, not immediate correctness. A chess move that looks bad in isolation might be part of a winning strategy — RL agents learn to balance immediate rewards with future potential.

🔍 RL Components: The Agent-Environment Feedback Loop

Reinforcement Learning involves four core components interacting in a continuous loop:

Agent: The learner/decision-maker (e.g., chess AI, trading algorithm, robot controller) Environment: The external system the agent interacts with (e.g., chessboard, stock market, physical world) State: The current situation or context (e.g., board position, portfolio balance, sensor readings) Action: What the agent can do (e.g., move piece, buy/sell stock, move joint) Reward: Feedback signal indicating how good the action was (e.g., +1 for win, profit/loss, task completion)

graph TD
    A[Agent] -->|Action| E[Environment]
    E -->|State, Reward| A
    E -->|Next State| E

    subgraph "RL Components"
        A
        E
        S[State Space]
        R[Reward Function]
        P[Policy ]
    end

    style A fill:#e1f5fe
    style E fill:#f3e5f5
    style S fill:#e8f5e8
    style R fill:#fff3e0
    style P fill:#fce4ec

At each time step:

Agent observes the current state s_t
Agent selects an action a_t based on its policy π(a|s)
Environment transitions to new state s_{t+1} and returns reward r_t
Agent updates its policy to improve future decisions

Policy π: The strategy that maps states to actions. This is what the agent learns — which action to take in each situation to maximize long-term reward.

Value Function V(s): Expected cumulative reward starting from state s following policy π. This helps the agent evaluate how "good" different states are.

⚙️ How Q-Learning Discovers Optimal Policies Through Trial and Error

The core challenge in RL is the credit assignment problem: if the agent wins a game after 50 moves, which moves were actually good? Q-Learning solves this by learning the Q-function — the expected reward for taking action a in state s and following the optimal policy afterward.

Q-Learning Algorithm:

Initialize Q-table Q(s,a) with small random values
For each episode:
- Start in initial state s
- While not terminal:
  - Choose action a using ε-greedy policy
  - Execute action, observe reward r and next state s'
  - Update: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
  - Set s ← s'

Key Parameters:

α (learning rate): How much to update Q-values each step (typically 0.1-0.5)
γ (discount factor): How much to value future rewards vs immediate rewards (0.9-0.99)
ε (exploration rate): Probability of taking random action vs best known action

The Bellman Equation captures the core insight: the value of being in state s equals the immediate reward plus the discounted value of the best next state:

Q(s,a) = r + γ max Q(s',a')

This means optimal decisions require looking ahead — sometimes accepting lower immediate rewards for better long-term outcomes.

🧠 Deep Dive: How Modern RL Algorithms Scale Beyond Tabular Methods

The Internals

Tabular Q-Learning stores Q-values in a lookup table, but this breaks down with large state spaces. A chess game has ~10^43 possible positions — impossible to store in memory.

Deep Q-Networks (DQN) replace the Q-table with a neural network that approximates Q(s,a). The network takes state s as input and outputs Q-values for all possible actions:

# DQN Architecture
state_input = tf.keras.Input(shape=state_size)
x = tf.keras.layers.Dense(64, activation='relu')(state_input)
x = tf.keras.layers.Dense(64, activation='relu')(x)
q_values = tf.keras.layers.Dense(action_size, activation='linear')(x)
model = tf.keras.Model(state_input, q_values)

Experience Replay: DQN stores (state, action, reward, next_state) transitions in a replay buffer and samples random batches for training. This breaks the correlation between consecutive experiences and improves learning stability.

Target Network: DQN uses a separate target network for computing expected future rewards, updated periodically. This prevents the "moving target" problem where both current and target Q-values change simultaneously.

Performance Analysis

Time Complexity: Tabular Q-learning is O(1) per update but requires O(|S| × |A|) memory. DQN is O(n) per forward pass where n is network size, but can handle infinite state spaces.

Sample Efficiency: Q-learning is sample-efficient for small problems but struggles with sparse rewards. Policy gradient methods like REINFORCE directly optimize the policy but require more samples:

# REINFORCE update rule
policy_loss = -tf.reduce_mean(log_probs * returns)
optimizer.minimize(policy_loss)

Exploration vs Exploitation: ε-greedy balances exploration (random actions) with exploitation (greedy actions). Advanced methods use Upper Confidence Bounds (UCB) or Thompson Sampling for smarter exploration.

📊 Visualizing the RL Learning Process

sequenceDiagram
    participant A as Agent
    participant E as Environment
    participant M as Memory/Replay Buffer
    participant N as Neural Network

    loop Training Episode
        A->>E: action_t
        E->>A: state_{t+1}, reward_t
        A->>M: store (s_t, a_t, r_t, s_{t+1})

        Note over A: Sample mini-batch from M
        A->>N: Update Q-network
        N->>A: Updated Q-values

        Note over A: -greedy action selection
    end

    Note over A,N: Policy improves over time

This diagram shows how modern RL agents learn:

Agent takes action in environment
Environment returns new state and reward
Experience stored in replay buffer
Neural network trained on sampled experiences
Updated policy used for next decision

The key insight: learning happens offline from stored experiences, not just from immediate feedback.

🌍 Real-World Applications: From Atari to RLHF

Case Study 1: Game AI Mastery

AlphaGo/AlphaZero used RL to master Go, chess, and shogi without human game data. The agent learned by playing millions of games against itself, starting with random moves and gradually discovering winning strategies.

Input: Board position (19×19 grid for Go) Process: Monte Carlo Tree Search guided by neural network value estimates Output: Move selection that maximizes win probability

Scaling insight: Self-play generates infinite training data, but requires massive computational resources (Google used 5,000 TPUs for AlphaZero training).

Case Study 2: LLM RLHF Training

Reinforcement Learning from Human Feedback (RLHF) trains language models to be helpful and harmless. ChatGPT uses RLHF to learn human preferences for response quality.

Input: Prompt + multiple response candidates Process: Human raters rank responses; RL optimizes for higher-rated outputs Output: Language model that aligns with human preferences

Operational challenge: Human preferences are noisy and inconsistent. Production systems use Proximal Policy Optimization (PPO) to prevent the policy from deviating too far from the base model.

⚖️ Trade-offs & Failure Modes: When RL Goes Wrong

Performance vs Sample Efficiency

RL typically requires orders of magnitude more data than supervised learning. DQN needed 200 million frames to master Atari games — equivalent to 924 hours of gameplay. Modern methods like Rainbow DQN achieve the same performance with 10x fewer samples through algorithmic improvements.

Exploration vs Exploitation Dilemma

Pure exploitation (always choose best known action) gets stuck in local optima. Pure exploration (random actions) never leverages learned knowledge. The optimal balance depends on problem structure:

High-stakes environments: Conservative exploitation (financial trading)
Safe exploration: Cautious exploration with safety constraints (robotics)
Multi-armed bandits: UCB provides theoretical guarantees for exploration-exploitation balance

Common Failure Modes

Reward Hacking: Agent finds unexpected ways to maximize reward that don't align with intended behavior. A cleaning robot might create messes to appear more useful.

Catastrophic Forgetting: Neural network forgets previously learned skills when learning new tasks. Requires techniques like Elastic Weight Consolidation or Progressive Neural Networks.

Distribution Shift: Performance degrades when deployment environment differs from training. RL agents can be brittle to changes in state representation or reward structure.

🧭 Decision Guide: When to Choose RL Over Supervised Learning

Situation	Recommendation
Use RL when	Sequential decisions with delayed rewards, no labeled training data, need to balance exploration/exploitation, environment provides reward signal
Avoid RL when	Single-step prediction tasks, abundant labeled data available, real-time inference required (RL training is slow), safety-critical applications with no tolerance for exploration failures
Alternative	Supervised learning for classification/regression, imitation learning when expert demonstrations available, evolutionary algorithms for black-box optimization
Edge cases	Sparse rewards need reward shaping, continuous action spaces require policy gradients, multi-agent environments need specialized algorithms (MADDPG)

Rule of thumb: Use RL when the optimal action depends on future consequences, not just current context. Supervised learning optimizes immediate accuracy; RL optimizes long-term outcomes.

🛠️ Gymnasium: The OpenAI Standard for RL Environments

Gymnasium (formerly OpenAI Gym) provides standardized environments for RL research and development. It defines a common interface that works with any RL algorithm:

import gymnasium as gym
import numpy as np
from collections import defaultdict

# Create CartPole environment
env = gym.make("CartPole-v1", render_mode="human")

# Simple Q-learning for discrete state approximation
class QLearningAgent:
    def __init__(self, action_size, learning_rate=0.1, discount=0.95, epsilon=0.1):
        self.q_table = defaultdict(lambda: np.zeros(action_size))
        self.lr = learning_rate
        self.gamma = discount
        self.epsilon = epsilon

    def discretize_state(self, state):
        """Convert continuous state to discrete bins for Q-table"""
        # CartPole state: [position, velocity, angle, angular_velocity]
        bins = [
            np.digitize(state[0], np.linspace(-2.4, 2.4, 10)),
            np.digitize(state[1], np.linspace(-2, 2, 10)), 
            np.digitize(state[2], np.linspace(-0.2, 0.2, 10)),
            np.digitize(state[3], np.linspace(-2, 2, 10))
        ]
        return tuple(bins)

    def choose_action(self, state):
        """ε-greedy action selection"""
        discrete_state = self.discretize_state(state)
        if np.random.random() < self.epsilon:
            return env.action_space.sample()  # Random action
        return np.argmax(self.q_table[discrete_state])  # Best known action

    def update(self, state, action, reward, next_state, done):
        """Q-learning update rule"""
        discrete_state = self.discretize_state(state)
        discrete_next_state = self.discretize_state(next_state)

        # Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
        current_q = self.q_table[discrete_state][action]
        if done:
            target_q = reward
        else:
            target_q = reward + self.gamma * np.max(self.q_table[discrete_next_state])

        self.q_table[discrete_state][action] += self.lr * (target_q - current_q)

# Training loop
agent = QLearningAgent(action_size=env.action_space.n)
episodes = 1000
scores = []

for episode in range(episodes):
    state, info = env.reset()
    total_reward = 0

    while True:
        action = agent.choose_action(state)
        next_state, reward, terminated, truncated, info = env.step(action)

        # Custom reward shaping for CartPole
        # Reward staying upright, penalize extreme positions
        angle_reward = 1.0 - abs(next_state[2]) / 0.2  # Angle penalty
        position_reward = 1.0 - abs(next_state[0]) / 2.4  # Position penalty
        shaped_reward = reward + 0.1 * (angle_reward + position_reward)

        agent.update(state, action, shaped_reward, next_state, terminated or truncated)
        state = next_state
        total_reward += reward

        if terminated or truncated:
            break

    scores.append(total_reward)

    # Decay exploration rate
    if agent.epsilon > 0.01:
        agent.epsilon *= 0.995

    if episode % 100 == 0:
        avg_score = np.mean(scores[-100:])
        print(f"Episode {episode}, Average Score: {avg_score:.2f}, ε: {agent.epsilon:.3f}")

env.close()

Why Gymnasium Matters: It standardizes the RL interface across different environments — from simple control tasks like CartPole to complex simulations like MuJoCo robotics. The same algorithm can work on any Gymnasium-compatible environment with minimal code changes.

Environment Categories:

Classic Control: CartPole, MountainCar (simple physics)
Atari: Video games with pixel observations
MuJoCo: Continuous control robotics simulation
Custom: Easy to wrap any environment with the Gymnasium API

🧪 Practical Examples: Deep Q-Network for Continuous State Spaces

Example 1: DQN Implementation

import tensorflow as tf
import numpy as np
from collections import deque
import random

class DQNAgent:
    def __init__(self, state_size, action_size, learning_rate=0.001):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=10000)  # Experience replay buffer
        self.epsilon = 1.0  # Exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = learning_rate
        self.gamma = 0.95  # Discount factor

        # Neural network for Q-value approximation
        self.q_network = self._build_model()
        self.target_network = self._build_model()
        self.update_target_network()

    def _build_model(self):
        """Build neural network to approximate Q-values"""
        model = tf.keras.Sequential([
            tf.keras.layers.Dense(64, input_dim=self.state_size, activation='relu'),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dense(32, activation='relu'),
            tf.keras.layers.Dense(self.action_size, activation='linear')
        ])
        model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate))
        return model

    def remember(self, state, action, reward, next_state, done):
        """Store experience in replay buffer"""
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        """ε-greedy action selection"""
        if np.random.random() <= self.epsilon:
            return random.randrange(self.action_size)

        q_values = self.q_network.predict(state.reshape(1, -1), verbose=0)
        return np.argmax(q_values[0])

    def replay(self, batch_size=32):
        """Train the model on a batch of experiences"""
        if len(self.memory) < batch_size:
            return

        batch = random.sample(self.memory, batch_size)
        states = np.array([e[0] for e in batch])
        actions = np.array([e[1] for e in batch])
        rewards = np.array([e[2] for e in batch])
        next_states = np.array([e[3] for e in batch])
        dones = np.array([e[4] for e in batch])

        # Current Q-values
        current_q_values = self.q_network.predict(states, verbose=0)

        # Target Q-values using target network
        next_q_values = self.target_network.predict(next_states, verbose=0)
        max_next_q_values = np.max(next_q_values, axis=1)

        # Bellman equation: Q(s,a) = r + γ * max(Q(s',a'))
        target_q_values = rewards + (self.gamma * max_next_q_values * ~dones)

        # Update only the Q-values for the actions taken
        for i in range(batch_size):
            current_q_values[i][actions[i]] = target_q_values[i]

        # Train the network
        self.q_network.fit(states, current_q_values, epochs=1, verbose=0)

        # Decay exploration rate
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def update_target_network(self):
        """Copy weights from main network to target network"""
        self.target_network.set_weights(self.q_network.get_weights())

# Training with DQN
env = gym.make("CartPole-v1")
agent = DQNAgent(state_size=4, action_size=2)

episodes = 500
scores = []

for episode in range(episodes):
    state, _ = env.reset()
    total_reward = 0

    for time_step in range(500):
        action = agent.act(state)
        next_state, reward, terminated, truncated, _ = env.step(action)

        # Store experience
        agent.remember(state, action, reward, next_state, terminated or truncated)
        state = next_state
        total_reward += reward

        if terminated or truncated:
            break

    scores.append(total_reward)

    # Train the agent
    agent.replay()

    # Update target network periodically
    if episode % 10 == 0:
        agent.update_target_network()

    if episode % 50 == 0:
        avg_score = np.mean(scores[-50:])
        print(f"Episode {episode}, Average Score: {avg_score:.2f}, ε: {agent.epsilon:.3f}")

env.close()

Key Differences from Q-Learning:

Function approximation: Neural network replaces lookup table
Experience replay: Training on stored experiences breaks temporal correlation
Target network: Stable target for Bellman updates prevents divergence

📚 Lessons Learned: Common Pitfalls and Best Practices

Reward Engineering Is Critical: The reward function defines what the agent optimizes for. Poorly designed rewards lead to reward hacking — optimizing the metric without achieving the intended behavior. Always test rewards on edge cases and consider unintended consequences.

Exploration Never Ends: Unlike supervised learning where more data always helps, RL agents can get worse if they stop exploring. Production RL systems need continuous exploration mechanisms like ε-greedy decay that maintains minimum exploration.

Sample Efficiency Matters in Production: RL is data-hungry. DQN needs millions of samples for simple tasks. Use techniques like transfer learning, reward shaping, and curriculum learning to reduce sample complexity. For real-world deployment, consider imitation learning to bootstrap from expert demonstrations.

Environment Assumptions Are Fragile: RL agents are sensitive to changes in state representation, reward structure, and environment dynamics. Production systems need robust monitoring and adaptation mechanisms.

Don't Ignore the Basics: Advanced algorithms like PPO and A3C get attention, but simple Q-learning often works well for discrete problems. Start simple and add complexity only when needed.

📌 Summary & Key Takeaways

RL learns through trial and error: Agents discover optimal policies by exploring environments and learning from reward feedback, unlike supervised learning which requires labeled examples
Sequential decision-making is the key insight: RL optimizes for long-term cumulative reward, making it suitable for problems where immediate actions affect future outcomes
The agent-environment loop drives learning: State → Action → Reward → Next State creates a feedback cycle that enables policy improvement over time
Q-learning builds a map of action values: The Q-function estimates expected future reward for each state-action pair, enabling optimal decision-making
Deep RL scales to complex environments: Neural networks can approximate value functions for infinite state spaces, but require careful engineering (experience replay, target networks)
Applications span games to language models: From AlphaGo mastering Go to RLHF training ChatGPT, RL excels when the optimal strategy emerges from experience rather than examples
Choose RL when consequences matter more than immediate accuracy: If the optimal action depends on future outcomes and you can tolerate exploration failures, RL is the right approach

Remember: RL is powerful but data-hungry — use it when the problem truly requires sequential decision-making and you can afford the exploration cost.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Dot Product in Machine Learning: The Engine Behind Similarity, Attention, and Neural Networks

TLDR: The dot product multiplies corresponding elements of two vectors and sums the results. In machine learning it does three critical jobs: it scores semantic similarity between embeddings, computes every activation in a fully connected layer, and ...

May 3, 2026•21 min read

Softmax Function Explained: From Raw Scores to Probabilities

TLDR: Softmax converts a vector of raw scores (logits) into a valid probability distribution by exponentiating each value and dividing by the total. Subtracting the max before exponentiating prevents floating-point overflow. Temperature scaling contr...

May 3, 2026•21 min read

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Reinforcement Learning: Agents, Environments, and Rewards in Practice

Intermediate

📖 The Trial-and-Error Learning Problem: When Labels Don't Exist

🔍 RL Components: The Agent-Environment Feedback Loop

⚙️ How Q-Learning Discovers Optimal Policies Through Trial and Error

🧠 Deep Dive: How Modern RL Algorithms Scale Beyond Tabular Methods

The Internals

Performance Analysis

📊 Visualizing the RL Learning Process

🌍 Real-World Applications: From Atari to RLHF

Case Study 1: Game AI Mastery

Case Study 2: LLM RLHF Training

⚖️ Trade-offs & Failure Modes: When RL Goes Wrong

Performance vs Sample Efficiency

Exploration vs Exploitation Dilemma

Common Failure Modes

🧭 Decision Guide: When to Choose RL Over Supervised Learning

🛠️ Gymnasium: The OpenAI Standard for RL Environments

🧪 Practical Examples: Deep Q-Network for Continuous State Spaces

Example 1: DQN Implementation

📚 Lessons Learned: Common Pitfalls and Best Practices

📌 Summary & Key Takeaways

🔗 Related Posts

Test Your Knowledge

Dot Product in Machine Learning: The Engine Behind Similarity, Attention, and Neural Networks

Softmax Function Explained: From Raw Scores to Probabilities

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive