All Posts

Types of LLM Quantization: By Timing, Scope, and Mapping

PTQ, QAT, INT8, INT4, and NF4 explained through timing, scope, and mapping choices.

Abstract AlgorithmsAbstract Algorithms
ยทยท16 min read
Cover Image for Types of LLM Quantization: By Timing, Scope, and Mapping
๐Ÿ“š

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 16 min

AI-assisted content.

TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantization, then add activation quantization when needed.


๐Ÿ“– Quantization Is a Design Space, Not One Switch

Deploying LLaMA-3-70B at full fp16 precision requires approximately 140GB of VRAM โ€” two A100 80GB GPUs at roughly $8โ€“12K/month in cloud GPU rental. At Q4_K_M quantization (4-bit weights), the same model fits on a single consumer GPU with 24GB VRAM. The measured quality difference on most standard benchmarks? Under 2%.

That is the practical upside, but "just quantize it to 4-bit" is not a strategy โ€” it is a gamble. Teams that apply the wrong quantization type to the wrong model component regularly discover quality regressions in production that never appeared in offline evals.

Here is a concrete picture of the memory trade-off:

ModelPrecisionVRAM RequiredApproximate Hardware
LLaMA-3-70Bfp16~140 GB2ร— A100 80GB
LLaMA-3-70BINT8~70 GB1ร— A100 80GB
LLaMA-3-70BQ4_K_M~35 GB1ร— H100 40GB or RTX 3090
LLaMA-3-8BQ4_K_M~4.5 GBConsumer laptop GPU

The right question is not "should I quantize?" โ€” it is: "Which quantization type fits my latency, memory, and quality budget?"

This post uses a taxonomy approach. Instead of memorizing tool names, classify every method by:

  • By Timing: when low precision enters the pipeline.
  • By Scope: which tensors or components are quantized.
  • By Mapping: how float values are mapped to low-bit representations.
If your main pain is...Your first axis to optimize
Model does not fit GPU/edge memoryScope (start with weights)
Cost per token is too highScope + Mapping
Accuracy regression after PTQTiming (move toward QAT)
Latency remains high despite smaller modelMapping + kernel compatibility

๐Ÿ” Three Classification Axes You Can Apply to Any Quantized LLM

Before selecting a library or hardware backend, use this compact classifier:

AxisCore questionCommon optionsTypical impact
TimingWhen do we apply quantization?PTQ, QATAccuracy retention vs implementation effort
ScopeWhat model parts are quantized?Weights-only, weights+activations, KV cacheMemory and throughput
MappingHow are floats represented in low bits?Symmetric, asymmetric, non-uniform (NF4)Error profile and hardware efficiency

Use timing for lifecycle decisions, scope for memory/bandwidth impact, and mapping for error behavior.

๐Ÿ“Š PTQ vs QAT Taxonomy

flowchart LR
    subgraph PTQ[Post-Training Quantization (PTQ)]
        P1[Trained FP16 Model]
        P2[Calibration Dataset (representative prompts)]
        P3[Quantize Weights (layer by layer)]
        P4[Deploy 4-bit / INT8 Model]
        P1 --> P2 --> P3 --> P4
    end

    subgraph QAT[Quantization-Aware Training (QAT)]
        Q1[FP16 Model or Adapter]
        Q2[Simulate Low-Precision During Fine-Tuning]
        Q3[Weights Adapt to Quantization Noise]
        Q4[Deploy with Better Quality Retention]
        Q1 --> Q2 --> Q3 --> Q4
    end

    PTQ -->|"Quality drops?"| QAT

This flowchart compares the two primary quantization timing strategies side by side. PTQ starts from an already-trained FP16 model, runs a small calibration dataset to determine quantization parameters layer by layer, and produces a 4-bit or INT8 model ready for deployment โ€” requiring no gradient updates. QAT instead embeds simulated low-precision arithmetic into the fine-tuning loop so that weights adapt to quantization noise before deployment, yielding better quality retention at the same bit width. The arrow from PTQ to QAT signals the recommended workflow: try PTQ first for speed and simplicity, then escalate to QAT only if the quality drop is unacceptable.

๐Ÿ“Š Scope: Per-Tensor vs Per-Channel vs Per-Token

flowchart TD
    Quant[Quantization Scope Decision]
    WeightOnly[Weights-Only (most linear layers)]
    WeightAct[Weights + Activations (higher throughput)]
    KVCache[+ KV Cache (long-context memory)]
    Mixed[Selective / Mixed (sensitive layers = BF16)]

    Quant --> WeightOnly
    WeightOnly -->|"Need more memory saving"| WeightAct
    WeightAct -->|"Long-context prompts"| KVCache
    WeightOnly -->|"Quality sensitive layers"| Mixed

    subgraph Granularity[Quantization Granularity]
        PerTensor[Per-Tensor (one scale for whole tensor)]
        PerChannel[Per-Channel (scale per output channel)]
        PerToken[Per-Token (scale per token, activations)]
    end

    WeightOnly --> Granularity

This flowchart maps the quantization scope decision tree, starting from the default weights-only path and escalating to wider coverage as memory or quality constraints demand. Weights-only quantization covers most linear layers; adding activations increases throughput at higher compression; appending KV-cache quantization targets long-context memory pressure. The Granularity subgraph shows how scale granularity โ€” per-tensor, per-channel, or per-token โ€” trades implementation overhead for accuracy preservation at each scope level. The key takeaway is that scope is a dial: start narrow and widen only when benchmarks show you must.


โš™๏ธ By Timing: PTQ vs QAT

Timing answers when quantization appears during the model lifecycle.

Post-Training Quantization (PTQ)

PTQ quantizes an already trained model. You do not retrain from scratch.

  • Fastest path to deployment.
  • Good first step for most LLM serving workloads.

PTQ can be static (calibrated once) or dynamic (activation scale computed at runtime in some setups).

Quantization-Aware Training (QAT)

QAT simulates low-precision behavior during fine-tuning/training so weights adapt to quantization noise.

  • Better quality retention when PTQ degrades important tasks.
  • Requires a cleaner data and eval pipeline.
Timing typeBest whenMain riskTypical owner
PTQYou need speed and lower infra cost nowQuality drops on sensitive tasksInference/platform team
QATPTQ quality is below product thresholdExtra tuning cycles and GPU costModel + platform collaboration

For most teams: PTQ first, QAT only when validation says PTQ is not enough.


โš™๏ธ By Scope: Which Parts of the LLM Get Quantized

Scope determines where quantization is applied in the model and runtime path.

Scope optionWhat is quantizedMemory gainAccuracy riskNotes
Weights-onlyModel parametersHighLow to mediumMost common first step
Weights + activationsParameters + runtime activationsHigherMediumBetter throughput potential
Weights + activations + KV cacheAdds cache compressionVery highMedium to highLong-context quality needs careful testing
Selective/mixed scopeSome layers kept high precisionMedium to highLowerPractical compromise

Common pattern: quantize most linear layer weights first, keep sensitive heads in higher precision, then add activations only after quality baselines pass.


โš™๏ธ By Mapping: How Float Values Become Low-Bit Values

Mapping defines the numeric transformation from float tensors to low-bit formats.

Symmetric mapping

Values are centered around zero, typically with a single scale.

  • Simpler and often faster.
  • Works well when tensor distributions are roughly zero-centered.

Asymmetric mapping

Uses scale plus zero-point, allowing shifted ranges.

  • Better fit for non-zero-centered distributions.
  • Slightly more metadata/handling complexity.

Non-uniform mapping (example: NF4)

Not all quantization levels are equally spaced.

  • Better alignment with weight distributions in some LLMs.
  • Common in 4-bit weight quantization pipelines.
Mapping typeFormula styleHardware friendlinessTypical use
Symmetricq = round(x / s)HighINT8 weight or activation paths
Asymmetricq = round(x / s) + zHighINT8 with offset-friendly runtimes
Non-uniformCodebook/learned binsMedium4-bit LLM weight quantization

If two methods use the same bit width but different mapping, quality can differ significantly.


These are the two most widely discussed practical approaches.

Weight quantization

Weight quantization compresses model parameters (usually linear layers).

Why it is popular:

  • Big memory savings with manageable quality impact.
  • Often enough to move from "cannot deploy" to "production feasible."

Typical setup:

  • 8-bit (safer) or 4-bit (more aggressive) weights.
  • Per-channel or group-wise scales.
  • Optional selective high precision for sensitive layers.

Activation quantization

Activation quantization compresses intermediate runtime tensors produced during inference.

Why teams use it:

  • Further reduces bandwidth and memory traffic.

Why teams delay it:

  • More sensitive to input distribution shifts.
  • Requires strong eval coverage (long context, tool calling, domain prompts).
ApproachBiggest benefitBiggest challengeGood default order
Weight quantizationLarge memory reductionLayer sensitivity at low bitsStart here
Activation quantizationExtra speed and memory gainsRuntime distribution sensitivityAdd second

In short: weight quantization is the baseline optimization; activation quantization is the scaling optimization.


๐Ÿง  Deep Dive: Inside the Runtime: Why Timing, Scope, and Mapping Interact

The internals

At inference time, quantized LLMs run through three hidden mechanisms:

  1. Tensor representation changes: weights/activations stored in low-bit format with scales.
  2. Kernel path changes: runtime chooses quantized GEMM kernels if available.
  3. Rescaling/dequantization points: outputs are rescaled at specific boundaries.

Small scope or mapping changes can push execution to a different kernel path, so "smaller model" does not always mean lower latency.

Mathematical model (lightweight)

A common affine quantization mapping is:

$$ q = \text{clip}\left(\text{round}\left(\frac{x}{s}\right) + z, q_{\min}, q_{\max}\right) $$

$$ \hat{x} = s \cdot (q - z) $$

The reconstruction error is:

$$ e = x - \hat{x} $$

Your deployment goal is to keep e small enough that task-level metrics stay within budget.

Performance analysis

For decoder-only LLMs, per-layer compute class stays similar to the unquantized path, but constants change:

  • Time complexity trend: operation class stays similar, but lower precision often improves throughput through lower memory transfer.
  • Space complexity trend: parameter memory roughly scales with bit width (FP16 to INT8 is about 2x smaller; FP16 to 4-bit is about 4x smaller before metadata overhead).
  • Bottlenecks: memory bandwidth, unsupported kernels, and dequantization overhead can limit gains.
ChangeUsually improvesCan regress when
Lower weight precisionMemory footprint, model fitKernel path is not optimized
Activation quantizationThroughput, memory trafficCalibration misses production distribution
More aggressive mappingCompression ratioQuantization error hurts key tasks

๐Ÿ”ฌ Internals

Post-training quantization (PTQ) maps FP32/BF16 weights to lower-bit representations using a calibration dataset to determine optimal scale and zero-point per layer. Quantization-aware training (QAT) simulates quantization noise during forward passes using straight-through estimators, allowing gradients to flow through the discretization step. Weight-only quantization (e.g., GPTQ) quantizes weights but keeps activations in FP16; activation quantization (e.g., SmoothQuant) quantizes both, requiring per-channel rescaling to handle outlier activations.

โšก Performance Analysis

INT8 weight-only quantization (LLM.int8) cuts memory in half with <0.5% accuracy loss on most benchmarks. INT4 PTQ (GPTQ/AWQ) achieves 4ร— compression with ~1โ€“2% accuracy degradation. QAT INT4 matches FP16 quality but requires 10โ€“20% additional training compute โ€” justified only for edge deployment where inference cost dominates. Activation quantization with SmoothQuant enables INT8 inference on 175B models at 1.56ร— throughput improvement over FP16.

๐Ÿ“Š Visualizing a Quantization Strategy Flow

flowchart TD
    A[Start with FP16 or BF16 LLM] --> B[Choose Timing Axis]
    B --> C{PTQ or QAT?}
    C -->|PTQ| D[Select Scope: Weights Only]
    C -->|QAT| E[Train with Quantization Simulation]
    D --> F[Select Mapping: Symmetric Asymmetric NF4]
    E --> F
    F --> G[Benchmark Memory Latency Quality]
    G --> H{Targets met?}
    H -- No --> I[Expand Scope or Adjust Mapping]
    I --> G
    H -- Yes --> J[Canary Deploy + Fallback]
    J --> K[Production Rollout]

This flowchart shows the complete quantization strategy selection process from a pre-trained model to production deployment. After choosing between PTQ โ€” which flows immediately into scope and mapping selection โ€” and QAT โ€” which requires a fine-tuning pass with simulated quantization noise โ€” both paths converge at a benchmarking gate covering memory, latency, and quality. If targets are not met, the loop expands scope or adjusts the numerical mapping; once satisfied, a canary deployment validates stability before full rollout. The key takeaway is that quantization is iterative: commit to production only after the benchmark gate passes.


๐ŸŒ Real-World Applications: Input, Process, Output

Case study 1: Customer support assistant on shared GPUs

StageDetails
InputMultilingual support prompts, medium context, strict p95 latency
ProcessPTQ + INT8 weights first, then selective activation quantization on stable layers
OutputLower memory usage, better concurrency, acceptable quality drift
StageDetails
InputLong-context legal prompts with domain terms
ProcessWeight-only 4-bit in most layers, output head and selected attention blocks kept BF16
OutputModel fits target hardware, but long-context eval required extra iteration

Both cases succeed by sequencing decisions across timing, scope, and mapping instead of maximizing compression immediately.


โš–๏ธ Trade-offs & Failure Modes: Trade-offs and Failure Modes You Should Expect

Trade-off or failure modeWhat it looks like in productionMitigation
Memory saved, quality dropsAnswers remain fluent but become less accurateAdd task-specific eval thresholds before rollout
Low-bit model, no latency gainSmaller model but unchanged p95Validate backend kernel support early
Activation driftGood offline metrics, bad real traffic performanceUse representative calibration and shadow traffic
Over-quantized sensitive layersHallucinations or format breakage in structured tasksKeep selective layers in higher precision
Aggressive scope changeImprovements on average, poor long-tail reliabilityCanary release with automated rollback

Intermediate-level rule: do not ship quantization based on memory metrics alone. Always include task-quality and tail-latency checks.


๐Ÿงญ Decision Guide: Choosing by Constraint

SituationRecommendation
Use whenStart with PTQ + weight quantization (INT8 or safe 4-bit) when memory and cost are immediate problems.
Avoid whenAvoid activation quantization as the first move if you do not have production-like calibration/eval data.
AlternativeUse mixed precision: quantize most layers, keep sensitive modules higher precision.
Edge casesFor long context, tool use, or strict JSON output, run dedicated eval suites before full rollout.

If deployment is blocked by memory, optimize scope first. If quality fails after PTQ, revisit timing (QAT). If two same-bit methods differ, inspect mapping and kernel support.


๐Ÿงช Practical Examples: Weight-First and Activation-Aware Paths

These examples demonstrate the two most common production quantization strategies: a weight-first path using 4-bit NF4 loading and a activation-aware path that extends quantization to the KV cache for long-context workloads. They were chosen because they represent the lowest-risk starting point and the next logical escalation when memory pressure remains after weight-only quantization. As you read them, focus on where quantization is applied in the pipeline โ€” which layers are quantized, what data type is used for compute, and which components remain in higher precision.

Example 1: Weight quantization with 4-bit loading

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

quant_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_cfg,
    device_map="auto",
)

prompt = "List three trade-offs of LLM quantization."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=80)

print(tokenizer.decode(output[0], skip_special_tokens=True))

What this demonstrates: a weight-first quantization strategy that is widely used for fast prototyping and production pilots.

Activation quantization in full LLM stacks is backend-dependent. Apply it after a weight-only baseline passes, then validate with production-like prompts, long-context tests, and structured-output checks.


๐Ÿ› ๏ธ AutoGPTQ, AutoAWQ, and bitsandbytes: Quantization Libraries in Practice

bitsandbytes (the bnb library by Tim Dettmers) integrates directly with the HuggingFace transformers from_pretrained() API to load models in INT8 or NF4 (4-bit) precision without a separate quantization step โ€” it is the fastest path from a HuggingFace model card to a quantized inference session.

AutoGPTQ implements the GPTQ algorithm (layer-wise weight quantization using second-order gradient information) for aggressive 4-bit quantization with better quality retention than naive round-to-nearest. AutoAWQ implements the AWQ (Activation-Aware Weight Quantization) algorithm, which identifies and preserves the 1% of weight channels most important to output quality.

# โ”€โ”€ bitsandbytes: NF4 (4-bit NormalFloat) via HuggingFace BitsAndBytesConfig โ”€โ”€
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # non-uniform 4-bit: better weight distribution fit
    bnb_4bit_use_double_quant=True,      # quantize the scale factors too (~0.4-bit savings)
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model_bnb = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    quantization_config=bnb_cfg,
    device_map="auto",
)

# โ”€โ”€ AutoGPTQ: GPTQ 4-bit from a pre-quantized model checkpoint โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from auto_gptq import AutoGPTQForCausalLM

model_gptq = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
    model_basename="model",
    use_safetensors=True,
    device="cuda:0",
    use_triton=False,    # set True for faster inference on supported GPUs
)

# โ”€โ”€ AutoAWQ: AWQ 4-bit from a pre-quantized model checkpoint โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from awq import AutoAWQForCausalLM

model_awq = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
    fuse_layers=True,    # fuse attention layers for ~20% throughput improvement
    trust_remote_code=False,
    safetensors=True,
)

# โ”€โ”€ Comparison: same prompt, three quantization backends โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
prompt = "Explain the difference between PTQ and QAT in two sentences."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

for name, model in [("bnb-nf4", model_bnb), ("gptq", model_gptq), ("awq", model_awq)]:
    with torch.inference_mode():
        out = model.generate(**inputs, max_new_tokens=60)
    print(f"\n[{name}]", tokenizer.decode(out[0], skip_special_tokens=True))
LibraryAlgorithmTimingBest forQuality vs. NF4
bitsandbytesNF4 / INT8PTQ (load-time)Fast prototyping, LoRA fine-tuningBaseline
AutoGPTQGPTQPTQ (offline, calibration)Inference-only deployments needing small model size+0.5โ€“1% on coding/math
AutoAWQAWQPTQ (offline, activation-aware)Low-bit deployment with quality-critical tasks+0.5โ€“1.5% on reasoning

fuse_layers=True in AWQ merges adjacent transformer blocks into fused CUDA kernels โ€” this is the activation-aware advantage materialising as runtime throughput rather than just model size reduction.

For a full deep-dive on AutoGPTQ calibration pipelines, AWQ activation channel analysis, and bitsandbytes QLoRA fine-tuning workflows, a dedicated follow-up post is planned.

๐Ÿ“š Lessons Learned from Quantization Projects

  • Classify decisions by timing, scope, and mapping before choosing tools.
  • Weight quantization is usually the highest-ROI first step.
  • Activation quantization can unlock additional speed, but calibration quality becomes critical.
  • Same bit width does not mean same quality; mapping and granularity matter.
  • Kernel compatibility can dominate real latency outcomes.
  • Selective precision is often better than aggressive all-layer quantization.

๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • LLM quantization is best understood as a 3-axis design space: timing, scope, and mapping.
  • By Timing: PTQ is fast and practical; QAT helps recover quality when PTQ is insufficient.
  • By Scope: start with weights, then add activations if needed.
  • By Mapping: symmetric, asymmetric, and non-uniform mappings create different error behavior.
  • Weight quantization is the most common production entry point.
  • Activation quantization is powerful but requires stronger evaluation discipline.
  • Production success depends on joint optimization of memory, latency, and task-level quality.

One-liner: The best quantization strategy is the one that meets your product SLA with the smallest quality compromise, not the one with the lowest bit count.


Share

Test Your Knowledge

๐Ÿง 

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms

Abstract Algorithms

Exploring the fascinating world of algorithms, data structures, and software engineering through clear explanations and practical examples.

Author

Abstract Algorithms

Abstract Algorithms

@abstractalgorithms

ยฉ 2026 Abstract Algorithms. All rights reserved.

Powered by Hashnode