Types of LLM Quantization: By Timing, Scope, and Mapping
PTQ, QAT, INT8, INT4, and NF4 explained through timing, scope, and mapping choices.
Abstract Algorithms
Intermediate
For developers with some experience. Builds on fundamentals.
Estimated read time: 16 min
AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantization, then add activation quantization when needed.
๐ Quantization Is a Design Space, Not One Switch
Deploying LLaMA-3-70B at full fp16 precision requires approximately 140GB of VRAM โ two A100 80GB GPUs at roughly $8โ12K/month in cloud GPU rental. At Q4_K_M quantization (4-bit weights), the same model fits on a single consumer GPU with 24GB VRAM. The measured quality difference on most standard benchmarks? Under 2%.
That is the practical upside, but "just quantize it to 4-bit" is not a strategy โ it is a gamble. Teams that apply the wrong quantization type to the wrong model component regularly discover quality regressions in production that never appeared in offline evals.
Here is a concrete picture of the memory trade-off:
| Model | Precision | VRAM Required | Approximate Hardware |
| LLaMA-3-70B | fp16 | ~140 GB | 2ร A100 80GB |
| LLaMA-3-70B | INT8 | ~70 GB | 1ร A100 80GB |
| LLaMA-3-70B | Q4_K_M | ~35 GB | 1ร H100 40GB or RTX 3090 |
| LLaMA-3-8B | Q4_K_M | ~4.5 GB | Consumer laptop GPU |
The right question is not "should I quantize?" โ it is: "Which quantization type fits my latency, memory, and quality budget?"
This post uses a taxonomy approach. Instead of memorizing tool names, classify every method by:
- By Timing: when low precision enters the pipeline.
- By Scope: which tensors or components are quantized.
- By Mapping: how float values are mapped to low-bit representations.
| If your main pain is... | Your first axis to optimize |
| Model does not fit GPU/edge memory | Scope (start with weights) |
| Cost per token is too high | Scope + Mapping |
| Accuracy regression after PTQ | Timing (move toward QAT) |
| Latency remains high despite smaller model | Mapping + kernel compatibility |
๐ Three Classification Axes You Can Apply to Any Quantized LLM
Before selecting a library or hardware backend, use this compact classifier:
| Axis | Core question | Common options | Typical impact |
| Timing | When do we apply quantization? | PTQ, QAT | Accuracy retention vs implementation effort |
| Scope | What model parts are quantized? | Weights-only, weights+activations, KV cache | Memory and throughput |
| Mapping | How are floats represented in low bits? | Symmetric, asymmetric, non-uniform (NF4) | Error profile and hardware efficiency |
Use timing for lifecycle decisions, scope for memory/bandwidth impact, and mapping for error behavior.
๐ PTQ vs QAT Taxonomy
flowchart LR
subgraph PTQ[Post-Training Quantization (PTQ)]
P1[Trained FP16 Model]
P2[Calibration Dataset (representative prompts)]
P3[Quantize Weights (layer by layer)]
P4[Deploy 4-bit / INT8 Model]
P1 --> P2 --> P3 --> P4
end
subgraph QAT[Quantization-Aware Training (QAT)]
Q1[FP16 Model or Adapter]
Q2[Simulate Low-Precision During Fine-Tuning]
Q3[Weights Adapt to Quantization Noise]
Q4[Deploy with Better Quality Retention]
Q1 --> Q2 --> Q3 --> Q4
end
PTQ -->|"Quality drops?"| QAT
This flowchart compares the two primary quantization timing strategies side by side. PTQ starts from an already-trained FP16 model, runs a small calibration dataset to determine quantization parameters layer by layer, and produces a 4-bit or INT8 model ready for deployment โ requiring no gradient updates. QAT instead embeds simulated low-precision arithmetic into the fine-tuning loop so that weights adapt to quantization noise before deployment, yielding better quality retention at the same bit width. The arrow from PTQ to QAT signals the recommended workflow: try PTQ first for speed and simplicity, then escalate to QAT only if the quality drop is unacceptable.
๐ Scope: Per-Tensor vs Per-Channel vs Per-Token
flowchart TD
Quant[Quantization Scope Decision]
WeightOnly[Weights-Only (most linear layers)]
WeightAct[Weights + Activations (higher throughput)]
KVCache[+ KV Cache (long-context memory)]
Mixed[Selective / Mixed (sensitive layers = BF16)]
Quant --> WeightOnly
WeightOnly -->|"Need more memory saving"| WeightAct
WeightAct -->|"Long-context prompts"| KVCache
WeightOnly -->|"Quality sensitive layers"| Mixed
subgraph Granularity[Quantization Granularity]
PerTensor[Per-Tensor (one scale for whole tensor)]
PerChannel[Per-Channel (scale per output channel)]
PerToken[Per-Token (scale per token, activations)]
end
WeightOnly --> Granularity
This flowchart maps the quantization scope decision tree, starting from the default weights-only path and escalating to wider coverage as memory or quality constraints demand. Weights-only quantization covers most linear layers; adding activations increases throughput at higher compression; appending KV-cache quantization targets long-context memory pressure. The Granularity subgraph shows how scale granularity โ per-tensor, per-channel, or per-token โ trades implementation overhead for accuracy preservation at each scope level. The key takeaway is that scope is a dial: start narrow and widen only when benchmarks show you must.
โ๏ธ By Timing: PTQ vs QAT
Timing answers when quantization appears during the model lifecycle.
Post-Training Quantization (PTQ)
PTQ quantizes an already trained model. You do not retrain from scratch.
- Fastest path to deployment.
- Good first step for most LLM serving workloads.
PTQ can be static (calibrated once) or dynamic (activation scale computed at runtime in some setups).
Quantization-Aware Training (QAT)
QAT simulates low-precision behavior during fine-tuning/training so weights adapt to quantization noise.
- Better quality retention when PTQ degrades important tasks.
- Requires a cleaner data and eval pipeline.
| Timing type | Best when | Main risk | Typical owner |
| PTQ | You need speed and lower infra cost now | Quality drops on sensitive tasks | Inference/platform team |
| QAT | PTQ quality is below product threshold | Extra tuning cycles and GPU cost | Model + platform collaboration |
For most teams: PTQ first, QAT only when validation says PTQ is not enough.
โ๏ธ By Scope: Which Parts of the LLM Get Quantized
Scope determines where quantization is applied in the model and runtime path.
| Scope option | What is quantized | Memory gain | Accuracy risk | Notes |
| Weights-only | Model parameters | High | Low to medium | Most common first step |
| Weights + activations | Parameters + runtime activations | Higher | Medium | Better throughput potential |
| Weights + activations + KV cache | Adds cache compression | Very high | Medium to high | Long-context quality needs careful testing |
| Selective/mixed scope | Some layers kept high precision | Medium to high | Lower | Practical compromise |
Common pattern: quantize most linear layer weights first, keep sensitive heads in higher precision, then add activations only after quality baselines pass.
โ๏ธ By Mapping: How Float Values Become Low-Bit Values
Mapping defines the numeric transformation from float tensors to low-bit formats.
Symmetric mapping
Values are centered around zero, typically with a single scale.
- Simpler and often faster.
- Works well when tensor distributions are roughly zero-centered.
Asymmetric mapping
Uses scale plus zero-point, allowing shifted ranges.
- Better fit for non-zero-centered distributions.
- Slightly more metadata/handling complexity.
Non-uniform mapping (example: NF4)
Not all quantization levels are equally spaced.
- Better alignment with weight distributions in some LLMs.
- Common in 4-bit weight quantization pipelines.
| Mapping type | Formula style | Hardware friendliness | Typical use |
| Symmetric | q = round(x / s) | High | INT8 weight or activation paths |
| Asymmetric | q = round(x / s) + z | High | INT8 with offset-friendly runtimes |
| Non-uniform | Codebook/learned bins | Medium | 4-bit LLM weight quantization |
If two methods use the same bit width but different mapping, quality can differ significantly.
โ๏ธ Popular Approaches: Weight Quantization and Activation Quantization
These are the two most widely discussed practical approaches.
Weight quantization
Weight quantization compresses model parameters (usually linear layers).
Why it is popular:
- Big memory savings with manageable quality impact.
- Often enough to move from "cannot deploy" to "production feasible."
Typical setup:
- 8-bit (safer) or 4-bit (more aggressive) weights.
- Per-channel or group-wise scales.
- Optional selective high precision for sensitive layers.
Activation quantization
Activation quantization compresses intermediate runtime tensors produced during inference.
Why teams use it:
- Further reduces bandwidth and memory traffic.
Why teams delay it:
- More sensitive to input distribution shifts.
- Requires strong eval coverage (long context, tool calling, domain prompts).
| Approach | Biggest benefit | Biggest challenge | Good default order |
| Weight quantization | Large memory reduction | Layer sensitivity at low bits | Start here |
| Activation quantization | Extra speed and memory gains | Runtime distribution sensitivity | Add second |
In short: weight quantization is the baseline optimization; activation quantization is the scaling optimization.
๐ง Deep Dive: Inside the Runtime: Why Timing, Scope, and Mapping Interact
The internals
At inference time, quantized LLMs run through three hidden mechanisms:
- Tensor representation changes: weights/activations stored in low-bit format with scales.
- Kernel path changes: runtime chooses quantized GEMM kernels if available.
- Rescaling/dequantization points: outputs are rescaled at specific boundaries.
Small scope or mapping changes can push execution to a different kernel path, so "smaller model" does not always mean lower latency.
Mathematical model (lightweight)
A common affine quantization mapping is:
$$ q = \text{clip}\left(\text{round}\left(\frac{x}{s}\right) + z, q_{\min}, q_{\max}\right) $$
$$ \hat{x} = s \cdot (q - z) $$
The reconstruction error is:
$$ e = x - \hat{x} $$
Your deployment goal is to keep e small enough that task-level metrics stay within budget.
Performance analysis
For decoder-only LLMs, per-layer compute class stays similar to the unquantized path, but constants change:
- Time complexity trend: operation class stays similar, but lower precision often improves throughput through lower memory transfer.
- Space complexity trend: parameter memory roughly scales with bit width (FP16 to INT8 is about 2x smaller; FP16 to 4-bit is about 4x smaller before metadata overhead).
- Bottlenecks: memory bandwidth, unsupported kernels, and dequantization overhead can limit gains.
| Change | Usually improves | Can regress when |
| Lower weight precision | Memory footprint, model fit | Kernel path is not optimized |
| Activation quantization | Throughput, memory traffic | Calibration misses production distribution |
| More aggressive mapping | Compression ratio | Quantization error hurts key tasks |
๐ฌ Internals
Post-training quantization (PTQ) maps FP32/BF16 weights to lower-bit representations using a calibration dataset to determine optimal scale and zero-point per layer. Quantization-aware training (QAT) simulates quantization noise during forward passes using straight-through estimators, allowing gradients to flow through the discretization step. Weight-only quantization (e.g., GPTQ) quantizes weights but keeps activations in FP16; activation quantization (e.g., SmoothQuant) quantizes both, requiring per-channel rescaling to handle outlier activations.
โก Performance Analysis
INT8 weight-only quantization (LLM.int8) cuts memory in half with <0.5% accuracy loss on most benchmarks. INT4 PTQ (GPTQ/AWQ) achieves 4ร compression with ~1โ2% accuracy degradation. QAT INT4 matches FP16 quality but requires 10โ20% additional training compute โ justified only for edge deployment where inference cost dominates. Activation quantization with SmoothQuant enables INT8 inference on 175B models at 1.56ร throughput improvement over FP16.
๐ Visualizing a Quantization Strategy Flow
flowchart TD
A[Start with FP16 or BF16 LLM] --> B[Choose Timing Axis]
B --> C{PTQ or QAT?}
C -->|PTQ| D[Select Scope: Weights Only]
C -->|QAT| E[Train with Quantization Simulation]
D --> F[Select Mapping: Symmetric Asymmetric NF4]
E --> F
F --> G[Benchmark Memory Latency Quality]
G --> H{Targets met?}
H -- No --> I[Expand Scope or Adjust Mapping]
I --> G
H -- Yes --> J[Canary Deploy + Fallback]
J --> K[Production Rollout]
This flowchart shows the complete quantization strategy selection process from a pre-trained model to production deployment. After choosing between PTQ โ which flows immediately into scope and mapping selection โ and QAT โ which requires a fine-tuning pass with simulated quantization noise โ both paths converge at a benchmarking gate covering memory, latency, and quality. If targets are not met, the loop expands scope or adjusts the numerical mapping; once satisfied, a canary deployment validates stability before full rollout. The key takeaway is that quantization is iterative: commit to production only after the benchmark gate passes.
๐ Real-World Applications: Input, Process, Output
Case study 1: Customer support assistant on shared GPUs
| Stage | Details |
| Input | Multilingual support prompts, medium context, strict p95 latency |
| Process | PTQ + INT8 weights first, then selective activation quantization on stable layers |
| Output | Lower memory usage, better concurrency, acceptable quality drift |
Case study 2: On-prem legal drafting assistant with long context
| Stage | Details |
| Input | Long-context legal prompts with domain terms |
| Process | Weight-only 4-bit in most layers, output head and selected attention blocks kept BF16 |
| Output | Model fits target hardware, but long-context eval required extra iteration |
Both cases succeed by sequencing decisions across timing, scope, and mapping instead of maximizing compression immediately.
โ๏ธ Trade-offs & Failure Modes: Trade-offs and Failure Modes You Should Expect
| Trade-off or failure mode | What it looks like in production | Mitigation |
| Memory saved, quality drops | Answers remain fluent but become less accurate | Add task-specific eval thresholds before rollout |
| Low-bit model, no latency gain | Smaller model but unchanged p95 | Validate backend kernel support early |
| Activation drift | Good offline metrics, bad real traffic performance | Use representative calibration and shadow traffic |
| Over-quantized sensitive layers | Hallucinations or format breakage in structured tasks | Keep selective layers in higher precision |
| Aggressive scope change | Improvements on average, poor long-tail reliability | Canary release with automated rollback |
Intermediate-level rule: do not ship quantization based on memory metrics alone. Always include task-quality and tail-latency checks.
๐งญ Decision Guide: Choosing by Constraint
| Situation | Recommendation |
| Use when | Start with PTQ + weight quantization (INT8 or safe 4-bit) when memory and cost are immediate problems. |
| Avoid when | Avoid activation quantization as the first move if you do not have production-like calibration/eval data. |
| Alternative | Use mixed precision: quantize most layers, keep sensitive modules higher precision. |
| Edge cases | For long context, tool use, or strict JSON output, run dedicated eval suites before full rollout. |
If deployment is blocked by memory, optimize scope first. If quality fails after PTQ, revisit timing (QAT). If two same-bit methods differ, inspect mapping and kernel support.
๐งช Practical Examples: Weight-First and Activation-Aware Paths
These examples demonstrate the two most common production quantization strategies: a weight-first path using 4-bit NF4 loading and a activation-aware path that extends quantization to the KV cache for long-context workloads. They were chosen because they represent the lowest-risk starting point and the next logical escalation when memory pressure remains after weight-only quantization. As you read them, focus on where quantization is applied in the pipeline โ which layers are quantized, what data type is used for compute, and which components remain in higher precision.
Example 1: Weight quantization with 4-bit loading
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
quant_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quant_cfg,
device_map="auto",
)
prompt = "List three trade-offs of LLM quantization."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=80)
print(tokenizer.decode(output[0], skip_special_tokens=True))
What this demonstrates: a weight-first quantization strategy that is widely used for fast prototyping and production pilots.
Activation quantization in full LLM stacks is backend-dependent. Apply it after a weight-only baseline passes, then validate with production-like prompts, long-context tests, and structured-output checks.
๐ ๏ธ AutoGPTQ, AutoAWQ, and bitsandbytes: Quantization Libraries in Practice
bitsandbytes (the bnb library by Tim Dettmers) integrates directly with the HuggingFace transformers from_pretrained() API to load models in INT8 or NF4 (4-bit) precision without a separate quantization step โ it is the fastest path from a HuggingFace model card to a quantized inference session.
AutoGPTQ implements the GPTQ algorithm (layer-wise weight quantization using second-order gradient information) for aggressive 4-bit quantization with better quality retention than naive round-to-nearest. AutoAWQ implements the AWQ (Activation-Aware Weight Quantization) algorithm, which identifies and preserves the 1% of weight channels most important to output quality.
# โโ bitsandbytes: NF4 (4-bit NormalFloat) via HuggingFace BitsAndBytesConfig โโ
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
bnb_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # non-uniform 4-bit: better weight distribution fit
bnb_4bit_use_double_quant=True, # quantize the scale factors too (~0.4-bit savings)
bnb_4bit_compute_dtype=torch.bfloat16,
)
model_bnb = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.2",
quantization_config=bnb_cfg,
device_map="auto",
)
# โโ AutoGPTQ: GPTQ 4-bit from a pre-quantized model checkpoint โโโโโโโโโโโโโโโโ
from auto_gptq import AutoGPTQForCausalLM
model_gptq = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
model_basename="model",
use_safetensors=True,
device="cuda:0",
use_triton=False, # set True for faster inference on supported GPUs
)
# โโ AutoAWQ: AWQ 4-bit from a pre-quantized model checkpoint โโโโโโโโโโโโโโโโโโ
from awq import AutoAWQForCausalLM
model_awq = AutoAWQForCausalLM.from_quantized(
"TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
fuse_layers=True, # fuse attention layers for ~20% throughput improvement
trust_remote_code=False,
safetensors=True,
)
# โโ Comparison: same prompt, three quantization backends โโโโโโโโโโโโโโโโโโโโโ
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
prompt = "Explain the difference between PTQ and QAT in two sentences."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
for name, model in [("bnb-nf4", model_bnb), ("gptq", model_gptq), ("awq", model_awq)]:
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=60)
print(f"\n[{name}]", tokenizer.decode(out[0], skip_special_tokens=True))
| Library | Algorithm | Timing | Best for | Quality vs. NF4 |
| bitsandbytes | NF4 / INT8 | PTQ (load-time) | Fast prototyping, LoRA fine-tuning | Baseline |
| AutoGPTQ | GPTQ | PTQ (offline, calibration) | Inference-only deployments needing small model size | +0.5โ1% on coding/math |
| AutoAWQ | AWQ | PTQ (offline, activation-aware) | Low-bit deployment with quality-critical tasks | +0.5โ1.5% on reasoning |
fuse_layers=True in AWQ merges adjacent transformer blocks into fused CUDA kernels โ this is the activation-aware advantage materialising as runtime throughput rather than just model size reduction.
For a full deep-dive on AutoGPTQ calibration pipelines, AWQ activation channel analysis, and bitsandbytes QLoRA fine-tuning workflows, a dedicated follow-up post is planned.
๐ Lessons Learned from Quantization Projects
- Classify decisions by timing, scope, and mapping before choosing tools.
- Weight quantization is usually the highest-ROI first step.
- Activation quantization can unlock additional speed, but calibration quality becomes critical.
- Same bit width does not mean same quality; mapping and granularity matter.
- Kernel compatibility can dominate real latency outcomes.
- Selective precision is often better than aggressive all-layer quantization.
๐ TLDR: Summary & Key Takeaways
- LLM quantization is best understood as a 3-axis design space: timing, scope, and mapping.
- By Timing: PTQ is fast and practical; QAT helps recover quality when PTQ is insufficient.
- By Scope: start with weights, then add activations if needed.
- By Mapping: symmetric, asymmetric, and non-uniform mappings create different error behavior.
- Weight quantization is the most common production entry point.
- Activation quantization is powerful but requires stronger evaluation discipline.
- Production success depends on joint optimization of memory, latency, and task-level quality.
One-liner: The best quantization strategy is the one that meets your product SLA with the smallest quality compromise, not the one with the lowest bit count.
๐ Related Posts
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Softmax Function Explained: From Raw Scores to Probabilities
TLDR: Softmax converts a vector of raw scores (logits) into a valid probability distribution by exponentiating each value and dividing by the total. Subtracting the max before exponentiating prevents floating-point overflow. Temperature scaling contr...
Dot Product in Machine Learning: The Engine Behind Similarity, Attention, and Neural Networks
TLDR: The dot product multiplies corresponding elements of two vectors and sums the results. In machine learning it does three critical jobs: it scores semantic similarity between embeddings, computes every activation in a fully connected layer, and ...
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
