Types of LLM Quantization: By Timing, Scope, and Mapping
PTQ, QAT, INT8, INT4, and NF4 explained through timing, scope, and mapping choices.

Abstract Algorithms
Helping engineers master software engineering topics.

TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantization, then add activation quantization when needed.
๐ Quantization Is a Design Space, Not One Switch
Deploying LLaMA-3-70B at full fp16 precision requires approximately 140GB of VRAM โ two A100 80GB GPUs at roughly $8โ12K/month in cloud GPU rental. At Q4_K_M quantization (4-bit weights), the same model fits on a single consumer GPU with 24GB VRAM. The measured quality difference on most standard benchmarks? Under 2%.
That is the practical upside, but "just quantize it to 4-bit" is not a strategy โ it is a gamble. Teams that apply the wrong quantization type to the wrong model component regularly discover quality regressions in production that never appeared in offline evals.
Here is a concrete picture of the memory trade-off:
| Model | Precision | VRAM Required | Approximate Hardware |
| LLaMA-3-70B | fp16 | ~140 GB | 2ร A100 80GB |
| LLaMA-3-70B | INT8 | ~70 GB | 1ร A100 80GB |
| LLaMA-3-70B | Q4_K_M | ~35 GB | 1ร H100 40GB or RTX 3090 |
| LLaMA-3-8B | Q4_K_M | ~4.5 GB | Consumer laptop GPU |
The right question is not "should I quantize?" โ it is: "Which quantization type fits my latency, memory, and quality budget?"
This post uses a taxonomy approach. Instead of memorizing tool names, classify every method by:
- By Timing: when low precision enters the pipeline.
- By Scope: which tensors or components are quantized.
- By Mapping: how float values are mapped to low-bit representations.
| If your main pain is... | Your first axis to optimize |
| Model does not fit GPU/edge memory | Scope (start with weights) |
| Cost per token is too high | Scope + Mapping |
| Accuracy regression after PTQ | Timing (move toward QAT) |
| Latency remains high despite smaller model | Mapping + kernel compatibility |
๐ Three Classification Axes You Can Apply to Any Quantized LLM
Before selecting a library or hardware backend, use this compact classifier:
| Axis | Core question | Common options | Typical impact |
| Timing | When do we apply quantization? | PTQ, QAT | Accuracy retention vs implementation effort |
| Scope | What model parts are quantized? | Weights-only, weights+activations, KV cache | Memory and throughput |
| Mapping | How are floats represented in low bits? | Symmetric, asymmetric, non-uniform (NF4) | Error profile and hardware efficiency |
Use timing for lifecycle decisions, scope for memory/bandwidth impact, and mapping for error behavior.
๐ PTQ vs QAT Taxonomy
flowchart LR
subgraph PTQ["Post-Training Quantization (PTQ)"]
P1[Trained FP16 Model]
P2["Calibration Dataset (representative prompts)"]
P3["Quantize Weights (layer by layer)"]
P4[Deploy 4-bit / INT8 Model]
P1 --> P2 --> P3 --> P4
end
subgraph QAT["Quantization-Aware Training (QAT)"]
Q1[FP16 Model or Adapter]
Q2[Simulate Low-Precision During Fine-Tuning]
Q3[Weights Adapt to Quantization Noise]
Q4[Deploy with Better Quality Retention]
Q1 --> Q2 --> Q3 --> Q4
end
PTQ -->|"Quality drops?"| QAT
This flowchart compares the two primary quantization timing strategies side by side. PTQ starts from an already-trained FP16 model, runs a small calibration dataset to determine quantization parameters layer by layer, and produces a 4-bit or INT8 model ready for deployment โ requiring no gradient updates. QAT instead embeds simulated low-precision arithmetic into the fine-tuning loop so that weights adapt to quantization noise before deployment, yielding better quality retention at the same bit width. The arrow from PTQ to QAT signals the recommended workflow: try PTQ first for speed and simplicity, then escalate to QAT only if the quality drop is unacceptable.
๐ Scope: Per-Tensor vs Per-Channel vs Per-Token
flowchart TD
Quant[Quantization Scope Decision]
WeightOnly["Weights-Only (most linear layers)"]
WeightAct["Weights + Activations (higher throughput)"]
KVCache["+ KV Cache (long-context memory)"]
Mixed["Selective / Mixed (sensitive layers = BF16)"]
Quant --> WeightOnly
WeightOnly -->|"Need more memory saving"| WeightAct
WeightAct -->|"Long-context prompts"| KVCache
WeightOnly -->|"Quality sensitive layers"| Mixed
subgraph Granularity[Quantization Granularity]
PerTensor["Per-Tensor (one scale for whole tensor)"]
PerChannel["Per-Channel (scale per output channel)"]
PerToken["Per-Token (scale per token, activations)"]
end
WeightOnly --> Granularity
This flowchart maps the quantization scope decision tree, starting from the default weights-only path and escalating to wider coverage as memory or quality constraints demand. Weights-only quantization covers most linear layers; adding activations increases throughput at higher compression; appending KV-cache quantization targets long-context memory pressure. The Granularity subgraph shows how scale granularity โ per-tensor, per-channel, or per-token โ trades implementation overhead for accuracy preservation at each scope level. The key takeaway is that scope is a dial: start narrow and widen only when benchmarks show you must.
โ๏ธ By Timing: PTQ vs QAT
Timing answers when quantization appears during the model lifecycle.
Post-Training Quantization (PTQ)
PTQ quantizes an already trained model. You do not retrain from scratch.
- Fastest path to deployment.
- Good first step for most LLM serving workloads.
PTQ can be static (calibrated once) or dynamic (activation scale computed at runtime in some setups).
Quantization-Aware Training (QAT)
QAT simulates low-precision behavior during fine-tuning/training so weights adapt to quantization noise.
- Better quality retention when PTQ degrades important tasks.
- Requires a cleaner data and eval pipeline.
| Timing type | Best when | Main risk | Typical owner |
| PTQ | You need speed and lower infra cost now | Quality drops on sensitive tasks | Inference/platform team |
| QAT | PTQ quality is below product threshold | Extra tuning cycles and GPU cost | Model + platform collaboration |
For most teams: PTQ first, QAT only when validation says PTQ is not enough.
โ๏ธ By Scope: Which Parts of the LLM Get Quantized
Scope determines where quantization is applied in the model and runtime path.
| Scope option | What is quantized | Memory gain | Accuracy risk | Notes |
| Weights-only | Model parameters | High | Low to medium | Most common first step |
| Weights + activations | Parameters + runtime activations | Higher | Medium | Better throughput potential |
| Weights + activations + KV cache | Adds cache compression | Very high | Medium to high | Long-context quality needs careful testing |
| Selective/mixed scope | Some layers kept high precision | Medium to high | Lower | Practical compromise |
Common pattern: quantize most linear layer weights first, keep sensitive heads in higher precision, then add activations only after quality baselines pass.
โ๏ธ By Mapping: How Float Values Become Low-Bit Values
Mapping defines the numeric transformation from float tensors to low-bit formats.
Symmetric mapping
Values are centered around zero, typically with a single scale.
- Simpler and often faster.
- Works well when tensor distributions are roughly zero-centered.
Asymmetric mapping
Uses scale plus zero-point, allowing shifted ranges.
- Better fit for non-zero-centered distributions.
- Slightly more metadata/handling complexity.
Non-uniform mapping (example: NF4)
Not all quantization levels are equally spaced.
- Better alignment with weight distributions in some LLMs.
- Common in 4-bit weight quantization pipelines.
| Mapping type | Formula style | Hardware friendliness | Typical use |
| Symmetric | q = round(x / s) | High | INT8 weight or activation paths |
| Asymmetric | q = round(x / s) + z | High | INT8 with offset-friendly runtimes |
| Non-uniform | Codebook/learned bins | Medium | 4-bit LLM weight quantization |
If two methods use the same bit width but different mapping, quality can differ significantly.
โ๏ธ Popular Approaches: Weight Quantization and Activation Quantization
These are the two most widely discussed practical approaches.
Weight quantization
Weight quantization compresses model parameters (usually linear layers).
Why it is popular:
- Big memory savings with manageable quality impact.
- Often enough to move from "cannot deploy" to "production feasible."
Typical setup:
- 8-bit (safer) or 4-bit (more aggressive) weights.
- Per-channel or group-wise scales.
- Optional selective high precision for sensitive layers.
Activation quantization
Activation quantization compresses intermediate runtime tensors produced during inference.
Why teams use it:
- Further reduces bandwidth and memory traffic.
Why teams delay it:
- More sensitive to input distribution shifts.
- Requires strong eval coverage (long context, tool calling, domain prompts).
| Approach | Biggest benefit | Biggest challenge | Good default order |
| Weight quantization | Large memory reduction | Layer sensitivity at low bits | Start here |
| Activation quantization | Extra speed and memory gains | Runtime distribution sensitivity | Add second |
In short: weight quantization is the baseline optimization; activation quantization is the scaling optimization.
๐ง Deep Dive: Inside the Runtime: Why Timing, Scope, and Mapping Interact
The internals
At inference time, quantized LLMs run through three hidden mechanisms:
- Tensor representation changes: weights/activations stored in low-bit format with scales.
- Kernel path changes: runtime chooses quantized GEMM kernels if available.
- Rescaling/dequantization points: outputs are rescaled at specific boundaries.
Small scope or mapping changes can push execution to a different kernel path, so "smaller model" does not always mean lower latency.
Mathematical model (lightweight)
A common affine quantization mapping is:
$$ q = \text{clip}\left(\text{round}\left(\frac{x}{s}\right) + z, q_{\min}, q_{\max}\right) $$
$$ \hat{x} = s \cdot (q - z) $$
The reconstruction error is:
$$ e = x - \hat{x} $$
Your deployment goal is to keep e small enough that task-level metrics stay within budget.
Performance analysis
For decoder-only LLMs, per-layer compute class stays similar to the unquantized path, but constants change:
- Time complexity trend: operation class stays similar, but lower precision often improves throughput through lower memory transfer.
- Space complexity trend: parameter memory roughly scales with bit width (FP16 to INT8 is about 2x smaller; FP16 to 4-bit is about 4x smaller before metadata overhead).
- Bottlenecks: memory bandwidth, unsupported kernels, and dequantization overhead can limit gains.
| Change | Usually improves | Can regress when |
| Lower weight precision | Memory footprint, model fit | Kernel path is not optimized |
| Activation quantization | Throughput, memory traffic | Calibration misses production distribution |
| More aggressive mapping | Compression ratio | Quantization error hurts key tasks |
๐ฌ Internals
Post-training quantization (PTQ) maps FP32/BF16 weights to lower-bit representations using a calibration dataset to determine optimal scale and zero-point per layer. Quantization-aware training (QAT) simulates quantization noise during forward passes using straight-through estimators, allowing gradients to flow through the discretization step. Weight-only quantization (e.g., GPTQ) quantizes weights but keeps activations in FP16; activation quantization (e.g., SmoothQuant) quantizes both, requiring per-channel rescaling to handle outlier activations.
โก Performance Analysis
INT8 weight-only quantization (LLM.int8) cuts memory in half with <0.5% accuracy loss on most benchmarks. INT4 PTQ (GPTQ/AWQ) achieves 4ร compression with ~1โ2% accuracy degradation. QAT INT4 matches FP16 quality but requires 10โ20% additional training compute โ justified only for edge deployment where inference cost dominates. Activation quantization with SmoothQuant enables INT8 inference on 175B models at 1.56ร throughput improvement over FP16.
๐ Visualizing a Quantization Strategy Flow
flowchart TD
A[Start with FP16 or BF16 LLM] --> B[Choose Timing Axis]
B --> C{PTQ or QAT?}
C -->|PTQ| D[Select Scope: Weights Only]
C -->|QAT| E[Train with Quantization Simulation]
D --> F[Select Mapping: Symmetric Asymmetric NF4]
E --> F
F --> G[Benchmark Memory Latency Quality]
G --> H{Targets met?}
H -- No --> I[Expand Scope or Adjust Mapping]
I --> G
H -- Yes --> J[Canary Deploy + Fallback]
J --> K[Production Rollout]
This flowchart shows the complete quantization strategy selection process from a pre-trained model to production deployment. After choosing between PTQ โ which flows immediately into scope and mapping selection โ and QAT โ which requires a fine-tuning pass with simulated quantization noise โ both paths converge at a benchmarking gate covering memory, latency, and quality. If targets are not met, the loop expands scope or adjusts the numerical mapping; once satisfied, a canary deployment validates stability before full rollout. The key takeaway is that quantization is iterative: commit to production only after the benchmark gate passes.
๐ Real-World Applications: Input, Process, Output
Case study 1: Customer support assistant on shared GPUs
| Stage | Details |
| Input | Multilingual support prompts, medium context, strict p95 latency |
| Process | PTQ + INT8 weights first, then selective activation quantization on stable layers |
| Output | Lower memory usage, better concurrency, acceptable quality drift |
Case study 2: On-prem legal drafting assistant with long context
| Stage | Details |
| Input | Long-context legal prompts with domain terms |
| Process | Weight-only 4-bit in most layers, output head and selected attention blocks kept BF16 |
| Output | Model fits target hardware, but long-context eval required extra iteration |
Both cases succeed by sequencing decisions across timing, scope, and mapping instead of maximizing compression immediately.
โ๏ธ Trade-offs & Failure Modes: Trade-offs and Failure Modes You Should Expect
| Trade-off or failure mode | What it looks like in production | Mitigation |
| Memory saved, quality drops | Answers remain fluent but become less accurate | Add task-specific eval thresholds before rollout |
| Low-bit model, no latency gain | Smaller model but unchanged p95 | Validate backend kernel support early |
| Activation drift | Good offline metrics, bad real traffic performance | Use representative calibration and shadow traffic |
| Over-quantized sensitive layers | Hallucinations or format breakage in structured tasks | Keep selective layers in higher precision |
| Aggressive scope change | Improvements on average, poor long-tail reliability | Canary release with automated rollback |
Intermediate-level rule: do not ship quantization based on memory metrics alone. Always include task-quality and tail-latency checks.
๐งญ Decision Guide: Choosing by Constraint
| Situation | Recommendation |
| Use when | Start with PTQ + weight quantization (INT8 or safe 4-bit) when memory and cost are immediate problems. |
| Avoid when | Avoid activation quantization as the first move if you do not have production-like calibration/eval data. |
| Alternative | Use mixed precision: quantize most layers, keep sensitive modules higher precision. |
| Edge cases | For long context, tool use, or strict JSON output, run dedicated eval suites before full rollout. |
If deployment is blocked by memory, optimize scope first. If quality fails after PTQ, revisit timing (QAT). If two same-bit methods differ, inspect mapping and kernel support.
๐งช Practical Examples: Weight-First and Activation-Aware Paths
These examples demonstrate the two most common production quantization strategies: a weight-first path using 4-bit NF4 loading and a activation-aware path that extends quantization to the KV cache for long-context workloads. They were chosen because they represent the lowest-risk starting point and the next logical escalation when memory pressure remains after weight-only quantization. As you read them, focus on where quantization is applied in the pipeline โ which layers are quantized, what data type is used for compute, and which components remain in higher precision.
Example 1: Weight quantization with 4-bit loading
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
quant_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quant_cfg,
device_map="auto",
)
prompt = "List three trade-offs of LLM quantization."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=80)
print(tokenizer.decode(output[0], skip_special_tokens=True))
What this demonstrates: a weight-first quantization strategy that is widely used for fast prototyping and production pilots.
Activation quantization in full LLM stacks is backend-dependent. Apply it after a weight-only baseline passes, then validate with production-like prompts, long-context tests, and structured-output checks.
๐ ๏ธ AutoGPTQ, AutoAWQ, and bitsandbytes: Quantization Libraries in Practice
bitsandbytes (the bnb library by Tim Dettmers) integrates directly with the HuggingFace transformers from_pretrained() API to load models in INT8 or NF4 (4-bit) precision without a separate quantization step โ it is the fastest path from a HuggingFace model card to a quantized inference session.
AutoGPTQ implements the GPTQ algorithm (layer-wise weight quantization using second-order gradient information) for aggressive 4-bit quantization with better quality retention than naive round-to-nearest. AutoAWQ implements the AWQ (Activation-Aware Weight Quantization) algorithm, which identifies and preserves the 1% of weight channels most important to output quality.
# โโ bitsandbytes: NF4 (4-bit NormalFloat) via HuggingFace BitsAndBytesConfig โโ
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
bnb_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # non-uniform 4-bit: better weight distribution fit
bnb_4bit_use_double_quant=True, # quantize the scale factors too (~0.4-bit savings)
bnb_4bit_compute_dtype=torch.bfloat16,
)
model_bnb = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.2",
quantization_config=bnb_cfg,
device_map="auto",
)
# โโ AutoGPTQ: GPTQ 4-bit from a pre-quantized model checkpoint โโโโโโโโโโโโโโโโ
from auto_gptq import AutoGPTQForCausalLM
model_gptq = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
model_basename="model",
use_safetensors=True,
device="cuda:0",
use_triton=False, # set True for faster inference on supported GPUs
)
# โโ AutoAWQ: AWQ 4-bit from a pre-quantized model checkpoint โโโโโโโโโโโโโโโโโโ
from awq import AutoAWQForCausalLM
model_awq = AutoAWQForCausalLM.from_quantized(
"TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
fuse_layers=True, # fuse attention layers for ~20% throughput improvement
trust_remote_code=False,
safetensors=True,
)
# โโ Comparison: same prompt, three quantization backends โโโโโโโโโโโโโโโโโโโโโ
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
prompt = "Explain the difference between PTQ and QAT in two sentences."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
for name, model in [("bnb-nf4", model_bnb), ("gptq", model_gptq), ("awq", model_awq)]:
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=60)
print(f"\n[{name}]", tokenizer.decode(out[0], skip_special_tokens=True))
| Library | Algorithm | Timing | Best for | Quality vs. NF4 |
| bitsandbytes | NF4 / INT8 | PTQ (load-time) | Fast prototyping, LoRA fine-tuning | Baseline |
| AutoGPTQ | GPTQ | PTQ (offline, calibration) | Inference-only deployments needing small model size | +0.5โ1% on coding/math |
| AutoAWQ | AWQ | PTQ (offline, activation-aware) | Low-bit deployment with quality-critical tasks | +0.5โ1.5% on reasoning |
fuse_layers=True in AWQ merges adjacent transformer blocks into fused CUDA kernels โ this is the activation-aware advantage materialising as runtime throughput rather than just model size reduction.
For a full deep-dive on AutoGPTQ calibration pipelines, AWQ activation channel analysis, and bitsandbytes QLoRA fine-tuning workflows, a dedicated follow-up post is planned.
๐ Lessons Learned from Quantization Projects
- Classify decisions by timing, scope, and mapping before choosing tools.
- Weight quantization is usually the highest-ROI first step.
- Activation quantization can unlock additional speed, but calibration quality becomes critical.
- Same bit width does not mean same quality; mapping and granularity matter.
- Kernel compatibility can dominate real latency outcomes.
- Selective precision is often better than aggressive all-layer quantization.
๐ TLDR: Summary & Key Takeaways
- LLM quantization is best understood as a 3-axis design space: timing, scope, and mapping.
- By Timing: PTQ is fast and practical; QAT helps recover quality when PTQ is insufficient.
- By Scope: start with weights, then add activations if needed.
- By Mapping: symmetric, asymmetric, and non-uniform mappings create different error behavior.
- Weight quantization is the most common production entry point.
- Activation quantization is powerful but requires stronger evaluation discipline.
- Production success depends on joint optimization of memory, latency, and task-level quality.
One-liner: The best quantization strategy is the one that meets your product SLA with the smallest quality compromise, not the one with the lowest bit count.
Article tools
Reader feedback
Was this article useful?
Rate it if it helped, then continue with the next deep dive when you are ready.
Article metadata