Topic
inference
4 articles across 2 sub-topics
Sub-topic
3 articles

Types of LLM Quantization: By Timing, Scope, and Mapping
TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantizati...
GPTQ vs AWQ vs NF4: Choosing the Right LLM Quantization Pipeline
TLDR: GPTQ, AWQ, and NF4 all shrink LLMs, but they optimize different constraints. GPTQ focuses on post-training reconstruction error, AWQ protects salient weights for better quality at low bits, and NF4 offers practical 4-bit compression through bit...

LLM Model Quantization: Why, When, and How to Deploy Smaller, Faster Models
TLDR: Quantization converts high-precision model weights and activations (FP16/FP32) into lower-precision formats (INT8 or INT4) so LLMs run with less memory, lower latency, and lower cost. The key is choosing the right quantization method for your a...
Sub-topic
1 article

Managed API LLMs vs Self-Hosted Models: When to Switch and When Not To
TLDR: Most teams should start with managed LLM APIs because they buy speed, reliability, model quality, and low operational burden. Move to self-hosted or open-weight models only when you have stable workloads, hard privacy or compliance constraints,...
