Abstract AlgorithmsAn AI Powered Learning Platform

Topic

inference

4 articles across 2 sub-topics

Sub-topic

3 articles

Types of LLM Quantization: By Timing, Scope, and Mapping

TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantizati...

Mar 14, 2026•16 min read

GPTQ vs AWQ vs NF4: Choosing the Right LLM Quantization Pipeline

TLDR: GPTQ, AWQ, and NF4 all shrink LLMs, but they optimize different constraints. GPTQ focuses on post-training reconstruction error, AWQ protects salient weights for better quality at low bits, and NF4 offers practical 4-bit compression through bit...

Mar 12, 2026•14 min read

LLM Model Quantization: Why, When, and How to Deploy Smaller, Faster Models

TLDR: Quantization converts high-precision model weights and activations (FP16/FP32) into lower-precision formats (INT8 or INT4) so LLMs run with less memory, lower latency, and lower cost. The key is choosing the right quantization method for your a...

Mar 8, 2026•13 min read

Sub-topic

#Anthropic

1 article

Managed API LLMs vs Self-Hosted Models: When to Switch and When Not To

TLDR: Most teams should start with managed LLM APIs because they buy speed, reliability, model quality, and low operational burden. Move to self-hosted or open-weight models only when you have stable workloads, hard privacy or compliance constraints,...

Apr 17, 2026•16 min read