Home/Learn/Inference
Topic

Inference

Learn Inference as a connected topic across chapters, concepts, simulations, and interview reasoning.

10 Concepts8 Articles2h 9m

Overview

Learn Inference as a connected topic across chapters, concepts, simulations, and interview reasoning.

How this topic helps

Ai
Llm
Deep Learning
Quantization

Learning Path in this Topic

Series that contain articles from Inference. Select a path to filter the article list.

Articles

8 matched articles

Article 1Managed API LLMs vs Self-Hosted Models: When to Switch and When Not ToTLDR: Most teams should start with managed LLM APIs because they buy speed, reliability, model quality, and low operational burden. Move to self-hosted or open-weight models only when you have stable 17 minArticle 2Types of LLM Quantization: By Timing, Scope, and MappingTLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In17 minArticle 3GPTQ vs AWQ vs NF4: Choosing the Right LLM Quantization PipelineTLDR: GPTQ, AWQ, and NF4 all shrink LLMs, but they optimize different constraints. GPTQ focuses on post-training reconstruction error, AWQ protects salient weights for better quality at low bits, and 15 minArticle 4LLM Model Quantization: Why, When, and How to Deploy Smaller, Faster ModelsTLDR: Quantization converts high-precision model weights and activations (FP16/FP32) into lower-precision formats (INT8 or INT4) so LLMs run with less memory, lower latency, and lower cost. The key is13 minArticle 5Sparse Mixture of Experts: How MoE LLMs Do More With Less ComputeTLDR: Mixture of Experts (MoE) replaces the single dense Feed-Forward Network (FFN) layer in each Transformer block with N independent expert FFNs plus a learned router. Only the top-K experts activat27 minArticle 6MLOps Model Serving and Monitoring Patterns for Production ReadinessTLDR: Production ML reliability depends on joining inference serving, data-quality signals, and rollback automation into one operating loop. TLDR: This dedicated deep dive focuses on the internals, 13 min

Page 1 of 2