Abstract Algorithms

Model Behavior Vector Space Inference Retrieval Evaluation Practical LLM Quantization in Colab A Beginner's Guide to Vector Database Principles LLM Model Quantization

Guidance

Model Behavior

Continues from what you have already explored.

I can continue your learning session from the exact context you left off.

Resume Context

Continue Learning Practice Tradeoffs Next Drill

System behavior

CAP Under Network Partition

Systems choose between consistency and availability during partitions.

Open

Speed

Step 1 / 2Normal flow

Read in sequence

1Practical LLM Quantization in Colab: A Hugging Face WalkthroughTLDR: This is a practical, notebook-style quantization guide for Google Colab and Hugging Face. You will quantize real models, run inference, compare memory/latency, and learn when to use 4-bit NF4 vs15 min 2A Beginner's Guide to Vector Database PrinciplesTLDR: A vector database stores meaning as numbers so you can search by intent, not exact keywords. That is why "reset my password" can find "account recovery steps" even if the words are different. 14 min 3LLM Model Quantization: Why, When, and How to Deploy Smaller, Faster ModelsTLDR: Quantization converts high-precision model weights and activations (FP16/FP32) into lower-precision formats (INT8 or INT4) so LLMs run with less memory, lower latency, and lower cost. The key is13 min 4Dot Product in Machine Learning: The Engine Behind Similarity, Attention, and Neural NetworksTLDR: The dot product multiplies corresponding elements of two vectors and sums the results. In machine learning it does three critical jobs: it scores semantic similarity between embeddings, computes22 min 5Sparse Mixture of Experts: How MoE LLMs Do More With Less ComputeTLDR: Mixture of Experts (MoE) replaces the single dense Feed-Forward Network (FFN) layer in each Transformer block with N independent expert FFNs plus a learned router. Only the top-K experts activat27 min 6Softmax Function Explained: From Raw Scores to ProbabilitiesTLDR: Softmax converts a vector of raw scores (logits) into a valid probability distribution by exponentiating each value and dividing by the total. Subtracting the max before exponentiating prevents 23 min 7Dense LLM Architecture: How Every Parameter Works on Every TokenTLDR: In a dense LLM every single parameter is active for every token in every forward pass — no routing, no selection. A transformer block runs multi-head self-attention (Q, K, V) followed by a feed-24 min 8Managed API LLMs vs Self-Hosted Models: When to Switch and When Not ToTLDR: Most teams should start with managed LLM APIs because they buy speed, reliability, model quality, and low operational burden. Move to self-hosted or open-weight models only when you have stable 17 min 9Types of LLM Quantization: By Timing, Scope, and MappingTLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In17 min 10GPTQ vs AWQ vs NF4: Choosing the Right LLM Quantization PipelineTLDR: GPTQ, AWQ, and NF4 all shrink LLMs, but they optimize different constraints. GPTQ focuses on post-training reconstruction error, AWQ protects salient weights for better quality at low bits, and 15 min 11Why Embeddings Matter: Solving Key Issues in Data RepresentationTLDR: Embeddings convert words (and images, users, products) into dense numerical vectors in a geometric space where semantic similarity = geometric proximity. "King - Man + Woman ≈ Queen" is not magi14 min 12How Transformer Architecture Works: A Deep DiveTLDR: The Transformer is the architecture behind every major LLM (GPT, BERT, Claude, Gemini). Its core innovation is Self-Attention — a mechanism that lets the model weigh relationships between all to18 min

Related threads

Find the idea you are trying to connect