Build Karapathy's NanoCHAT from scratch Free Using AMD Credits
Part-1 of Building NanoCHAT from scratch
Subscribe to receive digestible levels of information about building Karpathy’s NanoCHAT from scratch, where I will be reproducing and training the NanoCHAT on:
MI300X
1 GPU - 192 GB VRAM - 20 vCPU - 240 GB RAMBoot disk: 720 GB
NVMe- Scratch disk: 5 TB NVMe
$1.99/GPU/hrNanochat was trained on 8xH100 box from e.g. Lambda GPU Cloud. Karapathy has provided “speedrun.sh” and “run1000.sh” for training nanochat_d20( depth = 20 layers, parameters = 561M) and nanochat d32( depth = 20 layers, parameters = 561M) respectively, that run the entire pipeline start to end based on your budget.
So, you might wonder when you can run the entire pipeline just by running the .sh files, why would one take the burden of figuring out the dependency and endless loop of debugging?
The answer, Complimentary Developer Credits by AMD. So not even $100, now all you need to have is end to end expertise on Pytorch and repository to train the best ChatGPT.
Tag along to be part of the journey and build with me.
File Structure:
.
├── LICENSE
├── README.md
├── dev
│ ├── gen_synthetic_data.py # Example synthetic data for identity
│ ├── generate_logo.html
│ ├── nanochat.png
│ ├── repackage_data_reference.py # Pretraining data shard generation
│ └── runcpu.sh # Small example of how to run on CPU/MPS
├── nanochat
│ ├── __init__.py # empty
│ ├── adamw.py # Distributed AdamW optimizer
│ ├── checkpoint_manager.py # Save/Load model checkpoints
│ ├── common.py # Misc small utilities, quality of life
│ ├── configurator.py # A superior alternative to argparse
│ ├── core_eval.py # Evaluates base model CORE score (DCLM paper)
│ ├── dataloader.py # Tokenizing Distributed Data Loader
│ ├── dataset.py # Download/read utils for pretraining data
│ ├── engine.py # Efficient model inference with KV Cache
│ ├── execution.py # Allows the LLM to execute Python code as tool
│ ├── gpt.py # The GPT nn.Module Transformer
│ ├── logo.svg
│ ├── loss_eval.py # Evaluate bits per byte (instead of loss)
│ ├── muon.py # Distributed Muon optimizer
│ ├── report.py # Utilities for writing the nanochat Report
│ ├── tokenizer.py # BPE Tokenizer wrapper in style of GPT-4
│ └── ui.html # HTML/CSS/JS for nanochat frontend
├── pyproject.toml
├── run1000.sh # Train the ~$800 nanochat d32
├── rustbpe # Custom Rust BPE tokenizer trainer
│ ├── Cargo.lock
│ ├── Cargo.toml
│ ├── README.md # see for why this even exists
│ └── src
│ └── lib.rs
├── scripts
│ ├── base_eval.py # Base model: calculate CORE score
│ ├── base_loss.py # Base model: calculate bits per byte, sample
│ ├── base_train.py # Base model: train
│ ├── chat_cli.py # Chat model (SFT/Mid): talk to over CLI
│ ├── chat_eval.py # Chat model (SFT/Mid): eval tasks
│ ├── chat_rl.py # Chat model (SFT/Mid): reinforcement learning
│ ├── chat_sft.py # Chat model: train SFT
│ ├── chat_web.py # Chat model (SFT/Mid): talk to over WebUI
│ ├── mid_train.py # Chat model: midtraining
│ ├── tok_eval.py # Tokenizer: evaluate compression rate
│ └── tok_train.py # Tokenizer: train it
├── speedrun.sh # Train the ~$100 nanochat d20
├── tasks
│ ├── arc.py # Multiple choice science questions
│ ├── common.py # TaskMixture | TaskSequence
│ ├── customjson.py # Make Task from arbitrary jsonl convos
│ ├── gsm8k.py # 8K Grade School Math questions
│ ├── humaneval.py # Misnomer; Simple Python coding task
│ ├── mmlu.py # Multiple choice questions, broad topics
│ ├── smoltalk.py # Conglomerate dataset of SmolTalk from HF
│ └── spellingbee.py # Task teaching model to spell/count letters
├── tests
│ └── test_engine.py
│ └── test_rustbpe.py
└── uv.lockNote that this is from original file, this might change as I build. On Day-1 we will be looking at nanochat/gpt.py as this is the heart of the project. I strongly suggest people to complete this playlist: Neural Networks: Zero to Hero as most of it build on top of the this series.
Key things to be discussed moving forward, that is different from initial NanoGPT:
Layernorm replaced with RMS-Norm
Absolute positional encoding with Rotary positional encoding
Multi-Head Attention with Grouped Query Attention
RMS-Norm :
“Attention is All You Need” paper used LayerNorm with learnable parameters.
In the original Transformer, LayerNorm was applied to the output of each sub-layer (attention and FFN) and before the final output. It had learnable parameters (scale γ and bias β) to adjust the normalized values. But, later replaced with RMSNorm as it is simpler and minimal trade-off.
Compare layernorm with RMSNorm:
# Input: x = [2.0, 4.0, 6.0]
# LayerNorm:
mean = (2+4+6)/3 = 4.0
std = sqrt(((2-4)² + (4-4)² + (6-4)²)/3) = sqrt(8/3) ≈ 1.63
normalized = (x - 4.0) / 1.63 = [-1.23, 0, 1.23]
final = normalized * gamma + beta # Learnable adjustment
# RMSNorm:
RMS = sqrt((4+16+36)/3) = sqrt(56/3) ≈ 4.32
final = x / 4.32 = [0.46, 0.93, 1.39] # Simpler!RMSNorm makes training simpler, faster, empirically works as well for large models. The modified code snippet from NanoChat:
def norm(x):
# RMSNorm without learnable parameters
return F.rms_norm(x, (x.size(-1),))Rotary Positional Encoding (RoPE):
First, let’s understand how Absolute Positional Encoding work using GPT-2:
After tokenization, each token is converted into embedding of size 768 and added with positional encoding. It is calculated using the below formula:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))Where:
pos= position in sequence (0, 1, 2, ...)i= dimension index (0, 1, 2, ..., d_model/2 - 1)d_model= embedding dimension (e.g., 768)
Let’s simplify and look at positional encoding of tokens using the example below:
import torch
import math
# Settings from original Transformer
d_model = 4 # Small for example
seq_len = 3
# Generate sinusoidal positional encodings
def get_positional_encoding(pos, d_model):
pe = torch.zeros(d_model)
for i in range(d_model // 2):
denominator = 10000 ** (2 * i / d_model)
pe[2*i] = math.sin(pos / denominator) # Even indices
pe[2*i+1] = math.cos(pos / denominator) # Odd indices
return pe
# Position vectors for positions 0, 1, 2
p0 = get_positional_encoding(0, d_model)
p1 = get_positional_encoding(1, d_model)
p2 = get_positional_encoding(2, d_model)
print(f”Position 0 encoding: {p0}”)
print(f”Position 1 encoding: {p1}”)
print(f”Position 2 encoding: {p2}”) Output:
Position 0 encoding: tensor([0., 1., 0., 1.])
Position 1 encoding: tensor([0.8415, 0.5403, 0.0100, 0.9999])
Position 2 encoding: tensor([ 0.9093, -0.4161, 0.0200, 0.9998])In other words,
Position 0 encoding: tensor([sin(0 * ω_0), cos(0 * ω_0), sin(0 * ω_0), cos(0 * ω_0)])
Position 1 encoding: tensor([sin(1 * ω_1), cos(1 * ω_1), sin(1 * ω_1), cos(1 * ω_1)])
Position 2 encoding: tensor([sin(2 * ω_2), cos(2 * ω_2), sin(2 * ω_2), cos(2 * ω_2)])
where ω_i = 1/10000^(2i/d_model)Now, we will call Position 0 encoding as p0 and look at what actually happens during self attention.
Q = Wq*(Emb + p) + b
K = Wk*(Emb + p) + b
V = Wv*(Emb + p) + b
The formula looks fancy, but under the hood it is just a dot product + scaling + matrix multiplication. The key term to look here is QK^T.
For simplicity while looking at self attention we will call it q_with_pos and k_with_pos.
# The extra linear multipled weights are neglected for simplicity
q_with_pos0 = q0 + p0
k_with_pos0 = k0 + p0
q_with_pos1 = q1 + p1
k_with_pos1 = k1 + p1Let’s look at the term qm and kn for deducing the need of positional encoding:
(q_m + p_m)·(k_n + p_n) = q_m·k_n + q_m·p_n + p_m·k_n + p_m·p_n
↑ ↑ ↑ ↑
Content Content× Position× Pure
only Position Content PositionIf we look at the term p_m.p_n:
PE(m,2i)*PE(n,2i) + PE(m,2i+1)*PE(n,2i+1)
= sin(ω_i*m)*sin(ω_i*n) + cos(ω_i*m)*cos(ω_i*n)
= cos(ω_i*(m - n)) # Trigonometric identity!The terms q·p_n and p_m·k mix content and position arbitrarily:
q·p_n: “How much does the content of q align with the position of k?”p_m·k: “How much does the position of q align with the content of k?”
These are unnatural relationships that don’t correspond to any linguistic intuition!
The original authors of the Transformer paper knew about the p_m·p_n = cos(relative_position) property, but didn’t realize (in 2017) that adding position vectors creates problematic cross-terms. It took until 2021 (RoPE paper) to find a cleaner solution using rotation instead of addition!
Now with the knowledge of Absolute Positional Encoding, we will be able to appreciate Rotary Position Embedding (RoPE).
def apply_rotary_emb(x, cos, sin):
assert x.ndim == 4 # (b, h, s, d) for multi-head attention
d = x.shape[3] // 2
x1, x2 = x[..., :d], x[..., d:] # split last dim into two halves
y1 = x1 * cos + x2 * sin
y2 = -x1 * sin + x2 * cos
out = torch.cat([y1, y2], dim=3) # re-assemble
out = out.to(x.dtype)
return outDo not get intimidated by looking at the code here, if we know the math and intuition behind the terms it will be clear.
In PyTorch, the assert statement is a fundamental tool used to verify the correctness of code during development by checking if a specified condition holds true. “assert x.ndim == 4“ checks and confirm the dimensionality.
RoPE needs a vector to apply the rotation, it is carried out using sin and cos terms similar to Absolute Encoding. So, we require two elements therefore it is achieved using
# To carry out rotation using sinusoidal terms
d = x.shape[3] // 2
x1, x2 = x[..., :d], x[..., d:] # split last dim into two halvesThis concept is better understood using geometry than algebra, so let’s put the geometry cap.
import torch
import math
# Original vectors
q = torch.tensor([1.0, 0.0])
k = torch.tensor([0.0, 1.0])
# Rotation angles (m=1, n=2)
m, n = 1, 2
theta = 1.0 # For simplicityThis is what actually happens pair wise, when passed into apply_rotary_emb(x, cos, sin)
Now, we will apply self attention and look at the derivatives:
# Apply rotations
Original: q = [a, b], k = [c, d]
Positions: m for q, n for k
After rotary embedding at position m:
q_rot = [a*cos(mθ) + b*sin(mθ), -a*sin(mθ) + b*cos(mθ)]
After rotary embedding at position n:
k_rot = [c*cos(nθ) + d*sin(nθ), -c*sin(nθ) + d*cos(nθ)]
Dot product: q_rot·k_rot =
(a*cos(mθ) + b*sin(mθ))*(c*cos(nθ) + d*sin(nθ))
+ (-a*sin(mθ) + b*cos(mθ))*(-c*sin(nθ) + d*cos(nθ))After simplification using trig identities, this equals:
(ac + bd)*cos((n-m)θ) + (bc - ad)*sin((n-m)θ)With rotary encoding:
Attention(m,n) = ContentSimilarity × cos(relative_position)
+ ContentRelationship × sin(relative_position)Content Similarity = (a*c + b*d):
This is just the regular dot product of the two vectors, measures how similar token meanings are.
Intuition: If both vectors point in similar directions in semantic space, this value is large. It’s like asking: “How semantically similar are ‘dog’ and ‘pet’?”
High when tokens share similar meanings (e.g., “king” and “queen”)
Low when tokens are unrelated (e.g., “king” and “pizza”)
Content Relationship = (b*c - a*d):
This is the 2D cross product (or perpendicular dot product), measures orthogonal relationship between two tokens.
Intuition: This captures how the vectors are oriented relative to each other in the 2D plane. It’s largest when vectors are perpendicular, zero when aligned or opposite.
Captures relationships like:
Subject-Verb: “dog” → “barks”
Adjective-Noun: “red” → “apple”
Hypernym-Hyponym: “animal” → “dog”
It’s the “grammatical” or “relational” component
Key insight: Rotary encoding preserves the semantic structure while incorporating position through clean trigonometric modulation, whereas additive encoding creates messy cross-terms that mix content and position arbitrarily.
Grouped Query Attention (GQA):
We know Large Languages Models are auto regressive, if you look at the self attention head figure all the calculations are not carried out for every incoming token. We store the outputs of Keys (k) and Values (v) which is also known as KVCache, that grows proportionally with respect to the context length. Refer my previous post which discusses KVChache at length → Pub-1.
In standard MHA, each attention head has its own separate Q, K, V projections:
Dimensions (GPT-2 Small):
d_model = 768n_heads = 12d_head = d_model / n_heads = 64
Q projection: (batch, seq, 768) × (768, 768) → (batch, seq, 768)
K projection: (batch, seq, 768) × (768, 768) → (batch, seq, 768)
V projection: (batch, seq, 768) × (768, 768) → (batch, seq, 768)
Total params for QKV: 3 × 768 × 768 = 1,769,472GQA reduces memory by sharing K, V projections across groups of Q heads:
Dimensions for GQA:
d_model = 768n_heads = 12(number of Q heads)n_groups = 4(number of K, V groups)group_size = n_heads / n_groups = 3(Q heads per group)
Q projection: (batch, seq, 768) × (768, 768) → (batch, seq, 768)
K projection: (batch, seq, 768) × (768, 256) → (batch, seq, 256)
V projection: (batch, seq, 768) × (768, 256) → (batch, seq, 256)
Total params for QKV: 768×768 + 2×768×256 = 589,824 + 393,216 = 983,040You can see the reduction in the size of KVCache with GQA, Memory reduction of ~44% fewer parameters for QKV projections, with very little reduction in quality. Also, there is MLA (Multi-Head Latent Attention) proposed by Deepseek where the token embedding are converted into projects using latent projection matrices, which deserves a whole new blog for itself do subscribe to be part of that implementation too.
What’s Next? Implementation of nanochat
nanochat.muon.Muon→ This is the core Transformer model class. It defines the neural network architecture (embedding layers, attention blocks, feed-forward networks) for the language model.nanochat.muon.DistMuon→ The distributed version of the Muon model. It wraps the baseMuonto split it across multiple GPUs/nodes, enabling techniques like tensor or pipeline parallelism for training very large models.nanochat.adamw.DistAdamW→ A distributed implementation of the AdamW optimizer. It handles sharding the optimizer states and gradients across devices, which is critical for memory efficiency when training models with billions of parameters.
Thanks for reading The Atoms of AI! This post is public so feel free to share it.






