Bonus-Vorlesung | Grundlagen der Programmierung
Vaswani et al., Google Brain & Google Research
Goal: You should understand WHY this paper changed the world - not every mathematical detail.
Imagine you're translating: "The cat sat on the mat"
The problem: To correctly translate "sat", the computer needs to know WHO is sitting (the cat). But "cat" came EARLIER in the sentence!
Words are processed one after another:
Each word "remembers" the previous ones.
Example: "The cat, who was in the garden yesterday and played with the ball, is tired."
RNNs often forget "The cat" by the time they reach "is"!
Attention mechanism:
"For each word: Look at ALL other words and decide which ones are important."
When translating "sat":
"cat" gets high attention - it's the subject!
📐 The Real Dimensions: In practice, Q, K, V are matrices with shape (sequence_length × d_model), e.g. (512 × 512). Each row = one word's vector. For simplicity, we'll show examples with small numbers.
Why KT (transpose)?
1. Dimension matching: Q is (n×d), K is (n×d). Matrix multiplication requires inner dimensions to match!
• Q (n×d) × K (n×d) → ❌ doesn't work! (d ≠ n)
• Q (n×d) × KT (d×n) → ✓ works! (d = d)
2. Row × Column rule: Each row of Q multiplies with each column of KT (which was a row in K)
3. Result: (n×n) matrix = attention score for every word pair!
√d = Square root of dimension
d = vector size (e.g. 512). Problem: Large d → huge dot products → Softmax outputs only 0 or 1.
Solution: Divide by √d keeps values in good range for softmax.
Goal: Calculate similarity between EVERY Query and EVERY Key
Each ROW = one word's Key vector
Shape: 3 × 2
(3 words, 2 dimensions)
Each COLUMN = one word's Key vector
Shape: 2 × 3
(flipped!)
If Q = K (same vectors), then Q × KT calculates dot product of every pair:
Result: 3×3 matrix = similarity score for every Query-Key pair!
Sentence: "I like cats" (3 words, d=2 dimensions)
For "cats" (Q=[1,1]):
Scores: [1, 1, 2]
√2 ≈ 1.41, so: [0.7, 0.7, 1.4]
"cats" pays most attention to itself!
Output = 0.21×[0.5,0.2] + 0.21×[0.3,0.8] + 0.58×[0.9,0.7] = [0.69, 0.62]
Process all words simultaneously = 100x faster training with GPUs
Every word can directly "see" EVERY other word - no more forgetting!
More data + more parameters = better results (Scaling Laws)
Title of the paper: "Attention Is All You Need" - No RNNs, no CNNs. ONLY Attention!
Understands the input
Example: BERT
Generates the output
Example: GPT
Multi-Head Attention: Multiple "heads" learn different relationships simultaneously!
One head alone is too simple. Different heads learn different things:
Learns subject-verb relationships
"cat" -> "sits"
Learns adjective-noun
"big" -> "cat"
Learns preposition-location
"on" -> "mat"
The original paper uses 8 heads. GPT-4 has hundreds!
Problem: Attention processes all words simultaneously - but ORDER matters!
"Dog bites man" vs. "Man bites dog"
Solution: Add a unique "position information" to each word
Sine waves allow the model to learn relative positions!
| Model | BLEU Score | Training Time |
|---|---|---|
| Previous Best (RNN) | 26.0 | Weeks |
| Transformer (Base) | 27.3 | 12 hours |
| Transformer (Big) | 28.4 | 3.5 days |
Better AND faster - that's rare in research!
GPT-2: 1.5B
GPT-3: 175B
GPT-4: ~1T?
Billions of webpages, books, code...
Thousands of GPUs, millions of dollars for training
Emergent Abilities: Beyond a certain size, models can suddenly do things they weren't explicitly trained for!
Important: LLMs don't "know" anything - they are very good pattern completers!
8 pages that changed the world.
My advice: Use AI tools, but understand their limits. That makes you better programmers!
"The future is already here - it's just not evenly distributed."
- William Gibson
This presentation was created with the support of Claude (Anthropic) - a Transformer-based AI model.