Attention Is All You Need

Die Revolution der Transformer-Architektur

Bonus-Vorlesung | Grundlagen der Programmierung

"Attention Is All You Need"

Vaswani et al., Google Brain & Google Research

NeurIPS 2017 | Over 100,000 citations

Why should programmers understand this?

The AI Revolution

ChatGPT - based on Transformers
Claude - based on Transformers
GitHub Copilot - based on Transformers
DALL-E, Midjourney - use Transformers

For Your Career

AI tools are becoming standard
Understanding = better usage
New job opportunities
Critical thinking about AI

Goal: You should understand WHY this paper changed the world - not every mathematical detail.

The Problem: Processing Sequences

How does a computer understand language?

Imagine you're translating: "The cat sat on the mat"

The

cat

sat

on

the

mat

The problem: To correctly translate "sat", the computer needs to know WHO is sitting (the cat). But "cat" came EARLIER in the sentence!

Before 2017: Recurrent Neural Networks (RNNs)

How RNNs work

Words are processed one after another:

                        The -> cat -> sat -> on -> the -> mat
                    

Each word "remembers" the previous ones.

The Problems

Slow: Sequential = no parallelization
Forgetful: Long sentences lose context
Training: Gradients vanish

Example: "The cat, who was in the garden yesterday and played with the ball, is tired."

RNNs often forget "The cat" by the time they reach "is"!

The Revolutionary Idea: Attention

What if we see ALL words at once?

Attention mechanism:

"For each word: Look at ALL other words and decide which ones are important."

When translating "sat":

The

cat

sat

on

the

mat

"cat" gets high attention - it's the subject!

Self-Attention: The Core Idea

Three questions for each word:

Query (Q):
"What am I looking for?"
Key (K):
"What do I offer?"
Value (V):
"What is my content?"

The Process:

Compare my Query with all Keys
Calculate similarity scores
Normalize with Softmax
Weighted sum of Values

📐 The Real Dimensions: In practice, Q, K, V are matrices with shape (sequence_length × d_model), e.g. (512 × 512). Each row = one word's vector. For simplicity, we'll show examples with small numbers.

Attention(Q, K, V) = softmax(Q * K^T / √d) * V

Why K^T (transpose)?

1. Dimension matching: Q is (n×d), K is (n×d). Matrix multiplication requires inner dimensions to match!

• Q (n×d) × K (n×d) → ❌ doesn't work! (d ≠ n)

• Q (n×d) × K^T (d×n) → ✓ works! (d = d)

2. Row × Column rule: Each row of Q multiplies with each column of K^T (which was a row in K)

3. Result: (n×n) matrix = attention score for every word pair!

√d = Square root of dimension

d = vector size (e.g. 512). Problem: Large d → huge dot products → Softmax outputs only 0 or 1.

Solution: Divide by √d keeps values in good range for softmax.

🔄 Why Do We Need K^T? (Transpose)

Goal: Calculate similarity between EVERY Query and EVERY Key

K (Original)

Each ROW = one word's Key vector

                        Word 1: [ 1, 2 ] ← row

                        Word 2: [ 3, 4 ] ← row

                        Word 3: [ 5, 6 ] ← row

Shape: 3 × 2
(3 words, 2 dimensions)

K^T (Transposed)

Each COLUMN = one word's Key vector

                           W1 W2 W3

                        [ 1, 3, 5 ]

                        [ 2, 4, 6 ]

Shape: 2 × 3
(flipped!)

💡 The Key Insight: How Q × K^T Works

If Q = K (same vectors), then Q × K^T calculates dot product of every pair:

                    Q1=[1,2] • K1=[1,2]: 1×1 + 2×2 = 5

                    Q1=[1,2] • K2=[3,4]: 1×3 + 2×4 = 11

                    Q1=[1,2] • K3=[5,6]: 1×5 + 2×6 = 17

                    ... and so on for Q2 and Q3

Result: 3×3 matrix = similarity score for every Query-Key pair!

🧮 Attention Calculation - Example

Sentence: "I like cats" (3 words, d=2 dimensions)

Query (Q)

                        I: [1, 0]

                        like: [0, 1]

                        cats: [1, 1]

Key (K)

                        I: [1, 0]

                        like: [0, 1]

                        cats: [1, 1]

Value (V)

                        I: [0.5, 0.2]

                        like: [0.3, 0.8]

                        cats: [0.9, 0.7]

Step 1: Q × K^T

For "cats" (Q=[1,1]):

                        • with "I" [1,0]: 1×1 + 1×0 = 1

                        • with "like" [0,1]: 1×0 + 1×1 = 1

                        • with "cats" [1,1]: 1×1 + 1×1 = 2

Scores: [1, 1, 2]

Step 2: ÷ √d and Softmax

√2 ≈ 1.41, so: [0.7, 0.7, 1.4]

                        softmax → [0.21, 0.21, 0.58]
                    

"cats" pays most attention to itself!

Step 3: Weighted Sum of Values

Output = 0.21×[0.5,0.2] + 0.21×[0.3,0.8] + 0.58×[0.9,0.7] = [0.69, 0.62]

Why is this groundbreaking?

1.

Parallelization

Process all words simultaneously = 100x faster training with GPUs

2.

Long Distances

Every word can directly "see" EVERY other word - no more forgetting!

3.

Scalability

More data + more parameters = better results (Scaling Laws)

Title of the paper: "Attention Is All You Need" - No RNNs, no CNNs. ONLY Attention!

The Transformer Architecture

Encoder

Understands the input

Multi-Head Attention
Feed-Forward Network
Layer Normalization
Residual Connections

Example: BERT

Decoder

Generates the output

Masked Self-Attention
Cross-Attention (to Encoder)
Feed-Forward Network
+ same normalization

Example: GPT

Multi-Head Attention: Multiple "heads" learn different relationships simultaneously!

Multi-Head Attention

Why multiple "heads"?

One head alone is too simple. Different heads learn different things:

Head 1

Learns subject-verb relationships

"cat" -> "sits"

Head 2

Learns adjective-noun

"big" -> "cat"

Head 3

Learns preposition-location

"on" -> "mat"

The original paper uses 8 heads. GPT-4 has hundreds!

Positional Encoding

The missing puzzle piece

Problem: Attention processes all words simultaneously - but ORDER matters!

"Dog bites man" vs. "Man bites dog"

Solution: Add a unique "position information" to each word

PE(pos, 2i) = sin(pos / 10000^2i/d)
PE(pos, 2i+1) = cos(pos / 10000^2i/d)

Sine waves allow the model to learn relative positions!

The Results (2017)

Translation English -> German

Model	BLEU Score	Training Time
Previous Best (RNN)	26.0	Weeks
Transformer (Base)	27.3	12 hours
Transformer (Big)	28.4	3.5 days

Better AND faster - that's rare in research!

What happened next

2017
Transformer

->

2018
BERT (Google)

->

2018-20
GPT-1, 2, 3

->

                    2022+

                    ChatGPT, Claude

Encoder-only (BERT)

Text understanding
Sentiment analysis
Question-answering systems

Decoder-only (GPT)

Text generation
Chatbots
Code generation

The Surprise: Scaling Laws

More = Better (surprisingly reliable)

More Parameters

GPT-2: 1.5B
GPT-3: 175B
GPT-4: ~1T?

More Data

Billions of webpages, books, code...

More Compute

Thousands of GPUs, millions of dollars for training

Emergent Abilities: Beyond a certain size, models can suddenly do things they weren't explicitly trained for!

Practical Applications Today

Text

Language

ChatGPT, Claude
Translation
Summarization
Code generation

Image

Images

DALL-E, Midjourney
Stable Diffusion
Vision Transformer
Image analysis

+

And more...

Audio (Whisper)
Video
Protein structure
Robotics

Understanding Limitations

Problems

Hallucinations: Making up "facts"
Context limit: Finite window
No real understanding: Statistical patterns
Bias: Reflects training data
Cost: Enormous energy/money

Best Use

As a tool, not an oracle
Verify results
For drafts, not final versions
Build your own knowledge
Question critically

Important: LLMs don't "know" anything - they are very good pattern completers!

Summary

What makes Transformers revolutionary?

Attention mechanism instead of sequential processing
Parallelization enables fast training on GPUs
Long distances are modeled directly
Scalability - more parameters = better results
Universal - works for text, images, audio, ...