Startseite

Attention Is All You Need

Die Revolution der Transformer-Architektur

Bonus-Vorlesung | Grundlagen der Programmierung

"Attention Is All You Need"

Vaswani et al., Google Brain & Google Research

NeurIPS 2017 | Over 100,000 citations

Why should programmers understand this?

The AI Revolution

  • ChatGPT - based on Transformers
  • Claude - based on Transformers
  • GitHub Copilot - based on Transformers
  • DALL-E, Midjourney - use Transformers

For Your Career

  • AI tools are becoming standard
  • Understanding = better usage
  • New job opportunities
  • Critical thinking about AI

Goal: You should understand WHY this paper changed the world - not every mathematical detail.

The Problem: Processing Sequences

How does a computer understand language?

Imagine you're translating: "The cat sat on the mat"

The
cat
sat
on
the
mat

The problem: To correctly translate "sat", the computer needs to know WHO is sitting (the cat). But "cat" came EARLIER in the sentence!

Before 2017: Recurrent Neural Networks (RNNs)

How RNNs work

Words are processed one after another:

The -> cat -> sat -> on -> the -> mat

Each word "remembers" the previous ones.

The Problems

  • Slow: Sequential = no parallelization
  • Forgetful: Long sentences lose context
  • Training: Gradients vanish

Example: "The cat, who was in the garden yesterday and played with the ball, is tired."

RNNs often forget "The cat" by the time they reach "is"!

The Revolutionary Idea: Attention

What if we see ALL words at once?

Attention mechanism:

"For each word: Look at ALL other words and decide which ones are important."

When translating "sat":

The
cat
sat
on
the
mat

"cat" gets high attention - it's the subject!

Self-Attention: The Core Idea

Three questions for each word:

  1. Query (Q):
    "What am I looking for?"
  2. Key (K):
    "What do I offer?"
  3. Value (V):
    "What is my content?"

The Process:

  1. Compare my Query with all Keys
  2. Calculate similarity scores
  3. Normalize with Softmax
  4. Weighted sum of Values

📐 The Real Dimensions: In practice, Q, K, V are matrices with shape (sequence_length × d_model), e.g. (512 × 512). Each row = one word's vector. For simplicity, we'll show examples with small numbers.

Attention(Q, K, V) = softmax(Q * KT / √d) * V

Why KT (transpose)?

1. Dimension matching: Q is (n×d), K is (n×d). Matrix multiplication requires inner dimensions to match!

• Q (n×d) × K (n×d) → ❌ doesn't work! (d ≠ n)

• Q (n×d) × KT (d×n) → ✓ works! (d = d)

2. Row × Column rule: Each row of Q multiplies with each column of KT (which was a row in K)

3. Result: (n×n) matrix = attention score for every word pair!

√d = Square root of dimension

d = vector size (e.g. 512). Problem: Large d → huge dot products → Softmax outputs only 0 or 1.

Solution: Divide by √d keeps values in good range for softmax.

🔄 Why Do We Need KT? (Transpose)

Goal: Calculate similarity between EVERY Query and EVERY Key

K (Original)

Each ROW = one word's Key vector

Word 1: [ 1, 2 ] ← row
Word 2: [ 3, 4 ] ← row
Word 3: [ 5, 6 ] ← row

Shape: 3 × 2
(3 words, 2 dimensions)

KT (Transposed)

Each COLUMN = one word's Key vector

   W1 W2 W3
[ 1, 3, 5 ]
[ 2, 4, 6 ]

Shape: 2 × 3
(flipped!)

💡 The Key Insight: How Q × KT Works

If Q = K (same vectors), then Q × KT calculates dot product of every pair:

Q1=[1,2]K1=[1,2]: 1×1 + 2×2 = 5
Q1=[1,2]K2=[3,4]: 1×3 + 2×4 = 11
Q1=[1,2]K3=[5,6]: 1×5 + 2×6 = 17
... and so on for Q2 and Q3

Result: 3×3 matrix = similarity score for every Query-Key pair!

🧮 Attention Calculation - Example

Sentence: "I like cats" (3 words, d=2 dimensions)

Query (Q)

I: [1, 0]
like: [0, 1]
cats: [1, 1]

Key (K)

I: [1, 0]
like: [0, 1]
cats: [1, 1]

Value (V)

I: [0.5, 0.2]
like: [0.3, 0.8]
cats: [0.9, 0.7]

Step 1: Q × KT

For "cats" (Q=[1,1]):

• with "I" [1,0]: 1×1 + 1×0 = 1
• with "like" [0,1]: 1×0 + 1×1 = 1
• with "cats" [1,1]: 1×1 + 1×1 = 2

Scores: [1, 1, 2]

Step 2: ÷ √d and Softmax

√2 ≈ 1.41, so: [0.7, 0.7, 1.4]

softmax → [0.21, 0.21, 0.58]

"cats" pays most attention to itself!

Step 3: Weighted Sum of Values

Output = 0.21×[0.5,0.2] + 0.21×[0.3,0.8] + 0.58×[0.9,0.7] = [0.69, 0.62]

Why is this groundbreaking?

1.

Parallelization

Process all words simultaneously = 100x faster training with GPUs

2.

Long Distances

Every word can directly "see" EVERY other word - no more forgetting!

3.

Scalability

More data + more parameters = better results (Scaling Laws)

Title of the paper: "Attention Is All You Need" - No RNNs, no CNNs. ONLY Attention!

The Transformer Architecture

Encoder

Understands the input

  • Multi-Head Attention
  • Feed-Forward Network
  • Layer Normalization
  • Residual Connections

Example: BERT

Decoder

Generates the output

  • Masked Self-Attention
  • Cross-Attention (to Encoder)
  • Feed-Forward Network
  • + same normalization

Example: GPT

Multi-Head Attention: Multiple "heads" learn different relationships simultaneously!

Multi-Head Attention

Why multiple "heads"?

One head alone is too simple. Different heads learn different things:

Head 1

Learns subject-verb relationships

"cat" -> "sits"

Head 2

Learns adjective-noun

"big" -> "cat"

Head 3

Learns preposition-location

"on" -> "mat"

The original paper uses 8 heads. GPT-4 has hundreds!

Positional Encoding

The missing puzzle piece

Problem: Attention processes all words simultaneously - but ORDER matters!

"Dog bites man" vs. "Man bites dog"

Solution: Add a unique "position information" to each word

PE(pos, 2i) = sin(pos / 100002i/d)
PE(pos, 2i+1) = cos(pos / 100002i/d)

Sine waves allow the model to learn relative positions!

The Results (2017)

Translation English -> German

Model BLEU Score Training Time
Previous Best (RNN) 26.0 Weeks
Transformer (Base) 27.3 12 hours
Transformer (Big) 28.4 3.5 days

Better AND faster - that's rare in research!

What happened next

2017
Transformer
->
2018
BERT (Google)
->
2018-20
GPT-1, 2, 3
->
2022+
ChatGPT, Claude

Encoder-only (BERT)

  • Text understanding
  • Sentiment analysis
  • Question-answering systems

Decoder-only (GPT)

  • Text generation
  • Chatbots
  • Code generation

The Surprise: Scaling Laws

More = Better (surprisingly reliable)

More Parameters

GPT-2: 1.5B
GPT-3: 175B
GPT-4: ~1T?

More Data

Billions of webpages, books, code...

More Compute

Thousands of GPUs, millions of dollars for training

Emergent Abilities: Beyond a certain size, models can suddenly do things they weren't explicitly trained for!

Practical Applications Today

Text

Language

  • ChatGPT, Claude
  • Translation
  • Summarization
  • Code generation
Image

Images

  • DALL-E, Midjourney
  • Stable Diffusion
  • Vision Transformer
  • Image analysis
+

And more...

  • Audio (Whisper)
  • Video
  • Protein structure
  • Robotics

Understanding Limitations

Problems

  • Hallucinations: Making up "facts"
  • Context limit: Finite window
  • No real understanding: Statistical patterns
  • Bias: Reflects training data
  • Cost: Enormous energy/money

Best Use

  • As a tool, not an oracle
  • Verify results
  • For drafts, not final versions
  • Build your own knowledge
  • Question critically

Important: LLMs don't "know" anything - they are very good pattern completers!

Summary

What makes Transformers revolutionary?

  1. Attention mechanism instead of sequential processing
  2. Parallelization enables fast training on GPUs
  3. Long distances are modeled directly
  4. Scalability - more parameters = better results
  5. Universal - works for text, images, audio, ...
"Attention Is All You Need"

8 pages that changed the world.

Further Resources

To Read

  • Original Paper: arxiv.org/abs/1706.03762
  • "The Illustrated Transformer" - Jay Alammar
  • 3Blue1Brown - YouTube Videos on Neural Networks

To Try

  • ChatGPT - chat.openai.com
  • Claude - claude.ai
  • Hugging Face - Transformers Library

My advice: Use AI tools, but understand their limits. That makes you better programmers!

Thank You!

Questions about the AI Revolution?

"The future is already here - it's just not evenly distributed."

- William Gibson

This presentation was created with the support of Claude (Anthropic) - a Transformer-based AI model.

1 / 20