Inside the Transformer: Architecture and Attention Demystified -- A Complete Guide
Introduction: What Are Transformers and Why Should You Care? (Article 4 alternative)
Originally published on Medium.
Introduction: What Are Transformers and Why Should You Care? (Article 4 alternative)

- Tokenizer: Breaks sentences into pieces (like cutting a pizza into slices)
- Embeddings: Turns words into numbers the computer understands
- Attention: The secret sauce — lets the model focus on what’s important
- Layers: Stack these parts to make the model smarter
# Clone the repository and navigate to it
cd
art_hug_04
# Run the setup task (installs Python 3.12.9 and all dependencies)
task setup
# Run all examples
task run
# Run specific examples
task run-attention-mechanism
# Self-attention demos
task run-modern-models
# Architecture comparisons
-
transformers: The Hugging Face library (your AI toolkit)
-
torch: PyTorch for the mathematical operations
-
matplotlib/seaborn: For creating visualizations
-
“Transformers” might become [‘Transform’, ‘ers’]
-
This helps handle words the model hasn’t seen before!
from
transformers
import
AutoTokenizer, AutoModel
import
torch
def
basic_tokenization_and_embedding
():
"""Let's convert text to numbers step by step."""
# Step 1: Load a pre-trained model (like buying a trained dog vs training one yourself)
model_name =
"roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Step 2: Take a sentence
sentence =
"Transformers are amazing!"
# Step 3: Break it into pieces (tokenize)
inputs = tokenizer(sentence, return_tensors=
"pt"
)
print
(
"Token IDs:"
, inputs[
"input_ids"
])
# Output: tensor([[0, 44929, 32, 2770, 328, 2]])
# What do these numbers mean? Let's see:
tokens = tokenizer.convert_ids_to_tokens(inputs[
"input_ids"
][
0
])
print
(
f"Tokens:
{tokens}
"
)
# Output: ['<s>', 'Transform', 'ers', 'Ġare', 'Ġamazing', '!', '</s>']
# Step 4: Convert to embeddings (meaningful numbers)
with
torch.no_grad():
# This means "just use, don't train"
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
print
(
"Embeddings shape:"
, embeddings.shape)
# Output: torch.Size([1, 6, 768])
- Tokenization: “Transformers are amazing!” becomes pieces:
-
<s>= start of sentence (like a capital letter) -
Transform+ers= the word split into known pieces -
Ġare= "are" with a space marker (Ġ) -
</s>= end of sentence (like a period) -
Shape [1, 6, 768] means: 1 sentence, 6 tokens, 768 features per token
-
Think of 768 features like describing a person with 768 characteristics
import
matplotlib.pyplot
as
plt
import
numpy
as
np
def
visualize_embeddings
():
"""Show what embeddings look like."""
# Get embeddings for a few words
tokenizer = AutoTokenizer.from_pretrained(
"roberta-base"
)
model = AutoModel.from_pretrained(
"roberta-base"
)
words = [
"happy"
,
"sad"
,
"dog"
,
"cat"
]
for
word
in
words:
inputs = tokenizer(word, return_tensors=
"pt"
)
with
torch.no_grad():
outputs = model(**inputs)
# Average the embeddings (excluding special tokens)
embedding = outputs.last_hidden_state[
0
,
1
:-
1
].mean(dim=
0
)
# Show first 10 dimensions as a bar chart
plt.figure(figsize=(
10
,
3
))
plt.bar(
range
(
10
), embedding[:
10
].numpy())
plt.title(
f"First 10 embedding dimensions for '
{word}
'"
)
plt.xlabel(
"Dimension"
)
plt.ylabel(
"Value"
)
plt.show()
- “The cat chased the dog”
- “The dog chased the cat”
import
math
def
visualize_positional_encoding
():
"""Show how positional encoding works."""
seq_length =
20
d_model =
64
# Create positional encoding
position = torch.arange(seq_length).unsqueeze(
1
)
div_term = torch.exp(torch.arange(
0
, d_model,
2
) *
-(math.log(
10000.0
) / d_model))
pe = torch.zeros(seq_length, d_model)
pe[:,
0
::
2
] = torch.sin(position * div_term)
# Even dimensions
pe[:,
1
::
2
] = torch.cos(position * div_term)
# Odd dimensions
p
# Visualize
plt.figure(figsize=(
10
,
6
))
plt.imshow(pe, cmap=
'RdBu'
, aspect=
'auto'
)
plt.colorbar()
plt.xlabel(
'Embedding Dimension'
)
plt.ylabel(
'Position in Sequence'
)
plt.title(
'Positional Encoding Pattern'
)
plt.show()
import
torch.nn
as
nn
class
SimpleTransformerBlock
(nn.Module):
"""A basic building block of transformers."""
def
__init__
(
self, d_model
):
super
().__init__()
self.linear = nn.Linear(d_model, d_model)
self.norm = nn.LayerNorm(d_model)
def
forward
(
self, x
):
# Residual connection: x + transformation(x)
# Like having a safety net - if transformation fails, original x is preserved
output = x + self.linear(x)
# Normalization: keeps values in reasonable range
# Like adjusting volume so it's not too loud or quiet
return
self.norm(output)
# Example usage
block = SimpleTransformerBlock(
768
)
input_tensor = torch.randn(
1
,
10
,
768
)
# [batch, sequence, features]
output = block(input_tensor)
print
(
f"Input shape:
{input_tensor.shape}
"
)
print
(
f"Output shape:
{output.shape}
"
)
# Same shape!
- Residual Connection (
x + self.linear(x)): Information can flow around problematic transformations - Layer Normalization: Keeps numbers stable, preventing “explosion” or “vanishing”
class
FeedForward
(nn.Module):
"""Each token gets processed individually."""
def
__init__
(
self, d_model=
768
, d_ff=
3072
):
super
().__init__()
# Two linear layers with ReLU in between
self.net = nn.Sequential(
nn.Linear(d_model, d_ff),
# Expand (768 → 3072)
nn.ReLU(),
# Add non-linearity
nn.Dropout(
0.1
),
# Prevent overfitting
nn.Linear(d_ff, d_model)
# Contract (3072 → 768)
)
def
forward
(
self, x
):
return
self.net(x)
# Demonstrate
ff = FeedForward()
tokens = torch.randn(
1
,
5
,
768
)
# 5 tokens, 768 dimensions each
output = ff(tokens)
print
(
f"Each token processed independently!"
)
print
(
f"Input:
{tokens.shape}
→ Output:
{output.shape}
"
)
-
Query: “I need books about cooking”
-
Keys: Book titles on the shelves
-
Values: The actual books
-
Query from ‘sat’: “Who is doing the sitting?”
-
Keys from other words: [‘The’, ‘cat’, ‘on’, ‘the’, ‘mat’]
-
Attention focuses on ‘cat’ (high score)
-
Value from ‘cat’ enriches understanding of ‘sat’
def
demonstrate_self_attention
():
"""Show self-attention step by step."""
# Simple example dimensions
d_model =
64
# Feature size
seq_len =
5
# Number of words
# Create Query, Key, Value matrices
# In reality, these come from learned projections
Q = torch.randn(seq_len, d_model)
# Queries
K = torch.randn(seq_len, d_model)
# Keys
V = torch.randn(seq_len, d_model)
# Values
# Step 1: Calculate attention scores
# How well does each query match each key?
scores = torch.matmul(Q, K.transpose(-
2
, -
1
)) / math.sqrt(d_model)
print
(
f"Attention scores shape:
{scores.shape}
"
)
# [5, 5]
# Step 2: Convert to probabilities
attention_weights = torch.softmax(scores, dim=-
1
)
print
(
f"Each row sums to:
{attention_weights.
sum
(dim=-
1
)}
"
)
# All 1.0!
# Step 3: Weighted sum of values
output = torch.matmul(attention_weights, V)
print
(
f"Output shape:
{output.shape}
"
)
# [5, 64]
# Visualize attention pattern
plt.figure(figsize=(
6
,
5
))
plt.imshow(attention_weights.numpy(), cmap=
'Blues'
, vmin=
0
, vmax=
1
)
plt.colorbar(label=
'Attention Weight'
)
for
i
in
range
(seq_len):
for
j
in
range
(seq_len):
plt.text(j, i,
f'
{attention_weights[i,j]:
.2
f}
'
,
ha=
'center'
, va=
'center'
)
plt.xlabel(
'Key Position'
)
plt.ylabel(
'Query Position'
)
plt.title(
'Self-Attention Weights'
)
plt.show()
- Scores: Dot product measures similarity (like asking “how related are these words?”)
- Softmax: Ensures each word distributes exactly 100% of its attention
- Weighted Sum: Combines information based on attention weights
def
word_attention_example
():
"""Show attention with actual words."""
words = [
"The"
,
"cat"
,
"sat"
,
"on"
,
"mat"
]
seq_len =
len
(words)
# Simulate attention weights (in real transformers, these are learned)
# Let's make "sat" pay attention to "cat"
attention_weights = torch.zeros(seq_len, seq_len)
attention_weights[
2
,
1
] =
0.8
# "sat" → "cat"
attention_weights[
2
,
2
] =
0.2
# "sat" → "sat"
# Make each row sum to 1
for
i
in
range
(seq_len):
if
attention_weights[i].
sum
() >
0
:
attention_weights[i] = attention_weights[i] / attention_weights[i].
sum
()
else
:
attention_weights[i] = torch.ones(seq_len) / seq_len
# Visualize
plt.figure(figsize=(
8
,
6
))
plt.imshow(attention_weights.numpy(), cmap=
'Blues'
, vmin=
0
, vmax=
1
)
plt.colorbar(label=
'Attention Weight'
)
# Add labels
plt.xticks(
range
(seq_len), words)
plt.yticks(
range
(seq_len), words)
plt.xlabel(
'Attending To'
)
plt.ylabel(
'Word'
)
plt.title(
'Word-to-Word Attention'
)
# Add values
for
i
in
range
(seq_len):
for
j
in
range
(seq_len):
if
attention_weights[i,j] >
0.1
:
plt.text(j, i,
f'
{attention_weights[i,j]:
.1
f}
'
,
ha=
'center'
, va=
'center'
, color=
'white'
)
plt.show()
def
multi_head_attention_demo
():
"""Show how multiple attention heads work together."""
num_heads =
4
d_model =
64
d_k = d_model // num_heads
# 16 dimensions per head
# Input
seq_len =
5
x = torch.randn(seq_len, d_model)
# Each head processes a portion of the features
all_heads_output = []
for
head
in
range
(num_heads):
# Each head looks at different features
start_idx = head * d_k
end_idx = (head +
1
) * d_k
# Extract this head's portion
head_input = x[:, start_idx:end_idx]
# Simple attention for this head (simplified)
Q = head_input
K = head_input
V = head_input
scores = torch.matmul(Q, K.transpose(-
2
, -
1
)) / math.sqrt(d_k)
weights = torch.softmax(scores, dim=-
1
)
head_output = torch.matmul(weights, V)
all_heads_output.append(head_output)
print
(
f"Head
{head}
: Focusing on dimensions
{start_idx}
-
{end_idx}
"
)
# Concatenate all heads
concat_output = torch.cat(all_heads_output, dim=-
1
)
print
(
f"\nFinal shape after concatenating
{num_heads}
heads:
{concat_output.shape}
"
)
- Head 1 might focus on grammar (“who did what”)
- Head 2 might track entities (“which cat, which mat”)
- Head 3 might identify relationships (“sitting on”)
- Head 4 might capture style or tone
from
transformers
import
pipeline
def
compare_architectures
():
"""Show the three transformer types in action."""
print
(
"=== Three Types of Transformers ===\n"
)
# 1. ENCODER-ONLY: The Reader (understands text)
print
(
"1. Encoder-Only (BERT) - The Careful Reader:"
)
classifier = pipeline(
'sentiment-analysis'
,
model=
'distilbert-base-uncased-finetuned-sst-2-english'
)
text =
"I love learning about transformers!"
result = classifier(text)
print
(
f" Input: '
{text}
'"
)
print
(
f" Analysis:
{result[
0
][
'label'
]}
(confidence:
{result[
0
][
'score'
]:
.3
f}
)"
)
print
(
" Use for: Classification, understanding, search\n"
)
# 2. DECODER-ONLY: The Writer (generates text)
print
(
"2. Decoder-Only (GPT) - The Creative Writer:"
)
generator = pipeline(
'text-generation'
, model=
'gpt2'
, max_new_tokens=
15
)
prompt =
"The future of AI is"
result = generator(prompt, max_length=
25
, num_return_sequences=
1
)
print
(
f" Prompt: '
{prompt}
'"
)
print
(
f" Generated: '
{result[
0
][
'generated_text'
]}
'"
)
print
(
" Use for: Chatbots, story writing, code completion\n"
)
# 3. ENCODER-DECODER: The Translator (transforms text)
print
(
"3. Encoder-Decoder (T5) - The Translator:"
)
summarizer = pipeline(
'summarization'
, model=
't5-small'
)
long_text = (
"Transformers have revolutionized natural language processing "
"by using self-attention mechanisms. They process entire sequences "
"at once, understanding context better than previous models."
)
summary = summarizer(long_text, max_length=
30
, min_length=
10
)
print
(
f" Input: '
{long_text[:
50
]}
...'"
)
print
(
f" Summary: '
{summary[
0
][
'summary_text'
]}
'"
)
print
(
" Use for: Translation, summarization, Q&A"
)
def
visualize_attention_masks
():
""
"Show how different architectures see text."
""
seq_len =
6
fig, axes = plt.
subplots
(
1
,
3
, figsize=(
15
,
4
))
#
1
. BERT: Can see everything (bidirectional)
bert_mask = torch.
ones
(seq_len, seq_len)
axes[
0
].
imshow
(bert_mask, cmap=
'Blues'
, vmin=
0
, vmax=
1
)
axes[
0
].
set_title
(
'BERT (Encoder)\nSees Everything'
)
axes[
0
].
set_xlabel
(
'Can see →'
)
axes[
0
].
set_ylabel
(
'Token ↓'
)
#
2
. GPT: Can only see backwards (causal)
gpt_mask = torch.
tril
(torch.
ones
(seq_len, seq_len))
axes[
1
].
imshow
(gpt_mask, cmap=
'Blues'
, vmin=
0
, vmax=
1
)
axes[
1
].
set_title
(
'GPT (Decoder)\nSees Only Past'
)
#
3
. Training: Random masking
train_mask = (torch.
rand
(seq_len, seq_len) >
0.15
).
float
()
axes[
2
].
imshow
(train_mask, cmap=
'Blues'
, vmin=
0
, vmax=
1
)
axes[
2
].
set_title
(
'Training\nRandom Masking'
)
plt.
tight_layout
()
plt.
show
()
- BERT: Every position sees all positions (understanding)
- GPT: Each position only sees previous positions (generation)
- Training: Random masks make models robust
def
simple_rag_example
():
"""Show how RAG works with a simple example."""
# Step 1: Our knowledge base (imagine this is Wikipedia)
knowledge_base = [
"The Eiffel Tower is 330 meters tall and located in Paris."
,
"The Great Wall of China is over 21,000 kilometers long."
,
"The Pyramid of Giza was built around 2560 BCE."
,
"Transformers were introduced in the 2017 'Attention is All You Need' paper."
,
"BERT stands for Bidirectional Encoder Representations from Transformers."
]
# Step 2: User asks a question
question =
"How tall is the Eiffel Tower?"
print
(
f"Question:
{question}
\n"
)
# Step 3: Find relevant information (simple keyword matching)
print
(
"Step 1: Searching knowledge base..."
)
relevant_docs = []
for
doc
in
knowledge_base:
if
"Eiffel Tower"
in
doc
or
"tall"
in
doc:
relevant_docs.append(doc)
print
(
f"Found:
{doc}
"
)
# Step 4: Create a prompt with context
context =
" "
.join(relevant_docs)
prompt =
f"""Based on the following information:
{context}
Question:
{question}
Answer:"""
print
(
f"\nStep 2: Creating prompt with context..."
)
print
(prompt)
# Step 5: Generate answer (using GPT-2)
print
(
"\nStep 3: Generating answer..."
)
generator = pipeline(
'text-generation'
, model=
'gpt2'
, max_new_tokens=
20
)
answer = generator(prompt, max_length=
150
, pad_token_id=
50256
)
final_answer = answer[
0
][
'generated_text'
].split(
'Answer:'
)[-
1
].strip()
print
(
f"\nFinal Answer:
{final_answer}
"
)
-
Model might hallucinate (make up facts)
-
Knowledge is frozen at training time
-
Can’t access private documents
-
Answers are grounded in real documents
-
Knowledge can be updated without retraining
-
Can work with your company’s private data
-
Provides sources for fact-checking
-
Question: “What’s our company’s return policy?”
-
Without RAG: makes up a plausible but wrong policy
-
With RAG: retrieves actual policy document and quotes it accurately
with
torch.no_grad():
output = model(
input
)
# → Saves memory and speeds up inference
-
distilbert-base: 67M parameters — Fast, good for simple tasks
-
bert-base: 110M parameters — Balanced performance
-
bert-large: 340M parameters — Best accuracy, slower
-
gpt2: 124M parameters — Good for generation
-
gpt2-xl: 1.5B parameters — Better quality, needs more resources
-
Out of memory: Use smaller batch sizes or distilled models
-
Slow inference: Enable ONNX export or use quantization
-
Poor results: Check if you’re using the right architecture
-
Tokenization issues: Ensure using matching tokenizer for model
-
Training instability: Lower learning rate, use warmup
- Input Text: “Transformers are revolutionary!”
- Tokenization:
-
‘Transformers are revolutionary!’ → [‘Transform’, ‘ers’, ‘are’, ‘revolutionary’, ‘!’]
-
[‘Transform’, ‘ers’, …] → [1547, 433, 526, 9823, 256]
-
[1547, 433, …] → [[0.23, -0.45, …], [0.67, 0.12, …], …]
-
Each token → 768-dimensional vector
-
So model knows word order
-
Each word looks at all other words
-
‘revolutionary’ might focus on ‘Transformers’
-
Each token individually processed
-
Classification: ‘POSITIVE sentiment’
-
Generation: ‘…and changing the world!’
-
Translation: ‘Les transformers sont révolutionnaires!’
- Transformers = Tokenizer + Embeddings + Attention + Feed-Forward
- Attention lets every word see every other word (the breakthrough!)
- Three types: Encoder (understand), Decoder (generate), Both (transform)
- Multi-head attention = multiple perspectives for richer understanding
- Position matters — transformers need to know word order.
- RAG = Transformers + External Knowledge for better accuracy
- Choose architecture based on task (classification vs generation vs transformation)
# Setup environment
git
clone
[email protected]:RichardHightower/art_hug_04.git
cd
art_hug_04
task setup
# Run examples
task run-attention-mechanism
# See attention in action
task run-modern-models
# Compare architectures
task run
# Run everything
# Or run individual Python files
python src/attention_mechanism.py
python src/modern_models.py
python src/rag_example.py
- Try the Code: Run the examples and modify them
- Pick a Project: Choose a task (classification, generation, or transformation)
- Select a Model: Use the decision tree to pick the right architecture
- Fine-tune: Adapt a pre-trained model to your specific needs
- Deploy: Use optimization techniques for production