Inside the Transformer: Architecture and Attention Demystified -- A Complete Guide

Introduction: What Are Transformers and Why Should You Care? (Article 4 alternative)

Tokenizer: Breaks sentences into pieces (like cutting a pizza into slices)
Embeddings: Turns words into numbers the computer understands
Attention: The secret sauce — lets the model focus on what’s important
Layers: Stack these parts to make the model smarter

# Clone the repository and navigate to it
cd
 art_hug_04
# Run the setup task (installs Python 3.12.9 and all dependencies)
task setup
# Run all examples
task run
# Run specific examples
task run-attention-mechanism  
# Self-attention demos
task run-modern-models       
# Architecture comparisons

transformers: The Hugging Face library (your AI toolkit)
torch: PyTorch for the mathematical operations
matplotlib/seaborn: For creating visualizations
“Transformers” might become [‘Transform’, ‘ers’]
This helps handle words the model hasn’t seen before!

from
 transformers 
import
 AutoTokenizer, AutoModel
import
 torch
def
 
basic_tokenization_and_embedding
():
    
"""Let's convert text to numbers step by step."""
    
    
# Step 1: Load a pre-trained model (like buying a trained dog vs training one yourself)
    model_name = 
"roberta-base"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    
    
# Step 2: Take a sentence
    sentence = 
"Transformers are amazing!"
    
    
# Step 3: Break it into pieces (tokenize)
    inputs = tokenizer(sentence, return_tensors=
"pt"
)
    
print
(
"Token IDs:"
, inputs[
"input_ids"
])
    
# Output: tensor([[0, 44929, 32, 2770, 328, 2]])
    
    
# What do these numbers mean? Let's see:
    tokens = tokenizer.convert_ids_to_tokens(inputs[
"input_ids"
][
0
])
    
print
(
f"Tokens: 
{tokens}
"
)
    
# Output: ['<s>', 'Transform', 'ers', 'Ġare', 'Ġamazing', '!', '</s>']
    
    
# Step 4: Convert to embeddings (meaningful numbers)
    
with
 torch.no_grad():  
# This means "just use, don't train"
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
        
print
(
"Embeddings shape:"
, embeddings.shape)
        
# Output: torch.Size([1, 6, 768])

Tokenization: “Transformers are amazing!” becomes pieces:

<s> = start of sentence (like a capital letter)
Transform + ers = the word split into known pieces
Ġare = "are" with a space marker (Ġ)
</s> = end of sentence (like a period)
Shape [1, 6, 768] means: 1 sentence, 6 tokens, 768 features per token
Think of 768 features like describing a person with 768 characteristics

import
 matplotlib.pyplot 
as
 plt
import
 numpy 
as
 np
def
 
visualize_embeddings
():
    
"""Show what embeddings look like."""
    
# Get embeddings for a few words
    tokenizer = AutoTokenizer.from_pretrained(
"roberta-base"
)
    model = AutoModel.from_pretrained(
"roberta-base"
)
    words = [
"happy"
, 
"sad"
, 
"dog"
, 
"cat"
]
    
for
 word 
in
 words:
        inputs = tokenizer(word, return_tensors=
"pt"
)
        
with
 torch.no_grad():
            outputs = model(**inputs)
            
# Average the embeddings (excluding special tokens)
            embedding = outputs.last_hidden_state[
0
, 
1
:-
1
].mean(dim=
0
)
        
# Show first 10 dimensions as a bar chart
        plt.figure(figsize=(
10
, 
3
))
        plt.bar(
range
(
10
), embedding[:
10
].numpy())
        plt.title(
f"First 10 embedding dimensions for '
{word}
'"
)
        plt.xlabel(
"Dimension"
)
        plt.ylabel(
"Value"
)
        plt.show()

“The cat chased the dog”
“The dog chased the cat”

import
 math
def
 
visualize_positional_encoding
():
    
"""Show how positional encoding works."""
    seq_length = 
20
    d_model = 
64
    
# Create positional encoding
    position = torch.arange(seq_length).unsqueeze(
1
)
    div_term = torch.exp(torch.arange(
0
, d_model, 
2
) *
                         -(math.log(
10000.0
) / d_model))
    pe = torch.zeros(seq_length, d_model)
    pe[:, 
0
::
2
] = torch.sin(position * div_term)  
# Even dimensions
    pe[:, 
1
::
2
] = torch.cos(position * div_term)  
# Odd dimensions
    
    p
# Visualize
    plt.figure(figsize=(
10
, 
6
))
    plt.imshow(pe, cmap=
'RdBu'
, aspect=
'auto'
)
    plt.colorbar()
    plt.xlabel(
'Embedding Dimension'
)
    plt.ylabel(
'Position in Sequence'
)
    plt.title(
'Positional Encoding Pattern'
)
    plt.show()

import
 torch.nn 
as
 nn
class
 
SimpleTransformerBlock
(nn.Module):
    
"""A basic building block of transformers."""
    
def
 
__init__
(
self, d_model
):
        
super
().__init__()
        self.linear = nn.Linear(d_model, d_model)
        self.norm = nn.LayerNorm(d_model)
    
def
 
forward
(
self, x
):
        
# Residual connection: x + transformation(x)
        
# Like having a safety net - if transformation fails, original x is preserved
        output = x + self.linear(x)
        
# Normalization: keeps values in reasonable range
        
# Like adjusting volume so it's not too loud or quiet
        
return
 self.norm(output)
# Example usage
block = SimpleTransformerBlock(
768
)
input_tensor = torch.randn(
1
, 
10
, 
768
)  
# [batch, sequence, features]
output = block(input_tensor)
print
(
f"Input shape: 
{input_tensor.shape}
"
)
print
(
f"Output shape: 
{output.shape}
"
)  
# Same shape!

Residual Connection (x + self.linear(x)): Information can flow around problematic transformations
Layer Normalization: Keeps numbers stable, preventing “explosion” or “vanishing”

class
 
FeedForward
(nn.Module):
    
"""Each token gets processed individually."""
    
def
 
__init__
(
self, d_model=
768
, d_ff=
3072
):
        
super
().__init__()
        
# Two linear layers with ReLU in between
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),      
# Expand (768 → 3072)
            nn.ReLU(),                     
# Add non-linearity
            nn.Dropout(
0.1
),               
# Prevent overfitting
            nn.Linear(d_ff, d_model)       
# Contract (3072 → 768)
        )
    
def
 
forward
(
self, x
):
        
return
 self.net(x)
# Demonstrate
ff = FeedForward()
tokens = torch.randn(
1
, 
5
, 
768
)  
# 5 tokens, 768 dimensions each
output = ff(tokens)
print
(
f"Each token processed independently!"
)
print
(
f"Input: 
{tokens.shape}
 → Output: 
{output.shape}
"
)

Query: “I need books about cooking”
Keys: Book titles on the shelves
Values: The actual books
Query from ‘sat’: “Who is doing the sitting?”
Keys from other words: [‘The’, ‘cat’, ‘on’, ‘the’, ‘mat’]
Attention focuses on ‘cat’ (high score)
Value from ‘cat’ enriches understanding of ‘sat’

def
 
demonstrate_self_attention
():
    
"""Show self-attention step by step."""
    
# Simple example dimensions
    d_model = 
64
   
# Feature size
    seq_len = 
5
    
# Number of words
    
# Create Query, Key, Value matrices
    
# In reality, these come from learned projections
    Q = torch.randn(seq_len, d_model)  
# Queries
    K = torch.randn(seq_len, d_model)  
# Keys
    V = torch.randn(seq_len, d_model)  
# Values
    
# Step 1: Calculate attention scores
    
# How well does each query match each key?
    scores = torch.matmul(Q, K.transpose(-
2
, -
1
)) / math.sqrt(d_model)
    
print
(
f"Attention scores shape: 
{scores.shape}
"
)  
# [5, 5]
    
# Step 2: Convert to probabilities
    attention_weights = torch.softmax(scores, dim=-
1
)
    
print
(
f"Each row sums to: 
{attention_weights.
sum
(dim=-
1
)}
"
)  
# All 1.0!
    
# Step 3: Weighted sum of values
    output = torch.matmul(attention_weights, V)
    
print
(
f"Output shape: 
{output.shape}
"
)  
# [5, 64]
    
# Visualize attention pattern
    plt.figure(figsize=(
6
, 
5
))
    plt.imshow(attention_weights.numpy(), cmap=
'Blues'
, vmin=
0
, vmax=
1
)
    plt.colorbar(label=
'Attention Weight'
)
    
for
 i 
in
 
range
(seq_len):
        
for
 j 
in
 
range
(seq_len):
            plt.text(j, i, 
f'
{attention_weights[i,j]:
.2
f}
'
,
                    ha=
'center'
, va=
'center'
)
    plt.xlabel(
'Key Position'
)
    plt.ylabel(
'Query Position'
)
    plt.title(
'Self-Attention Weights'
)
    plt.show()

Scores: Dot product measures similarity (like asking “how related are these words?”)
Softmax: Ensures each word distributes exactly 100% of its attention
Weighted Sum: Combines information based on attention weights

def
 
word_attention_example
():
    
"""Show attention with actual words."""
    words = [
"The"
, 
"cat"
, 
"sat"
, 
"on"
, 
"mat"
]
    seq_len = 
len
(words)
    
# Simulate attention weights (in real transformers, these are learned)
    
# Let's make "sat" pay attention to "cat"
    attention_weights = torch.zeros(seq_len, seq_len)
    attention_weights[
2
, 
1
] = 
0.8
  
# "sat" → "cat"
    attention_weights[
2
, 
2
] = 
0.2
  
# "sat" → "sat"
    
# Make each row sum to 1
    
for
 i 
in
 
range
(seq_len):
        
if
 attention_weights[i].
sum
() > 
0
:
            attention_weights[i] = attention_weights[i] / attention_weights[i].
sum
()
        
else
:
            attention_weights[i] = torch.ones(seq_len) / seq_len
    
# Visualize
    plt.figure(figsize=(
8
, 
6
))
    plt.imshow(attention_weights.numpy(), cmap=
'Blues'
, vmin=
0
, vmax=
1
)
    plt.colorbar(label=
'Attention Weight'
)
    
# Add labels
    plt.xticks(
range
(seq_len), words)
    plt.yticks(
range
(seq_len), words)
    plt.xlabel(
'Attending To'
)
    plt.ylabel(
'Word'
)
    plt.title(
'Word-to-Word Attention'
)
    
# Add values
    
for
 i 
in
 
range
(seq_len):
        
for
 j 
in
 
range
(seq_len):
            
if
 attention_weights[i,j] > 
0.1
:
                plt.text(j, i, 
f'
{attention_weights[i,j]:
.1
f}
'
,
                        ha=
'center'
, va=
'center'
, color=
'white'
)
    plt.show()

def
 
multi_head_attention_demo
():
    
"""Show how multiple attention heads work together."""
    num_heads = 
4
    d_model = 
64
    d_k = d_model // num_heads  
# 16 dimensions per head
    
# Input
    seq_len = 
5
    x = torch.randn(seq_len, d_model)
    
# Each head processes a portion of the features
    all_heads_output = []
    
for
 head 
in
 
range
(num_heads):
        
# Each head looks at different features
        start_idx = head * d_k
        end_idx = (head + 
1
) * d_k
        
# Extract this head's portion
        head_input = x[:, start_idx:end_idx]
        
# Simple attention for this head (simplified)
        Q = head_input
        K = head_input
        V = head_input
        scores = torch.matmul(Q, K.transpose(-
2
, -
1
)) / math.sqrt(d_k)
        weights = torch.softmax(scores, dim=-
1
)
        head_output = torch.matmul(weights, V)
        all_heads_output.append(head_output)
        
print
(
f"Head 
{head}
: Focusing on dimensions 
{start_idx}
-
{end_idx}
"
)
    
# Concatenate all heads
    concat_output = torch.cat(all_heads_output, dim=-
1
)
    
print
(
f"\nFinal shape after concatenating 
{num_heads}
 heads: 
{concat_output.shape}
"
)

Head 1 might focus on grammar (“who did what”)
Head 2 might track entities (“which cat, which mat”)
Head 3 might identify relationships (“sitting on”)
Head 4 might capture style or tone

from
 transformers 
import
 pipeline
def
 
compare_architectures
():
    
"""Show the three transformer types in action."""
    
print
(
"=== Three Types of Transformers ===\n"
)
    
# 1. ENCODER-ONLY: The Reader (understands text)
    
print
(
"1. Encoder-Only (BERT) - The Careful Reader:"
)
    classifier = pipeline(
'sentiment-analysis'
,
                         model=
'distilbert-base-uncased-finetuned-sst-2-english'
)
    text = 
"I love learning about transformers!"
    result = classifier(text)
    
print
(
f"   Input: '
{text}
'"
)
    
print
(
f"   Analysis: 
{result[
0
][
'label'
]}
 (confidence: 
{result[
0
][
'score'
]:
.3
f}
)"
)
    
print
(
"   Use for: Classification, understanding, search\n"
)
    
# 2. DECODER-ONLY: The Writer (generates text)
    
print
(
"2. Decoder-Only (GPT) - The Creative Writer:"
)
    generator = pipeline(
'text-generation'
, model=
'gpt2'
, max_new_tokens=
15
)
    prompt = 
"The future of AI is"
    result = generator(prompt, max_length=
25
, num_return_sequences=
1
)
    
print
(
f"   Prompt: '
{prompt}
'"
)
    
print
(
f"   Generated: '
{result[
0
][
'generated_text'
]}
'"
)
    
print
(
"   Use for: Chatbots, story writing, code completion\n"
)
    
# 3. ENCODER-DECODER: The Translator (transforms text)
    
print
(
"3. Encoder-Decoder (T5) - The Translator:"
)
    summarizer = pipeline(
'summarization'
, model=
't5-small'
)
    long_text = (
"Transformers have revolutionized natural language processing "
                 
"by using self-attention mechanisms. They process entire sequences "
                 
"at once, understanding context better than previous models."
)
    summary = summarizer(long_text, max_length=
30
, min_length=
10
)
    
print
(
f"   Input: '
{long_text[:
50
]}
...'"
)
    
print
(
f"   Summary: '
{summary[
0
][
'summary_text'
]}
'"
)
    
print
(
"   Use for: Translation, summarization, Q&A"
)

def 
visualize_attention_masks
():
    
""
"Show how different architectures see text."
""
    seq_len = 
6
    fig, axes = plt.
subplots
(
1
, 
3
, figsize=(
15
, 
4
))
    # 
1
. BERT: Can see everything (bidirectional)
    bert_mask = torch.
ones
(seq_len, seq_len)
    axes[
0
].
imshow
(bert_mask, cmap=
'Blues'
, vmin=
0
, vmax=
1
)
    axes[
0
].
set_title
(
'BERT (Encoder)\nSees Everything'
)
    axes[
0
].
set_xlabel
(
'Can see →'
)
    axes[
0
].
set_ylabel
(
'Token ↓'
)
    # 
2
. GPT: Can only see backwards (causal)
    gpt_mask = torch.
tril
(torch.
ones
(seq_len, seq_len))
    axes[
1
].
imshow
(gpt_mask, cmap=
'Blues'
, vmin=
0
, vmax=
1
)
    axes[
1
].
set_title
(
'GPT (Decoder)\nSees Only Past'
)
    # 
3
. Training: Random masking
    train_mask = (torch.
rand
(seq_len, seq_len) > 
0.15
).
float
()
    axes[
2
].
imshow
(train_mask, cmap=
'Blues'
, vmin=
0
, vmax=
1
)
    axes[
2
].
set_title
(
'Training\nRandom Masking'
)
    plt.
tight_layout
()
    plt.
show
()

BERT: Every position sees all positions (understanding)
GPT: Each position only sees previous positions (generation)
Training: Random masks make models robust

def
 
simple_rag_example
():
    
"""Show how RAG works with a simple example."""
    
# Step 1: Our knowledge base (imagine this is Wikipedia)
    knowledge_base = [
        
"The Eiffel Tower is 330 meters tall and located in Paris."
,
        
"The Great Wall of China is over 21,000 kilometers long."
,
        
"The Pyramid of Giza was built around 2560 BCE."
,
        
"Transformers were introduced in the 2017 'Attention is All You Need' paper."
,
        
"BERT stands for Bidirectional Encoder Representations from Transformers."
    ]
    
# Step 2: User asks a question
    question = 
"How tall is the Eiffel Tower?"
    
print
(
f"Question: 
{question}
\n"
)
    
# Step 3: Find relevant information (simple keyword matching)
    
print
(
"Step 1: Searching knowledge base..."
)
    relevant_docs = []
    
for
 doc 
in
 knowledge_base:
        
if
 
"Eiffel Tower"
 
in
 doc 
or
 
"tall"
 
in
 doc:
            relevant_docs.append(doc)
            
print
(
f"Found: 
{doc}
"
)
    
# Step 4: Create a prompt with context
    context = 
" "
.join(relevant_docs)
    prompt = 
f"""Based on the following information:
{context}
Question: 
{question}
Answer:"""
    
print
(
f"\nStep 2: Creating prompt with context..."
)
    
print
(prompt)
    
# Step 5: Generate answer (using GPT-2)
    
print
(
"\nStep 3: Generating answer..."
)
    generator = pipeline(
'text-generation'
, model=
'gpt2'
, max_new_tokens=
20
)
    answer = generator(prompt, max_length=
150
, pad_token_id=
50256
)
    final_answer = answer[
0
][
'generated_text'
].split(
'Answer:'
)[-
1
].strip()
    
print
(
f"\nFinal Answer: 
{final_answer}
"
)

Model might hallucinate (make up facts)
Knowledge is frozen at training time
Can’t access private documents
Answers are grounded in real documents
Knowledge can be updated without retraining
Can work with your company’s private data
Provides sources for fact-checking
Question: “What’s our company’s return policy?”
Without RAG: makes up a plausible but wrong policy
With RAG: retrieves actual policy document and quotes it accurately

with
 torch.no_grad():
    output = model(
input
)
# → Saves memory and speeds up inference

distilbert-base: 67M parameters — Fast, good for simple tasks
bert-base: 110M parameters — Balanced performance
bert-large: 340M parameters — Best accuracy, slower
gpt2: 124M parameters — Good for generation
gpt2-xl: 1.5B parameters — Better quality, needs more resources
Out of memory: Use smaller batch sizes or distilled models
Slow inference: Enable ONNX export or use quantization
Poor results: Check if you’re using the right architecture
Tokenization issues: Ensure using matching tokenizer for model
Training instability: Lower learning rate, use warmup

Input Text: “Transformers are revolutionary!”
Tokenization:

‘Transformers are revolutionary!’ → [‘Transform’, ‘ers’, ‘are’, ‘revolutionary’, ‘!’]
[‘Transform’, ‘ers’, …] → [1547, 433, 526, 9823, 256]
[1547, 433, …] → [[0.23, -0.45, …], [0.67, 0.12, …], …]
Each token → 768-dimensional vector
So model knows word order
Each word looks at all other words
‘revolutionary’ might focus on ‘Transformers’
Each token individually processed
Classification: ‘POSITIVE sentiment’
Generation: ‘…and changing the world!’
Translation: ‘Les transformers sont révolutionnaires!’

Transformers = Tokenizer + Embeddings + Attention + Feed-Forward
Attention lets every word see every other word (the breakthrough!)
Three types: Encoder (understand), Decoder (generate), Both (transform)
Multi-head attention = multiple perspectives for richer understanding
Position matters — transformers need to know word order.
RAG = Transformers + External Knowledge for better accuracy
Choose architecture based on task (classification vs generation vs transformation)

# Setup environment
git 
clone
 [email protected]:RichardHightower/art_hug_04.git
cd
 art_hug_04
task setup
# Run examples
task run-attention-mechanism  
# See attention in action
task run-modern-models       
# Compare architectures
task run                     
# Run everything
# Or run individual Python files
python src/attention_mechanism.py
python src/modern_models.py
python src/rag_example.py

Try the Code: Run the examples and modify them
Pick a Project: Choose a task (classification, generation, or transformation)
Select a Model: Use the decision tree to pick the right architecture
Fine-tune: Adapt a pre-trained model to your specific needs
Deploy: Use optimization techniques for production