Inside the Transformer: Architecture and Attention Demystified -- A Complete Guide

Introduction: What Are Transformers and Why Should You Care? (Article 4 alternative)

Rick Hightower

Originally published on Medium.

Introduction: What Are Transformers and Why Should You Care? (Article 4 alternative)

Inside the Transformer: Architecture and Attention Demystified -- A Complete Guide

  • Tokenizer: Breaks sentences into pieces (like cutting a pizza into slices)
  • Embeddings: Turns words into numbers the computer understands
  • Attention: The secret sauce — lets the model focus on what’s important
  • Layers: Stack these parts to make the model smarter
# Clone the repository and navigate to it
cd
 art_hug_04
# Run the setup task (installs Python 3.12.9 and all dependencies)
task setup
# Run all examples
task run
# Run specific examples
task run-attention-mechanism  
# Self-attention demos
task run-modern-models       
# Architecture comparisons
  • transformers: The Hugging Face library (your AI toolkit)

  • torch: PyTorch for the mathematical operations

  • matplotlib/seaborn: For creating visualizations

  • “Transformers” might become [‘Transform’, ‘ers’]

  • This helps handle words the model hasn’t seen before!

from
 transformers 
import
 AutoTokenizer, AutoModel
import
 torch
def
 
basic_tokenization_and_embedding
():
    
"""Let's convert text to numbers step by step."""
    
    
# Step 1: Load a pre-trained model (like buying a trained dog vs training one yourself)
    model_name = 
"roberta-base"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    
    
# Step 2: Take a sentence
    sentence = 
"Transformers are amazing!"
    
    
# Step 3: Break it into pieces (tokenize)
    inputs = tokenizer(sentence, return_tensors=
"pt"
)
    
print
(
"Token IDs:"
, inputs[
"input_ids"
])
    
# Output: tensor([[0, 44929, 32, 2770, 328, 2]])
    
    
# What do these numbers mean? Let's see:
    tokens = tokenizer.convert_ids_to_tokens(inputs[
"input_ids"
][
0
])
    
print
(
f"Tokens: 
{tokens}
"
)
    
# Output: ['<s>', 'Transform', 'ers', 'Ġare', 'Ġamazing', '!', '</s>']
    
    
# Step 4: Convert to embeddings (meaningful numbers)
    
with
 torch.no_grad():  
# This means "just use, don't train"
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
        
print
(
"Embeddings shape:"
, embeddings.shape)
        
# Output: torch.Size([1, 6, 768])
  1. Tokenization: “Transformers are amazing!” becomes pieces:
  • <s> = start of sentence (like a capital letter)

  • Transform + ers = the word split into known pieces

  • Ġare = "are" with a space marker (Ġ)

  • </s> = end of sentence (like a period)

  • Shape [1, 6, 768] means: 1 sentence, 6 tokens, 768 features per token

  • Think of 768 features like describing a person with 768 characteristics

import
 matplotlib.pyplot 
as
 plt
import
 numpy 
as
 np
def
 
visualize_embeddings
():
    
"""Show what embeddings look like."""
    
# Get embeddings for a few words
    tokenizer = AutoTokenizer.from_pretrained(
"roberta-base"
)
    model = AutoModel.from_pretrained(
"roberta-base"
)
    words = [
"happy"
, 
"sad"
, 
"dog"
, 
"cat"
]
    
for
 word 
in
 words:
        inputs = tokenizer(word, return_tensors=
"pt"
)
        
with
 torch.no_grad():
            outputs = model(**inputs)
            
# Average the embeddings (excluding special tokens)
            embedding = outputs.last_hidden_state[
0
, 
1
:-
1
].mean(dim=
0
)
        
# Show first 10 dimensions as a bar chart
        plt.figure(figsize=(
10
, 
3
))
        plt.bar(
range
(
10
), embedding[:
10
].numpy())
        plt.title(
f"First 10 embedding dimensions for '
{word}
'"
)
        plt.xlabel(
"Dimension"
)
        plt.ylabel(
"Value"
)
        plt.show()
  • “The cat chased the dog”
  • “The dog chased the cat”
import
 math
def
 
visualize_positional_encoding
():
    
"""Show how positional encoding works."""
    seq_length = 
20
    d_model = 
64
    
# Create positional encoding
    position = torch.arange(seq_length).unsqueeze(
1
)
    div_term = torch.exp(torch.arange(
0
, d_model, 
2
) *
                         -(math.log(
10000.0
) / d_model))
    pe = torch.zeros(seq_length, d_model)
    pe[:, 
0
::
2
] = torch.sin(position * div_term)  
# Even dimensions
    pe[:, 
1
::
2
] = torch.cos(position * div_term)  
# Odd dimensions
    
    p
# Visualize
    plt.figure(figsize=(
10
, 
6
))
    plt.imshow(pe, cmap=
'RdBu'
, aspect=
'auto'
)
    plt.colorbar()
    plt.xlabel(
'Embedding Dimension'
)
    plt.ylabel(
'Position in Sequence'
)
    plt.title(
'Positional Encoding Pattern'
)
    plt.show()
import
 torch.nn 
as
 nn
class
 
SimpleTransformerBlock
(nn.Module):
    
"""A basic building block of transformers."""
    
def
 
__init__
(
self, d_model
):
        
super
().__init__()
        self.linear = nn.Linear(d_model, d_model)
        self.norm = nn.LayerNorm(d_model)
    
def
 
forward
(
self, x
):
        
# Residual connection: x + transformation(x)
        
# Like having a safety net - if transformation fails, original x is preserved
        output = x + self.linear(x)
        
# Normalization: keeps values in reasonable range
        
# Like adjusting volume so it's not too loud or quiet
        
return
 self.norm(output)
# Example usage
block = SimpleTransformerBlock(
768
)
input_tensor = torch.randn(
1
, 
10
, 
768
)  
# [batch, sequence, features]
output = block(input_tensor)
print
(
f"Input shape: 
{input_tensor.shape}
"
)
print
(
f"Output shape: 
{output.shape}
"
)  
# Same shape!
  • Residual Connection (x + self.linear(x)): Information can flow around problematic transformations
  • Layer Normalization: Keeps numbers stable, preventing “explosion” or “vanishing”
class
 
FeedForward
(nn.Module):
    
"""Each token gets processed individually."""
    
def
 
__init__
(
self, d_model=
768
, d_ff=
3072
):
        
super
().__init__()
        
# Two linear layers with ReLU in between
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),      
# Expand (768 → 3072)
            nn.ReLU(),                     
# Add non-linearity
            nn.Dropout(
0.1
),               
# Prevent overfitting
            nn.Linear(d_ff, d_model)       
# Contract (3072 → 768)
        )
    
def
 
forward
(
self, x
):
        
return
 self.net(x)
# Demonstrate
ff = FeedForward()
tokens = torch.randn(
1
, 
5
, 
768
)  
# 5 tokens, 768 dimensions each
output = ff(tokens)
print
(
f"Each token processed independently!"
)
print
(
f"Input: 
{tokens.shape}
 → Output: 
{output.shape}
"
)
  • Query: “I need books about cooking”

  • Keys: Book titles on the shelves

  • Values: The actual books

  • Query from ‘sat’: “Who is doing the sitting?”

  • Keys from other words: [‘The’, ‘cat’, ‘on’, ‘the’, ‘mat’]

  • Attention focuses on ‘cat’ (high score)

  • Value from ‘cat’ enriches understanding of ‘sat’

def
 
demonstrate_self_attention
():
    
"""Show self-attention step by step."""
    
# Simple example dimensions
    d_model = 
64
   
# Feature size
    seq_len = 
5
    
# Number of words
    
# Create Query, Key, Value matrices
    
# In reality, these come from learned projections
    Q = torch.randn(seq_len, d_model)  
# Queries
    K = torch.randn(seq_len, d_model)  
# Keys
    V = torch.randn(seq_len, d_model)  
# Values
    
# Step 1: Calculate attention scores
    
# How well does each query match each key?
    scores = torch.matmul(Q, K.transpose(-
2
, -
1
)) / math.sqrt(d_model)
    
print
(
f"Attention scores shape: 
{scores.shape}
"
)  
# [5, 5]
    
# Step 2: Convert to probabilities
    attention_weights = torch.softmax(scores, dim=-
1
)
    
print
(
f"Each row sums to: 
{attention_weights.
sum
(dim=-
1
)}
"
)  
# All 1.0!
    
# Step 3: Weighted sum of values
    output = torch.matmul(attention_weights, V)
    
print
(
f"Output shape: 
{output.shape}
"
)  
# [5, 64]
    
# Visualize attention pattern
    plt.figure(figsize=(
6
, 
5
))
    plt.imshow(attention_weights.numpy(), cmap=
'Blues'
, vmin=
0
, vmax=
1
)
    plt.colorbar(label=
'Attention Weight'
)
    
for
 i 
in
 
range
(seq_len):
        
for
 j 
in
 
range
(seq_len):
            plt.text(j, i, 
f'
{attention_weights[i,j]:
.2
f}
'
,
                    ha=
'center'
, va=
'center'
)
    plt.xlabel(
'Key Position'
)
    plt.ylabel(
'Query Position'
)
    plt.title(
'Self-Attention Weights'
)
    plt.show()
  1. Scores: Dot product measures similarity (like asking “how related are these words?”)
  2. Softmax: Ensures each word distributes exactly 100% of its attention
  3. Weighted Sum: Combines information based on attention weights
def
 
word_attention_example
():
    
"""Show attention with actual words."""
    words = [
"The"
, 
"cat"
, 
"sat"
, 
"on"
, 
"mat"
]
    seq_len = 
len
(words)
    
# Simulate attention weights (in real transformers, these are learned)
    
# Let's make "sat" pay attention to "cat"
    attention_weights = torch.zeros(seq_len, seq_len)
    attention_weights[
2
, 
1
] = 
0.8
  
# "sat" → "cat"
    attention_weights[
2
, 
2
] = 
0.2
  
# "sat" → "sat"
    
# Make each row sum to 1
    
for
 i 
in
 
range
(seq_len):
        
if
 attention_weights[i].
sum
() > 
0
:
            attention_weights[i] = attention_weights[i] / attention_weights[i].
sum
()
        
else
:
            attention_weights[i] = torch.ones(seq_len) / seq_len
    
# Visualize
    plt.figure(figsize=(
8
, 
6
))
    plt.imshow(attention_weights.numpy(), cmap=
'Blues'
, vmin=
0
, vmax=
1
)
    plt.colorbar(label=
'Attention Weight'
)
    
# Add labels
    plt.xticks(
range
(seq_len), words)
    plt.yticks(
range
(seq_len), words)
    plt.xlabel(
'Attending To'
)
    plt.ylabel(
'Word'
)
    plt.title(
'Word-to-Word Attention'
)
    
# Add values
    
for
 i 
in
 
range
(seq_len):
        
for
 j 
in
 
range
(seq_len):
            
if
 attention_weights[i,j] > 
0.1
:
                plt.text(j, i, 
f'
{attention_weights[i,j]:
.1
f}
'
,
                        ha=
'center'
, va=
'center'
, color=
'white'
)
    plt.show()
def
 
multi_head_attention_demo
():
    
"""Show how multiple attention heads work together."""
    num_heads = 
4
    d_model = 
64
    d_k = d_model // num_heads  
# 16 dimensions per head
    
# Input
    seq_len = 
5
    x = torch.randn(seq_len, d_model)
    
# Each head processes a portion of the features
    all_heads_output = []
    
for
 head 
in
 
range
(num_heads):
        
# Each head looks at different features
        start_idx = head * d_k
        end_idx = (head + 
1
) * d_k
        
# Extract this head's portion
        head_input = x[:, start_idx:end_idx]
        
# Simple attention for this head (simplified)
        Q = head_input
        K = head_input
        V = head_input
        scores = torch.matmul(Q, K.transpose(-
2
, -
1
)) / math.sqrt(d_k)
        weights = torch.softmax(scores, dim=-
1
)
        head_output = torch.matmul(weights, V)
        all_heads_output.append(head_output)
        
print
(
f"Head 
{head}
: Focusing on dimensions 
{start_idx}
-
{end_idx}
"
)
    
# Concatenate all heads
    concat_output = torch.cat(all_heads_output, dim=-
1
)
    
print
(
f"\nFinal shape after concatenating 
{num_heads}
 heads: 
{concat_output.shape}
"
)
  • Head 1 might focus on grammar (“who did what”)
  • Head 2 might track entities (“which cat, which mat”)
  • Head 3 might identify relationships (“sitting on”)
  • Head 4 might capture style or tone
from
 transformers 
import
 pipeline
def
 
compare_architectures
():
    
"""Show the three transformer types in action."""
    
print
(
"=== Three Types of Transformers ===\n"
)
    
# 1. ENCODER-ONLY: The Reader (understands text)
    
print
(
"1. Encoder-Only (BERT) - The Careful Reader:"
)
    classifier = pipeline(
'sentiment-analysis'
,
                         model=
'distilbert-base-uncased-finetuned-sst-2-english'
)
    text = 
"I love learning about transformers!"
    result = classifier(text)
    
print
(
f"   Input: '
{text}
'"
)
    
print
(
f"   Analysis: 
{result[
0
][
'label'
]}
 (confidence: 
{result[
0
][
'score'
]:
.3
f}
)"
)
    
print
(
"   Use for: Classification, understanding, search\n"
)
    
# 2. DECODER-ONLY: The Writer (generates text)
    
print
(
"2. Decoder-Only (GPT) - The Creative Writer:"
)
    generator = pipeline(
'text-generation'
, model=
'gpt2'
, max_new_tokens=
15
)
    prompt = 
"The future of AI is"
    result = generator(prompt, max_length=
25
, num_return_sequences=
1
)
    
print
(
f"   Prompt: '
{prompt}
'"
)
    
print
(
f"   Generated: '
{result[
0
][
'generated_text'
]}
'"
)
    
print
(
"   Use for: Chatbots, story writing, code completion\n"
)
    
# 3. ENCODER-DECODER: The Translator (transforms text)
    
print
(
"3. Encoder-Decoder (T5) - The Translator:"
)
    summarizer = pipeline(
'summarization'
, model=
't5-small'
)
    long_text = (
"Transformers have revolutionized natural language processing "
                 
"by using self-attention mechanisms. They process entire sequences "
                 
"at once, understanding context better than previous models."
)
    summary = summarizer(long_text, max_length=
30
, min_length=
10
)
    
print
(
f"   Input: '
{long_text[:
50
]}
...'"
)
    
print
(
f"   Summary: '
{summary[
0
][
'summary_text'
]}
'"
)
    
print
(
"   Use for: Translation, summarization, Q&A"
)
def 
visualize_attention_masks
():
    
""
"Show how different architectures see text."
""
    seq_len = 
6
    fig, axes = plt.
subplots
(
1
, 
3
, figsize=(
15
, 
4
))
    # 
1
. BERT: Can see everything (bidirectional)
    bert_mask = torch.
ones
(seq_len, seq_len)
    axes[
0
].
imshow
(bert_mask, cmap=
'Blues'
, vmin=
0
, vmax=
1
)
    axes[
0
].
set_title
(
'BERT (Encoder)\nSees Everything'
)
    axes[
0
].
set_xlabel
(
'Can see →'
)
    axes[
0
].
set_ylabel
(
'Token ↓'
)
    # 
2
. GPT: Can only see backwards (causal)
    gpt_mask = torch.
tril
(torch.
ones
(seq_len, seq_len))
    axes[
1
].
imshow
(gpt_mask, cmap=
'Blues'
, vmin=
0
, vmax=
1
)
    axes[
1
].
set_title
(
'GPT (Decoder)\nSees Only Past'
)
    # 
3
. Training: Random masking
    train_mask = (torch.
rand
(seq_len, seq_len) > 
0.15
).
float
()
    axes[
2
].
imshow
(train_mask, cmap=
'Blues'
, vmin=
0
, vmax=
1
)
    axes[
2
].
set_title
(
'Training\nRandom Masking'
)
    plt.
tight_layout
()
    plt.
show
()
  • BERT: Every position sees all positions (understanding)
  • GPT: Each position only sees previous positions (generation)
  • Training: Random masks make models robust
def
 
simple_rag_example
():
    
"""Show how RAG works with a simple example."""
    
# Step 1: Our knowledge base (imagine this is Wikipedia)
    knowledge_base = [
        
"The Eiffel Tower is 330 meters tall and located in Paris."
,
        
"The Great Wall of China is over 21,000 kilometers long."
,
        
"The Pyramid of Giza was built around 2560 BCE."
,
        
"Transformers were introduced in the 2017 'Attention is All You Need' paper."
,
        
"BERT stands for Bidirectional Encoder Representations from Transformers."
    ]
    
# Step 2: User asks a question
    question = 
"How tall is the Eiffel Tower?"
    
print
(
f"Question: 
{question}
\n"
)
    
# Step 3: Find relevant information (simple keyword matching)
    
print
(
"Step 1: Searching knowledge base..."
)
    relevant_docs = []
    
for
 doc 
in
 knowledge_base:
        
if
 
"Eiffel Tower"
 
in
 doc 
or
 
"tall"
 
in
 doc:
            relevant_docs.append(doc)
            
print
(
f"Found: 
{doc}
"
)
    
# Step 4: Create a prompt with context
    context = 
" "
.join(relevant_docs)
    prompt = 
f"""Based on the following information:
{context}
Question: 
{question}
Answer:"""
    
print
(
f"\nStep 2: Creating prompt with context..."
)
    
print
(prompt)
    
# Step 5: Generate answer (using GPT-2)
    
print
(
"\nStep 3: Generating answer..."
)
    generator = pipeline(
'text-generation'
, model=
'gpt2'
, max_new_tokens=
20
)
    answer = generator(prompt, max_length=
150
, pad_token_id=
50256
)
    final_answer = answer[
0
][
'generated_text'
].split(
'Answer:'
)[-
1
].strip()
    
print
(
f"\nFinal Answer: 
{final_answer}
"
)
  • Model might hallucinate (make up facts)

  • Knowledge is frozen at training time

  • Can’t access private documents

  • Answers are grounded in real documents

  • Knowledge can be updated without retraining

  • Can work with your company’s private data

  • Provides sources for fact-checking

  • Question: “What’s our company’s return policy?”

  • Without RAG: makes up a plausible but wrong policy

  • With RAG: retrieves actual policy document and quotes it accurately

with
 torch.no_grad():
    output = model(
input
)
# → Saves memory and speeds up inference
  • distilbert-base: 67M parameters — Fast, good for simple tasks

  • bert-base: 110M parameters — Balanced performance

  • bert-large: 340M parameters — Best accuracy, slower

  • gpt2: 124M parameters — Good for generation

  • gpt2-xl: 1.5B parameters — Better quality, needs more resources

  • Out of memory: Use smaller batch sizes or distilled models

  • Slow inference: Enable ONNX export or use quantization

  • Poor results: Check if you’re using the right architecture

  • Tokenization issues: Ensure using matching tokenizer for model

  • Training instability: Lower learning rate, use warmup

  1. Input Text: “Transformers are revolutionary!”
  2. Tokenization:
  • ‘Transformers are revolutionary!’ → [‘Transform’, ‘ers’, ‘are’, ‘revolutionary’, ‘!’]

  • [‘Transform’, ‘ers’, …] → [1547, 433, 526, 9823, 256]

  • [1547, 433, …] → [[0.23, -0.45, …], [0.67, 0.12, …], …]

  • Each token → 768-dimensional vector

  • So model knows word order

  • Each word looks at all other words

  • ‘revolutionary’ might focus on ‘Transformers’

  • Each token individually processed

  • Classification: ‘POSITIVE sentiment’

  • Generation: ‘…and changing the world!’

  • Translation: ‘Les transformers sont révolutionnaires!’

  1. Transformers = Tokenizer + Embeddings + Attention + Feed-Forward
  2. Attention lets every word see every other word (the breakthrough!)
  3. Three types: Encoder (understand), Decoder (generate), Both (transform)
  4. Multi-head attention = multiple perspectives for richer understanding
  5. Position matters — transformers need to know word order.
  6. RAG = Transformers + External Knowledge for better accuracy
  7. Choose architecture based on task (classification vs generation vs transformation)
# Setup environment
git 
clone
 [email protected]:RichardHightower/art_hug_04.git
cd
 art_hug_04
task setup
# Run examples
task run-attention-mechanism  
# See attention in action
task run-modern-models       
# Compare architectures
task run                     
# Run everything
# Or run individual Python files
python src/attention_mechanism.py
python src/modern_models.py
python src/rag_example.py
  1. Try the Code: Run the examples and modify them
  2. Pick a Project: Choose a task (classification, generation, or transformation)
  3. Select a Model: Use the decision tree to pick the right architecture
  4. Fine-tune: Adapt a pre-trained model to your specific needs
  5. Deploy: Use optimization techniques for production
#Inside #Transformer #Architecture #Attention #Demystified #Complete #Guide