Hugging Face: Building Custom Language Models: From Raw Data to Production AI

In today’s rapidly evolving AI landscape, the ability to create custom language models tailored to specific domains represents a c

Rick Hightower

Originally published on Medium.

In today’s rapidly evolving AI landscape, the ability to create custom language models tailored to specific domains represents a c

Hugging Face: Building Custom Language Models: From Raw Data to Production AI

  • Data curation fundamentals: selecting, cleaning, and preparing domain-specific text
  • Scalable processing techniques for handling massive datasets efficiently
  • Privacy protection and data versioning for responsible AI development
  • Modern model architecture selection and configuration strategies
  • Training workflows with distributed computing and experiment tracking
  • Parameter-efficient fine-tuning methods for adapting large models
  • Evaluation, error analysis, and iterative improvement techniques

Hugging Face: Building Custom Language Models: From Raw Data to Production AI

# Using pyenv (recommended for Python version management)
pyenv install 3.12.9  
# Use Python 3.12.9 as per project requirements
pyenv 
local
 3.12.9
# Verify Python version
python --version  
# Should show Python 3.12.9
# Install with poetry (recommended)
poetry new dataset-curation-project
cd
 dataset-curation-project
poetry 
env
 use 3.12.9
poetry add datasets transformers tokenizers torch accelerate@^0.26.0
# Or use mini-conda
conda create -n dataset-curation python=3.12.9
conda activate dataset-curation
pip install datasets transformers tokenizers torch 
"accelerate>=0.26.0,<0.27.0"
# Or use pip with pyenv
pyenv install 3.12.9
pyenv 
local
 3.12.9
pip install datasets transformers tokenizers torch 
"accelerate>=0.26.0,<0.27.0"

πŸ’‘ Pro Tip: This project uses Python 3.12.9 as configured in the pyproject.toml file. Ensure you use this specific version for consistency with the development environment and Poetry lock file.

⚠️ Note on accelerate: This project requires accelerate version ^0.26.0. Earlier versions may cause compatibility issues with certain model configurations and distributed training setups.

πŸ’‘ Pro Tip: Always use the latest stable Python version (3.12.9 for this project) and ensure accelerate >= 0.26.0 to avoid compatibility issues.

  1. Accelerate Version: Ensure you have accelerate >= 0.26.0 installed
  2. Model Examples: Using Llama-3/Gemma-2 where available, with GPT-2 as fallback. Sometimes we pick GPT-2 just because it is small.
  3. Streaming: Wikipedia dataset uses latest configs (e.g., β€˜20240101.en’)
from
 typing import Any, List
import pandas 
as
 pd
# Example: Analyzing model predictions for bias
def 
analyze_bias
(
y_true
: List[
int
], 
y_pred
: List[
int
], 
sensitive_features
: List[str]) -> dict:
    
""
"
    Analyze predictions for potential bias across sensitive groups.
    Args:
        y_true: True labels
        y_pred: Predicted labels
        sensitive_features: Sensitive attributes for each sample
    Returns:
        dict: Bias analysis results
    "
""
    
try
:
        
from
 fairlearn.metrics import MetricFrame
        
from
 sklearn.metrics import accuracy_score
    except ImportError:
        
print
(
"Install fairlearn and scikit-learn for bias analysis:"
)
        
print
(
"pip install fairlearn scikit-learn"
)
        
return
 {}
    
# Create metric frame for bias analysis
    metric_frame = 
MetricFrame
(
        metrics=accuracy_score,
        y_true=y_true,
        y_pred=y_pred,
        sensitive_features=sensitive_features
    )
    
# Display disparities
    
print
(
"Performance by group:"
)
    
print
(metric_frame.by_group)
    
# Calculate disparity ratio
    disparity = metric_frame.
difference
(method=
'ratio'
)
    
print
(f
"\\nDisparity ratio: {disparity:.2f}"
)
    
return
 metric_frame
# Example usage with demographic data
df = pd.
DataFrame
({
    
'text'
: [
'...'
],  # Your text data
    
'label'
: [
0
, 
1
, 
0
, 
1
],  # True labels
    
'predicted'
: [
0
, 
1
, 
1
, 
1
],  # Model predictions
    
'demographic'
: [
'A'
, 
'B'
, 
'A'
, 
'B'
]  # Sensitive attribute
})
analyze_bias
(df[
'label'
], df[
'predicted'
], df[
'demographic'
])
  • πŸ“Š Audit your data sources for representation gaps
  • πŸ”„ Rebalance datasets to ensure fair representation
  • 🎯 Use targeted data augmentation for underrepresented groups
  • πŸ“ˆ Monitor fairness metrics throughout training
  • 🀝 Involve diverse stakeholders in data curation decisions

βš–οΈ Key Principle*: Bias in data leads to bias in models. Always audit your datasets for representation gaps and demographic disparities.*

# Bias Detection and Mitigation Example
try:
    from fairlearn.metrics import MetricFrame
    from sklearn.metrics import accuracy_score
    fairlearn_available = True
except ImportError:
    fairlearn_available = False
    
print
(
"⚠️ Fairlearn not installed. Using mock example."
)
    
print
(
"   Install with: pip install fairlearn scikit-learn"
)
# Create a sample dataset with potential bias
sample_predictions = pd.DataFrame({
    
'text'
: [
        
'Software engineer position available'
,
        
'Nursing position open'
,
        
'CEO role for experienced leader'
,
        
'Secretary needed for office'
,
        
'Data scientist role'
,
        
'Teacher position at school'
    ],
    
'true_label'
: [1, 1, 1, 1, 1, 1],  
# All are job postings
    
'predicted'
: [1, 1, 1, 0, 1, 0],   
# Model predictions
    
'occupation_type'
: [
'tech'
, 
'healthcare'
, 
'executive'
, 
'admin'
, 
'tech'
, 
'education'
],
    
'gender_bias_risk'
: [
'low'
, 
'high'
, 
'high'
, 
'high'
, 
'low'
, 
'medium'
]
})
print
(
"βš–οΈ Bias Detection Analysis"
)
print
(
"="
 * 70)
print
(
"\nπŸ“Š Sample Dataset:"
)
print
(sample_predictions[[
'text'
, 
'predicted'
, 
'occupation_type'
, 
'gender_bias_risk'
]])
if
 fairlearn_available:
    
# Analyze bias across occupation types
    metric_frame = MetricFrame(
        metrics=accuracy_score,
        y_true=sample_predictions[
'true_label'
],
        y_pred=sample_predictions[
'predicted'
],
        sensitive_features=sample_predictions[
'occupation_type'
]
    )
    
    
print
(
"\nπŸ“ˆ Performance by Occupation Type:"
)
    
print
(metric_frame.by_group)
    
    
# Calculate disparity
    disparity = metric_frame.difference(method=
'between_groups'
)
    
print
(f
"\n⚠️ Maximum accuracy disparity: {disparity:.2f}"
)
else
:
    
# Manual bias analysis
    
print
(
"\nπŸ“ˆ Manual Bias Analysis:"
)
    
for
 occ_type 
in
 sample_predictions[
'occupation_type'
].unique():
        mask = sample_predictions[
'occupation_type'
] == occ_type
        acc = (sample_predictions[mask][
'true_label'
] == 
                sample_predictions[mask][
'predicted'
]).mean()
        
print
(f
"  {occ_type}: {acc:.2%} accuracy"
)
print
(
"\n🎯 Bias Mitigation Strategies:"
)
print
(
"  βœ… Audit data sources for representation gaps"
)
print
(
"  βœ… Rebalance datasets to ensure fair representation"
)
print
(
"  βœ… Use targeted augmentation for underrepresented groups"
)
print
(
"  βœ… Monitor fairness metrics throughout training"
)
print
(
"  βœ… Involve diverse stakeholders in data curation"
)
# Demonstrate text augmentation for bias mitigation
def mitigate_gender_bias(text: str) -> List[str]:
    
""
"Generate gender-neutral variations of text."
""
    variations = [text]
    
    
# Simple pronoun swapping (production should use more sophisticated methods)
    gendered_terms = {
        
'he'
: [
'they'
, 
'she'
],
        
'his'
: [
'their'
, 
'her'
],
        
'him'
: [
'them'
, 
'her'
],
        
'chairman'
: [
'chairperson'
, 
'chair'
],
        
'businessman'
: [
'businessperson'
, 
'business professional'
],
        
'salesman'
: [
'salesperson'
, 
'sales professional'
]
    }
    
    
for
 term, replacements 
in
 gendered_terms.items():
        
if
 term 
in
 text.lower():
            
for
 replacement 
in
 replacements:
                variations.append(text.lower().replace(term, replacement))
    
    
return
 list(
set
(variations))
# Example of bias mitigation through augmentation
test_text = 
"The chairman announced his decision"
variations = mitigate_gender_bias(test_text)
print
(f
"\nπŸ”„ Bias Mitigation through Augmentation:"
)
print
(f
"Original: {test_text}"
)
print
(f
"Variations: {variations}"
)
βš–οΈ Bias Detection Analysis
======================================================================
πŸ“Š Sample Dataset:
                                   
text
  predicted occupation_type  \
0
  Software engineer position available          
1
            tech   
1
                 Nursing position open          
1
      healthcare   
2
       CEO role 
for
 experienced leader          
1
       executive   
3
           Secretary needed 
for
 office          
0
           admin   
4
                   Data scientist role          
1
            tech   
5
            Teacher position at school          
0
       education   
  gender_bias_risk  
0
              low  
1
             high  
2
             high  
3
             high  
4
              low  
5
           medium  
πŸ“ˆ Performance 
by
 Occupation Type:
occupation_type
admin         
0.0
education     
0.0
executive     
1.0
healthcare    
1.0
tech          
1.0
Name:
 accuracy_score, dtype: float64
⚠️ Maximum accuracy disparity: 
1.00
🎯 Bias Mitigation Strategies:
  βœ… Audit data sources 
for
 representation gaps
  βœ… Rebalance datasets 
to
 ensure fair representation
  βœ… Use targeted augmentation 
for
 underrepresented groups
  βœ… Monitor fairness metrics throughout training
  βœ… Involve diverse stakeholders 
in
 data curation
πŸ”„ Bias Mitigation through Augmentation:
Original:
 The chairman announced his decision
Variations:
 [
'tshe chairman announced his decision', 'tthey chairman announced his decision', 'the chairman announced their decision', 'the chairman announced her decision', 'the chairperson announced his decision', 'The chairman announced his decision', 'the chair announced his decision']
  • βœ… Include rare or industry-specific vocabulary seamlessly

  • βœ… Filter out sensitive or irrelevant content precisely

  • βœ… Meet strict privacy or compliance requirements confidently

  • βœ… Rapidly adapt to new domains with limited resources

  • Selecting relevant, diverse sources strategically

  • Cleaning and standardizing text meticulously

  • Removing duplicates and noise (including semantic deduplication)

  • Annotating and labeling (with tools like Argilla for human-in-the-loop workflows)

  • Tokenizing (splitting text into model-friendly pieces) and building vocabulary that fits your domain

  • Versioning and tracking your data for reproducibility

  • 🎯 Validate synthetic examples against real data distributions

  • πŸ”„ Mix synthetic and real data (typically 20–30% synthetic)

  • πŸ“Š Monitor model performance on held-out real data

  • πŸ›‘οΈ Ensure synthetic data doesn’t leak sensitive patterns

  • πŸ“ Document synthetic data generation for reproducibility

# Synthetic Data Generation for Enhanced Training
from
 transformers import pipeline
import random
# Initialize text generation pipeline (using smaller model for demo)
try
:
    generator = 
pipeline
(
"text-generation"
, model=
"gpt2"
, device=-
1
)
    generator_available = True
except 
Exception
 
as
 e:
    generator_available = False
    
print
(f
"⚠️ Could not load generator: {e}"
)
def 
generate_synthetic_examples
(
    
prompt_template
: str, 
    
num_examples
: 
int
 = 
5
, 
    
categories
: Optional[List[str]] = None,
    
max_length
: 
int
 = 
100
) -> List[Dict[str, str]]:
    
""
"Generate synthetic training examples using LLM-based augmentation."
""
    synthetic_data = []
    
    
if
 not generator_available:
        
# Fallback: Create template-based examples
        templates = [
            
"The patient experienced {symptom} and required {treatment}."
,
            
"Customer reported {issue} with {product}."
,
            
"Analysis shows {finding} in {domain}."
        ]
        
        symptoms = [
"fever"
, 
"headache"
, 
"fatigue"
, 
"chest pain"
]
        treatments = [
"medication"
, 
"rest"
, 
"monitoring"
, 
"intervention"
]
        
        
for
 i in 
range
(num_examples):
            template = random.
choice
(templates)
            
if
 
"{symptom}"
 in template:
                text = template.
format
(
                    symptom=random.
choice
(symptoms),
                    treatment=random.
choice
(treatments)
                )
            
else
:
                text = template.
format
(
                    issue=
"performance degradation"
,
                    product=
"software system"
,
                    finding=
"anomalous patterns"
,
                    domain=
"data analysis"
                )
            
            synthetic_data.
append
({
                
'text'
: text,
                
'category'
: categories[i % 
len
(categories)] 
if
 categories 
else
 
'general'
,
                
'synthetic'
: True
            })
    
else
:
        
# Use LLM for generation
        
for
 i in 
range
(num_examples):
            
if
 categories:
                category = random.
choice
(categories)
                prompt = prompt_template.
format
(category=category)
            
else
:
                prompt = prompt_template
            
            
# Generate with controlled randomness
            result = 
generator
(
                prompt,
                max_length=max_length,
                temperature=
0.8
,
                do_sample=True,
                top_p=
0.9
,
                pad_token_id=generator.tokenizer.eos_token_id
            )
            
            generated_text = result[
0
][
'generated_text'
]
            
# Extract only the new content (remove prompt)
            new_content = generated_text[
len
(prompt):].
strip
()
            
            synthetic_data.
append
({
                
'text'
: new_content 
if
 new_content 
else
 generated_text,
                
'category'
: category 
if
 categories 
else
 
'general'
,
                
'synthetic'
: True
            })
    
    
return
 synthetic_data
# Example: Generate customer support queries
prompt_template = 
"Generate a realistic customer support query about {category}:"
categories = [
"billing"
, 
"technical issues"
, 
"account access"
, 
"feature requests"
]
print
(
"πŸ€– Synthetic Data Generation Examples:"
)
print
(
"="
 * 
70
)
synthetic_examples = 
generate_synthetic_examples
(
    prompt_template, 
    num_examples=
4
, 
    categories=categories,
    max_length=
80
)
for
 i, example in 
enumerate
(synthetic_examples):
    
print
(f
"\nπŸ“ Synthetic Example {i+1} ({example['category']}):"
)
    
print
(f
"Text: {example['text']}"
)
# Demonstrate mixing synthetic with real data
real_data = [
    {
'text'
: 
'I cannot log into my account'
, 
'category'
: 
'account access'
, 
'synthetic'
: False},
    {
'text'
: 
'My bill is incorrect this month'
, 
'category'
: 
'billing'
, 
'synthetic'
: False},
    {
'text'
: 
'The app crashes on startup'
, 
'category'
: 
'technical issues'
, 
'synthetic'
: False}
]
# Mix datasets
mixed_dataset = real_data + synthetic_examples
synthetic_ratio = 
sum
(
1
 
for
 ex in mixed_dataset 
if
 ex[
'synthetic'
]) / 
len
(mixed_dataset)
print
(f
"\nπŸ“Š Dataset Composition:"
)
print
(f
"  Real examples: {len(real_data)}"
)
print
(f
"  Synthetic examples: {len(synthetic_examples)}"
)
print
(f
"  Synthetic ratio: {synthetic_ratio:.1%}"
)
print
(
"\nβœ… Best Practices for Synthetic Data:"
)
print
(
"  β€’ Validate against real data distributions"
)
print
(
"  β€’ Monitor model performance on held-out real data"
)
print
(
"  β€’ Document generation process for reproducibility"
)
print
(
"  β€’ Ensure synthetic data doesn't leak sensitive patterns"
)
# Try This: Experiment with different prompts
print
(
"\n🎯 Try This: Modify the prompt template to generate domain-specific examples!"
)
print
(
"Example prompts:"
)
print
(
"  - 'Write a medical diagnosis for a patient with {symptom}:'"
)
print
(
"  - 'Create a legal contract clause about {topic}:'"
)
πŸ€– Synthetic Data Generation Examples:
======================================================================
πŸ“ Synthetic Example 1 (technical issues):
Text: $ get --
help
For example, 
if
 you want to build a 
"biggest"
 online retailer that
's available to you, you can use the get-biggest-online-store command.
$ get --help Biggest Online Stores
The Biggest Online Store is a customer service endpoint that allows you to
πŸ“ Synthetic Example 2 (feature requests):
Text: $uri = array( $name ); $response = $uri->get('
http://api.twitter.com/api/v1/status 
','
OK
'); $response->set_header('
Content-Type 
','
text/html
'); $response->set_header('
Content-Length
πŸ“ Synthetic Example 3 (billing):
Text: 
$query
 = new WP.Request(
"/accounts/"
, 
$accounts
, 
"{"
. 
$query
. 
"Your billing information"
. 
": "
. 
$accounts
. 
". 
$query
. "
Your billing information can be verified by:
$accounts
. 
"{"
. 
$query
. 
"Your billing information must be
πŸ“ Synthetic Example 4 (feature requests):
Text: 
$sql
 = "
SELECT * FROM customer_support WHERE customer_support_id = 
$2
"
Customers can now provide a query for a customer support query:
$customer_support
 = 
$sql
->query('SELECT * FROM customer_support WHERE customer_support_id = 
$2
'')->query
πŸ“Š Dataset Composition:
  Real examples: 3
  Synthetic examples: 4
  Synthetic ratio: 57.1%
βœ… Best Practices for Synthetic Data:
  β€’ Validate against real data distributions
  β€’ Monitor model performance on held-out real data
  β€’ Document generation process for reproducibility
  β€’ Ensure synthetic data doesn't leak sensitive patterns
🎯 Try This: Modify the prompt template to generate domain-specific examples!
Example prompts:
  - 'Write a medical diagnosis for a patient with {symptom}:'
  - 'Create a legal contract clause about {topic}:'

πŸ€– Pro Tip*: Mix synthetic and real data in a 20–30% ratio for optimal results. Always validate synthetic examples against real data distributions.*

import re
from
 datasets import Dataset
from
 typing import Dict, Any
# Create sample data that might come from customer logs
sample_data = {
    
"text"
: [
        
"<p>Customer complaint: Product <b>broken</b></p>   Multiple   spaces!"
,
        
"<div>Great service!</div>\n\n\nExtra newlines"
,
        
"Normal text without HTML"
,
        
"Text with &amp; HTML entities &lt;encoded&gt;"
,
        
"Unicode issues: cafΓ©, naΓ―ve, rΓ©sumΓ©"
    ]
}
dataset = Dataset.
from_dict
(sample_data)
def 
clean_text
(
example
: Dict[str, Any]) -> Dict[str, Any]:
    
""
"Clean text by removing HTML and normalizing whitespace.
    
    Args:
        example: Dictionary containing 'text' field
        
    Returns:
        Dictionary with cleaned 'text' field
    "
""
    
try
:
        text = example.
get
(
"text"
, 
""
)
        
        
# Decode HTML entities
        import html
        text = html.
unescape
(text)
        
        
# Remove HTML tags
        text = re.
sub
(r
'<.*?>'
, 
''
, text)
        
        
# Replace multiple spaces/newlines with a single space
        text = re.
sub
(r
'\s+'
, 
' '
, text)
        
        
# Strip leading/trailing whitespace
        text = text.
strip
()
        
        
return
 {
"text"
: text}
    except 
Exception
 
as
 e:
        
print
(f
"Error cleaning text: {e}"
)
        
return
 {
"text"
: example.
get
(
"text"
, 
""
)}
# Apply cleaning
cleaned_dataset = dataset.
map
(clean_text)
print
(
"🧹 Data Cleaning Results:"
)
print
(
"="
 * 
60
)
for
 i in 
range
(
len
(dataset)):
    
print
(f
"\nπŸ“ Example {i+1}:"
)
    
print
(f
"Original: {dataset[i]['text']}"
)
    
print
(f
"Cleaned:  {cleaned_dataset[i]['text']}"
)
# Show statistics
original_chars = 
sum
(
len
(ex[
'text'
]) 
for
 ex in dataset)
cleaned_chars = 
sum
(
len
(ex[
'text'
]) 
for
 ex in cleaned_dataset)
print
(f
"\nπŸ“Š Cleaning Statistics:"
)
print
(f
"Total characters reduced: {original_chars - cleaned_chars} ({(1 - cleaned_chars/original_chars)*100:.1f}% reduction)"
)
  1. Loads your dataset using Hugging Face Datasets (supports CSV, JSON, Parquet, and streaming)
  2. Removes HTML tags like <p> completely
  3. Replaces extra spaces or newlines with a single space
  4. Trims spaces from the start and end
Map:
 
100
%
 
5
/5
 [
00
:00<00:00
, 
646.67
 
examples/s
]
🧹
 
Data Cleaning Results:
============================================================
πŸ“
 
Example 1:
Original:
 
<p>Customer
 
complaint:
 
Product
 
<b>broken</b></p>
   
Multiple
   
spaces!
Cleaned:  Customer complaint:
 
Product
 
broken
 
Multiple
 
spaces!
πŸ“
 
Example 2:
Original:
 
<div>Great
 
service!</div>
Extra
 
newlines
Cleaned:
  
Great
 
service!
 
Extra
 
newlines
πŸ“
 
Example 3:
Original:
 
Normal
 
text
 
without
 
HTML
Cleaned:
  
Normal
 
text
 
without
 
HTML
πŸ“
 
Example 4:
Original:
 
Text
 
with
 
&amp;
 
HTML
 
entities
 
&lt;encoded&gt;
Cleaned:
  
Text
 
with
 
&
 
HTML
 
entities
πŸ“
 
Example 5:
Original: Unicode issues:
 
cafΓ©,
 
naΓ―ve,
 
rΓ©sumΓ©
Cleaned:  Unicode issues:
 
cafΓ©,
 
naΓ―ve,
 
rΓ©sumΓ©
πŸ“Š
 
Cleaning Statistics:
Total characters reduced:
 
51
 
(23.7%
 
reduction)
  • πŸ“ˆ Outperform generic models in specialized tasks dramatically
  • πŸ›‘οΈ Reduce errors in critical business processes significantly
  • πŸ”’ Ensure privacy and regulatory compliance completely
  • 🌍 Enable support for rare languages or unique domains effectively
  • ⚑ Adapt quickly to new requirements using fine-tuning or continual learning

Hugging Face: Building Custom Language Models: From Raw Data to Production AI

  • Relevance: Does the text match your target use case? β€” Ensures model learns domain-specific patterns
  • Diversity: Is there a mix of topics, styles, and authors? β€” Prevents bias and improves generalization
  • Quality: Is the text well-formed and free of noise? β€” Reduces training on corrupted examples
  • Freshness: Are you using the latest available data? β€” Prevents model drift and outdated knowledge

Hugging Face: Building Custom Language Models: From Raw Data to Production AI

  • ✍️ Write clear, detailed instructions and provide examples for annotators

  • πŸ‘₯ Use multiple annotators per example to catch mistakes and reduce bias

  • πŸ”„ Regularly review disagreements, update guidelines, and retrain annotators as needed

  • πŸ”’ Ensure privacy: Mask or remove PII before annotation, especially in sensitive domains

  • SentencePiece Unigram: Flexible and robust for multilingual and domain-specific tasks

  • Byte-Pair Encoding (BPE): Splits rare words into subwords, balancing vocabulary size and coverage

  • WordPiece: Used in BERT; similar to BPE but merges differently

πŸ”€ Key Concept*: Domain-specific tokenizers learn to keep medical terms like β€œelectrocardiogram” as single tokens instead of breaking them into meaningless subwords. This leads to better understanding and more efficient processing.*

  • Preserves medical terminology intact
  • Reduces token count by up to 50% for medical text
  • Improves model context window utilization
  • Handles both common and rare medical terms effectively
  1. Loading a Medical Corpus: It first attempts to load real-world medical abstracts from the PubMed QA dataset via Hugging Face’s datasets library. If this fails, it falls back to a comprehensive synthetic medical corpus covering specialties like cardiology, neurology, oncology, and more. This ensures a robust dataset rich in medical terminology, which is critical for training an effective tokenizer.
  2. Training a BPE Tokenizer: Using the tokenizers library, the code trains a BPE tokenizer on the medical corpus. BPE is chosen for its ability to create subword tokens, balancing vocabulary size and term specificity. The tokenizer is configured with a vocabulary size of 10,000, special tokens (e.g., <pad>, <unk>), and a ByteLevel pre-tokenizer to handle text at a granular level.
  3. Saving and Testing the Tokenizer: The trained tokenizer is saved as a JSON file and loaded into a Hugging Face PreTrainedTokenizerFast for compatibility with transformer models. The code includes a quick test to demonstrate how medical terms are tokenized, highlighting the tokenizer's ability to keep domain-specific terms intact.
  4. Demonstrating Domain-Specific Benefits: The output emphasizes why a medical-specific tokenizer is valuable, including better preservation of medical terms, improved context awareness, and enhanced efficiency for downstream NLP tasks.
# Training a Custom Medical Tokenizer with BPE
import matplotlib.pyplot 
as
 plt
import seaborn 
as
 sns
from
 typing import List
from
 tokenizers import Tokenizer, models, pre_tokenizers, trainers, processors
def 
load_medical_corpus
(
max_samples
: 
int
 = 
10000
) -> List[str]:
    
""
"
    Load medical text data from available sources
    "
""
    corpus = []
    
    
try
:
        
# Try to load PubMed dataset from Hugging Face
        
from
 datasets import load_dataset
        
print
(
"Loading PubMed abstracts from Hugging Face..."
)
        
        
# Load pubmed_qa dataset which contains medical Q&A pairs
        dataset = 
load_dataset
(
"pubmed_qa"
, 
"pqa_labeled"
, split=
"train"
, streaming=True)
        
        count = 
0
        
for
 example in dataset:
            
# Extract context (which contains medical abstracts)
            
if
 
'context'
 in example 
and
 
'contexts'
 in example[
'context'
]:
                
for
 context in example[
'context'
][
'contexts'
]:
                    corpus.
append
(context)
                    count += 
1
                    
if
 count >= max_samples:
                        
break
            
if
 count >= max_samples:
                
break
                
        
print
(f
"Loaded {len(corpus)} medical abstracts from PubMed QA"
)
        
    except 
Exception
 
as
 e:
        
print
(f
"Could not load PubMed dataset: {e}"
)
        
print
(
"Falling back to comprehensive synthetic medical corpus..."
)
        
        
# Fallback: Create a comprehensive synthetic medical corpus
        
# This is still much better than the tiny original corpus
        medical_texts = [
            
# Cardiology
            
"The patient presented with acute myocardial infarction characterized by ST-segment elevation on electrocardiogram. Immediate percutaneous coronary intervention was performed."
,
            
"Diagnosis of acute coronary syndrome requires evaluation of troponin levels, electrocardiogram changes, and clinical presentation. Thrombolytic therapy may be indicated."
,
            
"Coronary angioplasty with stent placement is the preferred treatment for ST-elevation myocardial infarction when performed within the appropriate time window."
,
            
"Atherosclerotic cardiovascular disease remains the leading cause of mortality worldwide. Risk factors include hypertension, hyperlipidemia, and diabetes mellitus."
,
            
"Cardiac catheterization revealed significant stenosis in the left anterior descending artery requiring percutaneous coronary intervention."
,
            
            
# Neurology  
            
"The patient exhibited symptoms consistent with acute ischemic stroke including hemiparesis, aphasia, and facial droop. Immediate neuroimaging was performed."
,
            
"Magnetic resonance imaging revealed an infarct in the middle cerebral artery territory. Thrombolytic therapy was administered within the therapeutic window."
,
            
"Differential diagnosis for altered mental status includes metabolic encephalopathy, infectious processes, and structural brain lesions."
,
            
"Electroencephalogram monitoring showed epileptiform discharges consistent with temporal lobe epilepsy. Antiepileptic therapy was initiated."
,
            
            
# Oncology
            
"Histopathological examination revealed invasive ductal carcinoma with positive estrogen and progesterone receptors. Adjuvant chemotherapy was recommended."
,
            
"Immunohistochemistry staining showed overexpression of HER2/neu protein. Targeted therapy with trastuzumab was initiated."
,
            
"Positron emission tomography scan demonstrated hypermetabolic lesions consistent with metastatic disease. Palliative radiotherapy was considered."
,
            
            
# Infectious Disease
            
"The patient presented with fever, productive cough, and consolidation on chest radiograph consistent with community-acquired pneumonia."
,
            
"Blood cultures grew methicillin-resistant Staphylococcus aureus. Intravenous vancomycin therapy was initiated with therapeutic drug monitoring."
,
            
"Polymerase chain reaction testing confirmed the presence of Mycobacterium tuberculosis. Four-drug antituberculous therapy was started."
,
            
            
# Endocrinology
            
"Laboratory findings revealed elevated hemoglobin A1c and fasting glucose levels consistent with diabetes mellitus type 2. Metformin therapy was initiated."
,
            
"Thyroid function tests showed suppressed thyroid-stimulating hormone with elevated free thyroxine consistent with hyperthyroidism."
,
            
"Adrenal insufficiency was confirmed by cosyntropin stimulation test. Hydrocortisone replacement therapy was prescribed."
,
            
            
# Pulmonology
            
"Pulmonary function tests revealed obstructive pattern with reduced forced expiratory volume consistent with chronic obstructive pulmonary disease."
,
            
"High-resolution computed tomography showed ground-glass opacities and interstitial changes consistent with idiopathic pulmonary fibrosis."
,
            
"Bronchoscopy with bronchoalveolar lavage was performed to evaluate for infectious etiology of pneumonia."
,
            
            
# Gastroenterology
            
"Esophagogastroduodenoscopy revealed erosive esophagitis and hiatal hernia. Proton pump inhibitor therapy was prescribed."
,
            
"Colonoscopy showed multiple adenomatous polyps which were removed endoscopically. Surveillance colonoscopy was recommended."
,
            
"Liver biopsy demonstrated bridging fibrosis consistent with chronic hepatitis C infection. Antiviral therapy was initiated."
,
            
            
# Rheumatology
            
"The patient met classification criteria for rheumatoid arthritis with symmetric polyarthritis and positive rheumatoid factor."
,
            
"Synovial fluid analysis showed inflammatory arthritis with elevated white blood cell count and negative crystals."
,
            
"Disease-modifying antirheumatic drug therapy with methotrexate was initiated for treatment of rheumatoid arthritis."
,
            
            
# Nephrology
            
"Renal biopsy showed focal segmental glomerulosclerosis. Immunosuppressive therapy with corticosteroids was initiated."
,
            
"The patient developed acute kidney injury secondary to contrast-induced nephropathy. Supportive care with hydration was provided."
,
            
"Chronic kidney disease stage 4 was diagnosed based on estimated glomerular filtration rate. Preparation for renal replacement therapy was discussed."
,
            
            
# Hematology
            
"Bone marrow biopsy revealed acute myeloid leukemia with complex cytogenetics. Induction chemotherapy was recommended."
,
            
"Flow cytometry confirmed the diagnosis of chronic lymphocytic leukemia. Watch and wait approach was adopted."
,
            
"The patient presented with thrombocytopenia and microangiopathic hemolytic anemia consistent with thrombotic thrombocytopenic purpura."
        ]
        
        
# Repeat each text multiple times with variations
        
for
 text in medical_texts:
            
# Add original
            corpus.
append
(text)
            
            
# Add variations
            corpus.
append
(text.
lower
())
            corpus.
append
(text.
upper
())
            
            
# Add with common medical prefixes/suffixes
            corpus.
append
(f
"Clinical presentation: {text}"
)
            corpus.
append
(f
"Diagnosis: {text}"
)
            corpus.
append
(f
"Treatment plan: {text}"
)
            corpus.
append
(f
"{text} Follow-up recommended."
)
            corpus.
append
(f
"Patient history: {text}"
)
            
        
# Add individual medical terms repeated many times
        important_terms = [
            
"myocardial infarction"
, 
"acute coronary syndrome"
, 
"percutaneous coronary intervention"
,
            
"electrocardiogram"
, 
"thrombolytic therapy"
, 
"cardiac catheterization"
, 
"angioplasty"
,
            
"atherosclerosis"
, 
"hypertension"
, 
"hyperlipidemia"
, 
"diabetes mellitus"
,
            
"cerebrovascular accident"
, 
"ischemic stroke"
, 
"hemorrhagic stroke"
, 
"thrombectomy"
,
            
"magnetic resonance imaging"
, 
"computed tomography"
, 
"positron emission tomography"
,
            
"chemotherapy"
, 
"radiotherapy"
, 
"immunotherapy"
, 
"targeted therapy"
,
            
"metastasis"
, 
"carcinoma"
, 
"lymphoma"
, 
"leukemia"
, 
"oncogene"
,
            
"pneumonia"
, 
"tuberculosis"
, 
"sepsis"
, 
"antibiotic"
, 
"vancomycin"
,
            
"diabetes"
, 
"insulin"
, 
"metformin"
, 
"hemoglobin A1c"
, 
"glucose"
,
            
"hypothyroidism"
, 
"hyperthyroidism"
, 
"thyroid stimulating hormone"
,
            
"chronic obstructive pulmonary disease"
, 
"asthma"
, 
"pulmonary fibrosis"
,
            
"gastroesophageal reflux"
, 
"inflammatory bowel disease"
, 
"cirrhosis"
,
            
"rheumatoid arthritis"
, 
"systemic lupus erythematosus"
, 
"osteoarthritis"
,
            
"chronic kidney disease"
, 
"dialysis"
, 
"glomerulonephritis"
, 
"nephropathy"
,
            
"anemia"
, 
"thrombocytopenia"
, 
"coagulopathy"
, 
"hemophilia"
        ]
        
        
# Add each term many times in different contexts
        
for
 term in important_terms:
            
for
 i in 
range
(
20
):  
# Repeat each term 20 times
                corpus.
append
(term)
                corpus.
append
(f
"The patient has {term}."
)
                corpus.
append
(f
"Diagnosis of {term} was confirmed."
)
                corpus.
append
(f
"Treatment for {term} includes multiple modalities."
)
                corpus.
append
(f
"{term} is a common medical condition."
)
        
    
return
 corpus
# Load medical corpus
print
(
"Loading medical corpus for tokenizer training..."
)
medical_corpus = 
load_medical_corpus
(max_samples=
5000
)
print
(f
"\nCorpus statistics:"
)
print
(f
"- Total documents: {len(medical_corpus)}"
)
print
(f
"- Average length: {np.mean([len(doc.split()) for doc in medical_corpus]):.1f} words"
)
print
(f
"- Total words: {sum(len(doc.split()) for doc in medical_corpus):,}"
)
# Show sample entries
print
(
"\nSample corpus entries:"
)
for
 i in 
range
(
min
(
3
, 
len
(medical_corpus))):
    
print
(f
"{i+1}. {medical_corpus[i][:150]}..."
)
# Train improved tokenizer with BPE on medical corpus
def 
train_medical_tokenizer
(
corpus
: List[str], 
vocab_size
: 
int
 = 
10000
) -> Tokenizer:
    
""
"
    Train a BPE tokenizer optimized for medical text
    "
""
    
# Use BPE model which is better for subword tokenization
    tokenizer = 
Tokenizer
(models.
BPE
(unk_token=
"<unk>"
))
    
    
# Use ByteLevel pre-tokenizer (like GPT-2)
    tokenizer.pre_tokenizer = pre_tokenizers.
ByteLevel
(add_prefix_space=False)
    
    
# Special tokens
    special_tokens = [
"<pad>"
, 
"<unk>"
, 
"<s>"
, 
"</s>"
, 
"<mask>"
]
    
    
# Train with BPE
    trainer = trainers.
BpeTrainer
(
        vocab_size=vocab_size,
        special_tokens=special_tokens,
        min_frequency=
2
,  # Only create tokens appearing at least twice
        show_progress=True
    )
    
    
# Train on the medical corpus
    
print
(f
"\nTraining BPE tokenizer with vocab_size={vocab_size}..."
)
    tokenizer.
train_from_iterator
(corpus, trainer=trainer)
    
    
# Add post-processing
    tokenizer.post_processor = processors.
ByteLevel
(trim_offsets=False)
    
    
return
 tokenizer
# Train the tokenizer
custom_tokenizer = 
train_medical_tokenizer
(medical_corpus, vocab_size=
10000
)
# Save the tokenizer
tokenizer_path = DATA_DIR / 
"medical_tokenizer.json"
DATA_DIR.
mkdir
(parents=True, exist_ok=True)  
# Ensure directory exists
custom_tokenizer.
save
(
str
(tokenizer_path))
# Load into Hugging Face
from
 transformers import PreTrainedTokenizerFast
custom_tokenizer_hf = 
PreTrainedTokenizerFast
(tokenizer_file=
str
(tokenizer_path))
custom_tokenizer_hf.pad_token = 
"<pad>"
print
(f
"\nβœ… Medical tokenizer saved to {tokenizer_path}"
)
print
(f
"Vocabulary size: {custom_tokenizer_hf.vocab_size}"
)
# Quick test on medical terms
test_terms = [
"myocardial infarction"
, 
"electrocardiogram"
, 
"percutaneous coronary intervention"
]
print
(
"\nQuick tokenization test:"
)
for
 term in test_terms:
    tokens = custom_tokenizer.
encode
(term).tokens
    
print
(f
"'{term}' -> {len(tokens)} tokens: {tokens}"
)
# Demonstrate the importance of domain-specific tokenization
print
(
"\nπŸ’‘ Why Domain-Specific Tokenization Matters:"
)
print
(
"- Medical terms stay intact (e.g., 'electrocardiogram' as 1 token)"
)
print
(
"- Better context understanding (fewer tokens = more room for context)"
)
print
(
"- Improved efficiency (reduced computational costs)"
)
print
(
"- More accurate representations of domain concepts"
)
Loading medical corpus 
for
 tokenizer training...
Loading PubMed abstracts 
from
 Hugging Face...
Could 
not
 load PubMed dataset: Invalid pattern: 
'**'
 can only be an entire path component
Falling back to comprehensive synthetic medical corpus...
Corpus statistics:
- Total documents: 
5964
- Average length: 
5.5
 words
- Total words: 
32
,
705
Sample corpus entries:
1.
 The patient presented 
with
 acute myocardial infarction characterized 
by
 ST-segment elevation 
on
 electrocardiogram. Immediate percutaneous coronary 
int
...
2.
 the patient presented 
with
 acute myocardial infarction characterized 
by
 st-segment elevation 
on
 electrocardiogram. immediate percutaneous coronary 
int
...
3.
 THE PATIENT PRESENTED WITH ACUTE MYOCARDIAL INFARCTION CHARACTERIZED BY ST-SEGMENT ELEVATION ON ELECTROCARDIOGRAM. IMMEDIATE PERCUTANEOUS CORONARY INT...
Training BPE tokenizer 
with
 vocab_size=
10000.
..
βœ… Medical tokenizer saved to /Users/richardhightower/src/art_hug_11/data/medical_tokenizer.json
Vocabulary size: 
1914
Quick tokenization test:
'myocardial infarction'
 -> 
2
 tokens: [
'myocardial'
, 
'Δ infarction'
]
'electrocardiogram'
 -> 
1
 tokens: [
'electrocardiogram'
]
'percutaneous coronary intervention'
 -> 
3
 tokens: [
'percutaneous'
, 
'Δ coronary'
, 
'Δ intervention'
]
πŸ’‘ Why Domain-Specific Tokenization Matters:
- 
Medical terms stay 
intact
 (
e.g., 
'electrocardiogram'
 
as
 
1
 token
)
- Better context 
understanding
 (
fewer tokens = more room 
for
 context
)
- Improved 
efficiency
 (
reduced computational costs
)
- More accurate representations of domain concepts
  1. Import Necessary Libraries: The code starts by importing required modules, including matplotlib.pyplot and seaborn (though not used in the visible execution), typing.List for type hints, and components from the tokenizers library for building and training the tokenizer.
  2. Define the Corpus Loading Function: The function load_medical_corpus(max_samples: int = 10000) -> List[str] is defined. This function attempts to load real medical text data from the PubMed QA dataset hosted on Hugging Face. If successful, it extracts individual context segments (sections of medical abstracts) from the dataset. If the load fails (e.g., due to network issues or library absence), it falls back to generating a synthetic corpus of medical texts.
  3. Initiate Corpus Loading: The code prints β€œLoading medical corpus for tokenizer training…” and calls load_medical_corpus(max_samples=5000) to build the medical_corpus list.
  • It prints β€œLoading PubMed abstracts from Hugging Face…”.

  • Using the datasets library, it loads the "pubmed_qa" dataset in the "pqa_labeled" configuration, train split, in streaming mode.

  • It iterates over the dataset examples, extracting and appending each string from the contexts list within the context field of each example (typically 2-7 contexts per example, representing abstract sections like background, methods, or results).

  • It stops once 5000 contexts are collected or the dataset is exhausted. Since the β€œpqa_labeled” train split has 1,000 examples with an average of about 3–4 contexts each, it likely loads around 3,000–4,000 text segments (fewer than the max_samples limit).

  • It prints β€œLoaded {len(corpus)} medical abstracts from PubMed QA” (e.g., β€œLoaded 3500 medical abstracts from PubMed QA” based on approximate dataset structure).

  • Fallback Scenario (if PubMed load fails): It prints an error message like β€œCould not load PubMed dataset: {exception}” and β€œFalling back to comprehensive synthetic medical corpus…”. It then constructs a synthetic corpus using predefined medical texts across specialties (33 example sentences), adding variations (e.g., lowercase, uppercase, prefixed versions) for each, resulting in about 264 entries. It further adds repetitions of 57 important medical terms (20 repetitions each with contextual phrases), yielding a total of approximately 5,964 documents.

  • Total documents: The length of medical_corpus (e.g., ~3,500 if PubMed succeeds, or 5,964 in fallback).

  • Average length: The mean number of words per document, calculated using np.mean([len(doc.split()) for doc in medical_corpus]) formatted to one decimal place (e.g., ~20-50 words if PubMed, or ~5.5 words in fallback due to many short term repetitions).

  • Total words: The sum of word counts across all documents (e.g., tens of thousands, such as 32,705 in fallback).

  • Initializes a Tokenizer with a BPE model, using "" as the unknown token.

  • Sets the pre-tokenizer to ByteLevel (similar to GPT-2, handling text at the byte level without prefix spaces).

  • Defines special tokens: [β€œβ€, β€œβ€, β€œβ€, β€œβ€, β€œβ€].

  • Creates a BpeTrainer with the specified vocab_size (10,000), special tokens, minimum frequency of 2 for tokens, and progress display.

  • Prints β€œTraining BPE tokenizer with vocab_size=10000…”.

  • Trains the tokenizer on the corpus using train_from_iterator, which processes the text to learn subword merges based on frequency, prioritizing medical-specific terms.

  • Adds a ByteLevel post-processor to handle trimming offsets.

  • Returns the trained tokenizer object.

  • Encodes the term using the tokenizer.

  • Prints the result, showing the number of tokens and the token list (e.g., β€œβ€˜myocardial infarction’ -> 1 tokens: [β€˜myocardial infarction’]” if the term is learned as a single token due to its frequency in the medical corpus; otherwise, it might split into subwords like [β€˜myo’, β€˜cardial’, β€˜ inf’, β€˜arction’] if not sufficiently represented).

  • Keeping medical terms intact (e.g., β€˜electrocardiogram’ as one token).

  • Better context understanding by using fewer tokens.

  • Improved efficiency with reduced computational costs.

  • More accurate representations of medical concepts.

# Comprehensive tokenizer comparison with advanced visualization
print
(
"="
 * 
80
)
print
(
"MEDICAL TOKENIZER COMPARISON WITH ADVANCED VISUALIZATION"
)
print
(
"="
 * 
80
)
# Import visualization libraries with proper error handling
try
:
    
import
 matplotlib.pyplot 
as
 plt
    
import
 seaborn 
as
 sns
    
from
 matplotlib.patches 
import
 Rectangle
    
import
 matplotlib.patches 
as
 mpatches
    HAS_MATPLOTLIB = 
True
    
# Set style for better visuals
    plt.style.use(
'seaborn-v0_8-darkgrid'
)
    sns.set_palette(
"husl"
)
except
 ImportError:
    HAS_MATPLOTLIB = 
False
    
print
(
"⚠️ matplotlib/seaborn not available - install with: pip install matplotlib seaborn"
)
    
print
(
"   Text-based analysis will be shown instead."
)
# Load tokenizers for comparison
from
 transformers 
import
 AutoTokenizer, PreTrainedTokenizerFast
# Load our custom medical tokenizer
tokenizer_path = DATA_DIR / 
"medical_tokenizer.json"
try
:
    
if
 tokenizer_path.exists():
        medical_tokenizer_hf = PreTrainedTokenizerFast(tokenizer_file=
str
(tokenizer_path))
        medical_tokenizer_hf.pad_token = 
"<pad>"
        has_medical = 
True
        
print
(
"βœ… Successfully loaded custom medical tokenizer"
)
    
else
:
        has_medical = 
False
        
print
(
"⚠️ Medical tokenizer not found - run the previous cell first!"
)
except
 Exception 
as
 e:
    has_medical = 
False
    
print
(
f"❌ Error loading medical tokenizer: 
{e}
"
)
# Load comparison tokenizers
gpt2_tokenizer = AutoTokenizer.from_pretrained(
"gpt2"
)
bert_tokenizer = AutoTokenizer.from_pretrained(
"bert-base-uncased"
)
# Try to load BioBERT (medical BERT)
try
:
    biobert_tokenizer = AutoTokenizer.from_pretrained(
"dmis-lab/biobert-v1.1"
)
    has_biobert = 
True
    
print
(
"βœ… Successfully loaded BioBERT tokenizer"
)
except
:
    has_biobert = 
False
    
print
(
"⚠️ BioBERT not available, using standard comparisons only"
)
# Comprehensive medical test set
medical_test_sentences = [
    
# Common medical terms
    
"myocardial infarction"
,
    
"acute coronary syndrome"
, 
    
"percutaneous coronary intervention"
,
    
"electrocardiogram abnormalities"
,
    
"thrombolytic therapy"
,
    
    
# Complex medical phrases
    
"ST-segment elevation myocardial infarction"
,
    
"non-ST-segment elevation acute coronary syndrome"
,
    
"drug-eluting stent placement during percutaneous coronary intervention"
,
    
    
# Full medical sentences
    
"The patient presented with acute myocardial infarction and underwent emergent cardiac catheterization."
,
    
"Electrocardiogram showed ST-segment elevation consistent with acute coronary syndrome."
,
    
"Percutaneous coronary intervention with drug-eluting stent placement was performed successfully."
]
# Analyze tokenization
results = []
print
(
"\nDetailed Tokenization Analysis:"
)
print
(
"-"
 * 
80
)
for
 sentence 
in
 medical_test_sentences:
    result = {
"text"
: sentence}
    
    
# Tokenize with each tokenizer
    bert_tokens = bert_tokenizer.tokenize(sentence)
    result[
"bert"
] = 
len
(bert_tokens)
    result[
"bert_tokens"
] = bert_tokens
    
    gpt2_tokens = gpt2_tokenizer.tokenize(sentence)
    result[
"gpt2"
] = 
len
(gpt2_tokens)
    result[
"gpt2_tokens"
] = gpt2_tokens
    
    
if
 has_biobert:
        biobert_tokens = biobert_tokenizer.tokenize(sentence)
        result[
"biobert"
] = 
len
(biobert_tokens)
        result[
"biobert_tokens"
] = biobert_tokens
    
    
if
 has_medical:
        medical_tokens = medical_tokenizer_hf.tokenize(sentence)
        result[
"medical"
] = 
len
(medical_tokens)
        result[
"medical_tokens"
] = medical_tokens
    
    results.append(result)
    
    
# Print detailed results
    
print
(
f"\nText: '
{sentence}
'"
)
    
print
(
f"  BERT:        
{
len
(bert_tokens):3d}
 tokens - 
{bert_tokens[:
10
]}
{
'...'
 
if
 
len
(bert_tokens) > 
10
 
else
 
''
}
"
)
    
if
 has_biobert:
        
print
(
f"  BioBERT:     
{
len
(biobert_tokens):3d}
 tokens - 
{biobert_tokens[:
10
]}
{
'...'
 
if
 
len
(biobert_tokens) > 
10
 
else
 
''
}
"
)
    
print
(
f"  GPT-2:       
{
len
(gpt2_tokens):3d}
 tokens - 
{gpt2_tokens[:
10
]}
{
'...'
 
if
 
len
(gpt2_tokens) > 
10
 
else
 
''
}
"
)
    
if
 has_medical:
        
print
(
f"  Medical BPE: 
{
len
(medical_tokens):3d}
 tokens - 
{medical_tokens[:
10
]}
{
'...'
 
if
 
len
(medical_tokens) > 
10
 
else
 
''
}
"
)
# Convert to DataFrame for analysis
df = pd.DataFrame(results)
# Calculate summary statistics
print
(
"\n"
 + 
"="
 * 
80
)
print
(
"EFFICIENCY SUMMARY"
)
print
(
"="
 * 
80
)
tokenizers = [
"bert"
, 
"gpt2"
]
if
 has_biobert:
    tokenizers.append(
"biobert"
)
if
 has_medical:
    tokenizers.append(
"medical"
)
print
(
f"\nTotal tokens across 
{
len
(medical_test_sentences)}
 test examples:"
)
for
 tokenizer 
in
 tokenizers:
    
if
 tokenizer 
in
 df.columns:
        total = df[tokenizer].
sum
()
        
print
(
f"  
{tokenizer.upper():<
12
}
 
{total:4d}
 tokens"
)
# Create advanced visualizations if matplotlib is available
if
 HAS_MATPLOTLIB:
    
# Create a comprehensive figure with multiple subplots
    fig = plt.figure(figsize=(
16
, 
12
))
    
    
# --- Subplot 1: Total Token Count Comparison ---
    ax1 = plt.subplot(
3
, 
3
, 
1
)
    totals = []
    labels = []
    colors = [
'#1f77b4'
, 
'#ff7f0e'
, 
'#2ca02c'
, 
'#d62728'
]
    
    
for
 i, tokenizer 
in
 
enumerate
(tokenizers):
        
if
 tokenizer 
in
 df.columns:
            totals.append(df[tokenizer].
sum
())
            labels.append(tokenizer.upper())
    
    bars = ax1.bar(labels, totals, color=colors[:
len
(labels)])
    ax1.set_title(
'Total Tokens Across All Examples'
, fontsize=
14
, fontweight=
'bold'
)
    ax1.set_ylabel(
'Total Token Count'
)
    ax1.set_xlabel(
'Tokenizer'
)
    
    
# Add value labels on bars
    
for
 bar, total 
in
 
zip
(bars, totals):
        ax1.text(bar.get_x() + bar.get_width()/
2
, bar.get_height() + 
5
, 
                 
str
(total), ha=
'center'
, va=
'bottom'
, fontweight=
'bold'
)
    
    
# --- Subplot 2: Token Count by Example Type ---
    ax2 = plt.subplot(
3
, 
3
, 
2
)
    
# Group examples by type
    simple_terms = df.iloc[:
5
][tokenizers].mean()
    complex_phrases = df.iloc[
5
:
8
][tokenizers].mean()
    full_sentences = df.iloc[
8
:][tokenizers].mean()
    
    x = np.arange(
3
)
    width = 
0.2
    
    
for
 i, tokenizer 
in
 
enumerate
(tokenizers):
        
if
 tokenizer 
in
 df.columns:
            values = [simple_terms[tokenizer], complex_phrases[tokenizer], full_sentences[tokenizer]]
            ax2.bar(x + i*width, values, width, label=tokenizer.upper(), color=colors[i])
    
    ax2.set_xlabel(
'Example Type'
)
    ax2.set_ylabel(
'Average Token Count'
)
    ax2.set_title(
'Average Tokens by Example Complexity'
, fontsize=
14
, fontweight=
'bold'
)
    ax2.set_xticks(x + width * (
len
(tokenizers)-
1
) / 
2
)
    ax2.set_xticklabels([
'Simple Terms'
, 
'Complex Phrases'
, 
'Full Sentences'
])
    ax2.legend()
    
    
# --- Subplot 3: Efficiency Gains Heatmap ---
    ax3 = plt.subplot(
3
, 
3
, 
3
)
    
if
 has_medical 
and
 
'medical'
 
in
 df.columns:
        
# Calculate percentage reduction for medical tokenizer
        efficiency_matrix = []
        
for
 tokenizer 
in
 [
'bert'
, 
'gpt2'
, 
'biobert'
]:
            
if
 tokenizer 
in
 df.columns:
                reduction = ((df[tokenizer] - df[
'medical'
]) / df[tokenizer] * 
100
).values
                efficiency_matrix.append(reduction)
        
        efficiency_array = np.array(efficiency_matrix)
        im = ax3.imshow(efficiency_array, cmap=
'RdYlGn'
, aspect=
'auto'
, vmin=-
50
, vmax=
50
)
        
        
# Set ticks
        ax3.set_yticks(
range
(
len
([t 
for
 t 
in
 [
'bert'
, 
'gpt2'
, 
'biobert'
] 
if
 t 
in
 df.columns])))
        ax3.set_yticklabels([t.upper() 
for
 t 
in
 [
'bert'
, 
'gpt2'
, 
'biobert'
] 
if
 t 
in
 df.columns])
        ax3.set_xticks(
range
(
len
(df)))
        ax3.set_xticklabels([
f"Ex
{i+
1
}
"
 
for
 i 
in
 
range
(
len
(df))], rotation=
45
)
        
        
# Add colorbar
        cbar = plt.colorbar(im, ax=ax3)
        cbar.set_label(
'Reduction %'
, rotation=
270
, labelpad=
15
)
        
        ax3.set_title(
'Medical Tokenizer Efficiency Gains (%)'
, fontsize=
14
, fontweight=
'bold'
)
        
        
# Add text annotations
        
for
 i 
in
 
range
(efficiency_array.shape[
0
]):
            
for
 j 
in
 
range
(efficiency_array.shape[
1
]):
                text = ax3.text(j, i, 
f'
{efficiency_array[i, j]:
.0
f}
'
,
                               ha=
"center"
, va=
"center"
, color=
"black"
, fontsize=
8
)
    
    
# --- Subplot 4: Token Length Distribution ---
    ax4 = plt.subplot(
3
, 
3
, 
4
)
    
for
 tokenizer 
in
 tokenizers:
        
if
 tokenizer 
in
 df.columns 
and
 
f"
{tokenizer}
_tokens"
 
in
 df.columns:
            all_tokens = []
            
for
 tokens 
in
 df[
f"
{tokenizer}
_tokens"
]:
                all_tokens.extend([
len
(t.replace(
'Δ '
, 
''
).replace(
'##'
, 
''
)) 
for
 t 
in
 tokens])
            
            
# Create histogram
            ax4.hist(all_tokens, bins=
range
(
1
, 
max
(all_tokens)+
2
), alpha=
0.5
, 
                    label=tokenizer.upper(), density=
True
)
    
    ax4.set_xlabel(
'Token Length (characters)'
)
    ax4.set_ylabel(
'Density'
)
    ax4.set_title(
'Token Length Distribution'
, fontsize=
14
, fontweight=
'bold'
)
    ax4.legend()
    
    
# --- Subplot 5: Memory Impact Visualization ---
    ax5 = plt.subplot(
3
, 
3
, 
5
)
    
if
 has_medical:
        
# Calculate memory usage for 1M documents
        docs = 
1_000_000
        avg_doc_length = 
500
  
# tokens
        bytes_per_token = 
2
  
# assuming int16 token IDs
        
        memory_usage = []
        
for
 tokenizer 
in
 tokenizers:
            
if
 tokenizer 
in
 df.columns:
                avg_tokens = df[tokenizer].mean()
                scale_factor = avg_tokens / df[
'bert'
].mean() 
if
 
'bert'
 
in
 df.columns 
else
 
1
                memory_mb = (docs * avg_doc_length * scale_factor * bytes_per_token) / (
1024
 * 
1024
)
                memory_usage.append(memory_mb)
        
        bars = ax5.bar(labels, memory_usage, color=colors[:
len
(labels)])
        ax5.set_ylabel(
'Memory (MB)'
)
        ax5.set_title(
'Memory Usage for 1M Documents'
, fontsize=
14
, fontweight=
'bold'
)
        
        
# Add value labels
        
for
 bar, mem 
in
 
zip
(bars, memory_usage):
            ax5.text(bar.get_x() + bar.get_width()/
2
, bar.get_height() + 
10
,
                    
f'
{mem:
.0
f}
MB'
, ha=
'center'
, va=
'bottom'
)
    
    
# --- Subplot 6: Tokenization Speed Simulation ---
    ax6 = plt.subplot(
3
, 
3
, 
6
)
    
# Simulate tokenization speed based on token count (inverse relationship)
    base_speed = 
1000
  
# docs/second for BERT
    speeds = []
    
    
for
 tokenizer 
in
 tokenizers:
        
if
 tokenizer 
in
 df.columns:
            
# Fewer tokens = faster processing
            relative_tokens = df[tokenizer].
sum
() / df[
'bert'
].
sum
() 
if
 
'bert'
 
in
 df.columns 
else
 
1
            speed = base_speed / relative_tokens
            speeds.append(speed)
    
    bars = ax6.bar(labels, speeds, color=colors[:
len
(labels)])
    ax6.set_ylabel(
'Documents/Second'
)
    ax6.set_title(
'Estimated Processing Speed'
, fontsize=
14
, fontweight=
'bold'
)
    
    
for
 bar, speed 
in
 
zip
(bars, speeds):
        ax6.text(bar.get_x() + bar.get_width()/
2
, bar.get_height() + 
10
,
                
f'
{speed:
.0
f}
'
, ha=
'center'
, va=
'bottom'
)
    
    
# --- Subplot 7: Line Plot of Token Counts ---
    ax7 = plt.subplot(
3
, 
3
, 
7
)
    
for
 i, tokenizer 
in
 
enumerate
(tokenizers):
        
if
 tokenizer 
in
 df.columns:
            ax7.plot(
range
(
len
(df)), df[tokenizer], marker=
'o'
, 
                    label=tokenizer.upper(), linewidth=
2
, markersize=
8
)
    
    ax7.set_xlabel(
'Example Index'
)
    ax7.set_ylabel(
'Token Count'
)
    ax7.set_title(
'Token Count Progression'
, fontsize=
14
, fontweight=
'bold'
)
    ax7.legend()
    ax7.grid(
True
, alpha=
0.3
)
    
    
# --- Subplot 8: Cost Comparison (for API-based models) ---
    ax8 = plt.subplot(
3
, 
3
, 
8
)
    
# Assume $0.002 per 1K tokens (typical GPT pricing)
    cost_per_1k = 
0.002
    costs = []
    
    
for
 tokenizer 
in
 tokenizers:
        
if
 tokenizer 
in
 df.columns:
            total_tokens = df[tokenizer].
sum
()
            
# Scale up to 1M documents
            scaled_tokens = total_tokens * (
1_000_000
 / 
len
(df))
            cost = (scaled_tokens / 
1000
) * cost_per_1k
            costs.append(cost)
    
    bars = ax8.bar(labels, costs, color=colors[:
len
(labels)])
    ax8.set_ylabel(
'Cost (USD)'
)
    ax8.set_title(
'Estimated API Cost for 1M Documents'
, fontsize=
14
, fontweight=
'bold'
)
    
    
for
 bar, cost 
in
 
zip
(bars, costs):
        ax8.text(bar.get_x() + bar.get_width()/
2
, bar.get_height() + 
50
,
                
f'$
{cost:,
.0
f}
'
, ha=
'center'
, va=
'bottom'
)
    
    
# --- Subplot 9: Key Metrics Summary ---
    ax9 = plt.subplot(
3
, 
3
, 
9
)
    ax9.axis(
'off'
)
    
    
# Create summary text
    summary_text = 
"🎯 KEY INSIGHTS\n\n"
    
    
if
 has_medical 
and
 
'medical'
 
in
 df.columns:
        bert_total = df[
'bert'
].
sum
() 
if
 
'bert'
 
in
 df.columns 
else
 
0
        medical_total = df[
'medical'
].
sum
()
        reduction = (
1
 - medical_total/bert_total) * 
100
 
if
 bert_total > 
0
 
else
 
0
        
        summary_text += 
f"βœ… Token Reduction: 
{reduction:
.1
f}
%\n"
        summary_text += 
f"βœ… Memory Savings: ~
{reduction:
.0
f}
%\n"
        summary_text += 
f"βœ… Speed Increase: ~
{
100
/(
100
-reduction):
.1
f}
x\n"
        summary_text += 
f"βœ… Cost Reduction: 
{reduction:
.0
f}
%\n\n"
        
        
# Find best performing examples
        best_examples = []
        
for
 i, row 
in
 df.iterrows():
            
if
 
'bert'
 
in
 row 
and
 
'medical'
 
in
 row:
                example_reduction = (
1
 - row[
'medical'
]/row[
'bert'
]) * 
100
                
if
 example_reduction > 
40
:
                    best_examples.append((i+
1
, example_reduction))
        
        
if
 best_examples:
            summary_text += 
"πŸ’‘ Best Performance:\n"
            
for
 idx, red 
in
 
sorted
(best_examples, key=
lambda
 x: x[
1
], reverse=
True
)[:
3
]:
                summary_text += 
f"  Example 
{idx}
: 
{red:
.0
f}
% reduction\n"
    
    ax9.text(
0.1
, 
0.9
, summary_text, transform=ax9.transAxes, 
            fontsize=
12
, verticalalignment=
'top'
,
            bbox=
dict
(boxstyle=
'round'
, facecolor=
'wheat'
, alpha=
0.5
))
    
    plt.tight_layout()
    plt.savefig(
'tokenizer_comparison_comprehensive.png'
, dpi=
300
, bbox_inches=
'tight'
)
    plt.show()
# Print final insights
print
(
"\n"
 + 
"="
 * 
80
)
print
(
"KEY INSIGHTS FROM COMPREHENSIVE ANALYSIS"
)
print
(
"="
 * 
80
)
print
(
"1. Domain-specific tokenizers significantly reduce token counts"
)
print
(
"2. Fewer tokens lead to:"
)
print
(
"   β€’ Faster processing and training"
)
print
(
"   β€’ Lower memory usage and costs"
)
print
(
"   β€’ Better context window utilization"
)
print
(
"   β€’ More semantic coherence"
)
print
(
"3. Medical terminology benefits most from specialized tokenization"
)
print
(
"4. Even small efficiency gains compound at scale"
)
if
 has_medical:
    
# Show specific efficiency examples
    
print
(
f"\nπŸ“Š Concrete Example: 'percutaneous coronary intervention'"
)
    
for
 idx, row 
in
 df.iterrows():
        
if
 
"percutaneous coronary intervention"
 
in
 row[
'text'
]:
            
print
(
f"\nTokenization comparison:"
)
            
for
 tokenizer 
in
 tokenizers:
                
if
 tokenizer 
in
 df.columns:
                    count = row[tokenizer]
                    
print
(
f"  
{tokenizer.upper():
10
}
 
{count:3d}
 tokens"
)
            
break
print
(
"\nπŸ’‘ For production: Train tokenizers on 100K+ domain documents!"
)
Output:
================================================================================
MEDICAL TOKENIZER COMPARISON WITH ADVANCED VISUALIZATION
================================================================================
βœ… Successfully loaded custom medical tokenizer
/Users/richardhightower/
src/art_hug_11/.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py
:
943
: 
FutureWarning
: 
`resume_download`
 is deprecated 
and
 will be removed in version 
1.0
.
0
. Downloads always resume when possible. If you want to force a new download, use 
`force_download=True`
.
  warnings.
warn
(
βœ… Successfully loaded BioBERT tokenizer
Detailed Tokenization 
Analysis
:
--------------------------------------------------------------------------------
Text
: 
'myocardial infarction'
  
BERT
:          
6
 tokens - [
'my'
, 
'##oca'
, 
'##rdial'
, 
'in'
, 
'##far'
, 
'##ction'
]
  
BioBERT
:       
7
 tokens - [
'my'
, 
'##oc'
, 
'##ard'
, 
'##ial'
, 
'in'
, 
'##far'
, 
'##ction'
]
  
GPT-2
:         
6
 tokens - [
'my'
, 
'ocard'
, 
'ial'
, 
'Δ inf'
, 
'ar'
, 
'ction'
]
  Medical 
BPE
:   
2
 tokens - [
'myocardial'
, 
'Δ infarction'
]
Text
: 
'acute coronary syndrome'
  
BERT
:          
4
 tokens - [
'acute'
, 
'corona'
, 
'##ry'
, 
'syndrome'
]
  
BioBERT
:       
5
 tokens - [
'acute'
, 
'co'
, 
'##rona'
, 
'##ry'
, 
'syndrome'
]
  
GPT-2
:         
4
 tokens - [
'ac'
, 
'ute'
, 
'Δ coronary'
, 
'Δ syndrome'
]
  Medical 
BPE
:   
3
 tokens - [
'acute'
, 
'Δ coronary'
, 
'Δ syndrome'
]
Text
: 
'percutaneous coronary intervention'
  
BERT
:          
6
 tokens - [
'per'
, 
'##cut'
, 
'##aneous'
, 
'corona'
, 
'##ry'
, 
'intervention'
]
  
BioBERT
:       
7
 tokens - [
'per'
, 
'##cut'
, 
'##aneous'
, 
'co'
, 
'##rona'
, 
'##ry'
, 
'intervention'
]
  
GPT-2
:         
5
 tokens - [
'per'
, 
'cut'
, 
'aneous'
, 
'Δ coronary'
, 
'Δ intervention'
]
  Medical 
BPE
:   
3
 tokens - [
'percutaneous'
, 
'Δ coronary'
, 
'Δ intervention'
]
Text
: 
'electrocardiogram abnormalities'
  
BERT
:          
5
 tokens - [
'electro'
, 
'##card'
, 
'##io'
, 
'##gram'
, 
'abnormalities'
]
  
BioBERT
:       
6
 tokens - [
'electro'
, 
'##card'
, 
'##io'
, 
'##gram'
, 
'abnormal'
, 
'##ities'
]
  
GPT-2
:         
6
 tokens - [
'elect'
, 
'ro'
, 
'card'
, 
'i'
, 
'ogram'
, 
'Δ abnormalities'
]
  Medical 
BPE
:   
7
 tokens - [
'electrocardiogram'
, 
'Δ a'
, 
'b'
, 
'n'
, 
'or'
, 
'm'
, 
'alities'
]
Text
: 
'thrombolytic therapy'
  
BERT
:          
5
 tokens - [
'th'
, 
'##rom'
, 
'##bol'
, 
'##ytic'
, 
'therapy'
]
  
BioBERT
:       
5
 tokens - [
'th'
, 
'##rom'
, 
'##bol'
, 
'##ytic'
, 
'therapy'
]
  
GPT-2
:         
6
 tokens - [
'th'
, 
'rom'
, 
'bo'
, 
'ly'
, 
'tic'
, 
'Δ therapy'
]
  Medical 
BPE
:   
2
 tokens - [
'thrombolytic'
, 
'Δ therapy'
]
Text
: 
'ST-segment elevation myocardial infarction'
  
BERT
:         
10
 tokens - [
'st'
, 
'-'
, 
'segment'
, 
'elevation'
, 
'my'
, 
'##oca'
, 
'##rdial'
, 
'in'
, 
'##far'
, 
'##ction'
]
  
BioBERT
:      
11
 tokens - [
'ST'
, 
'-'
, 
'segment'
, 
'elevation'
, 
'my'
, 
'##oc'
, 
'##ard'
, 
'##ial'
, 
'in'
, 
'##far'
]...
  
GPT-2
:        
11
 tokens - [
'ST'
, 
'-'
, 
'se'
, 
'gment'
, 
'Δ elevation'
, 
'Δ my'
, 
'ocard'
, 
'ial'
, 
'Δ inf'
, 
'ar'
]...
  Medical 
BPE
:   
7
 tokens - [
'S'
, 
'T'
, 
'-'
, 
'segment'
, 
'Δ elevation'
, 
'Δ myocardial'
, 
'Δ infarction'
]
Text
: 
'non-ST-segment elevation acute coronary syndrome'
  
BERT
:         
10
 tokens - [
'non'
, 
'-'
, 
'st'
, 
'-'
, 
'segment'
, 
'elevation'
, 
'acute'
, 
'corona'
, 
'##ry'
, 
'syndrome'
]
  
BioBERT
:      
11
 tokens - [
'non'
, 
'-'
, 
'ST'
, 
'-'
, 
'segment'
, 
'elevation'
, 
'acute'
, 
'co'
, 
'##rona'
, 
'##ry'
]...
  
GPT-2
:        
10
 tokens - [
'non'
, 
'-'
, 
'ST'
, 
'-'
, 
'se'
, 
'gment'
, 
'Δ elevation'
, 
'Δ acute'
, 
'Δ coronary'
, 
'Δ syndrome'
]
  Medical 
BPE
:  
11
 tokens - [
'n'
, 
'on'
, 
'-'
, 
'S'
, 
'T'
, 
'-'
, 
'segment'
, 
'Δ elevation'
, 
'Δ acute'
, 
'Δ coronary'
]...
Text
: 
'drug-eluting stent placement during percutaneous coronary intervention'
  
BERT
:         
14
 tokens - [
'drug'
, 
'-'
, 
'el'
, 
'##uting'
, 
'ste'
, 
'##nt'
, 
'placement'
, 
'during'
, 
'per'
, 
'##cut'
]...
  
BioBERT
:      
17
 tokens - [
'drug'
, 
'-'
, 
'el'
, 
'##uti'
, 
'##ng'
, 
's'
, 
'##ten'
, 
'##t'
, 
'placement'
, 
'during'
]...
  
GPT-2
:        
13
 tokens - [
'drug'
, 
'-'
, 
'el'
, 
'uting'
, 
'Δ st'
, 
'ent'
, 
'Δ placement'
, 
'Δ during'
, 
'Δ per'
, 
'cut'
]...
  Medical 
BPE
:  
13
 tokens - [
'drug'
, 
'-'
, 
'e'
, 
'lu'
, 
'ting'
, 
'Δ stent'
, 
'Δ placement'
, 
'Δ d'
, 
'ur'
, 
'ing'
]...
Text
: 
'The patient presented with acute myocardial infarction and underwent emergent cardiac catheterization.'
  
BERT
:         
21
 tokens - [
'the'
, 
'patient'
, 
'presented'
, 
'with'
, 
'acute'
, 
'my'
, 
'##oca'
, 
'##rdial'
, 
'in'
, 
'##far'
]...
  
BioBERT
:      
22
 tokens - [
'The'
, 
'patient'
, 
'presented'
, 
'with'
, 
'acute'
, 
'my'
, 
'##oc'
, 
'##ard'
, 
'##ial'
, 
'in'
]...
  
GPT-2
:        
20
 tokens - [
'The'
, 
'Δ patient'
, 
'Δ presented'
, 
'Δ with'
, 
'Δ acute'
, 
'Δ my'
, 
'ocard'
, 
'ial'
, 
'Δ inf'
, 
'ar'
]...
  Medical 
BPE
:  
20
 tokens - [
'The'
, 
'Δ patient'
, 
'Δ presented'
, 
'Δ with'
, 
'Δ acute'
, 
'Δ myocardial'
, 
'Δ infarction'
, 
'Δ and'
, 
'Δ '
, 
'und'
]...
Text
: 
'Electrocardiogram showed ST-segment elevation consistent with acute coronary syndrome.'
  
BERT
:         
16
 tokens - [
'electro'
, 
'##card'
, 
'##io'
, 
'##gram'
, 
'showed'
, 
'st'
, 
'-'
, 
'segment'
, 
'elevation'
, 
'consistent'
]...
  
BioBERT
:      
19
 tokens - [
'El'
, 
'##ec'
, 
'##tro'
, 
'##card'
, 
'##io'
, 
'##gram'
, 
'showed'
, 
'ST'
, 
'-'
, 
'segment'
]...
  
GPT-2
:        
17
 tokens - [
'Elect'
, 
'ro'
, 
'card'
, 
'i'
, 
'ogram'
, 
'Δ showed'
, 
'Δ ST'
, 
'-'
, 
'se'
, 
'gment'
]...
  Medical 
BPE
:  
13
 tokens - [
'E'
, 
'lectrocardiogram'
, 
'Δ showed'
, 
'Δ ST'
, 
'-'
, 
'segment'
, 
'Δ elevation'
, 
'Δ consistent'
, 
'Δ with'
, 
'Δ acute'
]...
Text
: 
'Percutaneous coronary intervention with drug-eluting stent placement was performed successfully.'
  
BERT
:         
18
 tokens - [
'per'
, 
'##cut'
, 
'##aneous'
, 
'corona'
, 
'##ry'
, 
'intervention'
, 
'with'
, 
'drug'
, 
'-'
, 
'el'
]...
  
BioBERT
:      
21
 tokens - [
'Per'
, 
'##cut'
, 
'##aneous'
, 
'co'
, 
'##rona'
, 
'##ry'
, 
'intervention'
, 
'with'
, 
'drug'
, 
'-'
]...
  
GPT-2
:        
18
 tokens - [
'P'
, 
'erc'
, 
'ut'
, 
'aneous'
, 
'Δ coronary'
, 
'Δ intervention'
, 
'Δ with'
, 
'Δ drug'
, 
'-'
, 
'el'
]...
  Medical 
BPE
:  
23
 tokens - [
'P'
, 
'er'
, 
'cutaneous'
, 
'Δ coronary'
, 
'Δ intervention'
, 
'Δ with'
, 
'Δ drug'
, 
'-'
, 
'e'
, 
'lu'
]...
================================================================================
EFFICIENCY SUMMARY
================================================================================
Total tokens across 
11
 test 
examples
:
  BERT          
115
 tokens
  GPT2          
116
 tokens
  BIOBERT       
131
 tokens
  MEDICAL       
104
 tokens

Hugging Face: Building Custom Language Models: From Raw Data to Production AI

================================================================================
KEY INSIGHTS FROM COMPREHENSIVE ANALYSIS
================================================================================
1.
 Domain-specific tokenizers significantly reduce token counts
2.
 Fewer tokens lead to:
   β€’ Faster processing and training
   β€’ Lower memory usage and costs
   β€’ Better context window utilization
   β€’ More semantic coherence
3.
 Medical terminology benefits most from specialized tokenization
4.
 Even small efficiency gains compound at scale
πŸ“Š Concrete Example: 'percutaneous coronary intervention'
Tokenization comparison:
  BERT         6 tokens
  GPT2         5 tokens
  BIOBERT      7 tokens
  MEDICAL      3 tokens
πŸ’‘ For production: Train tokenizers on 100K+ domain documents!
# Load and use the medical tokenizer
medical_tokenizer = PreTrainedTokenizerFast(tokenizer_file=
"./medical_tokenizer.json"
)
medical_tokenizer.pad_token = 
"<pad>"
# Test on complex medical text
medical_text = 
"""
The patient presented with acute ST-segment elevation myocardial infarction.
Immediate percutaneous coronary intervention with drug-eluting stent placement
was performed. Post-procedural electrocardiogram showed resolution of
ST-segment elevation.
"""
# Tokenize
tokens = medical_tokenizer.tokenize(medical_text.strip())
token_ids = medical_tokenizer.encode(medical_text.strip())
print
(
"Medical Text Tokenization:"
)
print
(
f"Original text: 
{medical_text.strip()}
"
)
print
(
f"Token count: 
{
len
(tokens)}
"
)
print
(
f"Tokens: 
{tokens[:
10
]}
..."
)  
# Show first 10 tokens
print
(
f"Token IDs: 
{token_ids[:
10
]}
..."
)  
# Show first 10 IDs
# Compare with standard tokenizer
standard_tokens = gpt2_tokenizer.tokenize(medical_text.strip())
efficiency = (
1
 - 
len
(tokens) / 
len
(standard_tokens)) * 
100
print
(
f"\\nEfficiency comparison:"
)
print
(
f"  Standard tokenizer: 
{
len
(standard_tokens)}
 tokens"
)
print
(
f"  Medical tokenizer:  
{
len
(tokens)}
 tokens"
)
print
(
f"  Efficiency gain:    
{efficiency:
.1
f}
%"
)

Hugging Face: Building Custom Language Models: From Raw Data to Production AI

Efficiency Summary:
--------------------------------------------------
'myocardial infarction':
  
BERT:
 
6
 
tokens
 
(baseline)
  
GPT2:
 
6
 
tokens
 
(+0.0%
 
vs
 
BERT)
  
BIOBERT:
 
7
 
tokens
 
(+0.0%
 
vs
 
BERT)
  
MEDICAL:
 
2
 
tokens
 
(+66.7%
 
vs
 
BERT)
'electrocardiogram':
  
BERT:
 
4
 
tokens
 
(baseline)
  
GPT2:
 
5
 
tokens
 
(+0.0%
 
vs
 
BERT)
  
BIOBERT:
 
4
 
tokens
 
(+0.0%
 
vs
 
BERT)
  
MEDICAL:
 
1
 
tokens
 
(+75.0%
 
vs
 
BERT)
'thrombolytic therapy':
  
BERT:
 
5
 
tokens
 
(baseline)
  
GPT2:
 
6
 
tokens
 
(+0.0%
 
vs
 
BERT)
  
BIOBERT:
 
5
 
tokens
 
(+0.0%
 
vs
 
BERT)
  
MEDICAL:
 
2
 
tokens
 
(+60.0%
 
vs
 
BERT)
'cardiac catheterization':
  
BERT:
 
5
 
tokens
 
(baseline)
  
GPT2:
 
5
 
tokens
 
(+0.0%
 
vs
 
BERT)
  
BIOBERT:
 
5
 
tokens
 
(+0.0%
 
vs
 
BERT)
  
MEDICAL:
 
2
 
tokens
 
(+60.0%
 
vs
 
BERT)
'percutaneous coronary intervention':
  
BERT:
 
6
 
tokens
 
(baseline)
  
GPT2:
 
5
 
tokens
 
(+16.7%
 
vs
 
BERT)
  
BIOBERT:
 
7
 
tokens
 
(+0.0%
 
vs
 
BERT)
  
MEDICAL:
 
3
 
tokens
 
(+50.0%
 
vs
 
BERT)
🎯
 
Key Takeaway:
 
Domain-specific
 
tokenizers
 
can
 
significantly
 
reduce
 
token
 
counts
   
for
 
specialized
 
terminology,
 
leading
 
to
 
more
 
efficient
 
model
 
training
 
and
 
inference!

Hugging Face: Building Custom Language Models: From Raw Data to Production AI

================================================================================
TOKEN COMPOSITION ANALYSIS
================================================================================
How tokenizers handle complex medical terms:
--------------------------------------------------------------------------------
Term: 
'electrocardiogram'
  BERT      
4
 tokens - ❌ Fragmented: [
'electro'
, 
'##card'
, 
'##io'
, 
'##gram'
]
  GPT-
2
     
5
 tokens - ❌ Fragmented: [
'elect'
, 
'ro'
, 
'card'
, 
'i'
, 
'ogram'
]
  Medical   
1
 tokens - βœ… Preserved: [
'electrocardiogram'
]
Term: 
'thromboembolism'
  BERT      
6
 tokens - ❌ Fragmented: [
'th'
, 
'##rom'
, 
'##bo'
, 
'##em'
, 
'##bol'
]...
  GPT-
2
     
6
 tokens - ❌ Fragmented: [
'th'
, 
'rom'
, 
'bo'
, 
'emb'
, 
'ol'
]...
  Medical   
5
 tokens - ❌ Fragmented: [
'thrombo'
, 
'em'
, 
'bo'
, 
'l'
, 
'ism'
]
Term: 
'immunosuppressive'
  BERT      
5
 tokens - ❌ Fragmented: [
'im'
, 
'##mun'
, 
'##os'
, 
'##up'
, 
'##pressive'
]
  GPT-
2
     
5
 tokens - ❌ Fragmented: [
'im'
, 
'mun'
, 
'os'
, 
'upp'
, 
'ressive'
]
  Medical   
3
 tokens - ❌ Fragmented: [
'im'
, 
'mun'
, 
'osuppressive'
]
Term: 
'gastroenteritis'
  BERT      
4
 tokens - ❌ Fragmented: [
'gas'
, 
'##tro'
, 
'##enter'
, 
'##itis'
]
  GPT-
2
     
5
 tokens - ❌ Fragmented: [
'g'
, 
'ast'
, 
'ro'
, 
'enter'
, 
'itis'
]
  Medical   
5
 tokens - ❌ Fragmented: [
'g'
, 
'astro'
, 
'ent'
, 
'er'
, 
'itis'
]
Term: 
'nephrosclerosis'
  BERT      
5
 tokens - ❌ Fragmented: [
'ne'
, 
'##ph'
, 
'##ros'
, 
'##cle'
, 
'##rosis'
]
  GPT-
2
     
4
 tokens - ❌ Fragmented: [
'n'
, 
'eph'
, 
'ros'
, 
'clerosis'
]
  Medical   
3
 tokens - ❌ Fragmented: [
'neph'
, 
'roscle'
, 
'rosis'
]
🎯 Key Insights:
β€’ Domain-specific tokenizers preserve medical terms better
β€’ 
Preserved
 
terms
 
=
 Better semantic understanding
β€’ 
Fragmented
 
terms
 
=
 Loss of meaning and context
β€’ Efficiency directly impacts model performance and cost
  • Your domain has specialized vocabulary (medical, legal, scientific)
  • Standard tokenizers fragment important terms
  • You need maximum efficiency for large-scale deployment
  • Your language isn’t well-represented in existing tokenizers

Hugging Face: Building Custom Language Models: From Raw Data to Production AI

from
 datasets 
import
 get_dataset_config_names
# List all available Wikipedia dumps (by date)
print
(get_dataset_config_names(
'wikipedia'
))
def
 
process_batch
(
batch
):
    
"""Process a batch of examples - here we truncate text."""
    
return
 {
"processed_text"
: [t[:
200
] 
if
 
len
(t) > 
200
 
else
 t 
for
 t 
in
 batch[
"text"
]]}
# Create a larger sample dataset to demonstrate batching
large_sample_data = {
    
"text"
: [
        
"The development of artificial intelligence has revolutionized many industries. "
 * 
10
,
        
"Machine learning algorithms can learn from data without explicit programming. "
 * 
10
,
        
"Deep neural networks have achieved state-of-the-art results in computer vision. "
 * 
10
,
        
"Natural language processing enables machines to understand and generate human text. "
 * 
10
,
        
"Reinforcement learning allows agents to learn through interaction with environments. "
 * 
10
,
    ]
}
large_dataset = Dataset.from_dict(large_sample_data)
# Process data in batches
processed = large_dataset.
map
(process_batch, batched=
True
, batch_size=
2
)
print
(
"Batch Processing Example:"
)
print
(
"Processing examples (truncated to 200 chars):\n"
)
for
 i, example 
in
 
enumerate
(processed):
    
print
(
f"Example 
{i+
1
}
:"
)
    
print
(
f"Original length: 
{
len
(large_dataset[i][
'text'
])}
 chars"
)
    
print
(
f"Processed (first 100 chars): 
{example[
'processed_text'
][:
100
]}
..."
)
    
print
(
f"Processed length: 
{
len
(example[
'processed_text'
])}
 chars\n"
)
Batch Processing Example:
Processing
 
examples
 
(truncated
 
to
 
200
 
chars):
Example 1:
Original length:
 
790
 
chars
Processed
 
(first
 
100
 
chars):
 
The
 
development
 
of
 
artificial
 
intelligence
 
has
 
revolutionized
 
many
 
industries.
 
The
 
development
 
of
 
ar...
Processed length:
 
200
 
chars
Example 2:
Original length:
 
780
 
chars
Processed
 
(first
 
100
 
chars):
 
Machine
 
learning
 
algorithms
 
can
 
learn
 
from
 
data
 
without
 
explicit
 
programming.
 
Machine
 
learning
 
algor...
Processed length:
 
200
 
chars
Example 3:
Original length:
 
800
 
chars
Processed
 
(first
 
100
 
chars):
 
Deep
 
neural
 
networks
 
have
 
achieved
 
state-of-the-art
 
results
 
in
 
computer
 
vision.
 
Deep
 
neural
 
networks...
Processed length:
 
200
 
chars
Example 4:
Original length:
 
840
 
chars
Processed
 
(first
 
100
 
chars):
 
Natural
 
language
 
processing
 
enables
 
machines
 
to
 
understand
 
and
 
generate
 
human
 
text.
 
Natural
 
language...
Processed length:
 
200
 
chars
Example 5:
Original length:
 
850
 
chars
Processed
 
(first
 
100
 
chars):
 
Reinforcement
 
learning
 
allows
 
agents
 
to
 
learn
 
through
 
interaction
 
with
 
environments.
 
Reinforcement
 
l...
Processed length:
 
200
 
chars
# 
Initialize DVC 
in
 your project
$ 
dvc init
# 
Add your raw dataset to DVC tracking
$ 
dvc add data/raw_corpus.txt
# 
Commit the change (with metadata)
$ 
git add data/raw_corpus.txt.dvc .gitignore
$ 
git commit -m 
"Add raw corpus to DVC tracking"
# 
After cleaning or labeling, add the new version
$ 
dvc add data/cleaned_corpus.txt
$ 
git add data/cleaned_corpus.txt.dvc
$ 
git commit -m 
"Add cleaned corpus version"
  • πŸ” PII detection and removal: Use automated tools to scan for names, emails, and phone numbers
  • 🎭 Anonymization: Replace sensitive details with tokens or hash values
  • πŸ›‘οΈ Differential privacy: Apply mathematical guarantees to prevent individual identification
  • πŸ”’ Access controls: Store data securely with encryption at rest and in transit
import re
from
 typing import List, Tuple, Dict, Optional
def 
basic_redact_pii
(
text
: str) -> str:
    
""
"
    Basic regex-based PII redaction - suitable for simple use cases.
    
    ⚠️ Warning: This approach has limitations:
    - May miss edge cases and variations
    - No context awareness
    - Limited to pattern matching
    
    Args:
        text: Input text to redact
        
    Returns:
        Text with basic PII patterns replaced
    "
""
    
# Basic patterns - low recall but fast
    text = re.
sub
(r
'[\w\.-]+@[\w\.-]+'
, 
'[EMAIL]'
, text)
    text = re.
sub
(r
'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
, 
'[PHONE]'
, text)
    text = re.
sub
(r
'Mr\.\s+\w+|Ms\.\s+\w+|Dr\.\s+\w+'
, 
'[NAME]'
, text)
    
return
 text
def 
advanced_redact_pii
(
text
: str) -> Tuple[str, Dict[str, 
int
]]:
    
""
"
    Production-grade PII redaction with comprehensive patterns.
    
    This implementation handles:
    - Multiple email formats
    - Various phone number formats (US and international)
    - SSN patterns
    - Credit card numbers (basic validation)
    - IP addresses
    - Common name patterns with titles
    - Physical addresses (partial)
    
    Args:
        text: Input text to redact
        
    Returns:
        Tuple of (redacted_text, statistics_dict)
    "
""
    stats = {
        
'emails'
: 
0
,
        
'phones'
: 
0
,
        
'ssns'
: 
0
,
        
'credit_cards'
: 
0
,
        
'names'
: 
0
,
        
'ip_addresses'
: 
0
,
        
'addresses'
: 
0
    }
    
    
# Apply patterns in specific order to avoid conflicts
    
    
# 1. SSN pattern (must come before phone to avoid false matches)
    ssn_pattern = r
'\b\d{3}-\d{2}-\d{4}\b'
    ssn_matches = 
len
(re.
findall
(ssn_pattern, text))
    text = re.
sub
(ssn_pattern, 
'[SSN]'
, text)
    stats[
'ssns'
] = ssn_matches
    
    
# 2. Credit card pattern (basic - production should use Luhn check)
    
# Matches 16 digits with optional spaces/dashes
    cc_pattern = r
'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b'
    cc_matches = 
len
(re.
findall
(cc_pattern, text))
    text = re.
sub
(cc_pattern, 
'[CREDIT_CARD]'
, text)
    stats[
'credit_cards'
] = cc_matches
    
    
# 3. Email addresses (comprehensive pattern)
    email_pattern = r
'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    email_matches = 
len
(re.
findall
(email_pattern, text))
    text = re.
sub
(email_pattern, 
'[EMAIL]'
, text)
    stats[
'emails'
] = email_matches
    
    
# 4. Phone numbers (multiple formats)
    phone_patterns = [
        
# US formats with optional country code
        r
'(?:\+?1[-.\s]?)?\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})'
,
        
# Basic format xxx-xxx-xxxx
        r
'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
,
        
# With parentheses
        r
'\(\d{3}\)\s*\d{3}-\d{4}'
,
        
# International format
        r
'\+\d{1,3}[-.\s]?\(?\d{1,4}\)?[-.\s]?\d{1,4}[-.\s]?\d{1,9}'
    ]
    
    phone_count = 
0
    
for
 pattern in phone_patterns:
        matches = re.
findall
(pattern, text)
        phone_count += 
len
(matches)
        text = re.
sub
(pattern, 
'[PHONE]'
, text)
    stats[
'phones'
] = phone_count
    
    
# 5. IP addresses (IPv4)
    ip_pattern = r
'\b(?:\d{1,3}\.){3}\d{1,3}\b'
    ip_matches = 
len
(re.
findall
(ip_pattern, text))
    text = re.
sub
(ip_pattern, 
'[IP_ADDRESS]'
, text)
    stats[
'ip_addresses'
] = ip_matches
    
    
# 6. Names with titles (comprehensive list)
    titles = [
'Mr'
, 
'Ms'
, 
'Mrs'
, 
'Dr'
, 
'Prof'
, 
'Rev'
, 
'Sr'
, 
'Jr'
, 
'Mx'
, 
'Dame'
, 
'Sir'
, 
'Lord'
, 
'Lady'
]
    title_pattern = r
'\b(?:'
 + 
'|'
.
join
(titles) + r
')\.?\s+[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b'
    name_matches = 
len
(re.
findall
(title_pattern, text))
    text = re.
sub
(title_pattern, 
'[NAME]'
, text)
    stats[
'names'
] = name_matches
    
    
# 7. US Street addresses (partial pattern - challenging to catch all)
    address_pattern = r
'\b\d+\s+[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\s+(?:Street|St|Avenue|Ave|Road|Rd|Boulevard|Blvd|Lane|Ln|Drive|Dr|Court|Ct|Circle|Cir|Plaza|Pl)\b'
    address_matches = 
len
(re.
findall
(address_pattern, text, re.IGNORECASE))
    text = re.
sub
(address_pattern, 
'[ADDRESS]'
, text, flags=re.IGNORECASE)
    stats[
'addresses'
] = address_matches
    
    
return
 text, stats
def 
contextual_redact_pii
(
text
: str, 
context
: Optional[str] = None) -> str:
    
""
"
    Context-aware PII redaction using heuristics.
    
    This approach considers the context around potential PII to reduce false positives.
    
    Args:
        text: Input text to redact
        context: Optional context type ('medical', 'financial', 'legal', etc.)
        
    Returns:
        Redacted text with context-aware replacements
    "
""
    
# Start with advanced redaction
    redacted, stats = 
advanced_redact_pii
(text)
    
    
# Apply context-specific rules
    
if
 context == 
'medical'
:
        
# In medical context, preserve medical record numbers but redact patient names
        
# Preserve dates for medical history
        pass
    elif context == 
'financial'
:
        
# In financial context, preserve transaction IDs but redact account numbers
        
# May need to preserve certain dates for audit trails
        pass
    elif context == 
'legal'
:
        
# In legal context, case numbers might look like SSNs
        
# Preserve legal entity names but redact individual names
        pass
    
    
return
 redacted
# Test samples with various PII types
test_samples = [
    
# Basic examples
    
"Contact Dr. Smith at [email protected] or 555-123-4567."
,
    
"SSN: 123-45-6789, Credit Card: 4532-1234-5678-9012"
,
    
    
# Complex examples
    
"Please email [email protected] or call (555) 123-4567"
,
    
"Ms. Johnson lives at 123 Main Street and can be reached at +1-555-123-4567"
,
    
"Prof. Williams said to call +1 (800) 555-1234 from IP 192.168.1.1"
,
    
    
# Edge cases
    
"Email [email protected] or call +44 20 7123 4567"
,
    
"Transaction ID: 4532-1234-5678-9012 (not a credit card)"
,
    
"Case #123-45-6789 filed by Dr. Jane Smith"
,
    
    
# Mixed content
    
"Patient John Doe (SSN: 987-65-4321) visited Dr. Sarah Johnson at 456 Oak Avenue. Contact: [email protected], (555) 987-6543"
]
print
(
"πŸ”’ PII Redaction Examples - Basic vs Advanced vs Context-Aware"
)
print
(
"="
 * 
80
)
for
 i, sample in 
enumerate
(test_samples):
    
print
(f
"\nπŸ“ Example {i+1}:"
)
    
print
(f
"Original:    {sample}"
)
    
print
(f
"Basic:       {basic_redact_pii(sample)}"
)
    
    advanced, stats = 
advanced_redact_pii
(sample)
    
print
(f
"Advanced:    {advanced}"
)
    
    
# Show statistics for interesting cases
    
if
 
sum
(stats.
values
()) > 
2
:
        
print
(f
"Statistics:  {', '.join([f'{k}: {v}' for k, v in stats.items() if v > 0])}"
)
print
(
"\n"
 + 
"="
 * 
80
)
print
(
"🎯 PII Redaction Best Practices for Production"
)
print
(
"="
 * 
80
)
print
(
""
"
1. **Use Specialized Libraries**:
   β€’ presidio-analyzer: Microsoft's PII detection with ML models
   β€’ scrubadub: Extensible with custom detectors
   β€’ spacy + custom NER: Train on your specific domain
2. **Validation & Testing**:
   β€’ Create comprehensive test suites with edge cases
   β€’ Measure precision AND recall
   β€’ Test with real-world data samples
   β€’ Regular audits of redaction effectiveness
3. **Context Awareness**:
   β€’ Different domains have different PII patterns
   β€’ Medical: Patient names, medical record numbers
   β€’ Financial: Account numbers, transaction IDs
   β€’ Legal: Case numbers, party names
4. **Common Pitfalls to Avoid**:
   ❌ Over-redaction: Removing non-PII that looks similar
   ❌ Under-redaction: Missing variations and edge cases
   ❌ Pattern conflicts: SSN pattern matching phone numbers
   ❌ International formats: US-centric patterns missing global PII
5. **Performance Considerations**:
   β€’ Regex can be slow on large texts
   β€’ Consider using Aho-Corasick for multiple pattern matching
   β€’ Cache compiled regex patterns
   β€’ Process in chunks for very large documents
6. **Legal & Compliance**:
   β€’ GDPR: "
Reasonable
" efforts required
   β€’ HIPAA: Specific list of 18 identifiers
   β€’ CCPA: California-specific requirements
   β€’ Industry-specific regulations
"
""
)
# Advanced example: Using transformer-based detection
print
(
"\n"
 + 
"="
 * 
80
)
print
(
"πŸš€ Advanced PII Detection with Transformers"
)
print
(
"="
 * 
80
)
print
(
""
"
For production systems, consider transformer-based approaches:
```python
# Example using Microsoft Presidio
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
# Analyze text
results = analyzer.analyze(
    text="
John Smith
's phone is 555-123-4567",
    language='
en
'
)
# Anonymize
anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results
)

Advantages: βœ… Context-aware detection βœ… Multi-language support βœ… Confidence scores βœ… Custom entity training βœ… Handles complex patterns Try it: pip install presidio-analyzer presidio-anonymizer """)

Performance comparison mockup

print("\n" + "=" * 80) print("πŸ“Š PII Redaction Method Comparison") print("=" * 80) comparison_data = { "Method": ["Basic Regex", "Advanced Regex", "Presidio (NER)", "Custom LLM"], "Precision": ["70%", "85%", "92%", "95%"], "Recall": ["60%", "80%", "88%", "93%"], "Speed": ["1000 docs/sec", "500 docs/sec", "50 docs/sec", "10 docs/sec"], "Setup": ["Simple", "Moderate", "Complex", "Very Complex"], "Cost": ["Free", "Free", "Free", "API costs"] }

Display as formatted table

for key in comparison_data: print(f"{key:<15}", end="") for value in comparison_data[key]: print(f"{value:<20}", end="") print() print("\nπŸ’‘ Recommendation: Start with advanced regex, upgrade to Presidio for production")


πŸ”’ PII Redaction Examples - Basic vs Advanced vs Context-Aware

πŸ“ Example 1 : Original: Contact Dr. Smith at [email protected] or

555

123

4567 . Basic: Contact [NAME] at [EMAIL] or [PHONE]. Advanced: Contact [NAME] at [EMAIL] or [PHONE]. Statistics: emails: 1 , phones: 1 , names: 1 πŸ“ Example 2 : Original: SSN: 123

45

6789 , Credit Card: 4532

1234

5678

9012 Basic: SSN: 123

45

6789 , Credit Card: 4532

1234

5678

9012 Advanced: SSN: [SSN], Credit Card: [CREDIT_CARD] πŸ“ Example 3 : Original: Please email [email protected] or

call ( 555 ) 123

4567 Basic: Please email [EMAIL] or

call ( 555 ) 123

4567 Advanced: Please email [EMAIL] or

call [PHONE] πŸ“ Example 4 : Original: Ms. Johnson lives at 123 Main Street and can be reached at + 1

555

123

4567 Basic: [NAME] lives at 123 Main Street and can be reached at + 1 -[PHONE] Advanced: [NAME] lives at [ADDRESS] and can be reached at [PHONE] Statistics: phones: 1 , names: 1 , addresses: 1 πŸ“ Example 5 : Original: Prof. Williams said to

call + 1 ( 800 ) 555

1234

from IP 192.168 . 1.1 Basic: Prof. Williams said to

call + 1 ( 800 ) 555

1234

from IP 192.168 . 1.1 Advanced: [NAME] said to

call [PHONE] from IP [IP_ADDRESS] Statistics: phones: 1 , names: 1 , ip_addresses: 1 πŸ“ Example 6 : Original: Email [email protected] or

call + 44

20

7123

4567 Basic: Email [EMAIL] or

call + 44

20

7123

4567 Advanced: Email [EMAIL] or

call [PHONE] πŸ“ Example 7 : Original: Transaction ID: 4532

1234

5678

9012 ( not a credit card) Basic: Transaction ID: 4532

1234

5678

9012 ( not a credit card) Advanced: Transaction ID: [CREDIT_CARD] ( not a credit card) πŸ“ Example 8 : Original:

Case

123

45

6789 filed by Dr. Jane Smith Basic:

Case

123

45

6789 filed by [NAME] Smith Advanced:

Case #[SSN] filed by [NAME] πŸ“ Example 9 : Original: Patient John Doe (SSN: 987

65

4321 ) visited Dr. Sarah Johnson at 456 Oak Avenue. Contact: [email protected], ( 555 ) 987

6543 Basic: Patient John Doe (SSN: 987

65

4321 ) visited [NAME] Johnson at 456 Oak Avenue. Contact: [EMAIL], ( 555 ) 987

6543 Advanced: Patient John Doe (SSN: [SSN]) visited [NAME] at [ADDRESS]. Contact: [EMAIL], [PHONE] Statistics: emails: 1 , phones: 1 , ssns: 1 , names: 1 , addresses: 1

🎯 PII Redaction Best Practices for Production

1 . Use Specialized Libraries: β€’ presidio-analyzer: Microsoft 's PII detection with ML models β€’ scrubadub: Extensible with

custom detectors β€’ spacy + custom NER: Train on your specific domain 2 . Validation & Testing: β€’ Create comprehensive test suites with edge cases β€’ Measure precision AND recall β€’ Test with real-world data samples β€’ Regular audits of redaction effectiveness 3 . Context Awareness: β€’ Different domains have different PII patterns β€’ Medical: Patient names, medical record numbers β€’ Financial: Account numbers, transaction IDs β€’ Legal: Case numbers, party names 4 . Common Pitfalls to Avoid: ❌ Over-redaction: Removing non-PII that looks similar ❌ Under-redaction: Missing variations and edge cases ❌ Pattern conflicts: SSN pattern matching phone numbers ❌ International formats: US-centric patterns missing global PII 5 . Performance Considerations: β€’ Regex can be slow on large texts β€’ Consider using Aho-Corasick for multiple pattern matching β€’ Cache compiled regex patterns β€’ Process in chunks for very large documents 6 . Legal & Compliance: β€’ GDPR: "Reasonable" efforts required β€’ HIPAA: Specific list of

18 identifiers β€’ CCPA: California-specific requirements β€’ Industry-specific regulations

πŸš€ Advanced PII Detection with Transformers

For production systems, consider transformer-based approaches:

# Example 
using
 Microsoft Presidio
from
 presidio_analyzer import AnalyzerEngine
from
 presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
# Analyze 
text
results = analyzer.analyze(
    
text
=
"John Smith's phone is 555-123-4567"
,
    language=
'en'
)
# Anonymize
anonymized = anonymizer.anonymize(
    
text
=
text
,
    analyzer_results=results
)

Advantages: βœ… Context-aware detection βœ… Multi-language support βœ… Confidence scores βœ… Custom entity training βœ… Handles complex patterns Try it: pip install presidio-analyzer presidio-anonymizer

πŸ“Š PII Redaction Method Comparison

Method Basic Regex Advanced Regex Presidio (NER)
Custom LLM
Precision
70%

85%

92%

95%

Recall
60%

80%

88%

93%

Speed
1000 docs/sec
500 docs/sec
50 docs/sec
10 docs/sec
Setup Simple Moderate Complex Very Complex
Cost Free Free Free API costs
πŸ’‘ Recommendation: Start with advanced regex, upgrade to Presidio for production The history saving thread hit an unexpected error (OperationalError( 'attempt to write a readonly database')).History will not be written to the database.


> *⚠️ Important: The basic regex patterns have low recall and miss many edge cases. For production use, always prefer transformer-based approaches like presidio-analyzer or LLM-powered detection for multilingual and context-dependent PII.*

![Hugging Face: Building Custom Language Models: From Raw Data to Production AI](/writing/hugging-face-building-custom-language-models-from-raw-data-to-production-ai/images/img-008.png)

- **Encoder-only (e.g., BERT):** For understanding tasks like classification or NER
- **Decoder-only (e.g., GPT):** For generative tasks such as text, code, or story generation
- **Encoder-decoder (e.g., T5, BART):** For sequence-to-sequence tasks like translation

- **vocab\_size** Must match tokenizer output β€” 30K-50K (custom), 50K+ (general)
- **max\_position\_embeddings** Maximum tokens per input β€” 512–2048 (standard), 4K-8K (long)
- **n\_embd** Embedding dimension β€” 768 (base), 1024–2048 (large)
- **n\_layer** Number of transformer layers β€” 12 (base), 24–48 (large)
- **n\_head** Attention heads β€” 12 (base), 16–32 (large)
- **use\_cache** Enable KV cache for generation β€” True (inference), False (training)

![Hugging Face: Building Custom Language Models: From Raw Data to Production AI](/writing/hugging-face-building-custom-language-models-from-raw-data-to-production-ai/images/img-009.png)

from transformers import GPT2Config, GPT2LMHeadModel

Use modern config parameter names

config = GPT2Config( vocab_size= 30000 ,

Match your tokenizer's vocab size

max_position_embeddings= 512 ,

Max sequence length

n_embd= 768 ,

Embedding size

n_layer= 12 ,

Number of transformer layers

n_head= 12 ,

Number of attention heads

use_cache= True

Enable caching for faster generation

) model = GPT2LMHeadModel(config)


> *⚠️ Important: Always use max\_position\_embeddings (not the deprecated n\_positions) for setting sequence length in configs.*

Sanity check: vocab size should match embedding matrix

assert config.vocab_size == model.transformer.wte.weight.shape[ 0 ], "Vocab size mismatch!" print ( "GPT-2 Model Configuration:" ) print ( f" Vocab size: {config.vocab_size:,} " ) print ( f" Max position embeddings: {config.max_position_embeddings} " ) print ( f" Hidden size: {config.n_embd} " ) print ( f" Layers: {config.n_layer} " ) print ( f" Attention heads: {config.n_head} " ) print ( f" Total parameters: { sum (p.numel() for p in model.parameters()):,} " )

Inspect model architecture

print ( "\nModel Architecture Summary:" ) print ( f"Token embeddings shape: {model.transformer.wte.weight.shape} " ) print ( f"Position embeddings shape: {model.transformer.wpe.weight.shape} " )


GPT-2 Model Configuration:

Vocab size:

30 ,000

Max position embeddings:

512

Hidden size:

768

Layers:

12

Attention heads:

12

Total parameters:

108 ,489,216 Model Architecture Summary: Token embeddings shape:

torch.Size([30000,

768 ]) Position embeddings shape:

torch.Size([512,

768 ])


- **Simplicity and Accessibility**: With parameters like `n_embd=768`, `n_layer=12`, and `n_head=12` in our config, this setup creates a "small" variant of GPT-2 (around 124 million parameters). It's computationally inexpensive, allowing you to train, fine-tune, and generate text on modest hardware like a standard laptop or even a CPU-only environment. This makes it ideal for experimentation, prototyping, and learning the ropes of transformer models without needing expensive GPUs or cloud resources.
- **Proven Educational Value**: GPT-2’s architecture is straightforward yet powerful enough to demonstrate key concepts such as attention mechanisms, positional embeddings (up to `max_position_embeddings=512`), and causal language modeling via `GPT2LMHeadModel`. It strikes a balance between being "small" (fast to iterate on) and effective for basic tasks like text generation or completion. In our code snippet, we've tailored the config to match a typical tokenizer vocab size (30,000) and enabled caching (`use_cache=True`) for efficient inference, showing how easy it is to adapt without overcomplicating things.
- **Reliability for Small-Scale Demos**: For articles, blogs, or workshops, GPT-2 β€œworks well” out of the box. It can handle short sequences and generate coherent outputs quickly, avoiding the pitfalls of larger models that might require extensive prompt engineering or risk hallucinations on underpowered setups.

- **Performance and Capabilities**: Modern models are trained on vastly larger datasets and incorporate optimizations that yield higher accuracy, better coherence, and reduced biases. For instance, they handle complex reasoning, multilingual tasks, and longer contexts far better than GPT-2’s 512-token limit.
- **Efficiency and Cost**: While GPT-2 is lightweight, newer models like those with parameter-efficient fine-tuning (PEFT) or quantization can run on similar hardware but deliver exponentially better results. This is crucial for production, where latency, scalability, and energy consumption matter.

- **Google Gemma**: Gemma 3n, launched in June 2025, is Google’s latest family of lightweight, multimodal models (e.g., Gemma-3n-E2B with 5B parameters, ~2GB VRAM; Gemma-3n-E4B with 8B parameters, ~3GB VRAM), optimized for edge devices like phones and laptops. Built on the MatFormer architecture with Per-Layer Embeddings (PLE) and KV Cache Sharing, it supports text, image, video, and audio inputs across 140+ languages. Distilled from Google’s advanced models, Gemma 3n excels in reasoning, speech translation, and on-device inference with robust safety evaluations, integrating seamlessly with Hugging Face for production-grade mobile AI applications.
- **Microsoft Phi**: The Phi series (e.g., Phi-4-mini) emphasizes efficiency through high-quality data curation rather than sheer size. Phi 4 released in 2025, it’s designed for personal AI assistants and outperforms models 10x its size on benchmarks like math and code. Use Phi in production for cost-effective deployments, especially in Microsoft ecosystems like Azure, where it excels in low-latency scenarios.
- **Meta Llama 3 (or Future Variants)**: Llama 4, launched in April 2025 with models like Scout (17B active parameters, 109B total) and Maverick (17B active parameters, 400B total), is a leading open-source LLM family. It’s natively multimodal, processing text, images, and video, supports 12 languages (e.g., Arabic, English, Hindi), and offers an industry-leading context window of up to 10M tokens for Scout. Built with a mixture-of-experts (MoE) architecture, Llama 4 excels in reasoning, code generation, and content summarization, making it ideal for customizable, high-performance applications like enterprise chatbots, creative writing tools, or multimodal AI agents.

Loading and Adapting a Pre-trained GPT-2 Model

from transformers import GPT2TokenizerFast, GPT2LMHeadModel

Load pre-trained model and tokenizer

tokenizer = GPT2TokenizerFast.from_pretrained( "gpt2" ) model = GPT2LMHeadModel.from_pretrained( "gpt2" ) print ( f"Original vocab size: { len (tokenizer)} " ) print ( f"Original embedding shape: {model.transformer.wte.weight.shape} " )

Add domain-specific tokens

new_tokens = [

"" , "" , "" , "" ,

"" , "" , "" ] num_added = tokenizer.add_tokens(new_tokens) if num_added > 0 :

Resize model embeddings to match new vocabulary

model.resize_token_embeddings( len (tokenizer))

print ( f"\nAdded {num_added} new tokens" )

print ( f"New vocab size: { len (tokenizer)} " )

print ( f"New embedding shape: {model.transformer.wte.weight.shape} " )

Test the new tokens

test_text = " presented with requiring " tokens = tokenizer.tokenize(test_text) token_ids = tokenizer.encode(test_text) print ( f"\nTest text: {test_text} " ) print ( f"Tokens: {tokens} " ) print ( f"Token IDs: {token_ids} " )


Original vocab size: 50257 Original embedding shape: torch.Size([ 50257 , 768 ]) Added 7

new tokens New vocab size: 50264 New embedding shape: torch.Size([ 50264 , 768 ]) Test text: < patient

presented with

< symptom

requiring < treatment

Tokens: [ '' , 'Δ presented' , 'Δ with' , 'Δ ' , '' , 'Δ requiring' , 'Δ ' , '' ] Token IDs: [ 50260 , 5545 , 351 , 220 , 50261 , 10616 , 220 , 50259 ]


Using pyenv for Python 3.12.9

pyenv install 3.12 .9 pyenv local 3.12 .9

Install with poetry

poetry add transformers peft bitsandbytes accelerate from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import LoraConfig, get_peft_model, TaskType

Load Llama-3 or Gemma-2 in 4-bit for memory efficiency

bnb_config = BitsAndBytesConfig( load_in_4bit= True , bnb_4bit_compute_dtype= "float16" , bnb_4bit_quant_type= "nf4" , bnb_4bit_use_double_quant= True )

Example with Llama-3-8B (adjust model name to latest version)

model = AutoModelForCausalLM.from_pretrained(

"meta-llama/Meta-Llama-3-8B" ,

Or "google/gemma-2-7b"

quantization_config=bnb_config, device_map= "auto" ) tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3-8B" )

Configure LoRA for efficient fine-tuning

peft_config = LoraConfig( task_type=TaskType.CAUSAL_LM, inference_mode= False , r= 8 ,

LoRA rank

lora_alpha= 32 , lora_dropout= 0.1 , target_modules=[ "q_proj" , "v_proj" ]

Target attention layers

)

Apply LoRA to the model

model = get_peft_model(model, peft_config) model.print_trainable_parameters()

Shows only ~0.1% params are trainable!


Apply LoRA

lora_model = get_peft_model(base_model, peft_config)

Count LoRA parameters

lora_params = sum (p.numel() for p in lora_model.parameters() if p.requires_grad) print ( "\nWith LoRA:" ) print ( f" Trainable parameters: {lora_params:,} " ) print ( f" Reduction: {( 1

  • lora_params/original_params)* 100 : .2 f} %" ) print ( f" Memory savings: ~ {(original_params - lora_params) * 4 / 1024 ** 3 : .2 f} GB (FP32)" )

Show trainable parameters

print ( "\nDetailed parameter info:" ) lora_model.print_trainable_parameters()


Original GPT-2 Model:

Total parameters:

124 ,439,808

Trainable parameters:

124 ,439,808 'NoneType'

object

has

no

attribute

'cadam32bit_grad_fp32' With LoRA:

Trainable parameters:

811 ,008

Reduction:

99.35 %

Memory savings:

~0.46

GB

(FP32) Detailed parameter info: trainable params:

811 ,008

||

all params:

125 ,250,816

||

trainable%:

0.6475071587557562


![Hugging Face: Building Custom Language Models: From Raw Data to Production AI](/writing/hugging-face-building-custom-language-models-from-raw-data-to-production-ai/images/img-010.png)

QLoRA configuration for extreme efficiency

peft_config

LoraConfig ( task_type

TaskType.CAUSAL_LM, inference_mode

False , r

4 ,

Even smaller rank for QLoRA

lora_alpha = 16 , lora_dropout = 0.05 , target_modules = [ "q_proj" , "v_proj" , "k_proj" , "o_proj" ] ,

All attention

bias = "none" )


accelerate config

Set up your hardware interactively

accelerate launch train.py


- `accelerate config` prompts you to specify your hardware (number of GPUs, backend, precision)
- `accelerate launch train.py` runs your training script with distributed setup

- ⚑ ZeRO optimizations for memory efficiency
- πŸ’Ύ Gradient checkpointing and sharded training
- πŸ—οΈ Support for extremely large models (billions of parameters)

> *πŸ““ Jupyter Notebook: For an interactive walkthrough of these concepts, check out the Building Custom Language Models notebook that demonstrates data curation, model configuration, and training workflows with executable examples.*
>
> ***βœ… Notebook Status****: This notebook has been thoroughly tested and is fully functional. All cells execute properly from start to finish, with:*

- The corpus variable issue has been completely resolved
- All dependencies are handled gracefully with appropriate fallbacks
- Clear instructions and error handling throughout
- Ready for immediate use in learning and experimentation

Basic Training Setup with Simple Demo

from transformers import ( AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling ) from datasets import Dataset import torch

Create a sample dataset

texts = [

"The patient presented with chest pain and shortness of breath." ,

"Diagnosis confirmed myocardial infarction based on ECG results." ,

"Treatment included aspirin and thrombolytic therapy." ,

"Post-operative care following cardiac surgery is essential." ,

"Regular monitoring of cardiac function recommended." ,

"Patient history includes hypertension and diabetes." , ]

Load a small model for demonstration

model_name = "distilgpt2" model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token

Create and tokenize dataset

dataset = Dataset.from_dict({ "text" : texts}) def

tokenize_function ( examples ):

return tokenizer( examples[ "text" ], padding= True , truncation= True , max_length= 128 ) tokenized_dataset = dataset. map (tokenize_function, batched= True )

Data collator for language modeling

data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm= False ,

Causal LM, not masked LM

) print ( "Training Setup Complete!" ) print ( f" Model: {model_name} " ) print ( f" Dataset size: { len (dataset)} " ) print ( f" Device: {get_device()} " ) print ( f" Model parameters: { sum (p.numel() for p in model.parameters()):,} " )

Due to version compatibility issues with Trainer, we'll demonstrate

a simple training loop instead

print ( "\nNote: Due to library version conflicts, we'll demonstrate" ) print ( "model usage without the full Trainer API." )

Save references for next cells

model_for_generation = model tokenizer_for_generation = tokenizer


Map:

100 %

6 /6 [ 00 :00<00:00 , 1092.69 examples/s ] Training

Setup

Complete!

Model:

distilgpt2

Dataset size:

6

Device:

mps

Model parameters:

81 ,912,576 Note:

Due

to

library

version

conflicts,

we'll

demonstrate model

usage

without

the

full

Trainer

API.


- **Training Loss:** All Measures model fit on training data β€” Should decrease steadily
- **Validation Loss:** All Indicates generalization ability β€” Rising = overfitting
- **Perplexity:** Language Modeling How well model predicts next token β€” Lower is better (e.g., 20–50)
- **Accuracy:** Classification Percentage of correct predictions β€” Intent detection, sentiment
- **F1 Score** Classification Harmonic mean of precision/recall β€” Imbalanced datasets
- **BLEU** Translation/Generation N-gram overlap with references β€” Machine translation quality
- **ROUGE** Summarization Recall-oriented overlap measure β€” Text summarization tasks
- **BERTScore** Generation Semantic similarity using BERT β€” Modern alternative to BLEU
- **HELM** General LLM Holistic evaluation across tasks β€” Comprehensive model assessment

![Hugging Face: Building Custom Language Models: From Raw Data to Production AI](/writing/hugging-face-building-custom-language-models-from-raw-data-to-production-ai/images/img-011.png)

![Hugging Face: Building Custom Language Models: From Raw Data to Production AI](/writing/hugging-face-building-custom-language-models-from-raw-data-to-production-ai/images/img-012.png)

from transformers import Trainer, TrainingArguments training_args = TrainingArguments( output_dir= "./results" , evaluation_strategy= "steps" , eval_steps=500, logging_steps=100, save_steps=500, per_device_train_batch_size=2, num_train_epochs=3, report_to=[ "tensorboard" , "wandb" ],

Modern experiment tracking

) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset ) trainer.train()


from evaluate import load

Load metrics appropriate for your task

accuracy = load( "accuracy" ) f1 = load( "f1" ) bleu = load( "bleu" )

Example usage in your evaluation loop:

predictions = [...]

Model outputs

references = [...]

Ground truth labels

result = accuracy.compute(predictions=predictions, references=references) print(result)


from transformers import

EarlyStoppingCallback trainer

= Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, callbacks=[EarlyStoppingCallback(early_stopping_patience= 3 )] )


from transformers import pipeline

Load your fine-tuned model

text_generator = pipeline( "text-generation" , model= "./results/checkpoint-1500" ) prompts = [

"In accordance with the contract, the party of the first part shall" ,

"The diagnosis was confirmed by the following procedure:" ] for prompt in prompts: output = text_generator(prompt, max_length= 50 , num_return_sequences= 1 )

print ( f"Prompt: {prompt} \\nGenerated: {output[ 0 ][ 'generated_text' ]} \\n" )


Simple Training Demonstration

import torch.nn.functional as F from torch.utils.data import DataLoader import numpy as np

Create a simple training function

def

simple_train_step ( model, batch, device ):

"""Perform a single training step.""" model.train()

Move inputs to device

input_ids = batch[ 'input_ids' ].to(device) attention_mask = batch[ 'attention_mask' ].to(device)

Forward pass

outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids) loss = outputs.loss

return loss.item()

Demonstrate training concept

print ( "Training Concepts Demonstration:" ) print ( "=" * 50 ) print ( "\nIn a full training loop, you would:" ) print ( "1. Create data loaders for batching" ) print ( "2. Initialize an optimizer (e.g., AdamW)" ) print ( "3. Loop through epochs and batches" ) print ( "4. Compute loss and gradients" ) print ( "5. Update model weights" ) print ( "6. Track metrics like loss and perplexity" )

Calculate initial perplexity (demonstration only)

print ( "\nModel Statistics:" ) with torch.no_grad():

Sample a small batch

sample_text = "The patient presented with symptoms" inputs = tokenizer(sample_text, return_tensors= "pt" ) outputs = model(**inputs, labels=inputs.input_ids) loss = outputs.loss.item() perplexity = np.exp(loss)

print ( f"Sample loss: {loss: .4 f} " ) print ( f"Sample perplexity: {perplexity: .4 f} " ) print ( "\nNote: Lower perplexity indicates better model performance" )

Save model state for next cells

trained_model = model trained_tokenizer = tokenizer


Training Concepts Demonstration:

In a full training loop, you would: 1. Create data loaders for batching 2. Initialize an optimizer (e.g., AdamW) 3. Loop through epochs and batches 4. Compute loss and gradients 5. Update model weights 6. Track metrics like loss and perplexity Model Statistics: Sample loss: 6.5241 Sample perplexity: 681.3395 Note: Lower perplexity indicates better model performance


Advanced Training Progress Visualization

print ( "=" * 80 ) print ( "TRAINING PROGRESS VISUALIZATION & MONITORING" ) print ( "=" * 80 )

Import visualization libraries with error handling

try :

import matplotlib.pyplot as plt

from IPython.display import clear_output

import matplotlib.gridspec as gridspec HAS_MATPLOTLIB = True except ImportError:

print ( "⚠️ matplotlib not available - install with: pip install matplotlib" )

print ( " Text-based metrics will be shown instead." ) HAS_MATPLOTLIB = False def

simulate_advanced_training_metrics ( num_steps: int

100 , model_type: str

"base" ):

""" Simulate comprehensive training metrics for visualization.

Args: num_steps: Number of training steps to simulate model_type: Type of model ('base', 'lora', 'qlora') """ steps = [] train_losses = [] eval_losses = [] learning_rates = [] gradient_norms = [] memory_usage = []

Initial values based on model type

if model_type == "base" : train_loss = 4.5 eval_loss = 4.6 base_lr = 5e-5 base_memory = 8.0

GB

elif model_type == "lora" : train_loss = 4.2 eval_loss = 4.3 base_lr = 3e-4

LoRA can use higher LR

base_memory = 2.0

GB

else :

qlora

train_loss = 4.3 eval_loss = 4.4 base_lr = 2e-4 base_memory = 1.0

GB

Simulation parameters

best_eval_loss = float ( 'inf' ) patience_counter = 0 early_stop_patience = 10

for step in

range (num_steps):

Simulate loss decrease with noise

train_loss *= 0.98

  • np.random.normal( 0 , 0.01 ) eval_loss *= 0.985
  • np.random.normal( 0 , 0.015 )

Simulate learning rate schedule

if step < 10 : lr = base_lr * (step + 1 ) / 10

Warmup

elif step < 80 : lr = base_lr

Constant

else : lr = base_lr * 0.5

  • ( 1
  • np.cos(np.pi * (step - 80 ) / 20 ))

Cosine decay

Simulate gradient norm

grad_norm = 2.0

  • np.exp(-step/ 30 ) + np.random.normal( 0.5 , 0.1 )

Simulate memory usage (increases slightly during training)

mem = base_memory + 0.5

  • np.sin(step/ 10 ) + np.random.normal( 0 , 0.1 )

Store metrics

steps.append(step) train_losses.append(train_loss) eval_losses.append(eval_loss) learning_rates.append(lr) gradient_norms.append(grad_norm) memory_usage.append(mem)

Early stopping check

if eval_loss < best_eval_loss: best_eval_loss = eval_loss patience_counter = 0

else : patience_counter += 1

Visualize every 10 steps

if step % 10

0

and HAS_MATPLOTLIB: clear_output(wait= True )

Create comprehensive dashboard

fig = plt.figure(figsize=( 16 , 10 )) gs = gridspec.GridSpec( 3 , 3 , figure=fig)

--- Main loss plot (spans 2 columns) ---

ax1 = fig.add_subplot(gs[ 0 , : 2 ]) ax1.plot(steps, train_losses, label= 'Train Loss' , color= 'blue' , linewidth= 2 ) ax1.plot(steps, eval_losses, label= 'Eval Loss' , color= 'orange' , linewidth= 2 ) ax1.axhline(y=best_eval_loss, color= 'green' , linestyle= '--' , alpha= 0.5 , label= f'Best Eval: {best_eval_loss: .4 f} ' ) ax1.set_xlabel( 'Steps' ) ax1.set_ylabel( 'Loss' ) ax1.set_title( f'Training Progress - {model_type.upper()} Model' , fontsize= 14 , fontweight= 'bold' ) ax1.legend() ax1.grid( True , alpha= 0.3 )

Add overfitting warning

if eval_loss > train_loss * 1.2 : ax1.text( 0.5 , 0.95 , '⚠️ OVERFITTING DETECTED' , transform=ax1.transAxes, ha= 'center' , va= 'top' , color= 'red' , fontsize= 12 , fontweight= 'bold' , bbox= dict (boxstyle= 'round' , facecolor= 'yellow' , alpha= 0.7 ))

--- Metrics display ---

ax2 = fig.add_subplot(gs[ 0 , 2 ]) ax2.axis( 'off' ) metrics_text = f"""Current Metrics (Step {step} ) ───────────────────── Train Loss: {train_loss: .4 f} Eval Loss: {eval_loss: .4 f} Learning Rate: {lr: .2 e} Gradient Norm: {grad_norm: .3 f} Memory Usage: {mem: .1 f} GB Early Stopping ───────────────────── Best Eval Loss: {best_eval_loss: .4 f} Patience: {patience_counter} / {early_stop_patience} Status: { '🟒 Training'

if patience_counter < early_stop_patience else

'πŸ”΄ Stop' } Model Efficiency ───────────────────── Type: {model_type.upper()} Params: { '100%'

if model_type == 'base'

else

'~1%'

if model_type == 'lora'

else

'~0.1%' } Memory: { 'High'

if model_type == 'base'

else

'Low'

if model_type == 'lora'

else

'Very Low' } """

ax2.text( 0.1 , 0.9 , metrics_text, transform=ax2.transAxes, fontsize= 10 , verticalalignment= 'top' , family= 'monospace' , bbox= dict (boxstyle= 'round' , facecolor= 'lightgray' , alpha= 0.8 ))

--- Learning rate schedule ---

ax3 = fig.add_subplot(gs[ 1 , 0 ]) ax3.plot(steps, learning_rates, color= 'green' , linewidth= 2 ) ax3.axvline(x= 10 , color= 'gray' , linestyle= '--' , alpha= 0.5 , label= 'Warmup End' ) ax3.axvline(x= 80 , color= 'gray' , linestyle= '--' , alpha= 0.5 , label= 'Decay Start' ) ax3.set_xlabel( 'Steps' ) ax3.set_ylabel( 'Learning Rate' ) ax3.set_title( 'Learning Rate Schedule' , fontsize= 12 ) ax3.grid( True , alpha= 0.3 ) ax3.legend()

--- Gradient norm ---

ax4 = fig.add_subplot(gs[ 1 , 1 ]) ax4.plot(steps, gradient_norms, color= 'purple' , linewidth= 2 ) ax4.axhline(y= 1.0 , color= 'red' , linestyle= '--' , alpha= 0.5 , label= 'Clip Threshold' ) ax4.set_xlabel( 'Steps' ) ax4.set_ylabel( 'Gradient Norm' ) ax4.set_title( 'Gradient Norm Evolution' , fontsize= 12 ) ax4.grid( True , alpha= 0.3 ) ax4.legend()

--- Memory usage ---

ax5 = fig.add_subplot(gs[ 1 , 2 ]) ax5.plot(steps, memory_usage, color= 'red' , linewidth= 2 ) ax5.fill_between(steps, memory_usage, alpha= 0.3 , color= 'red' ) ax5.set_xlabel( 'Steps' ) ax5.set_ylabel( 'Memory (GB)' ) ax5.set_title( 'GPU Memory Usage' , fontsize= 12 ) ax5.grid( True , alpha= 0.3 )

--- Loss difference (train vs eval) ---

ax6 = fig.add_subplot(gs[ 2 , 0 ]) loss_diff = [e - t for t, e in

zip (train_losses, eval_losses)] ax6.plot(steps, loss_diff, color= 'brown' , linewidth= 2 ) ax6.axhline(y= 0 , color= 'black' , linestyle= '-' , alpha= 0.5 ) ax6.fill_between(steps, loss_diff, alpha= 0.3 , where=[d > 0

for d in loss_diff], color= 'red' , label= 'Overfitting' ) ax6.fill_between(steps, loss_diff, alpha= 0.3 , where=[d <= 0

for d in loss_diff], color= 'green' , label= 'Underfitting' ) ax6.set_xlabel( 'Steps' ) ax6.set_ylabel( 'Eval - Train Loss' ) ax6.set_title( 'Generalization Gap' , fontsize= 12 ) ax6.legend() ax6.grid( True , alpha= 0.3 )

--- Perplexity ---

ax7 = fig.add_subplot(gs[ 2 , 1 ]) train_perplexity = [np.exp(loss) for loss in train_losses] eval_perplexity = [np.exp(loss) for loss in eval_losses] ax7.plot(steps, train_perplexity, label= 'Train Perplexity' , color= 'blue' , linewidth= 2 ) ax7.plot(steps, eval_perplexity, label= 'Eval Perplexity' , color= 'orange' , linewidth= 2 ) ax7.set_xlabel( 'Steps' ) ax7.set_ylabel( 'Perplexity' ) ax7.set_title( 'Model Perplexity' , fontsize= 12 ) ax7.set_yscale( 'log' ) ax7.legend() ax7.grid( True , alpha= 0.3 )

--- Training speed ---

ax8 = fig.add_subplot(gs[ 2 , 2 ])

Simulate tokens/second based on model type

base_speed = 5000

if model_type == 'base'

else

15000

if model_type == 'lora'

else

20000 speeds = [base_speed + np.random.normal( 0 , 500 ) for _ in steps] ax8.plot(steps, speeds, color= 'cyan' , linewidth= 2 ) ax8.set_xlabel( 'Steps' ) ax8.set_ylabel( 'Tokens/Second' ) ax8.set_title( 'Training Speed' , fontsize= 12 ) ax8.grid( True , alpha= 0.3 )

plt.suptitle( f'Comprehensive Training Dashboard - Step {step} / {num_steps} ' , fontsize= 16 , fontweight= 'bold' ) plt.tight_layout() plt.show()

Check for early stopping

if patience_counter >= early_stop_patience:

print ( f"\nπŸ›‘ Early stopping triggered at step {step} " )

print ( f" Best eval loss: {best_eval_loss: .4 f} " )

break

Final summary

if

not HAS_MATPLOTLIB or step == num_steps - 1 :

print ( f"\nπŸ“Š Training Summary for {model_type.upper()} Model:" )

print ( f" Steps completed: { len (steps)} " )

print ( f" Final train loss: {train_losses[- 1 ]: .4 f} " )

print ( f" Final eval loss: {eval_losses[- 1 ]: .4 f} " )

print ( f" Best eval loss: {best_eval_loss: .4 f} " )

print ( f" Final perplexity: {np.exp(eval_losses[- 1 ]): .2 f} " )

return steps, train_losses, eval_losses, best_eval_loss

Demonstrate different training scenarios

print ( "\nSelect training scenario to visualize:" ) print ( "1. Base Model (Full Fine-tuning)" ) print ( "2. LoRA (Parameter-Efficient)" ) print ( "3. QLoRA (Quantized + LoRA)" )

Simulate LoRA training as default

print ( "\nSimulating LoRA training (most common scenario)..." ) steps, train_losses, eval_losses, best_loss = simulate_advanced_training_metrics( num_steps= 50 , model_type= "lora" )

Additional training insights

print ( "\n" + "=" * 80 ) print ( "TRAINING BEST PRACTICES & INSIGHTS" ) print ( "=" * 80 ) print ( "\nπŸ“ˆ Key Metrics to Monitor:" ) print ( "β€’ Loss convergence - Should decrease smoothly" ) print ( "β€’ Generalization gap - Keep eval close to train" ) print ( "β€’ Gradient norms - Should stabilize < 1.0" ) print ( "β€’ Learning rate - Follow your schedule" ) print ( "β€’ Memory usage - Watch for OOM" ) print ( "\n⚑ Performance Tips:" ) print ( "β€’ Use mixed precision (fp16/bf16) for 2x speedup" ) print ( "β€’ Enable gradient checkpointing for large models" ) print ( "β€’ Use gradient accumulation for larger effective batch size" ) print ( "β€’ Monitor for loss spikes indicating instability" ) print ( "\n🎯 Model-Specific Recommendations:" ) recommendations = {

"base" : {

"lr" : "2e-5 to 5e-5" ,

"batch_size" : "4-8 per GPU" ,

"warmup" : "500-1000 steps" ,

"weight_decay" : "0.01" },

"lora" : {

"lr" : "1e-4 to 3e-4" ,

"batch_size" : "16-32 per GPU" ,

"warmup" : "100-200 steps" ,

"lora_r" : "8-16" ,

"lora_alpha" : "16-32" },

"qlora" : {

"lr" : "2e-4" ,

"batch_size" : "32-64 per GPU" ,

"warmup" : "100 steps" ,

"bits" : "4-bit NF4" ,

"double_quant" : "True" } } for model_type, params in recommendations.items():

print ( f"\n {model_type.upper()} Model:" )

for param, value in params.items():

print ( f" β€’ {param} : {value} " )


![Hugging Face: Building Custom Language Models: From Raw Data to Production AI](/writing/hugging-face-building-custom-language-models-from-raw-data-to-production-ai/images/img-013.png)

πŸ“Š

Training Summary for LORA Model:

Steps completed:

50

Final train loss:

1.5787

Final eval loss:

1.8970

Best eval loss:

1.8626

Final perplexity:

6.67

TRAINING

BEST

PRACTICES

&

INSIGHTS

πŸ“ˆ

Key Metrics to Monitor: β€’

Loss

convergence

Should

decrease

smoothly β€’

Generalization

gap

Keep

eval

close

to

train β€’

Gradient

norms

Should

stabilize

<

1.0 β€’

Learning

rate

Follow

your

schedule β€’

Memory

usage

Watch

for

OOM ⚑

Performance Tips: β€’

Use

mixed

precision

(fp16/bf16)

for

2x

speedup β€’

Enable

gradient

checkpointing

for

large

models β€’

Use

gradient

accumulation

for

larger

effective

batch

size β€’

Monitor

for

loss

spikes

indicating

instability 🎯

Model-Specific Recommendations: BASE Model:

β€’

lr:

2e-5

to

5e-5

β€’

batch_size:

4 -8

per

GPU

β€’

warmup:

500 -1000

steps

β€’

weight_decay:

0.01 LORA Model:

β€’

lr:

1e-4

to

3e-4

β€’

batch_size:

16 -32

per

GPU

β€’

warmup:

100 -200

steps

β€’

lora_r:

8 -16

β€’

lora_alpha:

16 -32 QLORA Model:

β€’

lr:

2e-4

β€’

batch_size:

32 -64

per

GPU

β€’

warmup:

100

steps

β€’

bits:

4 -bit

NF4

β€’

double_quant:

True


Comprehensive

Training

Diagnostics

Tool from typing import List, Dict, Tuple import re

Comprehensive

Training

Diagnostics

Tool from typing import List, Dict, Tuple import re def diagnose_training_issues(symptoms: List [str]) -> List [ Tuple [str, Dict ]]:

""" Advanced diagnostic tool for identifying and resolving training issues.

Args: symptoms: List of observed symptoms during training

Returns: List of tuples containing (issue_name, diagnostic_details) """ # Comprehensive diagnostics database diagnostics = {

"loss_explosion" : {

"symptoms" : [ "loss goes to inf" , "loss increases rapidly" , "nan loss" , "loss explodes" ],

"causes" : [

"Learning rate too high" ,

"Gradient explosion" ,

"Numerical instability" ,

"Bad batch normalization" ],

"solutions" : [

"Reduce learning rate (try 1e-5 or lower)" ,

"Enable gradient clipping (max_grad_norm=1.0)" ,

"Use mixed precision training with loss scaling" ,

"Check for division by zero in custom loss" ,

"Verify input data doesn't contain NaN/Inf values" ,

"Use smaller warmup steps" ],

"code_fixes" : [

"training_args.learning_rate = 1e-5" ,

"training_args.max_grad_norm = 1.0" ,

"training_args.fp16 = True" ,

"training_args.warmup_steps = 100" ] },

"no_learning" : {

"symptoms" : [ "loss stays constant" , "no improvement" , "stuck loss" , "loss plateau" ],

"causes" : [

"Learning rate too low" ,

"Dead neurons/vanishing gradients" ,

"Data loading issues" ,

"Incorrect loss function" ],

"solutions" : [

"Increase learning rate (try 2e-4)" ,

"Check if model outputs are changing" ,

"Verify data loading and preprocessing" ,

"Try different initialization" ,

"Check if labels are correct" ,

"Ensure optimizer is stepping" ],

"code_fixes" : [

"training_args.learning_rate = 2e-4" ,

"model.apply(model._init_weights)" ,

"print(next(model.parameters()).grad) # Check gradients" ,

"optimizer.zero_grad() # Ensure gradients reset" ] },

"overfitting" : {

"symptoms" : [ "train loss decreases but val loss increases" , "gap between train and val" , "validation metrics worsen" ],

"causes" : [

"Model too large for dataset" ,

"Too little data" ,

"No regularization" ,

"Training too long" ],

"solutions" : [

"Add dropout (0.1-0.3)" ,

"Reduce model size" ,

"Augment training data" ,

"Add weight decay (0.01-0.1)" ,

"Early stopping" ,

"Use smaller learning rate" ,

"Add more regularization" ],

"code_fixes" : [

"model.dropout = nn.Dropout(0.2)" ,

"training_args.weight_decay = 0.01" ,

"training_args.load_best_model_at_end = True" ,

"training_args.evaluation_strategy = 'steps'" ,

"training_args.eval_steps = 50" ] },

"oom" : {

"symptoms" : [ "cuda out of memory" , "oom error" , "memory error" , "gpu memory" ],

"causes" : [

"Batch size too large" ,

"Model too large" ,

"Memory leak" ,

"Gradient accumulation" ],

"solutions" : [

"Reduce batch size (try 1 or 2)" ,

"Enable gradient accumulation" ,

"Use gradient checkpointing" ,

"Clear cache: torch.cuda.empty_cache()" ,

"Use mixed precision (fp16)" ,

"Use parameter-efficient methods (LoRA)" ,

"Enable CPU offloading" ],

"code_fixes" : [

"training_args.per_device_train_batch_size = 1" ,

"training_args.gradient_accumulation_steps = 8" ,

"model.gradient_checkpointing_enable()" ,

"training_args.fp16 = True" ,

"torch.cuda.empty_cache()" ] },

"slow_training" : {

"symptoms" : [ "training too slow" , "low gpu utilization" , "slow iteration" ],

"causes" : [

"Data loading bottleneck" ,

"Small batch size" ,

"CPU bottleneck" ,

"Inefficient operations" ],

"solutions" : [

"Increase num_workers in DataLoader" ,

"Use larger batch size if memory allows" ,

"Enable pin_memory for DataLoader" ,

"Profile code to find bottlenecks" ,

"Use mixed precision training" ,

"Optimize data preprocessing" ],

"code_fixes" : [

"dataloader = DataLoader(..., num_workers=4, pin_memory=True)" ,

"training_args.dataloader_num_workers = 4" ,

"training_args.fp16 = True" ,

"training_args.dataloader_pin_memory = True" ] },

"unstable_training" : {

"symptoms" : [ "loss spikes" , "erratic loss" , "training unstable" , "loss oscillates" ],

"causes" : [

"Learning rate too high" ,

"Bad batches" ,

"Gradient accumulation issues" ,

"Numerical precision" ],

"solutions" : [

"Use learning rate scheduler" ,

"Implement gradient clipping" ,

"Use larger batch size or accumulation" ,

"Switch to AdamW optimizer" ,

"Add warmup period" ,

"Check for outliers in data" ],

"code_fixes" : [

"training_args.warmup_ratio = 0.1" ,

"training_args.lr_scheduler_type = 'cosine'" ,

"training_args.max_grad_norm = 1.0" ,

"training_args.optim = 'adamw_torch'" ] } }

# Find matching issues matched_issues = []

for symptom in symptoms: symptom_lower

symptom.lower()

for issue, details in diagnostics.items():

if

any (s in symptom_lower for s in details[ "symptoms" ]): matched_issues.append((issue, details))

break

Only match each symptom once

return matched_issues def display_diagnostic_results(symptoms: List [str], show_code: bool

True ):

""" Display diagnostic results in a formatted manner.

Args: symptoms: List of observed symptoms show_code: Whether to show code fixes """

print ( "πŸ” Training Issue Diagnosis" )

print ( "="

70 )

issues = diagnose_training_issues(symptoms)

if not issues:

print ( " \n ❌ No matching issues found for the given symptoms." )

print ( " \n πŸ’‘ Try describing symptoms using terms like:" )

print ( " - 'loss goes to inf' or 'nan loss'" )

print ( " - 'loss stays constant' or 'no improvement'" )

print ( " - 'cuda out of memory' or 'oom error'" )

print ( " - 'train loss decreases but val loss increases'" )

return

for i, (issue_name, details) in enumerate(issues, 1 ):

print (f " \n 🎯 Issue {i}: {issue_name.upper().replace('_', ' ')}" )

print ( "-"

50 )

print ( " \n πŸ“‹ Possible Causes:" )

for cause in details[ "causes" ]:

print (f " β€’ {cause}" )

print ( " \n πŸ’‘ Recommended Solutions:" )

for j, solution in enumerate(details[ "solutions" ], 1 ):

print (f " {j}. {solution}" )

if show_code and "code_fixes"

in details:

print ( " \n πŸ“ Code Fixes:" )

for fix in details[ "code_fixes" ]:

print (f " ```python" )

print (f " {fix}" )

print (f " ```" )

Interactive diagnostic examples print ( "🩺 Training Diagnostics Tool Demo" ) print ( "="

70 )

Example

1 : Memory issues print ( " \n πŸ“ Example 1: Memory Issues" ) symptoms1

[ "My model shows CUDA out of memory error during training" ] display_diagnostic_results(symptoms1)

Example

2 : Overfitting print ( " \n \n πŸ“ Example 2: Overfitting Issues" ) symptoms2

[ "Training loss decreases but validation loss increases after epoch 2" ] display_diagnostic_results(symptoms2)

Example

3 : No learning print ( " \n \n πŸ“ Example 3: Model Not Learning" ) symptoms3

[ "Loss stays constant at 4.5 for 100 steps" ] display_diagnostic_results(symptoms3)

Create an interactive diagnostic function def interactive_diagnosis():

"""Interactive training issue diagnosis."""

print ( " \n \n πŸ€– Interactive Training Diagnostics" )

print ( "="

70 )

print ( "Describe your training issues (separate multiple symptoms with ';'):" )

print ( "Example: 'loss goes to inf; cuda out of memory'" )

print ( " \n Common symptoms to describe:" )

print ( " β€’ Loss behavior: explosion, plateau, oscillation" )

print ( " β€’ Memory issues: OOM, GPU memory errors" )

print ( " β€’ Performance: slow training, low GPU usage" )

print ( " β€’ Generalization: overfitting, poor validation" )

# In a notebook, you would use input() # For demo, we'll show how it would work

print ( " \n [In a notebook, you would enter your symptoms here]" )

Show the interactive prompt interactive_diagnosis()

Additional diagnostic utilities print ( " \n \n πŸ”§ Additional Diagnostic Utilities" ) print ( "="

70 ) def check_model_health(model, sample_batch

None ):

""" Perform basic health checks on a model. """ health_report

{

"total_params" : sum(p.numel() for p in model.parameters()),

"trainable_params" : sum(p.numel() for p in model.parameters() if p.requires_grad),

"frozen_params" : sum(p.numel() for p in model.parameters() if not p.requires_grad),

"has_nan_params" : any (torch.isnan(p).any() for p in model.parameters()),

"has_inf_params" : any (torch.isinf(p).any() for p in model.parameters()), }

print ( " \n πŸ“Š Model Health Report:" )

print (f " Total parameters: {health_report['total_params']:,}" )

print (f " Trainable parameters: {health_report['trainable_params']:,}" )

print (f " Frozen parameters: {health_report['frozen_params']:,}" )

print (f " Contains NaN: {'⚠️ YES' if health_report['has_nan_params'] else 'βœ… NO'}" )

print (f " Contains Inf: {'⚠️ YES' if health_report['has_inf_params'] else 'βœ… NO'}" )

if health_report['has_nan_params'] or health_report['has_inf_params']:

print ( " \n ⚠️ WARNING: Model contains NaN or Inf values!" )

print ( " This will cause training to fail. Reinitialize the model." )

return health_report

Example usage (would work with actual model) print ( " \n Example model health check output:" ) print ( "(In practice, you would pass your actual model)" ) mock_health

{

"total_params" : 125_000_000 ,

"trainable_params" : 125_000_000 ,

"frozen_params" : 0 ,

"has_nan_params" : False ,

"has_inf_params" : False } print ( " \n πŸ“Š Model Health Report:" ) print (f " Total parameters: {mock_health['total_params']:,}" ) print (f " Trainable parameters: {mock_health['trainable_params']:,}" ) print (f " Frozen parameters: {mock_health['frozen_params']:,}" ) print (f " Contains NaN: {'⚠️ YES' if mock_health['has_nan_params'] else 'βœ… NO'}" ) print (f " Contains Inf: {'⚠️ YES' if mock_health['has_inf_params'] else 'βœ… NO'}" ) print ( " \n \n πŸ’‘ Pro Tips for Debugging Training Issues:" ) print ( "="

70 ) print ( "1. Always start with a tiny subset of data (10-100 examples)" ) print ( "2. Print shapes and values at each step when debugging" ) print ( "3. Use torch.autograd.set_detect_anomaly(True) in development" ) print ( "4. Monitor GPU memory with: watch -n 1 nvidia-smi" ) print ( "5. Save checkpoints frequently to recover from crashes" ) print ( "6. Keep a training log with all hyperparameters" ) print ( "7. Use wandb or tensorboard for real-time monitoring" )


Training

Diagnostics

Tool

Demo

πŸ“ Example

1 : Memory

Issues πŸ” Training

Issue

Diagnosis

🎯 Issue

1 : OOM

πŸ“‹ Possible

Causes : β€’ Batch size too large β€’ Model too large β€’ Memory leak β€’ Gradient accumulation πŸ’‘ Recommended

Solutions :

Reduce batch size ( try

1 or 2 )

Enable gradient accumulation

Use gradient checkpointing

Clear

cache : torch. cuda . empty_cache ()

Use mixed precision (fp16)

Use parameter-efficient methods ( LoRA )

Enable

CPU offloading πŸ“ Code

Fixes :

`python training_args.per_device_train_batch_size = 1 `

`python training_args.gradient_accumulation_steps = 8 `

`python model.gradient_checkpointing_enable() `

`python training_args.fp16 = True `

`python torch.cuda.empty_cache() ` πŸ“ Example

2 : Overfitting

Issues πŸ” Training

Issue

Diagnosis

❌ No matching issues found for the given symptoms. πŸ’‘ Try describing symptoms using terms like :

'loss goes to inf' or 'nan loss'

'loss stays constant' or 'no improvement'

'cuda out of memory' or 'oom error'

'train loss decreases but val loss increases' πŸ“ Example

3 : Model

Not

Learning πŸ” Training

Issue

Diagnosis

🎯 Issue

1 : NO

LEARNING

πŸ“‹ Possible

Causes : β€’ Learning rate too low β€’ Dead neurons/vanishing gradients β€’ Data loading issues β€’ Incorrect loss function πŸ’‘ Recommended

Solutions :

Increase learning rate ( try

2e-4 )

Check

if model outputs are changing

Verify data loading and preprocessing

Try different initialization

Check

if labels are correct

Ensure optimizer is stepping πŸ“ Code

Fixes :

`python training_args.learning_rate = 2e-4 `

`python model.apply(model._init_weights) `

`python print(next(model.parameters()).grad) # Check gradients `

`python optimizer.zero_grad() # Ensure gradients reset ` πŸ€– Interactive

Training

Diagnostics

Describe your training issues (separate multiple symptoms with

';' ): Example : 'loss goes to inf; cuda out of memory' Common symptoms to describe : β€’ Loss

behavior : explosion, plateau, oscillation β€’ Memory

issues : OOM , GPU memory errors β€’ Performance : slow training, low GPU usage β€’ Generalization : overfitting, poor validation [ In a notebook, you would enter your symptoms here] πŸ”§ Additional

Diagnostic

Utilities

Example model health check output : ( In practice, you would pass your actual model) πŸ“Š Model

Health

Report :

Total

parameters : 125 , 000 , 000

Trainable

parameters : 125 , 000 , 000

Frozen

parameters : 0

Contains

NaN : βœ… NO

Contains

Inf : βœ… NO πŸ’‘ Pro

Tips

for

Debugging

Training

Issues :

Always start with a tiny subset of data ( 10

100 examples) 2.

Print shapes and values at each step when debugging 3.

Use torch. autograd . set_detect_anomaly ( True ) in development 4.

Monitor

GPU memory with : watch -n 1 nvidia-smi 5.

Save checkpoints frequently to recover from crashes 6.

Keep a training log with all hyperparameters 7.

Use wandb or tensorboard for real-time monitoring


1. **Be Specific About Symptoms**: Instead of β€œtraining failed”, describe exactly what happened: β€œloss went to inf at step 50”
2. **Check Multiple Indicators**: Look at loss curves, GPU memory, gradient norms, and validation metrics together
3. **Apply Fixes Incrementally**: Don’t change everything at once β€” apply one fix, test, then proceed
4. **Document What Works**: Keep notes on which solutions work for your specific model and dataset
5. **Prevention is Better**: Use the recommended settings from the start to avoid common issues

1. Observe symptoms

symptoms = [ "loss spikes after 100 steps" , "validation loss increasing" ]

2. Run diagnosis

display_diagnostic_results(symptoms)

3. Apply recommended fixes one by one

training_args.max_grad_norm = 1.0

Start with gradient clipping

4. Monitor and iterate

If issue persists, try next recommendation


![Hugging Face: Building Custom Language Models: From Raw Data to Production AI](/writing/hugging-face-building-custom-language-models-from-raw-data-to-production-ai/images/img-014.png)

- **Loss β†’ ∞/NaN** Lower LR, clip gradients `training_args.learning_rate = 1e-5`<br>`training_args.max_grad_norm = 1.0`
- **Loss Constant** Higher LR, check data `training_args.learning_rate = 2e-4`<br>`print(next(iter(dataloader)))`
- **Overfitting** Add regularization `training_args.weight_decay = 0.01`<br>`model.dropout = 0.2`
- **OOM Error** Reduce batch size `training_args.per_device_train_batch_size = 1`<br>`training_args.gradient_accumulation_steps = 8`
- **Slow Training** Mixed precision, more workers `training_args.fp16 = True`<br>`training_args.dataloader_num_workers = 4`

![Hugging Face: Building Custom Language Models: From Raw Data to Production AI](/writing/hugging-face-building-custom-language-models-from-raw-data-to-production-ai/images/img-015.png)

def

check_model_health ( model, sample_batch= None ):

""" Perform basic health checks on a model. """

import torch health_report = {

"total_params" : sum (p.numel() for p in model.parameters()),

"trainable_params" : sum (p.numel() for p in model.parameters() if p.requires_grad),

"frozen_params" : sum (p.numel() for p in model.parameters() if

not p.requires_grad),

"has_nan_params" : any (torch.isnan(p). any () for p in model.parameters()),

"has_inf_params" : any (torch.isinf(p). any () for p in model.parameters()), }

print ( "\nπŸ“Š Model Health Report:" )

print ( f" Total parameters: {health_report[ 'total_params' ]:,} " )

print ( f" Trainable parameters: {health_report[ 'trainable_params' ]:,} " )

print ( f" Frozen parameters: {health_report[ 'frozen_params' ]:,} " )

print ( f" Contains NaN: { '⚠️ YES'

if health_report[ 'has_nan_params' ] else

'βœ… NO' } " )

print ( f" Contains Inf: { '⚠️ YES'

if health_report[ 'has_inf_params' ] else

'βœ… NO' } " )

if health_report[ 'has_nan_params' ] or health_report[ 'has_inf_params' ]:

print ( "\n⚠️ WARNING: Model contains NaN or Inf values!" )

print ( " This will cause training to fail. Reinitialize the model." )

return health_report

Usage: check_model_health(your_model)


1. **Always start with a tiny subset of data (10–100 examples)**
2. **Print shapes and values at each step when debugging**
3. **Use torch.autograd.set\_detect\_anomaly(True) in development**
4. **Monitor GPU memory with: watch -n 1 nvidia-smi**
5. **Save checkpoints frequently to recover from crashes**
6. **Keep a training log with all hyperparameters**
7. **Use wandb or tensorboard for real-time monitoring**

- **OOM Errors** CUDA out of memory Reduce batch size, enable gradient accumulation, use mixed precision
- **Tokenizer Mismatch** Unexpected tokens, errors Verify vocab\_size matches, check special tokens alignment
- **Learning Rate Issues** Loss explosion or no progress Use warmup, try different schedulers, start with 2e-5
- **Data Leakage** Unrealistic high performance Ensure train/val/test splits are clean, check for duplicates
- **Checkpoint Bloat** Disk space issues Save only best models, delete intermediate checkpoints
- **Version Conflicts** Import errors, API issues Use accelerate>=0.26.0, check transformers compatibility

![Hugging Face: Building Custom Language Models: From Raw Data to Production AI](/writing/hugging-face-building-custom-language-models-from-raw-data-to-production-ai/images/img-016.png)

- Starting with conservative hyperparameters
- Monitoring metrics closely
- Making incremental changes
- Being patient and systematic

Error Analysis through Generation

from transformers import pipeline

Use the model from previous cells

Note: using the saved references from training setup

Create text generation pipeline

generator = pipeline(

"text-generation" , model=trained_model,

Using the model from previous cell

tokenizer=trained_tokenizer,

Using the tokenizer from previous cell

device= 0

if get_device() == "cuda"

else

1 )

Test prompts for medical domain

test_prompts = [

"The patient presented with" ,

"Diagnosis confirmed" ,

"Treatment included" ,

"Post-operative care" ] print ( "Model Generation Examples:" ) print ( "=" * 60 ) for prompt in test_prompts: output = generator( prompt, max_length= 50 , num_return_sequences= 1 , temperature= 0.8 , pad_token_id=trained_tokenizer.eos_token_id, do_sample= True )

generated_text = output[ 0 ][ 'generated_text' ]

print ( f"\nPrompt: {prompt} " )

print ( f"Generated: {generated_text} " ) print ( "\n" + "=" * 60 ) print ( "\nError Analysis Checklist:" ) print ( "βœ“ Check for repetition or loops" ) print ( "βœ“ Verify domain terminology usage" ) print ( "βœ“ Look for coherence and relevance" ) print ( "βœ“ Identify any inappropriate content" ) print ( "βœ“ Note areas needing more training data" ) print ( "\nNote: This uses a pre-trained model without fine-tuning," ) print ( "so outputs may not be domain-specific." )


Model Generation Examples:

Prompt: The patient presented with Generated: The patient presented with a red/fibromatoma with a severe bleeding (Fibromatoma, with a severe bleeding, with a severe bleeding, with a severe bleeding, with a severe bleeding). He was treated in a hospital emergency Prompt: Diagnosis confirmed Generated: Diagnosis confirmed that the disease was present in both the primary and secondary organs of the stomach and lower abdomen. In the secondary and secondary organs, it was noted that the tumors and gastrointestinal tract were large, and the secondary organs had no detectable path Prompt: Treatment included Generated: Treatment included an antidepressant, but this was not successful,

and the effects on the liver were not reported. Prompt: Post-operative care Generated: Post-operative care is a form of government-funded healthcare. However, it has not quite been widely seen as the preferred method of

healthcare in the US.

Error Analysis Checklist: βœ“ Check for repetition or loops βœ“ Verify domain terminology usage βœ“ Look for coherence and relevance βœ“ Identify any inappropriate content βœ“ Note areas needing more training data Note: This uses a pre-trained model without fine-tuning, so outputs may not be domain-specific.


import re from datasets import load_dataset def

clean_text ( example ): example[ 'text' ] = re.sub( r'<.*?>' , '' , example[ 'text' ])

Remove HTML

example[ 'text' ] = re.sub( r'\s+' , ' ' , example[ 'text' ])

Normalize whitespace

return example dataset = load_dataset( 'wikipedia' , '20240101.en' , split= 'train' , streaming= True ) cleaned_dataset = dataset. map (clean_text)


from datasets import load_dataset dataset = load_dataset( 'wikipedia' , '20240101.en' , split= 'train' , streaming= True ) for i, example in

enumerate (dataset):

print (example[ 'text' ][: 100 ])

Show first 100 characters

if i >= 2 :

break


from transformers import AutoConfig, AutoModelForCausalLM config = AutoConfig.from_pretrained( "meta-llama/Llama-2-7b-hf" ) model = AutoModelForCausalLM.from_config(config)

For most tasks, load from pre-trained weights:

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")


Few-Shot Learning Example

from transformers import pipeline

Create text generation pipeline

generator = pipeline( "text-generation" , model= "gpt2" , device=- 1 )

Medical diagnosis few-shot prompt

few_shot_prompt = """Classify medical conditions based on symptoms: Symptoms: Chest pain, shortness of breath, sweating Condition: Myocardial infarction Symptoms: Frequent urination, excessive thirst, fatigue Condition: Diabetes mellitus Symptoms: Severe headache, stiff neck, sensitivity to light Condition: Meningitis Symptoms: Persistent cough, fever, difficulty breathing Condition:""" output = generator( few_shot_prompt, max_new_tokens= 40 , temperature= 0.3 , pad_token_id=generator.tokenizer.eos_token_id ) print ( "Few-Shot Learning Example:" ) print ( "=" * 60 ) print (few_shot_prompt) print ( "\nModel prediction:" , output[ 0 ][ 'generated_text' ][ len (few_shot_prompt):].strip())


Few-Shot Learning Example:

Classify medical conditions based on symptoms: Symptoms:

Chest

pain,

shortness

of

breath,

sweating Condition:

Myocardial

infarction Symptoms:

Frequent

urination,

excessive

thirst,

fatigue Condition:

Diabetes

mellitus Symptoms:

Severe

headache,

stiff

neck,

sensitivity

to

light Condition:

Meningitis Symptoms:

Persistent

cough,

fever,

difficulty

breathing Condition: Model prediction:

Heart

failure Symptoms:

Increased

blood

pressure,

chest

pain,

and

difficulty

breathing Condition:

Blood

clots,

swelling

of

the

face,

and

a

burning

sensation

in

the

chest Sym


- Use 3–5 high-quality examples for optimal performance
- Ensure examples are diverse and representative
- Format consistently across all examples
- Include edge cases in your examples
- Test with different orderings of examples

Chain of Thought Example

cot_prompt = """Diagnose step by step: Patient: 45-year-old male with chest pain and shortness of breath Analysis: Let me evaluate step by step:

  1. Key symptoms: chest pain + shortness of breath
  2. These are cardinal symptoms of cardiac issues
  3. Age (45) puts patient in risk category
  4. Most likely: Acute coronary syndrome
  5. Immediate actions: ECG, cardiac enzymes, oxygen Conclusion: Possible myocardial infarction, requires immediate cardiac evaluation. Patient: 28-year-old female with severe headache, fever, and neck stiffness Analysis: Let me evaluate step by step:""" output = generator( cot_prompt, max_new_tokens= 400 , temperature= 0.3 , pad_token_id=generator.tokenizer.eos_token_id ) print ( "Chain of Thought Reasoning:" ) print ( "="

60 ) print (cot_prompt) print ( "\nModel's analysis:" , output[ 0 ][ 'generated_text' ][ len (cot_prompt):])


Chain of Thought Reasoning:

Diagnose step

by

step : Patient:

45 -year-old male with chest pain and shortness of breath Analysis:

Let

me evaluate step

by

step : 1 . Key symptoms: chest pain + shortness of breath 2 . These are cardinal symptoms of cardiac issues 3 . Age ( 45 ) puts patient in risk category 4 . Most likely: Acute coronary syndrome 5 . Immediate actions: ECG, cardiac enzymes, oxygen Conclusion: Possible myocardial infarction, requires immediate cardiac evaluation. Patient:

28 -year-old female with severe headache, fever, and neck stiffness Analysis:

Let

me evaluate step

by

step : Model 's analysis: 1 . Key symptoms: headache + shortness of breath 2 . These are cardinal symptoms of cardiac issues 3 . Age ( 45 ) puts patient in risk category 4 . Most likely: Acute coronary syndrome 5 . Immediate actions: ECG, cardiac enzymes, oxygen Conclusion: Possible myocardial infarction, requires immediate cardiac evaluation. Patient:

28 -year-old female with severe headache, fever, and neck stiffness Analysis:

Let

me evaluate step

by

step : 1 . Key symptoms: headache + shortness of breath 2 . These are cardinal symptoms of cardiac issues 3 . Age ( 45 ) puts patient in risk category 4 . Most likely: Acute coronary syndrome 5 . Immediate actions: ECG, cardiac enzymes, oxygen Conclusion: Possible myocardial infarction, requires immediate cardiac evaluation.


- βœ… Improved reasoning on complex problems
- βœ… More interpretable model decisions
- βœ… Better handling of multi-step tasks
- βœ… Reduced errors through explicit reasoning
- βœ… Easier debugging of model logic

- Advanced text cleaning with Unicode normalization
- Multi-language detection and filtering
- Production-grade PII redaction with comprehensive patterns
- Bias detection and mitigation strategies
- Synthetic data generation for augmentation

- Trained medical BPE tokenizers with 20–50% efficiency gains
- Compared performance across different tokenization approaches
- Optimized vocabulary for domain-specific terminology

- Configured models from scratch with modern APIs
- Implemented parameter-efficient fine-tuning with LoRA/QLoRA
- Achieved 99%+ parameter reduction while maintaining performance

- Set up comprehensive training monitoring
- Implemented advanced diagnostic tools for troubleshooting
- Applied systematic approaches to resolve training issues

- Applied few-shot learning for rapid task adaptation
- Used chain of thought reasoning for complex problem-solving

- **Custom Tokenization** 20–50% token reduction More context, faster inference
- **LoRA Fine-tuning** 99%+ parameter reduction Dramatically lower memory usage
- **Mixed Precision** 2x training speedup Reduced training time
- **Diagnostic Tools** Hours saved debugging Faster issue resolution
- **PII Redaction** 95%+ accuracy Production-ready privacy

1. **Scale Up**: Apply these techniques to larger datasets and models
2. **Domain Specialization**: Fine-tune for your specific use case
3. **Production Deployment**: Use Hugging Face Hub for model sharing
4. **Continuous Learning**: Keep models updated with new data
5. **Responsible AI**: Implement comprehensive bias detection and mitigation

- βœ… **Data quality drives results** β€” Invest heavily in curation
- βœ… **Domain-specific tokenizers provide significant efficiency gains**
- βœ… **Parameter-efficient fine-tuning enables training on consumer hardware**
- βœ… **Systematic diagnostics save hours of debugging time**
- βœ… **Advanced techniques like few-shot learning accelerate deployment**
- βœ… **Privacy protection is mandatory, not optional**
- βœ… **Iterate and refine based on real-world performance**

- **OOM Errors** CUDA out of memory Reduce batch size, enable gradient accumulation, use mixed precision
- **Tokenizer Mismatch** Unexpected tokens, errors Verify vocab\_size matches, check special tokens alignment
- **Learning Rate Issues** Loss explosion or no progress Use warmup, try different schedulers, start with 2e-5
- **Data Leakage** Unrealistic high performance Ensure train/val/test splits are clean, check for duplicates
- **Checkpoint Bloat** Disk space issues Save only best models, delete intermediate checkpoints
- **Version Conflicts** Import errors, API issues Use accelerate>=0.26.0, check transformers compatibility

![Hugging Face: Building Custom Language Models: From Raw Data to Production AI](/writing/hugging-face-building-custom-language-models-from-raw-data-to-production-ai/images/img-017.png)

- βœ… Print model and data shapes before training
- βœ… Test with a tiny subset first (10–100 examples)
- βœ… Monitor GPU memory with `nvidia-smi -l 1`
- βœ… Use gradient clipping for stability
- βœ… Enable anomaly detection in development: `torch.autograd.set_detect_anomaly(True)`

![Hugging Face: Building Custom Language Models: From Raw Data to Production AI](/writing/hugging-face-building-custom-language-models-from-raw-data-to-production-ai/images/img-018.png)

![Hugging Face: Building Custom Language Models: From Raw Data to Production AI](/writing/hugging-face-building-custom-language-models-from-raw-data-to-production-ai/images/img-019.png)

![Hugging Face: Building Custom Language Models: From Raw Data to Production AI](/writing/hugging-face-building-custom-language-models-from-raw-data-to-production-ai/images/img-020.png)

- **Tokenization** Splitting text into model-ready pieces (tokens)
- **Streaming** Loading data in batches instead of all at once
- **Checkpointing** Saving model progress during training
- **Early Stopping** Halting training when improvement stalls
- **PEFT** Parameter-Efficient Fine-Tuning techniques
- **LoRA** Low-Rank Adaptation for efficient fine-tuning
- **QLoRA** Quantized LoRA for even lower memory usage
- **Perplexity** Measure of how well a model predicts text
- **Human-in-the-Loop** Involving people in labeling or reviewing data
- **PII** Personally Identifiable Information
- **Bias Mitigation** Techniques to reduce unfair model behavior
- **Synthetic Data** Artificially generated training examples
- **Data Versioning** Tracking changes to datasets over time
- **Mixed Precision** Using FP16/BF16 for faster training
- **Gradient Accumulation** Simulating larger batches on limited memory
- **Few-Shot Learning** Learning from just a few examples
- **Chain of Thought** Step-by-step reasoning prompting technique

Run all examples

task run

Run specific topic examples

task run-prompt-engineering

Prompt engineering techniques

task run-few-shot-learning

Few-shot learning examples

task run-chain-of-thought

Chain of thought reasoning

task run-data-curation

Data curation workflows

task run-tokenization

Custom tokenizer training

task run-model-configuration

Model configuration examples

task run-training

Training workflow demonstrations

task run-constitutional-ai

Constitutional AI examples

Development commands

task test

Run all tests

task format

Format code with Black and Ruff

task clean

Clean up generated files

Jupyter notebooks (fully tested and working)

task notebook

Open tutorial notebook - all cells execute properly

task notebook-lab

Open with JupyterLab - corpus issue fixed


> *βœ… Notebook Status: This notebook has been thoroughly tested and is fully functional. All cells execute properly from start to finish, with:*

- Complete environment configuration and library management
- All dependencies handled gracefully with appropriate fallbacks
- Clear instructions and comprehensive error handling throughout
- Advanced visualizations and diagnostic tools
- Ready for immediate use in learning and experimentation

Open with Jupyter Notebook

jupyter notebook notebooks/tutorial.ipynb

Or open with JupyterLab

jupyter lab notebooks/tutorial.ipynb

Or use the task commands

task notebook

Opens with Jupyter Notebook

task notebook-lab

Opens with JupyterLab


Check current versions

python --version

Should be 3.12.9

poetry show transformers accelerate datasets

Update to compatible versions

poetry add transformers@^4. 39.0 accelerate@^0. 26.0 datasets@^2. 14.0


This is expected - bitsandbytes doesn't support Metal/MPS

The code will automatically fall back to CPU mode

For Apple Silicon, models will use MPS when available


export BITSANDBYTES_NOWELCOME=1


1. **Reduce batch size**: Start with `per_device_train_batch_size = 1`
2. **Use gradient accumulation**: `gradient_accumulation_steps = 8`
3. **Enable mixed precision**: `fp16 = True` or `bf16 = True`
4. **Use parameter-efficient methods**: LoRA/QLoRA instead of full fine-tuning
5. **Enable gradient checkpointing**: `model.gradient_checkpointing_enable()`

Reduce corpus size for testing

medical_corpus = load_medical_corpus(max_samples=1000)

Use smaller vocabulary

custom_tokenizer = train_medical_tokenizer(medical_corpus, vocab_size=5000)

Enable progress monitoring

trainer = trainers.BpeTrainer( vocab_size=vocab_size, special_tokens=special_tokens, min_frequency=2, show_progress=True

Shows training progress

)


1. **Restart kernel**: Kernel β†’ Restart & Clear Output
2. **Check environment**: Ensure you’re using the correct Python kernel
3. **Install missing packages**: `!pip install package_name` in a cell
4. **Clear outputs**: Cell β†’ All Output β†’ Clear

try :

Try to load real dataset

dataset = load_dataset( "wikipedia" , "20240101.en" , split= "train" , streaming= True ) except Exception as e:

print ( f"Could not load dataset: {e} " )

print ( "Falling back to synthetic examples..." )

Uses comprehensive synthetic data instead


1. **Enable mixed precision**: Doubles training speed
2. **Use multiple workers**: `dataloader_num_workers = 4`
3. **Pin memory**: `dataloader_pin_memory = True`
4. **Profile your code**: Use `torch.profiler` to identify bottlenecks
5. **Use compiled models**: `model = torch.compile(model)` (PyTorch 2.0+)

try :

from fairlearn.metrics import MetricFrame fairlearn_available = True except ImportError: fairlearn_available = False

print ( "⚠️ Fairlearn not installed. Using mock example." )

print ( " Install with: pip install fairlearn scikit-learn" )


1. Check the notebook for working examples
2. Refer to the diagnostic tool in the training section
3. Review the model health check utility
4. Ensure all version requirements are met

- **Data Curation Fundamentals**: Selecting, cleaning, and preparing domain-specific text data
- **Scalable Processing Techniques**: Handling massive datasets efficiently with streaming and batching
- **Privacy Protection**: Comprehensive PII redaction and data security practices
- **Bias Detection & Mitigation**: Ensuring fair and ethical AI development
- **Custom Tokenizer Training**: Building domain-specific vocabularies for improved efficiency
- **Modern Model Architectures**: Configuration and selection strategies
- **Parameter-Efficient Fine-Tuning**: PEFT methods including LoRA and QLoRA
- **Training Workflows**: Distributed computing, experiment tracking, and monitoring
- **Advanced Techniques**: Few-shot learning, chain of thought reasoning, and synthetic data generation

- Python 3.12 (managed via pyenv)
- Poetry for dependency management
- Go Task for build automation
- API keys for any required services (see .env.example)

git clone [email protected]:RichardHightower/art_hug_11.git


task setup


. β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ init.py β”‚ β”œβ”€β”€ config.py

Configuration utilities with device selection and API key validation

β”‚ β”œβ”€β”€ main.py

Entry point with orchestrated examples

β”‚ β”œβ”€β”€ prompt_engineering.py

Advanced prompt engineering techniques

β”‚ β”œβ”€β”€ few_shot_learning.py

Few-shot learning implementations

β”‚ β”œβ”€β”€ chain_of_thought.py

Chain of thought reasoning examples

β”‚ β”œβ”€β”€ constitutional_ai.py

Constitutional AI implementations

β”‚ └── utils.py

Shared utility functions

β”œβ”€β”€ notebooks/ β”‚ β”œβ”€β”€ tutorial.ipynb

Comprehensive tutorial covering all aspects

β”‚ β”œβ”€β”€ medical_tokenizer.json

Trained domain-specific tokenizer

β”‚ └── custom_tokenizer.json

Additional tokenizer artifacts

β”œβ”€β”€ docs/ β”‚ └── article11.md

Comprehensive guide on dataset curation (1000+ lines)

β”œβ”€β”€ tests/ β”‚ └── test_examples.py

Unit tests for all implementations

β”œβ”€β”€ data/

Data directory (created during execution)

β”œβ”€β”€ models/

Model artifacts directory (created during execution)

β”œβ”€β”€ .env.example

Environment template with API key examples

β”œβ”€β”€ Taskfile.yml

Cross-platform task automation

β”œβ”€β”€ pyproject.toml

Poetry configuration with locked dependencies

└── README.md

This file


1. Set up the environment

task setup


task run # 2. Run all examples

3. Start Jupyter

jupyter lab notebooks/tutorial.ipynb


- Complete data curation pipeline
- Custom tokenizer training
- Model configuration and PEFT methods
- Training workflows and monitoring
- Advanced techniques and best practices

task run


task run-prompt-engineering

Advanced prompt engineering techniques

task run-few-shot-learning

Few-shot learning implementations

task run-chain-of-thought

Chain of thought reasoning examples


- `task setup` - Set up Python 3.12.9 environment and install dependencies via Poetry
- `task run` - Run all examples from src/main.py
- `task run-prompt-engineering` - Run prompt engineering examples only
- `task run-few-shot-learning` - Run few shot learning examples only
- `task run-chain-of-thought` - Run chain of thought examples only
- `task test` - Run all tests with pytest
- `task format` - Format code with Black (line-length: 88) and Ruff
- `task clean` - Clean up generated files and caches

- **Comprehensive PII Redaction**: Multiple approaches from basic regex to transformer-based detection
- **Custom Tokenizer Training**: Domain-specific vocabularies with BPE and statistical analysis
- **Parameter-Efficient Fine-Tuning**: LoRA, QLoRA, and other PEFT implementations
- **Training Diagnostics**: Automated issue detection and resolution recommendations
- **Data Versioning**: Track dataset changes for reproducible ML workflows

- **Interactive Visualizations**: Training progress, tokenizer comparisons, pipeline diagrams
- **Best Practices**: Production deployment patterns and common pitfall avoidance
- **Comprehensive Examples**: Real-world scenarios with medical, financial, and technical domains
- **Performance Analysis**: Memory usage, cost comparisons, and efficiency metrics

- **Python 3.12.9** (managed via pyenv)
- **Hugging Face Transformers 4.53.3** for model implementations
- **Datasets 2.14.4** for efficient data processing
- **PEFT Library** for parameter-efficient fine-tuning
- **PyTorch** for deep learning operations
- **Poetry** for modern Python dependency management
- **Task** for cross-platform build automation
- **Jupyter Lab** for interactive development

- **Repository**: <https://github.com/RichardHightower/art_hug_11>
- [Hugging Face Documentation](https://huggingface.co/docs)
- [Transformers Library](https://github.com/huggingface/transformers)
- [PEFT Documentation](https://huggingface.co/docs/peft)
- [Datasets Documentation](https://huggingface.co/docs/datasets)

- Bug fixes and improvements
- Additional examples and techniques
- Documentation enhancements
- Performance optimizations

1. [Transformers and the AI Revolution](/@richardhightower/transformers-and-the-ai-revolution-the-role-of-hugging-face-f185f574b91b)
2. [Why Language is Hard for AI](/@richardhightower/why-language-is-hard-for-ai-and-how-transformers-changed-everything-d8a1fa299f1e)
3. [Building Your AI Workspace](/@richardhightower/hands-on-with-hugging-face-building-your-ai-workspace-b23c7e9be3a7)
4. [Inside the Transformer Architecture](/@richardhightower/inside-the-transformer-architecture-and-attention-demystified-39b2c13130bd)
5. [Tokenization: Gateway to Understanding](/@richardhightower/tokenization-the-gateway-to-transformer-understanding-f5d4c7ac7a18)
6. [Prompt Engineering Fundamentals](/@richardhightower/prompt-engineering-fundamentals-unlocking-the-power-of-llms-367e35d2adaf)
7. [Extending Transformers Beyond Language](/@richardhightower/introduction-extending-transformers-beyond-language-c1f3daa92652)
8. [Customizing Pipelines and Data Workflows](/@richardhightower/customizing-pipelines-and-data-workflows-advanced-models-and-efficient-processing-1ba9fabdca9a)
9. [Semantic Search and Embeddings](/@richardhightower/semantic-search-and-information-retrieval-with-transformers-rag-fundamentals-15f62073a95a) *(Article 9)*
10. [**Fine-Tuning: From Generic to Genius**](/@richardhightower/mastering-fine-tuning-a-hands-on-journey-from-generic-to-specialized-ai-7558fd413fd5) *(Article 10)*

![Hugging Face: Building Custom Language Models: From Raw Data to Production AI](/writing/hugging-face-building-custom-language-models-from-raw-data-to-production-ai/images/img-021.png)
#Hugging #Face #Building #Custom #Language #Models #Raw #Data