Hugging Face

FischGPT Design Specs

Building a GPT-2 from scratch: A journey through transformer architecture, distributed training, and production deployment

~124M
Parameters
45B
Pretrain Tokens
25M
SFT Tokens
1.726
Final Val Loss
1
Architecture
2
Pretraining
3
Fine-tuning
4
Deployment

Core Architecture

Model Type:GPT-2 Style Decoder
Parameters:~124M
Layers:12
Hidden Size :768

Attention Mechanism

Attention Heads:12
Head Dimension:64
Context Length:1024 tokens
Implementation:Flash Attention

Technical Features

Vocabulary Size:50,304
Tokenizer:GPT-2 BPE
Activation:GELU (tanh approx)
Optimizer:AdamW

Implementation Highlights

Custom Components

CasualSelfAttention: Multi-head with causal masking
MLP: Feed-forward with GELU activation
Block: Pre-layer normalization
GPT: Complete model with tied embeddings

Advanced Features

Flash Attention: F.scaled_dot_product_attention
Custom weight initialization
Weight tying: Shared input/output embeddings
Professional separation of concerns

Dataset: FineWeb

Tokens/Epoch:10B
Epochs:4.5
Steps:80000
Batch Size:524288

Training

HellaSwag:33.3%
Val Loss:2.9803
Framework:PyTorch DDP 8x NVDIA A100
Total Cost:$208.74 ($$$!!!)

Pretraining Configuration

Optimization

Max LR: 1.8e-3
Min LR: 1.8e-4
Warmup Steps: 750
Weight Decay: 0.1
Gradient Clipping: 1.0

Training Dynamics

Micro Batch: 64
Mixed Precision: bfloat16

Dataset: OASST1

Tokens:25M
Epochs:3
Steps:20000
Batch Size:16384

Training

Format:Conversational
Val Loss:1.725750
Framework:PyTorch NVDIA A100
Base Model:Pretraining Checkpoint

SFT Configuration

Optimization

Max LR: 8e-6
Min LR: 8e-7
Warmup Steps: 600
Special Tokens: <|user|> <|assistant|>

Use Cases

Conversational AI
Code completion
Creative writing
Educational content

Deployment Stack

Repository:kristianfischerai12345/fischgpt-sft
Format:PyTorch + Safetensors
Backend:Express.js + ChromaDB RAG
Hardware:CPU Inference

RAG Pipeline

Vector DB:ChromaDB
Embeddings:all-MiniLM-L6-v2
Retrieval:Top-5 similarity
Context:Pre-prompt injection

Performance Metrics

1024
Max Context
~124M
Parameters
1.726
Final Loss
33.3%
HellaSwag

Enhanced Capabilities

RAG-Augmented Responses

Contextual knowledge retrieval
Source citation capabilities
Domain-specific expertise
Reduced hallucination rate

Generation Pipeline

Query → Embed → Retrieve → Augment
Temperature: 0.8 (recommended)
Max context: 1024 tokens
Response length: Up to 400 tokens

Complete Technical Stack

Machine Learning

  • PyTorch for model training
  • Distributed Data Parallel (DDP)
  • Mixed precision training (FP16)
  • Custom tokenization pipeline
  • Gradient accumulation & clipping
  • Learning rate scheduling

Infrastructure

  • Lambda Labs 8x A100 cluster
  • Hugging Face model hosting
  • Express.js + ChromaDB RAG API
  • Vector embeddings pipeline
  • Git LFS for model storage

RAG & Retrieval

  • ChromaDB vector database
  • all-MiniLM-L6-v2 embeddings
  • Semantic document chunking
  • Cosine similarity search

Frontend

  • Next.js 15 with React 19
  • TypeScript for type safety
  • Tailwind CSS v4 styling
  • shadcn/ui component library
  • Real-time chat interface