Building a GPT-2 from scratch: A journey through transformer architecture, distributed training, and production deployment
From-scratch GPT-2 implementation with Flash Attention
45B tokens from FineWeb with distributed training at ~1.2M tokens/sec for 12 hours
25M tokens from OpenAssistant/oasst1 with distributed training at ~1.2M tokens/sec
Hugging Face hosting with ChromaDB RAG pipeline for enhanced context retrieval