Convert Figma logo to code with AI

rasbt logoLLMs-from-scratch

Implement a ChatGPT-like LLM in PyTorch from scratch, step by step

43,073
5,918
43,073
3

Top Related Projects

38,629

The simplest, fastest repository for training/finetuning medium-sized GPTs.

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

22,814

Code for the paper "Language Models are Unsupervised Multitask Learners"

38,880

TensorFlow code and pre-trained models for BERT

37,573

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

30,829

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Quick Overview

"LLMs-from-scratch" is a GitHub repository by Sebastian Raschka that provides a comprehensive guide to building Large Language Models (LLMs) from scratch. It offers a step-by-step approach to understanding and implementing LLMs, covering various aspects from basic concepts to advanced techniques.

Pros

  • Detailed explanations and implementations of LLM concepts
  • Hands-on approach with practical code examples
  • Covers a wide range of topics, from basics to advanced techniques
  • Regularly updated with new content and improvements

Cons

  • May be challenging for absolute beginners in machine learning
  • Requires significant computational resources for some advanced models
  • Focuses primarily on PyTorch, which may not suit users of other frameworks
  • Some advanced topics might require additional background knowledge

Code Examples

  1. Tokenization example:
import torch
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(WordLevel(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = WordLevelTrainer(special_tokens=["[UNK]", "[PAD]", "[BOS]", "[EOS]"])

tokenizer.train_from_iterator(["Hello world", "How are you"], trainer=trainer)
encoded = tokenizer.encode("Hello how are you")
print(encoded.tokens)
  1. Simple language model training:
import torch
import torch.nn as nn

class SimpleLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        embedded = self.embedding(x)
        output, _ = self.rnn(embedded)
        return self.fc(output)

model = SimpleLanguageModel(vocab_size=1000, embedding_dim=128, hidden_dim=256)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
  1. Attention mechanism implementation:
import torch
import torch.nn as nn

class Attention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.attn = nn.Linear(hidden_dim * 2, hidden_dim)
        self.v = nn.Linear(hidden_dim, 1, bias=False)

    def forward(self, hidden, encoder_outputs):
        batch_size = encoder_outputs.shape[0]
        src_len = encoder_outputs.shape[1]
        
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))
        attention = self.v(energy).squeeze(2)
        
        return torch.softmax(attention, dim=1)

attention = Attention(hidden_dim=256)

Getting Started

To get started with the LLMs-from-scratch project:

  1. Clone the repository:

    git clone https://github.com/rasbt/LLMs-from-scratch.git
    cd LLMs-from-scratch
    
  2. Install the required dependencies:

    pip install -r requirements.txt
    
  3. Navigate through the notebooks in the repository, starting with the introductory ones and progressing to more advanced topics.

  4. Run the Jupyter notebooks to interact with the code and experiments:

    jupyter notebook
    

Competitor Comparisons

38,629

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Pros of nanoGPT

  • More focused on performance and efficiency, suitable for production use
  • Implements advanced features like flash attention and rotary embeddings
  • Provides a complete training pipeline, including data preparation and model evaluation

Cons of nanoGPT

  • Less educational focus, may be harder for beginners to understand
  • Fewer explanatory comments and documentation compared to LLMs-from-scratch
  • More complex codebase with additional optimizations and features

Code Comparison

nanoGPT:

class LayerNorm(nn.Module):
    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None

    def forward(self, input):
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)

LLMs-from-scratch:

class LayerNorm(nn.Module):
    def __init__(self, d_model, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.alpha = nn.Parameter(torch.ones(d_model))
        self.bias = nn.Parameter(torch.zeros(d_model))

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.alpha * (x - mean) / (std + self.eps) + self.bias

Both implementations showcase a LayerNorm module, but nanoGPT's version is more concise and uses PyTorch's built-in layer_norm function for efficiency.

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Comprehensive library with support for numerous pre-trained models
  • Extensive documentation and community support
  • Seamless integration with popular deep learning frameworks

Cons of transformers

  • Steeper learning curve for beginners
  • Larger codebase and dependencies
  • May be overkill for simple projects or educational purposes

Code comparison

LLMs-from-scratch:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)

transformers:

class BertSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)
22,814

Code for the paper "Language Models are Unsupervised Multitask Learners"

Pros of GPT-2

  • Fully trained and ready-to-use model
  • Extensive documentation and community support
  • Proven performance in various NLP tasks

Cons of GPT-2

  • Large model size, requiring significant computational resources
  • Less flexibility for customization and experimentation
  • Potential for misuse due to its powerful generation capabilities

Code Comparison

GPT-2:

import tensorflow as tf
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

LLMs-from-scratch:

import torch
from scratch_llm import LLM

model = LLM(vocab_size=10000, d_model=512, nhead=8, num_layers=6)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

The GPT-2 repository provides a pre-trained model that can be easily loaded and used, while LLMs-from-scratch offers a more educational approach, allowing users to build and train language models from the ground up. GPT-2 is better suited for production use and advanced NLP tasks, whereas LLMs-from-scratch is ideal for learning and experimentation with model architectures and training processes.

38,880

TensorFlow code and pre-trained models for BERT

Pros of BERT

  • Highly optimized and production-ready implementation
  • Extensive documentation and pre-trained models available
  • Widely adopted in industry and research communities

Cons of BERT

  • More complex codebase, potentially harder for beginners to understand
  • Focused specifically on BERT architecture, less flexible for exploring other LLM types
  • Requires more computational resources to train and fine-tune

Code Comparison

BERT (TensorFlow):

import tensorflow as tf
from bert import modeling

input_ids = tf.placeholder(tf.int32, shape=[None, max_seq_length])
input_mask = tf.placeholder(tf.int32, shape=[None, max_seq_length])
segment_ids = tf.placeholder(tf.int32, shape=[None, max_seq_length])

model = modeling.BertModel(
    config=bert_config,
    is_training=is_training,
    input_ids=input_ids,
    input_mask=input_mask,
    token_type_ids=segment_ids,
    use_one_hot_embeddings=use_one_hot_embeddings)

LLMs-from-scratch (PyTorch):

import torch
from model import GPT

model = GPT(
    vocab_size=vocab_size,
    n_embd=n_embd,
    n_head=n_head,
    n_layer=n_layer,
    block_size=block_size,
    dropout=dropout
)
logits, loss = model(x, y)

The BERT repository provides a more comprehensive implementation, while LLMs-from-scratch offers a simpler, educational approach to building language models from the ground up.

37,573

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

  • Highly optimized for large-scale distributed training
  • Supports various optimization techniques like ZeRO, pipeline parallelism, and 3D parallelism
  • Integrates well with popular deep learning frameworks like PyTorch and Hugging Face

Cons of DeepSpeed

  • Steeper learning curve due to its complexity and advanced features
  • May be overkill for smaller projects or individual researchers
  • Requires more setup and configuration compared to simpler implementations

Code Comparison

LLMs-from-scratch:

class SimpleLLM(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.transformer = TransformerBlock(d_model)

DeepSpeed:

model = MyLargeModel()
model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args,
    model=model,
    model_parameters=model.parameters(),
    config=ds_config
)

LLMs-from-scratch focuses on educational purposes, providing a clear and simple implementation of LLM components. It's ideal for learning and understanding the fundamentals of language models.

DeepSpeed, on the other hand, is designed for production-level, large-scale training of deep learning models. It offers advanced optimization techniques and distributed training capabilities, making it suitable for training massive language models efficiently across multiple GPUs or nodes.

30,829

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

  • More comprehensive and feature-rich, supporting a wide range of NLP tasks
  • Highly optimized for performance and scalability
  • Extensive documentation and community support

Cons of fairseq

  • Steeper learning curve due to its complexity
  • Requires more computational resources
  • Less focused on educational purposes or step-by-step implementation

Code Comparison

fairseq:

from fairseq.models.transformer import TransformerModel

model = TransformerModel.from_pretrained('/path/to/model', 'checkpoint.pt')
tokens = model.encode('Hello world!')
output = model.generate(tokens)

LLMs-from-scratch:

from llms_from_scratch import Transformer

model = Transformer(vocab_size=10000, d_model=512, nhead=8)
output = model.forward(input_ids)

fairseq offers a more production-ready approach with pre-trained models and high-level APIs, while LLMs-from-scratch provides a more educational, ground-up implementation. fairseq is better suited for large-scale projects and research, whereas LLMs-from-scratch is ideal for learning and understanding the fundamentals of language models.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Build a Large Language Model (From Scratch)

This repository contains the code for developing, pretraining, and finetuning a GPT-like LLM and is the official code repository for the book Build a Large Language Model (From Scratch).




In Build a Large Language Model (From Scratch), you'll learn and understand how large language models (LLMs) work from the inside out by coding them from the ground up, step by step. In this book, I'll guide you through creating your own LLM, explaining each stage with clear text, diagrams, and examples.

The method described in this book for training and developing your own small-but-functional model for educational purposes mirrors the approach used in creating large-scale foundational models such as those behind ChatGPT. In addition, this book includes code for loading the weights of larger pretrained models for finetuning.



To download a copy of this repository, click on the Download ZIP button or execute the following command in your terminal:

git clone --depth 1 https://github.com/rasbt/LLMs-from-scratch.git

(If you downloaded the code bundle from the Manning website, please consider visiting the official code repository on GitHub at https://github.com/rasbt/LLMs-from-scratch for the latest updates.)



Table of Contents

Please note that this README.md file is a Markdown (.md) file. If you have downloaded this code bundle from the Manning website and are viewing it on your local computer, I recommend using a Markdown editor or previewer for proper viewing. If you haven't installed a Markdown editor yet, MarkText is a good free option.

You can alternatively view this and other files on GitHub at https://github.com/rasbt/LLMs-from-scratch in your browser, which renders Markdown automatically.



Tip: If you're seeking guidance on installing Python and Python packages and setting up your code environment, I suggest reading the README.md file located in the setup directory.



Code tests Linux Code tests Windows Code tests macOS


Chapter TitleMain Code (for Quick Access)All Code + Supplementary
Setup recommendations--
Ch 1: Understanding Large Language ModelsNo code-
Ch 2: Working with Text Data- ch02.ipynb
- dataloader.ipynb (summary)
- exercise-solutions.ipynb
./ch02
Ch 3: Coding Attention Mechanisms- ch03.ipynb
- multihead-attention.ipynb (summary)
- exercise-solutions.ipynb
./ch03
Ch 4: Implementing a GPT Model from Scratch- ch04.ipynb
- gpt.py (summary)
- exercise-solutions.ipynb
./ch04
Ch 5: Pretraining on Unlabeled Data- ch05.ipynb
- gpt_train.py (summary)
- gpt_generate.py (summary)
- exercise-solutions.ipynb
./ch05
Ch 6: Finetuning for Text Classification- ch06.ipynb
- gpt_class_finetune.py
- exercise-solutions.ipynb
./ch06
Ch 7: Finetuning to Follow Instructions- ch07.ipynb
- gpt_instruction_finetuning.py (summary)
- ollama_evaluate.py (summary)
- exercise-solutions.ipynb
./ch07
Appendix A: Introduction to PyTorch- code-part1.ipynb
- code-part2.ipynb
- DDP-script.py
- exercise-solutions.ipynb
./appendix-A
Appendix B: References and Further ReadingNo code-
Appendix C: Exercise SolutionsNo code-
Appendix D: Adding Bells and Whistles to the Training Loop- appendix-D.ipynb./appendix-D
Appendix E: Parameter-efficient Finetuning with LoRA- appendix-E.ipynb./appendix-E

 

The mental model below summarizes the contents covered in this book.


 

Hardware Requirements

The code in the main chapters of this book is designed to run on conventional laptops within a reasonable timeframe and does not require specialized hardware. This approach ensures that a wide audience can engage with the material. Additionally, the code automatically utilizes GPUs if they are available. (Please see the setup doc for additional recommendations.)

 

Bonus Material

Several folders contain optional materials as a bonus for interested readers:


 

Questions, Feedback, and Contributing to This Repository

I welcome all sorts of feedback, best shared via the Manning Forum or GitHub Discussions. Likewise, if you have any questions or just want to bounce ideas off others, please don't hesitate to post these in the forum as well.

Please note that since this repository contains the code corresponding to a print book, I currently cannot accept contributions that would extend the contents of the main chapter code, as it would introduce deviations from the physical book. Keeping it consistent helps ensure a smooth experience for everyone.

 

Citation

If you find this book or code useful for your research, please consider citing it.

Chicago-style citation:

Raschka, Sebastian. Build A Large Language Model (From Scratch). Manning, 2024. ISBN: 978-1633437166.

BibTeX entry:

@book{build-llms-from-scratch-book,
  author       = {Sebastian Raschka},
  title        = {Build A Large Language Model (From Scratch)},
  publisher    = {Manning},
  year         = {2024},
  isbn         = {978-1633437166},
  url          = {https://www.manning.com/books/build-a-large-language-model-from-scratch},
  github       = {https://github.com/rasbt/LLMs-from-scratch}
}