Top Related Projects
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
TensorFlow code and pre-trained models for BERT
An open-source NLP research library, built on PyTorch.
Quick Overview
Megatron-LM is an open-source project by NVIDIA for training large language models efficiently on distributed GPU systems. It focuses on optimizing transformer-based models for scale, supporting various architectures like BERT, GPT, and T5. The library is designed to enable training of models with billions of parameters across multiple GPUs and nodes.
Pros
- Highly optimized for distributed training on NVIDIA GPUs
- Supports multiple model architectures (BERT, GPT, T5)
- Implements efficient parallelism techniques (data, model, and pipeline parallelism)
- Provides tools for efficient checkpointing and model loading
Cons
- Steep learning curve for users not familiar with distributed training
- Primarily focused on NVIDIA hardware, limiting its use on other platforms
- Requires significant computational resources for large-scale training
- Documentation can be sparse for some advanced features
Code Examples
- Initializing a GPT model:
import torch
from megatron import get_args
from megatron.model import GPTModel
args = get_args()
model = GPTModel(num_tokentypes=0, parallel_output=True)
- Setting up data parallel groups:
from megatron import mpu
mpu.initialize_model_parallel(model_parallel_size=4)
- Training loop with mixed precision:
from apex import amp
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
model, optimizer = amp.initialize(model, optimizer, opt_level="O2")
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
optimizer.step()
Getting Started
To get started with Megatron-LM:
-
Clone the repository:
git clone https://github.com/NVIDIA/Megatron-LM.git cd Megatron-LM
-
Install dependencies:
pip install -r requirements.txt
-
Set up your dataset and configuration file.
-
Run training script:
python pretrain_gpt.py \ --model-parallel-size 2 \ --num-layers 24 \ --hidden-size 1024 \ --num-attention-heads 16 \ --batch-size 4 \ --seq-length 1024 \ --max-position-embeddings 1024 \ --train-iters 500000 \ --lr-decay-iters 320000 \ --save $CHECKPOINT_PATH \ --load $CHECKPOINT_PATH \ --data-path $DATA_PATH \ --vocab-file gpt2-vocab.json \ --merge-file gpt2-merges.txt \ --data-impl mmap \ --split 949,50,1 \ --distributed-backend nccl \ --lr 0.00015 \ --lr-decay-style cosine \ --min-lr 1.0e-5 \ --weight-decay 1e-2 \ --clip-grad 1.0 \ --warmup .01 \ --checkpoint-activations \ --fp16
Competitor Comparisons
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Pros of DeepSpeed
- More flexible and adaptable to various deep learning frameworks
- Offers a wider range of optimization techniques beyond model parallelism
- Provides easier integration with existing codebases
Cons of DeepSpeed
- May require more setup and configuration for optimal performance
- Less specialized for transformer-based models compared to Megatron-LM
Code Comparison
Megatron-LM:
model = MegatronModule(
num_layers=args.num_layers,
hidden_size=args.hidden_size,
num_attention_heads=args.num_attention_heads,
vocab_size=args.vocab_size,
max_position_embeddings=args.max_position_embeddings,
)
DeepSpeed:
model = MyModel(args)
model_engine, optimizer, _, _ = deepspeed.initialize(
args=args,
model=model,
model_parameters=model.parameters()
)
DeepSpeed offers a more generalized approach, allowing users to define their own model architecture and then apply optimization techniques. Megatron-LM provides a more specialized implementation focused on transformer-based models, with built-in support for model parallelism.
Both libraries aim to improve training efficiency for large language models, but DeepSpeed offers a broader set of optimization techniques that can be applied to various deep learning tasks, while Megatron-LM is more focused on transformer architectures and model parallelism.
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Pros of Transformers
- Extensive model support and easy-to-use API for various NLP tasks
- Active community and frequent updates with new models and features
- Seamless integration with popular deep learning frameworks like PyTorch and TensorFlow
Cons of Transformers
- Less optimized for large-scale distributed training compared to Megatron-LM
- May require more memory and computational resources for very large models
Code Comparison
Transformers:
from transformers import BertModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
Megatron-LM:
import megatron
from megatron import get_args, get_tokenizer, get_model
args = get_args()
tokenizer = get_tokenizer()
model = get_model(args)
Summary
Transformers offers a user-friendly API with broad model support, making it ideal for various NLP tasks and experimentation. Megatron-LM, on the other hand, is optimized for large-scale distributed training of massive language models. While Transformers is more versatile and easier to use, Megatron-LM excels in scenarios requiring efficient training of extremely large models across multiple GPUs or nodes.
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
Pros of gpt-neox
- More user-friendly and easier to set up for newcomers
- Includes additional features like wandb integration and custom tokenizers
- Actively maintained with frequent updates and community contributions
Cons of gpt-neox
- May have slightly lower performance compared to Megatron-LM in some scenarios
- Less extensive documentation and fewer examples for advanced use cases
Code Comparison
Megatron-LM initialization:
model = MegatronModule(
init_method=init_method,
output_layer_init_method=scaled_init_method,
num_tokentypes=num_tokentypes,
parallel_output=parallel_output)
gpt-neox initialization:
model = GPTNeoX(
num_tokentypes=num_tokentypes,
parallel_output=parallel_output,
use_cache=use_cache,
config=config)
Both repositories provide powerful tools for training large language models, but gpt-neox offers a more accessible approach for newcomers and includes additional features. Megatron-LM, on the other hand, may offer slightly better performance and more extensive documentation for advanced users. The code comparison shows similarities in model initialization, with gpt-neox using a more streamlined approach.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Pros of fairseq
- Broader support for various NLP tasks and architectures
- More extensive documentation and examples
- Active community and frequent updates
Cons of fairseq
- Less optimized for large-scale language models
- May require more setup and configuration for specific tasks
Code Comparison
fairseq:
from fairseq.models.transformer import TransformerModel
model = TransformerModel.from_pretrained('/path/to/model')
tokens = model.encode('Hello world!')
output = model.decode(tokens)
Megatron-LM:
from megatron import get_args, get_tokenizer, get_model
from megatron.initialize import initialize_megatron
args = get_args()
tokenizer = get_tokenizer()
model = get_model(args)
tokens = tokenizer.tokenize('Hello world!')
output = model.generate(tokens)
The code snippets demonstrate that fairseq offers a more straightforward API for loading and using pre-trained models, while Megatron-LM requires more setup and initialization steps. However, Megatron-LM's approach allows for greater customization and optimization for large-scale language models.
TensorFlow code and pre-trained models for BERT
Pros of BERT
- Simpler architecture and easier to understand for beginners
- Extensive documentation and community support
- Widely adopted and used in various NLP tasks
Cons of BERT
- Limited scalability for very large language models
- Less efficient for distributed training on multiple GPUs
Code Comparison
BERT:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
Megatron-LM:
model = get_model(args)
output = model(tokens, labels, attention_mask)
loss = output['loss']
model.backward(loss)
Key Differences
- Megatron-LM is designed for training large language models with billions of parameters, while BERT is more suitable for smaller models.
- Megatron-LM offers advanced features for distributed training and model parallelism, which are not present in BERT.
- BERT provides pre-trained models and easy fine-tuning capabilities, making it more accessible for various NLP tasks.
- Megatron-LM focuses on performance and scalability, while BERT prioritizes ease of use and widespread adoption.
An open-source NLP research library, built on PyTorch.
Pros of AllenNLP
- More comprehensive and feature-rich NLP toolkit
- Easier to use for researchers and developers new to NLP
- Better documentation and tutorials
Cons of AllenNLP
- Less optimized for large-scale language model training
- May not scale as efficiently on multi-GPU systems
- Fewer options for advanced parallelism techniques
Code Comparison
AllenNLP:
from allennlp.data import DatasetReader, Instance
from allennlp.data.fields import TextField
from allennlp.data.token_indexers import SingleIdTokenIndexer
class MyDatasetReader(DatasetReader):
def _read(self, file_path: str) -> Iterable[Instance]:
with open(file_path, "r") as file:
for line in file:
yield self.text_to_instance(line.strip())
Megatron-LM:
from megatron import get_args
from megatron import print_rank_0
from megatron import get_tokenizer
from megatron.data.dataset_utils import build_train_valid_test_datasets
args = get_args()
tokenizer = get_tokenizer()
train_dataset, valid_dataset, test_dataset = build_train_valid_test_datasets(
data_prefix, data_impl, splits_string,
train_valid_test_num_samples,
seq_length, seed, skip_warmup)
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Megatron-LM & Megatron Core
GPU-optimized library for training transformer models at scale
â¡ Quick Start
# 1. Install Megatron Core with required dependencies
pip install megatron-core
pip install --no-build-isolation transformer-engine[pytorch]
# 2. Clone repository for examples
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
â Complete Installation Guide - Docker, pip variants (dev,lts,etc.), source installation, and system requirements
Latest News
- ð NEW! Megatron Bridge - Bidirectional converter for interoperability between Hugging Face and Megatron checkpoints, featuring production-ready recipes for popular models.
- ðºï¸ MoE Q3-Q4 2025 Roadmap - Comprehensive roadmap for MoE features including DeepSeek-V3, Qwen3, advanced parallelism strategies, FP8 optimizations, and Blackwell performance enhancements.
- ð GPT-OSS Implementation - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions are being integrated into Megatron Core.
- [2025/06] Megatron MoE Model Zoo - Best practices and optimized configurations for training DeepSeek-V3, Mixtral, and Qwen3 MoE models with performance benchmarking and checkpoint conversion tools.
- [2025/05] Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training (blog).
Previous News
- [2024/07] Megatron Core v0.7 improves scalability and training resiliency and adds support for multimodal training (blog).
- [2024/06] Megatron Core added supports for Mamba-based models. Check out our paper An Empirical Study of Mamba-based Language Models and code example.
- [2024/01 Announcement] NVIDIA has released the core capabilities in Megatron-LM into Megatron Core in this repository. Megatron Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs. Explore the [Megatron Core intro](#Megatron Core) for more details.
Table of Contents
Getting Started
Core Features
Training
Resources
- Examples - Training scripts and tutorials
- Documentation - Official docs
- Roadmaps - Development roadmaps and feature tracking
- Community & Support - Get help and contribute
Megatron Overview
Project Structure
Megatron-LM/
âââ megatron/
â âââ core/ # Megatron Core (kernels, parallelism, building blocks)
â â âââ models/ # Transformer models
â â âââ transformer/ # Transformer building blocks
â â âââ tensor_parallel/ # Tensor parallelism
â â âââ pipeline_parallel/ # Pipeline parallelism
â â âââ distributed/ # Distributed training (FSDP, DDP)
â â âââ optimizer/ # Optimizers
â â âââ datasets/ # Dataset loaders
â â âââ inference/ # Inference engines
â â âââ export/ # Model export (e.g. TensorRT-LLM)
â âââ training/ # Training scripts
â âââ inference/ # Inference server
â âââ legacy/ # Legacy components
â âââ post_training/ # Post-training (RLHF, etc.)
âââ examples/ # Ready-to-use training examples
âââ tools/ # Utility tools
âââ tests/ # Comprehensive test suite
âââ docs/ # Documentation
Megatron-LM: Reference Implementation
Reference implementation that includes Megatron Core plus everything needed to train models.
Best for:
- Training state-of-the-art foundation models at scale with cutting-edge performance on latest NVIDIA hardware
- Research teams exploring new architectures and training techniques
- Learning distributed training concepts and best practices
- Quick experimentation with proven model configurations
What you get:
- Pre-configured training scripts for GPT, LLama, DeepSeek, Qwen, and more.
- End-to-end examples from data prep to evaluation
- Research-focused tools and utilities
Megatron Core: Composable Library
Composable library with GPU-optimized building blocks for custom training frameworks.
Best for:
- Framework developers building on top of modular and optimized components
- Research teams needing custom training loops, optimizers, or data pipelines
- ML engineers requiring fault-tolerant training pipelines
What you get:
- Composable transformer building blocks (attention, MLP, etc.)
- Advanced parallelism strategies (TP, PP, DP, EP, CP)
- Pipeline schedules and distributed optimizers
- Mixed precision support (FP16, BF16, FP8)
- GPU-optimized kernels and memory management
- High-performance dataloaders and dataset utilities
- Model architectures (LLaMA, Qwen, GPT, Mixtral, Mamba, etc.)
Ecosystem Libraries
Libraries used by Megatron Core:
- Megatron Energon ð£ NEW! - Multi-modal data loader (text, images, video, audio) with distributed loading and dataset blending
- Transformer Engine - Optimized kernels and FP8 mixed precision support
- Resiliency Extension (NVRx) - Fault tolerant training with failure detection and recovery
Libraries using Megatron Core:
- Megatron Bridge - Training library with bidirectional Hugging Face â Megatron checkpoint conversion, flexible training loops, and production-ready recipes
- NeMo RL - Scalable toolkit for efficient reinforcement learning with RLHF, DPO, and other post-training methods
- NeMo Framework - Enterprise framework with cloud-native support and end-to-end examples
- TensorRT Model Optimizer (ModelOpt) - Model optimization toolkit for quantization, pruning, and distillation
Compatible with: Hugging Face Accelerate, Colossal-AI, DeepSpeed
Installation
ð³ Docker (Recommended)
We strongly recommend using the previous releases of PyTorch NGC Container rather than the latest one for optimal compatibility with Megatron Core release and testing. Our releases are always based on the previous month's NGC container, so this ensures compatibility and stability.
This container comes with all dependencies pre-installed with compatible versions and optimized configurations for NVIDIA GPUs:
- PyTorch (latest stable version)
- CUDA, cuDNN, NCCL (latest stable versions)
- Support for FP8 on NVIDIA Hopper, Ada, and Blackwell GPUs
- For best performance, use NVIDIA Turing GPU architecture generations and later
# Run container with mounted directories
docker run --runtime --nvidia --gpus all -it --rm \
-v /path/to/megatron:/workspace/megatron \
-v /path/to/dataset:/workspace/dataset \
-v /path/to/checkpoints:/workspace/checkpoints \
nvcr.io/nvidia/pytorch:25.04-py3
Pip Installation
Megatron Core offers support for two NGC PyTorch containers:
dev
: Moving head that supports the most recent upstream dependencieslts
: Long-term support of NGC PyTorch 24.01
Both containers can be combined with mlm
which adds package dependencies for Megatron-LM on top of Megatron Core.
# Install the latest release with minimal dependencies (no Transformer Engine)
pip install megatron-core[dev]
# Install packages for LTS support NGC PyTorch 24.01
pip install megatron-core[lts]
For a version of Megatron Core with only torch, run:
pip install megatron-core
For dependencies required by Megatron-LM, please run:
pip install megatron-core[mlm]
Source Installation
For development or latest features:
For Hybrid models, Megatron Core requires mamba. If the pre-built wheel in PyPI does not fit your environment, you can fall back to an install script Megatron Core uses in its CI system. For this, please install uv
first:
export UV_VERSION=0.7.2
export PATH="$HOME/.local/bin:$PATH"
curl -LsSf https://astral.sh/uv/${UV_VERSION}/install.sh | sh
export UV_PROJECT_ENVIRONMENT=./venv
export PATH="$UV_PROJECT_ENVIRONMENT/bin:$PATH"
export UV_LINK_MODE=copy
Run the following command to build upstream dependencies from source:
# Clone and install
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
# Optional: checkout specific release
git checkout core_r0.13.0
bash docker/common/install.sh --environment {dev,lts}
System Requirements
Hardware Requirements
- FP8 Support: NVIDIA Hopper, Ada, Blackwell GPUs
- Recommended: NVIDIA Turing architecture or later
Software Requirements
- CUDA/cuDNN/NCCL: Latest stable versions
- PyTorch: Latest stable version
- Transformer Engine: Latest stable version
- Python: 3.12 recommended
Performance Benchmarking
For our latest performance benchmarking results, please refer to NVIDIA NeMo Framework Performance Summary.
Our codebase efficiently trains models from 2B to 462B parameters across thousands of GPUs, achieving up to 47% Model FLOP Utilization (MFU) on H100 clusters.
Benchmark Configuration:
- Vocabulary size: 131,072 tokens
- Sequence length: 4096 tokens
- Model scaling: Varied hidden size, attention heads, and layers to achieve target parameter counts
- Communication optimizations: Fine-grained overlapping with DP (
--overlap-grad-reduce
,--overlap-param-gather
), TP (--tp-comm-overlap
), and PP (enabled by default)
Key Results:
- 6144 H100 GPUs: Successfully benchmarked 462B parameter model training
- Superlinear scaling: MFU increases from 41% to 47-48% with model size
- End-to-end measurement: Throughputs include all operations (data loading, optimizer steps, communication, logging)
- Production ready: Full training pipeline with checkpointing and fault tolerance
- Note: Performance results measured without training to convergence
Weak Scaling Results
Our weak scaled results show superlinear scaling (MFU increases from 41% for the smallest model considered to 47-48% for the largest models); this is because larger GEMMs have higher arithmetic intensity and are consequently more efficient to execute.
Strong Scaling Results
We also strong scaled the standard GPT-3 model (our version has slightly more than 175 billion parameters due to larger vocabulary size) from 96 H100 GPUs to 4608 GPUs, using the same batch size of 1152 sequences throughout. Communication becomes more exposed at larger scale, leading to a reduction in MFU from 47% to 42%.
Training
Getting Started
Simple Training Example
# Distributed training example (2 GPUs, mock data)
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
LLama-3 Training Example
# 8 GPUs, FP8 precision, mock data
./examples/llama/train_llama3_8b_fp8.sh
Data Preparation
JSONL Data Format
{"text": "Your training text here..."}
{"text": "Another training sample..."}
Basic Preprocessing
python tools/preprocess_data.py \
--input data.jsonl \
--output-prefix processed_data \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model /path/to/tokenizer.model \
--workers 8 \
--append-eod
Key Arguments
--input
: Path to input JSON/JSONL file--output-prefix
: Prefix for output binary files (.bin and .idx)--tokenizer-type
: Tokenizer type (HuggingFaceTokenizer
,GPT2BPETokenizer
, etc.)--tokenizer-model
: Path to tokenizer model file--workers
: Number of parallel workers for processing--append-eod
: Add end-of-document token
Parallelism Strategies
Data Parallelism (DP)
Standard Data Parallel
# Standard DDP - replicate model on each GPU
torchrun --nproc_per_node=8 pretrain_gpt.py \
--data-parallel-sharding-strategy no_shard
Fully Sharded Data Parallel (FSDP)
# Megatron's optimized FSDP (~15% faster than PyTorch FSDP2)
--use-custom-fsdp
# PyTorch FSDP2
--use-torch-fsdp2
# Sharding strategies
--data-parallel-sharding-strategy optim # Shard optimizer states (ZeRO-1)
--data-parallel-sharding-strategy optim_grads # Shard gradients + optimizer (ZeRO-2)
--data-parallel-sharding-strategy optim_grads_params # Shard parameters + gradients + optimizer (ZeRO-3)
Tensor Parallelism (TP)
Split individual model layers across GPUs:
--tensor-model-parallel-size 4 # 4-way tensor parallelism
--sequence-parallel # Enable sequence parallelism (recommended with TP)
Pipeline Parallelism (PP)
Split model depth across GPUs:
--pipeline-model-parallel-size 8 # 8 pipeline stages
--virtual-pipeline-model-parallel-size 4 # Virtual pipeline for better load balancing
Context Parallelism (CP)
Split long sequences across GPUs for handling long contexts:
--context-parallel-size 2 # 2-way context parallelism
--cp-comm-type p2p # Communication: p2p, a2a, allgather, a2a+p2p
--hierarchical-context-parallel-sizes 2 4 # Hierarchical context parallelism
Expert Parallelism (EP)
For Mixture of Experts (MoE) models:
--expert-model-parallel-size 4 # 4-way expert parallelism
--num-experts 8 # 8 experts per MoE layer
--moe-grouped-gemm # Optimize expert computation
Combining Parallelism Strategies
Parallelism Selection Guide
Based on NVIDIA NeMo production configurations:
Model | Size | GPUs | TP | PP | CP | EP | Notes |
---|---|---|---|---|---|---|---|
LLama-3 | 8B | 8 | 1 | 1 | 2 | 1 | CP for long seqlen (8K) |
LLama-3 | 70B | 64 | 4 | 4 | 2 | 1 | TP+PP |
LLama-3.1 | 405B | 1024 | 8 | 8 | 2 | 1 | 3D parallelism for scale |
GPT-3 | 175B | 128-512 | 4 | 8 | 1 | 1 | Large model config |
Mixtral | 8x7B | 64 | 1 | 4 | 1 | 8 | EP for MoE |
Mixtral | 8x22B | 256 | 4 | 4 | 8 | 8 | Combined TP+EP for large MoE |
DeepSeek-V3 | 671B | 1024 | 2 | 16 | 1 | 64 | Large MoE config |
MoE-Specific Requirements
Important: When combining Expert Parallelism (EP) with Tensor Parallelism (TP), Sequence Parallelism (SP) must be enabled.
Performance Optimizations
Feature | Flag | Benefit |
---|---|---|
FlashAttention | --attention-backend | Faster attention and lower memory usage |
FP8 Training | --fp8-hybrid | Faster training |
Activation Checkpointing | --recompute-activations | Reduced memory usage |
Data Parallelism Communication Overlap | --overlap-grad-reduce | Faster distributed training |
Distributed Optimizer | --use-distributed-optimizer | Reduced checkpointing time |
â NVIDIA NeMo Framework Performance Tuning Guide - Comprehensive performance optimization guide covering advanced tuning techniques, communication overlaps, memory optimizations, and profiling options.
FlashAttention
FlashAttention is a fast and memory-efficient attention algorithm. We recommend the default usage, which uses cuDNN for attention via Transformer Engine and provides up to 50% speedups on forward and 84% on backward propagation with FP8 kernels. The flash-attn
package is also supported via --use-flash-attn
.
Mixed Precision Training
--fp16 # Standard FP16
--bf16 # BFloat16 (recommended for large models)
--fp8-hybrid # FP8 training (Hopper, Ada, and Blackwell GPUs)
Activation Checkpointing and Recomputation
# For limited memory
--recompute-activations
# For extreme memory constraints
--recompute-granularity full \
--recompute-method uniform
Data Parallelism Communication Overlap
--overlap-grad-reduce
--overlap-param-gather
Distributed Optimizer
--use-distributed-optimizer
Roadmaps
Stay up-to-date with our development roadmaps and planned features:
- MoE Q3-Q4 2025 Roadmap - Comprehensive MoE feature development including DeepSeek-V3, Qwen3, advanced parallelism, FP8 optimizations, and Blackwell enhancements
- GPT-OSS Implementation Tracker - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions
More roadmap trackers will be added soon.
Community & Support
Getting Help
- ð Documentation - Official documentation
- ð Issues - Bug reports and feature requests
Contributing
We â¤ï¸ contributions! Ways to contribute:
- ð Report bugs - Help us improve reliability
- ð¡ Suggest features - Shape the future of Megatron Core
- ð Improve docs - Make Megatron Core more accessible
- ð§ Submit PRs - Contribute code improvements
Citation
@article{megatron-lm,
title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},
author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan},
journal={arXiv preprint arXiv:1909.08053},
year={2019}
}
Top Related Projects
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
TensorFlow code and pre-trained models for BERT
An open-source NLP research library, built on PyTorch.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot