Top Related Projects
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
💫 Industrial-strength Natural Language Processing (NLP) in Python
A very simple framework for state-of-the-art Natural Language Processing (NLP)
Topic Modelling for Humans
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Models and examples built with TensorFlow
Quick Overview
The graykode/nlp-tutorial repository is a comprehensive collection of Natural Language Processing (NLP) tutorials implemented in PyTorch. It covers a wide range of NLP tasks and models, from basic text classification to advanced transformer architectures. The tutorials are designed to be easy to understand and implement, making it an excellent resource for both beginners and intermediate practitioners in the field of NLP.
Pros
- Covers a wide range of NLP topics and models
- Implementations are in PyTorch, a popular deep learning framework
- Code is well-organized and easy to follow
- Includes both basic and advanced NLP concepts
Cons
- Some tutorials may not be up-to-date with the latest advancements in NLP
- Limited explanations in some sections, which may be challenging for absolute beginners
- Lacks extensive documentation or accompanying blog posts for deeper understanding
- Some advanced topics might require additional background knowledge
Code Examples
- Basic text classification using CNN:
class TextCNN(nn.Module):
def __init__(self):
super(TextCNN, self).__init__()
self.conv = nn.Sequential(
nn.Conv2d(1, 3, (3, embedding_dim)),
nn.ReLU(),
nn.MaxPool2d((sequence_length - 3 + 1, 1)),
)
self.fc = nn.Linear(3, num_classes)
def forward(self, X):
batch_size = X.shape[0]
embedding_X = self.embedding(X)
embedding_X = embedding_X.unsqueeze(1)
conved = self.conv(embedding_X)
flatten = conved.view(batch_size, -1)
output = self.fc(flatten)
return output
- Implementing attention mechanism:
class Attention(nn.Module):
def __init__(self):
super(Attention, self).__init__()
self.attn = nn.Linear(n_hidden * 2, n_hidden)
self.v = nn.Parameter(torch.randn(n_hidden))
def forward(self, hidden, encoder_outputs):
batch_size = encoder_outputs.size(0)
hidden = hidden.repeat(1, max_len, 1)
energy = torch.tanh(self.attn(torch.cat([hidden, encoder_outputs], 2)))
attention = torch.sum(self.v * energy, dim=2)
return F.softmax(attention, dim=1).unsqueeze(1)
- Transformer encoder implementation:
class TransformerEncoder(nn.Module):
def __init__(self):
super(TransformerEncoder, self).__init__()
self.enc_self_attn = MultiHeadAttention()
self.pos_ffn = PoswiseFeedForwardNet()
def forward(self, enc_inputs, enc_self_attn_mask):
enc_outputs, attn = self.enc_self_attn(enc_inputs, enc_inputs, enc_inputs, enc_self_attn_mask)
enc_outputs = self.pos_ffn(enc_outputs)
return enc_outputs, attn
Getting Started
To get started with the NLP tutorials:
-
Clone the repository:
git clone https://github.com/graykode/nlp-tutorial.git
-
Install the required dependencies:
pip install torch torchtext numpy
-
Navigate to the desired tutorial directory and run the Python script:
cd nlp-tutorial/1-1.NNLM python NNLM.py
Make sure to have Python 3.6+ and PyTorch installed on your system before running the tutorials.
Competitor Comparisons
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of transformers
- Comprehensive library with state-of-the-art models and architectures
- Extensive documentation and community support
- Regular updates and maintenance
Cons of transformers
- Steeper learning curve for beginners
- Larger codebase and dependencies
- May be overkill for simple NLP tasks
Code comparison
nlp-tutorial:
class BERT(nn.Module):
def __init__(self):
super(BERT, self).__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_pos, d_model)
self.layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)])
def forward(self, x):
seq_len = x.size(1)
pos = torch.arange(seq_len, dtype=torch.long)
pos = pos.unsqueeze(0).expand_as(x)
embedding = self.embedding(x) + self.pos_emb(pos)
for layer in self.layers:
embedding = layer(embedding)
return embedding
transformers:
from transformers import BertModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
nlp-tutorial provides a more hands-on approach to understanding NLP models, while transformers offers pre-trained models and easier integration for production use. nlp-tutorial is better for learning, while transformers is more suitable for practical applications.
💫 Industrial-strength Natural Language Processing (NLP) in Python
Pros of spaCy
- Production-ready, optimized library for industrial-strength NLP tasks
- Comprehensive documentation and extensive community support
- Offers pre-trained models and easy integration with deep learning frameworks
Cons of spaCy
- Steeper learning curve for beginners compared to simpler tutorials
- Less flexibility for customizing low-level NLP components
- Heavier resource requirements, especially for large language models
Code Comparison
nlp-tutorial:
import torch
import torch.nn as nn
class TextCNN(nn.Module):
def __init__(self):
super(TextCNN, self).__init__()
self.conv = nn.Conv2d(1, 3, (3, word_vec_dim))
spaCy:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")
for token in doc:
print(token.text, token.pos_, token.dep_)
The nlp-tutorial repository provides basic implementations of various NLP models, making it ideal for learning and experimentation. In contrast, spaCy offers a more comprehensive, production-ready solution with pre-trained models and optimized performance. While nlp-tutorial allows for greater customization and understanding of model architectures, spaCy provides a higher-level API that simplifies many NLP tasks but may obscure some lower-level details.
A very simple framework for state-of-the-art Natural Language Processing (NLP)
Pros of flair
- Comprehensive NLP framework with pre-trained models
- Active development and community support
- Extensive documentation and examples
Cons of flair
- Steeper learning curve for beginners
- Larger codebase and dependencies
- May be overkill for simple NLP tasks
Code Comparison
nlp-tutorial:
import torch
import torch.nn as nn
class TextCNN(nn.Module):
def __init__(self):
super(TextCNN, self).__init__()
self.conv = nn.Conv2d(1, 3, (3, 3))
flair:
from flair.data import Sentence
from flair.models import SequenceTagger
tagger = SequenceTagger.load('ner')
sentence = Sentence('John Doe is visiting New York.')
tagger.predict(sentence)
Summary
nlp-tutorial is a collection of simple NLP implementations, ideal for learning and understanding core concepts. It's lightweight and easy to follow but lacks advanced features.
flair is a full-fledged NLP library with state-of-the-art models and extensive functionality. It's more suitable for production use and complex NLP tasks but may be overwhelming for beginners.
Choose nlp-tutorial for educational purposes and quick prototypes, and flair for robust NLP applications and research projects.
Topic Modelling for Humans
Pros of gensim
- Comprehensive library for topic modeling, document indexing, and similarity retrieval
- Efficient implementation of popular algorithms like Word2Vec, Doc2Vec, and LDA
- Scalable and optimized for large datasets
Cons of gensim
- Steeper learning curve for beginners compared to nlp-tutorial
- Less focus on deep learning-based NLP techniques
- May require additional libraries for certain tasks
Code Comparison
nlp-tutorial:
import torch
import torch.nn as nn
class TextCNN(nn.Module):
def __init__(self):
super(TextCNN, self).__init__()
self.conv = nn.Conv2d(1, 3, (3, 2))
gensim:
from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)
The nlp-tutorial repository focuses on implementing various NLP models from scratch using PyTorch, providing a hands-on learning experience. In contrast, gensim offers pre-implemented, production-ready algorithms for various NLP tasks, emphasizing efficiency and scalability. While nlp-tutorial is excellent for understanding the inner workings of NLP models, gensim is more suitable for practical applications and large-scale projects.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Pros of fairseq
- More comprehensive and production-ready toolkit for sequence modeling
- Actively maintained by Facebook AI Research with frequent updates
- Supports a wider range of models and tasks, including machine translation and speech recognition
Cons of fairseq
- Steeper learning curve due to its complexity and extensive features
- Requires more computational resources for training and inference
- Less suitable for beginners or those looking for simple NLP implementations
Code Comparison
nlp-tutorial:
class TextCNN(nn.Module):
def __init__(self):
super(TextCNN, self).__init__()
self.conv = nn.Conv2d(1, num_filters, (filter_sizes, embedding_size))
self.fc = nn.Linear(num_filters * len(filter_sizes), num_classes)
fairseq:
class TransformerModel(FairseqEncoderDecoderModel):
def __init__(self, args, encoder, decoder):
super().__init__(encoder, decoder)
self.args = args
self.supports_align_args = True
The nlp-tutorial repository provides simpler, more focused implementations of NLP models, making it ideal for learning and understanding core concepts. In contrast, fairseq offers a more sophisticated and flexible framework for building and training advanced sequence models, but with increased complexity.
Models and examples built with TensorFlow
Pros of models
- Comprehensive collection of official TensorFlow models and examples
- Well-maintained with regular updates and contributions from the TensorFlow team
- Extensive documentation and integration with TensorFlow ecosystem
Cons of models
- Large and complex repository, potentially overwhelming for beginners
- Focuses primarily on TensorFlow, limiting flexibility for other frameworks
- May include more advanced models that require significant computational resources
Code Comparison
nlp-tutorial:
class TextCNN(nn.Module):
def __init__(self):
super(TextCNN, self).__init__()
self.conv = nn.Conv2d(1, num_filters, (filter_sizes, embedding_size))
self.dropout = nn.Dropout(dropout)
self.fc = nn.Linear(num_filters * len(filter_sizes), num_classes)
models:
class TextCNN(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, num_filters, filter_sizes, num_classes):
super(TextCNN, self).__init__()
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.conv_layers = [tf.keras.layers.Conv1D(num_filters, fs, activation='relu') for fs in filter_sizes]
self.dropout = tf.keras.layers.Dropout(0.5)
self.fc = tf.keras.layers.Dense(num_classes, activation='softmax')
The nlp-tutorial example uses PyTorch, while models uses TensorFlow/Keras. The models implementation is more verbose but offers greater flexibility in defining model architecture.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
nlp-tutorial
nlp-tutorial
is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch. Most of the models in NLP were implemented with less than 100 lines of code.(except comments or blank lines)
- [08-14-2020] Old TensorFlow v1 code is archived in the archive folder. For beginner readability, only pytorch version 1.0 or higher is supported.
Curriculum - (Example Purpose)
1. Basic Embedding Model
- 1-1. NNLM(Neural Network Language Model) - Predict Next Word
- Paper - A Neural Probabilistic Language Model(2003)
- Colab - NNLM.ipynb
- 1-2. Word2Vec(Skip-gram) - Embedding Words and Show Graph
- 1-3. FastText(Application Level) - Sentence Classification
- Paper - Bag of Tricks for Efficient Text Classification(2016)
- Colab - FastText.ipynb
2. CNN(Convolutional Neural Network)
- 2-1. TextCNN - Binary Sentiment Classification
3. RNN(Recurrent Neural Network)
- 3-1. TextRNN - Predict Next Step
- Paper - Finding Structure in Time(1990)
- Colab - TextRNN.ipynb
- 3-2. TextLSTM - Autocomplete
- Paper - LONG SHORT-TERM MEMORY(1997)
- Colab - TextLSTM.ipynb
- 3-3. Bi-LSTM - Predict Next Word in Long Sentence
- Colab - Bi_LSTM.ipynb
4. Attention Mechanism
- 4-1. Seq2Seq - Change Word
- 4-2. Seq2Seq with Attention - Translate
- 4-3. Bi-LSTM with Attention - Binary Sentiment Classification
- Colab - Bi_LSTM(Attention).ipynb
5. Model based on Transformer
- 5-1. The Transformer - Translate
- 5-2. BERT - Classification Next Sentence & Predict Masked Tokens
Dependencies
- Python 3.5+
- Pytorch 1.0.0+
Author
- Tae Hwan Jung(Jeff Jung) @graykode
- Author Email : nlkey2022@gmail.com
- Acknowledgements to mojitok as NLP Research Internship.
Top Related Projects
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
💫 Industrial-strength Natural Language Processing (NLP) in Python
A very simple framework for state-of-the-art Natural Language Processing (NLP)
Topic Modelling for Humans
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Models and examples built with TensorFlow
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot