nlp-recipes

Natural Language Processing Best Practices & Examples

6,407

919

6,407

View on GitHub View on NPM

Top Related Projects

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

rasa

20,333

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

allennlp

11,843

An open-source NLP research library, built on PyTorch.

flair

14,148

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Quick Overview

Microsoft's NLP Recipes is a comprehensive repository of best practices and examples for Natural Language Processing (NLP) tasks. It provides a collection of Python notebooks and scripts covering various NLP scenarios, from text classification to question answering, aimed at both beginners and experienced practitioners.

Pros

Extensive collection of NLP techniques and use cases
Well-documented notebooks with step-by-step explanations
Integrates popular NLP libraries and frameworks
Regularly updated with new techniques and improvements

Cons

Some examples may require significant computational resources
Primarily focused on Python, limiting options for users of other languages
May overwhelm beginners due to the breadth of content
Some advanced topics might lack in-depth explanations

Code Examples

Text Classification using BERT:

from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits

Named Entity Recognition using spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Sentiment Analysis using NLTK:

from nltk.sentiment import SentimentIntensityAnalyzer
import nltk

nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

text = "I love this product! It's amazing and works perfectly."
sentiment = sia.polarity_scores(text)
print(sentiment)

Getting Started

To get started with the NLP Recipes:

Clone the repository:

git clone https://github.com/microsoft/nlp-recipes.git

Install dependencies:

cd nlp-recipes
pip install -r requirements.txt

Open and run the Jupyter notebooks in the examples directory to explore different NLP tasks and techniques.

Competitor Comparisons

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Extensive library of pre-trained models for various NLP tasks
Active community and frequent updates
Seamless integration with PyTorch and TensorFlow

Cons of transformers

Steeper learning curve for beginners
Focused primarily on transformer-based models
May require more computational resources for large models

Code comparison

nlp-recipes:

from utils_nlp.models.transformers.sequence_classification import Classifier

classifier = Classifier(model_name="bert-base-uncased", num_labels=2)
classifier.fit(train_dataloader)
predictions = classifier.predict(test_dataloader)

transformers:

from transformers import BertForSequenceClassification, Trainer

model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
trainer = Trainer(model=model, train_dataset=train_dataset)
trainer.train()
predictions = trainer.predict(test_dataset)

Both repositories offer powerful NLP tools, but transformers provides a more comprehensive set of pre-trained models and is more actively maintained. nlp-recipes offers a broader range of NLP tasks and may be more accessible for beginners. The code examples show that both libraries provide high-level APIs for working with transformer models, but transformers offers more flexibility and direct access to model architectures.

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

Highly optimized and efficient for production use
Comprehensive linguistic features including tokenization, POS tagging, and dependency parsing
Extensive documentation and community support

Cons of spaCy

Less flexible for custom NLP tasks compared to more general-purpose libraries
Steeper learning curve for beginners due to its specialized nature
Limited support for languages other than English (though improving)

Code Comparison

spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.label_)

nlp-recipes:

from azureml.core import Workspace
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

ws = Workspace.from_config()
compute_target = ComputeTarget.create(ws, "cpu-cluster", AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2"))

The code snippets highlight the different focus areas of the two repositories. spaCy emphasizes natural language processing tasks, while nlp-recipes is more oriented towards machine learning workflows in Azure.

rasa

20,333

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Pros of Rasa

Specialized for building conversational AI and chatbots
Provides a complete framework for dialogue management and NLU
Active community and extensive documentation

Cons of Rasa

Steeper learning curve for beginners
Less flexible for general NLP tasks outside of conversational AI
Requires more setup and configuration

Code Comparison

Rasa (intent classification):

from rasa.nlu.model import Interpreter

interpreter = Interpreter.load("./models/nlu")
result = interpreter.parse("Hello!")
print(result["intent"]["name"])

NLP Recipes (text classification):

from nlp_recipes.models import TextClassifier

classifier = TextClassifier()
classifier.train(train_data, train_labels)
prediction = classifier.predict("Hello!")
print(prediction)

Summary

Rasa is a specialized framework for building conversational AI, offering robust tools for dialogue management and NLU. It has a steeper learning curve but provides comprehensive features for chatbot development. NLP Recipes, on the other hand, offers a broader range of NLP tools and is more flexible for general NLP tasks. It's easier to get started with but may require more custom implementation for complex conversational systems.

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

More focused on sequence modeling and neural machine translation
Extensive support for various architectures like Transformer, CNN, and RNN
Highly optimized for performance with CUDA support

Cons of fairseq

Steeper learning curve for beginners
Less comprehensive documentation compared to nlp-recipes
Primarily focused on research-oriented tasks

Code Comparison

fairseq:

from fairseq.models.transformer import TransformerModel
en2de = TransformerModel.from_pretrained(
    '/path/to/checkpoints',
    checkpoint_file='checkpoint_best.pt',
    data_name_or_path='data-bin/wmt16_en_de_bpe32k'
)

nlp-recipes:

from utils_nlp.models.transformers.sequence_classification import Transformer
model = Transformer(
    model_name="bert-base-uncased",
    num_labels=2,
    args={"output_dir": "outputs/"}
)

Summary

fairseq is a powerful library for sequence modeling and machine translation, offering high performance and support for various architectures. It's particularly suited for research-oriented tasks and advanced users. nlp-recipes, on the other hand, provides a more accessible entry point for NLP tasks with comprehensive documentation and a broader range of NLP applications. The choice between the two depends on the specific requirements of your project and your level of expertise in NLP.

allennlp

11,843

An open-source NLP research library, built on PyTorch.

Pros of AllenNLP

More comprehensive and feature-rich NLP library
Better documentation and tutorials for beginners
Active community and frequent updates

Cons of AllenNLP

Steeper learning curve for newcomers
Less focus on end-to-end NLP solutions

Code Comparison

AllenNLP:

from allennlp.predictors.predictor import Predictor

predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.03.24.tar.gz")
result = predictor.predict(sentence="Did Uriah honestly think he could beat the game in under three hours?")

NLP Recipes:

from nlp_recipes.models import BertForSequenceClassification
from nlp_recipes.utils import preprocess_text

model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
input_text = preprocess_text("This is a sample text for classification.")
output = model(input_text)

Both repositories offer valuable resources for NLP tasks, but they cater to different audiences. AllenNLP provides a more comprehensive toolkit for researchers and advanced practitioners, while NLP Recipes focuses on practical, end-to-end solutions for common NLP tasks. The choice between them depends on the user's specific needs and level of expertise in NLP.

flair

14,148

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Pros of flair

Focused specifically on NLP tasks with a simpler, more intuitive API
Provides pre-trained models for various languages and tasks out-of-the-box
Actively maintained with frequent updates and community support

Cons of flair

Limited scope compared to nlp-recipes' broader range of NLP solutions
Less comprehensive documentation and fewer example notebooks
May require more manual configuration for complex NLP pipelines

Code Comparison

flair:

from flair.data import Sentence
from flair.models import SequenceTagger

tagger = SequenceTagger.load('ner')
sentence = Sentence('John Doe works at Microsoft.')
tagger.predict(sentence)
print(sentence.to_tagged_string())

nlp-recipes:

from azureml.core import Workspace
from utils_nlp.models.transformers.named_entity_recognition import NERModel

ws = Workspace.from_config()
model = NERModel(model_name="bert-base-cased", num_labels=9, cache_dir="./cache")
model.fit(train_dataset)
predictions = model.predict(test_dataset)

Both repositories offer powerful NLP capabilities, but flair is more focused on specific NLP tasks with an easier-to-use API, while nlp-recipes provides a broader range of NLP solutions integrated with Azure ML. The choice between them depends on the specific project requirements and the desired level of customization and integration with cloud services.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

NLP Best Practices

In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. In the last few years, researchers have been applying newer deep learning methods to NLP. Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora.

This repository contains examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions. The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.

Overview

The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems. The content is based on our past and potential future engagements with customers as well as collaboration with partners, researchers, and the open source community.

We hope that the tools can significantly reduce the âtime to marketâ by simplifying the experience from defining the business problem to development of solution by orders of magnitude. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools in a wide variety of languages.

In an era of transfer learning, transformers, and deep architectures, we believe that pretrained models provide a unified solution to many real-world problems and allow handling different tasks and languages easily. We will, therefore, prioritize such models, as they achieve state-of-the-art results on several NLP benchmarks like GLUE and SQuAD leaderboards. The models can be used in a number of applications ranging from simple text classification to sophisticated intelligent chat bots.

Note that for certain kind of NLP problems, you may not need to build your own models. Instead, pre-built or easily customizable solutions exist which do not require any custom coding or machine learning expertise. We strongly recommend evaluating if these can sufficiently solve your problem. If these solutions are not applicable, or the accuracy of these solutions is not sufficient, then resorting to more complex and time-consuming custom approaches may be necessary. The following cognitive services offer simple solutions to address common NLP tasks:

Text Analytics are a set of pre-trained REST APIs which can be called for Sentiment Analysis, Key phrase extraction, Language detection and Named Entity Detection and more. These APIs work out of the box and require minimal expertise in machine learning, but have limited customization capabilities.

QnA Maker is a cloud-based API service that lets you create a conversational question-and-answer layer over your existing data. Use it to build a knowledge base by extracting questions and answers from your semi-structured content, including FAQs, manuals, and documents.

Language Understanding is a SaaS service to train and deploy a model as a REST API given a user-provided training set. You could do Intent Classification as well as Named Entity Extraction by performing simple steps of providing example utterances and labelling them. It supports Active Learning, so your model always keeps learning and improving.

Target Audience

For this repository our target audience includes data scientists and machine learning engineers with varying levels of NLP knowledge as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world NLP problems.

Focus Areas

The repository aims to expand NLP capabilities along three separate dimensions

Scenarios

We aim to have end-to-end examples of common tasks and scenarios such as text classification, named entity recognition etc.

Algorithms

We aim to support multiple models for each of the supported scenarios. Currently, transformer-based models are supported across most scenarios. We have been working on integrating the transformers package from Hugging Face which allows users to easily load pretrained models and fine-tune them for different tasks.

Languages

We strongly subscribe to the multi-language principles laid down by "Emily Bender"

"Natural language is not a synonym for English"
"English isn't generic for language, despite what NLP papers might lead you to believe"
"Always name the language you are working on" (Bender rule)

The repository aims to support non-English languages across all the scenarios. Pre-trained models used in the repository such as BERT, FastText support 100+ languages out of the box. Our goal is to provide end-to-end examples in as many languages as possible. We encourage community contributions in this area.

Content

The following is a summary of the commonly used NLP scenarios covered in the repository. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and repository utilities.

Scenario	Models	Description	Languages
Text Classification	BERT, DistillBERT, XLNet, RoBERTa, ALBERT, XLM	Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content.	English, Chinese, Hindi, Arabic, German, French, Japanese, Spanish, Dutch
Named Entity Recognition	BERT	Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest.	English
Text Summarization	BERTSumExt BERTSumAbs UniLM (s2s-ft) MiniLM	Text summarization is a language generation task of summarizing the input text into a shorter paragraph of text.	English
Entailment	BERT, XLNet, RoBERTa	Textual entailment is the task of classifying the binary relation between two natural-language texts, text and hypothesis, to determine if the text agrees with the hypothesis or not.	English
Question Answering	BiDAF, BERT, XLNet	Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query.	English
Sentence Similarity	BERT, GenSen	Sentence similarity is the process of computing a similarity score given a pair of text documents.	English
Embeddings	Word2Vec fastText GloVe	Embedding is the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension.	English
Sentiment Analysis	Dependency Parser GloVe	Provides an example of train and use Aspect Based Sentiment Analysis with Azure ML and Intel NLP Architect .	English

Getting Started

While solving NLP problems, it is always good to start with the prebuilt Cognitive Services. When the needs are beyond the bounds of the prebuilt cognitive service and when you want to search for custom machine learning methods, you will find this repository very useful. To get started, navigate to the Setup Guide, which lists instructions on how to setup your environment and dependencies.

Azure Machine Learning Service

Azure Machine Learning service is a cloud service used to train, deploy, automate, and manage machine learning models, all at the broad scale that the cloud provides. AzureML is presented in notebooks across different scenarios to enhance the efficiency of developing Natural Language systems at scale and for various AI model development related tasks like:

Accessing Datastores to easily read and write your data in Azure storage services such as blob storage or file share.
Scaling up and out on Azure Machine Learning Compute.
Automated Machine Learning which builds high quality machine learning models by automating model and hyperparameter selection. AutoML explores BERT, BiLSTM, bag-of-words, and word embeddings on the user's dataset to handle text columns.
Tracking experiments and monitoring metrics to enhance the model creation process.
Distributed Training
Hyperparameter tuning
Deploying the trained machine learning model as a web service to Azure Container Instance for deveopment and test, or for low scale, CPU-based workloads.
Deploying the trained machine learning model as a web service to Azure Kubernetes Service for high-scale production deployments and provides autoscaling, and fast response times.

To successfully run these notebooks, you will need an Azure subscription or can try Azure for free. There may be other Azure services or products used in the notebooks. Introduction and/or reference of those will be provided in the notebooks themselves.

Contributing

We hope that the open source community would contribute to the content and bring in the latest SOTA algorithm. This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.

Blog Posts

References

The following is a list of related repositories that we like and think are useful for NLP tasks.

Repository	Description
Transformers	A great PyTorch library from Hugging Face with implementations of popular transformer-based models. We've been using their package extensively in this repo and greatly appreciate their effort.
Azure Machine Learning Notebooks	ML and deep learning examples with Azure Machine Learning.
AzureML-BERT	End-to-end recipes for pre-training and fine-tuning BERT using Azure Machine Learning service.
MASS	MASS: Masked Sequence to Sequence Pre-training for Language Generation.
MT-DNN	Multi-Task Deep Neural Networks for Natural Language Understanding.
UniLM	Unified Language Model Pre-training.
DialoGPT	DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

Build Status

Build	Branch	Status
Linux CPU	master
Linux CPU	staging
Linux GPU	master
Linux GPU	staging

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot