snorkel

A system for quickly generating training data with weak supervision

5,853

857

5,853

View on GitHub

Top Related Projects

DeepSpeed

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

allennlp

11,843

An open-source NLP research library, built on PyTorch.

faiss

36,332

A library for efficient similarity search and clustering of dense vectors.

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Quick Overview

Snorkel is an open-source Python library for programmatically building and managing training datasets without manual labeling. It allows users to write labeling functions to automatically label data, which can then be used to train machine learning models. Snorkel is particularly useful for tasks where labeled data is scarce or expensive to obtain.

Pros

Reduces the need for manual data labeling, saving time and resources
Enables the creation of large training datasets quickly and efficiently
Provides a flexible framework for incorporating domain expertise into the labeling process
Supports integration with popular machine learning libraries like PyTorch and TensorFlow

Cons

Learning curve for writing effective labeling functions
Performance can be dependent on the quality of labeling functions
May require fine-tuning and iteration to achieve optimal results
Not suitable for all types of data or machine learning tasks

Code Examples

Creating a labeling function:

from snorkel.labeling import labeling_function

@labeling_function()
def keyword_labeling(x):
    if "positive" in x.text.lower():
        return 1
    elif "negative" in x.text.lower():
        return 0
    return -1

Applying labeling functions to a dataset:

from snorkel.labeling import PandasLFApplier

lfs = [keyword_labeling, another_labeling_function]
applier = PandasLFApplier(lfs)
L = applier.apply(df)

Training a label model:

from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, epochs=500, log_freq=100, seed=123)

Using the label model to make predictions:

preds = label_model.predict(L_test)

Getting Started

To get started with Snorkel, follow these steps:

Install Snorkel:

pip install snorkel

Import necessary modules:

from snorkel.labeling import labeling_function, PandasLFApplier
from snorkel.labeling.model import LabelModel

Define labeling functions, apply them to your data, and train a label model:

# Define labeling functions
@labeling_function()
def lf_example(x):
    # Your labeling logic here
    return label

# Apply labeling functions
applier = PandasLFApplier([lf_example])
L = applier.apply(df)

# Train label model
label_model = LabelModel(cardinality=num_classes)
label_model.fit(L_train)

Use the trained model to make predictions on new data:

preds = label_model.predict(L_test)

Competitor Comparisons

DeepSpeed

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

Focuses on optimizing large-scale deep learning models, particularly for distributed training
Offers advanced features like ZeRO optimizer and 3D parallelism for improved performance
Provides seamless integration with popular deep learning frameworks like PyTorch

Cons of DeepSpeed

Steeper learning curve due to its focus on advanced optimization techniques
May be overkill for smaller-scale machine learning projects or simpler models
Requires more setup and configuration compared to Snorkel's straightforward approach

Code Comparison

DeepSpeed:

import deepspeed
model_engine, optimizer, _, _ = deepspeed.initialize(args=args,
                                                     model=model,
                                                     model_parameters=params)

Snorkel:

from snorkel.labeling import labeling_function
@labeling_function()
def lf_keyword(x):
    return POSITIVE if "positive" in x.text.lower() else ABSTAIN

Summary

DeepSpeed excels in optimizing large-scale deep learning models and distributed training, offering advanced features for performance improvement. However, it has a steeper learning curve and may be excessive for smaller projects. Snorkel, on the other hand, focuses on programmatic labeling and weak supervision, making it more suitable for data labeling tasks and simpler to use for general machine learning workflows.

allennlp

11,843

An open-source NLP research library, built on PyTorch.

Pros of AllenNLP

More comprehensive NLP toolkit with a wider range of pre-built models and tasks
Better documentation and tutorials for ease of use
Stronger integration with PyTorch for deep learning tasks

Cons of AllenNLP

Steeper learning curve for beginners due to its extensive feature set
Less focus on weak supervision and data labeling compared to Snorkel
May be overkill for simpler NLP projects or specific labeling tasks

Code Comparison

AllenNLP example (model definition):

class SimpleClassifier(Model):
    def __init__(self, vocab, embedder, encoder):
        super().__init__(vocab)
        self.embedder = embedder
        self.encoder = encoder
        self.classifier = Linear(encoder.get_output_dim(), vocab.get_vocab_size('labels'))

Snorkel example (labeling function):

@labeling_function()
def keyword_labeling(x):
    return POSITIVE if "great" in x.text.lower() else ABSTAIN

AllenNLP focuses on building and training models, while Snorkel emphasizes labeling functions and weak supervision. AllenNLP provides a more structured approach to model creation, whereas Snorkel offers a flexible way to generate training data through labeling functions.

faiss

36,332

A library for efficient similarity search and clustering of dense vectors.

Pros of Faiss

Highly optimized for similarity search and clustering of dense vectors
Supports GPU acceleration for faster processing
Scales well to very large datasets (billions of vectors)

Cons of Faiss

More specialized focus on vector similarity search, less versatile than Snorkel
Steeper learning curve for non-experts in similarity search and indexing
Limited built-in support for data preprocessing and feature extraction

Code Comparison

Faiss example (vector similarity search):

import faiss
import numpy as np

d = 64  # dimension
nb = 100000  # database size
nq = 10000  # nb of queries
xb = np.random.random((nb, d)).astype('float32')
xq = np.random.random((nq, d)).astype('float32')

index = faiss.IndexFlatL2(d)
index.add(xb)
D, I = index.search(xq, k=4)  # search

Snorkel example (labeling function):

from snorkel.labeling import labeling_function

@labeling_function()
def keyword_lookup(x):
    return POSITIVE if "positive" in x.text.lower() else ABSTAIN

labeling_function = keyword_lookup

Both repositories serve different purposes: Faiss excels in similarity search and clustering, while Snorkel focuses on programmatic labeling and weak supervision. The code examples highlight these differences, with Faiss demonstrating vector indexing and search, and Snorkel showcasing a labeling function for data annotation.

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

Broader scope and applicability in deep learning and machine learning
Larger community and ecosystem, with more resources and third-party libraries
More flexible and dynamic computational graph, allowing for easier debugging

Cons of PyTorch

Steeper learning curve for beginners compared to Snorkel's focused approach
Less specialized for weak supervision and programmatic labeling tasks
Potentially more complex setup and configuration for specific use cases

Code Comparison

Snorkel (labeling function):

@labeling_function()
def keyword_labeling(x):
    return POSITIVE if "great" in x.text.lower() else ABSTAIN

PyTorch (simple neural network):

class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 1)
    
    def forward(self, x):
        return torch.sigmoid(self.fc(x))

While Snorkel focuses on programmatic labeling and weak supervision, PyTorch provides a more general-purpose deep learning framework. Snorkel's code emphasizes labeling functions, whereas PyTorch's code typically involves defining neural network architectures and training loops.

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of Transformers

Broader scope, covering a wide range of NLP tasks and models
Larger community and more frequent updates
Seamless integration with popular deep learning frameworks

Cons of Transformers

Steeper learning curve for beginners
Higher computational requirements for many models
Less focus on data labeling and weak supervision

Code Comparison

Transformers example:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")[0]
print(f"Label: {result['label']}, Score: {result['score']:.4f}")

Snorkel example:

from snorkel.labeling import labeling_function

@labeling_function()
def label_sentiment(text):
    if "love" in text.lower():
        return POSITIVE
    return ABSTAIN

labeler = PandasLFApplier([label_sentiment])
L = labeler.apply(df)

Both libraries serve different purposes in the NLP ecosystem. Transformers focuses on state-of-the-art models for various NLP tasks, while Snorkel emphasizes programmatic labeling and weak supervision techniques. Choose based on your specific needs and project requirements.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

PyPI - Python Version PyPI Conda

Programmatically Build and Manage Training Data

Announcement

The Snorkel team is now focusing their efforts on Snorkel Flow, an end-to-end AI application development platform based on the core ideas behind Snorkelâyou can check it out here or join us in building it!

The Snorkel project started at Stanford in 2015 with a simple technical bet: that it would increasingly be the training data, not the models, algorithms, or infrastructure, that decided whether a machine learning project succeeded or failed. Given this premise, we set out to explore the radical idea that you could bring mathematical and systems structure to the messy and often entirely manual process of training data creation and management, starting by empowering users to programmatically label, build, and manage training data.

To say that the Snorkel project succeeded and expanded beyond what we had ever expected would be an understatement. The basic goals of a research repo like Snorkel are to provide a minimum viable framework for testing and validating hypotheses. Four years later, weâve been fortunate to do not just this, but to develop and deploy early versions of Snorkel in partnership with some of the worldâs leading organizations like Google, Intel, Stanford Medicine, and many more; author over sixty peer-reviewed publications on our findings around Snorkel and related innovations in weak supervision modeling, data augmentation, multi-task learning, and more; be included in courses at top-tier universities; support production deployments in systems that youâve likely used in the last few hours; and work with an amazing community of researchers and practitioners from industry, medicine, government, academia, and beyond.

However, we realized increasinglyâfrom conversations with users in weekly office hours, workshops, online discussions, and industry partnersâthat the Snorkel project was just the very first step. The ideas behind Snorkel change not just how you label training data, but so much of the entire lifecycle and pipeline of building, deploying, and managing ML: how users inject their knowledge; how models are constructed, trained, inspected, versioned, and monitored; how entire pipelines are developed iteratively; and how the full set of stakeholders in any ML deployment, from subject matter experts to ML engineers, are incorporated into the process.

Over the last year, we have been building the platform to support this broader vision: Snorkel Flow, an end-to-end machine learning platform for developing and deploying AI applications. Snorkel Flow incorporates many of the concepts of the Snorkel project with a range of newer techniques around weak supervision modeling, data augmentation, multi-task learning, data slicing and structuring, monitoring and analysis, and more, all of which integrate in a way that is greater than the sum of its partsâand that we believe makes ML truly faster, more flexible, and more practical than ever before.

Moving forward, we will be focusing our efforts on Snorkel Flow. We are extremely grateful for all of you that have contributed to the Snorkel project, and are excited for you to check out our next chapter here.

Quick Links

Getting Started

The quickest way to familiarize yourself with the Snorkel library is to walk through the Get Started page on the Snorkel website, followed by the full-length tutorials in the Snorkel tutorials repository. These tutorials demonstrate a variety of tasks, domains, labeling techniques, and integrations that can serve as templates as you apply Snorkel to your own applications.

Installation

Snorkel requires Python 3.11 or later. To install Snorkel, we recommend using pip:

pip install snorkel

or conda:

conda install snorkel -c conda-forge

For information on installing from source and contributing to Snorkel, see our contributing guidelines.

Details on installing with conda

The following example commands give some more color on installing with conda. These commands assume that your conda installation is Python 3.11, and that you want to use a virtual environment called snorkel-env.

# [OPTIONAL] Activate a virtual environment called "snorkel"
conda create --yes -n snorkel-env python=3.11
conda activate snorkel-env

# We specify PyTorch here to ensure compatibility, but it may not be necessary.
conda install pytorch==1.1.0 -c pytorch
conda install snorkel==0.9.0 -c conda-forge

A quick note for Windows users

If you're using Windows, we highly recommend using Docker (you can find an example in our tutorials repo) or the Linux subsystem. We've done limited testing on Windows, so if you want to contribute instructions or improvements, feel free to open a PR!

Discussion

Issues

We use GitHub Issues for posting bugs and feature requests âÂ anything code-related. Just make sure you search for related issues first and use our Issues templates. We may ask for contributions if a prompt fix doesn't fit into the immediate roadmap of the core development team.

Contributions

We welcome contributions from the Snorkel community! This is likely the fastest way to get a change you'd like to see into the library.

Small contributions can be made directly in a pull request (PR). If you would like to contribute a larger feature, we recommend first creating an issue with a proposed design for discussion. For ideas about what to work on, we've labeled specific issues as help wanted.

To set up a development environment for contributing back to Snorkel, see our contributing guidelines. All PRs must pass the continuous integration tests and receive approval from a member of the Snorkel development team before they will be merged.

Community Forum

For broader Q&A, discussions about using Snorkel, tutorial requests, etc.,Â use the Snorkel community forum hosted on Spectrum. We hope this will be a venue for you to interact with other Snorkel users â please don't be shy about posting!

Announcements

To stay up-to-date on Snorkel-related announcements (e.g. version releases, upcoming workshops), subscribe to the Snorkel mailing list. We promise to respect your inboxes âÂ communication will be sparse!

Twitter

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot