gensim

Topic Modelling for Humans

16,077

4,397

16,077

428

View on GitHub

Top Related Projects

smart_open

3,329

Utils for streaming large files (S3, HDFS, gzip, bz2...)

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

scikit-learn

62,466

scikit-learn: machine learning in Python

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

xgboost

27,179

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Quick Overview

Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. It provides efficient multicore implementations of popular algorithms like Latent Semantic Analysis, Latent Dirichlet Allocation, and Word2Vec. Gensim is designed to process raw, unstructured digital texts using unsupervised machine learning algorithms.

Pros

Efficient and scalable implementation for processing large text corpora
Supports various popular topic modeling and word embedding algorithms
Integrates well with NumPy, SciPy, and Pandas
Extensive documentation and active community support

Cons

Steep learning curve for beginners in NLP and topic modeling
Some advanced features may require additional dependencies
Performance can be slower compared to specialized implementations for specific algorithms
Limited support for deep learning-based NLP models

Code Examples

Loading and preprocessing a corpus:

from gensim import corpora
from gensim.utils import simple_preprocess

texts = [
    "The quick brown fox jumps over the lazy dog",
    "A man a plan a canal panama",
    "I love natural language processing"
]
processed_texts = [simple_preprocess(doc) for doc in texts]
dictionary = corpora.Dictionary(processed_texts)
corpus = [dictionary.doc2bow(text) for text in processed_texts]

Training a Latent Dirichlet Allocation (LDA) model:

from gensim.models import LdaMulticore

lda_model = LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=3)
topics = lda_model.print_topics()
for topic in topics:
    print(topic)

Creating and using a Word2Vec model:

from gensim.models import Word2Vec

model = Word2Vec(sentences=processed_texts, vector_size=100, window=5, min_count=1, workers=4)
vector = model.wv['fox']
similar_words = model.wv.most_similar('fox', topn=5)
print(similar_words)

Getting Started

To get started with Gensim, first install it using pip:

pip install gensim

Then, import the necessary modules and start processing your text data:

from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess

# Prepare your corpus
texts = ["Your first document here", "Your second document here", ...]
processed_texts = [simple_preprocess(doc) for doc in texts]

# Create a dictionary and corpus
dictionary = corpora.Dictionary(processed_texts)
corpus = [dictionary.doc2bow(text) for text in processed_texts]

# Train an LDA model
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)

# Print the topics
print(lda_model.print_topics())

This basic example demonstrates how to preprocess text, create a corpus, and train an LDA model using Gensim.

Competitor Comparisons

smart_open

3,329

Utils for streaming large files (S3, HDFS, gzip, bz2...)

Pros of smart_open

Lightweight and focused on I/O operations
Supports a wide range of storage systems (local, S3, GCS, Azure Blob, etc.)
Easy to use and integrate into existing projects

Cons of smart_open

Limited to I/O operations, lacks advanced NLP features
Smaller community and fewer contributors
Less frequent updates and releases

Code Comparison

smart_open:

from smart_open import open

with open('s3://bucket/key.txt', 'w') as fout:
    fout.write('hello world')

gensim:

from gensim.models import Word2Vec

sentences = [['this', 'is', 'a', 'sentence'], ['another', 'sentence']]
model = Word2Vec(sentences, min_count=1)

Summary

smart_open is a specialized library for handling I/O operations across various storage systems, while gensim is a comprehensive NLP library. smart_open excels in simplifying file access, but lacks the advanced text processing capabilities of gensim. gensim offers a wide range of NLP tools and models, but may be overkill for projects that only require simple I/O operations. Choose smart_open for efficient file handling across different storage systems, and gensim for more complex NLP tasks and text processing.

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

More comprehensive NLP toolkit with advanced features like named entity recognition and dependency parsing
Faster processing speed, especially for large-scale text analysis
Better documentation and more active community support

Cons of spaCy

Steeper learning curve due to its more complex architecture
Less flexible for custom topic modeling compared to Gensim
Larger memory footprint, which can be an issue for resource-constrained environments

Code Comparison

spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

Gensim:

from gensim import corpora, models
texts = [["apple", "buy", "startup"], ["uk", "billion", "dollar"]]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary)

scikit-learn

62,466

scikit-learn: machine learning in Python

Pros of scikit-learn

Broader scope, covering a wide range of machine learning algorithms and techniques
More extensive documentation and community support
Better integration with other scientific Python libraries (NumPy, SciPy, Pandas)

Cons of scikit-learn

Less specialized for natural language processing and topic modeling tasks
May have slower performance for specific text processing operations
Steeper learning curve for users primarily interested in text analysis

Code Comparison

scikit-learn (text classification):

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
clf = MultinomialNB().fit(X, y)

gensim (topic modeling):

from gensim import corpora, models

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaMulticore(corpus=corpus, num_topics=10)

Both libraries offer powerful tools for text analysis, but gensim is more specialized for topic modeling and word embeddings, while scikit-learn provides a broader range of machine learning algorithms. The choice between them depends on the specific requirements of your project and your familiarity with each library.

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Extensive support for state-of-the-art transformer models
Seamless integration with PyTorch and TensorFlow
Active development and frequent updates

Cons of transformers

Steeper learning curve for beginners
Higher computational requirements for large models
More focused on transformer architectures, less versatile for traditional NLP tasks

Code comparison

transformers:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

gensim:

from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)
vector = model.wv['dog']

nltk

14,217

NLTK Source

Pros of NLTK

Comprehensive suite of text processing libraries and educational resources
Extensive documentation and academic support
Broader range of linguistic tools, including parsing and semantic analysis

Cons of NLTK

Slower performance for large-scale text processing tasks
Less focus on modern machine learning techniques for NLP
Steeper learning curve for beginners

Code Comparison

NLTK example:

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

text = "Natural language processing is fascinating."
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [w for w in tokens if w.lower() not in stop_words]

Gensim example:

from gensim import corpora, models

texts = [['natural', 'language', 'processing', 'fascinating']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaModel(corpus, num_topics=1, id2word=dictionary)

NLTK focuses on linguistic analysis and text preprocessing, while Gensim specializes in topic modeling and document similarity. NLTK provides a wider range of NLP tools, but Gensim offers more efficient implementations for specific tasks like word embeddings and document clustering.

xgboost

27,179

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Pros of XGBoost

Highly optimized for performance and speed in gradient boosting
Supports distributed computing for large-scale machine learning tasks
Offers built-in cross-validation and early stopping features

Cons of XGBoost

Steeper learning curve compared to Gensim's user-friendly interface
More focused on supervised learning, while Gensim excels in unsupervised tasks
Less suitable for natural language processing tasks than Gensim

Code Comparison

XGBoost (for classification):

import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Gensim (for topic modeling):

from gensim import corpora, models
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaMulticore(corpus=corpus, num_topics=10)

XGBoost is primarily used for supervised learning tasks like classification and regression, while Gensim focuses on unsupervised natural language processing tasks such as topic modeling and document similarity. XGBoost offers high performance and scalability for machine learning, whereas Gensim provides tools for working with textual data and semantic analysis.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

gensim â Topic Modelling in Python

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

â ï¸ Want to help out? Sponsor Gensim â¤ï¸

â ï¸ Gensim is in stable maintenance mode: we are not accepting new features, but bug and documentation fixes are still welcome! â ï¸

Features

All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core),
Intuitive interfaces
- easy to plug in your own input corpus/datastream (trivial streaming API)
- easy to extend with other Vector Space algorithms (trivial transformation API)
Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.
Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.
Extensive documentation and Jupyter Notebook tutorials.

If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

Installation

This software depends on NumPy, a Python package for scientific computing. Please bear in mind that building NumPy from source (e.g. by installing gensim on a platform which lacks NumPy .whl distribution) is a non-trivial task involving linking NumPy to a BLAS library.
It is recommended to provide a fast one (such as MKL, ATLAS or OpenBLAS) which can improve performance by as much as an order of magnitude. On OSX, NumPy picks up its vecLib BLAS automatically, so you donât need to do anything special.

Install the latest version of gensim:

    pip install --upgrade gensim

Or, if you have instead downloaded and unzipped the source tar.gz package:

    tar -xvzf gensim-X.X.X.tar.gz
    cd gensim-X.X.X/
    pip install .

For alternative modes of installation, see the documentation.

Gensim is being continuously tested under all supported Python versions. Support for Python 2.7 was dropped in gensim 4.0.0 â install gensim 3.8.3 if you must use Python 2.7.

How come gensim is so fast and memory efficient? Isnât it pure Python, and isnât Python slow and greedy?

Many scientific algorithms can be expressed in terms of large matrix operations (see the BLAS note above). Gensim taps into these low-level BLAS libraries, by means of its dependency on NumPy. So while gensim-the-top-level-code is pure Python, it actually executes highly optimized Fortran/C under the hood, including multithreading (if your BLAS is so configured).

Memory-wise, gensim makes heavy use of Pythonâs built-in generators and iterators for streamed data processing. Memory efficiency was one of gensimâs design goals, and is a central feature of gensim, rather than something bolted on as an afterthought.

Documentation

Support

For commercial support, please see Gensim sponsorship.

Ask open-ended questions on the public Gensim Mailing List.

Raise bugs on Github but please make sure you follow the issue template. Issues that are not bugs or fail to provide the requested details will be closed without inspection.

Adopters

Company	Industry	Use of Gensim
RARE Technologies	ML & NLP consulting	Creators of Gensim âÂ this is us!
Amazon	Retail	Document similarity.
National Institutes of Health	Health	Processing grants and publications with word2vec.
Cisco Security	Security	Large-scale fraud detection.
Mindseye	Legal	Similarities in legal documents.
Channel 4	Media	Recommendation engine.
Talentpair	HR	Candidate matching in high-touch recruiting.
Juju	HR	Provide non-obvious related job suggestions.
Tailwind	Media	Post interesting and relevant content to Pinterest.
Issuu	Media	Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about.
Search Metrics	Content Marketing	Gensim word2vec used for entity disambiguation in Search Engine Optimisation.
12K Research	Media	Document similarity analysis on media articles.
Stillwater Supercomputing	Hardware	Document comprehension and association with word2vec.
SiteGround	Web hosting	An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA.
Capital One	Finance	Topic modeling for customer complaints exploration.

Citing gensim

When citing gensim in academic papers and theses, please use this BibTeX entry:

@inproceedings{rehurek_lrec,
      title = {{Software Framework for Topic Modelling with Large Corpora}},
      author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
      booktitle = {{Proceedings of the LREC 2010 Workshop on New
           Challenges for NLP Frameworks}},
      pages = {45--50},
      year = 2010,
      month = May,
      day = 22,
      publisher = {ELRA},
      address = {Valletta, Malta},
      note={\url{http://is.muni.cz/publication/884893/en}},
      language={English}
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of smart_open

Cons of smart_open

Code Comparison

Summary

Pros of spaCy

Cons of spaCy

Code Comparison

Pros of scikit-learn

Cons of scikit-learn

Code Comparison

Pros of transformers

Cons of transformers

Code comparison

Pros of NLTK

Cons of NLTK

Code Comparison

Pros of XGBoost

Cons of XGBoost

Code Comparison

Convert designs to code with AI

README

gensim â Topic Modelling in Python

â ï¸ Want to help out? Sponsor Gensim â¤ï¸

â ï¸ Gensim is in stable maintenance mode: we are not accepting new features, but bug and documentation fixes are still welcome! â ï¸

Features

Installation

How come gensim is so fast and memory efficient? Isnât it pure Python, and isnât Python slow and greedy?

Documentation

Support

Adopters

Citing gensim

Top Related Projects

Convert designs to code with AI

gensim â Topic Modelling in Python

â ï¸ Want to help out? Sponsor Gensim â¤ï¸

â ï¸ Gensim is in stable maintenance mode: we are not accepting new features, but bug and documentation fixes are still welcome! â ï¸

How come gensim is so fast and memory efficient? Isnât it pure Python, and isnât Python slow and greedy?