Top Related Projects
Utils for streaming large files (S3, HDFS, gzip, bz2...)
💫 Industrial-strength Natural Language Processing (NLP) in Python
scikit-learn: machine learning in Python
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
NLTK Source
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Quick Overview
Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. It provides efficient multicore implementations of popular algorithms like Latent Semantic Analysis, Latent Dirichlet Allocation, and Word2Vec. Gensim is designed to process raw, unstructured digital texts using unsupervised machine learning algorithms.
Pros
- Efficient and scalable implementation for processing large text corpora
- Supports various popular topic modeling and word embedding algorithms
- Integrates well with NumPy, SciPy, and Pandas
- Extensive documentation and active community support
Cons
- Steep learning curve for beginners in NLP and topic modeling
- Some advanced features may require additional dependencies
- Performance can be slower compared to specialized implementations for specific algorithms
- Limited support for deep learning-based NLP models
Code Examples
- Loading and preprocessing a corpus:
from gensim import corpora
from gensim.utils import simple_preprocess
texts = [
"The quick brown fox jumps over the lazy dog",
"A man a plan a canal panama",
"I love natural language processing"
]
processed_texts = [simple_preprocess(doc) for doc in texts]
dictionary = corpora.Dictionary(processed_texts)
corpus = [dictionary.doc2bow(text) for text in processed_texts]
- Training a Latent Dirichlet Allocation (LDA) model:
from gensim.models import LdaMulticore
lda_model = LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=3)
topics = lda_model.print_topics()
for topic in topics:
print(topic)
- Creating and using a Word2Vec model:
from gensim.models import Word2Vec
model = Word2Vec(sentences=processed_texts, vector_size=100, window=5, min_count=1, workers=4)
vector = model.wv['fox']
similar_words = model.wv.most_similar('fox', topn=5)
print(similar_words)
Getting Started
To get started with Gensim, first install it using pip:
pip install gensim
Then, import the necessary modules and start processing your text data:
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
# Prepare your corpus
texts = ["Your first document here", "Your second document here", ...]
processed_texts = [simple_preprocess(doc) for doc in texts]
# Create a dictionary and corpus
dictionary = corpora.Dictionary(processed_texts)
corpus = [dictionary.doc2bow(text) for text in processed_texts]
# Train an LDA model
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)
# Print the topics
print(lda_model.print_topics())
This basic example demonstrates how to preprocess text, create a corpus, and train an LDA model using Gensim.
Competitor Comparisons
Utils for streaming large files (S3, HDFS, gzip, bz2...)
Pros of smart_open
- Lightweight and focused on I/O operations
- Supports a wide range of storage systems (local, S3, GCS, Azure Blob, etc.)
- Easy to use and integrate into existing projects
Cons of smart_open
- Limited to I/O operations, lacks advanced NLP features
- Smaller community and fewer contributors
- Less frequent updates and releases
Code Comparison
smart_open:
from smart_open import open
with open('s3://bucket/key.txt', 'w') as fout:
fout.write('hello world')
gensim:
from gensim.models import Word2Vec
sentences = [['this', 'is', 'a', 'sentence'], ['another', 'sentence']]
model = Word2Vec(sentences, min_count=1)
Summary
smart_open is a specialized library for handling I/O operations across various storage systems, while gensim is a comprehensive NLP library. smart_open excels in simplifying file access, but lacks the advanced text processing capabilities of gensim. gensim offers a wide range of NLP tools and models, but may be overkill for projects that only require simple I/O operations. Choose smart_open for efficient file handling across different storage systems, and gensim for more complex NLP tasks and text processing.
💫 Industrial-strength Natural Language Processing (NLP) in Python
Pros of spaCy
- More comprehensive NLP toolkit with advanced features like named entity recognition and dependency parsing
- Faster processing speed, especially for large-scale text analysis
- Better documentation and more active community support
Cons of spaCy
- Steeper learning curve due to its more complex architecture
- Less flexible for custom topic modeling compared to Gensim
- Larger memory footprint, which can be an issue for resource-constrained environments
Code Comparison
spaCy:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
print(ent.text, ent.label_)
Gensim:
from gensim import corpora, models
texts = [["apple", "buy", "startup"], ["uk", "billion", "dollar"]]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary)
scikit-learn: machine learning in Python
Pros of scikit-learn
- Broader scope, covering a wide range of machine learning algorithms and techniques
- More extensive documentation and community support
- Better integration with other scientific Python libraries (NumPy, SciPy, Pandas)
Cons of scikit-learn
- Less specialized for natural language processing and topic modeling tasks
- May have slower performance for specific text processing operations
- Steeper learning curve for users primarily interested in text analysis
Code Comparison
scikit-learn (text classification):
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
clf = MultinomialNB().fit(X, y)
gensim (topic modeling):
from gensim import corpora, models
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaMulticore(corpus=corpus, num_topics=10)
Both libraries offer powerful tools for text analysis, but gensim is more specialized for topic modeling and word embeddings, while scikit-learn provides a broader range of machine learning algorithms. The choice between them depends on the specific requirements of your project and your familiarity with each library.
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of transformers
- Extensive support for state-of-the-art transformer models
- Seamless integration with PyTorch and TensorFlow
- Active development and frequent updates
Cons of transformers
- Steeper learning curve for beginners
- Higher computational requirements for large models
- More focused on transformer architectures, less versatile for traditional NLP tasks
Code comparison
transformers:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
gensim:
from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)
vector = model.wv['dog']
NLTK Source
Pros of NLTK
- Comprehensive suite of text processing libraries and educational resources
- Extensive documentation and academic support
- Broader range of linguistic tools, including parsing and semantic analysis
Cons of NLTK
- Slower performance for large-scale text processing tasks
- Less focus on modern machine learning techniques for NLP
- Steeper learning curve for beginners
Code Comparison
NLTK example:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
text = "Natural language processing is fascinating."
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [w for w in tokens if w.lower() not in stop_words]
Gensim example:
from gensim import corpora, models
texts = [['natural', 'language', 'processing', 'fascinating']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaModel(corpus, num_topics=1, id2word=dictionary)
NLTK focuses on linguistic analysis and text preprocessing, while Gensim specializes in topic modeling and document similarity. NLTK provides a wider range of NLP tools, but Gensim offers more efficient implementations for specific tasks like word embeddings and document clustering.
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Pros of XGBoost
- Highly optimized for performance and speed in gradient boosting
- Supports distributed computing for large-scale machine learning tasks
- Offers built-in cross-validation and early stopping features
Cons of XGBoost
- Steeper learning curve compared to Gensim's user-friendly interface
- More focused on supervised learning, while Gensim excels in unsupervised tasks
- Less suitable for natural language processing tasks than Gensim
Code Comparison
XGBoost (for classification):
import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Gensim (for topic modeling):
from gensim import corpora, models
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaMulticore(corpus=corpus, num_topics=10)
XGBoost is primarily used for supervised learning tasks like classification and regression, while Gensim focuses on unsupervised natural language processing tasks such as topic modeling and document similarity. XGBoost offers high performance and scalability for machine learning, whereas Gensim provides tools for working with textual data and semantic analysis.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
gensim â Topic Modelling in Python
Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
â ï¸ Want to help out? Sponsor Gensim â¤ï¸
â ï¸ Gensim is in stable maintenance mode: we are not accepting new features, but bug and documentation fixes are still welcome! â ï¸
Features
- All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core),
- Intuitive interfaces
- easy to plug in your own input corpus/datastream (trivial streaming API)
- easy to extend with other Vector Space algorithms (trivial transformation API)
- Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.
- Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.
- Extensive documentation and Jupyter Notebook tutorials.
If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.
Installation
This software depends on NumPy, a Python package for
scientific computing. Please bear in mind that building NumPy from source
(e.g. by installing gensim on a platform which lacks NumPy .whl distribution)
is a non-trivial task involving linking NumPy to a BLAS library.
It is recommended to provide a fast one (such as MKL, ATLAS or
OpenBLAS) which can improve performance by as much as an order of
magnitude. On OSX, NumPy picks up its vecLib BLAS automatically,
so you donât need to do anything special.
Install the latest version of gensim:
pip install --upgrade gensim
Or, if you have instead downloaded and unzipped the source tar.gz package:
tar -xvzf gensim-X.X.X.tar.gz
cd gensim-X.X.X/
pip install .
For alternative modes of installation, see the documentation.
Gensim is being continuously tested under all supported Python versions. Support for Python 2.7 was dropped in gensim 4.0.0 â install gensim 3.8.3 if you must use Python 2.7.
How come gensim is so fast and memory efficient? Isnât it pure Python, and isnât Python slow and greedy?
Many scientific algorithms can be expressed in terms of large matrix operations (see the BLAS note above). Gensim taps into these low-level BLAS libraries, by means of its dependency on NumPy. So while gensim-the-top-level-code is pure Python, it actually executes highly optimized Fortran/C under the hood, including multithreading (if your BLAS is so configured).
Memory-wise, gensim makes heavy use of Pythonâs built-in generators and iterators for streamed data processing. Memory efficiency was one of gensimâs design goals, and is a central feature of gensim, rather than something bolted on as an afterthought.
Documentation
Support
For commercial support, please see Gensim sponsorship.
Ask open-ended questions on the public Gensim Mailing List.
Raise bugs on Github but please make sure you follow the issue template. Issues that are not bugs or fail to provide the requested details will be closed without inspection.
Adopters
Company | Logo | Industry | Use of Gensim |
---|---|---|---|
RARE Technologies | ML & NLP consulting | Creators of Gensim â this is us! | |
Amazon | Retail | Document similarity. | |
National Institutes of Health | Health | Processing grants and publications with word2vec. | |
Cisco Security | Security | Large-scale fraud detection. | |
Mindseye | Legal | Similarities in legal documents. | |
Channel 4 | Media | Recommendation engine. | |
Talentpair | HR | Candidate matching in high-touch recruiting. | |
Juju | HR | Provide non-obvious related job suggestions. | |
Tailwind | Media | Post interesting and relevant content to Pinterest. | |
Issuu | Media | Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about. | |
Search Metrics | Content Marketing | Gensim word2vec used for entity disambiguation in Search Engine Optimisation. | |
12K Research | Media | Document similarity analysis on media articles. | |
Stillwater Supercomputing | Hardware | Document comprehension and association with word2vec. | |
SiteGround | Web hosting | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. | |
Capital One | Finance | Topic modeling for customer complaints exploration. |
Citing gensim
When citing gensim in academic papers and theses, please use this BibTeX entry:
@inproceedings{rehurek_lrec,
title = {{Software Framework for Topic Modelling with Large Corpora}},
author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
booktitle = {{Proceedings of the LREC 2010 Workshop on New
Challenges for NLP Frameworks}},
pages = {45--50},
year = 2010,
month = May,
day = 22,
publisher = {ELRA},
address = {Valletta, Malta},
note={\url{http://is.muni.cz/publication/884893/en}},
language={English}
}
Top Related Projects
Utils for streaming large files (S3, HDFS, gzip, bz2...)
💫 Industrial-strength Natural Language Processing (NLP) in Python
scikit-learn: machine learning in Python
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
NLTK Source
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot