Convert Figma logo to code with AI

piskvorky logogensim

Topic Modelling for Humans

15,735
4,383
15,735
414

Top Related Projects

Utils for streaming large files (S3, HDFS, gzip, bz2...)

30,447

💫 Industrial-strength Natural Language Processing (NLP) in Python

scikit-learn: machine learning in Python

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

13,520

NLTK Source

26,184

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Quick Overview

Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. It provides efficient multicore implementations of popular algorithms like Latent Semantic Analysis, Latent Dirichlet Allocation, and Word2Vec. Gensim is designed to process raw, unstructured digital texts using unsupervised machine learning algorithms.

Pros

  • Efficient and scalable implementation for processing large text corpora
  • Supports various popular topic modeling and word embedding algorithms
  • Integrates well with NumPy, SciPy, and Pandas
  • Extensive documentation and active community support

Cons

  • Steep learning curve for beginners in NLP and topic modeling
  • Some advanced features may require additional dependencies
  • Performance can be slower compared to specialized implementations for specific algorithms
  • Limited support for deep learning-based NLP models

Code Examples

  1. Loading and preprocessing a corpus:
from gensim import corpora
from gensim.utils import simple_preprocess

texts = [
    "The quick brown fox jumps over the lazy dog",
    "A man a plan a canal panama",
    "I love natural language processing"
]
processed_texts = [simple_preprocess(doc) for doc in texts]
dictionary = corpora.Dictionary(processed_texts)
corpus = [dictionary.doc2bow(text) for text in processed_texts]
  1. Training a Latent Dirichlet Allocation (LDA) model:
from gensim.models import LdaMulticore

lda_model = LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=3)
topics = lda_model.print_topics()
for topic in topics:
    print(topic)
  1. Creating and using a Word2Vec model:
from gensim.models import Word2Vec

model = Word2Vec(sentences=processed_texts, vector_size=100, window=5, min_count=1, workers=4)
vector = model.wv['fox']
similar_words = model.wv.most_similar('fox', topn=5)
print(similar_words)

Getting Started

To get started with Gensim, first install it using pip:

pip install gensim

Then, import the necessary modules and start processing your text data:

from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess

# Prepare your corpus
texts = ["Your first document here", "Your second document here", ...]
processed_texts = [simple_preprocess(doc) for doc in texts]

# Create a dictionary and corpus
dictionary = corpora.Dictionary(processed_texts)
corpus = [dictionary.doc2bow(text) for text in processed_texts]

# Train an LDA model
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)

# Print the topics
print(lda_model.print_topics())

This basic example demonstrates how to preprocess text, create a corpus, and train an LDA model using Gensim.

Competitor Comparisons

Utils for streaming large files (S3, HDFS, gzip, bz2...)

Pros of smart_open

  • Lightweight and focused on I/O operations
  • Supports a wide range of storage systems (local, S3, GCS, Azure Blob, etc.)
  • Easy to use and integrate into existing projects

Cons of smart_open

  • Limited to I/O operations, lacks advanced NLP features
  • Smaller community and fewer contributors
  • Less frequent updates and releases

Code Comparison

smart_open:

from smart_open import open

with open('s3://bucket/key.txt', 'w') as fout:
    fout.write('hello world')

gensim:

from gensim.models import Word2Vec

sentences = [['this', 'is', 'a', 'sentence'], ['another', 'sentence']]
model = Word2Vec(sentences, min_count=1)

Summary

smart_open is a specialized library for handling I/O operations across various storage systems, while gensim is a comprehensive NLP library. smart_open excels in simplifying file access, but lacks the advanced text processing capabilities of gensim. gensim offers a wide range of NLP tools and models, but may be overkill for projects that only require simple I/O operations. Choose smart_open for efficient file handling across different storage systems, and gensim for more complex NLP tasks and text processing.

30,447

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

  • More comprehensive NLP toolkit with advanced features like named entity recognition and dependency parsing
  • Faster processing speed, especially for large-scale text analysis
  • Better documentation and more active community support

Cons of spaCy

  • Steeper learning curve due to its more complex architecture
  • Less flexible for custom topic modeling compared to Gensim
  • Larger memory footprint, which can be an issue for resource-constrained environments

Code Comparison

spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

Gensim:

from gensim import corpora, models
texts = [["apple", "buy", "startup"], ["uk", "billion", "dollar"]]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary)

scikit-learn: machine learning in Python

Pros of scikit-learn

  • Broader scope, covering a wide range of machine learning algorithms and techniques
  • More extensive documentation and community support
  • Better integration with other scientific Python libraries (NumPy, SciPy, Pandas)

Cons of scikit-learn

  • Less specialized for natural language processing and topic modeling tasks
  • May have slower performance for specific text processing operations
  • Steeper learning curve for users primarily interested in text analysis

Code Comparison

scikit-learn (text classification):

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
clf = MultinomialNB().fit(X, y)

gensim (topic modeling):

from gensim import corpora, models

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaMulticore(corpus=corpus, num_topics=10)

Both libraries offer powerful tools for text analysis, but gensim is more specialized for topic modeling and word embeddings, while scikit-learn provides a broader range of machine learning algorithms. The choice between them depends on the specific requirements of your project and your familiarity with each library.

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Extensive support for state-of-the-art transformer models
  • Seamless integration with PyTorch and TensorFlow
  • Active development and frequent updates

Cons of transformers

  • Steeper learning curve for beginners
  • Higher computational requirements for large models
  • More focused on transformer architectures, less versatile for traditional NLP tasks

Code comparison

transformers:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

gensim:

from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)
vector = model.wv['dog']
13,520

NLTK Source

Pros of NLTK

  • Comprehensive suite of text processing libraries and educational resources
  • Extensive documentation and academic support
  • Broader range of linguistic tools, including parsing and semantic analysis

Cons of NLTK

  • Slower performance for large-scale text processing tasks
  • Less focus on modern machine learning techniques for NLP
  • Steeper learning curve for beginners

Code Comparison

NLTK example:

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

text = "Natural language processing is fascinating."
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [w for w in tokens if w.lower() not in stop_words]

Gensim example:

from gensim import corpora, models

texts = [['natural', 'language', 'processing', 'fascinating']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaModel(corpus, num_topics=1, id2word=dictionary)

NLTK focuses on linguistic analysis and text preprocessing, while Gensim specializes in topic modeling and document similarity. NLTK provides a wider range of NLP tools, but Gensim offers more efficient implementations for specific tasks like word embeddings and document clustering.

26,184

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Pros of XGBoost

  • Highly optimized for performance and speed in gradient boosting
  • Supports distributed computing for large-scale machine learning tasks
  • Offers built-in cross-validation and early stopping features

Cons of XGBoost

  • Steeper learning curve compared to Gensim's user-friendly interface
  • More focused on supervised learning, while Gensim excels in unsupervised tasks
  • Less suitable for natural language processing tasks than Gensim

Code Comparison

XGBoost (for classification):

import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Gensim (for topic modeling):

from gensim import corpora, models
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaMulticore(corpus=corpus, num_topics=10)

XGBoost is primarily used for supervised learning tasks like classification and regression, while Gensim focuses on unsupervised natural language processing tasks such as topic modeling and document similarity. XGBoost offers high performance and scalability for machine learning, whereas Gensim provides tools for working with textual data and semantic analysis.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

gensim – Topic Modelling in Python

Build Status GitHub release Downloads DOI Mailing List Follow

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

⚠️ Want to help out? Sponsor Gensim ❤️

⚠️ Gensim is in stable maintenance mode: we are not accepting new features, but bug and documentation fixes are still welcome! ⚠️

Features

  • All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core),
  • Intuitive interfaces
    • easy to plug in your own input corpus/datastream (trivial streaming API)
    • easy to extend with other Vector Space algorithms (trivial transformation API)
  • Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.
  • Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.
  • Extensive documentation and Jupyter Notebook tutorials.

If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

Installation

This software depends on NumPy, a Python package for scientific computing. Please bear in mind that building NumPy from source (e.g. by installing gensim on a platform which lacks NumPy .whl distribution) is a non-trivial task involving linking NumPy to a BLAS library.
It is recommended to provide a fast one (such as MKL, ATLAS or OpenBLAS) which can improve performance by as much as an order of magnitude. On OSX, NumPy picks up its vecLib BLAS automatically, so you don’t need to do anything special.

Install the latest version of gensim:

    pip install --upgrade gensim

Or, if you have instead downloaded and unzipped the source tar.gz package:

    tar -xvzf gensim-X.X.X.tar.gz
    cd gensim-X.X.X/
    pip install .

For alternative modes of installation, see the documentation.

Gensim is being continuously tested under all supported Python versions. Support for Python 2.7 was dropped in gensim 4.0.0 – install gensim 3.8.3 if you must use Python 2.7.

How come gensim is so fast and memory efficient? Isn’t it pure Python, and isn’t Python slow and greedy?

Many scientific algorithms can be expressed in terms of large matrix operations (see the BLAS note above). Gensim taps into these low-level BLAS libraries, by means of its dependency on NumPy. So while gensim-the-top-level-code is pure Python, it actually executes highly optimized Fortran/C under the hood, including multithreading (if your BLAS is so configured).

Memory-wise, gensim makes heavy use of Python’s built-in generators and iterators for streamed data processing. Memory efficiency was one of gensim’s design goals, and is a central feature of gensim, rather than something bolted on as an afterthought.

Documentation

Support

For commercial support, please see Gensim sponsorship.

Ask open-ended questions on the public Gensim Mailing List.

Raise bugs on Github but please make sure you follow the issue template. Issues that are not bugs or fail to provide the requested details will be closed without inspection.


Adopters

CompanyLogoIndustryUse of Gensim
RARE TechnologiesrareML & NLP consultingCreators of Gensim – this is us!
AmazonamazonRetailDocument similarity.
National Institutes of HealthnihHealthProcessing grants and publications with word2vec.
Cisco SecurityciscoSecurityLarge-scale fraud detection.
MindseyemindseyeLegalSimilarities in legal documents.
Channel 4channel4MediaRecommendation engine.
Talentpairtalent-pairHRCandidate matching in high-touch recruiting.
JujujujuHRProvide non-obvious related job suggestions.
TailwindtailwindMediaPost interesting and relevant content to Pinterest.
IssuuissuuMediaGensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about.
Search Metricssearch-metricsContent MarketingGensim word2vec used for entity disambiguation in Search Engine Optimisation.
12K Research12kMediaDocument similarity analysis on media articles.
Stillwater SupercomputingstillwaterHardwareDocument comprehension and association with word2vec.
SiteGroundsitegroundWeb hostingAn ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA.
Capital OnecapitaloneFinanceTopic modeling for customer complaints exploration.

Citing gensim

When citing gensim in academic papers and theses, please use this BibTeX entry:

@inproceedings{rehurek_lrec,
      title = {{Software Framework for Topic Modelling with Large Corpora}},
      author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
      booktitle = {{Proceedings of the LREC 2010 Workshop on New
           Challenges for NLP Frameworks}},
      pages = {45--50},
      year = 2010,
      month = May,
      day = 22,
      publisher = {ELRA},
      address = {Valletta, Malta},
      note={\url{http://is.muni.cz/publication/884893/en}},
      language={English}
}