Top Related Projects
Code for the ACL 2017 paper "Get To The Point: Summarization with Pointer-Generator Networks"
Language-Agnostic SEntence Representations
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
💫 Industrial-strength Natural Language Processing (NLP) in Python
NLTK Source
Quick Overview
Sumy is a Python library and command-line utility for automatic text summarization. It provides multiple summarization algorithms and supports various input formats, making it a versatile tool for extracting key information from large texts.
Pros
- Offers multiple summarization algorithms, including LSA, LexRank, and TextRank
- Supports various input formats (plaintext, HTML, docx, etc.)
- Can be used as a Python library or command-line tool
- Easy to install and use
Cons
- Limited to extractive summarization (selects existing sentences)
- May not perform as well as more advanced deep learning-based summarizers
- Documentation could be more comprehensive
- Some algorithms may be slower for very large texts
Code Examples
- Basic usage with TextRank algorithm:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
text = "Your long text here..."
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, sentences_count=3)
for sentence in summary:
print(sentence)
- Using LexRank with a file input:
from sumy.parsers.html import HtmlParser
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.nlp.tokenizers import Tokenizer
url = "https://example.com/article.html"
parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = LexRankSummarizer()
summary = summarizer(parser.document, sentences_count=5)
for sentence in summary:
print(sentence)
- Comparing multiple summarization methods:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
text = "Your long text here..."
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizers = [LsaSummarizer(), LuhnSummarizer(), TextRankSummarizer()]
for summarizer in summarizers:
print(f"\n{summarizer.__class__.__name__} summary:")
summary = summarizer(parser.document, sentences_count=3)
for sentence in summary:
print(sentence)
Getting Started
To get started with Sumy, follow these steps:
-
Install Sumy using pip:
pip install sumy
-
Import the necessary modules:
from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.lsa import LsaSummarizer
-
Create a parser and summarizer:
parser = PlaintextParser.from_string("Your text here", Tokenizer("english")) summarizer = LsaSummarizer()
-
Generate and print the summary:
summary = summarizer(parser.document, sentences_count=3) for sentence in summary: print(sentence)
Competitor Comparisons
Code for the ACL 2017 paper "Get To The Point: Summarization with Pointer-Generator Networks"
Pros of pointer-generator
- Implements a novel approach combining pointer networks and coverage mechanisms for text summarization
- Designed specifically for abstractive summarization, allowing generation of new phrases
- Provides pre-trained models and datasets for easy experimentation
Cons of pointer-generator
- More complex architecture, potentially requiring more computational resources
- Focused solely on summarization, while sumy offers multiple text analysis algorithms
- Less actively maintained, with fewer recent updates compared to sumy
Code comparison
sumy:
from sumy.parsers.html import HtmlParser
from sumy.summarizers.lsa import LsaSummarizer
parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, SENTENCES_COUNT)
pointer-generator:
from model import Model
from batcher import Batcher
model = Model(hps, vocab)
batcher = Batcher(hps, vocab)
batch = batcher.next_batch()
output = model.run_train_step(batch)
Key differences
- sumy offers a more straightforward API for various summarization techniques
- pointer-generator focuses on a specific neural network-based approach
- sumy provides broader text analysis capabilities beyond summarization
- pointer-generator is more suitable for researchers working on advanced NLP models
Language-Agnostic SEntence Representations
Pros of LASER
- Supports multilingual and cross-lingual text processing for over 90 languages
- Utilizes advanced neural network models for high-quality language embeddings
- Offers pre-trained models for immediate use in various NLP tasks
Cons of LASER
- More complex setup and usage compared to Sumy
- Requires more computational resources due to its advanced models
- Less focused on specific summarization tasks
Code Comparison
Sumy (text summarization):
from sumy.parsers.html import HtmlParser
from sumy.summarizers.lsa import LsaSummarizer
parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, sentences_count=5)
LASER (sentence embeddings):
from laserembeddings import Laser
laser = Laser()
embeddings = laser.embed_sentences(
['Hello world', 'Bonjour le monde'],
lang=['en', 'fr']
)
Key Differences
- Sumy focuses specifically on text summarization, while LASER is a more general-purpose multilingual NLP tool
- Sumy offers various summarization algorithms, whereas LASER provides language-agnostic sentence embeddings
- LASER is better suited for cross-lingual tasks and large-scale multilingual applications
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pros of transformers
- Extensive library with support for various NLP tasks and models
- Regular updates and active community support
- Seamless integration with PyTorch and TensorFlow
Cons of transformers
- Steeper learning curve due to its complexity
- Higher computational requirements for large models
- May be overkill for simple text summarization tasks
Code comparison
sumy:
from sumy.parsers.html import HtmlParser
from sumy.summarizers.lsa import LsaSummarizer
parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, sentences_count=3)
transformers:
from transformers import pipeline
summarizer = pipeline("summarization")
summary = summarizer(text, max_length=100, min_length=30, do_sample=False)
Key differences
- sumy focuses specifically on text summarization, while transformers covers a wide range of NLP tasks
- transformers leverages pre-trained models, potentially offering better performance for complex summarization tasks
- sumy is more lightweight and may be easier to set up for simple summarization needs
💫 Industrial-strength Natural Language Processing (NLP) in Python
Pros of spaCy
- Comprehensive NLP library with a wide range of features
- Highly optimized for performance, suitable for large-scale processing
- Active development and strong community support
Cons of spaCy
- Steeper learning curve due to its extensive functionality
- Larger memory footprint and longer initial loading time
- May be overkill for simple text summarization tasks
Code Comparison
spaCy (text processing):
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sample sentence.")
for token in doc:
print(token.text, token.pos_, token.dep_)
Sumy (text summarization):
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
parser = PlaintextParser.from_string("Your text here", Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, 3) # 3 sentences
While spaCy is a comprehensive NLP toolkit, Sumy focuses specifically on text summarization. spaCy offers more advanced language processing capabilities, but Sumy provides a simpler interface for generating summaries. Choose spaCy for complex NLP tasks or when performance is crucial, and Sumy for quick and straightforward text summarization.
NLTK Source
Pros of NLTK
- Comprehensive library with a wide range of NLP tools and functionalities
- Extensive documentation and community support
- Suitable for both research and production environments
Cons of NLTK
- Steeper learning curve due to its extensive feature set
- Can be slower for certain tasks compared to more specialized libraries
- Larger footprint and potentially unnecessary features for simple projects
Code Comparison
NLTK (text summarization):
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.probability import FreqDist
# Tokenize and preprocess text
sentences = sent_tokenize(text)
words = word_tokenize(text.lower())
words = [word for word in words if word not in stopwords.words('english')]
# Calculate word frequency and summarize
freq = FreqDist(words)
summary = ' '.join([sent for sent in sentences if any(word in freq.most_common(10) for word in word_tokenize(sent.lower()))])
Sumy (text summarization):
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, 3) # 3 sentences
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Automatic text summarizer
Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains simple evaluation framework for text summaries. Implemented summarization methods are described in the documentation. I also maintain a list of alternative implementations of the summarizers in various programming languages.
Is my natural language supported?
There is a good chance it is. But if not it is not too hard to add it.
Installation
Make sure you have Python 3.6+ and pip (Windows, Linux) installed. Run simply (preferred way):
$ [sudo] pip install sumy
$ [sudo] pip install git+git://github.com/miso-belica/sumy.git # for the fresh version
Usage
Thanks to some good soul out there, the easiest way to try sumy is in your browser at https://huggingface.co/spaces/issam9/sumy_space
Sumy contains command line utility for quick summarization of documents.
$ sumy lex-rank --length=10 --url=https://en.wikipedia.org/wiki/Automatic_summarization # what's summarization?
$ sumy lex-rank --language=uk --length=30 --url=https://uk.wikipedia.org/wiki/УкÑаÑна
$ sumy luhn --language=czech --url=https://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy edmundson --language=czech --length=3% --url=https://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy --help # for more info
Various evaluation methods for some summarization method can be executed by commands below:
$ sumy_eval lex-rank reference_summary.txt --url=https://en.wikipedia.org/wiki/Automatic_summarization
$ sumy_eval lsa reference_summary.txt --language=czech --url=https://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy_eval edmundson reference_summary.txt --language=czech --url=https://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy_eval --help # for more info
If you don't want to bother by the installation, you can try it as a container.
$ docker run --rm misobelica/sumy lex-rank --length=10 --url=https://en.wikipedia.org/wiki/Automatic_summarization
Python API
Or you can use sumy like a library in your project. Create file sumy_example.py
(don't name it sumy.py
) with the code below to test it.
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
LANGUAGE = "english"
SENTENCES_COUNT = 10
if __name__ == "__main__":
url = "https://en.wikipedia.org/wiki/Automatic_summarization"
parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
# or for plain text files
# parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
# parser = PlaintextParser.from_string("Check this out.", Tokenizer(LANGUAGE))
stemmer = Stemmer(LANGUAGE)
summarizer = Summarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)
for sentence in summarizer(parser.document, SENTENCES_COUNT):
print(sentence)
Interesting projects using sumy
I found some interesting projects while browsing the internet or sometimes people wrote me an e-mail with questions, and I was curious how they use the sumy :)
- Learning to generate questions from text - https://github.com/adityasarvaiya/Automatic_Question_Generation
- Summarize your video to any duration - https://github.com/aswanthkoleri/VideoMash and similar https://github.com/OpenGenus/vidsum
- Tool for collectively summarizing large discussions - https://github.com/amyxzhang/wikum
- AutoTL;DR bot for Lemmy uses sumy: https://github.com/RikudouSage/LemmyAutoTldrBot
Top Related Projects
Code for the ACL 2017 paper "Get To The Point: Summarization with Pointer-Generator Networks"
Language-Agnostic SEntence Representations
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
💫 Industrial-strength Natural Language Processing (NLP) in Python
NLTK Source
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot