sumy

Module for automatic summarization of text documents and HTML pages.

3,598

530

3,598

View on GitHub

Top Related Projects

pointer-generator

2,190

Code for the ACL 2017 paper "Get To The Point: Summarization with Pointer-Generator Networks"

LASER

3,646

Language-Agnostic SEntence Representations

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

Quick Overview

Sumy is a Python library and command-line utility for automatic text summarization. It provides multiple summarization algorithms and supports various input formats, making it a versatile tool for extracting key information from large texts.

Pros

Offers multiple summarization algorithms, including LSA, LexRank, and TextRank
Supports various input formats (plaintext, HTML, docx, etc.)
Can be used as a Python library or command-line tool
Easy to install and use

Cons

Limited to extractive summarization (selects existing sentences)
May not perform as well as more advanced deep learning-based summarizers
Documentation could be more comprehensive
Some algorithms may be slower for very large texts

Code Examples

Basic usage with TextRank algorithm:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

text = "Your long text here..."
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, sentences_count=3)

for sentence in summary:
    print(sentence)

Using LexRank with a file input:

from sumy.parsers.html import HtmlParser
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.nlp.tokenizers import Tokenizer

url = "https://example.com/article.html"
parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = LexRankSummarizer()
summary = summarizer(parser.document, sentences_count=5)

for sentence in summary:
    print(sentence)

Comparing multiple summarization methods:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer

text = "Your long text here..."
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizers = [LsaSummarizer(), LuhnSummarizer(), TextRankSummarizer()]

for summarizer in summarizers:
    print(f"\n{summarizer.__class__.__name__} summary:")
    summary = summarizer(parser.document, sentences_count=3)
    for sentence in summary:
        print(sentence)

Getting Started

To get started with Sumy, follow these steps:

Install Sumy using pip:
```
pip install sumy
```

Import the necessary modules:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

Create a parser and summarizer:

parser = PlaintextParser.from_string("Your text here", Tokenizer("english"))
summarizer = LsaSummarizer()

Generate and print the summary:

summary = summarizer(parser.document, sentences_count=3)
for sentence in summary:
    print(sentence)

Competitor Comparisons

pointer-generator

2,190

Code for the ACL 2017 paper "Get To The Point: Summarization with Pointer-Generator Networks"

Pros of pointer-generator

Implements a novel approach combining pointer networks and coverage mechanisms for text summarization
Designed specifically for abstractive summarization, allowing generation of new phrases
Provides pre-trained models and datasets for easy experimentation

Cons of pointer-generator

More complex architecture, potentially requiring more computational resources
Focused solely on summarization, while sumy offers multiple text analysis algorithms
Less actively maintained, with fewer recent updates compared to sumy

Code comparison

sumy:

from sumy.parsers.html import HtmlParser
from sumy.summarizers.lsa import LsaSummarizer

parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, SENTENCES_COUNT)

pointer-generator:

from model import Model
from batcher import Batcher

model = Model(hps, vocab)
batcher = Batcher(hps, vocab)
batch = batcher.next_batch()
output = model.run_train_step(batch)

Key differences

sumy offers a more straightforward API for various summarization techniques
pointer-generator focuses on a specific neural network-based approach
sumy provides broader text analysis capabilities beyond summarization
pointer-generator is more suitable for researchers working on advanced NLP models

LASER

3,646

Language-Agnostic SEntence Representations

Pros of LASER

Supports multilingual and cross-lingual text processing for over 90 languages
Utilizes advanced neural network models for high-quality language embeddings
Offers pre-trained models for immediate use in various NLP tasks

Cons of LASER

More complex setup and usage compared to Sumy
Requires more computational resources due to its advanced models
Less focused on specific summarization tasks

Code Comparison

Sumy (text summarization):

from sumy.parsers.html import HtmlParser
from sumy.summarizers.lsa import LsaSummarizer

parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, sentences_count=5)

LASER (sentence embeddings):

from laserembeddings import Laser

laser = Laser()
embeddings = laser.embed_sentences(
    ['Hello world', 'Bonjour le monde'],
    lang=['en', 'fr']
)

Key Differences

Sumy focuses specifically on text summarization, while LASER is a more general-purpose multilingual NLP tool
Sumy offers various summarization algorithms, whereas LASER provides language-agnostic sentence embeddings
LASER is better suited for cross-lingual tasks and large-scale multilingual applications

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Extensive library with support for various NLP tasks and models
Regular updates and active community support
Seamless integration with PyTorch and TensorFlow

Cons of transformers

Steeper learning curve due to its complexity
Higher computational requirements for large models
May be overkill for simple text summarization tasks

Code comparison

sumy:

from sumy.parsers.html import HtmlParser
from sumy.summarizers.lsa import LsaSummarizer

parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, sentences_count=3)

transformers:

from transformers import pipeline

summarizer = pipeline("summarization")
summary = summarizer(text, max_length=100, min_length=30, do_sample=False)

Key differences

sumy focuses specifically on text summarization, while transformers covers a wide range of NLP tasks
transformers leverages pre-trained models, potentially offering better performance for complex summarization tasks
sumy is more lightweight and may be easier to set up for simple summarization needs

spaCy

31,840

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

Comprehensive NLP library with a wide range of features
Highly optimized for performance, suitable for large-scale processing
Active development and strong community support

Cons of spaCy

Steeper learning curve due to its extensive functionality
Larger memory footprint and longer initial loading time
May be overkill for simple text summarization tasks

Code Comparison

spaCy (text processing):

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sample sentence.")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Sumy (text summarization):

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

parser = PlaintextParser.from_string("Your text here", Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, 3)  # 3 sentences

While spaCy is a comprehensive NLP toolkit, Sumy focuses specifically on text summarization. spaCy offers more advanced language processing capabilities, but Sumy provides a simpler interface for generating summaries. Choose spaCy for complex NLP tasks or when performance is crucial, and Sumy for quick and straightforward text summarization.

nltk

14,217

NLTK Source

Pros of NLTK

Comprehensive library with a wide range of NLP tools and functionalities
Extensive documentation and community support
Suitable for both research and production environments

Cons of NLTK

Steeper learning curve due to its extensive feature set
Can be slower for certain tasks compared to more specialized libraries
Larger footprint and potentially unnecessary features for simple projects

Code Comparison

NLTK (text summarization):

from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.probability import FreqDist

# Tokenize and preprocess text
sentences = sent_tokenize(text)
words = word_tokenize(text.lower())
words = [word for word in words if word not in stopwords.words('english')]

# Calculate word frequency and summarize
freq = FreqDist(words)
summary = ' '.join([sent for sent in sentences if any(word in freq.most_common(10) for word in word_tokenize(sent.lower()))])

Sumy (text summarization):

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, 3)  # 3 sentences

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Automatic text summarizer

Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains simple evaluation framework for text summaries. Implemented summarization methods are described in the documentation. I also maintain a list of alternative implementations of the summarizers in various programming languages.

Is my natural language supported?

There is a good chance it is. But if not it is not too hard to add it.

Installation

Make sure you have Python 3.6+ and pip (Windows, Linux) installed. Run simply (preferred way):

$ [sudo] pip install sumy
$ [sudo] pip install git+git://github.com/miso-belica/sumy.git  # for the fresh version

Usage

Thanks to some good soul out there, the easiest way to try sumy is in your browser at https://huggingface.co/spaces/issam9/sumy_space

Sumy contains command line utility for quick summarization of documents.

$ sumy lex-rank --length=10 --url=https://en.wikipedia.org/wiki/Automatic_summarization # what's summarization?
$ sumy lex-rank --language=uk --length=30 --url=https://uk.wikipedia.org/wiki/Ð£ÐºÑÐ°ÑÐ½Ð°
$ sumy luhn --language=czech --url=https://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy edmundson --language=czech --length=3% --url=https://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy --help # for more info

Various evaluation methods for some summarization method can be executed by commands below:

$ sumy_eval lex-rank reference_summary.txt --url=https://en.wikipedia.org/wiki/Automatic_summarization
$ sumy_eval lsa reference_summary.txt --language=czech --url=https://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy_eval edmundson reference_summary.txt --language=czech --url=https://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy_eval --help # for more info

If you don't want to bother by the installation, you can try it as a container.

$ docker run --rm misobelica/sumy lex-rank --length=10 --url=https://en.wikipedia.org/wiki/Automatic_summarization

Python API

Or you can use sumy like a library in your project. Create file sumy_example.py (don't name it sumy.py) with the code below to test it.

# -*- coding: utf-8 -*-

from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals

from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words


LANGUAGE = "english"
SENTENCES_COUNT = 10


if __name__ == "__main__":
    url = "https://en.wikipedia.org/wiki/Automatic_summarization"
    parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
    # or for plain text files
    # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
    # parser = PlaintextParser.from_string("Check this out.", Tokenizer(LANGUAGE))
    stemmer = Stemmer(LANGUAGE)

    summarizer = Summarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)

    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)

Interesting projects using sumy

I found some interesting projects while browsing the internet or sometimes people wrote me an e-mail with questions, and I was curious how they use the sumy :)

Learning to generate questions from text - https://github.com/adityasarvaiya/Automatic_Question_Generation
Summarize your video to any duration - https://github.com/aswanthkoleri/VideoMash and similar https://github.com/OpenGenus/vidsum
Tool for collectively summarizing large discussions - https://github.com/amyxzhang/wikum
AutoTL;DR bot for Lemmy uses sumy: https://github.com/RikudouSage/LemmyAutoTldrBot

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot