Convert Figma logo to code with AI

miso-belica logosumy

Module for automatic summarization of text documents and HTML pages.

3,533
531
3,533
24

Top Related Projects

Code for the ACL 2017 paper "Get To The Point: Summarization with Pointer-Generator Networks"

3,604

Language-Agnostic SEntence Representations

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

30,447

💫 Industrial-strength Natural Language Processing (NLP) in Python

13,520

NLTK Source

Quick Overview

Sumy is a Python library and command-line utility for automatic text summarization. It provides multiple summarization algorithms and supports various input formats, making it a versatile tool for extracting key information from large texts.

Pros

  • Offers multiple summarization algorithms, including LSA, LexRank, and TextRank
  • Supports various input formats (plaintext, HTML, docx, etc.)
  • Can be used as a Python library or command-line tool
  • Easy to install and use

Cons

  • Limited to extractive summarization (selects existing sentences)
  • May not perform as well as more advanced deep learning-based summarizers
  • Documentation could be more comprehensive
  • Some algorithms may be slower for very large texts

Code Examples

  1. Basic usage with TextRank algorithm:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

text = "Your long text here..."
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, sentences_count=3)

for sentence in summary:
    print(sentence)
  1. Using LexRank with a file input:
from sumy.parsers.html import HtmlParser
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.nlp.tokenizers import Tokenizer

url = "https://example.com/article.html"
parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = LexRankSummarizer()
summary = summarizer(parser.document, sentences_count=5)

for sentence in summary:
    print(sentence)
  1. Comparing multiple summarization methods:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer

text = "Your long text here..."
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizers = [LsaSummarizer(), LuhnSummarizer(), TextRankSummarizer()]

for summarizer in summarizers:
    print(f"\n{summarizer.__class__.__name__} summary:")
    summary = summarizer(parser.document, sentences_count=3)
    for sentence in summary:
        print(sentence)

Getting Started

To get started with Sumy, follow these steps:

  1. Install Sumy using pip:

    pip install sumy
    
  2. Import the necessary modules:

    from sumy.parsers.plaintext import PlaintextParser
    from sumy.nlp.tokenizers import Tokenizer
    from sumy.summarizers.lsa import LsaSummarizer
    
  3. Create a parser and summarizer:

    parser = PlaintextParser.from_string("Your text here", Tokenizer("english"))
    summarizer = LsaSummarizer()
    
  4. Generate and print the summary:

    summary = summarizer(parser.document, sentences_count=3)
    for sentence in summary:
        print(sentence)
    

Competitor Comparisons

Code for the ACL 2017 paper "Get To The Point: Summarization with Pointer-Generator Networks"

Pros of pointer-generator

  • Implements a novel approach combining pointer networks and coverage mechanisms for text summarization
  • Designed specifically for abstractive summarization, allowing generation of new phrases
  • Provides pre-trained models and datasets for easy experimentation

Cons of pointer-generator

  • More complex architecture, potentially requiring more computational resources
  • Focused solely on summarization, while sumy offers multiple text analysis algorithms
  • Less actively maintained, with fewer recent updates compared to sumy

Code comparison

sumy:

from sumy.parsers.html import HtmlParser
from sumy.summarizers.lsa import LsaSummarizer

parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, SENTENCES_COUNT)

pointer-generator:

from model import Model
from batcher import Batcher

model = Model(hps, vocab)
batcher = Batcher(hps, vocab)
batch = batcher.next_batch()
output = model.run_train_step(batch)

Key differences

  • sumy offers a more straightforward API for various summarization techniques
  • pointer-generator focuses on a specific neural network-based approach
  • sumy provides broader text analysis capabilities beyond summarization
  • pointer-generator is more suitable for researchers working on advanced NLP models
3,604

Language-Agnostic SEntence Representations

Pros of LASER

  • Supports multilingual and cross-lingual text processing for over 90 languages
  • Utilizes advanced neural network models for high-quality language embeddings
  • Offers pre-trained models for immediate use in various NLP tasks

Cons of LASER

  • More complex setup and usage compared to Sumy
  • Requires more computational resources due to its advanced models
  • Less focused on specific summarization tasks

Code Comparison

Sumy (text summarization):

from sumy.parsers.html import HtmlParser
from sumy.summarizers.lsa import LsaSummarizer

parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, sentences_count=5)

LASER (sentence embeddings):

from laserembeddings import Laser

laser = Laser()
embeddings = laser.embed_sentences(
    ['Hello world', 'Bonjour le monde'],
    lang=['en', 'fr']
)

Key Differences

  • Sumy focuses specifically on text summarization, while LASER is a more general-purpose multilingual NLP tool
  • Sumy offers various summarization algorithms, whereas LASER provides language-agnostic sentence embeddings
  • LASER is better suited for cross-lingual tasks and large-scale multilingual applications

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Extensive library with support for various NLP tasks and models
  • Regular updates and active community support
  • Seamless integration with PyTorch and TensorFlow

Cons of transformers

  • Steeper learning curve due to its complexity
  • Higher computational requirements for large models
  • May be overkill for simple text summarization tasks

Code comparison

sumy:

from sumy.parsers.html import HtmlParser
from sumy.summarizers.lsa import LsaSummarizer

parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, sentences_count=3)

transformers:

from transformers import pipeline

summarizer = pipeline("summarization")
summary = summarizer(text, max_length=100, min_length=30, do_sample=False)

Key differences

  • sumy focuses specifically on text summarization, while transformers covers a wide range of NLP tasks
  • transformers leverages pre-trained models, potentially offering better performance for complex summarization tasks
  • sumy is more lightweight and may be easier to set up for simple summarization needs
30,447

💫 Industrial-strength Natural Language Processing (NLP) in Python

Pros of spaCy

  • Comprehensive NLP library with a wide range of features
  • Highly optimized for performance, suitable for large-scale processing
  • Active development and strong community support

Cons of spaCy

  • Steeper learning curve due to its extensive functionality
  • Larger memory footprint and longer initial loading time
  • May be overkill for simple text summarization tasks

Code Comparison

spaCy (text processing):

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sample sentence.")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Sumy (text summarization):

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

parser = PlaintextParser.from_string("Your text here", Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, 3)  # 3 sentences

While spaCy is a comprehensive NLP toolkit, Sumy focuses specifically on text summarization. spaCy offers more advanced language processing capabilities, but Sumy provides a simpler interface for generating summaries. Choose spaCy for complex NLP tasks or when performance is crucial, and Sumy for quick and straightforward text summarization.

13,520

NLTK Source

Pros of NLTK

  • Comprehensive library with a wide range of NLP tools and functionalities
  • Extensive documentation and community support
  • Suitable for both research and production environments

Cons of NLTK

  • Steeper learning curve due to its extensive feature set
  • Can be slower for certain tasks compared to more specialized libraries
  • Larger footprint and potentially unnecessary features for simple projects

Code Comparison

NLTK (text summarization):

from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.probability import FreqDist

# Tokenize and preprocess text
sentences = sent_tokenize(text)
words = word_tokenize(text.lower())
words = [word for word in words if word not in stopwords.words('english')]

# Calculate word frequency and summarize
freq = FreqDist(words)
summary = ' '.join([sent for sent in sentences if any(word in freq.most_common(10) for word in word_tokenize(sent.lower()))])

Sumy (text summarization):

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, 3)  # 3 sentences

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Automatic text summarizer

image GitPod Ready-to-Code

Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains simple evaluation framework for text summaries. Implemented summarization methods are described in the documentation. I also maintain a list of alternative implementations of the summarizers in various programming languages.

Is my natural language supported?

There is a good chance it is. But if not it is not too hard to add it.

Installation

Make sure you have Python 3.6+ and pip (Windows, Linux) installed. Run simply (preferred way):

$ [sudo] pip install sumy
$ [sudo] pip install git+git://github.com/miso-belica/sumy.git  # for the fresh version

Usage

Thanks to some good soul out there, the easiest way to try sumy is in your browser at https://huggingface.co/spaces/issam9/sumy_space

Sumy contains command line utility for quick summarization of documents.

$ sumy lex-rank --length=10 --url=https://en.wikipedia.org/wiki/Automatic_summarization # what's summarization?
$ sumy lex-rank --language=uk --length=30 --url=https://uk.wikipedia.org/wiki/Україна
$ sumy luhn --language=czech --url=https://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy edmundson --language=czech --length=3% --url=https://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy --help # for more info

Various evaluation methods for some summarization method can be executed by commands below:

$ sumy_eval lex-rank reference_summary.txt --url=https://en.wikipedia.org/wiki/Automatic_summarization
$ sumy_eval lsa reference_summary.txt --language=czech --url=https://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy_eval edmundson reference_summary.txt --language=czech --url=https://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy_eval --help # for more info

If you don't want to bother by the installation, you can try it as a container.

$ docker run --rm misobelica/sumy lex-rank --length=10 --url=https://en.wikipedia.org/wiki/Automatic_summarization

Python API

Or you can use sumy like a library in your project. Create file sumy_example.py (don't name it sumy.py) with the code below to test it.

# -*- coding: utf-8 -*-

from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals

from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words


LANGUAGE = "english"
SENTENCES_COUNT = 10


if __name__ == "__main__":
    url = "https://en.wikipedia.org/wiki/Automatic_summarization"
    parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
    # or for plain text files
    # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
    # parser = PlaintextParser.from_string("Check this out.", Tokenizer(LANGUAGE))
    stemmer = Stemmer(LANGUAGE)

    summarizer = Summarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)

    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)

Interesting projects using sumy

I found some interesting projects while browsing the internet or sometimes people wrote me an e-mail with questions, and I was curious how they use the sumy :)