python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!

2,810

354

2,810

View on GitHub

Top Related Projects

readability

10,277

A standalone version of the readability lib

newspaper

14,624

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

python-goose

4,041

Html Content / Article Extractor, web scrapping lib in Python

psycopg2

3,502

PostgreSQL database adapter for the Python programming language

sumy

3,598

Module for automatic summarization of text documents and HTML pages.

Quick Overview

The buriy/python-readability project is a Python library that provides a set of tools for extracting the main content from web pages, removing boilerplate and formatting the text in a clean and readable format. It is based on the original Readability algorithm developed by Arc90.

Pros

Accurate Content Extraction: The library is able to accurately identify and extract the main content of a web page, even in the presence of ads, navigation menus, and other distracting elements.
Customizable Behavior: The library provides several configuration options that allow users to fine-tune its behavior to their specific needs.
Cross-Platform Compatibility: The library is written in Python and can be used on a variety of platforms, including Windows, macOS, and Linux.
Active Development: The project is actively maintained, with regular updates and bug fixes.

Cons

Dependency on External Libraries: The library relies on several external libraries, such as lxml and html5lib, which can add complexity to the installation process.
Limited Scope: The library is primarily focused on extracting the main content of web pages and may not be suitable for more complex web scraping tasks.
Potential Performance Issues: Depending on the size and complexity of the web pages being processed, the library may experience performance issues, especially on larger datasets.
Limited Documentation: The project's documentation could be more comprehensive, making it harder for new users to get started.

Code Examples

Here are a few examples of how to use the buriy/python-readability library:

from readability.readability import Readability

# Extract the main content from a web page
url = "https://www.example.com/article"
reader = Readability(url)
article = reader.get_article()
print(article.title)
print(article.content)

from readability.readability import Readability

# Extract the main content and metadata from a web page
url = "https://www.example.com/article"
reader = Readability(url)
article = reader.get_article()
print(article.title)
print(article.content)
print(article.published_date)
print(article.domain)

from readability.readability import Readability

# Extract the main content and apply custom configuration
url = "https://www.example.com/article"
reader = Readability(url, min_text_length=100, retry_max=3)
article = reader.get_article()
print(article.title)
print(article.content)

Getting Started

To get started with the buriy/python-readability library, follow these steps:

Install the library using pip:

pip install readability-lxml

Import the Readability class from the readability.readability module:

from readability.readability import Readability

Create a Readability object, passing in the URL of the web page you want to extract the content from:

url = "https://www.example.com/article"
reader = Readability(url)

Call the get_article() method to extract the main content and metadata of the web page:

article = reader.get_article()
print(article.title)
print(article.content)
print(article.published_date)
print(article.domain)

Optionally, you can customize the behavior of the library by passing in additional configuration options to the Readability constructor:

reader = Readability(url, min_text_length=100, retry_max=3)

For more advanced usage and configuration options, please refer to the project's documentation.

Competitor Comparisons

readability

10,277

A standalone version of the readability lib

Pros of mozilla/readability

Actively maintained and updated, with the latest commit being within the past year.
Supports a wider range of input formats, including HTML, XML, and Markdown.
Provides more detailed configuration options for customizing the extraction process.

Cons of mozilla/readability

Larger codebase and more complex, which may make it more challenging to understand and contribute to.
Slower performance compared to buriy/python-readability, especially on larger documents.
Requires additional dependencies, such as lxml, which may increase the complexity of the setup process.

Code Comparison

buriy/python-readability:

from readability.readability import Readability

url = "https://example.com/article"
html = requests.get(url).text
doc = Readability(html).get_article()
content = doc.content

mozilla/readability:

from readability.readability import Readability

url = "https://example.com/article"
html = requests.get(url).text
reader = Readability(html)
doc = reader.parse()
content = doc.content

The main difference in the code is the usage of the parse() method in mozilla/readability, which provides more detailed configuration options compared to the simpler get_article() method in buriy/python-readability.

newspaper

14,624

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

Pros of Newspaper

Newspaper provides a more comprehensive set of features, including article extraction, image extraction, and text summarization.
Newspaper has better support for handling different types of web pages, including news articles, blogs, and forums.
Newspaper has a more active development community, with more contributors and more frequent updates.

Cons of Newspaper

Newspaper can be slower and more resource-intensive than python-readability, especially for simple web page extraction tasks.
Newspaper has a steeper learning curve, with more complex configuration options and a larger API.
Newspaper may not be as accurate as python-readability for certain types of web pages, particularly those with non-standard layouts.

Code Comparison

python-readability:

from readability.readability import Readability

url = "https://www.example.com/article"
r = Readability(url)
content = r.get_article().content

Newspaper:

from newspaper import Article

url = "https://www.example.com/article"
article = Article(url)
article.download()
article.parse()
content = article.text

python-goose

4,041

Html Content / Article Extractor, web scrapping lib in Python

Pros of Goose

Goose provides a more robust and feature-rich set of tools for extracting content from web pages, including support for video and image extraction.
Goose has a larger and more active community, with more contributors and more frequent updates.
Goose has better support for handling different types of web content, including pages with complex layouts and dynamic content.

Cons of Goose

Goose can be more complex to set up and configure, with more dependencies and a steeper learning curve.
Goose may be overkill for simple use cases where Readability's more lightweight approach is sufficient.
Goose's codebase is larger and more complex, which can make it more difficult to understand and modify.

Code Comparison

Readability:

from readability.readability import Readability

url = 'https://example.com/article'
article = Readability(url).get_article()
content = article.content

Goose:

from goose3 import Goose

url = 'https://example.com/article'
g = Goose()
article = g.extract(url=url)
content = article.cleaned_text

psycopg2

3,502

PostgreSQL database adapter for the Python programming language

Pros of psycopg2

Robust and mature library for interacting with PostgreSQL databases in Python
Supports a wide range of PostgreSQL features, including transactions, prepared statements, and asynchronous operations
Provides a simple and intuitive API for common database operations

Cons of psycopg2

Primarily focused on PostgreSQL, so it may not be the best choice for working with other database systems
Can be more complex to set up and configure compared to some other database libraries

Code Comparison

psycopg2:

import psycopg2

conn = psycopg2.connect("dbname=mydb user=myuser password=mypass")
cur = conn.cursor()
cur.execute("SELECT * FROM mytable WHERE id = %s", (1,))
result = cur.fetchone()
print(result)

python-readability:

from readability.readability import Readability

url = "https://example.com"
r = Readability(url)
content = r.get_content()
print(content)

sumy

3,598

Module for automatic summarization of text documents and HTML pages.

Pros of Sumy

Sumy provides a wider range of text summarization algorithms, including Luhn, Edmundson, and TextRank, allowing for more flexibility in summarization approaches.
Sumy supports multiple languages, including English, French, German, and Russian, making it more versatile for international use cases.
Sumy has a more active development community, with regular updates and bug fixes.

Cons of Sumy

Sumy has a more complex API and requires more setup and configuration compared to python-readability.
Sumy may have a higher learning curve for users who are primarily interested in simple text extraction and cleaning.
Sumy's performance may be slower than python-readability for certain tasks, especially on larger documents.

Code Comparison

Python-readability:

from readability.readability import Readability

url = "https://example.com"
article = Readability(url).get_article()
print(article.title)
print(article.content)

Sumy:

from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

url = "https://example.com"
parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, sentences_count=3)

for sentence in summary:
    print(sentence)

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

python-readability

Given an HTML document, extract and clean up the main body text and title.

This is a Python port of a Ruby port of arc90's Readability project.

Installation

It's easy using pip, just run:

$ pip install readability-lxml

As an alternative, you may also use conda to install, just run:

$ conda install -c conda-forge readability-lxml

Usage

>>> import requests
>>> from readability import Document

>>> response = requests.get('http://example.com')
>>> doc = Document(response.content)
>>> doc.title()
'Example Domain'

>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""

Change Log

0.8.4 Better CJK support, thanks @cdhigh
0.8.3.1 Support for python 3.8 - 3.13
0.8.3 We can now save all images via keep_all_images=True (default is to save 1 main image), thanks @botlabsDev
0.8.2 Added article author(s) (thanks @mattblaha)
0.8.1 Fixed processing of non-ascii HTMLs via regexps.
0.8 Replaced XHTML output with HTML5 output in summary() call.
0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
0.4 Added Videos loading and allowed more images per paragraph
0.3 Added Document.encoding, positive_keywords and negative_keywords

Licensing

This code is under the Apache License 2.0 license.

Thanks to

Latest readability.js
Ruby port by starrhorne and iterationlabs
Python port by gfxmonk
Decruft effort to move to lxml
"BR to P" fix from readability.js which improves quality for smaller texts
Github users contributions.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot