Convert Figma logo to code with AI

buriy logopython-readability

fast python port of arc90's readability tool, updated to match latest readability.js!

2,641
348
2,641
40

Top Related Projects

A standalone version of the readability lib

14,051

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

Html Content / Article Extractor, web scrapping lib in Python

PostgreSQL database adapter for the Python programming language

3,503

Module for automatic summarization of text documents and HTML pages.

Quick Overview

The buriy/python-readability project is a Python library that provides a set of tools for extracting the main content from web pages, removing boilerplate and formatting the text in a clean and readable format. It is based on the original Readability algorithm developed by Arc90.

Pros

  • Accurate Content Extraction: The library is able to accurately identify and extract the main content of a web page, even in the presence of ads, navigation menus, and other distracting elements.
  • Customizable Behavior: The library provides several configuration options that allow users to fine-tune its behavior to their specific needs.
  • Cross-Platform Compatibility: The library is written in Python and can be used on a variety of platforms, including Windows, macOS, and Linux.
  • Active Development: The project is actively maintained, with regular updates and bug fixes.

Cons

  • Dependency on External Libraries: The library relies on several external libraries, such as lxml and html5lib, which can add complexity to the installation process.
  • Limited Scope: The library is primarily focused on extracting the main content of web pages and may not be suitable for more complex web scraping tasks.
  • Potential Performance Issues: Depending on the size and complexity of the web pages being processed, the library may experience performance issues, especially on larger datasets.
  • Limited Documentation: The project's documentation could be more comprehensive, making it harder for new users to get started.

Code Examples

Here are a few examples of how to use the buriy/python-readability library:

from readability.readability import Readability

# Extract the main content from a web page
url = "https://www.example.com/article"
reader = Readability(url)
article = reader.get_article()
print(article.title)
print(article.content)
from readability.readability import Readability

# Extract the main content and metadata from a web page
url = "https://www.example.com/article"
reader = Readability(url)
article = reader.get_article()
print(article.title)
print(article.content)
print(article.published_date)
print(article.domain)
from readability.readability import Readability

# Extract the main content and apply custom configuration
url = "https://www.example.com/article"
reader = Readability(url, min_text_length=100, retry_max=3)
article = reader.get_article()
print(article.title)
print(article.content)

Getting Started

To get started with the buriy/python-readability library, follow these steps:

  1. Install the library using pip:
pip install readability-lxml
  1. Import the Readability class from the readability.readability module:
from readability.readability import Readability
  1. Create a Readability object, passing in the URL of the web page you want to extract the content from:
url = "https://www.example.com/article"
reader = Readability(url)
  1. Call the get_article() method to extract the main content and metadata of the web page:
article = reader.get_article()
print(article.title)
print(article.content)
print(article.published_date)
print(article.domain)
  1. Optionally, you can customize the behavior of the library by passing in additional configuration options to the Readability constructor:
reader = Readability(url, min_text_length=100, retry_max=3)

For more advanced usage and configuration options, please refer to the project's documentation.

Competitor Comparisons

A standalone version of the readability lib

Pros of mozilla/readability

  • Actively maintained and updated, with the latest commit being within the past year.
  • Supports a wider range of input formats, including HTML, XML, and Markdown.
  • Provides more detailed configuration options for customizing the extraction process.

Cons of mozilla/readability

  • Larger codebase and more complex, which may make it more challenging to understand and contribute to.
  • Slower performance compared to buriy/python-readability, especially on larger documents.
  • Requires additional dependencies, such as lxml, which may increase the complexity of the setup process.

Code Comparison

buriy/python-readability:

from readability.readability import Readability

url = "https://example.com/article"
html = requests.get(url).text
doc = Readability(html).get_article()
content = doc.content

mozilla/readability:

from readability.readability import Readability

url = "https://example.com/article"
html = requests.get(url).text
reader = Readability(html)
doc = reader.parse()
content = doc.content

The main difference in the code is the usage of the parse() method in mozilla/readability, which provides more detailed configuration options compared to the simpler get_article() method in buriy/python-readability.

14,051

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

Pros of Newspaper

  • Newspaper provides a more comprehensive set of features, including article extraction, image extraction, and text summarization.
  • Newspaper has better support for handling different types of web pages, including news articles, blogs, and forums.
  • Newspaper has a more active development community, with more contributors and more frequent updates.

Cons of Newspaper

  • Newspaper can be slower and more resource-intensive than python-readability, especially for simple web page extraction tasks.
  • Newspaper has a steeper learning curve, with more complex configuration options and a larger API.
  • Newspaper may not be as accurate as python-readability for certain types of web pages, particularly those with non-standard layouts.

Code Comparison

python-readability:

from readability.readability import Readability

url = "https://www.example.com/article"
r = Readability(url)
content = r.get_article().content

Newspaper:

from newspaper import Article

url = "https://www.example.com/article"
article = Article(url)
article.download()
article.parse()
content = article.text

Html Content / Article Extractor, web scrapping lib in Python

Pros of Goose

  • Goose provides a more robust and feature-rich set of tools for extracting content from web pages, including support for video and image extraction.
  • Goose has a larger and more active community, with more contributors and more frequent updates.
  • Goose has better support for handling different types of web content, including pages with complex layouts and dynamic content.

Cons of Goose

  • Goose can be more complex to set up and configure, with more dependencies and a steeper learning curve.
  • Goose may be overkill for simple use cases where Readability's more lightweight approach is sufficient.
  • Goose's codebase is larger and more complex, which can make it more difficult to understand and modify.

Code Comparison

Readability:

from readability.readability import Readability

url = 'https://example.com/article'
article = Readability(url).get_article()
content = article.content

Goose:

from goose3 import Goose

url = 'https://example.com/article'
g = Goose()
article = g.extract(url=url)
content = article.cleaned_text

PostgreSQL database adapter for the Python programming language

Pros of psycopg2

  • Robust and mature library for interacting with PostgreSQL databases in Python
  • Supports a wide range of PostgreSQL features, including transactions, prepared statements, and asynchronous operations
  • Provides a simple and intuitive API for common database operations

Cons of psycopg2

  • Primarily focused on PostgreSQL, so it may not be the best choice for working with other database systems
  • Can be more complex to set up and configure compared to some other database libraries

Code Comparison

psycopg2:

import psycopg2

conn = psycopg2.connect("dbname=mydb user=myuser password=mypass")
cur = conn.cursor()
cur.execute("SELECT * FROM mytable WHERE id = %s", (1,))
result = cur.fetchone()
print(result)

python-readability:

from readability.readability import Readability

url = "https://example.com"
r = Readability(url)
content = r.get_content()
print(content)
3,503

Module for automatic summarization of text documents and HTML pages.

Pros of Sumy

  • Sumy provides a wider range of text summarization algorithms, including Luhn, Edmundson, and TextRank, allowing for more flexibility in summarization approaches.
  • Sumy supports multiple languages, including English, French, German, and Russian, making it more versatile for international use cases.
  • Sumy has a more active development community, with regular updates and bug fixes.

Cons of Sumy

  • Sumy has a more complex API and requires more setup and configuration compared to python-readability.
  • Sumy may have a higher learning curve for users who are primarily interested in simple text extraction and cleaning.
  • Sumy's performance may be slower than python-readability for certain tasks, especially on larger documents.

Code Comparison

Python-readability:

from readability.readability import Readability

url = "https://example.com"
article = Readability(url).get_article()
print(article.title)
print(article.content)

Sumy:

from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

url = "https://example.com"
parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, sentences_count=3)

for sentence in summary:
    print(sentence)

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

.. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master :target: https://travis-ci.org/buriy/python-readability .. image:: https://img.shields.io/pypi/v/readability-lxml.svg :target: https://pypi.python.org/pypi/readability-lxml

python-readability

Given an HTML document, extract and clean up the main body text and title.

This is a Python port of a Ruby port of arc90's Readability project <https://web.archive.org/web/20130519040221/http://www.readability.com/>__.

Installation

It's easy using pip, just run:

.. code-block:: bash

$ pip install readability-lxml

As an alternative, you may also use conda to install, just run:

.. code-block:: bash

$ conda install -c conda-forge readability-lxml 

Usage

.. code-block:: python

>>> import requests
>>> from readability import Document

>>> response = requests.get('http://example.com')
>>> doc = Document(response.content)
>>> doc.title()
'Example Domain'

>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""

Change Log

  • 0.8.2 Added article author(s) (thanks @mattblaha)
  • 0.8.1 Fixed processing of non-ascii HTMLs via regexps.
  • 0.8 Replaced XHTML output with HTML5 output in summary() call.
  • 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
  • 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
  • 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
  • 0.4 Added Videos loading and allowed more images per paragraph
  • 0.3 Added Document.encoding, positive_keywords and negative_keywords

Licensing

This code is under the Apache License 2.0 <http://www.apache.org/licenses/LICENSE-2.0>__ license.

Thanks to

  • Latest readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>__
  • Ruby port by starrhorne and iterationlabs
  • Python port <https://github.com/gfxmonk/python-readability>__ by gfxmonk
  • Decruft effort <https://web.archive.org/web/20110214150709/https://www.minvolai.com/blog/decruft-arc90s-readability-in-python/> to move to lxml
  • "BR to P" fix from readability.js which improves quality for smaller texts
  • Github users contributions.