python-readability
fast python port of arc90's readability tool, updated to match latest readability.js!
Top Related Projects
A standalone version of the readability lib
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Html Content / Article Extractor, web scrapping lib in Python
PostgreSQL database adapter for the Python programming language
Module for automatic summarization of text documents and HTML pages.
Quick Overview
The buriy/python-readability
project is a Python library that provides a set of tools for extracting the main content from web pages, removing boilerplate and formatting the text in a clean and readable format. It is based on the original Readability algorithm developed by Arc90.
Pros
- Accurate Content Extraction: The library is able to accurately identify and extract the main content of a web page, even in the presence of ads, navigation menus, and other distracting elements.
- Customizable Behavior: The library provides several configuration options that allow users to fine-tune its behavior to their specific needs.
- Cross-Platform Compatibility: The library is written in Python and can be used on a variety of platforms, including Windows, macOS, and Linux.
- Active Development: The project is actively maintained, with regular updates and bug fixes.
Cons
- Dependency on External Libraries: The library relies on several external libraries, such as
lxml
andhtml5lib
, which can add complexity to the installation process. - Limited Scope: The library is primarily focused on extracting the main content of web pages and may not be suitable for more complex web scraping tasks.
- Potential Performance Issues: Depending on the size and complexity of the web pages being processed, the library may experience performance issues, especially on larger datasets.
- Limited Documentation: The project's documentation could be more comprehensive, making it harder for new users to get started.
Code Examples
Here are a few examples of how to use the buriy/python-readability
library:
from readability.readability import Readability
# Extract the main content from a web page
url = "https://www.example.com/article"
reader = Readability(url)
article = reader.get_article()
print(article.title)
print(article.content)
from readability.readability import Readability
# Extract the main content and metadata from a web page
url = "https://www.example.com/article"
reader = Readability(url)
article = reader.get_article()
print(article.title)
print(article.content)
print(article.published_date)
print(article.domain)
from readability.readability import Readability
# Extract the main content and apply custom configuration
url = "https://www.example.com/article"
reader = Readability(url, min_text_length=100, retry_max=3)
article = reader.get_article()
print(article.title)
print(article.content)
Getting Started
To get started with the buriy/python-readability
library, follow these steps:
- Install the library using pip:
pip install readability-lxml
- Import the
Readability
class from thereadability.readability
module:
from readability.readability import Readability
- Create a
Readability
object, passing in the URL of the web page you want to extract the content from:
url = "https://www.example.com/article"
reader = Readability(url)
- Call the
get_article()
method to extract the main content and metadata of the web page:
article = reader.get_article()
print(article.title)
print(article.content)
print(article.published_date)
print(article.domain)
- Optionally, you can customize the behavior of the library by passing in additional configuration options to the
Readability
constructor:
reader = Readability(url, min_text_length=100, retry_max=3)
For more advanced usage and configuration options, please refer to the project's documentation.
Competitor Comparisons
A standalone version of the readability lib
Pros of mozilla/readability
- Actively maintained and updated, with the latest commit being within the past year.
- Supports a wider range of input formats, including HTML, XML, and Markdown.
- Provides more detailed configuration options for customizing the extraction process.
Cons of mozilla/readability
- Larger codebase and more complex, which may make it more challenging to understand and contribute to.
- Slower performance compared to buriy/python-readability, especially on larger documents.
- Requires additional dependencies, such as lxml, which may increase the complexity of the setup process.
Code Comparison
buriy/python-readability:
from readability.readability import Readability
url = "https://example.com/article"
html = requests.get(url).text
doc = Readability(html).get_article()
content = doc.content
mozilla/readability:
from readability.readability import Readability
url = "https://example.com/article"
html = requests.get(url).text
reader = Readability(html)
doc = reader.parse()
content = doc.content
The main difference in the code is the usage of the parse()
method in mozilla/readability, which provides more detailed configuration options compared to the simpler get_article()
method in buriy/python-readability.
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Pros of Newspaper
- Newspaper provides a more comprehensive set of features, including article extraction, image extraction, and text summarization.
- Newspaper has better support for handling different types of web pages, including news articles, blogs, and forums.
- Newspaper has a more active development community, with more contributors and more frequent updates.
Cons of Newspaper
- Newspaper can be slower and more resource-intensive than python-readability, especially for simple web page extraction tasks.
- Newspaper has a steeper learning curve, with more complex configuration options and a larger API.
- Newspaper may not be as accurate as python-readability for certain types of web pages, particularly those with non-standard layouts.
Code Comparison
python-readability:
from readability.readability import Readability
url = "https://www.example.com/article"
r = Readability(url)
content = r.get_article().content
Newspaper:
from newspaper import Article
url = "https://www.example.com/article"
article = Article(url)
article.download()
article.parse()
content = article.text
Html Content / Article Extractor, web scrapping lib in Python
Pros of Goose
- Goose provides a more robust and feature-rich set of tools for extracting content from web pages, including support for video and image extraction.
- Goose has a larger and more active community, with more contributors and more frequent updates.
- Goose has better support for handling different types of web content, including pages with complex layouts and dynamic content.
Cons of Goose
- Goose can be more complex to set up and configure, with more dependencies and a steeper learning curve.
- Goose may be overkill for simple use cases where Readability's more lightweight approach is sufficient.
- Goose's codebase is larger and more complex, which can make it more difficult to understand and modify.
Code Comparison
Readability:
from readability.readability import Readability
url = 'https://example.com/article'
article = Readability(url).get_article()
content = article.content
Goose:
from goose3 import Goose
url = 'https://example.com/article'
g = Goose()
article = g.extract(url=url)
content = article.cleaned_text
PostgreSQL database adapter for the Python programming language
Pros of psycopg2
- Robust and mature library for interacting with PostgreSQL databases in Python
- Supports a wide range of PostgreSQL features, including transactions, prepared statements, and asynchronous operations
- Provides a simple and intuitive API for common database operations
Cons of psycopg2
- Primarily focused on PostgreSQL, so it may not be the best choice for working with other database systems
- Can be more complex to set up and configure compared to some other database libraries
Code Comparison
psycopg2:
import psycopg2
conn = psycopg2.connect("dbname=mydb user=myuser password=mypass")
cur = conn.cursor()
cur.execute("SELECT * FROM mytable WHERE id = %s", (1,))
result = cur.fetchone()
print(result)
python-readability:
from readability.readability import Readability
url = "https://example.com"
r = Readability(url)
content = r.get_content()
print(content)
Module for automatic summarization of text documents and HTML pages.
Pros of Sumy
- Sumy provides a wider range of text summarization algorithms, including Luhn, Edmundson, and TextRank, allowing for more flexibility in summarization approaches.
- Sumy supports multiple languages, including English, French, German, and Russian, making it more versatile for international use cases.
- Sumy has a more active development community, with regular updates and bug fixes.
Cons of Sumy
- Sumy has a more complex API and requires more setup and configuration compared to python-readability.
- Sumy may have a higher learning curve for users who are primarily interested in simple text extraction and cleaning.
- Sumy's performance may be slower than python-readability for certain tasks, especially on larger documents.
Code Comparison
Python-readability:
from readability.readability import Readability
url = "https://example.com"
article = Readability(url).get_article()
print(article.title)
print(article.content)
Sumy:
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
url = "https://example.com"
parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, sentences_count=3)
for sentence in summary:
print(sentence)
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
.. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master :target: https://travis-ci.org/buriy/python-readability .. image:: https://img.shields.io/pypi/v/readability-lxml.svg :target: https://pypi.python.org/pypi/readability-lxml
python-readability
Given an HTML document, extract and clean up the main body text and title.
This is a Python port of a Ruby port of arc90's Readability project <https://web.archive.org/web/20130519040221/http://www.readability.com/>
__.
Installation
It's easy using pip
, just run:
.. code-block:: bash
$ pip install readability-lxml
As an alternative, you may also use conda to install, just run:
.. code-block:: bash
$ conda install -c conda-forge readability-lxml
Usage
.. code-block:: python
>>> import requests
>>> from readability import Document
>>> response = requests.get('http://example.com')
>>> doc = Document(response.content)
>>> doc.title()
'Example Domain'
>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n domain in examples without prior coordination or asking for permission.</p>
\n <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""
Change Log
- 0.8.2 Added article author(s) (thanks @mattblaha)
- 0.8.1 Fixed processing of non-ascii HTMLs via regexps.
- 0.8 Replaced XHTML output with HTML5 output in summary() call.
- 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
- 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
- 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
- 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
- 0.4 Added Videos loading and allowed more images per paragraph
- 0.3 Added Document.encoding, positive_keywords and negative_keywords
Licensing
This code is under the Apache License 2.0 <http://www.apache.org/licenses/LICENSE-2.0>
__ license.
Thanks to
- Latest
readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>
__ - Ruby port by starrhorne and iterationlabs
Python port <https://github.com/gfxmonk/python-readability>
__ by gfxmonkDecruft effort <https://web.archive.org/web/20110214150709/https://www.minvolai.com/blog/decruft-arc90s-readability-in-python/>
to move to lxml- "BR to P" fix from readability.js which improves quality for smaller texts
- Github users contributions.
Top Related Projects
A standalone version of the readability lib
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Html Content / Article Extractor, web scrapping lib in Python
PostgreSQL database adapter for the Python programming language
Module for automatic summarization of text documents and HTML pages.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot