readability

A standalone version of the readability lib

10,277

670

10,277

283

View on GitHub View on NPM

Top Related Projects

parser

5,685

📜 Extract meaningful content from the chaos of a web page

percollate

4,463

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.

puppeteer

92,518

JavaScript API for Chrome and Firefox

jsdom

21,145

A JavaScript implementation of various web standards, for use with Node.js

cheerio

29,666

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.

Quick Overview

The mozilla/readability repository is a JavaScript library that extracts the primary readable content from a web page. It is designed to be used in browser extensions, standalone apps, and server-side applications to provide a clean, readable version of web content, stripping away ads, navigation, and other non-essential elements.

Pros

Robust and Reliable: The library has been actively maintained by the Mozilla team for several years and has a proven track record of accurately extracting the main content from a wide range of web pages.
Customizable: The library provides various configuration options to fine-tune the content extraction process, allowing developers to adapt it to their specific needs.
Cross-Platform: The library is written in JavaScript and can be used in both client-side and server-side environments, making it a versatile tool for a variety of use cases.
Open-Source: The project is open-source, allowing developers to contribute, report issues, and customize the library as needed.

Cons

Dependency on External Libraries: The library relies on several external dependencies, which can increase the overall project size and complexity.
Limited Handling of Dynamic Content: The library may struggle with accurately extracting content from web pages that heavily rely on JavaScript-driven dynamic content.
Potential for False Positives: In some cases, the library may incorrectly identify non-essential content as the main readable content, requiring additional post-processing.
Lack of Comprehensive Documentation: While the project has a README file, the documentation could be more detailed and provide more examples for different use cases.

Code Examples

Here are a few examples of how to use the mozilla/readability library:

Basic Usage:

const { Readability } = require('@mozilla/readability');
const { JSDOM } = require('jsdom');

const url = 'https://example.com/article';
const response = await fetch(url);
const html = await response.text();

const doc = new JSDOM(html).window.document;
const article = new Readability(doc).parse();

console.log(article.content);

Customizing the Extraction Process:

const { Readability } = require('@mozilla/readability');
const { JSDOM } = require('jsdom');

const url = 'https://example.com/article';
const response = await fetch(url);
const html = await response.text();

const doc = new JSDOM(html).window.document;
const article = new Readability(doc, {
  minArticleLength: 100,
  maxArticleLength: 10000,
  weightClasses: true,
  disableJSONLD: true
}).parse();

console.log(article.content);

Handling Dynamic Content:

const { Readability } = require('@mozilla/readability');
const { JSDOM } = require('jsdom');

const url = 'https://example.com/dynamic-article';
const response = await fetch(url);
const html = await response.text();

const doc = new JSDOM(html, {
  runScripts: 'dangerously'
}).window.document;

// Wait for the dynamic content to load
await new Promise((resolve) => {
  setTimeout(resolve, 5000);
});

const article = new Readability(doc).parse();
console.log(article.content);

Integrating with a Web Framework:

const { Readability } = require('@mozilla/readability');
const { JSDOM } = require('jsdom');
const express = require('express');

const app = express();

app.get('/article/:url', async (req, res) => {
  const url = req.params.url;
  const response = await fetch(url);
  const html = await response.text();

  const doc = new JSDOM(html).window.document;
  const article = new Readability(doc).parse();

  res.json(article);
});

app.listen(3000, () => {
  console.log('Server listening on port 3000');
});

Getting Started

To get started with the mozilla/readability library, follow these steps:

Install the library

Competitor Comparisons

parser

5,685

📜 Extract meaningful content from the chaos of a web page

Pros of Parser

More comprehensive content extraction, including metadata and additional elements
Supports multiple output formats (JSON, HTML, text)
Actively maintained with regular updates and improvements

Cons of Parser

Larger codebase and dependencies, potentially slower performance
May require more setup and configuration for specific use cases
Less focused on pure readability extraction compared to Readability

Code Comparison

Readability:

var article = new Readability(document).parse();
console.log(article.content);

Parser:

const { parse } = require('@postlight/parser');

parse('https://example.com').then(result => {
  console.log(result.content);
});

Summary

Readability focuses on extracting clean, readable content from web pages, while Parser offers a more comprehensive solution for content extraction and parsing. Readability is simpler to use and may be faster for basic content extraction, but Parser provides more features and flexibility for advanced use cases. The choice between the two depends on the specific requirements of your project and the level of content parsing and manipulation needed.

percollate

4,463

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.

Pros of Percollate

Offers a command-line interface for easy integration into scripts and workflows
Supports multiple output formats including PDF, HTML, and EPUB
Provides advanced customization options for styling and layout

Cons of Percollate

Less focused on pure readability extraction compared to Readability
May require more setup and configuration for basic use cases
Smaller community and potentially less frequent updates

Code Comparison

Readability:

const reader = new Readability(document);
const article = reader.parse();

Percollate:

const percollate = require('percollate');
percollate.configure({ /* options */ });
percollate.pdf('https://example.com', 'output.pdf');

Summary

Readability focuses on extracting readable content from web pages, while Percollate offers a more comprehensive solution for converting web content to various formats. Readability is simpler to use for basic content extraction, while Percollate provides more flexibility and output options. The choice between the two depends on specific project requirements and desired output formats.

puppeteer

92,518

JavaScript API for Chrome and Firefox

Pros of Puppeteer

Offers full browser automation, not just content extraction
Supports a wide range of web scraping and testing scenarios
Provides a high-level API for controlling Chrome or Chromium

Cons of Puppeteer

Heavier resource usage due to full browser automation
Steeper learning curve for simple content extraction tasks
Requires Node.js runtime environment

Code Comparison

Readability (content extraction):

const { Readability } = require('@mozilla/readability');
const { JSDOM } = require('jsdom');

const doc = new JSDOM(html);
const reader = new Readability(doc.window.document);
const article = reader.parse();

Puppeteer (web scraping):

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  const content = await page.content();
  await browser.close();
})();

While Readability focuses on extracting readable content from HTML, Puppeteer provides a full browser automation solution. Readability is more lightweight and specialized for content extraction, making it easier to use for that specific task. Puppeteer, on the other hand, offers more flexibility and power for complex web scraping and testing scenarios but requires more setup and resources.

jsdom

21,145

A JavaScript implementation of various web standards, for use with Node.js

Pros of jsdom

Provides a complete DOM environment for Node.js, allowing for browser-like interactions
Supports a wide range of web standards and APIs, including HTML, CSS, and JavaScript
Actively maintained with frequent updates and a large community

Cons of jsdom

Heavier and more resource-intensive due to its comprehensive feature set
May be overkill for simple HTML parsing tasks
Slower performance compared to lightweight parsing libraries

Code Comparison

Readability (parsing HTML content):

const { Readability } = require('@mozilla/readability');
const { JSDOM } = require('jsdom');

const doc = new JSDOM(html);
const reader = new Readability(doc.window.document);
const article = reader.parse();

jsdom (creating a DOM environment):

const jsdom = require("jsdom");
const { JSDOM } = jsdom;

const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
console.log(dom.window.document.querySelector("p").textContent);

Summary

Readability is focused on extracting readable content from web pages, while jsdom provides a full DOM environment for Node.js. Readability is more lightweight and specific to content extraction, whereas jsdom offers broader functionality for web page manipulation and testing. Choose Readability for simple content parsing tasks and jsdom for more complex browser-like interactions in Node.js environments.

cheerio

29,666

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.

Pros of Cheerio

More versatile for general HTML parsing and manipulation
Lightweight and fast, with a jQuery-like API
Extensive documentation and large community support

Cons of Cheerio

Not specifically designed for content extraction
Requires more manual work to achieve readability-focused results
May need additional libraries for advanced text processing

Code Comparison

Cheerio:

const $ = cheerio.load(html);
const title = $('h1').text();
const paragraphs = $('p').map((i, el) => $(el).text()).get();

Readability:

const reader = new Readability(document);
const article = reader.parse();
const { title, content } = article;

Key Differences

Cheerio is a general-purpose HTML parsing library, while Readability focuses specifically on extracting readable content from web pages. Cheerio provides more flexibility for various HTML manipulation tasks, but requires more custom code for content extraction. Readability offers a simpler API for obtaining clean, readable content with less effort.

Cheerio is better suited for projects requiring extensive HTML parsing and manipulation, while Readability excels in scenarios where the primary goal is to extract the main content from web pages for readability purposes.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Readability.js

A standalone version of the readability library used for Firefox Reader View.

Installation

Readability is available on npm:

npm install @mozilla/readability

You can then require() it, or for web-based projects, load the Readability.js script from your webpage.

Basic usage

To parse a document, you must create a new Readability object from a DOM document object, and then call the parse() method. Here's an example:

var article = new Readability(document).parse();

If you use Readability in a web browser, you will likely be able to use a document reference from elsewhere (e.g. fetched via XMLHttpRequest, in a same-origin <iframe> you have access to, etc.). In Node.js, you can use an external DOM library.

API Reference

`new Readability(document, options)`

The options object accepts a number of properties, all optional:

debug (boolean, default false): whether to enable logging.
maxElemsToParse (number, default 0 i.e. no limit): the maximum number of elements to parse.
nbTopCandidates (number, default 5): the number of top candidates to consider when analysing how tight the competition is among candidates.
charThreshold (number, default 500): the number of characters an article must have in order to return a result.
classesToPreserve (array): a set of classes to preserve on HTML elements when the keepClasses options is set to false.
keepClasses (boolean, default false): whether to preserve all classes on HTML elements. When set to false only classes specified in the classesToPreserve array are kept.
disableJSONLD (boolean, default false): when extracting page metadata, Readability gives precedence to Schema.org fields specified in the JSON-LD format. Set this option to true to skip JSON-LD parsing.
serializer (function, default el => el.innerHTML) controls how the content property returned by the parse() method is produced from the root DOM element. It may be useful to specify the serializer as the identity function (el => el) to obtain a DOM element instead of a string for content if you plan to process it further.
allowedVideoRegex (RegExp, default undefined ): a regular expression that matches video URLs that should be allowed to be included in the article content. If undefined, the default regex is applied.
linkDensityModifier (number, default 0): a number that is added to the base link density threshold during the shadiness checks. This can be used to penalize nodes with a high link density or vice versa.

`parse()`

Returns an object containing the following properties:

title: article title;
content: HTML string of processed article content;
textContent: text content of the article, with all the HTML tags removed;
length: length of an article, in characters;
excerpt: article description, or short excerpt from the content;
byline: author metadata;
dir: content direction;
siteName: name of the site;
lang: content language;
publishedTime: published time;

The parse() method works by modifying the DOM. This removes some elements in the web page, which may be undesirable. You can avoid this by passing the clone of the document object to the Readability constructor:

var documentClone = document.cloneNode(true);
var article = new Readability(documentClone).parse();

`isProbablyReaderable(document, options)`

A quick-and-dirty way of figuring out if it's plausible that the contents of a given document are suitable for processing with Readability. It is likely to produce both false positives and false negatives. The reason it exists is to avoid bogging down a time-sensitive process (like loading and showing the user a webpage) with the complex logic in the core of Readability. Improvements to its logic (while not deteriorating its performance) are very welcome.

The options object accepts a number of properties, all optional:

minContentLength (number, default 140): the minimum node content length used to decide if the document is readerable;
minScore (number, default 20): the minimum cumulated 'score' used to determine if the document is readerable;
visibilityChecker (function, default isNodeVisible): the function used to determine if a node is visible;

The function returns a boolean corresponding to whether or not we suspect Readability.parse() will succeed at returning an article object. Here's an example:

/*
    Only instantiate Readability  if we suspect
    the `parse()` method will produce a meaningful result.
*/
if (isProbablyReaderable(document)) {
    let article = new Readability(document).parse();
}

Node.js usage

Since Node.js does not come with its own DOM implementation, we rely on external libraries like jsdom. Here's an example using jsdom to obtain a DOM document object:

var { Readability } = require('@mozilla/readability');
var { JSDOM } = require('jsdom');
var doc = new JSDOM("<body>Look at this cat: <img src='./cat.jpg'></body>", {
  url: "https://www.example.com/the-page-i-got-the-source-from"
});
let reader = new Readability(doc.window.document);
let article = reader.parse();

Remember to pass the page's URI as the url option in the JSDOM constructor (as shown in the example above), so that Readability can convert relative URLs for images, hyperlinks, etc. to their absolute counterparts.

jsdom has the ability to run the scripts included in the HTML and fetch remote resources. For security reasons these are disabled by default, and we strongly recommend you keep them that way.

Security

If you're going to use Readability with untrusted input (whether in HTML or DOM form), we strongly recommend you use a sanitizer library like DOMPurify to avoid script injection when you use the output of Readability. We would also recommend using CSP to add further defense-in-depth restrictions to what you allow the resulting content to do. The Firefox integration of reader mode uses both of these techniques itself. Sanitizing unsafe content out of the input is explicitly not something we aim to do as part of Readability itself - there are other good sanitizer libraries out there, use them!

Contributing

Please see our Contributing document.

License

Copyright (c) 2010 Arc90 Inc

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Parser

Cons of Parser

Code Comparison

Summary

Pros of Percollate

Cons of Percollate

Code Comparison

Summary

Pros of Puppeteer

Cons of Puppeteer

Code Comparison

Pros of jsdom

Cons of jsdom

Code Comparison

Summary

Pros of Cheerio

Cons of Cheerio

Code Comparison

Key Differences

Convert designs to code with AI

README

Readability.js

Installation

Basic usage

API Reference

new Readability(document, options)

parse()

isProbablyReaderable(document, options)

Node.js usage

Security

Contributing

License

Top Related Projects

Convert designs to code with AI

NPM DownloadsLast 30 Days

`new Readability(document, options)`

`parse()`

`isProbablyReaderable(document, options)`