parser

📜 Extract meaningful content from the chaos of a web page

5,598

477

5,598

114

View on GitHub View on NPM

Top Related Projects

readability

9,825

A standalone version of the readability lib

newspaper

14,624

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

python-readability

2,810

fast python port of arc90's readability tool, updated to match latest readability.js!

python-goose

4,041

Html Content / Article Extractor, web scrapping lib in Python

dragnet

1,268

Just the facts -- web page content extraction

Quick Overview

Postlight Parser is an open-source library that extracts and parses content from web pages. It transforms web pages into clean, usable JSON objects containing the relevant text, images, and metadata. This tool is particularly useful for developers working on content aggregation, readability tools, or any application that needs to extract structured data from web pages.

Pros

Supports multiple languages and character encodings
Extracts not only text but also images, videos, and metadata
Provides a clean JSON output for easy integration into various applications
Offers both a Node.js library and a hosted API for flexibility in usage

Cons

May occasionally struggle with complex or non-standard web page layouts
Requires some configuration for optimal results with certain types of websites
Limited customization options for fine-tuning the parsing process
Dependency on external libraries may lead to potential vulnerabilities if not kept up-to-date

Code Examples

Basic usage to parse a URL:

const { parse } = require('@postlight/parser');

const url = 'https://example.com/article';
parse(url).then((result) => {
  console.log(result);
});

Parsing with custom options:

const { parse } = require('@postlight/parser');

const url = 'https://example.com/article';
const options = {
  headers: { 'User-Agent': 'Custom User Agent' },
  timeout: 10000
};

parse(url, options).then((result) => {
  console.log(result);
});

Handling errors:

const { parse } = require('@postlight/parser');

const url = 'https://example.com/nonexistent';
parse(url)
  .then((result) => {
    console.log(result);
  })
  .catch((error) => {
    console.error('Parsing failed:', error);
  });

Getting Started

To use Postlight Parser in your project, follow these steps:

Install the package:
```
npm install @postlight/parser
```

Import and use the parser in your code:

const { parse } = require('@postlight/parser');

const url = 'https://example.com/article';
parse(url).then((result) => {
  console.log(result.title);
  console.log(result.content);
  console.log(result.author);
  // ... work with other extracted data
});

For more advanced usage and configuration options, refer to the official documentation on the GitHub repository.

Competitor Comparisons

readability

9,825

A standalone version of the readability lib

Pros of Readability

More mature and widely adopted project with a larger community
Integrated into Firefox and other popular applications, indicating reliability
Supports multiple languages and character sets out-of-the-box

Cons of Readability

Primarily focused on content extraction, lacking some advanced features
May require more manual configuration for specific use cases
JavaScript-only implementation, which might limit usage in certain environments

Code Comparison

Parser:

const { parse } = require('@postlight/parser');

const result = await parse('https://example.com');
console.log(result);

Readability:

const { Readability } = require('@mozilla/readability');
const { JSDOM } = require('jsdom');

const doc = new JSDOM(html);
const reader = new Readability(doc.window.document);
const article = reader.parse();

Both libraries aim to extract content from web pages, but Parser offers a more comprehensive set of features, including metadata extraction and support for various content types. Readability focuses primarily on extracting the main content and provides a simpler API.

Parser is more suitable for applications requiring extensive data extraction and processing, while Readability excels in scenarios where clean, readable content is the primary goal. The choice between the two depends on specific project requirements and the desired level of customization.

newspaper

14,624

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

Pros of Newspaper

More focused on news article extraction and processing
Includes additional features like keyword extraction and summary generation
Better support for non-English languages

Cons of Newspaper

Less actively maintained (last update over 2 years ago)
Limited to web scraping, doesn't support parsing from raw HTML or other formats
Slower performance compared to Parser

Code Comparison

Newspaper:

from newspaper import Article

url = 'http://example.com/article'
article = Article(url)
article.download()
article.parse()

print(article.title)
print(article.text)

Parser:

from newspaper import Article

url = 'http://example.com/article'
article = Article(url)
article.download()
article.parse()

print(article.title)
print(article.text)

Both libraries offer similar basic functionality for extracting content from web articles. However, Parser provides a more streamlined API and supports multiple input formats, while Newspaper offers additional features specific to news article processing.

Parser is more actively maintained and generally performs faster, making it a better choice for general-purpose content extraction. Newspaper, on the other hand, might be preferable for projects specifically focused on news article analysis, especially those requiring multilingual support or advanced features like keyword extraction and summarization.

python-readability

2,810

fast python port of arc90's readability tool, updated to match latest readability.js!

Pros of python-readability

Lightweight and focused solely on content extraction
Written in Python, making it easy to integrate with Python projects
Actively maintained with recent updates

Cons of python-readability

Limited to content extraction, lacking additional features like metadata parsing
May require additional libraries for full-fledged article parsing
Less comprehensive documentation compared to Parser

Code Comparison

python-readability:

from readability import Document
import requests

response = requests.get('http://example.com')
doc = Document(response.text)
print(doc.summary())

Parser:

const Mercury = require('@postlight/mercury-parser');

Mercury.parse('http://example.com').then(result => console.log(result));

Both repositories aim to extract content from web pages, but Parser offers a more comprehensive solution with additional features like metadata extraction. python-readability is more focused on content extraction and is better suited for Python-based projects, while Parser provides a more robust solution with broader language support and additional functionalities.

python-goose

4,041

Html Content / Article Extractor, web scrapping lib in Python

Pros of python-goose

Specifically designed for article extraction and content analysis
Includes features like image extraction and language detection
Lightweight and focused on a specific use case

Cons of python-goose

Less actively maintained compared to Parser
More limited in scope, primarily focused on article extraction
May require more manual configuration for certain use cases

Code Comparison

python-goose:

from goose3 import Goose

g = Goose()
article = g.extract(url='http://example.com/article')
print(article.cleaned_text)

Parser:

const Mercury = require('@postlight/mercury-parser');

Mercury.parse('http://example.com/article').then(result => console.log(result));

Both libraries aim to extract content from web pages, but they differ in their approach and features. python-goose is more focused on article extraction and content analysis, while Parser offers a broader range of web page parsing capabilities. Parser is more actively maintained and has a larger community, which may lead to better support and more frequent updates. However, python-goose might be a better choice for projects specifically focused on article extraction and analysis, especially if features like image extraction and language detection are important.

dragnet

1,268

Just the facts -- web page content extraction

Pros of Dragnet

Specialized in content extraction from web pages
Uses machine learning techniques for improved accuracy
Supports multiple languages

Cons of Dragnet

Less actively maintained (last update over 2 years ago)
Narrower focus on content extraction
Limited documentation and examples

Code Comparison

Parser:

from newspaper import Article

url = 'http://example.com'
article = Article(url)
article.download()
article.parse()

print(article.text)

Dragnet:

import requests
from dragnet import extract_content

url = 'http://example.com'
html = requests.get(url).text
content = extract_content(html)

print(content)

Both libraries aim to extract content from web pages, but Parser offers a more comprehensive set of features for article parsing and metadata extraction. Dragnet focuses specifically on content extraction using machine learning techniques.

Parser provides a higher-level API with more built-in functionality, while Dragnet requires additional steps for fetching HTML content. Parser also offers more extensive documentation and examples, making it easier for developers to get started.

However, Dragnet's machine learning approach may provide better accuracy in certain scenarios, especially for complex web pages or non-standard layouts. It also supports multiple languages out of the box, which can be advantageous for projects dealing with multilingual content.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Postlight Parser - Extracting content from chaos

Postlight's Parser extracts the bits that humans care about from any URL you give it. That includes article content, titles, authors, published dates, excerpts, lead images, and more.

Postlight Parser powers Postlight Reader, a browser extension that removes ads and distractions, leaving only text and images for a beautiful reading view on any site.

Postlight Parser allows you to easily create custom parsers using simple JavaScript and CSS selectors. This allows you to proactively manage parsing and migration edge cases. There are many examples available along with documentation.

How? Like this.

Installation

# If you're using yarn
yarn add @postlight/parser

# If you're using npm
npm install @postlight/parser

Usage

import Parser from '@postlight/parser';

Parser.parse(url).then(result => console.log(result));

// NOTE: When used in the browser, you can omit the URL argument
// and simply run `Parser.parse()` to parse the current page.

The result looks like this:

{
  "title": "Thunder (mascot)",
  "content": "... <p><b>Thunder</b> is the <a href=\"https://en.wikipedia.org/wiki/Stage_name\">stage name</a> for the...",
  "author": "Wikipedia Contributors",
  "date_published": "2016-09-16T20:56:00.000Z",
  "lead_image_url": null,
  "dek": null,
  "next_page_url": null,
  "url": "https://en.wikipedia.org/wiki/Thunder_(mascot)",
  "domain": "en.wikipedia.org",
  "excerpt": "Thunder Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos",
  "word_count": 4677,
  "direction": "ltr",
  "total_pages": 1,
  "rendered_pages": 1
}

If Parser is unable to find a field, that field will return null.

`parse()` Options

Content Formats

By default, Postlight Parser returns the content field as HTML. However, you can override this behavior by passing in options to the parse function, specifying whether or not to scrape all pages of an article, and what type of output to return (valid values are 'html', 'markdown', and 'text'). For example:

Parser.parse(url, { contentType: 'markdown' }).then(result =>
  console.log(result)
);

This returns the the page's content as GitHub-flavored Markdown:

"content": "...**Thunder** is the [stage name](https://en.wikipedia.org/wiki/Stage_name) for the..."

Custom Request Headers

You can include custom headers in requests by passing name-value pairs to the parse function as follows:

Parser.parse(url, {
  headers: {
    Cookie: 'name=value; name2=value2; name3=value3',
    'User-Agent':
      'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1',
  },
}).then(result => console.log(result));

Pre-fetched HTML

You can use Postlight Parser to parse custom or pre-fetched HTML by passing an HTML string to the parse function as follows:

Parser.parse(url, {
  html:
    '<html><body><article><h1>Thunder (mascot)</h1><p>Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos</p></article></body></html>',
}).then(result => console.log(result));

Note that the URL argument is still supplied, in order to identify the web site and use its custom parser, if it has any, though it will not be used for fetching content.

The command-line parser

Postlight Parser also ships with a CLI, meaning you can use it from your command line like so:

Postlight Parser CLI Basic Usage

# Install Postlight Parser globally
yarn global add @postlight/parser
#   or
npm -g install @postlight/parser

# Then
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source

# Pass optional --format argument to set content type (html|markdown|text)
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --format=markdown

# Pass optional --header.name=value arguments to include custom headers in the request
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --header.Cookie="name=value; name2=value2; name3=value3" --header.User-Agent="Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"

# Pass optional --extend argument to add a custom type to the response
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend credit="p:last-child em"

# Pass optional --extend-list argument to add a custom type with multiple matches
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list categories=".meta__tags-list a"

# Get the value of attributes by adding a pipe to --extend or --extend-list
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list links=".body a|href"

# Pass optional --add-extractor argument to add a custom extractor at runtime.
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --add-extractor ./src/extractors/fixtures/postlight.com/index.js

License

Licensed under either of the below, at your preference:

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

Contributing

For details on how to contribute to Postlight Parser, including how to write a custom content extractor for any site, see CONTRIBUTING.md

Unless it is explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual licensed as above without any additional terms or conditions.

ð¬ A Labs project from your friends at Postlight. Happy coding!

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Readability

Cons of Readability

Code Comparison

Pros of Newspaper

Cons of Newspaper

Code Comparison

Pros of python-readability

Cons of python-readability

Code Comparison

Pros of python-goose

Cons of python-goose

Code Comparison

Pros of Dragnet

Cons of Dragnet

Code Comparison

Convert designs to code with AI

README

Postlight Parser - Extracting content from chaos

How? Like this.

Installation

Usage

parse() Options

Content Formats

Custom Request Headers

Pre-fetched HTML

The command-line parser

License

Contributing

Top Related Projects

Convert designs to code with AI

NPM DownloadsLast 30 Days

`parse()` Options