Convert Figma logo to code with AI

postlight logoparser

📜 Extract meaningful content from the chaos of a web page

5,429
444
5,429
111

Top Related Projects

A standalone version of the readability lib

14,219

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

fast python port of arc90's readability tool, updated to match latest readability.js!

Html Content / Article Extractor, web scrapping lib in Python

1,259

Just the facts -- web page content extraction

Quick Overview

Postlight Parser is an open-source library that extracts and parses content from web pages. It transforms web pages into clean, usable JSON objects containing the relevant text, images, and metadata. This tool is particularly useful for developers working on content aggregation, readability tools, or any application that needs to extract structured data from web pages.

Pros

  • Supports multiple languages and character encodings
  • Extracts not only text but also images, videos, and metadata
  • Provides a clean JSON output for easy integration into various applications
  • Offers both a Node.js library and a hosted API for flexibility in usage

Cons

  • May occasionally struggle with complex or non-standard web page layouts
  • Requires some configuration for optimal results with certain types of websites
  • Limited customization options for fine-tuning the parsing process
  • Dependency on external libraries may lead to potential vulnerabilities if not kept up-to-date

Code Examples

  1. Basic usage to parse a URL:
const { parse } = require('@postlight/parser');

const url = 'https://example.com/article';
parse(url).then((result) => {
  console.log(result);
});
  1. Parsing with custom options:
const { parse } = require('@postlight/parser');

const url = 'https://example.com/article';
const options = {
  headers: { 'User-Agent': 'Custom User Agent' },
  timeout: 10000
};

parse(url, options).then((result) => {
  console.log(result);
});
  1. Handling errors:
const { parse } = require('@postlight/parser');

const url = 'https://example.com/nonexistent';
parse(url)
  .then((result) => {
    console.log(result);
  })
  .catch((error) => {
    console.error('Parsing failed:', error);
  });

Getting Started

To use Postlight Parser in your project, follow these steps:

  1. Install the package:

    npm install @postlight/parser
    
  2. Import and use the parser in your code:

    const { parse } = require('@postlight/parser');
    
    const url = 'https://example.com/article';
    parse(url).then((result) => {
      console.log(result.title);
      console.log(result.content);
      console.log(result.author);
      // ... work with other extracted data
    });
    
  3. For more advanced usage and configuration options, refer to the official documentation on the GitHub repository.

Competitor Comparisons

A standalone version of the readability lib

Pros of Readability

  • More mature and widely adopted project with a larger community
  • Integrated into Firefox and other popular applications, indicating reliability
  • Supports multiple languages and character sets out-of-the-box

Cons of Readability

  • Primarily focused on content extraction, lacking some advanced features
  • May require more manual configuration for specific use cases
  • JavaScript-only implementation, which might limit usage in certain environments

Code Comparison

Parser:

const { parse } = require('@postlight/parser');

const result = await parse('https://example.com');
console.log(result);

Readability:

const { Readability } = require('@mozilla/readability');
const { JSDOM } = require('jsdom');

const doc = new JSDOM(html);
const reader = new Readability(doc.window.document);
const article = reader.parse();

Both libraries aim to extract content from web pages, but Parser offers a more comprehensive set of features, including metadata extraction and support for various content types. Readability focuses primarily on extracting the main content and provides a simpler API.

Parser is more suitable for applications requiring extensive data extraction and processing, while Readability excels in scenarios where clean, readable content is the primary goal. The choice between the two depends on specific project requirements and the desired level of customization.

14,219

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

Pros of Newspaper

  • More focused on news article extraction and processing
  • Includes additional features like keyword extraction and summary generation
  • Better support for non-English languages

Cons of Newspaper

  • Less actively maintained (last update over 2 years ago)
  • Limited to web scraping, doesn't support parsing from raw HTML or other formats
  • Slower performance compared to Parser

Code Comparison

Newspaper:

from newspaper import Article

url = 'http://example.com/article'
article = Article(url)
article.download()
article.parse()

print(article.title)
print(article.text)

Parser:

from newspaper import Article

url = 'http://example.com/article'
article = Article(url)
article.download()
article.parse()

print(article.title)
print(article.text)

Both libraries offer similar basic functionality for extracting content from web articles. However, Parser provides a more streamlined API and supports multiple input formats, while Newspaper offers additional features specific to news article processing.

Parser is more actively maintained and generally performs faster, making it a better choice for general-purpose content extraction. Newspaper, on the other hand, might be preferable for projects specifically focused on news article analysis, especially those requiring multilingual support or advanced features like keyword extraction and summarization.

fast python port of arc90's readability tool, updated to match latest readability.js!

Pros of python-readability

  • Lightweight and focused solely on content extraction
  • Written in Python, making it easy to integrate with Python projects
  • Actively maintained with recent updates

Cons of python-readability

  • Limited to content extraction, lacking additional features like metadata parsing
  • May require additional libraries for full-fledged article parsing
  • Less comprehensive documentation compared to Parser

Code Comparison

python-readability:

from readability import Document
import requests

response = requests.get('http://example.com')
doc = Document(response.text)
print(doc.summary())

Parser:

const Mercury = require('@postlight/mercury-parser');

Mercury.parse('http://example.com').then(result => console.log(result));

Both repositories aim to extract content from web pages, but Parser offers a more comprehensive solution with additional features like metadata extraction. python-readability is more focused on content extraction and is better suited for Python-based projects, while Parser provides a more robust solution with broader language support and additional functionalities.

Html Content / Article Extractor, web scrapping lib in Python

Pros of python-goose

  • Specifically designed for article extraction and content analysis
  • Includes features like image extraction and language detection
  • Lightweight and focused on a specific use case

Cons of python-goose

  • Less actively maintained compared to Parser
  • More limited in scope, primarily focused on article extraction
  • May require more manual configuration for certain use cases

Code Comparison

python-goose:

from goose3 import Goose

g = Goose()
article = g.extract(url='http://example.com/article')
print(article.cleaned_text)

Parser:

const Mercury = require('@postlight/mercury-parser');

Mercury.parse('http://example.com/article').then(result => console.log(result));

Both libraries aim to extract content from web pages, but they differ in their approach and features. python-goose is more focused on article extraction and content analysis, while Parser offers a broader range of web page parsing capabilities. Parser is more actively maintained and has a larger community, which may lead to better support and more frequent updates. However, python-goose might be a better choice for projects specifically focused on article extraction and analysis, especially if features like image extraction and language detection are important.

1,259

Just the facts -- web page content extraction

Pros of Dragnet

  • Specialized in content extraction from web pages
  • Uses machine learning techniques for improved accuracy
  • Supports multiple languages

Cons of Dragnet

  • Less actively maintained (last update over 2 years ago)
  • Narrower focus on content extraction
  • Limited documentation and examples

Code Comparison

Parser:

from newspaper import Article

url = 'http://example.com'
article = Article(url)
article.download()
article.parse()

print(article.text)

Dragnet:

import requests
from dragnet import extract_content

url = 'http://example.com'
html = requests.get(url).text
content = extract_content(html)

print(content)

Both libraries aim to extract content from web pages, but Parser offers a more comprehensive set of features for article parsing and metadata extraction. Dragnet focuses specifically on content extraction using machine learning techniques.

Parser provides a higher-level API with more built-in functionality, while Dragnet requires additional steps for fetching HTML content. Parser also offers more extensive documentation and examples, making it easier for developers to get started.

However, Dragnet's machine learning approach may provide better accuracy in certain scenarios, especially for complex web pages or non-standard layouts. It also supports multiple languages out of the box, which can be advantageous for projects dealing with multilingual content.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Postlight Parser - Extracting content from chaos

CircleCI Greenkeeper badge Apache License MITC License Gitter chat

Postlight's Parser extracts the bits that humans care about from any URL you give it. That includes article content, titles, authors, published dates, excerpts, lead images, and more.

Postlight Parser powers Postlight Reader, a browser extension that removes ads and distractions, leaving only text and images for a beautiful reading view on any site.

Postlight Parser allows you to easily create custom parsers using simple JavaScript and CSS selectors. This allows you to proactively manage parsing and migration edge cases. There are many examples available along with documentation.

How? Like this.

Installation

# If you're using yarn
yarn add @postlight/parser

# If you're using npm
npm install @postlight/parser

Usage

import Parser from '@postlight/parser';

Parser.parse(url).then(result => console.log(result));

// NOTE: When used in the browser, you can omit the URL argument
// and simply run `Parser.parse()` to parse the current page.

The result looks like this:

{
  "title": "Thunder (mascot)",
  "content": "... <p><b>Thunder</b> is the <a href=\"https://en.wikipedia.org/wiki/Stage_name\">stage name</a> for the...",
  "author": "Wikipedia Contributors",
  "date_published": "2016-09-16T20:56:00.000Z",
  "lead_image_url": null,
  "dek": null,
  "next_page_url": null,
  "url": "https://en.wikipedia.org/wiki/Thunder_(mascot)",
  "domain": "en.wikipedia.org",
  "excerpt": "Thunder Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos",
  "word_count": 4677,
  "direction": "ltr",
  "total_pages": 1,
  "rendered_pages": 1
}

If Parser is unable to find a field, that field will return null.

parse() Options

Content Formats

By default, Postlight Parser returns the content field as HTML. However, you can override this behavior by passing in options to the parse function, specifying whether or not to scrape all pages of an article, and what type of output to return (valid values are 'html', 'markdown', and 'text'). For example:

Parser.parse(url, { contentType: 'markdown' }).then(result =>
  console.log(result)
);

This returns the the page's content as GitHub-flavored Markdown:

"content": "...**Thunder** is the [stage name](https://en.wikipedia.org/wiki/Stage_name) for the..."
Custom Request Headers

You can include custom headers in requests by passing name-value pairs to the parse function as follows:

Parser.parse(url, {
  headers: {
    Cookie: 'name=value; name2=value2; name3=value3',
    'User-Agent':
      'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1',
  },
}).then(result => console.log(result));
Pre-fetched HTML

You can use Postlight Parser to parse custom or pre-fetched HTML by passing an HTML string to the parse function as follows:

Parser.parse(url, {
  html:
    '<html><body><article><h1>Thunder (mascot)</h1><p>Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos</p></article></body></html>',
}).then(result => console.log(result));

Note that the URL argument is still supplied, in order to identify the web site and use its custom parser, if it has any, though it will not be used for fetching content.

The command-line parser

Postlight Parser also ships with a CLI, meaning you can use it from your command line like so:

Postlight Parser CLI Basic Usage

# Install Postlight Parser globally
yarn global add @postlight/parser
#   or
npm -g install @postlight/parser

# Then
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source

# Pass optional --format argument to set content type (html|markdown|text)
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --format=markdown

# Pass optional --header.name=value arguments to include custom headers in the request
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --header.Cookie="name=value; name2=value2; name3=value3" --header.User-Agent="Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"

# Pass optional --extend argument to add a custom type to the response
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend credit="p:last-child em"

# Pass optional --extend-list argument to add a custom type with multiple matches
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list categories=".meta__tags-list a"

# Get the value of attributes by adding a pipe to --extend or --extend-list
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list links=".body a|href"

# Pass optional --add-extractor argument to add a custom extractor at runtime.
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --add-extractor ./src/extractors/fixtures/postlight.com/index.js

License

Licensed under either of the below, at your preference:

Contributing

For details on how to contribute to Postlight Parser, including how to write a custom content extractor for any site, see CONTRIBUTING.md

Unless it is explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual licensed as above without any additional terms or conditions.


🔬 A Labs project from your friends at Postlight. Happy coding!

NPM DownloadsLast 30 Days