percollate

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.

4,463

169

4,463

View on GitHub View on NPM

Top Related Projects

html-to-markdown

2,915

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.

readability

10,277

A standalone version of the readability lib

puppeteer

92,518

JavaScript API for Chrome and Firefox

jsdom

21,145

A JavaScript implementation of various web standards, for use with Node.js

cheerio

29,666

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.

parser

5,685

📜 Extract meaningful content from the chaos of a web page

Quick Overview

Percollate is a command-line tool that converts web pages to PDF, EPUB, or HTML files. It aims to create high-quality, readable documents from web content, making it easier to read articles offline or on e-readers.

Pros

Supports multiple output formats (PDF, EPUB, HTML)
Offers customization options for layout and styling
Can process multiple URLs in a single command
Preserves important metadata from the original web pages

Cons

Requires Node.js to be installed
May struggle with complex or dynamically loaded web pages
Limited support for non-article content (e.g., interactive elements)
Occasional formatting issues with certain websites

Code Examples

// Convert a single URL to PDF
percollate pdf https://example.com/article

// Convert multiple URLs to EPUB
percollate epub https://example.com/article1 https://example.com/article2

// Convert a URL to HTML with custom CSS
percollate html --css custom.css https://example.com/article

// Use a configuration file for advanced options
percollate pdf --config percollate.config.js https://example.com/article

// Extract specific content using CSS selectors
percollate pdf --css-selector "article.main-content" https://example.com/article

Getting Started

To get started with Percollate, follow these steps:

Install Node.js if not already installed
Install Percollate globally:
```
npm install -g percollate
```

Run Percollate with a URL:

percollate pdf https://example.com/article

For more advanced usage, refer to the project's documentation on GitHub.

Competitor Comparisons

html-to-markdown

2,915

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.

Pros of html-to-markdown

Pure Go implementation, making it easy to integrate into Go projects
Supports custom rules for HTML-to-Markdown conversion
Actively maintained with regular updates

Cons of html-to-markdown

Limited to HTML-to-Markdown conversion only
Lacks additional features like content extraction or PDF generation
May require more manual configuration for complex conversions

Code Comparison

html-to-markdown:

converter := md.NewConverter("", true, nil)
markdown, err := converter.ConvertString(html)

percollate:

const percollate = require('percollate');
const pdf = await percollate.pdf(url, options);

html-to-markdown focuses solely on HTML-to-Markdown conversion, while percollate offers a broader range of features including content extraction, PDF generation, and EPUB creation. percollate is more suitable for end-users looking for a complete web content processing tool, while html-to-markdown is better suited for developers integrating HTML-to-Markdown conversion into Go applications.

Both projects have their strengths, and the choice between them depends on the specific use case and programming language preferences of the user.

readability

10,277

A standalone version of the readability lib

Pros of Readability

More mature and widely adopted project with extensive browser support
Focuses specifically on content extraction, making it more lightweight
Actively maintained with regular updates and bug fixes

Cons of Readability

Limited to content extraction without additional formatting options
Requires more setup and integration for creating full documents
Less flexible for customizing output formats

Code Comparison

Readability:

var article = new Readability(document).parse();
console.log(article.content);

Percollate:

const html = await percollate.grab(url);
const output = await percollate.process(html, {
  output: 'pdf',
  template: 'article'
});

Summary

Readability is a focused content extraction library, while Percollate is a more comprehensive tool for creating documents from web content. Readability excels in its core functionality and browser support, making it ideal for projects that need reliable content parsing. Percollate offers more features for generating complete documents in various formats, but may be overkill for simpler content extraction tasks. The choice between the two depends on the specific requirements of your project and the desired output format.

puppeteer

92,518

JavaScript API for Chrome and Firefox

Pros of Puppeteer

More comprehensive and feature-rich, offering full browser automation capabilities
Larger community and ecosystem, with extensive documentation and third-party plugins
Supports a wider range of use cases beyond web scraping and PDF generation

Cons of Puppeteer

Heavier and more resource-intensive, requiring a full Chromium instance
Steeper learning curve due to its broader scope and more complex API
Overkill for simple web content extraction and PDF generation tasks

Code Comparison

Percollate (HTML content extraction):

const html = await percollate.grab(url);

Puppeteer (HTML content extraction):

const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const html = await page.content();
await browser.close();

Summary

Percollate is a lightweight tool focused on web content extraction and PDF generation, making it simpler for specific use cases. Puppeteer, on the other hand, is a full-fledged browser automation library with broader capabilities but higher complexity and resource requirements. Choose Percollate for quick content grabbing and PDF creation, and Puppeteer for more advanced web automation tasks.

jsdom

21,145

A JavaScript implementation of various web standards, for use with Node.js

Pros of jsdom

More comprehensive DOM simulation, supporting a wider range of web standards
Larger community and more frequent updates
Can be used for server-side rendering and testing JavaScript applications

Cons of jsdom

Heavier and more resource-intensive
Steeper learning curve due to its extensive API
May be overkill for simple web scraping or content extraction tasks

Code Comparison

jsdom:

const { JSDOM } = require('jsdom');

const dom = new JSDOM(`<p>Hello world</p>`);
console.log(dom.window.document.querySelector("p").textContent);

percollate:

const percollate = require('percollate');

percollate.configure();
percollate.grab('https://example.com')
  .then(content => console.log(content));

Summary

jsdom is a more robust and feature-rich library for simulating a DOM environment, making it ideal for complex web application testing and server-side rendering. However, it may be excessive for simpler tasks like content extraction.

percollate, on the other hand, is specifically designed for web content extraction and conversion, making it more lightweight and easier to use for these specific tasks. It's less versatile than jsdom but more focused on its primary use case.

Choose jsdom for comprehensive DOM simulation and testing, and percollate for straightforward web content extraction and conversion to other formats.

cheerio

29,666

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.

Pros of Cheerio

More versatile and widely used for general web scraping and parsing
Lightweight and fast, with a jQuery-like syntax for DOM manipulation
Extensive documentation and large community support

Cons of Cheerio

Focused solely on HTML parsing and manipulation
Lacks built-in PDF generation capabilities
Requires additional libraries for advanced features like content extraction

Code Comparison

Cheerio:

const cheerio = require('cheerio');
const $ = cheerio.load('<h2 class="title">Hello world</h2>');

$('h2.title').text('Hello there!');
$('h2').addClass('welcome');

Percollate:

const percollate = require('percollate');

percollate.configure();
percollate.pdf({
  url: 'https://example.com',
  output: 'example.pdf'
});

Key Differences

Percollate is specifically designed for web page to PDF conversion
Cheerio is a more general-purpose HTML parsing and manipulation tool
Percollate offers out-of-the-box content extraction and PDF generation
Cheerio provides more flexibility for custom DOM manipulation tasks

Use Cases

Choose Cheerio for general web scraping, parsing, and DOM manipulation
Opt for Percollate when focused on converting web pages to clean, readable PDFs

parser

5,685

📜 Extract meaningful content from the chaos of a web page

Pros of Parser

More comprehensive content extraction, handling a wider range of web content types
Larger community and more frequent updates, indicating better long-term support
Provides additional metadata like author, date, and lead image

Cons of Parser

Heavier and more complex, potentially slower for simple use cases
Requires more setup and configuration for basic tasks
JavaScript-only, limiting use in other programming environments

Code Comparison

Parser:

const { parse } = require('@postlight/parser');

const result = await parse('https://example.com');
console.log(result.content);

Percollate:

const percollate = require('percollate');

percollate.configure();
const html = await percollate.grab('https://example.com');
console.log(html);

Summary

Parser offers more robust content extraction and metadata, making it suitable for complex web scraping tasks. It has a larger community and more frequent updates. However, it's heavier and more complex than Percollate.

Percollate is simpler and more lightweight, focusing on basic HTML content extraction. It's easier to set up for simple tasks but lacks some of the advanced features and metadata extraction capabilities of Parser.

Choose Parser for comprehensive web content extraction with rich metadata, or Percollate for simpler, lightweight HTML grabbing tasks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Percollate is a command-line tool that turns web pages into beautifully formatted PDF, EPUB, HTML or Markdown files.

Sample Output — Sample spread from the generated PDF of a chapter in Dimensions of Colour; rendered here in black & white for a smaller image file size.

Installation
Usage
- Available commands
- Available options
Recipes
How it works
Updating
Limitations
Troubleshooting
Contributing
See also

Installation

percollate is a Node.js command-line tool which you can install globally from npm:

npm install -g percollate

Percollate and its dependencies require Node.js 14.17.0 or later.

Community-maintained packages

There's a packaged version available on Arch User Repository, which you can install using your local AUR helper (yay, pacaur, or similar):

yay -S nodejs-percollate

Some Docker images are available in this tracking issue.

Usage

Run percollate --help for a list of available commands and options.

Percollate is invoked on one or more operands (usually URLs):

percollate <command> [options] url [url]...

The following commands are available:

percollate pdf produces a PDF file;
percollate epub produces an EPUB file;
percollate html produces a HTML file.
percollate md produces a Markdown file.

The operands can be URLs, paths to local files, or the - character which stands for stdin (the standard inputs).

Available options

Unless otherwise stated, these options apply to all three commands.

`-o, --output`

Specify the path of the resulting bundle relative to the current folder.

percollate pdf https://example.com -o my-example.pdf

`-u, --url`

Using the - operand you can read the HTML content from stdin, as fetched by a separate command, such as curl. In this sort of setup, percollate does not know the URL from which the content has been fetched, and relative paths on images, anchors, et cetera won't resolve correctly.

Use the --url option to supply the source's original URL.

curl https://example.com | percollate pdf - --url=https://example.com

`-w, --wait`

By default, percollate processes URLs in parallel. Use the --wait option to process them sequentially instead, with a pause between items. The delay is specified in seconds, and can be zero.

percollate epub --wait=1 url1 url2 url3

`--individual`

By default, percollate bundles all web pages in a single file. Use the --individual flag to export each source to a separate file.

percollate pdf --individual http://example.com/page1 http://example.com/page2

`--template`

Path to a custom HTML template. Applies to pdf, html, and md.

`--style`

Path to a custom CSS stylesheet, relative to the current folder.

`--css`

Additional CSS styles you can pass from the command-line to override styles specified by the default/custom stylesheet.

`--no-amp`

Don't prefer the AMP version of the web page.

`--debug`

Print more detailed information.

`-t, --title`

Provide a title for the bundle.

percollate epub http://example.com/page-1 http://example.com/page-2 --title="Best Of Example"

`-a, --author`

Provide an author for the bundle.

percollate pdf --author="Ella Example" http://example.com

`--cover`

Generate a cover. The option is implicitly enabled when the --title option is provided, or when bundling more than one web page to a single file. Disable this implicit behavior by passing the --no-cover flag.

`--toc`

Generate a hyperlinked table of contents. The option is implicitly enabled when bundling more than one web page to a single file. Disable this implicit behavior by passing the --no-toc flag.

Applies to pdf, html, and md.

`--toc-level=<level>`

By default, the table of contents is a flat list of article titles. With the --toc-level option the table of contents will include headings under each article title (<h2>, <h3>, etc.), up to the specified heading depth. A number between 1 and 6 is expected.

Using --toc-level with a value greater than 1 implies --toc.

`--hyphenate`

Hyphenation is enabled by default for pdf, and disabled for epub, html, and md. You can opt into hyphenation with the --hyphenate flag, or disable it with the --no-hyphenate flag.

See also the Hyphenation and justification recipe.

`--inline`

Embed images inline with the document. Images are fetched and converted to Base64-encoded data URLs.

This option is particularly useful for html to produce self-contained HTML files.

`--md.<option>=<value>`

Pass options to the underlying Markdown stringifier, mdast-util-to-markdown. These are the default Markdown options:

const DEFAULT_MARKDOWN_OPTIONS = {
	fences: true,
	emphasis: '_',
	strong: '_',
	resourceLink: true,
	rule: '-'
};

`--unsafe`

Disables some JSDOM validations that may throw an error when parsing invalid HTML pages (See #177).

Recipes

Basic bundling

To turn a single web page into a PDF:

percollate pdf --output=some.pdf https://example.com

To bundle several web pages into a single PDF, specify them as separate arguments to the command:

percollate pdf --output=some.pdf https://example.com/page1 https://example.com/page2

You can use common Unix commands and keep the list of URLs in a newline-delimited text file:

cat urls.txt | xargs percollate pdf --output=some.pdf

To transform several web pages into individual PDF files at once, use the --individual flag:

percollate pdf --individual https://example.com/page1 https://example.com/page2

If you'd like to fetch the HTML with an external command, you can use - as an operand, which stands for stdin (the standard input):

curl https://example.com/page1 | percollate pdf --url=https://example.com/page1 -

Notice we're using the url option to tell percollate the source of our (now-anonymous) HTML it gets on stdin, so that relative URLs on links and images resolve correctly.

The `--css` option

The --css option lets you pass a small snippet of CSS to percollate. Here are some common use-cases:

Custom page size / margins

The default page size is A5 (portrait). You can use the --css option to override it using any supported CSS size:

percollate pdf --css "@page { size: A3 landscape }" http://example.com

Similarly, you can define:

custom margins, e.g. @page { margin: 0 }
the base font size: html { font-size: 10pt }

Changing the font stacks

The default stylesheet includes CSS variables for the fonts used in the PDF:

:root {
	--main-font: Palatino, 'Palatino Linotype', 'Times New Roman',
		'Droid Serif', Times, 'Source Serif Pro', serif, 'Apple Color Emoji',
		'Segoe UI Emoji', 'Segoe UI Symbol';
	--alt-font: 'helvetica neue', ubuntu, roboto, noto, 'segoe ui', arial,
		sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol';
	--code-font: Menlo, Consolas, monospace;
}

CSS variable	What it does
`--main-font`	The font stack used for body text
`--alt-font`	Used in headings, captions, et cetera
`--code-font`	Used for code snippets

To override them, use the --css option:

percollate pdf --css ":root { --main-font: 'PT Serif';  --alt-font: Roboto; }" http://example.com

ð¡ To work correctly, you must have the fonts installed on your machine. Custom web fonts currently require you to use a custom CSS stylesheet / HTML template.

Remove the appended `href`s from hyperlinks

The idea with percollate is to make PDFs that can be printed without losing where the hyperlinks point to. However, for some link-heavy pages, the appended hrefs can become bothersome. You can remove them using:

percollate pdf --css "a:after { display: none }" http://example.com

Hyphenation and justification

Hyphenation is only enabled by default for PDFs, but you can opt in or out of it for any output format with a flag.

When hyphenation is enabled, paragraphs will be justified:

.article__content p {
	text-align: justify;
}

If you prefer left-aligned text:

percollate pdf --css ".article__content p { text-align: left }" http://example.com

The `--style` option

The --style option lets you use your own CSS stylesheet instead of the default one. Here are some common use-cases for this option:

â ï¸ TODO add examples here

The `--template` option

The --template option lets you use a custom HTML template for the PDF.

ð¡ The HTML template is parsed with nunjucks, which is a close JavaScript relative of Twig for PHP, Jinja2 for Python and L for Ruby.

Here are some common use-cases:

Customizing the page header / footer

Puppeteer can print some basic information about the page in the PDF. The following CSS class names are available for the header / footer, into which the appropriate content will be injected:

date â The formatted print date
title â The document title
url â document location (Note: this will print the path of the temporary html, not the original web page URL)
pageNumber â the current page number
totalPages â total pages in the document

ð See the Chromium source code for details.

You place your header / footer template in a template element in your HTML:

<template class="header-template"> My header </template>

<template class="footer-template">
	<div class="text center">
		<span class="pageNumber"></span>
	</div>
</template>

See the default HTML for example usage.

You can add CSS styles to the header / footer with either the --css option or a separate CSS stylesheet (the --style option).

ð¡ The header / footer template do not inherit their styles from the rest of the page (i.e. they are not part of the cascade), so you'll have to write the full CSS you want to apply to them.

An example from the default stylesheet:

.footer-template {
	font-size: 10pt;
	font-weight: bold;
}

Updating

To keep the tool up-to-date, you can run:

npm install -g percollate

Occasionally, an upgrade might not go according to plan; in this case, you can uninstall and re-install percollate:

npm uninstall -g percollate && npm install -g percollate

How it works

All export formats follow a common pipeline:

Fetch the page(s) using node-fetch
If an AMP version of the page exists, use that instead (disable with --no-amp flag)
Enhance the DOM using jsdom
Pass the DOM through mozilla/readability to strip unnecessary elements
Apply the HTML template and the stylesheet to the resulting HTML

Different formats then use different tools to produce the final file.

PDFs are rendered with puppeteer.

EPUBs have external images fetched and bundled together with the HTML of each article. When the --inline option is used, images are instead converted to data URLs and embedded into the HTML.

HTMLs are saved without any further changes. When the --inline option is used, images are converted to data URLs and embedded into the HTML. External images are not otherwise fetched.

Markdown files are produced the same way as HTMLs, then processed with a series of utilities from the unified.js umbrella.

Limitations

Percollate inherits the limitations of two of its main components, Readability and Puppeteer (headless Chrome).

The imperative approach Readability takes will not be perfect in each case, especially on HTML pages with atypical markup; you may occasionally notice that it either leaves in superfluous content, or that it strips out parts of the content. You can confirm the problem against Firefox's Reader View. In this case, consider filing an issue on mozilla/readability.

Using a browser to generate the PDF is a double-edged sword. On the one hand, you get excellent support for web platform features. On the other hand, print CSS as defined by W3C specifications is only partially implemented, and it seems unlikely that support will be improved any time soon. However, even with modest print support, I think Chrome is the best (free) tool for the job.

Troubleshooting

On some Linux machines you'll need to install a few more Chrome dependencies before percollate works correctly. (Thanks to @ptica for sorting it out)

The percollate pdf command supports the --no-sandbox Puppeteer flag, but make sure you're aware of the implications before disabling the sandbox.

Using Firefox to render PDFs

This feature is experimental. Please log an issue if you notice anything wrong.

Starting with percollate 3.x, it's possible to use Firefox Nightly as an alternative browser with which to render PDFs. To make Firefox available to Percollate, use the following install command:

PUPPETEER_PRODUCT=firefox npm install percollate

After installation, percollate pdf commands can be run with the --browser=firefox option.

Limitations of Firefox PDF rendering

At the moment, rendering PDFs with Firefox has the following limitations:

The pages can't have headers and footers, so there are no page numbers.

Contributing

Contributions of all kinds are welcome! See CONTRIBUTING.md for details.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of html-to-markdown

Cons of html-to-markdown

Code Comparison

Pros of Readability

Cons of Readability

Code Comparison

Summary

Pros of Puppeteer

Cons of Puppeteer

Code Comparison

Summary

Pros of jsdom

Cons of jsdom

Code Comparison

Summary

Pros of Cheerio

Cons of Cheerio

Code Comparison

Key Differences

Use Cases

Pros of Parser

Cons of Parser

Code Comparison

Summary

Convert designs to code with AI

README

Installation

Community-maintained packages

Usage

Available options

-o, --output

-u, --url

-w, --wait

--individual

--template

--style

--css

--no-amp

--debug

-t, --title

-a, --author

--cover

--toc

--toc-level=<level>

--hyphenate

--inline

--md.<option>=<value>

--unsafe

Recipes

Basic bundling

The --css option

Custom page size / margins

Changing the font stacks

Remove the appended hrefs from hyperlinks

Hyphenation and justification

The --style option

The --template option

Customizing the page header / footer

Updating

How it works

Limitations

Troubleshooting

Using Firefox to render PDFs

Limitations of Firefox PDF rendering

Contributing

See also

Top Related Projects

Convert designs to code with AI

NPM DownloadsLast 30 Days

`-o, --output`

`-u, --url`

`-w, --wait`

`--individual`

`--template`

`--style`

`--css`

`--no-amp`

`--debug`

`-t, --title`

`-a, --author`

`--cover`

`--toc`

`--toc-level=<level>`

`--hyphenate`

`--inline`

`--md.<option>=<value>`

`--unsafe`

The `--css` option

Remove the appended `href`s from hyperlinks

The `--style` option

The `--template` option