html-pipeline

HTML processing filters and utilities

2,285

381

2,285

View on GitHub

Top Related Projects

html-pipeline

2,280

HTML processing filters and utilities

redcarpet

5,060

The safe Markdown parser, reloaded.

markup

5,955

Determines which markup library to use to render a content file (e.g. README) on GitHub

nokogiri

6,208

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.

commonmark.js

1,519

CommonMark parser and renderer in JavaScript

Quick Overview

HTML Pipeline is a Ruby library that provides a framework for processing HTML content through a series of filters or "pipelines." It allows developers to transform, sanitize, and enhance HTML content using a modular and extensible approach. The library is particularly useful for processing user-generated content or implementing custom Markdown-like syntaxes.

Pros

Modular and extensible design, allowing easy addition of custom filters
Comprehensive set of built-in filters for common tasks (e.g., syntax highlighting, auto-linking)
Well-documented and actively maintained
Integrates well with Ruby on Rails and other Ruby web frameworks

Cons

Limited to Ruby ecosystem, not available for other programming languages
May have a learning curve for developers unfamiliar with the pipeline concept
Performance can be impacted when processing large amounts of content through multiple filters
Some advanced customizations may require diving into the source code

Code Examples

Basic usage with default pipeline:

require 'html/pipeline'

pipeline = HTML::Pipeline.new([
  HTML::Pipeline::MarkdownFilter,
  HTML::Pipeline::SanitizationFilter
])

result = pipeline.call("# Hello, world!\n\nThis is **bold** text.")
puts result[:output].to_s

Custom filter implementation:

class MyCustomFilter < HTML::Pipeline::Filter
  def call
    doc.search('p').each do |node|
      node['class'] = 'custom-paragraph'
    end
    doc
  end
end

pipeline = HTML::Pipeline.new([
  HTML::Pipeline::MarkdownFilter,
  MyCustomFilter
])

result = pipeline.call("This is a paragraph.")
puts result[:output].to_s

Using context to pass data to filters:

pipeline = HTML::Pipeline.new([
  HTML::Pipeline::MentionFilter
], { base_url: 'https://example.com' })

result = pipeline.call("Hello @username!")
puts result[:output].to_s

Getting Started

To use HTML Pipeline in your Ruby project:

Add the gem to your Gemfile:
```
gem 'html-pipeline'
```
Run bundle install

In your Ruby code:

require 'html/pipeline'

pipeline = HTML::Pipeline.new([
  HTML::Pipeline::MarkdownFilter,
  HTML::Pipeline::SanitizationFilter
])

result = pipeline.call("# Hello, world!")
processed_html = result[:output].to_s

This sets up a basic pipeline that processes Markdown and sanitizes the resulting HTML. You can customize the pipeline by adding or removing filters as needed.

Competitor Comparisons

html-pipeline

2,280

HTML processing filters and utilities

Pros of html-pipeline

More established project with a longer history and larger community
Wider range of built-in filters and extensions
Better documentation and examples

Cons of html-pipeline

Potentially more complex setup for simple use cases
May include unnecessary features for some projects
Slightly larger footprint in terms of dependencies

Code Comparison

html-pipeline:

require 'html/pipeline'

pipeline = HTML::Pipeline.new([
  HTML::Pipeline::MarkdownFilter,
  HTML::Pipeline::SanitizationFilter
])
result = pipeline.call(content)

html-pipeline>:

require 'html/pipeline'

pipeline = HTML::Pipeline.new([
  HTML::Pipeline::MarkdownFilter,
  HTML::Pipeline::SanitizationFilter
])
result = pipeline.call(content)

In this case, the code comparison shows no significant differences between the two repositories. Both use the same basic structure for creating and using an HTML pipeline. The main differences would likely be in the available filters and configuration options, which are not apparent in this basic example.

redcarpet

5,060

The safe Markdown parser, reloaded.

Pros of Redcarpet

Faster performance for Markdown parsing
More lightweight and focused solely on Markdown processing
Supports a wider range of Markdown extensions and customizations

Cons of Redcarpet

Limited to Markdown processing, lacking the versatility of HTML Pipeline
Requires more manual setup for additional text processing tasks
Less actively maintained, with fewer recent updates

Code Comparison

Redcarpet:

markdown = Redcarpet::Markdown.new(Redcarpet::Render::HTML, autolink: true, tables: true)
html_output = markdown.render("# Hello, world!")

HTML Pipeline:

pipeline = HTML::Pipeline.new([
  HTML::Pipeline::MarkdownFilter,
  HTML::Pipeline::SanitizationFilter
])
result = pipeline.call("# Hello, world!")
html_output = result[:output].to_s

Summary

Redcarpet is a focused Markdown parser with excellent performance and extensive Markdown support. HTML Pipeline, on the other hand, offers a more versatile approach to text processing, allowing for multiple filters and transformations beyond just Markdown. While Redcarpet excels in pure Markdown scenarios, HTML Pipeline provides greater flexibility for complex text processing workflows.

markup

5,955

Determines which markup library to use to render a content file (e.g. README) on GitHub

Pros of markup

Supports a wider range of markup languages (e.g., Markdown, reStructuredText, Textile)
Designed specifically for GitHub's needs, potentially better integration with GitHub services
Lightweight and focused on rendering various markup formats to HTML

Cons of markup

Less flexible for customization and extension compared to html-pipeline
Fewer built-in features for advanced content processing (e.g., syntax highlighting, mention linking)
May require additional setup for complex content transformation workflows

Code Comparison

markup:

GitHub::Markup.render('README.md', File.read('README.md'))

html-pipeline:

pipeline = HTML::Pipeline.new([
  HTML::Pipeline::MarkdownFilter,
  HTML::Pipeline::SyntaxHighlightFilter
])
result = pipeline.call(text)

Summary

markup is a straightforward tool for rendering various markup formats to HTML, ideal for simple GitHub-centric use cases. html-pipeline offers a more flexible and extensible approach to content processing, allowing for complex transformation pipelines but with a steeper learning curve. Choose markup for basic rendering needs, and html-pipeline for more advanced content processing requirements.

nokogiri

6,208

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.

Pros of Nokogiri

More comprehensive XML/HTML parsing and manipulation capabilities
Faster performance for large documents due to native C extensions
Wider adoption and larger community support

Cons of Nokogiri

Steeper learning curve due to more complex API
Heavier dependency with native extensions, which can complicate installation
Less focused on specific HTML processing tasks compared to HTML Pipeline

Code Comparison

Nokogiri:

require 'nokogiri'
doc = Nokogiri::HTML('<h1>Hello, World!</h1>')
doc.at_css('h1').content = 'Modified Heading'
puts doc.to_html

HTML Pipeline:

require 'html/pipeline'
pipeline = HTML::Pipeline.new([HTML::Pipeline::SanitizationFilter])
result = pipeline.call('<h1>Hello, World!</h1>')
puts result[:output].to_s

Summary

Nokogiri is a powerful and versatile XML/HTML parsing library with excellent performance, while HTML Pipeline focuses on processing HTML through a series of filters. Nokogiri offers more comprehensive manipulation capabilities but comes with a steeper learning curve. HTML Pipeline provides a simpler, more targeted approach to HTML processing tasks, making it easier to use for specific workflows but potentially less flexible for complex parsing needs.

commonmark.js

1,519

CommonMark parser and renderer in JavaScript

Pros of commonmark.js

Focused specifically on CommonMark parsing and rendering
Lightweight and fast performance
Extensive test suite ensuring compliance with CommonMark spec

Cons of commonmark.js

Limited to CommonMark functionality only
Lacks built-in security features for sanitizing HTML output
Requires additional libraries for extended Markdown features

Code Comparison

html-pipeline:

pipeline = HTML::Pipeline.new [
  HTML::Pipeline::MarkdownFilter,
  HTML::Pipeline::SanitizationFilter
]
result = pipeline.call(content)

commonmark.js:

var reader = new commonmark.Parser();
var writer = new commonmark.HtmlRenderer();
var parsed = reader.parse(content);
var result = writer.render(parsed);

Key Differences

html-pipeline is a Ruby-based framework for processing content through multiple filters, including Markdown parsing and HTML sanitization. It offers a flexible pipeline approach for content processing.

commonmark.js is a JavaScript library specifically designed for parsing and rendering CommonMark-compliant Markdown. It focuses on strict adherence to the CommonMark specification.

While html-pipeline provides a more comprehensive content processing solution, commonmark.js excels in lightweight and efficient CommonMark parsing and rendering.

octokit.rb

3,905

Ruby toolkit for the GitHub API

Pros of Octokit.rb

Comprehensive GitHub API wrapper for Ruby
Actively maintained with frequent updates
Extensive documentation and examples

Cons of Octokit.rb

Larger codebase and dependencies
Steeper learning curve for beginners
Focused solely on GitHub API interactions

Code Comparison

HTML-Pipeline:

pipeline = HTML::Pipeline.new [
  HTML::Pipeline::MarkdownFilter,
  HTML::Pipeline::SanitizationFilter
]
result = pipeline.call(content)

Octokit.rb:

client = Octokit::Client.new(access_token: 'your_token')
repo = client.repo('octokit/octokit.rb')
issues = client.issues(repo.full_name)

Summary

HTML-Pipeline is a tool for processing and transforming HTML content, while Octokit.rb is a Ruby toolkit for interacting with the GitHub API. HTML-Pipeline is more focused on content manipulation, whereas Octokit.rb provides comprehensive access to GitHub's features and data. The choice between the two depends on the specific requirements of your project: HTML processing vs. GitHub API integration.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

HTML-Pipeline

HTML processing filters and utilities. This module is a small framework for defining CSS-based content filters and applying them to user provided content.

Although this project was started at GitHub, they no longer use it. This gem must be considered standalone and independent from GitHub.

HTML-Pipeline

Installation

Add this line to your application's Gemfile:

gem 'html-pipeline'

And then execute:

$ bundle

Or install it by yourself as:

$ gem install html-pipeline

Usage

This library provides a handful of chainable HTML filters to transform user content into HTML markup. Each filter does some work, and then hands off the results tothe next filter. A pipeline has several kinds of filters available to use:

Multiple TextFilters, which operate a UTF-8 string
A ConvertFilter filter, which turns text into HTML (eg., Commonmark/Asciidoc -> HTML)
A SanitizationFilter, which remove dangerous/unwanted HTML elements and attributes
Multiple NodeFilters, which operate on a UTF-8 HTML document

You can assemble each sequence into a single pipeline, or choose to call each filter individually.

As an example, suppose we want to transform Commonmark source text into Markdown HTML:

Hey there, @gjtorikian

With the content, we also want to:

change every instance of Hey to Hello
strip undesired HTML
linkify @mention

We can construct a pipeline to do all that like this:

require 'html_pipeline'

class HelloJohnnyFilter < HTMLPipelineFilter
  def call
    text.gsub("Hey", "Hello")
  end
end

pipeline = HTMLPipeline.new(
  text_filters: [HelloJohnnyFilter.new]
  convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
    # note: next line is not needed as sanitization occurs by default;
    # see below for more info
  sanitization_config: HTMLPipeline::SanitizationFilter::DEFAULT_CONFIG,
  node_filters: [HTMLPipeline::NodeFilter::MentionFilter.new]
)
pipeline.call(user_supplied_text) # recommended: can call pipeline over and over

Filters can be custom ones you create (like HelloJohnnyFilter), and HTMLPipeline additionally provides several helpful ones (detailed below). If you only need a single filter, you can call one individually, too:

filter = HTMLPipeline::ConvertFilter::MarkdownFilter.new
filter.call(text)

Filters combine into a sequential pipeline, and each filter hands its output to the next filter's input. Text filters are processed first, then the convert filter, sanitization filter, and finally, the node filters.

Some filters take optional context and/or result hash(es). These are used to pass around arguments and metadata between filters in a pipeline. For example, if you want to disable footnotes in the MarkdownFilter, you can pass an option in the context hash:

context = { markdown: { extensions: { footnotes: false } } }
filter = HTMLPipeline::ConvertFilter::MarkdownFilter.new(context: context)
filter.call("Hi **world**!")

Alternatively, you can construct a pipeline, and pass in a context during the call:

pipeline = HTMLPipeline.new(
  convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
  node_filters: [HTMLPipeline::NodeFilter::MentionFilter.new]
)
pipeline.call(user_supplied_text, context: { markdown: { extensions: { footnotes: false } } })

Please refer to the documentation for each filter to understand what configuration options are available.

More Examples

Different pipelines can be defined for different parts of an app. Here are a few paraphrased snippets to get you started:

# The context hash is how you pass options between different filters.
# See individual filter source for explanation of options.
context = {
  asset_root: "http://your-domain.com/where/your/images/live/icons",
  base_url: "http://your-domain.com"
}

# Pipeline used for user provided content on the web
MarkdownPipeline = HTMLPipeline.new (
  text_filters: [HTMLPipeline::TextFilter::ImageFilter.new],
  convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
  node_filters: [
    HTMLPipeline::NodeFilter::HttpsFilter.new,HTMLPipeline::NodeFilter::MentionFilter.new,
  ], context: context)

# Pipelines aren't limited to the web. You can use them for email
# processing also.
HtmlEmailPipeline = HTMLPipeline.new(
  text_filters: [
    PlainTextInputFilter.new,
    ImageFilter.new
  ], {})

Filters

TextFilters

TextFilters must define a method named call which is called on the text. @text, @config, and @result are available to use, and any changes made to these ivars are passed on to the next filter.

ImageFilter - converts image url into <img> tag
PlainTextInputFilter - html escape text and wrap the result in a <div>

ConvertFilter

The ConvertFilter takes text and turns it into HTML. @text, @config, and @result are available to use. ConvertFilter must defined a method named call, taking one argument, text. call must return a string representing the new HTML document.

MarkdownFilter - creates HTML from text using Commonmarker

Sanitization

Because the web can be a scary place, HTML is automatically sanitized after the ConvertFilter runs and before the NodeFilters are processed. This is to prevent malicious or unexpected input from entering the pipeline.

The sanitization process takes a hash configuration of settings. See the Selma documentation for more information on how to configure these settings. Note that users must correctly configure the sanitization configuration if they expect to use it correctly in conjunction with handlers which manipulate HTML.

A default sanitization config is provided by this library (HTMLPipeline::SanitizationFilter::DEFAULT_CONFIG). A sample custom sanitization allowlist might look like this:

ALLOWLIST = {
  elements: ["p", "pre", "code"]
}

pipeline = HTMLPipeline.new \
  text_filters: [
    HTMLPipeline::TextFilter::ImageFilter.new,
  ],
  convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
  sanitization_config: ALLOWLIST

result = pipeline.call <<-CODE
This is *great*:

    some_code(:first)

CODE
result[:output].to_s

This would print:

<p>This is great:</p>
<pre><code>some_code(:first)
</code></pre>

Sanitization can be disabled if and only if nil is explicitly passed as the config:

pipeline = HTMLPipeline.new \
  text_filters: [
    HTMLPipeline::TextFilter::ImageFilter.new,
  ],
  convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
  sanitization_config: nil

For more examples of customizing the sanitization process to include the tags you want, check out the tests and the FAQ.

NodeFilters

NodeFilterss can operate either on HTML elements or text nodes using CSS selectors. Each NodeFilter must define a method named selector which provides an instance of Selma::Selector. If elements are being manipulated, handle_element must be defined, taking one argument, element; if text nodes are being manipulated, handle_text_chunk must be defined, taking one argument, text_chunk. @config, and @result are available to use, and any changes made to these ivars are passed on to the next filter.

NodeFilter also has an optional method, after_initialize, which is run after the filter initializes. This can be useful in setting up a fresh custom state for result to start from each time the pipeline is called.

Here's an example NodeFilter that adds a base url to images that are root relative:

require 'uri'

class RootRelativeFilter < HTMLPipeline::NodeFilter

  SELECTOR = Selma::Selector.new(match_element: "img")

  def selector
    SELECTOR
  end

  def handle_element(img)
    next if img['src'].nil?
    src = img['src'].strip
    if src.start_with? '/'
      img["src"] = URI.join(context[:base_url], src).to_s
    end
  end
end

For more information on how to write effective NodeFilters, refer to the provided filters, and see the underlying lib, Selma for more information.

AbsoluteSourceFilter: replace relative image urls with fully qualified versions
AssetProxyFilter: replace image links with an encoded link to an asset server
EmojiFilter: converts :<emoji>: to emoji
- (Note: the included MarkdownFilter will already convert emoji)
HttpsFilter: Replacing http urls with https versions
ImageMaxWidthFilter: link to full size image for large images
MentionFilter: replace @user mentions with links
SanitizationFilter: allow sanitize user markup
SyntaxHighlightFilter: applies syntax highlighting to pre blocks
- (Note: the included MarkdownFilter will already apply highlighting)
TableOfContentsFilter: anchor headings with name attributes and generate Table of Contents html unordered list linking headings
TeamMentionFilter: replace @org/team mentions with links

Dependencies

Since filters can be customized to your heart's content, gem dependencies are not bundled; this project doesn't know which of the default filters you might use, and as such, you must bundle each filter's gem dependencies yourself.

For example, SyntaxHighlightFilter uses rouge to detect and highlight languages; to use the SyntaxHighlightFilter, you must add the following to your Gemfile:

gem "rouge"

Note See the Gemfile :test group for any version requirements.

When developing a custom filter, call HTMLPipeline.require_dependency at the start to ensure that the local machine has the necessary dependency. You can also use HTMLPipeline.require_dependencies to provide a list of dependencies to check.

On a similar note, you must manually require whichever filters you desire:

require "html_pipeline" # must be included
require "html_pipeline/convert_filter/markdown_filter" # included because you want to use this filter
require "html_pipeline/node_filter/mention_filter" # included because you want to use this filter

Documentation

Full reference documentation can be found here.

Instrumenting

Filters and Pipelines can be set up to be instrumented when called. The pipeline must be setup with an ActiveSupport::Notifications compatible service object and a name. New pipeline objects will default to the HTMLPipeline.default_instrumentation_service object.

# the AS::Notifications-compatible service object
service = ActiveSupport::Notifications

# instrument a specific pipeline
pipeline = HTMLPipeline.new [MarkdownFilter], context
pipeline.setup_instrumentation "MarkdownPipeline", service

# or set default instrumentation service for all new pipelines
HTMLPipeline.default_instrumentation_service = service
pipeline = HTMLPipeline.new [MarkdownFilter], context
pipeline.setup_instrumentation "MarkdownPipeline"

Filters are instrumented when they are run through the pipeline. A call_filter.html_pipeline event is published once any filter finishes; call_text_filters and call_node_filters is published when all of the text and node filters are finished, respectively. The payload should include the filter name. Each filter will trigger its own instrumentation call.

service.subscribe "call_filter.html_pipeline" do |event, start, ending, transaction_id, payload|
  payload[:pipeline] #=> "MarkdownPipeline", set with `setup_instrumentation`
  payload[:filter] #=> "MarkdownFilter"
  payload[:context] #=> context Hash
  payload[:result] #=> instance of result class
  payload[:result][:output] #=> output HTML String
end

The full pipeline is also instrumented:

service.subscribe "call_text_filters.html_pipeline" do |event, start, ending, transaction_id, payload|
  payload[:pipeline] #=> "MarkdownPipeline", set with `setup_instrumentation`
  payload[:filters] #=> ["MarkdownFilter"]
  payload[:doc] #=> HTML String
  payload[:context] #=> context Hash
  payload[:result] #=> instance of result class
  payload[:result][:output] #=> output HTML String
end

FAQ

1. Why doesn't my pipeline work when there's no root element in the document?

To make a pipeline work on a plain text document, put the PlainTextInputFilter at the end of your text_filters config . This will wrap the content in a div so the filters have a root element to work with. If you're passing in an HTML fragment, but it doesn't have a root element, you can wrap the content in a div yourself.

2. How do I customize an allowlist for `SanitizationFilter`s?

HTMLPipeline::SanitizationFilter::ALLOWLIST is the default allowlist used if no sanitization_config argument is given. The default is a good starting template for you to add additional elements. You can either modify the constant's value, or re-define your own config and pass that in, such as:

config = HTMLPipeline::SanitizationFilter::DEFAULT_CONFIG.deep_dup
config[:elements] << "iframe" # sure, whatever you want

Contributors

Thanks to all of these contributors.

This project is a member of the OSS Manifesto.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of html-pipeline

Cons of html-pipeline

Code Comparison

Pros of Redcarpet

Cons of Redcarpet

Code Comparison

Summary

Pros of markup

Cons of markup

Code Comparison

Summary

Pros of Nokogiri

Cons of Nokogiri

Code Comparison

Summary

Pros of commonmark.js

Cons of commonmark.js

Code Comparison

Key Differences

Pros of Octokit.rb

Cons of Octokit.rb

Code Comparison

Summary

Convert designs to code with AI

README

HTML-Pipeline

Installation

Usage

More Examples

Filters

TextFilters

ConvertFilter

Sanitization

NodeFilters

Dependencies

Documentation

Instrumenting

FAQ

1. Why doesn't my pipeline work when there's no root element in the document?

2. How do I customize an allowlist for SanitizationFilters?

Contributors

Top Related Projects

Convert designs to code with AI

2. How do I customize an allowlist for `SanitizationFilter`s?