Top Related Projects
HTML processing filters and utilities
Preflight for HTML email
Quick Overview
Nokogiri is a powerful HTML, XML, SAX, and Reader parser for Ruby. It provides a robust and efficient way to parse, search, and manipulate XML and HTML documents, making it an essential tool for web scraping, data extraction, and document processing tasks in Ruby applications.
Pros
- Fast and memory-efficient parsing of large documents
- Supports multiple parsing methods (DOM, SAX, Reader)
- Extensive CSS3 selector and XPath support for easy document traversal
- Cross-platform compatibility and native extensions for improved performance
Cons
- Installation can be complex on some systems due to native extensions
- Learning curve for advanced features and optimal usage
- Occasional version compatibility issues with Ruby and system libraries
- Limited support for writing and modifying XML/HTML documents compared to parsing
Code Examples
- Parsing an HTML document and extracting data:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(URI.open('https://example.com'))
title = doc.at_css('title').text
paragraphs = doc.css('p').map(&:text)
- Searching for elements using CSS selectors:
require 'nokogiri'
xml = '<root><item id="1"><name>Item 1</name></item><item id="2"><name>Item 2</name></item></root>'
doc = Nokogiri::XML(xml)
names = doc.css('item name').map(&:text)
item_with_id_2 = doc.at_css('item[id="2"]')
- Modifying an XML document:
require 'nokogiri'
builder = Nokogiri::XML::Builder.new do |xml|
xml.root do
xml.item(id: 1) do
xml.name "New Item"
end
end
end
modified_xml = builder.to_xml
Getting Started
To use Nokogiri in your Ruby project, follow these steps:
-
Add Nokogiri to your Gemfile:
gem 'nokogiri'
-
Install the gem:
bundle install
-
Require Nokogiri in your Ruby file:
require 'nokogiri'
-
Start parsing HTML or XML:
doc = Nokogiri::HTML('<html><body><h1>Hello, Nokogiri!</h1></body></html>') puts doc.at_css('h1').text
Competitor Comparisons
HTML processing filters and utilities
Pros of html-pipeline
- Focused on HTML processing and transformation
- Modular design with customizable filters
- Easier to use for specific HTML-related tasks
Cons of html-pipeline
- Less versatile than Nokogiri for general XML/HTML parsing
- Smaller community and fewer resources available
- More limited in scope and functionality
Code Comparison
Nokogiri example:
require 'nokogiri'
doc = Nokogiri::HTML('<h1>Hello, World!</h1>')
puts doc.at_css('h1').text
html-pipeline example:
require 'html/pipeline'
pipeline = HTML::Pipeline.new([HTML::Pipeline::MarkdownFilter])
result = pipeline.call('# Hello, World!')
puts result[:output].to_s
Key Differences
- Nokogiri is a comprehensive XML/HTML parser and manipulator
- html-pipeline focuses on processing and transforming HTML content
- Nokogiri offers more low-level control over parsing and manipulation
- html-pipeline provides a higher-level abstraction for common HTML tasks
Use Cases
- Choose Nokogiri for general-purpose XML/HTML parsing and manipulation
- Opt for html-pipeline when working specifically with HTML processing pipelines
- Nokogiri is better suited for complex parsing tasks and web scraping
- html-pipeline excels in scenarios involving Markdown conversion and HTML sanitization
Preflight for HTML email
Pros of Premailer
- Specialized for email HTML processing and CSS inlining
- Provides additional features like link rewriting and image URL fixing
- Easier to use for email-specific tasks
Cons of Premailer
- More limited in scope compared to Nokogiri's general-purpose XML/HTML parsing
- Smaller community and fewer contributors
- Less frequent updates and maintenance
Code Comparison
Premailer (CSS inlining):
require 'premailer'
premailer = Premailer.new(html, with_html_string: true)
inline_html = premailer.to_inline_css
Nokogiri (HTML parsing):
require 'nokogiri'
doc = Nokogiri::HTML(html)
doc.css('div.example').each do |div|
# Manipulate HTML elements
end
Premailer focuses on email-specific tasks like CSS inlining, while Nokogiri provides more general-purpose HTML/XML parsing and manipulation capabilities. Premailer is ideal for email template processing, whereas Nokogiri is better suited for complex HTML/XML parsing and scraping tasks. Choose Premailer for email-specific needs and Nokogiri for broader HTML/XML processing requirements.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Nokogiri
Nokogiri (é¸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2, libgumbo, and xerces.
Guiding Principles
Some guiding principles Nokogiri tries to follow:
- be secure-by-default by treating all documents as untrusted by default
- be a thin-as-reasonable layer on top of the underlying parsers, and don't attempt to fix behavioral differences between the parsers
Features Overview
- DOM Parser for XML, HTML4, and HTML5
- SAX Parser for XML and HTML4
- Push Parser for XML and HTML4
- Document search via XPath 1.0
- Document search via CSS3 selectors, with some jquery-like extensions
- XSD Schema validation
- XSLT transformation
- "Builder" DSL for XML and HTML documents
Status
Support, Getting Help, and Reporting Issues
All official documentation is posted at https://nokogiri.org (the source for which is at https://github.com/sparklemotion/nokogiri.org/, and we welcome contributions).
Reading
Your first stops for learning more about Nokogiri should be:
- API Documentation
- Tutorials
- An excellent community-maintained Cheat Sheet
Ask For Help
There are a few ways to ask exploratory questions:
- The Nokogiri mailing list is active at https://groups.google.com/group/nokogiri-talk
- Open an issue using the "Help Request" template at https://github.com/sparklemotion/nokogiri/issues
- Open a discussion at https://github.com/sparklemotion/nokogiri/discussions
Please do not mail the maintainers at their personal addresses.
Report A Bug
The Nokogiri bug tracker is at https://github.com/sparklemotion/nokogiri/issues
Please use the "Bug Report" or "Installation Difficulties" templates.
Security and Vulnerability Reporting
Please report vulnerabilities at https://hackerone.com/nokogiri
Full information and description of our security policy is in SECURITY.md
Semantic Versioning Policy
Nokogiri follows Semantic Versioning (since 2017 or so).
We bump Major.Minor.Patch
versions following this guidance:
Major
: (we've never done this)
- Significant backwards-incompatible changes to the public API that would require rewriting existing application code.
- Some examples of backwards-incompatible changes we might someday consider for a Major release are at
ROADMAP.md
.
Minor
:
- Features and bugfixes.
- Updating packaged libraries for non-security-related reasons.
- Dropping support for EOLed Ruby versions. Some folks find this objectionable, but SemVer says this is OK if the public API hasn't changed.
- Backwards-incompatible changes to internal or private methods and constants. These are detailed in the "Changes" section of each changelog entry.
- Removal of deprecated methods or parameters, after a generous transition period; usually when those methods or parameters are rarely-used or dangerous to the user. Essentially, removals that do not justify a major version bump.
Patch
:
- Bugfixes.
- Security updates.
- Updating packaged libraries for security-related reasons.
Sponsorship
You can help sponsor the maintainers of this software through one of these organizations:
- github.com/sponsors/flavorjones
- opencollective.com/nokogiri
- tidelift.com/subscription/pkg/rubygems-nokogiri
Installation
Requirements:
- Ruby >= 3.1
- JRuby >= 9.4.0.0
If you are compiling the native extension against a system version of libxml2:
- libxml2 >= 2.9.2 (recommended >= 2.12.0)
Native Gems: Faster, more reliable installation
"Native gems" contain pre-compiled libraries for a specific machine architecture. On supported platforms, this removes the need for compiling the C extension and the packaged libraries, or for system dependencies to exist. This results in much faster installation and more reliable installation, which as you probably know are the biggest headaches for Nokogiri users.
Supported Platforms
Nokogiri ships pre-compiled, "native" gems for the following platforms:
- Linux:
x86_64-linux-gnu
,aarch64-linux-gnu
, andarm-linux-gnu
(req:glibc >= 2.29
)x86_64-linux-musl
,aarch64-linux-musl
, andarm-linux-musl
- Darwin/MacOS:
x86_64-darwin
andarm64-darwin
- Windows:
x64-mingw-ucrt
- Java: any platform running JRuby 9.4 or higher
To determine whether your system supports one of these gems, look at the output of bundle platform
or ruby -e 'puts Gem::Platform.local.to_s'
.
If you're on a supported platform, either gem install
or bundle install
should install a native gem without any additional action on your part. This installation should only take a few seconds, and your output should look something like:
$ gem install nokogiri
Fetching nokogiri-1.11.0-x86_64-linux.gem
Successfully installed nokogiri-1.11.0-x86_64-linux
1 gem installed
Other Installation Options
Because Nokogiri is a C extension, it requires that you have a C compiler toolchain, Ruby development header files, and some system dependencies installed.
The following may work for you if you have an appropriately-configured system:
gem install nokogiri
If you have any issues, please visit Installing Nokogiri for more complete instructions and troubleshooting.
How To Use Nokogiri
Nokogiri is a large library, and so it's challenging to briefly summarize it. We've tried to provide long, real-world examples at Tutorials.
Parsing and Querying
Here is example usage for parsing and querying a document:
#! /usr/bin/env ruby
require 'nokogiri'
require 'open-uri'
# Fetch and parse HTML document
doc = Nokogiri::HTML(URI.open('https://nokogiri.org/tutorials/installing_nokogiri.html'))
# Search for nodes by css
doc.css('nav ul.menu li a', 'article h2').each do |link|
puts link.content
end
# Search for nodes by xpath
doc.xpath('//nav//ul//li/a', '//article//h2').each do |link|
puts link.content
end
# Or mix and match
doc.search('nav ul.menu li a', '//article//h2').each do |link|
puts link.content
end
Encoding
Strings are always stored as UTF-8 internally. Methods that return
text values will always return UTF-8 encoded strings. Methods that
return a string containing markup (like to_xml
, to_html
and
inner_html
) will return a string encoded like the source document.
WARNING
Some documents declare one encoding, but actually use a different one. In these cases, which encoding should the parser choose?
Data is just a stream of bytes. Humans add meaning to that stream. Any
particular set of bytes could be valid characters in multiple
encodings, so detecting encoding with 100% accuracy is not
possible. libxml2
does its best, but it can't be right all the time.
If you want Nokogiri to handle the document encoding properly, your best bet is to explicitly set the encoding. Here is an example of explicitly setting the encoding to EUC-JP on the parser:
doc = Nokogiri.XML('<foo><bar /></foo>', nil, 'EUC-JP')
Technical Overview
Guiding Principles
As noted above, two guiding principles of the software are:
- be secure-by-default by treating all documents as untrusted by default
- be a thin-as-reasonable layer on top of the underlying parsers, and don't attempt to fix behavioral differences between the parsers
Notably, despite all parsers being standards-compliant, there are behavioral inconsistencies between the parsers used in the CRuby and JRuby implementations, and Nokogiri does not and should not attempt to remove these inconsistencies. Instead, we surface these differences in the test suite when they are important/semantic; or we intentionally write tests to depend only on the important/semantic bits (omitting whitespace from regex matchers on results, for example).
CRuby
The Ruby (a.k.a., CRuby, MRI, YARV) implementation is a C extension that depends on libxml2 and libxslt (which in turn depend on zlib and possibly libiconv).
These dependencies are met by default by Nokogiri's packaged versions of the libxml2 and libxslt source code, but a configuration option --use-system-libraries
is provided to allow specification of alternative library locations. See Installing Nokogiri for full documentation.
We provide native gems by pre-compiling libxml2 and libxslt (and potentially zlib and libiconv) and packaging them into the gem file. In this case, no compilation is necessary at installation time, which leads to faster and more reliable installation.
See LICENSE-DEPENDENCIES.md
for more information on which dependencies are provided in which native and source gems.
JRuby
The Java (a.k.a. JRuby) implementation is a Java extension that depends primarily on Xerces and NekoHTML for parsing, though additional dependencies are on isorelax
, nekodtd
, jing
, serializer
, xalan-j
, and xml-apis
.
These dependencies are provided by pre-compiled jar files packaged in the java
platform gem.
See LICENSE-DEPENDENCIES.md
for more information on which dependencies are provided in which native and source gems.
Contributing
See CONTRIBUTING.md
for an intro guide to developing Nokogiri.
Code of Conduct
We've adopted the Contributor Covenant code of conduct, which you can read in full in CODE_OF_CONDUCT.md
.
License
This project is licensed under the terms of the MIT license.
See this license at LICENSE.md
.
Dependencies
Some additional libraries may be distributed with your version of Nokogiri. Please see LICENSE-DEPENDENCIES.md
for a discussion of the variations as well as the licenses thereof.
Authors
- Mike Dalessio
- Aaron Patterson
- Yoko Harada
- Akinori MUSHA
- John Shahid
- Karol Bucek
- Sam Ruby
- Craig Barnes
- Stephen Checkoway
- Lars Kanis
- Sergio Arbeo
- Timothy Elliott
- Nobuyoshi Nakada
Top Related Projects
HTML processing filters and utilities
Preflight for HTML email
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot