Top Related Projects
HTML Standard
A JavaScript implementation of various web standards, for use with Node.js
An HTML5 parsing library in pure C99
Lexbor is development of an open source HTML Renderer library. https://lexbor.com
Quick Overview
html5ever is a high-performance HTML5 parser written in Rust. It aims to be fully compliant with the HTML5 specification while providing a fast and memory-efficient parsing solution. The library is designed to be used as a foundation for web browsers, web crawlers, and other HTML processing tools.
Pros
- High performance and memory efficiency due to Rust implementation
- Fully compliant with the HTML5 specification
- Supports both DOM and token stream outputs
- Actively maintained and part of the Servo browser engine project
Cons
- Steeper learning curve for developers not familiar with Rust
- Limited documentation compared to some other HTML parsing libraries
- May require additional dependencies for certain features
- Not as widely adopted as some other HTML parsing solutions
Code Examples
- Parsing HTML into a DOM tree:
use html5ever::parse_document;
use html5ever::tendril::TendrilSink;
use markup5ever_rcdom::RcDom;
let html = r#"<html><body><h1>Hello, world!</h1></body></html>"#;
let dom = parse_document(RcDom::default(), Default::default())
.from_utf8()
.read_from(&mut html.as_bytes())
.unwrap();
- Tokenizing HTML:
use html5ever::tokenizer::{TokenSink, Token, TokenizerOpts};
use html5ever::tokenizer::{TagToken, StartTag};
struct MyTokenSink;
impl TokenSink for MyTokenSink {
fn process_token(&mut self, token: Token) {
match token {
Token::TagToken(tag) => {
if let TagToken::StartTag(StartTag { name, .. }) = tag {
println!("Found start tag: {}", name);
}
}
_ => {}
}
}
}
let mut tokenizer = html5ever::tokenizer::Tokenizer::new(
MyTokenSink,
TokenizerOpts::default()
);
tokenizer.feed(std::borrow::Cow::Borrowed("<div>Hello</div>"));
tokenizer.end();
- Serializing a DOM tree back to HTML:
use html5ever::serialize::{serialize, SerializeOpts};
use markup5ever_rcdom::SerializableHandle;
use std::io::stdout;
let mut out = Vec::new();
serialize(&mut out, &SerializableHandle::from(dom.document), SerializeOpts::default())
.unwrap();
println!("{}", String::from_utf8(out).unwrap());
Getting Started
To use html5ever in your Rust project, add the following to your Cargo.toml
:
[dependencies]
html5ever = "0.26"
markup5ever_rcdom = "0.2"
Then, in your Rust code, you can import and use the library:
use html5ever::parse_document;
use html5ever::tendril::TendrilSink;
use markup5ever_rcdom::RcDom;
fn main() {
let html = r#"<html><body><h1>Hello, html5ever!</h1></body></html>"#;
let dom = parse_document(RcDom::default(), Default::default())
.from_utf8()
.read_from(&mut html.as_bytes())
.unwrap();
println!("Parsed HTML successfully!");
}
This example parses a simple HTML string into a DOM tree. You can then traverse or manipulate the DOM as needed for your specific use case.
Competitor Comparisons
HTML Standard
Pros of html
- Official HTML specification repository, providing the most up-to-date and authoritative source for HTML standards
- Extensive documentation and explanations for HTML elements, attributes, and behaviors
- Collaborative platform for discussing and proposing changes to the HTML specification
Cons of html
- Primarily focused on documentation rather than providing a usable HTML parser implementation
- May be overwhelming for developers seeking a practical HTML parsing solution
- Less suitable for direct integration into web browsers or other HTML-processing applications
Code comparison
html5ever:
let dom = parse_document(RcDom::default(), Default::default())
.from_utf8()
.read_from(&mut "<!DOCTYPE html><html><head></head><body>Hello, world!</body></html>"
.as_bytes())
.unwrap();
html:
<!DOCTYPE html>
<html lang="en">
<head><title>Example</title></head>
<body>Hello, world!</body>
</html>
Summary
html5ever is a high-performance HTML parser implemented in Rust, designed for integration into web browsers and other HTML-processing applications. It focuses on providing a practical, efficient parsing solution.
html, on the other hand, is the official HTML specification repository maintained by WHATWG. It serves as the authoritative source for HTML standards and documentation but does not provide a parser implementation.
Developers looking for a robust HTML parser should consider html5ever, while those seeking comprehensive HTML documentation and standards should refer to html.
A JavaScript implementation of various web standards, for use with Node.js
Pros of jsdom
- Written in JavaScript, making it more accessible for web developers
- Provides a complete DOM/HTML implementation for use in Node.js
- Supports a wide range of web standards and APIs
Cons of jsdom
- Generally slower performance compared to native implementations
- May not always perfectly match browser behavior in edge cases
- Larger package size and more dependencies
Code Comparison
jsdom:
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
console.log(dom.window.document.querySelector("p").textContent);
html5ever:
use html5ever::parse_document;
use html5ever::tendril::TendrilSink;
use markup5ever_rcdom::RcDom;
let dom = parse_document(RcDom::default(), Default::default())
.from_utf8().read_from(&mut "<!DOCTYPE html><p>Hello world</p>".as_bytes()).unwrap();
Key Differences
- html5ever is a Rust HTML parser, while jsdom is a full JavaScript DOM implementation
- html5ever focuses on parsing performance, while jsdom aims for API completeness
- jsdom provides a more familiar environment for web developers working in Node.js
- html5ever is better suited for integration into larger Rust-based projects or browsers
An HTML5 parsing library in pure C99
Pros of Gumbo-parser
- Written in C, offering potential performance advantages and easier integration with C/C++ projects
- Smaller codebase, potentially easier to understand and maintain
- Follows the HTML5 parsing specification closely
Cons of Gumbo-parser
- Less actively maintained compared to html5ever
- Limited language support (primarily C/C++), while html5ever is Rust-based with bindings for other languages
- May have fewer features and less flexibility for complex parsing scenarios
Code Comparison
html5ever (Rust):
let dom = parse_document(RcDom::default(), Default::default())
.from_utf8()
.read_from(&mut "<!DOCTYPE html><html><head></head><body>Hello, world!</body></html>".as_bytes())
.unwrap();
Gumbo-parser (C):
GumboOutput* output = gumbo_parse("<html><head></head><body>Hello, world!</body></html>");
// Use the parsed output
gumbo_destroy_output(&kGumboDefaultOptions, output);
Both parsers aim to provide HTML5-compliant parsing, but they differ in their implementation languages and target audiences. html5ever is more modern, actively maintained, and offers better language interoperability, while Gumbo-parser is more lightweight and potentially faster for C/C++ projects.
Lexbor is development of an open source HTML Renderer library. https://lexbor.com
Pros of lexbor
- Written in C, potentially offering better performance and lower memory usage
- Supports both DOM and SAX parsing methods
- Provides a more comprehensive set of tools for HTML parsing and manipulation
Cons of lexbor
- Less mature project with potentially fewer contributors and community support
- May have a steeper learning curve for developers not familiar with C
- Documentation might be less extensive compared to html5ever
Code Comparison
html5ever (Rust):
let dom = parse_document(RcDom::default(), Default::default())
.from_utf8()
.read_from(&mut "<!DOCTYPE html><html><head></head><body>Hello, world!</body></html>".as_bytes())
.unwrap();
lexbor (C):
lxb_html_document_t *document = lxb_html_document_create();
lxb_html_document_parse(document, (const lxb_char_t *)"<!DOCTYPE html><html><head></head><body>Hello, world!</body></html>", 71);
Both libraries provide straightforward ways to parse HTML, but lexbor's C implementation may be more verbose. html5ever's Rust code might be more readable for developers familiar with modern programming languages. However, lexbor's C implementation could potentially offer better performance in certain scenarios.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
html5ever
html5ever is an HTML parser developed as part of the Servo project.
It can parse and serialize HTML according to the WHATWG specs (aka "HTML5"). However, there are some differences in the actual behavior currently, most of which are documented in the bug tracker. html5ever passes all tokenizer tests from html5lib-tests, with most tree builder tests outside of the unimplemented features. The goal is to pass all html5lib tests, while also providing all hooks needed by a production web browser, e.g. document.write
.
Note that the HTML syntax is very similar to XML. For correct parsing of XHTML, use an XML parser (that said, many XHTML documents in the wild are serialized in an HTML-compatible form).
html5ever is written in Rust, therefore it avoids the notorious security problems that come along with using C. Being built with Rust also makes the library come with the high-grade performance you would expect from an HTML parser written in C. html5ever is basically a C HTML parser, but without needing a garbage collector or other heavy runtime processes.
Getting started in Rust
Add html5ever as a dependency in your Cargo.toml
file:
[dependencies]
html5ever = "0.27"
You should also take a look at examples/html2html.rs
, examples/print-rcdom.rs
, and the API documentation.
Getting started in other languages
Bindings for Python and other languages are much desired.
Working on html5ever
To fetch the test suite, you need to run
git submodule update --init
Run cargo doc
in the repository root to build local documentation under target/doc/
.
Details
html5ever uses callbacks to manipulate the DOM, therefore it does not provide any DOM tree representation.
html5ever exclusively uses UTF-8 to represent strings. In the future it will support other document encodings (and UCS-2 document.write
) by converting input.
The code is cross-referenced with the WHATWG syntax spec, and eventually we will have a way to present code and spec side-by-side.
html5ever builds against the official stable releases of Rust, though some optimizations are only supported on nightly releases.
Top Related Projects
HTML Standard
A JavaScript implementation of various web standards, for use with Node.js
An HTML5 parsing library in pure C99
Lexbor is development of an open source HTML Renderer library. https://lexbor.com
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot