Convert Figma logo to code with AI

servo logohtml5ever

High-performance browser-grade HTML5 parser

2,096
214
2,096
52

Top Related Projects

7,987

HTML Standard

20,377

A JavaScript implementation of various web standards, for use with Node.js

An HTML5 parsing library in pure C99

1,580

Lexbor is development of an open source HTML Renderer library. https://lexbor.com

Quick Overview

html5ever is a high-performance HTML5 parser written in Rust. It aims to be fully compliant with the HTML5 specification while providing a fast and memory-efficient parsing solution. The library is designed to be used as a foundation for web browsers, web crawlers, and other HTML processing tools.

Pros

  • High performance and memory efficiency due to Rust implementation
  • Fully compliant with the HTML5 specification
  • Supports both DOM and token stream outputs
  • Actively maintained and part of the Servo browser engine project

Cons

  • Steeper learning curve for developers not familiar with Rust
  • Limited documentation compared to some other HTML parsing libraries
  • May require additional dependencies for certain features
  • Not as widely adopted as some other HTML parsing solutions

Code Examples

  1. Parsing HTML into a DOM tree:
use html5ever::parse_document;
use html5ever::tendril::TendrilSink;
use markup5ever_rcdom::RcDom;

let html = r#"<html><body><h1>Hello, world!</h1></body></html>"#;
let dom = parse_document(RcDom::default(), Default::default())
    .from_utf8()
    .read_from(&mut html.as_bytes())
    .unwrap();
  1. Tokenizing HTML:
use html5ever::tokenizer::{TokenSink, Token, TokenizerOpts};
use html5ever::tokenizer::{TagToken, StartTag};

struct MyTokenSink;

impl TokenSink for MyTokenSink {
    fn process_token(&mut self, token: Token) {
        match token {
            Token::TagToken(tag) => {
                if let TagToken::StartTag(StartTag { name, .. }) = tag {
                    println!("Found start tag: {}", name);
                }
            }
            _ => {}
        }
    }
}

let mut tokenizer = html5ever::tokenizer::Tokenizer::new(
    MyTokenSink,
    TokenizerOpts::default()
);
tokenizer.feed(std::borrow::Cow::Borrowed("<div>Hello</div>"));
tokenizer.end();
  1. Serializing a DOM tree back to HTML:
use html5ever::serialize::{serialize, SerializeOpts};
use markup5ever_rcdom::SerializableHandle;
use std::io::stdout;

let mut out = Vec::new();
serialize(&mut out, &SerializableHandle::from(dom.document), SerializeOpts::default())
    .unwrap();
println!("{}", String::from_utf8(out).unwrap());

Getting Started

To use html5ever in your Rust project, add the following to your Cargo.toml:

[dependencies]
html5ever = "0.26"
markup5ever_rcdom = "0.2"

Then, in your Rust code, you can import and use the library:

use html5ever::parse_document;
use html5ever::tendril::TendrilSink;
use markup5ever_rcdom::RcDom;

fn main() {
    let html = r#"<html><body><h1>Hello, html5ever!</h1></body></html>"#;
    let dom = parse_document(RcDom::default(), Default::default())
        .from_utf8()
        .read_from(&mut html.as_bytes())
        .unwrap();
    
    println!("Parsed HTML successfully!");
}

This example parses a simple HTML string into a DOM tree. You can then traverse or manipulate the DOM as needed for your specific use case.

Competitor Comparisons

7,987

HTML Standard

Pros of html

  • Official HTML specification repository, providing the most up-to-date and authoritative source for HTML standards
  • Extensive documentation and explanations for HTML elements, attributes, and behaviors
  • Collaborative platform for discussing and proposing changes to the HTML specification

Cons of html

  • Primarily focused on documentation rather than providing a usable HTML parser implementation
  • May be overwhelming for developers seeking a practical HTML parsing solution
  • Less suitable for direct integration into web browsers or other HTML-processing applications

Code comparison

html5ever:

let dom = parse_document(RcDom::default(), Default::default())
    .from_utf8()
    .read_from(&mut "<!DOCTYPE html><html><head></head><body>Hello, world!</body></html>"
        .as_bytes())
    .unwrap();

html:

<!DOCTYPE html>
<html lang="en">
  <head><title>Example</title></head>
  <body>Hello, world!</body>
</html>

Summary

html5ever is a high-performance HTML parser implemented in Rust, designed for integration into web browsers and other HTML-processing applications. It focuses on providing a practical, efficient parsing solution.

html, on the other hand, is the official HTML specification repository maintained by WHATWG. It serves as the authoritative source for HTML standards and documentation but does not provide a parser implementation.

Developers looking for a robust HTML parser should consider html5ever, while those seeking comprehensive HTML documentation and standards should refer to html.

20,377

A JavaScript implementation of various web standards, for use with Node.js

Pros of jsdom

  • Written in JavaScript, making it more accessible for web developers
  • Provides a complete DOM/HTML implementation for use in Node.js
  • Supports a wide range of web standards and APIs

Cons of jsdom

  • Generally slower performance compared to native implementations
  • May not always perfectly match browser behavior in edge cases
  • Larger package size and more dependencies

Code Comparison

jsdom:

const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
console.log(dom.window.document.querySelector("p").textContent);

html5ever:

use html5ever::parse_document;
use html5ever::tendril::TendrilSink;
use markup5ever_rcdom::RcDom;

let dom = parse_document(RcDom::default(), Default::default())
    .from_utf8().read_from(&mut "<!DOCTYPE html><p>Hello world</p>".as_bytes()).unwrap();

Key Differences

  • html5ever is a Rust HTML parser, while jsdom is a full JavaScript DOM implementation
  • html5ever focuses on parsing performance, while jsdom aims for API completeness
  • jsdom provides a more familiar environment for web developers working in Node.js
  • html5ever is better suited for integration into larger Rust-based projects or browsers

An HTML5 parsing library in pure C99

Pros of Gumbo-parser

  • Written in C, offering potential performance advantages and easier integration with C/C++ projects
  • Smaller codebase, potentially easier to understand and maintain
  • Follows the HTML5 parsing specification closely

Cons of Gumbo-parser

  • Less actively maintained compared to html5ever
  • Limited language support (primarily C/C++), while html5ever is Rust-based with bindings for other languages
  • May have fewer features and less flexibility for complex parsing scenarios

Code Comparison

html5ever (Rust):

let dom = parse_document(RcDom::default(), Default::default())
    .from_utf8()
    .read_from(&mut "<!DOCTYPE html><html><head></head><body>Hello, world!</body></html>".as_bytes())
    .unwrap();

Gumbo-parser (C):

GumboOutput* output = gumbo_parse("<html><head></head><body>Hello, world!</body></html>");
// Use the parsed output
gumbo_destroy_output(&kGumboDefaultOptions, output);

Both parsers aim to provide HTML5-compliant parsing, but they differ in their implementation languages and target audiences. html5ever is more modern, actively maintained, and offers better language interoperability, while Gumbo-parser is more lightweight and potentially faster for C/C++ projects.

1,580

Lexbor is development of an open source HTML Renderer library. https://lexbor.com

Pros of lexbor

  • Written in C, potentially offering better performance and lower memory usage
  • Supports both DOM and SAX parsing methods
  • Provides a more comprehensive set of tools for HTML parsing and manipulation

Cons of lexbor

  • Less mature project with potentially fewer contributors and community support
  • May have a steeper learning curve for developers not familiar with C
  • Documentation might be less extensive compared to html5ever

Code Comparison

html5ever (Rust):

let dom = parse_document(RcDom::default(), Default::default())
    .from_utf8()
    .read_from(&mut "<!DOCTYPE html><html><head></head><body>Hello, world!</body></html>".as_bytes())
    .unwrap();

lexbor (C):

lxb_html_document_t *document = lxb_html_document_create();
lxb_html_document_parse(document, (const lxb_char_t *)"<!DOCTYPE html><html><head></head><body>Hello, world!</body></html>", 71);

Both libraries provide straightforward ways to parse HTML, but lexbor's C implementation may be more verbose. html5ever's Rust code might be more readable for developers familiar with modern programming languages. However, lexbor's C implementation could potentially offer better performance in certain scenarios.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

html5ever

Build Status crates.io

API Documentation

html5ever is an HTML parser developed as part of the Servo project.

It can parse and serialize HTML according to the WHATWG specs (aka "HTML5"). However, there are some differences in the actual behavior currently, most of which are documented in the bug tracker. html5ever passes all tokenizer tests from html5lib-tests, with most tree builder tests outside of the unimplemented features. The goal is to pass all html5lib tests, while also providing all hooks needed by a production web browser, e.g. document.write.

Note that the HTML syntax is very similar to XML. For correct parsing of XHTML, use an XML parser (that said, many XHTML documents in the wild are serialized in an HTML-compatible form).

html5ever is written in Rust, therefore it avoids the notorious security problems that come along with using C. Being built with Rust also makes the library come with the high-grade performance you would expect from an HTML parser written in C. html5ever is basically a C HTML parser, but without needing a garbage collector or other heavy runtime processes.

Getting started in Rust

Add html5ever as a dependency in your Cargo.toml file:

[dependencies]
html5ever = "0.27"

You should also take a look at examples/html2html.rs, examples/print-rcdom.rs, and the API documentation.

Getting started in other languages

Bindings for Python and other languages are much desired.

Working on html5ever

To fetch the test suite, you need to run

git submodule update --init

Run cargo doc in the repository root to build local documentation under target/doc/.

Details

html5ever uses callbacks to manipulate the DOM, therefore it does not provide any DOM tree representation.

html5ever exclusively uses UTF-8 to represent strings. In the future it will support other document encodings (and UCS-2 document.write) by converting input.

The code is cross-referenced with the WHATWG syntax spec, and eventually we will have a way to present code and spec side-by-side.

html5ever builds against the official stable releases of Rust, though some optimizations are only supported on nightly releases.