html5lib-python

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

1,196

294

1,196

View on GitHub

Top Related Projects

bleach

2,703

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

Quick Overview

html5lib-python is a pure-Python library for parsing HTML. It is designed to be fully compatible with the WHATWG HTML specification, as is implemented by all major web browsers. This library aims to parse HTML exactly as modern browsers do, making it ideal for web scraping and processing HTML documents.

Pros

Highly accurate HTML parsing, matching browser behavior
Supports both Python 2 and Python 3
Extensive test suite ensuring compatibility with the HTML5 specification
Ability to parse fragment and full documents

Cons

Slower parsing speed compared to some alternatives like lxml
More complex API compared to simpler libraries like BeautifulSoup
Larger memory footprint due to its comprehensive parsing approach
Limited built-in functionality for DOM manipulation

Code Examples

Parsing an HTML document:

import html5lib

document = html5lib.parse("<p>Hello, <b>World!</b></p>")
print(document.getroottree().getroot()[0][0].text)  # Outputs: Hello,
print(document.getroottree().getroot()[0][1].text)  # Outputs: World!

Parsing an HTML fragment:

import html5lib

fragment = html5lib.parseFragment("<p>Fragment</p>")
print(fragment[0].tag)  # Outputs: p
print(fragment[0].text)  # Outputs: Fragment

Serializing HTML:

import html5lib
from html5lib.serializer import HTMLSerializer

document = html5lib.parse("<p>Hello, <b>World!</b></p>")
serializer = HTMLSerializer()
output = serializer.serialize(document)
print(output)  # Outputs: <html><head></head><body><p>Hello, <b>World!</b></p></body></html>

Getting Started

To get started with html5lib-python, first install it using pip:

pip install html5lib

Then, you can use it in your Python code:

import html5lib

# Parse an HTML string
document = html5lib.parse("<html><body><p>Hello, World!</p></body></html>")

# Access the parsed document
body = document.find(".//body")
paragraph = body.find(".//p")
print(paragraph.text)  # Outputs: Hello, World!

This example demonstrates how to parse an HTML string and access elements within the parsed document.

Competitor Comparisons

bleach

2,703

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

Pros of Bleach

Focused on sanitizing HTML, making it more specialized and potentially easier to use for this specific task
Actively maintained with more recent updates and releases
Smaller codebase, potentially easier to understand and contribute to

Cons of Bleach

Less comprehensive HTML parsing capabilities compared to html5lib-python
More limited in scope, not suitable for full HTML parsing or manipulation
Depends on html5lib-python, which may introduce additional complexity

Code Comparison

Bleach (sanitizing HTML):

import bleach

html = '<script>alert("XSS")</script><p>Safe content</p>'
clean_html = bleach.clean(html)
print(clean_html)  # Output: &lt;script&gt;alert("XSS")&lt;/script&gt;<p>Safe content</p>

html5lib-python (parsing HTML):

import html5lib

html = '<p>Some HTML content</p>'
document = html5lib.parse(html)
print(document.tostring())  # Output: <html><head></head><body><p>Some HTML content</p></body></html>

Both libraries serve different primary purposes. Bleach is focused on sanitizing HTML for security purposes, while html5lib-python is a more comprehensive HTML parsing library. The choice between them depends on the specific requirements of your project.

lxml

2,856

The lxml XML toolkit for Python

Pros of lxml

Significantly faster parsing and processing speed
More comprehensive XML support
Lower memory usage for large documents

Cons of lxml

More complex installation process, especially on Windows
Less forgiving with malformed HTML
Steeper learning curve for beginners

Code Comparison

lxml:

from lxml import etree

root = etree.fromstring("<root><child>Text</child></root>")
child = root.find("child")
print(child.text)

html5lib:

import html5lib

doc = html5lib.parse("<root><child>Text</child></root>")
child = doc.find(".//child")
print(child.text)

Both libraries offer HTML and XML parsing capabilities, but lxml is generally faster and more feature-rich for XML processing. html5lib-python excels in parsing real-world HTML, especially malformed documents, making it more suitable for web scraping tasks. lxml has a C-based implementation, which contributes to its speed but can make installation trickier. html5lib-python is pure Python, ensuring easier installation across platforms. For projects requiring strict XML compliance or high-performance parsing, lxml is often the better choice. For web scraping or working with potentially malformed HTML, html5lib-python might be more appropriate due to its lenient parsing nature.

requests-html

13,833

Pythonic HTML Parsing for Humans™

Pros of requests-html

Simpler API and easier to use for basic web scraping tasks
Built-in JavaScript rendering support using Pyppeteer
Integrates well with the popular Requests library

Cons of requests-html

Less comprehensive HTML parsing capabilities
Not as well-suited for complex HTML manipulation tasks
Smaller community and fewer contributors compared to html5lib-python

Code Comparison

requests-html:

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://python.org')
r.html.render()
print(r.html.find('title', first=True).text)

html5lib-python:

import html5lib
with open("document.html", "rb") as f:
    document = html5lib.parse(f)
print(document.find(".//title").text)

Both libraries offer HTML parsing capabilities, but requests-html provides a more streamlined approach for web scraping tasks, while html5lib-python offers more robust and standards-compliant HTML parsing. requests-html is better suited for quick scraping jobs, while html5lib-python is more appropriate for complex HTML processing and manipulation tasks.

WeasyPrint

7,859

The awesome document factory

Pros of WeasyPrint

Specialized for HTML to PDF conversion, offering more advanced layout and styling options
Supports CSS3 and web fonts, providing better rendering fidelity
Actively maintained with regular updates and improvements

Cons of WeasyPrint

Narrower focus compared to html5lib-python's general-purpose HTML parsing
Steeper learning curve due to its more complex feature set
Heavier dependencies, potentially making installation more challenging

Code Comparison

WeasyPrint example:

from weasyprint import HTML
HTML('https://example.com').write_pdf('example.pdf')

html5lib-python example:

import html5lib
with open("example.html", "rb") as f:
    document = html5lib.parse(f)

WeasyPrint is tailored for HTML-to-PDF conversion, while html5lib-python focuses on parsing HTML. WeasyPrint offers more advanced features for document rendering, but html5lib-python provides a simpler API for general HTML parsing tasks. The choice between them depends on the specific requirements of your project.

html

8,619

HTML Standard

Pros of html

Official specification repository maintained by WHATWG
Comprehensive documentation and explanations of HTML standards
Regularly updated with the latest HTML features and changes

Cons of html

Not a parser or implementation, just a specification
Requires additional tools or libraries to work with HTML programmatically
More complex and verbose for those seeking a simple HTML parsing solution

Code comparison

html5lib-python:

from html5lib import parse
document = parse("<p>Hello, <b>world!</b></p>")
print(document.tostring())

html (WHATWG specification):

<!DOCTYPE html>
<html lang="en">
<head><title>Example</title></head>
<body><p>Hello, <b>world!</b></p></body>
</html>

Summary

html5lib-python is a practical HTML parser and serializer for Python, while html (WHATWG) is the official HTML specification repository. html5lib-python is better suited for developers who need to work with HTML programmatically in Python, offering parsing and manipulation capabilities. On the other hand, the WHATWG html repository serves as the authoritative source for HTML standards and is ideal for those seeking in-depth understanding of HTML specifications and staying updated with the latest changes in the language.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

html5lib

.. image:: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml/badge.svg :target: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

Usage

Simple usage follows this pattern:

.. code-block:: python

import html5lib with open("mydocument.html", "rb") as f: document = html5lib.parse(f)

or:

.. code-block:: python

import html5lib document = html5lib.parse("

Hello World!")

By default, the document will be an xml.etree element instance. Whenever possible, html5lib chooses the accelerated ElementTree implementation (i.e. xml.etree.cElementTree on Python 2.x).

Two other tree types are supported: xml.dom.minidom and lxml.etree. To use an alternative format, specify the name of a treebuilder:

.. code-block:: python

import html5lib with open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml")

When using with urllib2 (Python 2), the charset from HTTP should be pass into html5lib as follows:

.. code-block:: python

from contextlib import closing from urllib2 import urlopen import html5lib

with closing(urlopen("http://example.com/")) as f: document = html5lib.parse(f, transport_encoding=f.info().getparam("charset"))

When using with urllib.request (Python 3), the charset from HTTP should be pass into html5lib as follows:

.. code-block:: python

from urllib.request import urlopen import html5lib

with urlopen("http://example.com/") as f: document = html5lib.parse(f, transport_encoding=f.info().get_content_charset())

To have more control over the parser, create a parser object explicitly. For instance, to make the parser raise exceptions on parse errors, use:

.. code-block:: python

import html5lib with open("mydocument.html", "rb") as f: parser = html5lib.HTMLParser(strict=True) document = parser.parse(f)

When you're instantiating parser objects explicitly, pass a treebuilder class as the tree keyword argument to use an alternative document format:

.. code-block:: python

import html5lib parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) minidom_document = parser.parse("

Hello World!")

More documentation is available at https://html5lib.readthedocs.io/.

Installation

html5lib works on CPython 2.7+, CPython 3.5+ and PyPy. To install:

.. code-block:: bash

$ pip install html5lib

The goal is to support a (non-strict) superset of the versions that pip supports <https://pip.pypa.io/en/stable/installing/#python-and-os-compatibility>_.

Optional Dependencies

The following third-party libraries may be used for additional functionality:

lxml is supported as a tree format (for both building and walking) under CPython (but not PyPy where it is known to cause segfaults);
genshi has a treewalker (but not builder); and
chardet can be used as a fallback when character encoding cannot be determined.

Bugs

Please report any bugs on the issue tracker <https://github.com/html5lib/html5lib-python/issues>_.

Tests

Unit tests require the pytest and mock libraries and can be run using the pytest command in the root directory.

Test data are contained in a separate html5lib-tests <https://github.com/html5lib/html5lib-tests>_ repository and included as a submodule, thus for git checkouts they must be initialized::

$ git submodule init $ git submodule update

If you have all compatible Python implementations available on your system, you can run tests on all of them using the tox utility, which can be found on PyPI.

Questions?

Check out the docs <https://html5lib.readthedocs.io/en/latest/>. Still need help? Go to our GitHub Discussions <https://github.com/html5lib/html5lib-python/discussions>.

You can also browse the archives of the html5lib-discuss mailing list <https://www.mail-archive.com/html5lib-discuss@googlegroups.com/>_.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot