Convert Figma logo to code with AI

html5lib logohtml5lib-python

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

1,138
285
1,138
88

Top Related Projects

2,660

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

2,724

The lxml XML toolkit for Python

Pythonic HTML Parsing for Humans™

The awesome document factory

8,228

HTML Standard

Quick Overview

html5lib-python is a pure-Python library for parsing HTML. It is designed to be fully compatible with the WHATWG HTML specification, as is implemented by all major web browsers. This library aims to parse HTML exactly as modern browsers do, making it ideal for web scraping and processing HTML documents.

Pros

  • Highly accurate HTML parsing, matching browser behavior
  • Supports both Python 2 and Python 3
  • Extensive test suite ensuring compatibility with the HTML5 specification
  • Ability to parse fragment and full documents

Cons

  • Slower parsing speed compared to some alternatives like lxml
  • More complex API compared to simpler libraries like BeautifulSoup
  • Larger memory footprint due to its comprehensive parsing approach
  • Limited built-in functionality for DOM manipulation

Code Examples

Parsing an HTML document:

import html5lib

document = html5lib.parse("<p>Hello, <b>World!</b></p>")
print(document.getroottree().getroot()[0][0].text)  # Outputs: Hello,
print(document.getroottree().getroot()[0][1].text)  # Outputs: World!

Parsing an HTML fragment:

import html5lib

fragment = html5lib.parseFragment("<p>Fragment</p>")
print(fragment[0].tag)  # Outputs: p
print(fragment[0].text)  # Outputs: Fragment

Serializing HTML:

import html5lib
from html5lib.serializer import HTMLSerializer

document = html5lib.parse("<p>Hello, <b>World!</b></p>")
serializer = HTMLSerializer()
output = serializer.serialize(document)
print(output)  # Outputs: <html><head></head><body><p>Hello, <b>World!</b></p></body></html>

Getting Started

To get started with html5lib-python, first install it using pip:

pip install html5lib

Then, you can use it in your Python code:

import html5lib

# Parse an HTML string
document = html5lib.parse("<html><body><p>Hello, World!</p></body></html>")

# Access the parsed document
body = document.find(".//body")
paragraph = body.find(".//p")
print(paragraph.text)  # Outputs: Hello, World!

This example demonstrates how to parse an HTML string and access elements within the parsed document.

Competitor Comparisons

2,660

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

Pros of Bleach

  • Focused on sanitizing HTML, making it more specialized and potentially easier to use for this specific task
  • Actively maintained with more recent updates and releases
  • Smaller codebase, potentially easier to understand and contribute to

Cons of Bleach

  • Less comprehensive HTML parsing capabilities compared to html5lib-python
  • More limited in scope, not suitable for full HTML parsing or manipulation
  • Depends on html5lib-python, which may introduce additional complexity

Code Comparison

Bleach (sanitizing HTML):

import bleach

html = '<script>alert("XSS")</script><p>Safe content</p>'
clean_html = bleach.clean(html)
print(clean_html)  # Output: &lt;script&gt;alert("XSS")&lt;/script&gt;<p>Safe content</p>

html5lib-python (parsing HTML):

import html5lib

html = '<p>Some HTML content</p>'
document = html5lib.parse(html)
print(document.tostring())  # Output: <html><head></head><body><p>Some HTML content</p></body></html>

Both libraries serve different primary purposes. Bleach is focused on sanitizing HTML for security purposes, while html5lib-python is a more comprehensive HTML parsing library. The choice between them depends on the specific requirements of your project.

2,724

The lxml XML toolkit for Python

Pros of lxml

  • Significantly faster parsing and processing speed
  • More comprehensive XML support
  • Lower memory usage for large documents

Cons of lxml

  • More complex installation process, especially on Windows
  • Less forgiving with malformed HTML
  • Steeper learning curve for beginners

Code Comparison

lxml:

from lxml import etree

root = etree.fromstring("<root><child>Text</child></root>")
child = root.find("child")
print(child.text)

html5lib:

import html5lib

doc = html5lib.parse("<root><child>Text</child></root>")
child = doc.find(".//child")
print(child.text)

Both libraries offer HTML and XML parsing capabilities, but lxml is generally faster and more feature-rich for XML processing. html5lib-python excels in parsing real-world HTML, especially malformed documents, making it more suitable for web scraping tasks. lxml has a C-based implementation, which contributes to its speed but can make installation trickier. html5lib-python is pure Python, ensuring easier installation across platforms. For projects requiring strict XML compliance or high-performance parsing, lxml is often the better choice. For web scraping or working with potentially malformed HTML, html5lib-python might be more appropriate due to its lenient parsing nature.

Pythonic HTML Parsing for Humans™

Pros of requests-html

  • Simpler API and easier to use for basic web scraping tasks
  • Built-in JavaScript rendering support using Pyppeteer
  • Integrates well with the popular Requests library

Cons of requests-html

  • Less comprehensive HTML parsing capabilities
  • Not as well-suited for complex HTML manipulation tasks
  • Smaller community and fewer contributors compared to html5lib-python

Code Comparison

requests-html:

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://python.org')
r.html.render()
print(r.html.find('title', first=True).text)

html5lib-python:

import html5lib
with open("document.html", "rb") as f:
    document = html5lib.parse(f)
print(document.find(".//title").text)

Both libraries offer HTML parsing capabilities, but requests-html provides a more streamlined approach for web scraping tasks, while html5lib-python offers more robust and standards-compliant HTML parsing. requests-html is better suited for quick scraping jobs, while html5lib-python is more appropriate for complex HTML processing and manipulation tasks.

The awesome document factory

Pros of WeasyPrint

  • Specialized for HTML to PDF conversion, offering more advanced layout and styling options
  • Supports CSS3 and web fonts, providing better rendering fidelity
  • Actively maintained with regular updates and improvements

Cons of WeasyPrint

  • Narrower focus compared to html5lib-python's general-purpose HTML parsing
  • Steeper learning curve due to its more complex feature set
  • Heavier dependencies, potentially making installation more challenging

Code Comparison

WeasyPrint example:

from weasyprint import HTML
HTML('https://example.com').write_pdf('example.pdf')

html5lib-python example:

import html5lib
with open("example.html", "rb") as f:
    document = html5lib.parse(f)

WeasyPrint is tailored for HTML-to-PDF conversion, while html5lib-python focuses on parsing HTML. WeasyPrint offers more advanced features for document rendering, but html5lib-python provides a simpler API for general HTML parsing tasks. The choice between them depends on the specific requirements of your project.

8,228

HTML Standard

Pros of html

  • Official specification repository maintained by WHATWG
  • Comprehensive documentation and explanations of HTML standards
  • Regularly updated with the latest HTML features and changes

Cons of html

  • Not a parser or implementation, just a specification
  • Requires additional tools or libraries to work with HTML programmatically
  • More complex and verbose for those seeking a simple HTML parsing solution

Code comparison

html5lib-python:

from html5lib import parse
document = parse("<p>Hello, <b>world!</b></p>")
print(document.tostring())

html (WHATWG specification):

<!DOCTYPE html>
<html lang="en">
<head><title>Example</title></head>
<body><p>Hello, <b>world!</b></p></body>
</html>

Summary

html5lib-python is a practical HTML parser and serializer for Python, while html (WHATWG) is the official HTML specification repository. html5lib-python is better suited for developers who need to work with HTML programmatically in Python, offering parsing and manipulation capabilities. On the other hand, the WHATWG html repository serves as the authoritative source for HTML standards and is ideal for those seeking in-depth understanding of HTML specifications and staying updated with the latest changes in the language.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

html5lib

.. image:: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml/badge.svg :target: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

Usage

Simple usage follows this pattern:

.. code-block:: python

import html5lib with open("mydocument.html", "rb") as f: document = html5lib.parse(f)

or:

.. code-block:: python

import html5lib document = html5lib.parse("

Hello World!")

By default, the document will be an xml.etree element instance. Whenever possible, html5lib chooses the accelerated ElementTree implementation (i.e. xml.etree.cElementTree on Python 2.x).

Two other tree types are supported: xml.dom.minidom and lxml.etree. To use an alternative format, specify the name of a treebuilder:

.. code-block:: python

import html5lib with open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml")

When using with urllib2 (Python 2), the charset from HTTP should be pass into html5lib as follows:

.. code-block:: python

from contextlib import closing from urllib2 import urlopen import html5lib

with closing(urlopen("http://example.com/")) as f: document = html5lib.parse(f, transport_encoding=f.info().getparam("charset"))

When using with urllib.request (Python 3), the charset from HTTP should be pass into html5lib as follows:

.. code-block:: python

from urllib.request import urlopen import html5lib

with urlopen("http://example.com/") as f: document = html5lib.parse(f, transport_encoding=f.info().get_content_charset())

To have more control over the parser, create a parser object explicitly. For instance, to make the parser raise exceptions on parse errors, use:

.. code-block:: python

import html5lib with open("mydocument.html", "rb") as f: parser = html5lib.HTMLParser(strict=True) document = parser.parse(f)

When you're instantiating parser objects explicitly, pass a treebuilder class as the tree keyword argument to use an alternative document format:

.. code-block:: python

import html5lib parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) minidom_document = parser.parse("

Hello World!")

More documentation is available at https://html5lib.readthedocs.io/.

Installation

html5lib works on CPython 2.7+, CPython 3.5+ and PyPy. To install:

.. code-block:: bash

$ pip install html5lib

The goal is to support a (non-strict) superset of the versions that pip supports <https://pip.pypa.io/en/stable/installing/#python-and-os-compatibility>_.

Optional Dependencies

The following third-party libraries may be used for additional functionality:

  • lxml is supported as a tree format (for both building and walking) under CPython (but not PyPy where it is known to cause segfaults);

  • genshi has a treewalker (but not builder); and

  • chardet can be used as a fallback when character encoding cannot be determined.

Bugs

Please report any bugs on the issue tracker <https://github.com/html5lib/html5lib-python/issues>_.

Tests

Unit tests require the pytest and mock libraries and can be run using the pytest command in the root directory.

Test data are contained in a separate html5lib-tests <https://github.com/html5lib/html5lib-tests>_ repository and included as a submodule, thus for git checkouts they must be initialized::

$ git submodule init $ git submodule update

If you have all compatible Python implementations available on your system, you can run tests on all of them using the tox utility, which can be found on PyPI.

Questions?

Check out the docs <https://html5lib.readthedocs.io/en/latest/>. Still need help? Go to our GitHub Discussions <https://github.com/html5lib/html5lib-python/discussions>.

You can also browse the archives of the html5lib-discuss mailing list <https://www.mail-archive.com/html5lib-discuss@googlegroups.com/>_.