html5lib-python
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
Top Related Projects
Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
The lxml XML toolkit for Python
Pythonic HTML Parsing for Humans™
The awesome document factory
HTML Standard
Quick Overview
html5lib-python is a pure-Python library for parsing HTML. It is designed to be fully compatible with the WHATWG HTML specification, as is implemented by all major web browsers. This library aims to parse HTML exactly as modern browsers do, making it ideal for web scraping and processing HTML documents.
Pros
- Highly accurate HTML parsing, matching browser behavior
- Supports both Python 2 and Python 3
- Extensive test suite ensuring compatibility with the HTML5 specification
- Ability to parse fragment and full documents
Cons
- Slower parsing speed compared to some alternatives like lxml
- More complex API compared to simpler libraries like BeautifulSoup
- Larger memory footprint due to its comprehensive parsing approach
- Limited built-in functionality for DOM manipulation
Code Examples
Parsing an HTML document:
import html5lib
document = html5lib.parse("<p>Hello, <b>World!</b></p>")
print(document.getroottree().getroot()[0][0].text) # Outputs: Hello,
print(document.getroottree().getroot()[0][1].text) # Outputs: World!
Parsing an HTML fragment:
import html5lib
fragment = html5lib.parseFragment("<p>Fragment</p>")
print(fragment[0].tag) # Outputs: p
print(fragment[0].text) # Outputs: Fragment
Serializing HTML:
import html5lib
from html5lib.serializer import HTMLSerializer
document = html5lib.parse("<p>Hello, <b>World!</b></p>")
serializer = HTMLSerializer()
output = serializer.serialize(document)
print(output) # Outputs: <html><head></head><body><p>Hello, <b>World!</b></p></body></html>
Getting Started
To get started with html5lib-python, first install it using pip:
pip install html5lib
Then, you can use it in your Python code:
import html5lib
# Parse an HTML string
document = html5lib.parse("<html><body><p>Hello, World!</p></body></html>")
# Access the parsed document
body = document.find(".//body")
paragraph = body.find(".//p")
print(paragraph.text) # Outputs: Hello, World!
This example demonstrates how to parse an HTML string and access elements within the parsed document.
Competitor Comparisons
Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
Pros of Bleach
- Focused on sanitizing HTML, making it more specialized and potentially easier to use for this specific task
- Actively maintained with more recent updates and releases
- Smaller codebase, potentially easier to understand and contribute to
Cons of Bleach
- Less comprehensive HTML parsing capabilities compared to html5lib-python
- More limited in scope, not suitable for full HTML parsing or manipulation
- Depends on html5lib-python, which may introduce additional complexity
Code Comparison
Bleach (sanitizing HTML):
import bleach
html = '<script>alert("XSS")</script><p>Safe content</p>'
clean_html = bleach.clean(html)
print(clean_html) # Output: <script>alert("XSS")</script><p>Safe content</p>
html5lib-python (parsing HTML):
import html5lib
html = '<p>Some HTML content</p>'
document = html5lib.parse(html)
print(document.tostring()) # Output: <html><head></head><body><p>Some HTML content</p></body></html>
Both libraries serve different primary purposes. Bleach is focused on sanitizing HTML for security purposes, while html5lib-python is a more comprehensive HTML parsing library. The choice between them depends on the specific requirements of your project.
The lxml XML toolkit for Python
Pros of lxml
- Significantly faster parsing and processing speed
- More comprehensive XML support
- Lower memory usage for large documents
Cons of lxml
- More complex installation process, especially on Windows
- Less forgiving with malformed HTML
- Steeper learning curve for beginners
Code Comparison
lxml:
from lxml import etree
root = etree.fromstring("<root><child>Text</child></root>")
child = root.find("child")
print(child.text)
html5lib:
import html5lib
doc = html5lib.parse("<root><child>Text</child></root>")
child = doc.find(".//child")
print(child.text)
Both libraries offer HTML and XML parsing capabilities, but lxml is generally faster and more feature-rich for XML processing. html5lib-python excels in parsing real-world HTML, especially malformed documents, making it more suitable for web scraping tasks. lxml has a C-based implementation, which contributes to its speed but can make installation trickier. html5lib-python is pure Python, ensuring easier installation across platforms. For projects requiring strict XML compliance or high-performance parsing, lxml is often the better choice. For web scraping or working with potentially malformed HTML, html5lib-python might be more appropriate due to its lenient parsing nature.
Pythonic HTML Parsing for Humans™
Pros of requests-html
- Simpler API and easier to use for basic web scraping tasks
- Built-in JavaScript rendering support using Pyppeteer
- Integrates well with the popular Requests library
Cons of requests-html
- Less comprehensive HTML parsing capabilities
- Not as well-suited for complex HTML manipulation tasks
- Smaller community and fewer contributors compared to html5lib-python
Code Comparison
requests-html:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://python.org')
r.html.render()
print(r.html.find('title', first=True).text)
html5lib-python:
import html5lib
with open("document.html", "rb") as f:
document = html5lib.parse(f)
print(document.find(".//title").text)
Both libraries offer HTML parsing capabilities, but requests-html provides a more streamlined approach for web scraping tasks, while html5lib-python offers more robust and standards-compliant HTML parsing. requests-html is better suited for quick scraping jobs, while html5lib-python is more appropriate for complex HTML processing and manipulation tasks.
The awesome document factory
Pros of WeasyPrint
- Specialized for HTML to PDF conversion, offering more advanced layout and styling options
- Supports CSS3 and web fonts, providing better rendering fidelity
- Actively maintained with regular updates and improvements
Cons of WeasyPrint
- Narrower focus compared to html5lib-python's general-purpose HTML parsing
- Steeper learning curve due to its more complex feature set
- Heavier dependencies, potentially making installation more challenging
Code Comparison
WeasyPrint example:
from weasyprint import HTML
HTML('https://example.com').write_pdf('example.pdf')
html5lib-python example:
import html5lib
with open("example.html", "rb") as f:
document = html5lib.parse(f)
WeasyPrint is tailored for HTML-to-PDF conversion, while html5lib-python focuses on parsing HTML. WeasyPrint offers more advanced features for document rendering, but html5lib-python provides a simpler API for general HTML parsing tasks. The choice between them depends on the specific requirements of your project.
HTML Standard
Pros of html
- Official specification repository maintained by WHATWG
- Comprehensive documentation and explanations of HTML standards
- Regularly updated with the latest HTML features and changes
Cons of html
- Not a parser or implementation, just a specification
- Requires additional tools or libraries to work with HTML programmatically
- More complex and verbose for those seeking a simple HTML parsing solution
Code comparison
html5lib-python:
from html5lib import parse
document = parse("<p>Hello, <b>world!</b></p>")
print(document.tostring())
html (WHATWG specification):
<!DOCTYPE html>
<html lang="en">
<head><title>Example</title></head>
<body><p>Hello, <b>world!</b></p></body>
</html>
Summary
html5lib-python is a practical HTML parser and serializer for Python, while html (WHATWG) is the official HTML specification repository. html5lib-python is better suited for developers who need to work with HTML programmatically in Python, offering parsing and manipulation capabilities. On the other hand, the WHATWG html repository serves as the authoritative source for HTML standards and is ideal for those seeking in-depth understanding of HTML specifications and staying updated with the latest changes in the language.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
html5lib
.. image:: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml/badge.svg :target: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
Usage
Simple usage follows this pattern:
.. code-block:: python
import html5lib with open("mydocument.html", "rb") as f: document = html5lib.parse(f)
or:
.. code-block:: python
import html5lib document = html5lib.parse("
Hello World!")
By default, the document
will be an xml.etree
element instance.
Whenever possible, html5lib chooses the accelerated ElementTree
implementation (i.e. xml.etree.cElementTree
on Python 2.x).
Two other tree types are supported: xml.dom.minidom
and
lxml.etree
. To use an alternative format, specify the name of
a treebuilder:
.. code-block:: python
import html5lib with open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
When using with urllib2
(Python 2), the charset from HTTP should be
pass into html5lib as follows:
.. code-block:: python
from contextlib import closing from urllib2 import urlopen import html5lib
with closing(urlopen("http://example.com/")) as f: document = html5lib.parse(f, transport_encoding=f.info().getparam("charset"))
When using with urllib.request
(Python 3), the charset from HTTP
should be pass into html5lib as follows:
.. code-block:: python
from urllib.request import urlopen import html5lib
with urlopen("http://example.com/") as f: document = html5lib.parse(f, transport_encoding=f.info().get_content_charset())
To have more control over the parser, create a parser object explicitly. For instance, to make the parser raise exceptions on parse errors, use:
.. code-block:: python
import html5lib with open("mydocument.html", "rb") as f: parser = html5lib.HTMLParser(strict=True) document = parser.parse(f)
When you're instantiating parser objects explicitly, pass a treebuilder
class as the tree
keyword argument to use an alternative document
format:
.. code-block:: python
import html5lib parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) minidom_document = parser.parse("
Hello World!")
More documentation is available at https://html5lib.readthedocs.io/.
Installation
html5lib works on CPython 2.7+, CPython 3.5+ and PyPy. To install:
.. code-block:: bash
$ pip install html5lib
The goal is to support a (non-strict) superset of the versions that pip supports <https://pip.pypa.io/en/stable/installing/#python-and-os-compatibility>
_.
Optional Dependencies
The following third-party libraries may be used for additional functionality:
-
lxml
is supported as a tree format (for both building and walking) under CPython (but not PyPy where it is known to cause segfaults); -
genshi
has a treewalker (but not builder); and -
chardet
can be used as a fallback when character encoding cannot be determined.
Bugs
Please report any bugs on the issue tracker <https://github.com/html5lib/html5lib-python/issues>
_.
Tests
Unit tests require the pytest
and mock
libraries and can be
run using the pytest
command in the root directory.
Test data are contained in a separate html5lib-tests <https://github.com/html5lib/html5lib-tests>
_ repository and included
as a submodule, thus for git checkouts they must be initialized::
$ git submodule init $ git submodule update
If you have all compatible Python implementations available on your
system, you can run tests on all of them using the tox
utility,
which can be found on PyPI.
Questions?
Check out the docs <https://html5lib.readthedocs.io/en/latest/>
. Still
need help? Go to our GitHub Discussions <https://github.com/html5lib/html5lib-python/discussions>
.
You can also browse the archives of the html5lib-discuss mailing list <https://www.mail-archive.com/html5lib-discuss@googlegroups.com/>
_.
Top Related Projects
Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
The lxml XML toolkit for Python
Pythonic HTML Parsing for Humans™
The awesome document factory
HTML Standard
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot