Top Related Projects
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
HTML Standard
Lexbor is development of an open source HTML Renderer library. https://lexbor.com
High-performance browser-grade HTML5 parser
Quick Overview
Gumbo is an HTML5 parsing library implemented in pure C99. It aims to provide a robust and efficient parsing solution that adheres to the HTML5 specification. Gumbo is designed to be easily integrated into various programming languages and applications.
Pros
- Fast and memory-efficient parsing of HTML documents
- Compliant with the HTML5 specification
- Easy to integrate with other programming languages
- Well-documented and maintained
Cons
- Limited to HTML parsing only (no XML or other markup languages)
- May require additional wrappers for use in higher-level languages
- Not designed for in-place editing of the parsed document
Code Examples
- Parsing an HTML string:
#include "gumbo.h"
const char* html = "<h1>Hello, World!</h1>";
GumboOutput* output = gumbo_parse(html);
// Use the parsed output
gumbo_destroy_output(&kGumboDefaultOptions, output);
- Traversing the parsed tree:
void traverse_node(GumboNode* node) {
if (node->type == GUMBO_NODE_ELEMENT) {
GumboElement* element = &node->v.element;
printf("Tag: %s\n", gumbo_normalized_tagname(element->tag));
GumboVector* children = &element->children;
for (unsigned int i = 0; i < children->length; ++i) {
traverse_node((GumboNode*)children->data[i]);
}
}
}
- Extracting text content:
void extract_text(GumboNode* node) {
if (node->type == GUMBO_NODE_TEXT) {
printf("Text: %s\n", node->v.text.text);
} else if (node->type == GUMBO_NODE_ELEMENT) {
GumboVector* children = &node->v.element.children;
for (unsigned int i = 0; i < children->length; ++i) {
extract_text((GumboNode*)children->data[i]);
}
}
}
Getting Started
-
Clone the repository:
git clone https://github.com/google/gumbo-parser.git
-
Build the library:
cd gumbo-parser ./autogen.sh ./configure make
-
Include the header in your C project:
#include "gumbo.h"
-
Link against the built library when compiling your project:
gcc -o myproject myproject.c -lgumbo
Competitor Comparisons
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
Pros of html5lib-python
- Pure Python implementation, making it easier to install and use across different platforms
- Highly compliant with the HTML5 specification, ensuring accurate parsing of modern web content
- Supports parsing of fragments and streaming input, offering more flexibility in usage
Cons of html5lib-python
- Generally slower performance compared to Gumbo-parser, which is written in C
- Larger memory footprint due to its Python implementation
- May require more frequent updates to keep up with HTML5 specification changes
Code Comparison
html5lib-python:
import html5lib
with open("input.html", "rb") as f:
document = html5lib.parse(f)
Gumbo-parser (using PyGumbo wrapper):
from pygumbo import parse
with open("input.html", "rb") as f:
document = parse(f.read())
Both libraries provide similar functionality for parsing HTML, but Gumbo-parser typically offers better performance due to its C implementation. html5lib-python, however, provides a more Pythonic interface and better cross-platform compatibility. The choice between the two depends on specific project requirements, such as parsing accuracy, performance needs, and ease of integration.
HTML Standard
Pros of html
- Comprehensive HTML specification, providing a complete reference for web developers
- Regularly updated to reflect the latest web standards and browser implementations
- Extensive documentation and examples for HTML elements and attributes
Cons of html
- Not a parser implementation, requiring additional tools for HTML parsing
- Large and complex specification, potentially overwhelming for beginners
- Focuses on HTML standards rather than providing a ready-to-use parsing solution
Code comparison
gumbo-parser:
GumboOutput* output = gumbo_parse("<h1>Hello, World!</h1>");
GumboNode* root = output->root;
// Process the parsed HTML tree
gumbo_destroy_output(&kGumboDefaultOptions, output);
html:
<!-- No direct parsing code, as it's a specification, not a parser -->
<h1>Hello, World!</h1>
<!-- The specification defines how this should be interpreted -->
Summary
gumbo-parser is a lightweight HTML5 parsing library, while html is the official HTML specification. gumbo-parser provides a practical solution for parsing HTML in applications, whereas html serves as the authoritative reference for HTML standards. Developers often use both in conjunction: gumbo-parser for implementation and html for guidance on correct HTML structure and semantics.
Lexbor is development of an open source HTML Renderer library. https://lexbor.com
Pros of lexbor
- Written in C, potentially offering better performance and lower memory usage
- Supports both HTML and XML parsing
- Actively maintained with more recent updates
Cons of lexbor
- Less widely adopted compared to Gumbo parser
- Documentation is not as comprehensive
- Steeper learning curve for developers unfamiliar with the library
Code Comparison
Gumbo parser:
GumboOutput* output = gumbo_parse("<h1>Hello, World!</h1>");
// ... process the output ...
gumbo_destroy_output(&kGumboDefaultOptions, output);
lexbor:
lxb_html_document_t *document = lxb_html_document_create();
lxb_html_document_parse(document, (const lxb_char_t *)"<h1>Hello, World!</h1>", 22);
// ... process the document ...
lxb_html_document_destroy(document);
Both libraries offer HTML parsing capabilities, but lexbor provides a more extensive set of features for working with HTML and XML documents. Gumbo parser, being a product of Google, has a larger user base and more established reputation. The choice between the two may depend on specific project requirements, performance needs, and developer familiarity with the libraries.
High-performance browser-grade HTML5 parser
Pros of html5ever
- Written in Rust, offering memory safety and concurrent parsing
- Designed for high performance and integration with the Servo browser engine
- Supports streaming parsing, allowing processing of partial HTML documents
Cons of html5ever
- Steeper learning curve for developers not familiar with Rust
- Less widespread adoption compared to C-based parsers like Gumbo
- May require additional setup for use in non-Rust projects
Code Comparison
html5ever:
let dom = parse_document(RcDom::default(), Default::default())
.from_utf8()
.read_from(&mut input)
.unwrap();
Gumbo:
GumboOutput* output = gumbo_parse(input);
GumboNode* root = output->root;
Key Differences
- Language: html5ever is written in Rust, while Gumbo is written in C
- Performance: html5ever is optimized for parallel processing and streaming
- API: html5ever provides a more modern, type-safe API compared to Gumbo's C interface
- Integration: Gumbo is easier to integrate with C/C++ projects, while html5ever is more suitable for Rust-based applications
- Maintenance: html5ever is actively maintained as part of the Servo project, while Gumbo has less frequent updates
Both parsers aim to provide conformant HTML5 parsing, but their design philosophies and target use cases differ significantly.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Gumbo - A pure-C HTML5 parser.
This project has been unmaintained since 2016 and should not be used.
The original README is available for historical reference.
Top Related Projects
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
HTML Standard
Lexbor is development of an open source HTML Renderer library. https://lexbor.com
High-performance browser-grade HTML5 parser
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot