gumbo-parser

An HTML5 parsing library in pure C99

5,165

662

5,165

View on GitHub

Top Related Projects

html5lib-python

1,111

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

lexbor

1,580

Lexbor is development of an open source HTML Renderer library. https://lexbor.com

html5ever

2,096

High-performance browser-grade HTML5 parser

Quick Overview

Gumbo is an HTML5 parsing library implemented in pure C99. It aims to provide a robust and efficient parsing solution that adheres to the HTML5 specification. Gumbo is designed to be easily integrated into various programming languages and applications.

Pros

Fast and memory-efficient parsing of HTML documents
Compliant with the HTML5 specification
Easy to integrate with other programming languages
Well-documented and maintained

Cons

Limited to HTML parsing only (no XML or other markup languages)
May require additional wrappers for use in higher-level languages
Not designed for in-place editing of the parsed document

Code Examples

Parsing an HTML string:

#include "gumbo.h"

const char* html = "<h1>Hello, World!</h1>";
GumboOutput* output = gumbo_parse(html);
// Use the parsed output
gumbo_destroy_output(&kGumboDefaultOptions, output);

Traversing the parsed tree:

void traverse_node(GumboNode* node) {
    if (node->type == GUMBO_NODE_ELEMENT) {
        GumboElement* element = &node->v.element;
        printf("Tag: %s\n", gumbo_normalized_tagname(element->tag));
        
        GumboVector* children = &element->children;
        for (unsigned int i = 0; i < children->length; ++i) {
            traverse_node((GumboNode*)children->data[i]);
        }
    }
}

Extracting text content:

void extract_text(GumboNode* node) {
    if (node->type == GUMBO_NODE_TEXT) {
        printf("Text: %s\n", node->v.text.text);
    } else if (node->type == GUMBO_NODE_ELEMENT) {
        GumboVector* children = &node->v.element.children;
        for (unsigned int i = 0; i < children->length; ++i) {
            extract_text((GumboNode*)children->data[i]);
        }
    }
}

Getting Started

Clone the repository:

git clone https://github.com/google/gumbo-parser.git

Build the library:

cd gumbo-parser
./autogen.sh
./configure
make

Include the header in your C project:
```
#include "gumbo.h"
```
Link against the built library when compiling your project:
```
gcc -o myproject myproject.c -lgumbo
```

Competitor Comparisons

html5lib-python

1,111

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

Pros of html5lib-python

Pure Python implementation, making it easier to install and use across different platforms
Highly compliant with the HTML5 specification, ensuring accurate parsing of modern web content
Supports parsing of fragments and streaming input, offering more flexibility in usage

Cons of html5lib-python

Generally slower performance compared to Gumbo-parser, which is written in C
Larger memory footprint due to its Python implementation
May require more frequent updates to keep up with HTML5 specification changes

Code Comparison

html5lib-python:

import html5lib
with open("input.html", "rb") as f:
    document = html5lib.parse(f)

Gumbo-parser (using PyGumbo wrapper):

from pygumbo import parse
with open("input.html", "rb") as f:
    document = parse(f.read())

Both libraries provide similar functionality for parsing HTML, but Gumbo-parser typically offers better performance due to its C implementation. html5lib-python, however, provides a more Pythonic interface and better cross-platform compatibility. The choice between the two depends on specific project requirements, such as parsing accuracy, performance needs, and ease of integration.

html

7,987

HTML Standard

Pros of html

Comprehensive HTML specification, providing a complete reference for web developers
Regularly updated to reflect the latest web standards and browser implementations
Extensive documentation and examples for HTML elements and attributes

Cons of html

Not a parser implementation, requiring additional tools for HTML parsing
Large and complex specification, potentially overwhelming for beginners
Focuses on HTML standards rather than providing a ready-to-use parsing solution

Code comparison

gumbo-parser:

GumboOutput* output = gumbo_parse("<h1>Hello, World!</h1>");
GumboNode* root = output->root;
// Process the parsed HTML tree
gumbo_destroy_output(&kGumboDefaultOptions, output);

html:

<!-- No direct parsing code, as it's a specification, not a parser -->
<h1>Hello, World!</h1>
<!-- The specification defines how this should be interpreted -->

Summary

gumbo-parser is a lightweight HTML5 parsing library, while html is the official HTML specification. gumbo-parser provides a practical solution for parsing HTML in applications, whereas html serves as the authoritative reference for HTML standards. Developers often use both in conjunction: gumbo-parser for implementation and html for guidance on correct HTML structure and semantics.

lexbor

1,580

Lexbor is development of an open source HTML Renderer library. https://lexbor.com

Pros of lexbor

Written in C, potentially offering better performance and lower memory usage
Supports both HTML and XML parsing
Actively maintained with more recent updates

Cons of lexbor

Less widely adopted compared to Gumbo parser
Documentation is not as comprehensive
Steeper learning curve for developers unfamiliar with the library

Code Comparison

Gumbo parser:

GumboOutput* output = gumbo_parse("<h1>Hello, World!</h1>");
// ... process the output ...
gumbo_destroy_output(&kGumboDefaultOptions, output);

lexbor:

lxb_html_document_t *document = lxb_html_document_create();
lxb_html_document_parse(document, (const lxb_char_t *)"<h1>Hello, World!</h1>", 22);
// ... process the document ...
lxb_html_document_destroy(document);

Both libraries offer HTML parsing capabilities, but lexbor provides a more extensive set of features for working with HTML and XML documents. Gumbo parser, being a product of Google, has a larger user base and more established reputation. The choice between the two may depend on specific project requirements, performance needs, and developer familiarity with the libraries.

html5ever

2,096

High-performance browser-grade HTML5 parser

Pros of html5ever

Written in Rust, offering memory safety and concurrent parsing
Designed for high performance and integration with the Servo browser engine
Supports streaming parsing, allowing processing of partial HTML documents

Cons of html5ever

Steeper learning curve for developers not familiar with Rust
Less widespread adoption compared to C-based parsers like Gumbo
May require additional setup for use in non-Rust projects

Code Comparison

html5ever:

let dom = parse_document(RcDom::default(), Default::default())
    .from_utf8()
    .read_from(&mut input)
    .unwrap();

Gumbo:

GumboOutput* output = gumbo_parse(input);
GumboNode* root = output->root;

Key Differences

Language: html5ever is written in Rust, while Gumbo is written in C
Performance: html5ever is optimized for parallel processing and streaming
API: html5ever provides a more modern, type-safe API compared to Gumbo's C interface
Integration: Gumbo is easier to integrate with C/C++ projects, while html5ever is more suitable for Rust-based applications
Maintenance: html5ever is actively maintained as part of the Servo project, while Gumbo has less frequent updates

Both parsers aim to provide conformant HTML5 parsing, but their design philosophies and target use cases differ significantly.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Gumbo - A pure-C HTML5 parser.

This project has been unmaintained since 2016 and should not be used.

The original README is available for historical reference.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot