Convert Figma logo to code with AI

google logogumbo-parser

An HTML5 parsing library in pure C99

5,165
662
5,165
23

Top Related Projects

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

7,987

HTML Standard

1,580

Lexbor is development of an open source HTML Renderer library. https://lexbor.com

High-performance browser-grade HTML5 parser

Quick Overview

Gumbo is an HTML5 parsing library implemented in pure C99. It aims to provide a robust and efficient parsing solution that adheres to the HTML5 specification. Gumbo is designed to be easily integrated into various programming languages and applications.

Pros

  • Fast and memory-efficient parsing of HTML documents
  • Compliant with the HTML5 specification
  • Easy to integrate with other programming languages
  • Well-documented and maintained

Cons

  • Limited to HTML parsing only (no XML or other markup languages)
  • May require additional wrappers for use in higher-level languages
  • Not designed for in-place editing of the parsed document

Code Examples

  1. Parsing an HTML string:
#include "gumbo.h"

const char* html = "<h1>Hello, World!</h1>";
GumboOutput* output = gumbo_parse(html);
// Use the parsed output
gumbo_destroy_output(&kGumboDefaultOptions, output);
  1. Traversing the parsed tree:
void traverse_node(GumboNode* node) {
    if (node->type == GUMBO_NODE_ELEMENT) {
        GumboElement* element = &node->v.element;
        printf("Tag: %s\n", gumbo_normalized_tagname(element->tag));
        
        GumboVector* children = &element->children;
        for (unsigned int i = 0; i < children->length; ++i) {
            traverse_node((GumboNode*)children->data[i]);
        }
    }
}
  1. Extracting text content:
void extract_text(GumboNode* node) {
    if (node->type == GUMBO_NODE_TEXT) {
        printf("Text: %s\n", node->v.text.text);
    } else if (node->type == GUMBO_NODE_ELEMENT) {
        GumboVector* children = &node->v.element.children;
        for (unsigned int i = 0; i < children->length; ++i) {
            extract_text((GumboNode*)children->data[i]);
        }
    }
}

Getting Started

  1. Clone the repository:

    git clone https://github.com/google/gumbo-parser.git
    
  2. Build the library:

    cd gumbo-parser
    ./autogen.sh
    ./configure
    make
    
  3. Include the header in your C project:

    #include "gumbo.h"
    
  4. Link against the built library when compiling your project:

    gcc -o myproject myproject.c -lgumbo
    

Competitor Comparisons

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

Pros of html5lib-python

  • Pure Python implementation, making it easier to install and use across different platforms
  • Highly compliant with the HTML5 specification, ensuring accurate parsing of modern web content
  • Supports parsing of fragments and streaming input, offering more flexibility in usage

Cons of html5lib-python

  • Generally slower performance compared to Gumbo-parser, which is written in C
  • Larger memory footprint due to its Python implementation
  • May require more frequent updates to keep up with HTML5 specification changes

Code Comparison

html5lib-python:

import html5lib
with open("input.html", "rb") as f:
    document = html5lib.parse(f)

Gumbo-parser (using PyGumbo wrapper):

from pygumbo import parse
with open("input.html", "rb") as f:
    document = parse(f.read())

Both libraries provide similar functionality for parsing HTML, but Gumbo-parser typically offers better performance due to its C implementation. html5lib-python, however, provides a more Pythonic interface and better cross-platform compatibility. The choice between the two depends on specific project requirements, such as parsing accuracy, performance needs, and ease of integration.

7,987

HTML Standard

Pros of html

  • Comprehensive HTML specification, providing a complete reference for web developers
  • Regularly updated to reflect the latest web standards and browser implementations
  • Extensive documentation and examples for HTML elements and attributes

Cons of html

  • Not a parser implementation, requiring additional tools for HTML parsing
  • Large and complex specification, potentially overwhelming for beginners
  • Focuses on HTML standards rather than providing a ready-to-use parsing solution

Code comparison

gumbo-parser:

GumboOutput* output = gumbo_parse("<h1>Hello, World!</h1>");
GumboNode* root = output->root;
// Process the parsed HTML tree
gumbo_destroy_output(&kGumboDefaultOptions, output);

html:

<!-- No direct parsing code, as it's a specification, not a parser -->
<h1>Hello, World!</h1>
<!-- The specification defines how this should be interpreted -->

Summary

gumbo-parser is a lightweight HTML5 parsing library, while html is the official HTML specification. gumbo-parser provides a practical solution for parsing HTML in applications, whereas html serves as the authoritative reference for HTML standards. Developers often use both in conjunction: gumbo-parser for implementation and html for guidance on correct HTML structure and semantics.

1,580

Lexbor is development of an open source HTML Renderer library. https://lexbor.com

Pros of lexbor

  • Written in C, potentially offering better performance and lower memory usage
  • Supports both HTML and XML parsing
  • Actively maintained with more recent updates

Cons of lexbor

  • Less widely adopted compared to Gumbo parser
  • Documentation is not as comprehensive
  • Steeper learning curve for developers unfamiliar with the library

Code Comparison

Gumbo parser:

GumboOutput* output = gumbo_parse("<h1>Hello, World!</h1>");
// ... process the output ...
gumbo_destroy_output(&kGumboDefaultOptions, output);

lexbor:

lxb_html_document_t *document = lxb_html_document_create();
lxb_html_document_parse(document, (const lxb_char_t *)"<h1>Hello, World!</h1>", 22);
// ... process the document ...
lxb_html_document_destroy(document);

Both libraries offer HTML parsing capabilities, but lexbor provides a more extensive set of features for working with HTML and XML documents. Gumbo parser, being a product of Google, has a larger user base and more established reputation. The choice between the two may depend on specific project requirements, performance needs, and developer familiarity with the libraries.

High-performance browser-grade HTML5 parser

Pros of html5ever

  • Written in Rust, offering memory safety and concurrent parsing
  • Designed for high performance and integration with the Servo browser engine
  • Supports streaming parsing, allowing processing of partial HTML documents

Cons of html5ever

  • Steeper learning curve for developers not familiar with Rust
  • Less widespread adoption compared to C-based parsers like Gumbo
  • May require additional setup for use in non-Rust projects

Code Comparison

html5ever:

let dom = parse_document(RcDom::default(), Default::default())
    .from_utf8()
    .read_from(&mut input)
    .unwrap();

Gumbo:

GumboOutput* output = gumbo_parse(input);
GumboNode* root = output->root;

Key Differences

  • Language: html5ever is written in Rust, while Gumbo is written in C
  • Performance: html5ever is optimized for parallel processing and streaming
  • API: html5ever provides a more modern, type-safe API compared to Gumbo's C interface
  • Integration: Gumbo is easier to integrate with C/C++ projects, while html5ever is more suitable for Rust-based applications
  • Maintenance: html5ever is actively maintained as part of the Servo project, while Gumbo has less frequent updates

Both parsers aim to provide conformant HTML5 parsing, but their design philosophies and target use cases differ significantly.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Gumbo - A pure-C HTML5 parser.

This project has been unmaintained since 2016 and should not be used.

The original README is available for historical reference.