Convert Figma logo to code with AI

jhy logojsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

10,873
2,167
10,873
61

Top Related Projects

28,557

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.

A standalone version of the readability lib

5,429

📜 Extract meaningful content from the chaos of a web page

Javascript-based HTML compressor/minifier (with Node.js support)

The fast & forgiving HTML and XML parser

Quick Overview

JSoup is a Java library that provides a simple and efficient way to parse, extract, and manipulate HTML content. It allows developers to work with HTML documents as if they were XML documents, providing a familiar and intuitive API for navigating, searching, and modifying the document structure.

Pros

  • Powerful HTML Parsing: JSoup can parse HTML from a URL, file, or string, and provides a comprehensive set of methods for navigating and querying the document structure.
  • Flexible Querying: JSoup supports a wide range of CSS and jQuery-like selectors, making it easy to target and extract specific elements from the HTML.
  • Seamless HTML Manipulation: JSoup allows developers to modify the HTML document, including adding, removing, and updating elements and attributes.
  • Robust Error Handling: JSoup is designed to handle malformed or invalid HTML gracefully, making it a reliable choice for working with a wide range of web content.

Cons

  • Limited to Java: JSoup is a Java-based library, which may limit its adoption in projects that use other programming languages.
  • Performance Overhead: While JSoup is generally fast and efficient, the parsing and manipulation of large HTML documents can still incur some performance overhead, especially on resource-constrained systems.
  • Lack of Asynchronous Support: JSoup is a synchronous library, which may not be ideal for certain use cases that require asynchronous processing of HTML content.
  • Dependency on External Libraries: JSoup relies on a few external libraries, such as the Jsoup-Jaxen library for XPath support, which can add to the overall project dependencies.

Code Examples

Here are a few examples of how to use JSoup:

  1. Parsing HTML from a URL:
Document doc = Jsoup.connect("https://www.example.com").get();
String title = doc.title();
Elements paragraphs = doc.select("p");
  1. Modifying HTML Elements:
Document doc = Jsoup.parse("<p>Hello, <b>World</b>!</p>");
Element paragraph = doc.select("p").first();
paragraph.child(1).text("JSoup");
String modifiedHtml = doc.html();
  1. Extracting Data from HTML Tables:
Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)").get();
Elements table = doc.select("table.wikitable").first();
for (Element row : table.select("tr")) {
    Elements cells = row.select("td");
    if (cells.size() == 5) {
        String country = cells.get(0).text();
        int population = Integer.parseInt(cells.get(2).text().replaceAll(",", ""));
        // Process the country and population data
    }
}
  1. Cleaning and Sanitizing HTML:
String dirtyHtml = "<p>Hello, <b>World</b>! <script>alert('XSS attack!');</script></p>";
String cleanHtml = Jsoup.clean(dirtyHtml, Whitelist.basic());

Getting Started

To get started with JSoup, you can add the following dependency to your Java project:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>

Once you have the library set up, you can start using JSoup to parse, manipulate, and extract data from HTML documents. The JSoup documentation provides a comprehensive guide on how to use the library, including examples and detailed explanations of the various features and capabilities.

Competitor Comparisons

28,557

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.

Pros of Cheerio

  • JavaScript-based, allowing seamless integration with Node.js projects
  • Lightweight and fast, with minimal overhead
  • Familiar jQuery-like syntax for DOM manipulation

Cons of Cheerio

  • Limited to server-side use, not suitable for browser environments
  • Lacks support for some advanced CSS selectors and pseudo-classes
  • May struggle with complex, dynamic web pages that rely heavily on JavaScript

Code Comparison

Jsoup (Java):

Document doc = Jsoup.connect("https://example.com").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
    System.out.println(link.attr("href"));
}

Cheerio (JavaScript):

const cheerio = require('cheerio');
const $ = cheerio.load('<html><body><a href="https://example.com">Link</a></body></html>');
$('a[href]').each((i, elem) => {
    console.log($(elem).attr('href'));
});

Both libraries offer similar functionality for parsing and manipulating HTML, but Jsoup is designed for Java applications, while Cheerio is tailored for JavaScript and Node.js environments. Jsoup provides more robust handling of malformed HTML and better support for CSS selectors, while Cheerio offers a lighter footprint and familiar jQuery-like syntax for web developers already comfortable with JavaScript.

A standalone version of the readability lib

Pros of Readability

  • Focuses specifically on extracting readable content from web pages
  • Provides a more refined and targeted approach for content extraction
  • Offers language-specific adaptations for better results in different locales

Cons of Readability

  • More limited in scope compared to Jsoup's general-purpose HTML parsing
  • May require additional processing for tasks beyond content extraction
  • Less frequent updates and potentially slower development cycle

Code Comparison

Jsoup example:

Document doc = Jsoup.connect("https://example.com").get();
Elements paragraphs = doc.select("p");
for (Element paragraph : paragraphs) {
    System.out.println(paragraph.text());
}

Readability example:

const { Readability } = require('@mozilla/readability');
const { JSDOM } = require('jsdom');

const doc = new JSDOM(html);
const reader = new Readability(doc.window.document);
const article = reader.parse();
console.log(article.textContent);

Summary

Jsoup is a versatile Java HTML parser with a wide range of capabilities, while Readability is a specialized JavaScript library for extracting readable content from web pages. Jsoup offers more general-purpose functionality, making it suitable for various HTML manipulation tasks. Readability, on the other hand, excels at isolating the main content of a webpage, providing a more focused solution for content extraction and readability improvement.

5,429

📜 Extract meaningful content from the chaos of a web page

Pros of Parser

  • Supports multiple languages and formats (HTML, JSON, XML)
  • Provides a unified API for parsing different content types
  • Offers additional features like content extraction and metadata parsing

Cons of Parser

  • Less specialized for HTML parsing compared to jsoup
  • May have a steeper learning curve due to its broader scope
  • Potentially slower performance for simple HTML parsing tasks

Code Comparison

Parser:

const { parse } = require('@postlight/parser');

parse('https://example.com').then(result => {
  console.log(result.title);
  console.log(result.content);
});

jsoup:

Document doc = Jsoup.connect("https://example.com").get();
String title = doc.title();
String body = doc.body().text();
System.out.println(title);
System.out.println(body);

Summary

Parser offers a more versatile solution for parsing various content types, while jsoup specializes in HTML parsing with a simpler API. Parser may be preferred for projects requiring multi-format parsing and content extraction, whereas jsoup is ideal for straightforward HTML manipulation tasks. The choice between the two depends on the specific requirements of your project and the primary content type you'll be working with.

Javascript-based HTML compressor/minifier (with Node.js support)

Pros of html-minifier

  • Focuses on HTML minification and optimization
  • Offers extensive configuration options for fine-tuned control
  • Actively maintained with frequent updates

Cons of html-minifier

  • Limited to HTML processing and minification
  • Requires Node.js environment for execution
  • Steeper learning curve due to numerous configuration options

Code Comparison

html-minifier (JavaScript):

const minify = require('html-minifier').minify;
const result = minify('<p>  Hello,   World!  </p>', {
  collapseWhitespace: true,
  removeComments: true
});

jsoup (Java):

Document doc = Jsoup.parse("<p>  Hello,   World!  </p>");
doc.outputSettings().prettyPrint(false);
String result = doc.body().html();

Key Differences

  • jsoup is a Java library for HTML parsing and manipulation, while html-minifier is a JavaScript tool for HTML minification.
  • jsoup provides a broader range of HTML processing capabilities, including parsing, traversing, and modifying HTML documents.
  • html-minifier is specifically designed for reducing HTML file size and optimizing performance.
  • jsoup is more suitable for server-side Java applications, while html-minifier is typically used in front-end build processes or Node.js environments.

Both tools have their strengths and are suited for different use cases, depending on the programming language, environment, and specific requirements of the project.

The fast & forgiving HTML and XML parser

Pros of htmlparser2

  • Written in JavaScript, making it ideal for Node.js and browser environments
  • Faster parsing speed, especially for large HTML documents
  • Supports streaming, allowing for processing of partial chunks of HTML

Cons of htmlparser2

  • Less feature-rich compared to jsoup's DOM manipulation capabilities
  • Limited CSS selector support out of the box
  • Lacks built-in methods for common tasks like form handling and URL manipulation

Code Comparison

jsoup:

Document doc = Jsoup.connect("https://example.com").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
    System.out.println(link.attr("href"));
}

htmlparser2:

const parser = new Parser({
  onopentag: (name, attribs) => {
    if (name === "a" && attribs.href) {
      console.log(attribs.href);
    }
  }
});
parser.write(html);
parser.end();

Both libraries offer HTML parsing capabilities, but jsoup provides a more comprehensive set of features for DOM manipulation and traversal. htmlparser2 excels in performance and is better suited for JavaScript-based projects, while jsoup is the go-to choice for Java developers needing robust HTML handling.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

jsoup: Java HTML Parser

jsoup is a Java library that makes it easy to work with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.

  • scrape and parse HTML from a URL, file, or string
  • find and extract data, using DOM traversal or CSS selectors
  • manipulate the HTML elements, attributes, and text
  • clean user-submitted content against a safe-list, to prevent XSS attacks
  • output tidy HTML

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

See jsoup.org for downloads and the full API documentation.

Build Status

Example

Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the News section into a list of Elements:

Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
  log("%s\n\t%s", 
    headline.attr("title"), headline.absUrl("href"));
}

Online sample, full source.

Open source

jsoup is an open source project distributed under the liberal MIT license. The source code is available on GitHub.

Getting started

  1. Download the latest jsoup jar (or add it to your Maven/Gradle build)
  2. Read the cookbook
  3. Enjoy!

Android support

When used in Android projects, core library desugaring with the NIO specification should be enabled to support Java 8+ features.

Development and support

If you have any questions on how to use jsoup, or have ideas for future development, please get in touch via jsoup Discussions.

If you find any issues, please file a bug after checking for duplicates.

The colophon talks about the history of and tools used to build jsoup.

Status

jsoup is in general, stable release.