jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

11,196

2,239

11,196

View on GitHub

Top Related Projects

cheerio

29,388

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.

readability

9,825

A standalone version of the readability lib

parser

5,598

📜 Extract meaningful content from the chaos of a web page

html-minifier

5,013

Javascript-based HTML compressor/minifier (with Node.js support)

htmlparser2

4,596

The fast & forgiving HTML and XML parser

Quick Overview

JSoup is a Java library that provides a simple and efficient way to parse, extract, and manipulate HTML content. It allows developers to work with HTML documents as if they were XML documents, providing a familiar and intuitive API for navigating, searching, and modifying the document structure.

Pros

Powerful HTML Parsing: JSoup can parse HTML from a URL, file, or string, and provides a comprehensive set of methods for navigating and querying the document structure.
Flexible Querying: JSoup supports a wide range of CSS and jQuery-like selectors, making it easy to target and extract specific elements from the HTML.
Seamless HTML Manipulation: JSoup allows developers to modify the HTML document, including adding, removing, and updating elements and attributes.
Robust Error Handling: JSoup is designed to handle malformed or invalid HTML gracefully, making it a reliable choice for working with a wide range of web content.

Cons

Limited to Java: JSoup is a Java-based library, which may limit its adoption in projects that use other programming languages.
Performance Overhead: While JSoup is generally fast and efficient, the parsing and manipulation of large HTML documents can still incur some performance overhead, especially on resource-constrained systems.
Lack of Asynchronous Support: JSoup is a synchronous library, which may not be ideal for certain use cases that require asynchronous processing of HTML content.
Dependency on External Libraries: JSoup relies on a few external libraries, such as the Jsoup-Jaxen library for XPath support, which can add to the overall project dependencies.

Code Examples

Here are a few examples of how to use JSoup:

Parsing HTML from a URL:

Document doc = Jsoup.connect("https://www.example.com").get();
String title = doc.title();
Elements paragraphs = doc.select("p");

Modifying HTML Elements:

Document doc = Jsoup.parse("<p>Hello, <b>World</b>!</p>");
Element paragraph = doc.select("p").first();
paragraph.child(1).text("JSoup");
String modifiedHtml = doc.html();

Extracting Data from HTML Tables:

Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)").get();
Elements table = doc.select("table.wikitable").first();
for (Element row : table.select("tr")) {
    Elements cells = row.select("td");
    if (cells.size() == 5) {
        String country = cells.get(0).text();
        int population = Integer.parseInt(cells.get(2).text().replaceAll(",", ""));
        // Process the country and population data
    }
}

Cleaning and Sanitizing HTML:

String dirtyHtml = "<p>Hello, <b>World</b>! <script>alert('XSS attack!');</script></p>";
String cleanHtml = Jsoup.clean(dirtyHtml, Whitelist.basic());

Getting Started

To get started with JSoup, you can add the following dependency to your Java project:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>

Once you have the library set up, you can start using JSoup to parse, manipulate, and extract data from HTML documents. The JSoup documentation provides a comprehensive guide on how to use the library, including examples and detailed explanations of the various features and capabilities.

Competitor Comparisons

cheerio

29,388

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.

Pros of Cheerio

JavaScript-based, allowing seamless integration with Node.js projects
Lightweight and fast, with minimal overhead
Familiar jQuery-like syntax for DOM manipulation

Cons of Cheerio

Limited to server-side use, not suitable for browser environments
Lacks support for some advanced CSS selectors and pseudo-classes
May struggle with complex, dynamic web pages that rely heavily on JavaScript

Code Comparison

Jsoup (Java):

Document doc = Jsoup.connect("https://example.com").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
    System.out.println(link.attr("href"));
}

Cheerio (JavaScript):

const cheerio = require('cheerio');
const $ = cheerio.load('<html><body><a href="https://example.com">Link</a></body></html>');
$('a[href]').each((i, elem) => {
    console.log($(elem).attr('href'));
});

Both libraries offer similar functionality for parsing and manipulating HTML, but Jsoup is designed for Java applications, while Cheerio is tailored for JavaScript and Node.js environments. Jsoup provides more robust handling of malformed HTML and better support for CSS selectors, while Cheerio offers a lighter footprint and familiar jQuery-like syntax for web developers already comfortable with JavaScript.

readability

9,825

A standalone version of the readability lib

Pros of Readability

Focuses specifically on extracting readable content from web pages
Provides a more refined and targeted approach for content extraction
Offers language-specific adaptations for better results in different locales

Cons of Readability

More limited in scope compared to Jsoup's general-purpose HTML parsing
May require additional processing for tasks beyond content extraction
Less frequent updates and potentially slower development cycle

Code Comparison

Jsoup example:

Document doc = Jsoup.connect("https://example.com").get();
Elements paragraphs = doc.select("p");
for (Element paragraph : paragraphs) {
    System.out.println(paragraph.text());
}

Readability example:

const { Readability } = require('@mozilla/readability');
const { JSDOM } = require('jsdom');

const doc = new JSDOM(html);
const reader = new Readability(doc.window.document);
const article = reader.parse();
console.log(article.textContent);

Summary

Jsoup is a versatile Java HTML parser with a wide range of capabilities, while Readability is a specialized JavaScript library for extracting readable content from web pages. Jsoup offers more general-purpose functionality, making it suitable for various HTML manipulation tasks. Readability, on the other hand, excels at isolating the main content of a webpage, providing a more focused solution for content extraction and readability improvement.

parser

5,598

📜 Extract meaningful content from the chaos of a web page

Pros of Parser

Supports multiple languages and formats (HTML, JSON, XML)
Provides a unified API for parsing different content types
Offers additional features like content extraction and metadata parsing

Cons of Parser

Less specialized for HTML parsing compared to jsoup
May have a steeper learning curve due to its broader scope
Potentially slower performance for simple HTML parsing tasks

Code Comparison

Parser:

const { parse } = require('@postlight/parser');

parse('https://example.com').then(result => {
  console.log(result.title);
  console.log(result.content);
});

jsoup:

Document doc = Jsoup.connect("https://example.com").get();
String title = doc.title();
String body = doc.body().text();
System.out.println(title);
System.out.println(body);

Summary

Parser offers a more versatile solution for parsing various content types, while jsoup specializes in HTML parsing with a simpler API. Parser may be preferred for projects requiring multi-format parsing and content extraction, whereas jsoup is ideal for straightforward HTML manipulation tasks. The choice between the two depends on the specific requirements of your project and the primary content type you'll be working with.

html-minifier

5,013

Javascript-based HTML compressor/minifier (with Node.js support)

Pros of html-minifier

Focuses on HTML minification and optimization
Offers extensive configuration options for fine-tuned control
Actively maintained with frequent updates

Cons of html-minifier

Limited to HTML processing and minification
Requires Node.js environment for execution
Steeper learning curve due to numerous configuration options

Code Comparison

html-minifier (JavaScript):

const minify = require('html-minifier').minify;
const result = minify('<p>  Hello,   World!  </p>', {
  collapseWhitespace: true,
  removeComments: true
});

jsoup (Java):

Document doc = Jsoup.parse("<p>  Hello,   World!  </p>");
doc.outputSettings().prettyPrint(false);
String result = doc.body().html();

Key Differences

jsoup is a Java library for HTML parsing and manipulation, while html-minifier is a JavaScript tool for HTML minification.
jsoup provides a broader range of HTML processing capabilities, including parsing, traversing, and modifying HTML documents.
html-minifier is specifically designed for reducing HTML file size and optimizing performance.
jsoup is more suitable for server-side Java applications, while html-minifier is typically used in front-end build processes or Node.js environments.

Both tools have their strengths and are suited for different use cases, depending on the programming language, environment, and specific requirements of the project.

htmlparser2

4,596

The fast & forgiving HTML and XML parser

Pros of htmlparser2

Written in JavaScript, making it ideal for Node.js and browser environments
Faster parsing speed, especially for large HTML documents
Supports streaming, allowing for processing of partial chunks of HTML

Cons of htmlparser2

Less feature-rich compared to jsoup's DOM manipulation capabilities
Limited CSS selector support out of the box
Lacks built-in methods for common tasks like form handling and URL manipulation

Code Comparison

jsoup:

Document doc = Jsoup.connect("https://example.com").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
    System.out.println(link.attr("href"));
}

htmlparser2:

const parser = new Parser({
  onopentag: (name, attribs) => {
    if (name === "a" && attribs.href) {
      console.log(attribs.href);
    }
  }
});
parser.write(html);
parser.end();

Both libraries offer HTML parsing capabilities, but jsoup provides a more comprehensive set of features for DOM manipulation and traversal. htmlparser2 excels in performance and is better suited for JavaScript-based projects, while jsoup is the go-to choice for Java developers needing robust HTML handling.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

jsoup: Java HTML Parser

jsoup is a Java library that makes it easy to work with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.

scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
manipulate the HTML elements, attributes, and text
clean user-submitted content against a safe-list, to prevent XSS attacks
output tidy HTML

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

See jsoup.org for downloads and the full API documentation.

Example

Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the News section into a list of Elements:

Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
  log("%s\n\t%s", 
    headline.attr("title"), headline.absUrl("href"));
}

Online sample, full source.

Open source

jsoup is an open source project distributed under the liberal MIT license. The source code is available on GitHub.

Getting started

Download the latest jsoup jar (or add it to your Maven/Gradle build)
Read the cookbook
Enjoy!

Android support

When used in Android projects, core library desugaring with the NIO specification should be enabled to support Java 8+ features.

Development and support

If you have any questions on how to use jsoup, or have ideas for future development, please get in touch via jsoup Discussions.

If you find any issues, please file a bug after checking for duplicates.

The colophon talks about the history of and tools used to build jsoup.

Status

jsoup is in general, stable release.

Author

jsoup was created and is maintained by Jonathan Hedley, its primary author.

jsoup is an open-source project, and many contributors have helped improve it over the years. You can see their contributions and join the development on GitHub.

Citing jsoup

If you use jsoup in research or technical documentation, you can cite it as:

Jonathan Hedley & jsoup contributors. jsoup: Java HTML Parser (2009âpresent). Available at: https://jsoup.org

@misc{jsoup,
  author = {Jonathan Hedley and jsoup contributors},
  title = {jsoup: Java HTML Parser},
  year = {2025},
  url = {https://jsoup.org}
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot