jsoup
jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
Top Related Projects
The fast, flexible, and elegant library for parsing and manipulating HTML and XML.
A standalone version of the readability lib
📜 Extract meaningful content from the chaos of a web page
Javascript-based HTML compressor/minifier (with Node.js support)
The fast & forgiving HTML and XML parser
Quick Overview
JSoup is a Java library that provides a simple and efficient way to parse, extract, and manipulate HTML content. It allows developers to work with HTML documents as if they were XML documents, providing a familiar and intuitive API for navigating, searching, and modifying the document structure.
Pros
- Powerful HTML Parsing: JSoup can parse HTML from a URL, file, or string, and provides a comprehensive set of methods for navigating and querying the document structure.
- Flexible Querying: JSoup supports a wide range of CSS and jQuery-like selectors, making it easy to target and extract specific elements from the HTML.
- Seamless HTML Manipulation: JSoup allows developers to modify the HTML document, including adding, removing, and updating elements and attributes.
- Robust Error Handling: JSoup is designed to handle malformed or invalid HTML gracefully, making it a reliable choice for working with a wide range of web content.
Cons
- Limited to Java: JSoup is a Java-based library, which may limit its adoption in projects that use other programming languages.
- Performance Overhead: While JSoup is generally fast and efficient, the parsing and manipulation of large HTML documents can still incur some performance overhead, especially on resource-constrained systems.
- Lack of Asynchronous Support: JSoup is a synchronous library, which may not be ideal for certain use cases that require asynchronous processing of HTML content.
- Dependency on External Libraries: JSoup relies on a few external libraries, such as the Jsoup-Jaxen library for XPath support, which can add to the overall project dependencies.
Code Examples
Here are a few examples of how to use JSoup:
- Parsing HTML from a URL:
Document doc = Jsoup.connect("https://www.example.com").get();
String title = doc.title();
Elements paragraphs = doc.select("p");
- Modifying HTML Elements:
Document doc = Jsoup.parse("<p>Hello, <b>World</b>!</p>");
Element paragraph = doc.select("p").first();
paragraph.child(1).text("JSoup");
String modifiedHtml = doc.html();
- Extracting Data from HTML Tables:
Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)").get();
Elements table = doc.select("table.wikitable").first();
for (Element row : table.select("tr")) {
Elements cells = row.select("td");
if (cells.size() == 5) {
String country = cells.get(0).text();
int population = Integer.parseInt(cells.get(2).text().replaceAll(",", ""));
// Process the country and population data
}
}
- Cleaning and Sanitizing HTML:
String dirtyHtml = "<p>Hello, <b>World</b>! <script>alert('XSS attack!');</script></p>";
String cleanHtml = Jsoup.clean(dirtyHtml, Whitelist.basic());
Getting Started
To get started with JSoup, you can add the following dependency to your Java project:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version>
</dependency>
Once you have the library set up, you can start using JSoup to parse, manipulate, and extract data from HTML documents. The JSoup documentation provides a comprehensive guide on how to use the library, including examples and detailed explanations of the various features and capabilities.
Competitor Comparisons
The fast, flexible, and elegant library for parsing and manipulating HTML and XML.
Pros of Cheerio
- JavaScript-based, allowing seamless integration with Node.js projects
- Lightweight and fast, with minimal overhead
- Familiar jQuery-like syntax for DOM manipulation
Cons of Cheerio
- Limited to server-side use, not suitable for browser environments
- Lacks support for some advanced CSS selectors and pseudo-classes
- May struggle with complex, dynamic web pages that rely heavily on JavaScript
Code Comparison
Jsoup (Java):
Document doc = Jsoup.connect("https://example.com").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
System.out.println(link.attr("href"));
}
Cheerio (JavaScript):
const cheerio = require('cheerio');
const $ = cheerio.load('<html><body><a href="https://example.com">Link</a></body></html>');
$('a[href]').each((i, elem) => {
console.log($(elem).attr('href'));
});
Both libraries offer similar functionality for parsing and manipulating HTML, but Jsoup is designed for Java applications, while Cheerio is tailored for JavaScript and Node.js environments. Jsoup provides more robust handling of malformed HTML and better support for CSS selectors, while Cheerio offers a lighter footprint and familiar jQuery-like syntax for web developers already comfortable with JavaScript.
A standalone version of the readability lib
Pros of Readability
- Focuses specifically on extracting readable content from web pages
- Provides a more refined and targeted approach for content extraction
- Offers language-specific adaptations for better results in different locales
Cons of Readability
- More limited in scope compared to Jsoup's general-purpose HTML parsing
- May require additional processing for tasks beyond content extraction
- Less frequent updates and potentially slower development cycle
Code Comparison
Jsoup example:
Document doc = Jsoup.connect("https://example.com").get();
Elements paragraphs = doc.select("p");
for (Element paragraph : paragraphs) {
System.out.println(paragraph.text());
}
Readability example:
const { Readability } = require('@mozilla/readability');
const { JSDOM } = require('jsdom');
const doc = new JSDOM(html);
const reader = new Readability(doc.window.document);
const article = reader.parse();
console.log(article.textContent);
Summary
Jsoup is a versatile Java HTML parser with a wide range of capabilities, while Readability is a specialized JavaScript library for extracting readable content from web pages. Jsoup offers more general-purpose functionality, making it suitable for various HTML manipulation tasks. Readability, on the other hand, excels at isolating the main content of a webpage, providing a more focused solution for content extraction and readability improvement.
📜 Extract meaningful content from the chaos of a web page
Pros of Parser
- Supports multiple languages and formats (HTML, JSON, XML)
- Provides a unified API for parsing different content types
- Offers additional features like content extraction and metadata parsing
Cons of Parser
- Less specialized for HTML parsing compared to jsoup
- May have a steeper learning curve due to its broader scope
- Potentially slower performance for simple HTML parsing tasks
Code Comparison
Parser:
const { parse } = require('@postlight/parser');
parse('https://example.com').then(result => {
console.log(result.title);
console.log(result.content);
});
jsoup:
Document doc = Jsoup.connect("https://example.com").get();
String title = doc.title();
String body = doc.body().text();
System.out.println(title);
System.out.println(body);
Summary
Parser offers a more versatile solution for parsing various content types, while jsoup specializes in HTML parsing with a simpler API. Parser may be preferred for projects requiring multi-format parsing and content extraction, whereas jsoup is ideal for straightforward HTML manipulation tasks. The choice between the two depends on the specific requirements of your project and the primary content type you'll be working with.
Javascript-based HTML compressor/minifier (with Node.js support)
Pros of html-minifier
- Focuses on HTML minification and optimization
- Offers extensive configuration options for fine-tuned control
- Actively maintained with frequent updates
Cons of html-minifier
- Limited to HTML processing and minification
- Requires Node.js environment for execution
- Steeper learning curve due to numerous configuration options
Code Comparison
html-minifier (JavaScript):
const minify = require('html-minifier').minify;
const result = minify('<p> Hello, World! </p>', {
collapseWhitespace: true,
removeComments: true
});
jsoup (Java):
Document doc = Jsoup.parse("<p> Hello, World! </p>");
doc.outputSettings().prettyPrint(false);
String result = doc.body().html();
Key Differences
- jsoup is a Java library for HTML parsing and manipulation, while html-minifier is a JavaScript tool for HTML minification.
- jsoup provides a broader range of HTML processing capabilities, including parsing, traversing, and modifying HTML documents.
- html-minifier is specifically designed for reducing HTML file size and optimizing performance.
- jsoup is more suitable for server-side Java applications, while html-minifier is typically used in front-end build processes or Node.js environments.
Both tools have their strengths and are suited for different use cases, depending on the programming language, environment, and specific requirements of the project.
The fast & forgiving HTML and XML parser
Pros of htmlparser2
- Written in JavaScript, making it ideal for Node.js and browser environments
- Faster parsing speed, especially for large HTML documents
- Supports streaming, allowing for processing of partial chunks of HTML
Cons of htmlparser2
- Less feature-rich compared to jsoup's DOM manipulation capabilities
- Limited CSS selector support out of the box
- Lacks built-in methods for common tasks like form handling and URL manipulation
Code Comparison
jsoup:
Document doc = Jsoup.connect("https://example.com").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
System.out.println(link.attr("href"));
}
htmlparser2:
const parser = new Parser({
onopentag: (name, attribs) => {
if (name === "a" && attribs.href) {
console.log(attribs.href);
}
}
});
parser.write(html);
parser.end();
Both libraries offer HTML parsing capabilities, but jsoup provides a more comprehensive set of features for DOM manipulation and traversal. htmlparser2 excels in performance and is better suited for JavaScript-based projects, while jsoup is the go-to choice for Java developers needing robust HTML handling.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
jsoup: Java HTML Parser
jsoup is a Java library that makes it easy to work with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors.
jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.
- scrape and parse HTML from a URL, file, or string
- find and extract data, using DOM traversal or CSS selectors
- manipulate the HTML elements, attributes, and text
- clean user-submitted content against a safe-list, to prevent XSS attacks
- output tidy HTML
jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.
See jsoup.org for downloads and the full API documentation.
Example
Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the News section into a list of Elements:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
log("%s\n\t%s",
headline.attr("title"), headline.absUrl("href"));
}
Open source
jsoup is an open source project distributed under the liberal MIT license. The source code is available on GitHub.
Getting started
Android support
When used in Android projects, core library desugaring with the NIO specification should be enabled to support Java 8+ features.
Development and support
If you have any questions on how to use jsoup, or have ideas for future development, please get in touch via jsoup Discussions.
If you find any issues, please file a bug after checking for duplicates.
The colophon talks about the history of and tools used to build jsoup.
Status
jsoup is in general, stable release.
Top Related Projects
The fast, flexible, and elegant library for parsing and manipulating HTML and XML.
A standalone version of the readability lib
📜 Extract meaningful content from the chaos of a web page
Javascript-based HTML compressor/minifier (with Node.js support)
The fast & forgiving HTML and XML parser
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot