Top Related Projects
Quick Overview
GoQuery is a Go library that provides a jQuery-like syntax for parsing and manipulating HTML documents. It's built on top of the net/html package and offers a convenient way to extract data from HTML using CSS selectors, similar to how jQuery works in JavaScript.
Pros
- Easy-to-use API with familiar jQuery-like syntax
- Powerful CSS selector support for efficient HTML parsing
- Well-documented and actively maintained
- Good performance for most use cases
Cons
- Limited to HTML parsing (not suitable for XML or other formats)
- May be overkill for simple scraping tasks
- Learning curve for developers not familiar with jQuery concepts
- Some advanced CSS selectors may not be supported
Code Examples
- Basic HTML parsing and element selection:
doc, err := goquery.NewDocument("http://example.com")
if err != nil {
log.Fatal(err)
}
// Find all links and print their href attribute
doc.Find("a").Each(func(i int, s *goquery.Selection) {
href, _ := s.Attr("href")
fmt.Printf("Link %d: %s\n", i, href)
})
- Modifying HTML content:
html := `<div class="container"><p>Hello</p></div>`
doc, _ := goquery.NewDocumentFromReader(strings.NewReader(html))
// Add a new element
doc.Find(".container").AppendHtml("<span>World</span>")
// Print the modified HTML
output, _ := doc.Html()
fmt.Println(output)
- Extracting text content:
html := `<div><h1>Title</h1><p>Paragraph 1</p><p>Paragraph 2</p></div>`
doc, _ := goquery.NewDocumentFromReader(strings.NewReader(html))
// Extract text from all paragraphs
doc.Find("p").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Paragraph %d: %s\n", i+1, s.Text())
})
Getting Started
To start using GoQuery, first install it using Go modules:
go get github.com/PuerkitoBio/goquery
Then, import it in your Go code:
import "github.com/PuerkitoBio/goquery"
Here's a simple example to get you started:
package main
import (
"fmt"
"log"
"github.com/PuerkitoBio/goquery"
)
func main() {
doc, err := goquery.NewDocument("https://example.com")
if err != nil {
log.Fatal(err)
}
title := doc.Find("title").Text()
fmt.Printf("Page title: %s\n", title)
}
This example fetches the HTML content from example.com, parses it, and prints the page title.
Competitor Comparisons
Elegant Scraper and Crawler Framework for Golang
Pros of colly
- Built-in concurrency and rate limiting for efficient crawling
- Simpler API for common web scraping tasks
- Includes features like caching and cookie handling out of the box
Cons of colly
- Less flexible for complex DOM manipulation tasks
- Limited to web scraping use cases, while goquery is more general-purpose
- Steeper learning curve for users familiar with jQuery-like syntax
Code Comparison
goquery example:
doc, _ := goquery.NewDocument("http://example.com")
doc.Find(".post-title").Each(func(i int, s *goquery.Selection) {
title := s.Text()
fmt.Println(title)
})
colly example:
c := colly.NewCollector()
c.OnHTML(".post-title", func(e *colly.HTMLElement) {
title := e.Text
fmt.Println(title)
})
c.Visit("http://example.com")
Both libraries offer ways to scrape web content, but colly provides a more streamlined approach for typical web crawling tasks, while goquery offers greater flexibility for DOM manipulation. colly excels in performance and ease of use for web scraping, while goquery is better suited for complex DOM traversal and manipulation tasks.
Web Scraper in Go, similar to BeautifulSoup
Pros of soup
- Simpler API with fewer methods, making it easier to learn and use
- Lightweight and has fewer dependencies
- Supports both HTML and XML parsing out of the box
Cons of soup
- Less feature-rich compared to goquery
- Smaller community and fewer resources available
- May not handle complex DOM manipulation as efficiently as goquery
Code Comparison
soup:
doc := soup.HTMLParse("<html><body><p>Hello, World!</p></body></html>")
text := doc.Find("p").Text()
goquery:
doc, _ := goquery.NewDocumentFromReader(strings.NewReader("<html><body><p>Hello, World!</p></body></html>"))
text := doc.Find("p").Text()
Both libraries provide similar functionality for basic HTML parsing and element selection. However, goquery offers a more extensive set of methods for DOM manipulation and traversal, making it better suited for complex web scraping tasks. soup, on the other hand, provides a simpler API that may be more appealing for straightforward parsing needs.
While goquery is inspired by jQuery and offers a familiar syntax for web developers, soup takes a more minimalist approach. The choice between the two libraries depends on the specific requirements of your project, with goquery being more powerful but potentially overkill for simple tasks, and soup being easier to use but potentially limiting for complex scenarios.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
goquery - a little like that j-thing, only in Go
goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go's net/html package and the CSS Selector library cascadia. Since the net/html parser returns nodes, and not a full-featured DOM tree, jQuery's stateful manipulation functions (like height(), css(), detach()) have been left off.
Also, because the net/html parser requires UTF-8 encoding, so does goquery: it is the caller's responsibility to ensure that the source document provides UTF-8 encoded HTML. See the wiki for various options to do this.
Syntax-wise, it is as close as possible to jQuery, with the same function names when possible, and that warm and fuzzy chainable interface. jQuery being the ultra-popular library that it is, I felt that writing a similar HTML-manipulating library was better to follow its API than to start anew (in the same spirit as Go's fmt
package), even though some of its methods are less than intuitive (looking at you, index()...).
Table of Contents
Installation
Required Go version:
- Starting with version
v1.10.0
of goquery, Go 1.23+ is required due to the use of function-based iterators. - For
v1.9.0
of goquery, Go 1.18+ is required due to the use of generics. - For previous goquery versions, a Go version of 1.1+ was required because of the
net/html
dependency.
Ongoing goquery development is tested on the latest 2 versions of Go.
$ go get github.com/PuerkitoBio/goquery
(optional) To run unit tests:
$ cd $GOPATH/src/github.com/PuerkitoBio/goquery
$ go test
(optional) To run benchmarks (warning: it runs for a few minutes):
$ cd $GOPATH/src/github.com/PuerkitoBio/goquery
$ go test -bench=".*"
Changelog
Note that goquery's API is now stable, and will not break.
- 2024-12-26 (v1.10.1) : Update
go.mod
dependencies. - 2024-09-06 (v1.10.0) : Add
EachIter
which provides an iterator that can be used infor..range
loops on the*Selection
object. goquery now requires Go version 1.23+ (thanks @amikai). - 2024-09-06 (v1.9.3) : Update
go.mod
dependencies. - 2024-04-29 (v1.9.2) : Update
go.mod
dependencies. - 2024-02-29 (v1.9.1) : Improve allocation and performance of the
Map
function andSelection.Map
method, better document the cascadia differences (thanks @jwilsson). - 2024-02-22 (v1.9.0) : Add a generic
Map
function, goquery now requires Go version 1.18+ (thanks @Fesaa). - 2023-02-18 (v1.8.1) : Update
go.mod
dependencies, update CI workflow. - 2021-10-25 (v1.8.0) : Add
Render
function to render aSelection
to anio.Writer
(thanks @anthonygedeon). - 2021-07-11 (v1.7.1) : Update go.mod dependencies and add dependabot config (thanks @jauderho).
- 2021-06-14 (v1.7.0) : Add
Single
andSingleMatcher
functions to optimize first-match selection (thanks @gdollardollar). - 2021-01-11 (v1.6.1) : Fix panic when calling
{Prepend,Append,Set}Html
on aSelection
that contains non-Element nodes. - 2020-10-08 (v1.6.0) : Parse html in context of the container node for all functions that deal with html strings (
AfterHtml
,AppendHtml
, etc.). Thanks to @thiemok and @davidjwilkins for their work on this. - 2020-02-04 (v1.5.1) : Update module dependencies.
- 2018-11-15 (v1.5.0) : Go module support (thanks @Zaba505).
- 2018-06-07 (v1.4.1) : Add
NewDocumentFromReader
examples. - 2018-03-24 (v1.4.0) : Deprecate
NewDocument(url)
andNewDocumentFromResponse(response)
. - 2018-01-28 (v1.3.0) : Add
ToEnd
constant toSlice
until the end of the selection (thanks to @davidjwilkins for raising the issue). - 2018-01-11 (v1.2.0) : Add
AddBack*
and deprecateAndSelf
(thanks to @davidjwilkins). - 2017-02-12 (v1.1.0) : Add
SetHtml
andSetText
(thanks to @glebtv). - 2016-12-29 (v1.0.2) : Optimize allocations for
Selection.Text
(thanks to @radovskyb). - 2016-08-28 (v1.0.1) : Optimize performance for large documents.
- 2016-07-27 (v1.0.0) : Tag version 1.0.0.
- 2016-06-15 : Invalid selector strings internally compile to a
Matcher
implementation that never matches any node (instead of a panic). So for example,doc.Find("~")
returns an empty*Selection
object. - 2016-02-02 : Add
NodeName
utility function similar to the DOM'snodeName
property. It returns the tag name of the first element in a selection, and other relevant values of non-element nodes (see doc for details). AddOuterHtml
utility function similar to the DOM'souterHTML
property (namedOuterHtml
in small caps for consistency with the existingHtml
method on theSelection
). - 2015-04-20 : Add
AttrOr
helper method to return the attribute's value or a default value if absent. Thanks to piotrkowalczuk. - 2015-02-04 : Add more manipulation functions - Prepend* - thanks again to Andrew Stone.
- 2014-11-28 : Add more manipulation functions - ReplaceWith*, Wrap* and Unwrap - thanks again to Andrew Stone.
- 2014-11-07 : Add manipulation functions (thanks to Andrew Stone) and
*Matcher
functions, that receive compiled cascadia selectors instead of selector strings, thus avoiding potential panics thrown by goquery viacascadia.MustCompile
calls. This results in better performance (selectors can be compiled once and reused) and more idiomatic error handling (you can handle cascadia's compilation errors, instead of recovering from panics, which had been bugging me for a long time). Note that the actual type expected is aMatcher
interface, thatcascadia.Selector
implements. Other matcher implementations could be used. - 2014-11-06 : Change import paths of net/html to golang.org/x/net/html (see https://groups.google.com/forum/#!topic/golang-nuts/eD8dh3T9yyA). Make sure to update your code to use the new import path too when you call goquery with
html.Node
s. - v0.3.2 : Add
NewDocumentFromReader()
(thanks jweir) which allows creating a goquery document from an io.Reader. - v0.3.1 : Add
NewDocumentFromResponse()
(thanks assassingj) which allows creating a goquery document from an http response. - v0.3.0 : Add
EachWithBreak()
which allows to break out of anEach()
loop by returning false. This function was added instead of changing the existingEach()
to avoid breaking compatibility. - v0.2.1 : Make go-getable, now that go.net/html is Go1.0-compatible (thanks to @matrixik for pointing this out).
- v0.2.0 : Add support for negative indices in Slice(). BREAKING CHANGE
Document.Root
is removed,Document
is now aSelection
itself (a selection of one, the root element, just likeDocument.Root
was before). Add jQuery's Closest() method. - v0.1.1 : Add benchmarks to use as baseline for refactorings, refactor Next...() and Prev...() methods to use the new html package's linked list features (Next/PrevSibling, FirstChild). Good performance boost (40+% in some cases).
- v0.1.0 : Initial release.
API
goquery exposes two structs, Document
and Selection
, and the Matcher
interface. Unlike jQuery, which is loaded as part of a DOM document, and thus acts on its containing document, goquery doesn't know which HTML document to act upon. So it needs to be told, and that's what the Document
type is for. It holds the root document node as the initial Selection value to manipulate.
jQuery often has many variants for the same function (no argument, a selector string argument, a jQuery object argument, a DOM element argument, ...). Instead of exposing the same features in goquery as a single method with variadic empty interface arguments, statically-typed signatures are used following this naming convention:
- When the jQuery equivalent can be called with no argument, it has the same name as jQuery for the no argument signature (e.g.:
Prev()
), and the version with a selector string argument is calledXxxFiltered()
(e.g.:PrevFiltered()
) - When the jQuery equivalent requires one argument, the same name as jQuery is used for the selector string version (e.g.:
Is()
) - The signatures accepting a jQuery object as argument are defined in goquery as
XxxSelection()
and take a*Selection
object as argument (e.g.:FilterSelection()
) - The signatures accepting a DOM element as argument in jQuery are defined in goquery as
XxxNodes()
and take a variadic argument of type*html.Node
(e.g.:FilterNodes()
) - The signatures accepting a function as argument in jQuery are defined in goquery as
XxxFunction()
and take a function as argument (e.g.:FilterFunction()
) - The goquery methods that can be called with a selector string have a corresponding version that take a
Matcher
interface and are defined asXxxMatcher()
(e.g.:IsMatcher()
)
Utility functions that are not in jQuery but are useful in Go are implemented as functions (that take a *Selection
as parameter), to avoid a potential naming clash on the *Selection
's methods (reserved for jQuery-equivalent behaviour).
The complete package reference documentation can be found here.
Please note that Cascadia's selectors do not necessarily match all supported selectors of jQuery (Sizzle). See the cascadia project for details. Also, the selectors work more like the DOM's querySelectorAll
, than jQuery's matchers - they have no concept of contextual matching (for some concrete examples of what that means, see this ticket). In practice, it doesn't matter very often but it's something worth mentioning. Invalid selector strings compile to a Matcher
that fails to match any node. Behaviour of the various functions that take a selector string as argument follows from that fact, e.g. (where ~
is an invalid selector string):
Find("~")
returns an empty selection because the selector string doesn't match anything.Add("~")
returns a new selection that holds the same nodes as the original selection, because it didn't add any node (selector string didn't match anything).ParentsFiltered("~")
returns an empty selection because the selector string doesn't match anything.ParentsUntil("~")
returns all parents of the selection because the selector string didn't match any element to stop before the top element.
Examples
See some tips and tricks in the wiki.
Adapted from example_test.go:
package main
import (
"fmt"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
func ExampleScrape() {
// Request the HTML page.
res, err := http.Get("http://metalsucks.net")
if err != nil {
log.Fatal(err)
}
defer res.Body.Close()
if res.StatusCode != 200 {
log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
}
// Load the HTML document
doc, err := goquery.NewDocumentFromReader(res.Body)
if err != nil {
log.Fatal(err)
}
// Find the review items
doc.Find(".left-content article .post-title").Each(func(i int, s *goquery.Selection) {
// For each item found, get the title
title := s.Find("a").Text()
fmt.Printf("Review %d: %s\n", i, title)
})
}
func main() {
ExampleScrape()
}
Related Projects
- Goq, an HTML deserialization and scraping library based on goquery and struct tags.
- andybalholm/cascadia, the CSS selector library used by goquery.
- suntong/cascadia, a command-line interface to the cascadia CSS selector library, useful to test selectors.
- gocolly/colly, a lightning fast and elegant Scraping Framework
- gnulnx/goperf, a website performance test tool that also fetches static assets.
- MontFerret/ferret, declarative web scraping.
- tacusci/berrycms, a modern simple to use CMS with easy to write plugins
- Dataflow kit, Web Scraping framework for Gophers.
- Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
- Pagser, a simple, easy, extensible, configurable HTML parser to struct based on goquery and struct tags.
- stitcherd, A server for doing server side includes using css selectors and DOM updates.
- goskyr, an easily configurable command-line scraper written in Go.
- goGetJS, a tool for extracting, searching, and saving JavaScript files (with optional headless browser).
- fitter, a tool for selecting values from JSON, XML, HTML and XPath formatted pages.
- seltabl, an orm-like package and supporting language server for extracting values from HTML
Support
There are a number of ways you can support the project:
- Use it, star it, build something with it, spread the word!
- If you do build something open-source or otherwise publicly-visible, let me know so I can add it to the Related Projects section!
- Raise issues to improve the project (note: doc typos and clarifications are issues too!)
- Please search existing issues before opening a new one - it may have already been addressed.
- Pull requests: please discuss new code in an issue first, unless the fix is really trivial.
- Make sure new code is tested.
- Be mindful of existing code - PRs that break existing code have a high probability of being declined, unless it fixes a serious issue.
- Sponsor the developer
- See the Github Sponsor button at the top of the repo on github
- or via BuyMeACoffee.com, below
License
The BSD 3-Clause license, the same as the Go language. Cascadia's license is here.
Top Related Projects
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot