Top Related Projects
A faster, simpler way to drive browsers supporting the Chrome DevTools Protocol.
A little like that j-thing, only in Go.
A Chrome DevTools Protocol driver for web automation and scraping.
Geziyor, blazing fast web crawling & scraping framework for Go. Supports JS rendering.
Stateful programmatic web browsing in Go.
Quick Overview
Colly is a powerful and flexible web scraping framework for Go. It provides a clean API for traversing HTML documents, making HTTP requests, and extracting data from websites. Colly is designed to be fast, efficient, and easy to use for both simple and complex web scraping tasks.
Pros
- Fast and efficient, with support for concurrent scraping
- Easy to use with a clean and intuitive API
- Extensible through middleware and callbacks
- Built-in support for handling common web scraping challenges (e.g., rate limiting, caching)
Cons
- Limited built-in support for JavaScript rendering (requires additional tools for dynamic content)
- May require additional libraries for more complex data processing tasks
- Learning curve for advanced features and customizations
Code Examples
- Basic scraping example:
c := colly.NewCollector()
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
fmt.Printf("Found link: %s\n", link)
})
c.Visit("https://example.com")
This code creates a new collector, sets up a callback for HTML elements with href attributes, and visits a website.
- Extracting specific data:
c := colly.NewCollector()
c.OnHTML(".product", func(e *colly.HTMLElement) {
name := e.ChildText(".name")
price := e.ChildText(".price")
fmt.Printf("Product: %s, Price: %s\n", name, price)
})
c.Visit("https://example.com/products")
This example extracts product names and prices from a hypothetical product listing page.
- Handling pagination:
c := colly.NewCollector()
c.OnHTML("a.next-page", func(e *colly.HTMLElement) {
nextPage := e.Attr("href")
c.Visit(e.Request.AbsoluteURL(nextPage))
})
c.OnHTML(".content", func(e *colly.HTMLElement) {
// Process content on each page
})
c.Visit("https://example.com/page/1")
This code demonstrates how to handle pagination by following "next page" links.
Getting Started
To start using Colly, first install it:
go get -u github.com/gocolly/colly/v2
Then, create a new Go file and import Colly:
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", r.Request.URL)
})
c.Visit("https://example.com")
}
This basic example creates a collector, sets up callbacks for requests and responses, and visits a website. Run the program to see the output.
Competitor Comparisons
A faster, simpler way to drive browsers supporting the Chrome DevTools Protocol.
Pros of chromedp
- Full browser automation with JavaScript execution
- Supports complex interactions like clicking, typing, and scrolling
- Ideal for testing web applications and scraping dynamic content
Cons of chromedp
- Heavier resource usage due to running a full browser
- Slower execution compared to lightweight scraping
- More complex setup and configuration required
Code Comparison
chromedp example:
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
var res string
err := chromedp.Run(ctx,
chromedp.Navigate("https://example.com"),
chromedp.Text("body", &res, chromedp.NodeVisible),
)
colly example:
c := colly.NewCollector()
c.OnHTML("body", func(e *colly.HTMLElement) {
fmt.Println(e.Text)
})
c.Visit("https://example.com")
Key Differences
- chromedp provides full browser automation, while colly is focused on web scraping
- colly is generally faster and more lightweight for simple scraping tasks
- chromedp offers more advanced interaction capabilities with web pages
- colly has a simpler API and is easier to set up for basic scraping needs
- chromedp is better suited for scenarios requiring JavaScript rendering or complex user interactions
Both tools have their strengths, and the choice depends on the specific requirements of your project. chromedp excels in browser automation and handling dynamic content, while colly is more efficient for straightforward web scraping tasks.
A little like that j-thing, only in Go.
Pros of goquery
- More focused on DOM parsing and manipulation, similar to jQuery
- Better suited for complex HTML document traversal and analysis
- Provides a familiar API for developers with jQuery experience
Cons of goquery
- Less feature-rich for web scraping compared to Colly
- Doesn't include built-in concurrent crawling capabilities
- Requires more manual setup for handling HTTP requests and responses
Code Comparison
goquery:
doc, err := goquery.NewDocument("http://example.com")
doc.Find(".post-title").Each(func(i int, s *goquery.Selection) {
title := s.Text()
fmt.Println(title)
})
Colly:
c := colly.NewCollector()
c.OnHTML(".post-title", func(e *colly.HTMLElement) {
title := e.Text
fmt.Println(title)
})
c.Visit("http://example.com")
Both libraries offer ways to scrape web content, but Colly provides a more streamlined API for web scraping tasks, while goquery focuses on DOM manipulation. Colly includes built-in features for handling requests, concurrent scraping, and data extraction, making it more suitable for large-scale web scraping projects. goquery, on the other hand, excels in scenarios where detailed DOM traversal and manipulation are required, offering a jQuery-like experience for Go developers.
A Chrome DevTools Protocol driver for web automation and scraping.
Pros of Rod
- Supports browser automation and JavaScript rendering
- Provides more advanced interaction capabilities (e.g., mouse events, keyboard input)
- Better suited for dynamic web pages and single-page applications
Cons of Rod
- Slower performance due to browser overhead
- Higher resource consumption
- Steeper learning curve for beginners
Code Comparison
Rod example:
page := rod.New().MustConnect().MustPage("https://example.com")
text := page.MustElement("#content").MustText()
Colly example:
c := colly.NewCollector()
c.OnHTML("#content", func(e *colly.HTMLElement) {
text := e.Text
})
c.Visit("https://example.com")
Rod is more suitable for complex web scraping tasks involving JavaScript-heavy websites and browser interactions. It offers a wider range of capabilities but comes with increased resource usage and complexity.
Colly, on the other hand, is lightweight and faster for simple HTML parsing tasks. It's easier to use for beginners and more efficient for static websites, but lacks advanced browser automation features.
Choose Rod for dynamic web scraping with browser simulation, and Colly for simpler, static HTML parsing tasks where performance is crucial.
Geziyor, blazing fast web crawling & scraping framework for Go. Supports JS rendering.
Pros of Geziyor
- Built-in support for concurrent scraping and distributed crawling
- Includes data processing and exporting capabilities
- Offers a more comprehensive set of features out-of-the-box
Cons of Geziyor
- Less mature and less widely adopted compared to Colly
- Documentation is not as extensive or well-organized
- Fewer third-party extensions and integrations available
Code Comparison
Colly:
c := colly.NewCollector()
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.Visit("http://example.com/")
Geziyor:
geziyor.NewGeziyor(&geziyor.Options{
StartURLs: []string{"http://example.com/"},
ParseFunc: func(g *geziyor.Geziyor, r *geziyor.Response) {
r.HTMLDoc.Find("a[href]").Each(func(_ int, s *goquery.Selection) {
g.Get(s.AttrOr("href", ""), g.Opt.ParseFunc)
})
},
}).Start()
Both Colly and Geziyor are Go-based web scraping frameworks, but they differ in their approach and feature set. Colly is more lightweight and focused on simplicity, while Geziyor offers more built-in functionality for complex scraping tasks. The choice between them depends on the specific requirements of your project and your familiarity with Go programming.
Stateful programmatic web browsing in Go.
Pros of Surf
- Simpler API for basic web browsing tasks
- Built-in support for JavaScript execution
- Easier to handle cookies and user sessions
Cons of Surf
- Less actively maintained (last commit in 2018)
- Fewer features for advanced scraping scenarios
- Smaller community and ecosystem
Code Comparison
Surf:
browser := surf.NewBrowser()
err := browser.Open("https://example.com")
if err != nil {
panic(err)
}
fmt.Println(browser.Title())
Colly:
c := colly.NewCollector()
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println(e.Text)
})
c.Visit("https://example.com")
Colly is more focused on web scraping and provides a more extensive set of features for handling complex scraping tasks. It offers better performance and is actively maintained with a larger community.
Surf, on the other hand, provides a simpler API for basic web browsing tasks and includes built-in JavaScript support, which can be useful for certain scenarios. However, its development has been inactive for several years, which may be a concern for long-term project maintenance.
For most modern web scraping projects, Colly is likely the better choice due to its active development, performance, and rich feature set. However, if you need a simple solution for basic web browsing tasks with JavaScript support, Surf might still be worth considering.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Colly
Lightning Fast and Elegant Scraping Framework for Gophers
Colly provides a clean interface to write any kind of crawler/scraper/spider.
With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.
Sponsors
Scrapfly is an enterprise-grade solution providing Web Scraping API that aims to simplify the scraping process by managing everything: real browser rendering, rotating proxies, and fingerprints (TLS, HTTP, browser) to bypass all major anti-bots. Scrapfly also unlocks the observability by providing an analytical dashboard and measuring the success rate/block rate in detail.
Features
- Clean API
- Fast (>1k request/sec on a single core)
- Manages request delays and maximum concurrency per domain
- Automatic cookie and session handling
- Sync/async/parallel scraping
- Caching
- Automatic encoding of non-unicode responses
- Robots.txt support
- Distributed scraping
- Configuration via environment variables
- Extensions
Example
func main() {
c := colly.NewCollector()
// Find and visit all links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.Visit("http://go-colly.org/")
}
See examples folder for more detailed examples.
Installation
Add colly to your go.mod
file:
module github.com/x/y
go 1.14
require (
github.com/gocolly/colly/v2 latest
)
Bugs
Bugs or suggestions? Visit the issue tracker or join #colly
on freenode
Other Projects Using Colly
Below is a list of public, open source projects that use Colly:
- greenpeace/check-my-pages Scraping script to test the Spanish Greenpeace web archive.
- altsab/gowap Wappalyzer implementation in Go.
- jesuiscamille/goquotes A quotes scraper, making your day a little better!
- jivesearch/jivesearch A search engine that doesn't track you.
- Leagify/colly-draft-prospects A scraper for future NFL Draft prospects.
- lucasepe/go-ps4 Search playstation store for your favorite PS4 games using the command line.
- yringler/inside-chassidus-scraper Scrapes Rabbi Paltiel's web site for lesson metadata.
- gamedb/gamedb A database of Steam games.
- lawzava/scrape CLI for email scraping from any website.
- eureka101v/WeiboSpiderGo A sina weibo(chinese twitter) scraper
- Go-phie/gophie Search, Download and Stream movies from your terminal
- imthaghost/goclone Clone websites to your computer within seconds.
- superiss/spidy Crawl the web and collect expired domains.
- docker-slim/docker-slim Optimize your Docker containers to make them smaller and better.
- seversky/gachifinder an agent for asynchronous scraping, parsing and writing to some storages(elasticsearch for now)
- eval-exec/goodreads crawl all tags and all pages of quotes from goodreads.
If you are using Colly in a project please send a pull request to add it to the list.
Contributors
This project exists thanks to all the people who contribute. [Contribute].
Backers
Thank you to all our backers! ð [Become a backer]
Sponsors
Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]
License
Top Related Projects
A faster, simpler way to drive browsers supporting the Chrome DevTools Protocol.
A little like that j-thing, only in Go.
A Chrome DevTools Protocol driver for web automation and scraping.
Geziyor, blazing fast web crawling & scraping framework for Go. Supports JS rendering.
Stateful programmatic web browsing in Go.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot