Convert Figma logo to code with AI

anaskhan96 logosoup

Web Scraper in Go, similar to BeautifulSoup

2,188
168
2,188
22

Top Related Projects

5,549

A Chrome DevTools Protocol driver for web automation and scraping.

14,154

A little like that j-thing, only in Go.

23,473

Elegant Scraper and Crawler Framework for Golang

Quick Overview

Soup is a Go package inspired by Python's BeautifulSoup library, designed for web scraping and HTML parsing. It provides a simple and intuitive API for extracting data from HTML and XML documents, making it easier for Go developers to parse and navigate through web content.

Pros

  • Easy to use with a straightforward API
  • Supports both HTML and XML parsing
  • Lightweight and has no external dependencies
  • Good performance for most scraping tasks

Cons

  • Limited functionality compared to more comprehensive parsing libraries
  • May not handle extremely complex or malformed HTML as well as some alternatives
  • Documentation could be more extensive
  • Not actively maintained (last commit was in 2021)

Code Examples

  1. Creating a new Soup object from a URL:
doc, err := soup.Get("https://example.com")
if err != nil {
    // Handle error
}
root := soup.HTMLParse(doc)
  1. Finding elements by tag name and accessing attributes:
links := root.FindAll("a")
for _, link := range links {
    href := link.Attrs()["href"]
    fmt.Println(href)
}
  1. Finding elements by class and extracting text:
elements := root.FindAll("div", "class", "content")
for _, element := range elements {
    text := element.Text()
    fmt.Println(text)
}

Getting Started

To use Soup in your Go project, follow these steps:

  1. Install the package:

    go get github.com/anaskhan96/soup
    
  2. Import the package in your Go file:

    import "github.com/anaskhan96/soup"
    
  3. Use the package to parse HTML:

    doc, _ := soup.Get("https://example.com")
    root := soup.HTMLParse(doc)
    title := root.Find("title").Text()
    fmt.Println(title)
    

This will fetch the HTML content from the specified URL, parse it, and print the title of the page.

Competitor Comparisons

5,549

A Chrome DevTools Protocol driver for web automation and scraping.

Pros of rod

  • Full browser automation capabilities, not just HTML parsing
  • Supports JavaScript rendering and interaction
  • More powerful for complex web scraping tasks

Cons of rod

  • Heavier resource usage due to browser automation
  • Steeper learning curve for basic scraping tasks
  • More complex setup and dependencies

Code comparison

soup:

doc, _ := soup.Get("https://example.com")
links := doc.FindAll("a")
for _, link := range links {
    fmt.Println(link.Attrs()["href"])
}

rod:

page := rod.New().MustConnect().MustPage("https://example.com")
page.MustElements("a").Each(func(e *rod.Element) {
    href, _ := e.Attribute("href")
    fmt.Println(*href)
})

Summary

soup is a lightweight HTML parsing library, ideal for simple scraping tasks. It's easy to use and has minimal dependencies. rod, on the other hand, is a full browser automation tool, offering more power and flexibility for complex web scraping and testing scenarios. While rod provides more features, it comes with increased complexity and resource usage. Choose soup for basic HTML parsing and rod for tasks requiring JavaScript rendering or advanced interaction with web pages.

14,154

A little like that j-thing, only in Go.

Pros of goquery

  • More mature and widely used project with a larger community
  • Implements jQuery-like syntax, familiar to web developers
  • Offers more advanced CSS selector support

Cons of goquery

  • Steeper learning curve for those unfamiliar with jQuery
  • Larger codebase and potentially higher memory usage
  • May be overkill for simple scraping tasks

Code Comparison

soup:

doc := soup.HTMLParse(html)
links := doc.FindAll("a")
for _, link := range links {
    fmt.Println(link.Attrs()["href"])
}

goquery:

doc, _ := goquery.NewDocumentFromReader(strings.NewReader(html))
doc.Find("a").Each(func(i int, s *goquery.Selection) {
    href, _ := s.Attr("href")
    fmt.Println(href)
})

Both libraries provide similar functionality for parsing HTML and extracting data. soup offers a simpler API with more straightforward methods, making it easier for beginners or quick scraping tasks. goquery, on the other hand, provides a more powerful and flexible approach, especially for complex DOM traversal and manipulation.

The choice between the two depends on the project's requirements, the developer's familiarity with jQuery-like syntax, and the complexity of the HTML parsing tasks at hand.

23,473

Elegant Scraper and Crawler Framework for Golang

Pros of Colly

  • More feature-rich and powerful, offering advanced capabilities like concurrent scraping and robots.txt handling
  • Better performance and efficiency, especially for large-scale scraping tasks
  • More active development and larger community support

Cons of Colly

  • Steeper learning curve due to its more complex API and advanced features
  • May be overkill for simple scraping tasks where a lightweight solution is preferred

Code Comparison

Soup:

doc, _ := soup.Get("https://example.com")
links := doc.FindAll("a")
for _, link := range links {
    fmt.Println(link.Attrs()["href"])
}

Colly:

c := colly.NewCollector()
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    fmt.Println(link)
})
c.Visit("https://example.com")

Summary

Colly is a more powerful and feature-rich web scraping framework, suitable for complex and large-scale scraping tasks. It offers better performance and a wider range of capabilities. However, it may have a steeper learning curve compared to Soup.

Soup, on the other hand, is simpler and more lightweight, making it easier to use for basic scraping tasks. It's a good choice for smaller projects or when quick prototyping is needed.

The choice between the two depends on the specific requirements of your project, the scale of scraping needed, and your familiarity with Go and web scraping concepts.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

soup

Build Status GoDoc Go Report Card

Web Scraper in Go, similar to BeautifulSoup

soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.

Exported variables and functions implemented till now :

var Headers map[string]string // Set headers as a map of key-value pairs, an alternative to calling Header() individually
var Cookies map[string]string // Set cookies as a map of key-value  pairs, an alternative to calling Cookie() individually
func Get(string) (string,error) {} // Takes the url as an argument, returns HTML string
func GetWithClient(string, *http.Client) {} // Takes the url and a custom HTTP client as arguments, returns HTML string
func Post(string, string, interface{}) (string, error) {} // Takes the url, bodyType, and payload as an argument, returns HTML string
func PostForm(string, url.Values) {} // Takes the url and body. bodyType is set to "application/x-www-form-urlencoded"
func Header(string, string) {} // Takes key,value pair to set as headers for the HTTP request made in Get()
func Cookie(string, string) {} // Takes key, value pair to set as cookies to be sent with the HTTP request in Get()
func HTMLParse(string) Root {} // Takes the HTML string as an argument, returns a pointer to the DOM constructed
func Find([]string) Root {} // Element tag,(attribute key-value pair) as argument, pointer to first occurence returned
func FindAll([]string) []Root {} // Same as Find(), but pointers to all occurrences returned
func FindStrict([]string) Root {} //  Element tag,(attribute key-value pair) as argument, pointer to first occurence returned with exact matching values
func FindAllStrict([]string) []Root {} // Same as FindStrict(), but pointers to all occurrences returned
func FindNextSibling() Root {} // Pointer to the next sibling of the Element in the DOM returned
func FindNextElementSibling() Root {} // Pointer to the next element sibling of the Element in the DOM returned
func FindPrevSibling() Root {} // Pointer to the previous sibling of the Element in the DOM returned
func FindPrevElementSibling() Root {} // Pointer to the previous element sibling of the Element in the DOM returned
func Children() []Root {} // Find all direct children of this DOM element
func Attrs() map[string]string {} // Map returned with all the attributes of the Element as lookup to their respective values
func Text() string {} // Full text inside a non-nested tag returned, first half returned in a nested one
func FullText() string {} // Full text inside a nested/non-nested tag returned
func SetDebug(bool) {} // Sets the debug mode to true or false; false by default
func HTML() {} // HTML returns the HTML code for the specific element

Root is a struct, containing three fields :

  • Pointer containing the pointer to the current html node
  • NodeValue containing the current html node's value, i.e. the tag name for an ElementNode, or the text in case of a TextNode
  • Error containing an error in a struct if one occurrs, else nil is returned. A detailed text explaination of the error can be accessed using the Error() function. A field Type in this struct of type ErrorType will denote the kind of error that took place, which will consist of either of the following
    • ErrUnableToParse
    • ErrElementNotFound
    • ErrNoNextSibling
    • ErrNoPreviousSibling
    • ErrNoNextElementSibling
    • ErrNoPreviousElementSibling
    • ErrCreatingGetRequest
    • ErrInGetRequest
    • ErrReadingResponse

Installation

Install the package using the command

go get github.com/anaskhan96/soup

Example

An example code is given below to scrape the "Comics I Enjoy" part (text and its links) from xkcd.

More Examples

package main

import (
	"fmt"
	"github.com/anaskhan96/soup"
	"os"
)

func main() {
	resp, err := soup.Get("https://xkcd.com")
	if err != nil {
		os.Exit(1)
	}
	doc := soup.HTMLParse(resp)
	links := doc.Find("div", "id", "comicLinks").FindAll("a")
	for _, link := range links {
		fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
	}
}

Contributions

This package was developed in my free time. However, contributions from everybody in the community are welcome, to make it a better web scraper. If you think there should be a particular feature or function included in the package, feel free to open up a new issue or pull request.