gospider

Gospider - Fast web spider written in Go

2,692

327

2,692

View on GitHub

Top Related Projects

katana

13,574

A next-generation crawling and spidering framework.

hakrawler

4,693

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

waybackurls

3,861

Fetch all the URLs that the Wayback Machine knows about for a domain

gau

4,321

Fetch known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl.

gobuster

11,466

Directory/File, DNS and VHost busting tool written in Go

Quick Overview

The gospider project is a fast and efficient web crawler written in Go. It is designed to quickly discover and extract information from web pages, making it a useful tool for security researchers, web developers, and data analysts.

Pros

Fast and Efficient: gospider is built using the Go programming language, which is known for its speed and concurrency capabilities, allowing it to crawl web pages quickly and efficiently.
Customizable: The project provides a wide range of configuration options, allowing users to tailor the crawling process to their specific needs, such as setting depth limits, filtering URLs, and more.
Robust: gospider is designed to handle a variety of web page structures and can extract data from both HTML and JSON-based content.
Extensible: The project's modular design makes it easy to extend with custom functionality, such as additional data extraction or processing capabilities.

Cons

Limited Functionality: While gospider is a powerful web crawler, it may not provide all the features and functionality that some users might require, such as advanced data analysis or visualization tools.
Steep Learning Curve: Configuring and using gospider may require a certain level of technical expertise, which could be a barrier for some users.
Potential for Abuse: Like any web crawler, gospider could potentially be used for malicious purposes, such as scraping data without permission or overwhelming web servers with excessive requests.
Dependency on Go: The project is written in Go, which may not be the preferred language for all users, and may require additional setup and configuration for those not familiar with the language.

Code Examples

// Example 1: Basic web crawling
package main

import (
    "fmt"
    "github.com/jaeles-project/gospider"
)

func main() {
    spider := gospider.New()
    spider.Start("https://example.com")

    for result := range spider.Results {
        fmt.Println(result.URL)
    }
}

This code demonstrates the basic usage of gospider to crawl the website https://example.com and print the URLs of the discovered pages.

// Example 2: Customizing the crawling process
package main

import (
    "github.com/jaeles-project/gospider"
)

func main() {
    spider := gospider.New()
    spider.Depth = 2
    spider.Concurrency = 10
    spider.Filters = []string{"*.jpg", "*.png"}
    spider.Start("https://example.com")

    for result := range spider.Results {
        // Process the crawled data
        println(result.URL)
    }
}

This example shows how to customize the gospider configuration, such as setting the crawling depth, concurrency level, and URL filters.

// Example 3: Extracting data from web pages
package main

import (
    "fmt"
    "github.com/jaeles-project/gospider"
)

func main() {
    spider := gospider.New()
    spider.Extractor = func(result *gospider.Result) {
        fmt.Println("Title:", result.Title)
        fmt.Println("Description:", result.Description)
    }
    spider.Start("https://example.com")
}

This code demonstrates how to use the gospider extractor functionality to extract specific data, such as the title and description, from the crawled web pages.

Getting Started

To get started with gospider, follow these steps:

Install Go on your system if you haven't already. You can download it from the official Go website: https://golang.org/dl/.
Install the gospider package using the Go package manager:
```
go get -u github.com/jaeles-project/gospider
```

Create a new Go file (e.g., main.go) and import the gospider package:

package main

import (
    "fmt"
    "github.com/jaeles-project/gospider"
)

Initialize a new gospider instance and start the crawling process

Competitor Comparisons

katana

13,574

A next-generation crawling and spidering framework.

Pros of Katana

More advanced crawling capabilities, including JavaScript rendering and dynamic content extraction
Better performance and scalability for large-scale web crawling tasks
Extensive configuration options and customizable output formats

Cons of Katana

Steeper learning curve due to more complex configuration options
Potentially higher resource consumption for advanced features

Code Comparison

GoSpider:

crawler := gospider.NewCrawler(
    gospider.WithConcurrency(10),
    gospider.WithDepth(3),
    gospider.WithIgnoreRobotsTxt(true),
)

Katana:

crawler, err := katana.New(
    katana.WithConcurrency(10),
    katana.WithMaxDepth(3),
    katana.WithJSRendering(true),
    katana.WithCustomHeaders(map[string]string{"User-Agent": "Katana"}),
)

Both tools are web crawlers written in Go, but Katana offers more advanced features and configuration options. GoSpider is simpler to use and may be sufficient for basic crawling tasks, while Katana is better suited for complex, large-scale web crawling projects that require JavaScript rendering and dynamic content extraction. The code comparison shows that Katana provides more granular control over the crawling process, including JavaScript rendering and custom headers, which are not available in the GoSpider example.

hakrawler

4,693

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

Pros of hakrawler

Lightweight and fast, with minimal dependencies
Supports custom headers and cookies for authenticated crawling
Offers flexible output options (JSON, plain text, etc.)

Cons of hakrawler

Less feature-rich compared to gospider
Limited configuration options for crawl depth and scope
May miss some dynamic content that gospider can detect

Code Comparison

hakrawler:

func crawl(url string, depth int, c *colly.Collector) {
    c.Visit(url)
}

gospider:

func crawl(url string, depth int, c *colly.Collector) {
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        c.Visit(e.Request.AbsoluteURL(link))
    })
    c.Visit(url)
}

The code snippets show that gospider implements more advanced crawling logic, including recursive link following, while hakrawler's approach is simpler.

Both tools are useful for web crawling and reconnaissance, but gospider offers more advanced features and customization options. hakrawler excels in simplicity and speed, making it suitable for quick scans. The choice between the two depends on the specific requirements of the task at hand, with gospider being more suitable for comprehensive scans and hakrawler for rapid initial reconnaissance.

waybackurls

3,861

Fetch all the URLs that the Wayback Machine knows about for a domain

Pros of waybackurls

Lightweight and focused on a single task: fetching URLs from the Wayback Machine
Simple to use with minimal configuration required
Can be easily integrated into other tools or scripts

Cons of waybackurls

Limited functionality compared to gospider's broader feature set
Doesn't perform active crawling or spidering of websites
Lacks advanced filtering options for retrieved URLs

Code comparison

waybackurls:

func getWaybackURLs(domain string, results chan<- string) {
    resp, err := http.Get(fmt.Sprintf("http://web.archive.org/cdx/search/cdx?url=%s/*&output=json&fl=original&collapse=urlkey", domain))
    if err != nil {
        return
    }
    defer resp.Body.Close()
    // ... (processing and sending results)
}

gospider:

func (s *Spider) Start() error {
    for _, site := range s.C.Sites {
        go func(site string) {
            s.crawl(site)
        }(site)
    }
    s.wait()
    return nil
}

The code snippets highlight the different approaches: waybackurls focuses on retrieving URLs from the Wayback Machine, while gospider implements a more complex crawling mechanism. gospider offers broader functionality for web crawling and information gathering, whereas waybackurls is a specialized tool for accessing historical URL data.

gau

4,321

Fetch known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl.

Pros of gau

Faster execution due to its focus on URL discovery
Simpler to use with fewer configuration options
Integrates well with other tools in a pipeline

Cons of gau

Less comprehensive crawling capabilities
Fewer customization options for specific use cases
Limited built-in filtering options

Code Comparison

gau:

func main() {
    urls := make(chan string)
    var wg sync.WaitGroup
    for i := 0; i < *threads; i++ {
        wg.Add(1)
        go func() {
            for url := range urls {
                process(url)
            }
            wg.Done()
        }()
    }
}

gospider:

func main() {
    crawler := spider.New(opts)
    crawler.Start()
    for _, result := range crawler.Results {
        fmt.Println(result)
    }
}

gau focuses on URL discovery and uses a simple goroutine-based approach for processing URLs. gospider, on the other hand, provides a more comprehensive crawling solution with a dedicated crawler object and additional features.

Both tools serve different purposes within web crawling and URL discovery. gau is better suited for quick URL enumeration, while gospider offers more advanced crawling capabilities and customization options. The choice between them depends on the specific requirements of the task at hand.

ffuf

13,852

Fast web fuzzer written in Go

Pros of ffuf

Faster performance for fuzzing tasks
More flexible configuration options
Supports multiple output formats (JSON, CSV, etc.)

Cons of ffuf

Limited to fuzzing and doesn't offer web crawling capabilities
Requires more manual setup for complex scanning scenarios

Code Comparison

ffuf:

func main() {
    flag.Parse()
    if err := ffuf.New().Run(); err != nil {
        fmt.Printf("\n[ERR] %s\n", err)
        os.Exit(1)
    }
}

gospider:

func main() {
    flag.Parse()
    core.Banner()
    if err := core.Run(); err != nil {
        log.Fatal(err)
    }
}

Summary

ffuf is a fast web fuzzer focused on performance and flexibility, while gospider is a more comprehensive web spider and crawler. ffuf excels in targeted fuzzing tasks with various configuration options, but lacks the broader web crawling capabilities of gospider. gospider offers a more all-in-one solution for web reconnaissance but may not match ffuf's speed in specific fuzzing scenarios. The choice between the two depends on the specific requirements of the task at hand.

gobuster

11,466

Directory/File, DNS and VHost busting tool written in Go

Pros of gobuster

More focused on directory and DNS enumeration
Supports multiple wordlists and file extensions
Offers wildcard detection to reduce false positives

Cons of gobuster

Less versatile in terms of web crawling capabilities
Limited output formats compared to gospider
Lacks some advanced features like JavaScript parsing

Code comparison

gospider:

func (s *Spider) Start() error {
    for _, site := range s.C.Sites {
        go func(url string) {
            s.crawl(url)
        }(site)
    }
    return nil
}

gobuster:

func (d *DNSBuster) Run(ctx context.Context) error {
    d.resultChan = make(chan Result)
    d.errorChan = make(chan error)
    d.wildcardChan = make(chan string)
    return d.process(ctx)
}

Key differences

gospider is designed for broader web crawling and information gathering
gobuster focuses on specific enumeration tasks (directory, DNS, vhost)
gospider offers more extensive output options and data extraction
gobuster provides better control over enumeration parameters

Both tools are valuable for different aspects of web reconnaissance and penetration testing. gospider excels in comprehensive crawling and data extraction, while gobuster is more specialized for targeted enumeration tasks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

GoSpider

GoSpider - Fast web spider written in Go

Painless integrate Gospider into your recon workflow?

this project was part of Osmedeus Engine. Check out how it was integrated at @OsmedeusEngine

Installation

GO install

GO111MODULE=on go install github.com/jaeles-project/gospider@latest

Docker

# Clone the repo
git clone https://github.com/jaeles-project/gospider.git
# Build the contianer
docker build -t gospider:latest gospider
# Run the container
docker run -t gospider -h

Features

Fast web crawling
Brute force and parse sitemap.xml
Parse robots.txt
Generate and verify link from JavaScript files
Link Finder
Find AWS-S3 from response source
Find subdomains from response source
Get URLs from Wayback Machine, Common Crawl, Virus Total, Alien Vault
Format output easy to Grep
Support Burp input
Crawl multiple sites in parallel
Random mobile/web User-Agent

Showcases

Usage

Fast web spider written in Go - v1.1.5 by @thebl4ckturtle & @j3ssiejjj

Usage:
  gospider [flags]

Flags:
  -s, --site string               Site to crawl
  -S, --sites string              Site list to crawl
  -p, --proxy string              Proxy (Ex: http://127.0.0.1:8080)
  -o, --output string             Output folder
  -u, --user-agent string         User Agent to use
                                  	web: random web user-agent
                                  	mobi: random mobile user-agent
                                  	or you can set your special user-agent (default "web")
      --cookie string             Cookie to use (testA=a; testB=b)
  -H, --header stringArray        Header to use (Use multiple flag to set multiple header)
      --burp string               Load headers and cookie from burp raw http request
      --blacklist string          Blacklist URL Regex
      --whitelist string          Whitelist URL Regex
      --whitelist-domain string   Whitelist Domain
  -t, --threads int               Number of threads (Run sites in parallel) (default 1)
  -c, --concurrent int            The number of the maximum allowed concurrent requests of the matching domains (default 5)
  -d, --depth int                 MaxDepth limits the recursion depth of visited URLs. (Set it to 0 for infinite recursion) (default 1)
  -k, --delay int                 Delay is the duration to wait before creating a new request to the matching domains (second)
  -K, --random-delay int          RandomDelay is the extra randomized duration to wait added to Delay before creating a new request (second)
  -m, --timeout int               Request timeout (second) (default 10)
  -B, --base                      Disable all and only use HTML content
      --js                        Enable linkfinder in javascript file (default true)
      --subs                      Include subdomains
      --sitemap                   Try to crawl sitemap.xml
      --robots                    Try to crawl robots.txt (default true)
  -a, --other-source              Find URLs from 3rd party (Archive.org, CommonCrawl.org, VirusTotal.com, AlienVault.com)
  -w, --include-subs              Include subdomains crawled from 3rd party. Default is main domain
  -r, --include-other-source      Also include other-source's urls (still crawl and request)
      --debug                     Turn on debug mode
      --json                      Enable JSON output
  -v, --verbose                   Turn on verbose
  -l, --length                    Turn on length
  -L, --filter-length             Turn on length filter
  -R, --raw                       Turn on raw
  -q, --quiet                     Suppress all the output and only show URL
      --no-redirect               Disable redirect
      --version                   Check version
  -h, --help                      help for gospider

Example commands

Quite output

gospider -q -s "https://google.com/"

Run with single site

gospider -s "https://google.com/" -o output -c 10 -d 1

Run with site list

gospider -S sites.txt -o output -c 10 -d 1

Run with 20 sites at the same time with 10 bot each site

gospider -S sites.txt -o output -c 10 -d 1 -t 20

Also get URLs from 3rd party (Archive.org, CommonCrawl.org, VirusTotal.com, AlienVault.com)

gospider -s "https://google.com/" -o output -c 10 -d 1 --other-source

Also get URLs from 3rd party (Archive.org, CommonCrawl.org, VirusTotal.com, AlienVault.com) and include subdomains

gospider -s "https://google.com/" -o output -c 10 -d 1 --other-source --include-subs

Use custom header/cookies

gospider -s "https://google.com/" -o output -c 10 -d 1 --other-source -H "Accept: */*" -H "Test: test" --cookie "testA=a; testB=b"

gospider -s "https://google.com/" -o output -c 10 -d 1 --other-source --burp burp_req.txt

Blacklist url/file extension.

P/s: gospider blacklisted .(jpg|jpeg|gif|css|tif|tiff|png|ttf|woff|woff2|ico) as default

gospider -s "https://google.com/" -o output -c 10 -d 1 --blacklist ".(woff|pdf)"

Show and Blacklist file length.

gospider -s "https://google.com/" -o output -c 10 -d 1 --length --filter-length "6871,24432"

License

Gospider is made with â¥ by @j3ssiejjj & @thebl4ckturtle and it is released under the MIT license.

Donation

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot