hakrawler

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

4,818

527

4,818

View on GitHub

Top Related Projects

waybackurls

4,038

Fetch all the URLs that the Wayback Machine knows about for a domain

katana

13,996

A next-generation crawling and spidering framework.

gospider

2,752

Gospider - Fast web spider written in Go

gau

4,482

Fetch known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl.

Photon

11,762

Incredibly fast crawler designed for OSINT.

Quick Overview

Hakrawler is a fast web crawler designed for easy, quick discovery of endpoints and assets within a web application. It's written in Go and can be used for reconnaissance during web application security assessments or bug bounty hunting.

Pros

Fast and efficient, capable of crawling large websites quickly
Supports various output formats (JSON, plain text) for easy integration with other tools
Can handle JavaScript rendering through Chrome headless browser integration
Customizable with options for depth, threads, and domain scope

Cons

May miss some dynamically generated content or complex JavaScript-based navigation
Can potentially overload target servers if not used carefully
Limited built-in filtering options compared to some more comprehensive crawlers
Requires manual analysis of results for identifying security issues

Getting Started

Install Go on your system if not already installed.
Install hakrawler:

go install github.com/hakluke/hakrawler@latest

Basic usage:

echo https://example.com | hakrawler

More advanced usage with options:

echo https://example.com | hakrawler -d 3 -t 20 -h "User-Agent: MyCustomCrawler" -insecure

This command crawls https://example.com with a depth of 3, using 20 threads, a custom User-Agent header, and ignoring SSL certificate errors.

Competitor Comparisons

waybackurls

4,038

Fetch all the URLs that the Wayback Machine knows about for a domain

Pros of waybackurls

Simpler and more focused tool, specifically for fetching URLs from the Wayback Machine
Faster execution for its specific task
Lightweight with minimal dependencies

Cons of waybackurls

Limited functionality compared to hakrawler's broader feature set
Lacks the ability to crawl websites directly
No built-in filtering or pattern matching capabilities

Code Comparison

waybackurls:

resp, err := http.Get(fmt.Sprintf("http://web.archive.org/cdx/search/cdx?url=%s/*&output=json&fl=original&collapse=urlkey", url))
if err != nil {
    return nil, err
}
defer resp.Body.Close()

hakrawler:

c := colly.NewCollector(
    colly.UserAgent("Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"),
    colly.MaxDepth(depth),
    colly.IgnoreRobotsTxt(),
)
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    absoluteURL := e.Request.AbsoluteURL(link)
    // ... (additional processing)
}

Both tools serve different purposes within the realm of URL discovery and web crawling. waybackurls is a focused tool for retrieving historical URLs from the Wayback Machine, while hakrawler is a more comprehensive web crawler with additional features for discovering and analyzing web content.

katana

13,996

A next-generation crawling and spidering framework.

Pros of Katana

More advanced crawling capabilities, including JavaScript rendering and form submission
Supports multiple output formats (JSON, Markdown, etc.)
Actively maintained with frequent updates and new features

Cons of Katana

Higher resource consumption due to more complex functionality
Steeper learning curve for advanced features
May be overkill for simple crawling tasks

Code Comparison

Hakrawler (simple usage):

cat urls.txt | hakrawler

Katana (simple usage):

katana -u https://example.com

Key Differences

Hakrawler is lightweight and focused on speed, while Katana offers more comprehensive crawling features
Katana provides better support for modern web applications with JavaScript rendering
Hakrawler is easier to use for basic tasks, while Katana offers more customization options

Use Cases

Hakrawler: Quick reconnaissance and simple crawling tasks
Katana: In-depth web application scanning and complex crawling scenarios

Both tools have their merits, and the choice depends on the specific requirements of the task at hand. Hakrawler excels in simplicity and speed, while Katana offers more advanced features for thorough web application analysis.

gospider

2,752

Gospider - Fast web spider written in Go

Pros of gospider

More comprehensive crawling capabilities, including JavaScript rendering
Supports multiple output formats (JSON, Markdown, CSV)
Includes built-in modules for extracting specific types of information (e.g., subdomains, AWS S3 buckets)

Cons of gospider

Potentially slower due to more extensive crawling and JavaScript rendering
May require more system resources for larger scans
Steeper learning curve due to more advanced features and options

Code comparison

hakrawler:

func crawl(url string, depth int) {
    if depth <= 0 {
        return
    }
    // Crawl logic here
}

gospider:

func (s *Spider) Start() error {
    for _, u := range s.C.URLs {
        s.wg.Add(1)
        go func(url string) {
            defer s.wg.Done()
            s.crawl(url)
        }(u)
    }
    s.wg.Wait()
    return nil
}

Both projects use Go for web crawling, but gospider implements a more complex concurrent crawling mechanism using goroutines and wait groups, while hakrawler uses a simpler recursive approach.

gau

4,482

Fetch known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl.

Pros of gau

Faster execution due to concurrent fetching from multiple sources
Supports more data sources, including Wayback Machine, AlienVault OTX, and Common Crawl
Provides options for custom output formatting

Cons of gau

Less flexible in terms of crawling depth and following links
May produce more noise in results due to its broader data sources
Lacks built-in filtering options for specific file types or patterns

Code Comparison

hakrawler:

func crawl(url string, depth int, source string) {
    if depth >= *maxDepth {
        return
    }
    // ... (crawling logic)
}

gau:

func getUrls(domain string, providers []string) {
    var wg sync.WaitGroup
    for _, provider := range providers {
        wg.Add(1)
        go func(provider string) {
            defer wg.Done()
            // ... (fetching logic for each provider)
        }(provider)
    }
    wg.Wait()
}

The code snippets highlight the different approaches: hakrawler uses recursive crawling with depth control, while gau focuses on concurrent fetching from multiple providers.

Photon

11,762

Incredibly fast crawler designed for OSINT.

Pros of Photon

More comprehensive crawling capabilities, including JavaScript parsing and DNS enumeration
Supports multiple output formats (JSON, CSV, TXT)
Includes additional features like parameter discovery and intelligent error handling

Cons of Photon

Slower performance compared to hakrawler, especially for large-scale scans
More complex setup and usage, requiring additional dependencies
Less frequent updates and maintenance

Code Comparison

Photon:

def photon(url, seeds, level, threads, delay, timeout, cook, headers):
    # ... (initialization code)
    for url in urls:
        # ... (URL processing)
    # ... (output handling)

hakrawler:

func crawl(url string, depth int, source string) {
    // ... (initialization code)
    for _, link := range links {
        // ... (link processing)
    }
    // ... (output handling)
}

Both tools use similar approaches for crawling, but Photon's implementation in Python allows for more flexibility and additional features, while hakrawler's Go implementation focuses on speed and simplicity.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Hakrawler

Fast golang web crawler for gathering URLs and JavaScript file locations. This is basically a simple implementation of the awesome Gocolly library.

Example usages

Single URL:

echo https://google.com | hakrawler

Multiple URLs:

cat urls.txt | hakrawler

Timeout for each line of stdin after 5 seconds:

cat urls.txt | hakrawler -timeout 5

Send all requests through a proxy:

cat urls.txt | hakrawler -proxy http://localhost:8080

Include subdomains:

echo https://google.com | hakrawler -subs

Note: a common issue is that the tool returns no URLs. This usually happens when a domain is specified (https://example.com), but it redirects to a subdomain (https://www.example.com). The subdomain is not included in the scope, so the no URLs are printed. In order to overcome this, either specify the final URL in the redirect chain or use the -subs option to include subdomains.

Example tool chain

Get all subdomains of google, find the ones that respond to http(s), crawl them all.

echo google.com | haktrails subdomains | httpx | hakrawler

Installation

Normal Install

First, you'll need to install go.

Then run this command to download + compile hakrawler:

go install github.com/hakluke/hakrawler@latest

You can now run ~/go/bin/hakrawler. If you'd like to just run hakrawler without the full path, you'll need to export PATH="~/go/bin/:$PATH". You can also add this line to your ~/.bashrc file if you'd like this to persist.

Docker Install (from dockerhub)

echo https://www.google.com | docker run --rm -i hakluke/hakrawler:v2 -subs

Local Docker Install

It's much easier to use the dockerhub method above, but if you'd prefer to run it locally:

git clone https://github.com/hakluke/hakrawler
cd hakrawler
sudo docker build -t hakluke/hakrawler .
sudo docker run --rm -i hakluke/hakrawler --help

Kali Linux: Using apt

Note: This will install an older version of hakrawler without all the features, and it may be buggy. I recommend using one of the other methods.

sudo apt install hakrawler

Then, to run hakrawler:

echo https://www.google.com | docker run --rm -i hakluke/hakrawler -subs

Command-line options

Usage of hakrawler:
  -d int
    	Depth to crawl. (default 2)
  -dr
    	Disable following HTTP redirects.
  -h string
    	Custom headers separated by two semi-colons. E.g. -h "Cookie: foo=bar;;Referer: http://example.com/"
  -i	Only crawl inside path
  -insecure
    	Disable TLS verification.
  -json
    	Output as JSON.
  -proxy string
    	Proxy URL. E.g. -proxy http://127.0.0.1:8080
  -s	Show the source of URL based on where it was found. E.g. href, form, script, etc.
  -size int
    	Page size limit, in KB. (default -1)
  -subs
    	Include subdomains for crawling.
  -t int
    	Number of threads to utilise. (default 8)
  -timeout int
    	Maximum time to crawl each URL from stdin, in seconds. (default -1)
  -u	Show only unique urls.
  -w	Show at which link the URL is found.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot