hakrawler
Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
Top Related Projects
Fetch all the URLs that the Wayback Machine knows about for a domain
A next-generation crawling and spidering framework.
Gospider - Fast web spider written in Go
Fetch known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl.
Incredibly fast crawler designed for OSINT.
Quick Overview
Hakrawler is a fast web crawler designed for easy, quick discovery of endpoints and assets within a web application. It's written in Go and can be used for reconnaissance during web application security assessments or bug bounty hunting.
Pros
- Fast and efficient, capable of crawling large websites quickly
- Supports various output formats (JSON, plain text) for easy integration with other tools
- Can handle JavaScript rendering through Chrome headless browser integration
- Customizable with options for depth, threads, and domain scope
Cons
- May miss some dynamically generated content or complex JavaScript-based navigation
- Can potentially overload target servers if not used carefully
- Limited built-in filtering options compared to some more comprehensive crawlers
- Requires manual analysis of results for identifying security issues
Getting Started
- Install Go on your system if not already installed.
- Install hakrawler:
go install github.com/hakluke/hakrawler@latest
- Basic usage:
echo https://example.com | hakrawler
- More advanced usage with options:
echo https://example.com | hakrawler -d 3 -t 20 -h "User-Agent: MyCustomCrawler" -insecure
This command crawls https://example.com with a depth of 3, using 20 threads, a custom User-Agent header, and ignoring SSL certificate errors.
Competitor Comparisons
Fetch all the URLs that the Wayback Machine knows about for a domain
Pros of waybackurls
- Simpler and more focused tool, specifically for fetching URLs from the Wayback Machine
- Faster execution for its specific task
- Lightweight with minimal dependencies
Cons of waybackurls
- Limited functionality compared to hakrawler's broader feature set
- Lacks the ability to crawl websites directly
- No built-in filtering or pattern matching capabilities
Code Comparison
waybackurls:
resp, err := http.Get(fmt.Sprintf("http://web.archive.org/cdx/search/cdx?url=%s/*&output=json&fl=original&collapse=urlkey", url))
if err != nil {
return nil, err
}
defer resp.Body.Close()
hakrawler:
c := colly.NewCollector(
colly.UserAgent("Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"),
colly.MaxDepth(depth),
colly.IgnoreRobotsTxt(),
)
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
absoluteURL := e.Request.AbsoluteURL(link)
// ... (additional processing)
}
Both tools serve different purposes within the realm of URL discovery and web crawling. waybackurls is a focused tool for retrieving historical URLs from the Wayback Machine, while hakrawler is a more comprehensive web crawler with additional features for discovering and analyzing web content.
A next-generation crawling and spidering framework.
Pros of Katana
- More advanced crawling capabilities, including JavaScript rendering and form submission
- Supports multiple output formats (JSON, Markdown, etc.)
- Actively maintained with frequent updates and new features
Cons of Katana
- Higher resource consumption due to more complex functionality
- Steeper learning curve for advanced features
- May be overkill for simple crawling tasks
Code Comparison
Hakrawler (simple usage):
cat urls.txt | hakrawler
Katana (simple usage):
katana -u https://example.com
Key Differences
- Hakrawler is lightweight and focused on speed, while Katana offers more comprehensive crawling features
- Katana provides better support for modern web applications with JavaScript rendering
- Hakrawler is easier to use for basic tasks, while Katana offers more customization options
Use Cases
- Hakrawler: Quick reconnaissance and simple crawling tasks
- Katana: In-depth web application scanning and complex crawling scenarios
Both tools have their merits, and the choice depends on the specific requirements of the task at hand. Hakrawler excels in simplicity and speed, while Katana offers more advanced features for thorough web application analysis.
Gospider - Fast web spider written in Go
Pros of gospider
- More comprehensive crawling capabilities, including JavaScript rendering
- Supports multiple output formats (JSON, Markdown, CSV)
- Includes built-in modules for extracting specific types of information (e.g., subdomains, AWS S3 buckets)
Cons of gospider
- Potentially slower due to more extensive crawling and JavaScript rendering
- May require more system resources for larger scans
- Steeper learning curve due to more advanced features and options
Code comparison
hakrawler:
func crawl(url string, depth int) {
if depth <= 0 {
return
}
// Crawl logic here
}
gospider:
func (s *Spider) Start() error {
for _, u := range s.C.URLs {
s.wg.Add(1)
go func(url string) {
defer s.wg.Done()
s.crawl(url)
}(u)
}
s.wg.Wait()
return nil
}
Both projects use Go for web crawling, but gospider implements a more complex concurrent crawling mechanism using goroutines and wait groups, while hakrawler uses a simpler recursive approach.
Fetch known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl.
Pros of gau
- Faster execution due to concurrent fetching from multiple sources
- Supports more data sources, including Wayback Machine, AlienVault OTX, and Common Crawl
- Provides options for custom output formatting
Cons of gau
- Less flexible in terms of crawling depth and following links
- May produce more noise in results due to its broader data sources
- Lacks built-in filtering options for specific file types or patterns
Code Comparison
hakrawler:
func crawl(url string, depth int, source string) {
if depth >= *maxDepth {
return
}
// ... (crawling logic)
}
gau:
func getUrls(domain string, providers []string) {
var wg sync.WaitGroup
for _, provider := range providers {
wg.Add(1)
go func(provider string) {
defer wg.Done()
// ... (fetching logic for each provider)
}(provider)
}
wg.Wait()
}
The code snippets highlight the different approaches: hakrawler uses recursive crawling with depth control, while gau focuses on concurrent fetching from multiple providers.
Incredibly fast crawler designed for OSINT.
Pros of Photon
- More comprehensive crawling capabilities, including JavaScript parsing and DNS enumeration
- Supports multiple output formats (JSON, CSV, TXT)
- Includes additional features like parameter discovery and intelligent error handling
Cons of Photon
- Slower performance compared to hakrawler, especially for large-scale scans
- More complex setup and usage, requiring additional dependencies
- Less frequent updates and maintenance
Code Comparison
Photon:
def photon(url, seeds, level, threads, delay, timeout, cook, headers):
# ... (initialization code)
for url in urls:
# ... (URL processing)
# ... (output handling)
hakrawler:
func crawl(url string, depth int, source string) {
// ... (initialization code)
for _, link := range links {
// ... (link processing)
}
// ... (output handling)
}
Both tools use similar approaches for crawling, but Photon's implementation in Python allows for more flexibility and additional features, while hakrawler's Go implementation focuses on speed and simplicity.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Hakrawler
Fast golang web crawler for gathering URLs and JavaScript file locations. This is basically a simple implementation of the awesome Gocolly library.
Example usages
Single URL:
echo https://google.com | hakrawler
Multiple URLs:
cat urls.txt | hakrawler
Timeout for each line of stdin after 5 seconds:
cat urls.txt | hakrawler -timeout 5
Send all requests through a proxy:
cat urls.txt | hakrawler -proxy http://localhost:8080
Include subdomains:
echo https://google.com | hakrawler -subs
Note: a common issue is that the tool returns no URLs. This usually happens when a domain is specified (https://example.com), but it redirects to a subdomain (https://www.example.com). The subdomain is not included in the scope, so the no URLs are printed. In order to overcome this, either specify the final URL in the redirect chain or use the
-subs
option to include subdomains.
Example tool chain
Get all subdomains of google, find the ones that respond to http(s), crawl them all.
echo google.com | haktrails subdomains | httpx | hakrawler
Installation
Normal Install
First, you'll need to install go.
Then run this command to download + compile hakrawler:
go install github.com/hakluke/hakrawler@latest
You can now run ~/go/bin/hakrawler
. If you'd like to just run hakrawler
without the full path, you'll need to export PATH="~/go/bin/:$PATH"
. You can also add this line to your ~/.bashrc
file if you'd like this to persist.
Docker Install (from dockerhub)
echo https://www.google.com | docker run --rm -i hakluke/hakrawler:v2 -subs
Local Docker Install
It's much easier to use the dockerhub method above, but if you'd prefer to run it locally:
git clone https://github.com/hakluke/hakrawler
cd hakrawler
sudo docker build -t hakluke/hakrawler .
sudo docker run --rm -i hakluke/hakrawler --help
Kali Linux: Using apt
Note: This will install an older version of hakrawler without all the features, and it may be buggy. I recommend using one of the other methods.
sudo apt install hakrawler
Then, to run hakrawler:
echo https://www.google.com | docker run --rm -i hakluke/hakrawler -subs
Command-line options
Usage of hakrawler:
-d int
Depth to crawl. (default 2)
-dr
Disable following HTTP redirects.
-h string
Custom headers separated by two semi-colons. E.g. -h "Cookie: foo=bar;;Referer: http://example.com/"
-i Only crawl inside path
-insecure
Disable TLS verification.
-json
Output as JSON.
-proxy string
Proxy URL. E.g. -proxy http://127.0.0.1:8080
-s Show the source of URL based on where it was found. E.g. href, form, script, etc.
-size int
Page size limit, in KB. (default -1)
-subs
Include subdomains for crawling.
-t int
Number of threads to utilise. (default 8)
-timeout int
Maximum time to crawl each URL from stdin, in seconds. (default -1)
-u Show only unique urls.
-w Show at which link the URL is found.
Top Related Projects
Fetch all the URLs that the Wayback Machine knows about for a domain
A next-generation crawling and spidering framework.
Gospider - Fast web spider written in Go
Fetch known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl.
Incredibly fast crawler designed for OSINT.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot