crawler-user-agents
Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome :star:
Top Related Projects
This repo is for demonstration purposes only.
The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).
Scrapy, a fast high-level web crawling & scraping framework for Python.
Quick Overview
The monperrus/crawler-user-agents
repository is a collection of user-agent strings for various web crawlers and bots. This repository can be useful for developers who need to identify and handle requests from web crawlers and bots on their websites or applications.
Pros
- Comprehensive Collection: The repository contains a large and diverse collection of user-agent strings, covering a wide range of web crawlers and bots.
- Regularly Updated: The repository is actively maintained, with new user-agent strings being added regularly to keep up with the evolving landscape of web crawlers and bots.
- Open-Source: The repository is open-source, allowing developers to contribute and expand the collection as needed.
- Easy to Use: The user-agent strings are organized and categorized, making it easy for developers to find the appropriate user-agent string for their use case.
Cons
- Potential Inaccuracies: As the web crawler and bot landscape is constantly changing, the user-agent strings in the repository may not always be up-to-date or accurate.
- Limited Metadata: The repository does not provide much additional metadata or information about the user-agent strings, such as the purpose, origin, or behavior of the crawlers and bots.
- Lack of Programmatic Access: The repository is primarily a collection of user-agent strings in a text file, and does not provide a programmatic API or interface for accessing the data.
- Potential Legal Concerns: Depending on the use case, the use of web crawler and bot user-agent strings may raise legal or ethical concerns, such as potential violations of terms of service or privacy policies.
Code Examples
This repository does not contain any code examples, as it is a collection of user-agent strings rather than a code library.
Getting Started
To use the user-agent strings from the monperrus/crawler-user-agents
repository, you can simply download the crawler-user-agents.txt
file from the repository and reference the user-agent strings as needed in your application or website. For example, you could use the user-agent strings to identify and handle requests from web crawlers and bots, or to block or limit their access to your website.
Here's an example of how you could use the user-agent strings in a Python script to identify and log requests from web crawlers and bots:
import requests
# Load the user-agent strings from the repository
with open('crawler-user-agents.txt', 'r') as f:
crawler_user_agents = f.read().splitlines()
# Example URL to check for crawler/bot requests
url = 'https://example.com'
# Make a request to the URL
response = requests.get(url)
# Check the user-agent string in the request
user_agent = response.request.headers.get('User-Agent')
if user_agent in crawler_user_agents:
print(f'Crawler/bot detected: {user_agent}')
else:
print('Regular user agent detected')
This is just a simple example, and you may need to adapt the code to your specific use case and requirements. Additionally, you should carefully consider the legal and ethical implications of using web crawler and bot user-agent strings, and ensure that your use of the data is in compliance with any relevant laws, regulations, and terms of service.
Competitor Comparisons
This repo is for demonstration purposes only.
Pros of Spoon-Knife
- Spoon-Knife is a simple and straightforward repository, making it easy for beginners to understand and use.
- The repository has a clear and concise README file, providing helpful information for users.
- Spoon-Knife has a large number of stars and forks, indicating its popularity and widespread use.
Cons of Spoon-Knife
- Spoon-Knife is a very basic repository, with limited functionality compared to monperrus/crawler-user-agents.
- The repository does not provide any advanced features or tools for users, such as the ability to customize user-agent strings.
- Spoon-Knife may not be suitable for more complex projects or use cases that require more advanced functionality.
Code Comparison
Spoon-Knife:
# README.md
This repository is meant to provide an example for the *forking* process. A "fork" is a copy of a repository. Forking a repository allows you to freely experiment with changes without affecting the original project.
# index.html
<html>
<body>
<img src="spoon-knife.jpg" alt="Spoon and Knife">
</body>
</html>
monperrus/crawler-user-agents:
# crawler_user_agents/user_agents.py
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/603.2.4 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.4"
]
The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).
Pros of robotstxt
- Official implementation by Google, ensuring high reliability and adherence to standards
- Supports multiple programming languages (C++, Java, Python)
- Comprehensive documentation and extensive test suite
Cons of robotstxt
- Focused solely on robots.txt parsing and matching
- May be overly complex for simple use cases
- Requires more setup and integration effort
Code Comparison
robotstxt (C++):
#include "robots.h"
bool allowed = robots::IsAllowed(url, user_agent, robots_txt_content);
crawler-user-agents (JSON):
{
"pattern": "Googlebot/",
"instances": ["Googlebot/2.1 (+http://www.google.com/bot.html)"],
"url": "http://www.google.com/bot.html"
}
Summary
robotstxt is a comprehensive solution for parsing and matching robots.txt files, offering multi-language support and Google's official implementation. However, it may be overkill for simpler projects.
crawler-user-agents provides a curated list of user agents in JSON format, making it easier to identify and categorize web crawlers. It's more straightforward but limited in scope compared to robotstxt.
Choose robotstxt for robust robots.txt handling or crawler-user-agents for quick user agent identification and categorization.
Scrapy, a fast high-level web crawling & scraping framework for Python.
Pros of Scrapy
- Scrapy is a mature and feature-rich web scraping framework, with a large and active community.
- It provides a powerful and flexible API for building complex web crawlers and scrapers.
- Scrapy has built-in support for handling various web protocols, data formats, and storage options.
Cons of Scrapy
- Scrapy has a steeper learning curve compared to the simpler Crawler-User-Agents library.
- Scrapy may be overkill for simple web scraping tasks, where Crawler-User-Agents could be a more lightweight and easier-to-use solution.
Code Comparison
Crawler-User-Agents (monperrus/crawler-user-agents):
from crawler_user_agents import user_agents
# Get a random user agent string
user_agent = user_agents.get_random_user_agent()
print(user_agent)
Scrapy (scrapy/scrapy):
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['https://example.com']
def parse(self, response):
yield {'title': response.css('title::text').get()}
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
crawler-user-agents
This repository contains a list of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.
- NPM package: https://www.npmjs.com/package/crawler-user-agents
- Go package: https://pkg.go.dev/github.com/monperrus/crawler-user-agents
- PyPi package: https://pypi.org/project/crawler-user-agents/
Each pattern
is a regular expression. It should work out-of-the-box wih your favorite regex library.
If you use this project in a commercial product, please sponsor it.
Install
Direct download
Download the crawler-user-agents.json
file from this repository directly.
Javascript
crawler-user-agents is deployed on npmjs.com: https://www.npmjs.com/package/crawler-user-agents
To use it using npm or yarn:
npm install --save crawler-user-agents
# OR
yarn add crawler-user-agents
In Node.js, you can require
the package to get an array of crawler user agents.
const crawlers = require('crawler-user-agents');
console.log(crawlers);
Python
Install with pip install crawler-user-agents
Then:
import crawleruseragents
if crawleruseragents.is_crawler("Googlebot/"):
# do something
or:
import crawleruseragents
indices = crawleruseragents.matching_crawlers("bingbot/2.0")
print("crawlers' indices:", indices)
print(
"crawler's URL:",
crawleruseragents.CRAWLER_USER_AGENTS_DATA[indices[0]]["url"]
)
Note that matching_crawlers
is much slower than is_crawler
, if the given User-Agent does indeed match any crawlers.
Go
Go: use this package,
it provides global variable Crawlers
(it is synchronized with crawler-user-agents.json
),
functions IsCrawler
and MatchingCrawlers
.
Example of Go program:
package main
import (
"fmt"
"github.com/monperrus/crawler-user-agents"
)
func main() {
userAgent := "Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)"
isCrawler := agents.IsCrawler(userAgent)
fmt.Println("isCrawler:", isCrawler)
indices := agents.MatchingCrawlers(userAgent)
fmt.Println("crawlers' indices:", indices)
fmt.Println("crawler's URL:", agents.Crawlers[indices[0]].URL)
}
Output:
isCrawler: true
crawlers' indices: [237]
crawler' URL: https://discordapp.com
Contributing
I do welcome additions contributed as pull requests.
The pull requests should:
- contain a single addition
- specify a discriminant relevant syntactic fragment (for example "totobot" and not "Mozilla/5 totobot v20131212.alpha1")
- contain the pattern (generic regular expression), the discovery date (year/month/day) and the official url of the robot
- result in a valid JSON file (don't forget the comma between items)
Example:
{
"pattern": "rogerbot",
"addition_date": "2014/02/28",
"url": "http://moz.com/help/pro/what-is-rogerbot-",
"instances" : ["rogerbot/2.3 example UA"]
}
License
The list is under a MIT License. The versions prior to Nov 7, 2016 were under a CC-SA license.
Related work
There are a few wrapper libraries that use this data to detect bots:
- Voight-Kampff (Ruby)
- isbot (Ruby)
- crawlers (Clojure)
- isBot (Node.JS)
Other systems for spotting robots, crawlers, and spiders that you may want to consider are:
- Crawler-Detect (PHP)
- BrowserDetector (PHP)
- browscap (JSON files)
Top Related Projects
This repo is for demonstration purposes only.
The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).
Scrapy, a fast high-level web crawling & scraping framework for Python.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot