Convert Figma logo to code with AI

monperrus logocrawler-user-agents

Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome :star:

1,204
257
1,204
12

Top Related Projects

This repo is for demonstration purposes only.

The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).

52,766

Scrapy, a fast high-level web crawling & scraping framework for Python.

Quick Overview

The monperrus/crawler-user-agents repository is a collection of user-agent strings for various web crawlers and bots. This repository can be useful for developers who need to identify and handle requests from web crawlers and bots on their websites or applications.

Pros

  • Comprehensive Collection: The repository contains a large and diverse collection of user-agent strings, covering a wide range of web crawlers and bots.
  • Regularly Updated: The repository is actively maintained, with new user-agent strings being added regularly to keep up with the evolving landscape of web crawlers and bots.
  • Open-Source: The repository is open-source, allowing developers to contribute and expand the collection as needed.
  • Easy to Use: The user-agent strings are organized and categorized, making it easy for developers to find the appropriate user-agent string for their use case.

Cons

  • Potential Inaccuracies: As the web crawler and bot landscape is constantly changing, the user-agent strings in the repository may not always be up-to-date or accurate.
  • Limited Metadata: The repository does not provide much additional metadata or information about the user-agent strings, such as the purpose, origin, or behavior of the crawlers and bots.
  • Lack of Programmatic Access: The repository is primarily a collection of user-agent strings in a text file, and does not provide a programmatic API or interface for accessing the data.
  • Potential Legal Concerns: Depending on the use case, the use of web crawler and bot user-agent strings may raise legal or ethical concerns, such as potential violations of terms of service or privacy policies.

Code Examples

This repository does not contain any code examples, as it is a collection of user-agent strings rather than a code library.

Getting Started

To use the user-agent strings from the monperrus/crawler-user-agents repository, you can simply download the crawler-user-agents.txt file from the repository and reference the user-agent strings as needed in your application or website. For example, you could use the user-agent strings to identify and handle requests from web crawlers and bots, or to block or limit their access to your website.

Here's an example of how you could use the user-agent strings in a Python script to identify and log requests from web crawlers and bots:

import requests

# Load the user-agent strings from the repository
with open('crawler-user-agents.txt', 'r') as f:
    crawler_user_agents = f.read().splitlines()

# Example URL to check for crawler/bot requests
url = 'https://example.com'

# Make a request to the URL
response = requests.get(url)

# Check the user-agent string in the request
user_agent = response.request.headers.get('User-Agent')
if user_agent in crawler_user_agents:
    print(f'Crawler/bot detected: {user_agent}')
else:
    print('Regular user agent detected')

This is just a simple example, and you may need to adapt the code to your specific use case and requirements. Additionally, you should carefully consider the legal and ethical implications of using web crawler and bot user-agent strings, and ensure that your use of the data is in compliance with any relevant laws, regulations, and terms of service.

Competitor Comparisons

This repo is for demonstration purposes only.

Pros of Spoon-Knife

  • Spoon-Knife is a simple and straightforward repository, making it easy for beginners to understand and use.
  • The repository has a clear and concise README file, providing helpful information for users.
  • Spoon-Knife has a large number of stars and forks, indicating its popularity and widespread use.

Cons of Spoon-Knife

  • Spoon-Knife is a very basic repository, with limited functionality compared to monperrus/crawler-user-agents.
  • The repository does not provide any advanced features or tools for users, such as the ability to customize user-agent strings.
  • Spoon-Knife may not be suitable for more complex projects or use cases that require more advanced functionality.

Code Comparison

Spoon-Knife:

# README.md
This repository is meant to provide an example for the *forking* process. A "fork" is a copy of a repository. Forking a repository allows you to freely experiment with changes without affecting the original project.

# index.html
<html>
<body>
    <img src="spoon-knife.jpg" alt="Spoon and Knife">
</body>
</html>

monperrus/crawler-user-agents:

# crawler_user_agents/user_agents.py
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/603.2.4 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.4"
]

The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).

Pros of robotstxt

  • Official implementation by Google, ensuring high reliability and adherence to standards
  • Supports multiple programming languages (C++, Java, Python)
  • Comprehensive documentation and extensive test suite

Cons of robotstxt

  • Focused solely on robots.txt parsing and matching
  • May be overly complex for simple use cases
  • Requires more setup and integration effort

Code Comparison

robotstxt (C++):

#include "robots.h"

bool allowed = robots::IsAllowed(url, user_agent, robots_txt_content);

crawler-user-agents (JSON):

{
  "pattern": "Googlebot/",
  "instances": ["Googlebot/2.1 (+http://www.google.com/bot.html)"],
  "url": "http://www.google.com/bot.html"
}

Summary

robotstxt is a comprehensive solution for parsing and matching robots.txt files, offering multi-language support and Google's official implementation. However, it may be overkill for simpler projects.

crawler-user-agents provides a curated list of user agents in JSON format, making it easier to identify and categorize web crawlers. It's more straightforward but limited in scope compared to robotstxt.

Choose robotstxt for robust robots.txt handling or crawler-user-agents for quick user agent identification and categorization.

52,766

Scrapy, a fast high-level web crawling & scraping framework for Python.

Pros of Scrapy

  • Scrapy is a mature and feature-rich web scraping framework, with a large and active community.
  • It provides a powerful and flexible API for building complex web crawlers and scrapers.
  • Scrapy has built-in support for handling various web protocols, data formats, and storage options.

Cons of Scrapy

  • Scrapy has a steeper learning curve compared to the simpler Crawler-User-Agents library.
  • Scrapy may be overkill for simple web scraping tasks, where Crawler-User-Agents could be a more lightweight and easier-to-use solution.

Code Comparison

Crawler-User-Agents (monperrus/crawler-user-agents):

from crawler_user_agents import user_agents

# Get a random user agent string
user_agent = user_agents.get_random_user_agent()
print(user_agent)

Scrapy (scrapy/scrapy):

import scrapy

class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['https://example.com']

    def parse(self, response):
        yield {'title': response.css('title::text').get()}

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

crawler-user-agents

This repository contains a list of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.

Each pattern is a regular expression. It should work out-of-the-box wih your favorite regex library.

If you use this project in a commercial product, please sponsor it.

Install

Direct download

Download the crawler-user-agents.json file from this repository directly.

Javascript

crawler-user-agents is deployed on npmjs.com: https://www.npmjs.com/package/crawler-user-agents

To use it using npm or yarn:

npm install --save crawler-user-agents
# OR
yarn add crawler-user-agents

In Node.js, you can require the package to get an array of crawler user agents.

const crawlers = require('crawler-user-agents');
console.log(crawlers);

Python

Install with pip install crawler-user-agents

Then:

import crawleruseragents
if crawleruseragents.is_crawler("Googlebot/"):
   # do something

or:

import crawleruseragents
indices = crawleruseragents.matching_crawlers("bingbot/2.0")
print("crawlers' indices:", indices)
print(
    "crawler's URL:",
    crawleruseragents.CRAWLER_USER_AGENTS_DATA[indices[0]]["url"]
)

Note that matching_crawlers is much slower than is_crawler, if the given User-Agent does indeed match any crawlers.

Go

Go: use this package, it provides global variable Crawlers (it is synchronized with crawler-user-agents.json), functions IsCrawler and MatchingCrawlers.

Example of Go program:

package main

import (
	"fmt"

	"github.com/monperrus/crawler-user-agents"
)

func main() {
	userAgent := "Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)"

	isCrawler := agents.IsCrawler(userAgent)
	fmt.Println("isCrawler:", isCrawler)

	indices := agents.MatchingCrawlers(userAgent)
	fmt.Println("crawlers' indices:", indices)
	fmt.Println("crawler's URL:", agents.Crawlers[indices[0]].URL)
}

Output:

isCrawler: true
crawlers' indices: [237]
crawler' URL: https://discordapp.com

Contributing

I do welcome additions contributed as pull requests.

The pull requests should:

  • contain a single addition
  • specify a discriminant relevant syntactic fragment (for example "totobot" and not "Mozilla/5 totobot v20131212.alpha1")
  • contain the pattern (generic regular expression), the discovery date (year/month/day) and the official url of the robot
  • result in a valid JSON file (don't forget the comma between items)

Example:

{
  "pattern": "rogerbot",
  "addition_date": "2014/02/28",
  "url": "http://moz.com/help/pro/what-is-rogerbot-",
  "instances" : ["rogerbot/2.3 example UA"]
}

License

The list is under a MIT License. The versions prior to Nov 7, 2016 were under a CC-SA license.

Related work

There are a few wrapper libraries that use this data to detect bots:

Other systems for spotting robots, crawlers, and spiders that you may want to consider are:

NPM DownloadsLast 30 Days