Convert Figma logo to code with AI

SpiderClub logohaipproxy

:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis

5,490
911
5,490
45

Top Related Projects

Python ProxyPool for web spider

A list of free, public, forward proxy servers. UPDATED DAILY!

Get PROXY List that gets updated everyday

1,895

Daily feed of bad IPs (with blacklist hit scores)

Quick Overview

HAipproxy is an open-source project that aims to provide a high-performance IP proxy pool. It uses web crawling techniques to gather free proxy IP addresses from various sources, validates them, and maintains a pool of working proxies. The project is designed to be scalable and efficient, making it suitable for large-scale web scraping tasks.

Pros

  • Automated proxy collection and validation
  • Scalable architecture using Redis for storage
  • Supports multiple proxy protocols (HTTP, HTTPS, SOCKS4/5)
  • Customizable scoring system for proxy quality assessment

Cons

  • Requires setup and maintenance of multiple components (Redis, Scrapy, etc.)
  • Documentation is primarily in Chinese, which may be challenging for non-Chinese speakers
  • Relies on free proxy sources, which can be unreliable or change frequently
  • May require frequent updates to maintain effectiveness as proxy sources change

Code Examples

  1. Fetching a proxy from the pool:
from client.py_cli import ProxyFetcher

fetcher = ProxyFetcher('https', strategy='greedy')
proxy = fetcher.get_proxy()
print(proxy)  # Output: {'ip': '1.2.3.4', 'port': 8080, 'score': 100}
  1. Using a proxy with requests library:
import requests
from client.py_cli import ProxyFetcher

fetcher = ProxyFetcher('https', strategy='greedy')
proxy = fetcher.get_proxy()
proxies = {
    'http': f'http://{proxy["ip"]}:{proxy["port"]}',
    'https': f'https://{proxy["ip"]}:{proxy["port"]}'
}

response = requests.get('https://example.com', proxies=proxies)
print(response.status_code)
  1. Customizing proxy selection strategy:
from client.py_cli import ProxyFetcher

fetcher = ProxyFetcher('https', strategy='robin')
proxy1 = fetcher.get_proxy()
proxy2 = fetcher.get_proxy()
print(proxy1, proxy2)  # Output: Two different proxies

Getting Started

  1. Clone the repository:

    git clone https://github.com/SpiderClub/haipproxy.git
    cd haipproxy
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Set up Redis and modify config/settings.py with your Redis configuration.

  4. Run the crawler to gather proxies:

    python crawler_booter.py --usage crawler
    
  5. Run the validator to check proxy quality:

    python crawler_booter.py --usage validator
    
  6. Use the client to fetch proxies in your application:

    from client.py_cli import ProxyFetcher
    fetcher = ProxyFetcher('https')
    proxy = fetcher.get_proxy()
    

Competitor Comparisons

Python ProxyPool for web spider

Pros of proxy_pool

  • Simpler setup and usage, with fewer dependencies
  • Supports multiple database backends (Redis, MongoDB, MySQL)
  • Includes a web API for easy integration with other projects

Cons of proxy_pool

  • Less sophisticated proxy validation and scoring system
  • Fewer customization options for proxy sources and validation methods
  • Limited documentation and examples compared to haipproxy

Code Comparison

proxy_pool:

class ProxyCheck(object):
    def __init__(self):
        self.selfip = self.getMyIP()
        self.detect_pool = []
        self.thread_num = 20
        self.detect_queue = Queue()
        self.timeout = 5

haipproxy:

class ProxyValidator:
    def __init__(self, task):
        self.task = task
        self.redis_args = get_redis_args()
        self.pool = get_redis_conn(**self.redis_args)
        self.timeout = 10

Both projects use similar approaches for proxy validation, but haipproxy offers more advanced features and customization options. proxy_pool is easier to set up and use, while haipproxy provides more robust proxy management capabilities. The choice between the two depends on the specific requirements of your project and the level of control you need over the proxy pool.

A list of free, public, forward proxy servers. UPDATED DAILY!

Pros of proxy-list

  • Simple and straightforward list of proxies
  • Regularly updated with new proxy addresses
  • Easy to integrate into existing projects

Cons of proxy-list

  • Limited functionality compared to haipproxy
  • Lacks advanced features like proxy validation and scoring
  • No built-in proxy rotation or management system

Code Comparison

proxy-list:

socks5://1.2.3.4:1080
http://5.6.7.8:8080
https://9.10.11.12:3128

haipproxy:

from haipproxy.client.py_cli import ProxyFetcher

args = dict(host='127.0.0.1', port=6379, password='123456')
fetcher = ProxyFetcher('zhihu', strategy='greedy', redis_args=args)
print(fetcher.get_proxy())

The proxy-list repository provides a simple list of proxy addresses, while haipproxy offers a more comprehensive solution with a Python client for fetching and managing proxies.

haipproxy includes features like proxy validation, scoring, and automatic rotation, making it more suitable for complex scraping projects. However, proxy-list's simplicity can be advantageous for quick integration or when only a basic list of proxies is needed.

proxy-list is easier to use out of the box, but haipproxy provides more control and flexibility for managing proxy pools in larger-scale applications.

Get PROXY List that gets updated everyday

Pros of PROXY-List

  • Simple and straightforward list of proxy servers
  • Regularly updated with new proxies
  • Easy to integrate into existing projects

Cons of PROXY-List

  • Lacks advanced features like proxy validation or scoring
  • No built-in proxy rotation or management functionality
  • Limited documentation and usage examples

Code Comparison

PROXY-List:

# Example of reading proxies from PROXY-List
with open('proxy.txt', 'r') as f:
    proxies = f.readlines()

haipproxy:

# Example of using haipproxy
from client.py_cli import ProxyFetcher
args = dict(host='127.0.0.1', port=6379, password='123456')
fetcher = ProxyFetcher('zhihu', strategy='greedy', redis_args=args)
print(fetcher.get_proxy())

haipproxy offers a more comprehensive solution for proxy management, including features like proxy validation, scoring, and automatic rotation. It provides a Redis-based backend for storing and retrieving proxies, making it suitable for larger-scale applications.

PROXY-List, on the other hand, is a simpler option that provides a regularly updated list of proxy servers. It's easier to integrate into existing projects but lacks advanced features and management capabilities.

The choice between the two depends on the specific requirements of your project. If you need a simple list of proxies, PROXY-List might suffice. For more complex proxy management needs, haipproxy offers a more robust solution.

1,895

Daily feed of bad IPs (with blacklist hit scores)

Pros of ipsum

  • Simpler setup and usage, focused solely on IP blocklists
  • Regularly updated with new malicious IPs
  • Lightweight and easy to integrate into existing security systems

Cons of ipsum

  • Limited functionality compared to haipproxy's proxy harvesting capabilities
  • Less customizable for specific use cases
  • Lacks advanced features like proxy validation and scoring

Code comparison

ipsum:

wget https://raw.githubusercontent.com/stamparm/ipsum/master/ipsum.txt -O /tmp/ipsum.txt
iptables -I INPUT -m set --match-set ipsum src -j DROP
ipset create ipsum hash:net
for ip in $(cat /tmp/ipsum.txt); do ipset add ipsum $ip; done

haipproxy:

from client.py_cli import ProxyFetcher
args = dict(host='127.0.0.1', port=6379, password='123456')
fetcher = ProxyFetcher('zhihu', strategy='greedy', redis_args=args)
print(fetcher.get_proxy())

Summary

ipsum is a straightforward IP blocklist tool, while haipproxy is a more comprehensive proxy harvesting and management system. ipsum is easier to set up and use for basic IP blocking, but haipproxy offers more advanced features for proxy handling and validation. The choice between them depends on the specific requirements of your project and the level of complexity you're willing to manage.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

高可用IP代理池

README | 中文文档

本项目所采集的IP资源都来自互联网,愿景是为大型爬虫项目提供一个**高可用低延迟的高匿IP代理池**。

项目亮点

  • 代理来源丰富
  • 代理抓取提取精准
  • 代理校验严格合理
  • 监控完备,鲁棒性强
  • 架构灵活,便于扩展
  • 各个组件分布式部署

快速开始

注意,代码请在release列表中下载,master分支的代码不保证能稳定运行

单机部署

服务端

  • 安装Python3和Redis。有问题可以阅读这篇文章的相关部分。

  • 根据Redis的实际配置修改项目配置文件config/settings.py中的REDIS_HOST、REDIS_PASSWORD等参数。

  • 安装scrapy-splash,并修改配置文件config/settings.py中的SPLASH_URL

  • 安装项目相关依赖

    pip install -r requirements.txt

  • 启动scrapy worker,包括代理IP采集器和校验器

    python crawler_booter.py --usage crawler

    python crawler_booter.py --usage validator

  • 启动*调度器*,包括代理IP定时调度和校验

    python scheduler_booter.py --usage crawler

    python scheduler_booter.py --usage validator

客户端

近日不断有同学问,如何获取该项目中可用的代理IP列表。haipproxy提供代理的方式并不是通过api api来提供,而是通过具体的客户端来提供。 目前支持的是Python客户端和语言无关的squid二级代理

python客户端调用示例

from client.py_cli import ProxyFetcher
args = dict(host='127.0.0.1', port=6379, password='123456', db=0)
# 这里`zhihu`的意思是,去和`zhihu`相关的代理ip校验队列中获取ip
# 这么做的原因是同一个代理IP对不同网站代理效果不同
fetcher = ProxyFetcher('zhihu', strategy='greedy', redis_args=args)
# 获取一个可用代理
print(fetcher.get_proxy())
# 获取可用代理列表
print(fetcher.get_proxies()) # or print(fetcher.pool)

更具体的示例见examples/zhihu

squid作为二级代理

  • 安装squid,备份squid的配置文件并且启动squid,以ubuntu为例

    sudo apt-get install squid

    sudo sed -i 's/http_access deny all/http_access allow all/g' /etc/squid/squid.conf

    sudo cp /etc/squid/squid.conf /etc/squid/squid.conf.backup

    sudo service squid start

  • 根据操作系统修改项目配置文件config/settings.py中的SQUID_BIN_PATH、SQUID_CONF_PATH、SQUID_TEMPLATE_PATH等参数

  • 启动squid conf的定时更新程序

    sudo python squid_update.py

  • 使用squid作为代理中间层请求目标网站,默认代理URL为'http://squid_host:3128',用Python请求示例如下

    import requests
    proxies = {'https': 'http://127.0.0.1:3128'}
    resp = requests.get('https://httpbin.org/ip', proxies=proxies)
    print(resp.text)
    

Docker部署

  • 安装Docker

  • 安装docker-compose

    pip install -U docker-compose

  • 修改settings.py中的SPLASH_URL和REDIS_HOST参数

    # 注意,如果您使用master分支下的代码,这步可被省略
    SPLASH_URL = 'http://splash:8050'
    REDIS_HOST = 'redis'
    
  • 使用docker-compose启动各个应用组件

    docker-compose up

这种方式会一同部署squid,您可以通过squid调用代理IP池,也可以使用客户端调用,和单机部署调用方式一样

注意事项

  • 本项目高度依赖Redis,除了消息通信和数据存储之外,IP校验和任务定时工具也使用了Redis中的多种数据结构。 如果需要替换Redis,请自行度量
  • 由于GFW的原因,某些网站需要通过科学上网才能进行访问和采集,如果用户无法访问墙外的网站,请将rules.py task_queue为 SPIDER_GFW_TASK和SPIDER_AJAX_GFW_TASK的任务enable属性设置为0或者启动爬虫的时候指定爬虫类型为common和 ajax

    python crawler_booter.py --usage crawler common ajax

  • 相同代理IP,对于不同网站的代理效果可能大不相同。如果通用代理无法满足您的需求,您可以为特定网站编写代理IP校验器

工作流程

效果测试

以单机模式部署haipproxy和测试代码,以知乎为目标请求站点,实测抓取效果如下

测试代码见examples/zhihu

项目监控(可选)

项目监控主要通过sentry和prometheus,通过在关键地方 进行业务埋点对项目各个维度进行监测,以提高项目的鲁棒性

项目使用Sentry作Bug Trace工具,通过Sentry可以很容易跟踪项目健康情况

使用Prometheus+Grafana做业务监控,了解项目当前状态

捐赠作者

开源不易,如果本项目对您有用,不妨进行小额捐赠,以支持项目的持续维护

同类项目

本项目参考了Github上开源的各个爬虫代理的实现,感谢他们的付出,下面是笔者参考的所有项目,排名不分先后。

dungproxy

proxyspider

ProxyPool

proxy_pool

ProxyPool

IPProxyTool

IPProxyPool

proxy_list

proxy_pool

ProxyPool

scylla