haipproxy

:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis

5,490

911

5,490

View on GitHub

Top Related Projects

proxy-list

2,312

A list of free, public, forward proxy servers. UPDATED DAILY!

PROXY-List

4,458

Get PROXY List that gets updated everyday

ipsum

1,895

Daily feed of bad IPs (with blacklist hit scores)

Quick Overview

HAipproxy is an open-source project that aims to provide a high-performance IP proxy pool. It uses web crawling techniques to gather free proxy IP addresses from various sources, validates them, and maintains a pool of working proxies. The project is designed to be scalable and efficient, making it suitable for large-scale web scraping tasks.

Pros

Automated proxy collection and validation
Scalable architecture using Redis for storage
Supports multiple proxy protocols (HTTP, HTTPS, SOCKS4/5)
Customizable scoring system for proxy quality assessment

Cons

Requires setup and maintenance of multiple components (Redis, Scrapy, etc.)
Documentation is primarily in Chinese, which may be challenging for non-Chinese speakers
Relies on free proxy sources, which can be unreliable or change frequently
May require frequent updates to maintain effectiveness as proxy sources change

Code Examples

Fetching a proxy from the pool:

from client.py_cli import ProxyFetcher

fetcher = ProxyFetcher('https', strategy='greedy')
proxy = fetcher.get_proxy()
print(proxy)  # Output: {'ip': '1.2.3.4', 'port': 8080, 'score': 100}

Using a proxy with requests library:

import requests
from client.py_cli import ProxyFetcher

fetcher = ProxyFetcher('https', strategy='greedy')
proxy = fetcher.get_proxy()
proxies = {
    'http': f'http://{proxy["ip"]}:{proxy["port"]}',
    'https': f'https://{proxy["ip"]}:{proxy["port"]}'
}

response = requests.get('https://example.com', proxies=proxies)
print(response.status_code)

Customizing proxy selection strategy:

from client.py_cli import ProxyFetcher

fetcher = ProxyFetcher('https', strategy='robin')
proxy1 = fetcher.get_proxy()
proxy2 = fetcher.get_proxy()
print(proxy1, proxy2)  # Output: Two different proxies

Getting Started

Clone the repository:

git clone https://github.com/SpiderClub/haipproxy.git
cd haipproxy

Install dependencies:
```
pip install -r requirements.txt
```
Set up Redis and modify config/settings.py with your Redis configuration.

Run the crawler to gather proxies:

python crawler_booter.py --usage crawler

Run the validator to check proxy quality:

python crawler_booter.py --usage validator

Use the client to fetch proxies in your application:

from client.py_cli import ProxyFetcher
fetcher = ProxyFetcher('https')
proxy = fetcher.get_proxy()

Competitor Comparisons

proxy_pool

22,577

Python ProxyPool for web spider

Pros of proxy_pool

Simpler setup and usage, with fewer dependencies
Supports multiple database backends (Redis, MongoDB, MySQL)
Includes a web API for easy integration with other projects

Cons of proxy_pool

Less sophisticated proxy validation and scoring system
Fewer customization options for proxy sources and validation methods
Limited documentation and examples compared to haipproxy

Code Comparison

proxy_pool:

class ProxyCheck(object):
    def __init__(self):
        self.selfip = self.getMyIP()
        self.detect_pool = []
        self.thread_num = 20
        self.detect_queue = Queue()
        self.timeout = 5

haipproxy:

class ProxyValidator:
    def __init__(self, task):
        self.task = task
        self.redis_args = get_redis_args()
        self.pool = get_redis_conn(**self.redis_args)
        self.timeout = 10

Both projects use similar approaches for proxy validation, but haipproxy offers more advanced features and customization options. proxy_pool is easier to set up and use, while haipproxy provides more robust proxy management capabilities. The choice between the two depends on the specific requirements of your project and the level of control you need over the proxy pool.

proxy-list

2,312

A list of free, public, forward proxy servers. UPDATED DAILY!

Pros of proxy-list

Simple and straightforward list of proxies
Regularly updated with new proxy addresses
Easy to integrate into existing projects

Cons of proxy-list

Limited functionality compared to haipproxy
Lacks advanced features like proxy validation and scoring
No built-in proxy rotation or management system

Code Comparison

proxy-list:

socks5://1.2.3.4:1080
http://5.6.7.8:8080
https://9.10.11.12:3128

haipproxy:

from haipproxy.client.py_cli import ProxyFetcher

args = dict(host='127.0.0.1', port=6379, password='123456')
fetcher = ProxyFetcher('zhihu', strategy='greedy', redis_args=args)
print(fetcher.get_proxy())

The proxy-list repository provides a simple list of proxy addresses, while haipproxy offers a more comprehensive solution with a Python client for fetching and managing proxies.

haipproxy includes features like proxy validation, scoring, and automatic rotation, making it more suitable for complex scraping projects. However, proxy-list's simplicity can be advantageous for quick integration or when only a basic list of proxies is needed.

proxy-list is easier to use out of the box, but haipproxy provides more control and flexibility for managing proxy pools in larger-scale applications.

PROXY-List

4,458

Get PROXY List that gets updated everyday

Pros of PROXY-List

Simple and straightforward list of proxy servers
Regularly updated with new proxies
Easy to integrate into existing projects

Cons of PROXY-List

Lacks advanced features like proxy validation or scoring
No built-in proxy rotation or management functionality
Limited documentation and usage examples

Code Comparison

PROXY-List:

# Example of reading proxies from PROXY-List
with open('proxy.txt', 'r') as f:
    proxies = f.readlines()

haipproxy:

# Example of using haipproxy
from client.py_cli import ProxyFetcher
args = dict(host='127.0.0.1', port=6379, password='123456')
fetcher = ProxyFetcher('zhihu', strategy='greedy', redis_args=args)
print(fetcher.get_proxy())

haipproxy offers a more comprehensive solution for proxy management, including features like proxy validation, scoring, and automatic rotation. It provides a Redis-based backend for storing and retrieving proxies, making it suitable for larger-scale applications.

PROXY-List, on the other hand, is a simpler option that provides a regularly updated list of proxy servers. It's easier to integrate into existing projects but lacks advanced features and management capabilities.

The choice between the two depends on the specific requirements of your project. If you need a simple list of proxies, PROXY-List might suffice. For more complex proxy management needs, haipproxy offers a more robust solution.

ipsum

1,895

Daily feed of bad IPs (with blacklist hit scores)

Pros of ipsum

Simpler setup and usage, focused solely on IP blocklists
Regularly updated with new malicious IPs
Lightweight and easy to integrate into existing security systems

Cons of ipsum

Limited functionality compared to haipproxy's proxy harvesting capabilities
Less customizable for specific use cases
Lacks advanced features like proxy validation and scoring

Code comparison

ipsum:

wget https://raw.githubusercontent.com/stamparm/ipsum/master/ipsum.txt -O /tmp/ipsum.txt
iptables -I INPUT -m set --match-set ipsum src -j DROP
ipset create ipsum hash:net
for ip in $(cat /tmp/ipsum.txt); do ipset add ipsum $ip; done

haipproxy:

from client.py_cli import ProxyFetcher
args = dict(host='127.0.0.1', port=6379, password='123456')
fetcher = ProxyFetcher('zhihu', strategy='greedy', redis_args=args)
print(fetcher.get_proxy())

Summary

ipsum is a straightforward IP blocklist tool, while haipproxy is a more comprehensive proxy harvesting and management system. ipsum is easier to set up and use for basic IP blocking, but haipproxy offers more advanced features for proxy handling and validation. The choice between them depends on the specific requirements of your project and the level of complexity you're willing to manage.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

é«å¯ç¨IPä»£çæ±

READMEãï½ãä¸æææ¡£

é¡¹ç®äº®ç¹

ä»£çæ¥æºä¸°å¯
ä»£çæåæåç²¾å
ä»£çæ ¡éªä¸¥æ ¼åç
çæ§å®å¤ï¼é²æ£æ§å¼º
æ¶æçµæ´»ï¼ä¾¿äºæ©å±
åä¸ªç»ä»¶åå¸å¼é¨ç½²

å¿«éå¼å§

åæºé¨ç½²

æå¡ç«¯

å®è£Python3åRedisãæé®é¢å¯ä»¥éè¯»è¿ç¯æç« çç¸å³é¨åã
æ ¹æ®Redisçå®ééç½®ä¿®æ¹é¡¹ç®éç½®æä»¶config/settings.pyä¸çREDIS_HOSTãREDIS_PASSWORDçåæ°ã
å®è£scrapy-splashï¼å¹¶ä¿®æ¹éç½®æä»¶config/settings.pyä¸çSPLASH_URL
å®è£é¡¹ç®ç¸å³ä¾èµ

pip install -r requirements.txt
å¯å¨scrapy workerï¼åæ¬ä»£çIPééå¨åæ ¡éªå¨

python crawler_booter.py --usage crawler

python crawler_booter.py --usage validator
å¯å¨*è°åº¦å¨*ï¼åæ¬ä»£çIPå®æ¶è°åº¦åæ ¡éª

python scheduler_booter.py --usage crawler

python scheduler_booter.py --usage validator

å®¢æ·ç«¯

pythonå®¢æ·ç«¯è°ç¨ç¤ºä¾

from client.py_cli import ProxyFetcher
args = dict(host='127.0.0.1', port=6379, password='123456', db=0)
ï¼ãè¿é`zhihu`çæææ¯ï¼å»å`zhihu`ç¸å³çä»£çipæ ¡éªéåä¸è·åip
ï¼ãè¿ä¹åçåå æ¯åä¸ä¸ªä»£çIPå¯¹ä¸åç½ç«ä»£çææä¸å
fetcher = ProxyFetcher('zhihu', strategy='greedy', redis_args=args)
# è·åä¸ä¸ªå¯ç¨ä»£ç
print(fetcher.get_proxy())
# è·åå¯ç¨ä»£çåè¡¨
print(fetcher.get_proxies()) # or print(fetcher.pool)

æ´å·ä½çç¤ºä¾è§examples/zhihu

squidä½ä¸ºäºçº§ä»£ç

å®è£squidï¼å¤ä»½squidçéç½®æä»¶å¹¶ä¸å¯å¨squidï¼ä»¥ubuntuä¸ºä¾

sudo apt-get install squid

sudo sed -i 's/http_access deny all/http_access allow all/g' /etc/squid/squid.conf

sudo cp /etc/squid/squid.conf /etc/squid/squid.conf.backup

sudo service squid start
æ ¹æ®æä½ç³»ç»ä¿®æ¹é¡¹ç®éç½®æä»¶config/settings.pyä¸çSQUID_BIN_PATHãSQUID_CONF_PATHãSQUID_TEMPLATE_PATHçåæ°
å¯å¨squid confçå®æ¶æ´æ°ç¨åº

sudo python squid_update.py
ä½¿ç¨squidä½ä¸ºä»£çä¸é´å±è¯·æ±ç®æ ç½ç«,é»è®¤ä»£çURLä¸º'http://squid_host:3128',ç¨Pythonè¯·æ±ç¤ºä¾å¦ä¸
```
import requests
proxies = {'https': 'http://127.0.0.1:3128'}
resp = requests.get('https://httpbin.org/ip', proxies=proxies)
print(resp.text)
```

Dockeré¨ç½²

å®è£Docker
å®è£docker-compose

pip install -U docker-compose

ä¿®æ¹settings.pyä¸çSPLASH_URLåREDIS_HOSTåæ°

# æ³¨æï¼å¦ææ¨ä½¿ç¨masteråæ¯ä¸çä»£ç ï¼è¿æ¥å¯è¢«çç¥
SPLASH_URL = 'http://splash:8050'
REDIS_HOST = 'redis'

ä½¿ç¨docker-composeå¯å¨åä¸ªåºç¨ç»ä»¶

docker-compose up

æ³¨æäºé¡¹

æ¬é¡¹ç®é«åº¦ä¾èµRedisï¼é¤äºæ¶æ¯éä¿¡åæ°æ®åå¨ä¹å¤ï¼IPæ ¡éªåä»»å¡å®æ¶å·¥å·ä¹ä½¿ç¨äºRedisä¸çå¤ç§æ°æ®ç»æã å¦æéè¦æ¿æ¢Redisï¼è¯·èªè¡åº¦é
ç±äºGFWçåå ï¼æäºç½ç«éè¦éè¿ç§å¦ä¸ç½æè½è¿è¡è®¿é®åééï¼å¦æç¨æ·æ æ³è®¿é®å¢å¤çç½ç«ï¼è¯·å°rules.py task_queueä¸º SPIDER_GFW_TASKåSPIDER_AJAX_GFW_TASKçä»»å¡enableå±æ§è®¾ç½®ä¸º0æèå¯å¨ç¬è«çæ¶åæå®ç¬è«ç±»åä¸ºcommonå ajax

python crawler_booter.py --usage crawler common ajax
ç¸åä»£çIPï¼å¯¹äºä¸åç½ç«çä»£çææå¯è½å¤§ä¸ç¸åãå¦æéç¨ä»£çæ æ³æ»¡è¶³æ¨çéæ±ï¼æ¨å¯ä»¥ä¸ºç¹å®ç½ç«ç¼åä»£çIPæ ¡éªå¨