haipproxy
:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis
Top Related Projects
Python ProxyPool for web spider
A list of free, public, forward proxy servers. UPDATED DAILY!
Get PROXY List that gets updated everyday
Daily feed of bad IPs (with blacklist hit scores)
Quick Overview
HAipproxy is an open-source project that aims to provide a high-performance IP proxy pool. It uses web crawling techniques to gather free proxy IP addresses from various sources, validates them, and maintains a pool of working proxies. The project is designed to be scalable and efficient, making it suitable for large-scale web scraping tasks.
Pros
- Automated proxy collection and validation
- Scalable architecture using Redis for storage
- Supports multiple proxy protocols (HTTP, HTTPS, SOCKS4/5)
- Customizable scoring system for proxy quality assessment
Cons
- Requires setup and maintenance of multiple components (Redis, Scrapy, etc.)
- Documentation is primarily in Chinese, which may be challenging for non-Chinese speakers
- Relies on free proxy sources, which can be unreliable or change frequently
- May require frequent updates to maintain effectiveness as proxy sources change
Code Examples
- Fetching a proxy from the pool:
from client.py_cli import ProxyFetcher
fetcher = ProxyFetcher('https', strategy='greedy')
proxy = fetcher.get_proxy()
print(proxy) # Output: {'ip': '1.2.3.4', 'port': 8080, 'score': 100}
- Using a proxy with requests library:
import requests
from client.py_cli import ProxyFetcher
fetcher = ProxyFetcher('https', strategy='greedy')
proxy = fetcher.get_proxy()
proxies = {
'http': f'http://{proxy["ip"]}:{proxy["port"]}',
'https': f'https://{proxy["ip"]}:{proxy["port"]}'
}
response = requests.get('https://example.com', proxies=proxies)
print(response.status_code)
- Customizing proxy selection strategy:
from client.py_cli import ProxyFetcher
fetcher = ProxyFetcher('https', strategy='robin')
proxy1 = fetcher.get_proxy()
proxy2 = fetcher.get_proxy()
print(proxy1, proxy2) # Output: Two different proxies
Getting Started
-
Clone the repository:
git clone https://github.com/SpiderClub/haipproxy.git cd haipproxy
-
Install dependencies:
pip install -r requirements.txt
-
Set up Redis and modify
config/settings.py
with your Redis configuration. -
Run the crawler to gather proxies:
python crawler_booter.py --usage crawler
-
Run the validator to check proxy quality:
python crawler_booter.py --usage validator
-
Use the client to fetch proxies in your application:
from client.py_cli import ProxyFetcher fetcher = ProxyFetcher('https') proxy = fetcher.get_proxy()
Competitor Comparisons
Python ProxyPool for web spider
Pros of proxy_pool
- Simpler setup and usage, with fewer dependencies
- Supports multiple database backends (Redis, MongoDB, MySQL)
- Includes a web API for easy integration with other projects
Cons of proxy_pool
- Less sophisticated proxy validation and scoring system
- Fewer customization options for proxy sources and validation methods
- Limited documentation and examples compared to haipproxy
Code Comparison
proxy_pool:
class ProxyCheck(object):
def __init__(self):
self.selfip = self.getMyIP()
self.detect_pool = []
self.thread_num = 20
self.detect_queue = Queue()
self.timeout = 5
haipproxy:
class ProxyValidator:
def __init__(self, task):
self.task = task
self.redis_args = get_redis_args()
self.pool = get_redis_conn(**self.redis_args)
self.timeout = 10
Both projects use similar approaches for proxy validation, but haipproxy offers more advanced features and customization options. proxy_pool is easier to set up and use, while haipproxy provides more robust proxy management capabilities. The choice between the two depends on the specific requirements of your project and the level of control you need over the proxy pool.
A list of free, public, forward proxy servers. UPDATED DAILY!
Pros of proxy-list
- Simple and straightforward list of proxies
- Regularly updated with new proxy addresses
- Easy to integrate into existing projects
Cons of proxy-list
- Limited functionality compared to haipproxy
- Lacks advanced features like proxy validation and scoring
- No built-in proxy rotation or management system
Code Comparison
proxy-list:
socks5://1.2.3.4:1080
http://5.6.7.8:8080
https://9.10.11.12:3128
haipproxy:
from haipproxy.client.py_cli import ProxyFetcher
args = dict(host='127.0.0.1', port=6379, password='123456')
fetcher = ProxyFetcher('zhihu', strategy='greedy', redis_args=args)
print(fetcher.get_proxy())
The proxy-list repository provides a simple list of proxy addresses, while haipproxy offers a more comprehensive solution with a Python client for fetching and managing proxies.
haipproxy includes features like proxy validation, scoring, and automatic rotation, making it more suitable for complex scraping projects. However, proxy-list's simplicity can be advantageous for quick integration or when only a basic list of proxies is needed.
proxy-list is easier to use out of the box, but haipproxy provides more control and flexibility for managing proxy pools in larger-scale applications.
Get PROXY List that gets updated everyday
Pros of PROXY-List
- Simple and straightforward list of proxy servers
- Regularly updated with new proxies
- Easy to integrate into existing projects
Cons of PROXY-List
- Lacks advanced features like proxy validation or scoring
- No built-in proxy rotation or management functionality
- Limited documentation and usage examples
Code Comparison
PROXY-List:
# Example of reading proxies from PROXY-List
with open('proxy.txt', 'r') as f:
proxies = f.readlines()
haipproxy:
# Example of using haipproxy
from client.py_cli import ProxyFetcher
args = dict(host='127.0.0.1', port=6379, password='123456')
fetcher = ProxyFetcher('zhihu', strategy='greedy', redis_args=args)
print(fetcher.get_proxy())
haipproxy offers a more comprehensive solution for proxy management, including features like proxy validation, scoring, and automatic rotation. It provides a Redis-based backend for storing and retrieving proxies, making it suitable for larger-scale applications.
PROXY-List, on the other hand, is a simpler option that provides a regularly updated list of proxy servers. It's easier to integrate into existing projects but lacks advanced features and management capabilities.
The choice between the two depends on the specific requirements of your project. If you need a simple list of proxies, PROXY-List might suffice. For more complex proxy management needs, haipproxy offers a more robust solution.
Daily feed of bad IPs (with blacklist hit scores)
Pros of ipsum
- Simpler setup and usage, focused solely on IP blocklists
- Regularly updated with new malicious IPs
- Lightweight and easy to integrate into existing security systems
Cons of ipsum
- Limited functionality compared to haipproxy's proxy harvesting capabilities
- Less customizable for specific use cases
- Lacks advanced features like proxy validation and scoring
Code comparison
ipsum:
wget https://raw.githubusercontent.com/stamparm/ipsum/master/ipsum.txt -O /tmp/ipsum.txt
iptables -I INPUT -m set --match-set ipsum src -j DROP
ipset create ipsum hash:net
for ip in $(cat /tmp/ipsum.txt); do ipset add ipsum $ip; done
haipproxy:
from client.py_cli import ProxyFetcher
args = dict(host='127.0.0.1', port=6379, password='123456')
fetcher = ProxyFetcher('zhihu', strategy='greedy', redis_args=args)
print(fetcher.get_proxy())
Summary
ipsum is a straightforward IP blocklist tool, while haipproxy is a more comprehensive proxy harvesting and management system. ipsum is easier to set up and use for basic IP blocking, but haipproxy offers more advanced features for proxy handling and validation. The choice between them depends on the specific requirements of your project and the level of complexity you're willing to manage.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
é«å¯ç¨IPä»£çæ±
READMEãï½ãä¸æææ¡£
æ¬é¡¹ç®æééçIPèµæºé½æ¥èªäºèç½ï¼æ¿æ¯æ¯ä¸ºå¤§åç¬è«é¡¹ç®æä¾ä¸ä¸ª**é«å¯ç¨ä½å»¶è¿çé«å¿IPä»£çæ± **ã
项ç®äº®ç¹
- ä»£çæ¥æºä¸°å¯
- 代çæåæåç²¾å
- ä»£çæ ¡éªä¸¥æ ¼åç
- çæ§å®å¤ï¼é²æ£æ§å¼º
- æ¶æçµæ´»ï¼ä¾¿äºæ©å±
- å个ç»ä»¶åå¸å¼é¨ç½²
å¿«éå¼å§
注æï¼ä»£ç 请å¨releaseå表ä¸ä¸è½½ï¼master忝ç代ç ä¸ä¿è¯è½ç¨³å®è¿è¡
åæºé¨ç½²
æå¡ç«¯
-
å®è£ Python3åRedisãæé®é¢å¯ä»¥é 读è¿ç¯æç« çç¸å ³é¨åã
-
æ ¹æ®Redisçå®é é 置修æ¹é¡¹ç®é ç½®æä»¶config/settings.pyä¸ç
REDIS_HOST
ãREDIS_PASSWORD
çåæ°ã -
å®è£ scrapy-splashï¼å¹¶ä¿®æ¹é ç½®æä»¶config/settings.pyä¸ç
SPLASH_URL
-
å®è£ 项ç®ç¸å ³ä¾èµ
pip install -r requirements.txt
-
å¯å¨scrapy workerï¼å æ¬ä»£çIPééå¨åæ ¡éªå¨
python crawler_booter.py --usage crawler
python crawler_booter.py --usage validator
-
å¯å¨*è°åº¦å¨*ï¼å æ¬ä»£çIP宿¶è°åº¦åæ ¡éª
python scheduler_booter.py --usage crawler
python scheduler_booter.py --usage validator
客æ·ç«¯
è¿æ¥ä¸ææåå¦é®ï¼å¦ä½è·å该项ç®ä¸å¯ç¨ç代çIPå表ãhaipproxy
æä¾ä»£ççæ¹å¼å¹¶ä¸æ¯éè¿api api
æ¥æä¾ï¼èæ¯éè¿å
·ä½ç客æ·ç«¯æ¥æä¾ã
ç®åæ¯æçæ¯Python客æ·ç«¯åè¯è¨æ å
³çsquidäºçº§ä»£ç
python客æ·ç«¯è°ç¨ç¤ºä¾
from client.py_cli import ProxyFetcher
args = dict(host='127.0.0.1', port=6379, password='123456', db=0)
ï¼ãè¿é`zhihu`çæææ¯ï¼å»å`zhihu`ç¸å
³ç代çipæ ¡éªéåä¸è·åip
ï¼ãè¿ä¹åçåå æ¯åä¸ä¸ªä»£çIP对ä¸åç½ç«ä»£çææä¸å
fetcher = ProxyFetcher('zhihu', strategy='greedy', redis_args=args)
# è·åä¸ä¸ªå¯ç¨ä»£ç
print(fetcher.get_proxy())
# è·åå¯ç¨ä»£çå表
print(fetcher.get_proxies()) # or print(fetcher.pool)
æ´å ·ä½ç示ä¾è§examples/zhihu
squidä½ä¸ºäºçº§ä»£ç
-
å®è£ squidï¼å¤ä»½squidçé ç½®æä»¶å¹¶ä¸å¯å¨squidï¼ä»¥ubuntu为ä¾
sudo apt-get install squid
sudo sed -i 's/http_access deny all/http_access allow all/g' /etc/squid/squid.conf
sudo cp /etc/squid/squid.conf /etc/squid/squid.conf.backup
sudo service squid start
-
æ ¹æ®æä½ç³»ç»ä¿®æ¹é¡¹ç®é ç½®æä»¶config/settings.pyä¸ç
SQUID_BIN_PATH
ãSQUID_CONF_PATH
ãSQUID_TEMPLATE_PATH
çåæ° -
å¯å¨
squid conf
ç宿¶æ´æ°ç¨åºsudo python squid_update.py
-
使ç¨squidä½ä¸ºä»£çä¸é´å±è¯·æ±ç®æ ç½ç«,é»è®¤ä»£çURL为'http://squid_host:3128',ç¨Python请æ±ç¤ºä¾å¦ä¸
import requests proxies = {'https': 'http://127.0.0.1:3128'} resp = requests.get('https://httpbin.org/ip', proxies=proxies) print(resp.text)
Dockeré¨ç½²
-
å®è£ Docker
-
å®è£ docker-compose
pip install -U docker-compose
-
ä¿®æ¹settings.pyä¸ç
SPLASH_URL
åREDIS_HOST
åæ°# 注æï¼å¦ææ¨ä½¿ç¨master忝ä¸ç代ç ï¼è¿æ¥å¯è¢«çç¥ SPLASH_URL = 'http://splash:8050' REDIS_HOST = 'redis'
-
使ç¨docker-composeå¯å¨å个åºç¨ç»ä»¶
docker-compose up
è¿ç§æ¹å¼ä¼ä¸åé¨ç½²squid
ï¼æ¨å¯ä»¥éè¿squid
è°ç¨ä»£çIPæ± ï¼ä¹å¯ä»¥ä½¿ç¨å®¢æ·ç«¯è°ç¨ï¼ååæºé¨ç½²è°ç¨æ¹å¼ä¸æ ·
注æäºé¡¹
- æ¬é¡¹ç®é«åº¦ä¾èµRedisï¼é¤äºæ¶æ¯éä¿¡åæ°æ®åå¨ä¹å¤ï¼IPæ ¡éªåä»»å¡å®æ¶å·¥å ·ä¹ä½¿ç¨äºRedisä¸çå¤ç§æ°æ®ç»æã 妿éè¦æ¿æ¢Redisï¼è¯·èªè¡åº¦é
- ç±äºGFWçåå ï¼æäºç½ç«éè¦éè¿ç§å¦ä¸ç½æè½è¿è¡è®¿é®åééï¼å¦æç¨æ·æ æ³è®¿é®å¢å¤çç½ç«ï¼è¯·å°rules.py
task_queue
为SPIDER_GFW_TASK
åSPIDER_AJAX_GFW_TASK
çä»»å¡enable
屿§è®¾ç½®ä¸º0æè å¯å¨ç¬è«çæ¶åæå®ç¬è«ç±»å为common
åajax
python crawler_booter.py --usage crawler common ajax
- ç¸å代çIPï¼å¯¹äºä¸åç½ç«çä»£çææå¯è½å¤§ä¸ç¸åã妿éç¨ä»£çæ æ³æ»¡è¶³æ¨çéæ±ï¼æ¨å¯ä»¥ä¸ºç¹å®ç½ç«ç¼å代çIPæ ¡éªå¨
工使µç¨
æææµè¯
ä»¥åæºæ¨¡å¼é¨ç½²haipproxy
åæµè¯ä»£ç ï¼ä»¥ç¥ä¹ä¸ºç®æ 请æ±ç«ç¹ï¼å®æµæåææå¦ä¸
æµè¯ä»£ç è§examples/zhihu
项ç®çæ§(å¯é)
项ç®çæ§ä¸»è¦éè¿sentryåprometheus,éè¿å¨å ³é®å°æ¹ è¿è¡ä¸å¡åç¹å¯¹é¡¹ç®å个维度è¿è¡çæµï¼ä»¥æé«é¡¹ç®ç鲿£æ§
项ç®ä½¿ç¨Sentryä½Bug Trace
å·¥å
·ï¼éè¿Sentryå¯ä»¥å¾å®¹æè·è¸ªé¡¹ç®å¥åº·æ
åµ
使ç¨Prometheus+Grafanaåä¸å¡çæ§ï¼äºè§£é¡¹ç®å½åç¶æ
æèµ ä½è
弿ºä¸æï¼å¦ææ¬é¡¹ç®å¯¹æ¨æç¨ï¼ä¸å¦¨è¿è¡å°é¢æèµ ï¼ä»¥æ¯æé¡¹ç®çæç»ç»´æ¤
å类项ç®
æ¬é¡¹ç®åèäºGithubä¸å¼æºçå个ç¬è«ä»£ççå®ç°ï¼æè°¢ä»ä»¬çä»åºï¼ä¸é¢æ¯ç¬è åèçææé¡¹ç®ï¼æåä¸åå åã
Top Related Projects
Python ProxyPool for web spider
A list of free, public, forward proxy servers. UPDATED DAILY!
Get PROXY List that gets updated everyday
Daily feed of bad IPs (with blacklist hit scores)
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot