ProxyPool

An Efficient ProxyPool with Getter, Tester and Server

6,034

2,156

6,034

View on GitHub

Top Related Projects

haipproxy

5,490

:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis

IPProxyTool

1,998

python ip proxy tool scrapy crawl. 抓取大量免费代理 ip，提取有效 ip 使用

Quick Overview

ProxyPool is an open-source Python project that provides a proxy IP pool with automatic crawling, verification, and API support. It aims to offer a reliable source of proxy IPs for web scraping and other network-related tasks, with features like customizable crawling, proxy validation, and easy integration through a RESTful API.

Pros

Automatic proxy crawling and validation, reducing manual effort
RESTful API for easy integration with other applications
Customizable proxy sources and validation criteria
Support for multiple database backends (Redis, MongoDB)

Cons

Limited documentation, especially for advanced configurations
Potential legal and ethical concerns when using proxies without permission
May require frequent updates to maintain effectiveness as proxy sources change
Performance can vary depending on the quality of crawled proxies

Code Examples

Fetching a random proxy:

import requests

proxy = requests.get("http://localhost:5555/random").text
print(f"Random proxy: {proxy}")

Getting a proxy count:

import requests

count = requests.get("http://localhost:5555/count").text
print(f"Total proxies: {count}")

Checking if a specific IP is available:

import requests

ip = "1.1.1.1"
result = requests.get(f"http://localhost:5555/{ip}").text
print(f"Is {ip} available: {result}")

Getting Started

Clone the repository:

git clone https://github.com/Python3WebSpider/ProxyPool.git

Install dependencies:

cd ProxyPool
pip install -r requirements.txt

Configure the settings in proxypool/setting.py (e.g., database connection, proxy sources)
Run the proxy pool:
```
python3 run.py
```
Access the API at http://localhost:5555

Competitor Comparisons

proxy_pool

22,577

Python ProxyPool for web spider

Pros of proxy_pool

More active development with recent updates and contributions
Includes a web API for easy integration with other applications
Supports multiple database backends (Redis, MongoDB, SQLite)

Cons of proxy_pool

Less comprehensive documentation compared to ProxyPool
Fewer built-in proxy sources out of the box
Slightly more complex setup process

Code Comparison

ProxyPool:

class Crawler(object):
    def crawl(self):
        print('Crawler is working')
        proxy_count = 0
        for callback_label in range(self.crawler_func.__len__()):
            callback = self.crawler_func[callback_label]
            proxies = callback()
            for proxy in proxies:
                proxy = proxy.strip()
                if proxy and self.db.add(proxy):
                    proxy_count += 1
        return proxy_count

proxy_pool:

class ProxyFetcher(object):
    def run(self):
        print('ProxyFetcher is working')
        proxy_count = 0
        for callback_label in self.fetcher_func:
            callback = getattr(self, callback_label)
            proxies = callback()
            for proxy in proxies:
                if proxy.strip():
                    self.db.put(proxy)
                    proxy_count += 1
        return proxy_count

Both projects implement similar functionality for crawling and fetching proxies, but proxy_pool uses a slightly different approach with getattr() to access callback functions. ProxyPool's implementation is more straightforward, while proxy_pool offers more flexibility in method naming and organization.

haipproxy

5,490

:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis

Pros of haipproxy

More advanced proxy acquisition methods, including crawling from multiple sources and supporting ADSL dial-up
Better scalability with distributed architecture using Scrapy and Redis
More comprehensive proxy validation and scoring system

Cons of haipproxy

More complex setup and configuration due to its distributed nature
Steeper learning curve for users unfamiliar with Scrapy and Redis
Potentially higher resource requirements for running the full system

Code Comparison

ProxyPool:

def crawl_xicidaili():
    for i in range(1, 3):
        start_url = 'http://www.xicidaili.com/nn/{}'.format(i)
        html = get_page(start_url)
        ip_addresses = re.findall('<td>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})</td>', html)
        port_numbers = re.findall('<td>(\d+)</td>', html)
        for address, port in zip(ip_addresses, port_numbers):
            yield ':'.join([address, port])

haipproxy:

class XiciSpider(RedisSpider):
    def parse(self, response):
        for sel in response.xpath('//table[@id="ip_list"]/tr[position()>1]'):
            ip = sel.xpath('./td[2]/text()').extract_first()
            port = sel.xpath('./td[3]/text()').extract_first()
            yield Proxy(host=ip, port=port)

The code comparison shows that haipproxy uses Scrapy's more structured approach for crawling, while ProxyPool uses a simpler custom crawling method.

IPProxyTool

1,998

python ip proxy tool scrapy crawl. 抓取大量免费代理 ip，提取有效 ip 使用

Pros of IPProxyTool

Supports multiple proxy sources, including free and paid services
Includes a web interface for easy management and visualization of proxy data
Offers more detailed proxy information, such as response time and anonymity level

Cons of IPProxyTool

Less actively maintained, with fewer recent updates
More complex setup process compared to ProxyPool
Limited documentation, which may make it harder for new users to get started

Code Comparison

IPProxyTool:

class Validator(object):
    def __init__(self):
        self.detect_from = "http://httpbin.org/get"
        self.timeout = 10
        self.valid_proxies = []
        self.invalid_proxies = []

ProxyPool:

class Tester(object):
    def __init__(self):
        self.redis = RedisClient()
        self.proxy_tester = ProxyTester()

    def run(self):
        print('Tester is working')
        try:
            proxies = self.redis.all()
            loop = asyncio.get_event_loop()
            for i in range(0, len(proxies), BATCH_TEST_SIZE):
                test_proxies = proxies[i:i + BATCH_TEST_SIZE]
                tasks = [self.proxy_tester.test(proxy) for proxy in test_proxies]
                loop.run_until_complete(asyncio.wait(tasks))
        except Exception as e:
            print('Tester error', e.args)

Both projects aim to provide a pool of usable proxies, but they differ in their implementation and features. IPProxyTool offers a more comprehensive set of features, including a web interface and support for multiple proxy sources. However, ProxyPool has a simpler setup process and is more actively maintained, making it potentially more suitable for users who prioritize ease of use and ongoing support.

IPProxyPool

4,236

IPProxyPool代理池项目，提供代理ip

Pros of IPProxyPool

More comprehensive proxy validation process, including checks for anonymity levels and support for HTTPS
Includes a web interface for easy management and visualization of proxy data
Supports multiple database backends (Redis, MongoDB, MySQL) for storing proxy information

Cons of IPProxyPool

Less actively maintained, with fewer recent updates compared to ProxyPool
More complex setup process due to additional dependencies and database requirements
Limited documentation, especially for non-Chinese speakers

Code Comparison

IPProxyPool:

def validUsefulProxy(proxy):
    if isinstance(proxy, bytes):
        proxy = proxy.decode('utf8')
    proxies = {"http": "http://{proxy}".format(proxy=proxy)}
    try:
        r = requests.get('http://httpbin.org/ip', proxies=proxies, timeout=30, verify=False)
        if r.status_code == 200 and r.json().get('origin'):
            return True
    except:
        return False

ProxyPool:

async def test_proxy(proxy):
    try:
        async with aiohttp.ClientSession() as session:
            try:
                async with session.get(TEST_URL, proxy=f'http://{proxy}', timeout=15, ssl=False) as response:
                    if response.status == 200:
                        return True
            except:
                return False
    except:
        return False

The code comparison shows that IPProxyPool uses synchronous requests for proxy validation, while ProxyPool utilizes asynchronous HTTP requests, potentially offering better performance for large-scale proxy testing.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

ProxyPool

Docker Pulls

ç®æé«æçä»£çæ± ï¼æä¾å¦ä¸åè½ï¼

å®æ¶æååè´¹ä»£çç½ç«ï¼ç®æå¯æ©å±ã
ä½¿ç¨ Redis å¯¹ä»£çè¿è¡åå¨å¹¶å¯¹ä»£çå¯ç¨æ§è¿è¡æåºã
å®æ¶æµè¯åçéï¼åé¤ä¸å¯ç¨ä»£çï¼çä¸å¯ç¨ä»£çã
æä¾ä»£ç APIï¼éæºåç¨æµè¯éè¿çå¯ç¨ä»£çã

ä½¿ç¨åæ³¨æ

ä»è´¹ä»£çæ¨èï¼

ADSL æ¨å·ä»£çï¼æµ·éæ¨å·ï¼ä¸å½å¢åï¼é«è´¨éä»£ç
æµ·å¤/å¨çä»£çï¼ä¸å½å¢å¤é«è´¨éä»£ç
èçª 4G/5G ä»£çï¼æé«è´¨éï¼ä¸å½å¢åï¼é²é£æ§ä»£ç

ä½¿ç¨åå¤

é¦åå½ç¶æ¯åéä»£ç å¹¶è¿å¥ ProxyPool æä»¶å¤¹ï¼

git clone https://github.com/Python3WebSpider/ProxyPool.git
cd ProxyPool

ç¶åéç¨ä¸é¢ Docker åå¸¸è§æ¹å¼ä»»æä¸ä¸ªæ§è¡å³å¯ã

ä½¿ç¨è¦æ±

Docker

å¦æä½¿ç¨ Dockerï¼åéè¦å®è£å¦ä¸ç¯å¢ï¼

Docker
Docker-Compose

å®è£æ¹æ³èªè¡æç´¢å³å¯ã

å®æ¹ Docker Hub éåï¼germey/proxypool

å¸¸è§æ¹å¼

Python>=3.6
Redis

Docker è¿è¡

docker-compose up

è¿è¡ç»æç±»ä¼¼å¦ä¸ï¼

redis        | 1:M 19 Feb 2020 17:09:43.940 * DB loaded from disk: 0.000 seconds
redis        | 1:M 19 Feb 2020 17:09:43.940 * Ready to accept connections
proxypool    | 2020-02-19 17:09:44,200 CRIT Supervisor is running as root.  Privileges were not dropped because no user is specified in the config file.  If you intend to run as root, you can set user=root in the config file to avoid this message.
proxypool    | 2020-02-19 17:09:44,203 INFO supervisord started with pid 1
proxypool    | 2020-02-19 17:09:45,209 INFO spawned: 'getter' with pid 10
proxypool    | 2020-02-19 17:09:45,212 INFO spawned: 'server' with pid 11
proxypool    | 2020-02-19 17:09:45,216 INFO spawned: 'tester' with pid 12
proxypool    | 2020-02-19 17:09:46,596 INFO success: getter entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
proxypool    | 2020-02-19 17:09:46,596 INFO success: server entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
proxypool    | 2020-02-19 17:09:46,596 INFO success: tester entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

å¯ä»¥çå° RedisãGetterãServerãTester é½å·²ç»å¯å¨æåã

è¿æ¶åè®¿é® http://localhost:5555/random å³å¯è·åä¸ä¸ªéæºå¯ç¨ä»£çã

- RUN pip install -r requirements.txt
+ RUN pip install -r requirements.txt -i https://pypi.douban.com/simple

å¸¸è§æ¹å¼è¿è¡

å®è£åéç½® Redis

è®¾ç½® hostãportãpasswordï¼å¦æ password ä¸ºç©ºå¯ä»¥è®¾ç½®ä¸ºç©ºåç¬¦ä¸²ï¼ç¤ºä¾å¦ä¸ï¼

export PROXYPOOL_REDIS_HOST='localhost'
export PROXYPOOL_REDIS_PORT=6379
export PROXYPOOL_REDIS_PASSWORD=''
export PROXYPOOL_REDIS_DB=0

æèåªè®¾ç½®è¿æ¥åç¬¦ä¸²ï¼

export PROXYPOOL_REDIS_CONNECTION_STRING='redis://localhost'

è¿éè¿æ¥åç¬¦ä¸²çæ ¼å¼éè¦ç¬¦å redis://[:password@]host[:port][/database] çæ ¼å¼ï¼ ä¸æ¬å·åæ°å¯ä»¥çç¥ï¼port é»è®¤æ¯ 6379ï¼database é»è®¤æ¯ 0ï¼å¯ç é»è®¤ä¸ºç©ºã

ä»¥ä¸ä¸¤ç§è®¾ç½®ä»»éå¶ä¸å³å¯ã

å®è£ä¾èµå

è¿éå¼ºçæ¨èä½¿ç¨ Conda æ virtualenv åå»ºèæç¯å¢ï¼Python çæ¬ä¸ä½äº 3.6ã

ç¶å pip å®è£ä¾èµå³å¯ï¼

pip3 install -r requirements.txt

è¿è¡ä»£çæ±

python3 run.py

è¿è¡ä¹åä¼å¯å¨ TesterãGetterãServerï¼è¿æ¶è®¿é® http://localhost:5555/random å³å¯è·åä¸ä¸ªéæºå¯ç¨ä»£çã

python3 run.py --processor getter
python3 run.py --processor tester
python3 run.py --processor server

è¿é processor å¯ä»¥æå®è¿è¡ TesterãGetter è¿æ¯ Serverã

ä½¿ç¨

æåè¿è¡ä¹åå¯ä»¥éè¿ http://localhost:5555/random è·åä¸ä¸ªéæºå¯ç¨ä»£çã

import requests

proxypool_url = 'http://127.0.0.1:5555/random'
target_url = 'http://httpbin.org/get'

def get_random_proxy():
    """
    get random proxy from proxypool
    :return: proxy
    """
    return requests.get(proxypool_url).text.strip()

def crawl(url, proxy):
    """
    use proxy to crawl page
    :param url: page url
    :param proxy: proxy, such as 8.8.8.8:8888
    :return: html
    """
    proxies = {'http': 'http://' + proxy}
    return requests.get(url, proxies=proxies).text


def main():
    """
    main method, entry point
    :return: none
    """
    proxy = get_random_proxy()
    print('get random proxy', proxy)
    html = crawl(target_url, proxy)
    print(html)

if __name__ == '__main__':
    main()

è¿è¡ç»æå¦ä¸ï¼

get random proxy 116.196.115.209:8080
{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.22.0",
    "X-Amzn-Trace-Id": "Root=1-5e4d7140-662d9053c0a2e513c7278364"
  },
  "origin": "116.196.115.209",
  "url": "https://httpbin.org/get"
}

å¯éç½®é¡¹

ä»£çæ± å¯ä»¥éè¿è®¾ç½®ç¯å¢åéæ¥éç½®ä¸äºåæ°ã

å¼å³

ENABLE_TESTERï¼åè®¸ Tester å¯å¨ï¼é»è®¤ true
ENABLE_GETTERï¼åè®¸ Getter å¯å¨ï¼é»è®¤ true
ENABLE_SERVERï¼è¿è¡ Server å¯å¨ï¼é»è®¤ true

ç¯å¢

APP_ENVï¼è¿è¡ç¯å¢ï¼å¯ä»¥è®¾ç½® devãtestãprodï¼å³å¼åãæµè¯ãçäº§ç¯å¢ï¼é»è®¤ dev
APP_DEBUGï¼è°è¯æ¨¡å¼ï¼å¯ä»¥è®¾ç½® true æ falseï¼é»è®¤ true
APP_PROD_METHOD: æ£å¼ç¯å¢å¯å¨åºç¨æ¹å¼ï¼é»è®¤æ¯geventï¼ å¯éï¼tornadoï¼meinheldï¼åå«éè¦å®è£ tornado æ meinheld æ¨¡åï¼

Redis è¿æ¥

PROXYPOOL_REDIS_HOST / REDIS_HOSTï¼Redis ç Hostï¼å¶ä¸ PROXYPOOL_REDIS_HOST ä¼è¦ç REDIS_HOST çå¼ã
PROXYPOOL_REDIS_PORT / REDIS_PORTï¼Redis çç«¯å£ï¼å¶ä¸ PROXYPOOL_REDIS_PORT ä¼è¦ç REDIS_PORT çå¼ã
PROXYPOOL_REDIS_PASSWORD / REDIS_PASSWORDï¼Redis çå¯ç ï¼å¶ä¸ PROXYPOOL_REDIS_PASSWORD ä¼è¦ç REDIS_PASSWORD çå¼ã
PROXYPOOL_REDIS_DB / REDIS_DBï¼Redis çæ°æ®åºç´¢å¼ï¼å¦ 0ã1ï¼å¶ä¸ PROXYPOOL_REDIS_DB ä¼è¦ç REDIS_DB çå¼ã
PROXYPOOL_REDIS_CONNECTION_STRING / REDIS_CONNECTION_STRINGï¼Redis è¿æ¥åç¬¦ä¸²ï¼å¶ä¸ PROXYPOOL_REDIS_CONNECTION_STRING ä¼è¦ç REDIS_CONNECTION_STRING çå¼ã
PROXYPOOL_REDIS_KEY / REDIS_KEYï¼Redis å¨åä»£çä½¿ç¨åå¸çåç§°ï¼å¶ä¸ PROXYPOOL_REDIS_KEY ä¼è¦ç REDIS_KEY çå¼ã

å¤çå¨

CYCLE_TESTERï¼Tester è¿è¡å¨æï¼å³é´éå¤ä¹è¿è¡ä¸æ¬¡æµè¯ï¼é»è®¤ 20 ç§
CYCLE_GETTERï¼Getter è¿è¡å¨æï¼å³é´éå¤ä¹è¿è¡ä¸æ¬¡ä»£çè·åï¼é»è®¤ 100 ç§
TEST_URLï¼æµè¯ URLï¼é»è®¤ç¾åº¦
TEST_TIMEOUTï¼æµè¯è¶æ¶æ¶é´ï¼é»è®¤ 10 ç§
TEST_BATCHï¼æ¹éæµè¯æ°éï¼é»è®¤ 20 ä¸ªä»£ç
TEST_VALID_STATUSï¼æµè¯ææçç¶æç
API_HOSTï¼ä»£ç Server è¿è¡ Hostï¼é»è®¤ 0.0.0.0
API_PORTï¼ä»£ç Server è¿è¡ç«¯å£ï¼é»è®¤ 5555
API_THREADEDï¼ä»£ç Server æ¯å¦ä½¿ç¨å¤çº¿ç¨ï¼é»è®¤ true

æ¥å¿

LOG_DIRï¼æ¥å¿ç¸å¯¹è·¯å¾
LOG_RUNTIME_FILEï¼è¿è¡æ¥å¿æä»¶åç§°
LOG_ERROR_FILEï¼éè¯¯æ¥å¿æä»¶åç§°
LOG_ROTATION: æ¥å¿è®°å½å¨è½¬å¨ææå¤§å°ï¼é»è®¤ 500MBï¼è§ loguru - rotation
ENABLE_LOG_FILEï¼æ¯å¦è¾åº log æä»¶ï¼é»è®¤ trueï¼å¦æè®¾ç½®ä¸º falseï¼é£ä¹ ENABLE_LOG_RUNTIME_FILE å ENABLE_LOG_ERROR_FILE é½ä¸ä¼çæ
ENABLE_LOG_RUNTIME_FILEï¼æ¯å¦è¾åº runtime log æä»¶ï¼é»è®¤ true
ENABLE_LOG_ERROR_FILEï¼æ¯å¦è¾åº error log æä»¶ï¼é»è®¤ true

export TEST_URL=http://weibo.cn
export REDIS_KEY=proxies:weibo

å¦æä½¿ç¨ Docker-Compose å¯å¨ä»£çæ± ï¼åéè¦å¨ docker-compose.yml æä»¶éé¢æå®ç¯å¢åéï¼å¦ï¼

version: "3"
services:
  redis:
    image: redis:alpine
    container_name: redis
    command: redis-server
    ports:
      - "6379:6379"
    restart: always
  proxypool:
    build: .
    image: "germey/proxypool"
    container_name: proxypool
    ports:
      - "5555:5555"
    restart: always
    environment:
      REDIS_HOST: redis
      TEST_URL: http://weibo.cn
      REDIS_KEY: proxies:weibo

æ©å±ä»£çç¬è«

ä»£ççç¬è«åæ¾ç½®å¨ proxypool/crawlers æä»¶å¤¹ä¸ï¼ç®åå¯¹æ¥äºæéå ä¸ªä»£ççç¬è«ã

åæ³è§èå¦ä¸ï¼

from pyquery import PyQuery as pq
from proxypool.schemas.proxy import Proxy
from proxypool.crawlers.base import BaseCrawler

BASE_URL = 'http://www.664ip.cn/{page}.html'
MAX_PAGE = 5

class Daili66Crawler(BaseCrawler):
    """
    daili66 crawler, http://www.66ip.cn/1.html
    """
    urls = [BASE_URL.format(page=page) for page in range(1, MAX_PAGE + 1)]

    def parse(self, html):
        """
        parse html file to get proxies
        :return:
        """
        doc = pq(html)
        trs = doc('.containerbox table tr:gt(0)').items()
        for tr in trs:
            host = tr.find('td:nth-child(1)').text()
            port = int(tr.find('td:nth-child(2)').text())
            yield Proxy(host=host, port=port)

å¨è¿éåªéè¦å®ä¹ä¸ä¸ª Crawler ç»§æ¿ BaseCrawler å³å¯ï¼ç¶åå®ä¹å¥½ urls åéå parse æ¹æ³å³å¯ã

urls åéå³ä¸ºç¬åçä»£çç½ç«ç½ååè¡¨ï¼å¯ä»¥ç¨ç¨åºå®ä¹ä¹å¯åæåºå®åå®¹ã
parse æ¹æ³æ¥æ¶ä¸ä¸ªåæ°å³ htmlï¼ä»£çç½åç htmlï¼å¨ parse æ¹æ³éåªéè¦åå¥½ html çè§£æï¼è§£æåº host å portï¼å¹¶æå»º Proxy å¯¹è±¡ yield è¿åå³å¯ã

é¨ç½²

æ¬é¡¹ç®æä¾äº Kubernetes é¨ç½²èæ¬ï¼å¦éé¨ç½²å° Kubernetesï¼è¯·åè kubernetesã

å¦æä¸èµ·å¼åçå´è¶£å¯ä»¥å¨ Issue çè¨ï¼éå¸¸æè°¢ï¼

LICENSE

MIT

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of proxy_pool

Cons of proxy_pool

Code Comparison

Pros of haipproxy

Cons of haipproxy

Code Comparison

Pros of IPProxyTool

Cons of IPProxyTool

Code Comparison

Pros of IPProxyPool

Cons of IPProxyPool

Code Comparison

Convert designs to code with AI

README

ProxyPool

ä½¿ç¨åæ³¨æ

ä½¿ç¨åå¤

ä½¿ç¨è¦æ±

Docker

å¸¸è§æ¹å¼

Docker è¿è¡

å¸¸è§æ¹å¼è¿è¡

å®è£ åé ç½® Redis

å®è£ ä¾èµå

è¿è¡ä»£çæ±

ä½¿ç¨

å¯é ç½®é¡¹

å¼å ³

ç¯å¢

Redis è¿æ¥

å¤çå¨

æ¥å¿

æ©å±ä»£çç¬è«

é¨ç½²

LICENSE

Top Related Projects

Convert designs to code with AI

ä½¿ç¨åæ³¨æ

ä½¿ç¨åå¤

ä½¿ç¨è¦æ±

å¸¸è§æ¹å¼

Docker è¿è¡

å¸¸è§æ¹å¼è¿è¡

å®è£åéç½® Redis

å®è£ä¾èµå

è¿è¡ä»£çæ±

ä½¿ç¨

å¯éç½®é¡¹

å¼å³

ç¯å¢

Redis è¿æ¥

å¤çå¨

æ¥å¿

æ©å±ä»£çç¬è«

é¨ç½²