Convert Figma logo to code with AI

Python3WebSpider logoProxyPool

An Efficient ProxyPool with Getter, Tester and Server

6,034
2,156
6,034
43

Top Related Projects

Python ProxyPool for web spider

:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis

python ip proxy tool scrapy crawl. 抓取大量免费代理 ip,提取有效 ip 使用

IPProxyPool代理池项目,提供代理ip

Quick Overview

ProxyPool is an open-source Python project that provides a proxy IP pool with automatic crawling, verification, and API support. It aims to offer a reliable source of proxy IPs for web scraping and other network-related tasks, with features like customizable crawling, proxy validation, and easy integration through a RESTful API.

Pros

  • Automatic proxy crawling and validation, reducing manual effort
  • RESTful API for easy integration with other applications
  • Customizable proxy sources and validation criteria
  • Support for multiple database backends (Redis, MongoDB)

Cons

  • Limited documentation, especially for advanced configurations
  • Potential legal and ethical concerns when using proxies without permission
  • May require frequent updates to maintain effectiveness as proxy sources change
  • Performance can vary depending on the quality of crawled proxies

Code Examples

  1. Fetching a random proxy:
import requests

proxy = requests.get("http://localhost:5555/random").text
print(f"Random proxy: {proxy}")
  1. Getting a proxy count:
import requests

count = requests.get("http://localhost:5555/count").text
print(f"Total proxies: {count}")
  1. Checking if a specific IP is available:
import requests

ip = "1.1.1.1"
result = requests.get(f"http://localhost:5555/{ip}").text
print(f"Is {ip} available: {result}")

Getting Started

  1. Clone the repository:

    git clone https://github.com/Python3WebSpider/ProxyPool.git
    
  2. Install dependencies:

    cd ProxyPool
    pip install -r requirements.txt
    
  3. Configure the settings in proxypool/setting.py (e.g., database connection, proxy sources)

  4. Run the proxy pool:

    python3 run.py
    
  5. Access the API at http://localhost:5555

Competitor Comparisons

Python ProxyPool for web spider

Pros of proxy_pool

  • More active development with recent updates and contributions
  • Includes a web API for easy integration with other applications
  • Supports multiple database backends (Redis, MongoDB, SQLite)

Cons of proxy_pool

  • Less comprehensive documentation compared to ProxyPool
  • Fewer built-in proxy sources out of the box
  • Slightly more complex setup process

Code Comparison

ProxyPool:

class Crawler(object):
    def crawl(self):
        print('Crawler is working')
        proxy_count = 0
        for callback_label in range(self.crawler_func.__len__()):
            callback = self.crawler_func[callback_label]
            proxies = callback()
            for proxy in proxies:
                proxy = proxy.strip()
                if proxy and self.db.add(proxy):
                    proxy_count += 1
        return proxy_count

proxy_pool:

class ProxyFetcher(object):
    def run(self):
        print('ProxyFetcher is working')
        proxy_count = 0
        for callback_label in self.fetcher_func:
            callback = getattr(self, callback_label)
            proxies = callback()
            for proxy in proxies:
                if proxy.strip():
                    self.db.put(proxy)
                    proxy_count += 1
        return proxy_count

Both projects implement similar functionality for crawling and fetching proxies, but proxy_pool uses a slightly different approach with getattr() to access callback functions. ProxyPool's implementation is more straightforward, while proxy_pool offers more flexibility in method naming and organization.

:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis

Pros of haipproxy

  • More advanced proxy acquisition methods, including crawling from multiple sources and supporting ADSL dial-up
  • Better scalability with distributed architecture using Scrapy and Redis
  • More comprehensive proxy validation and scoring system

Cons of haipproxy

  • More complex setup and configuration due to its distributed nature
  • Steeper learning curve for users unfamiliar with Scrapy and Redis
  • Potentially higher resource requirements for running the full system

Code Comparison

ProxyPool:

def crawl_xicidaili():
    for i in range(1, 3):
        start_url = 'http://www.xicidaili.com/nn/{}'.format(i)
        html = get_page(start_url)
        ip_addresses = re.findall('<td>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})</td>', html)
        port_numbers = re.findall('<td>(\d+)</td>', html)
        for address, port in zip(ip_addresses, port_numbers):
            yield ':'.join([address, port])

haipproxy:

class XiciSpider(RedisSpider):
    def parse(self, response):
        for sel in response.xpath('//table[@id="ip_list"]/tr[position()>1]'):
            ip = sel.xpath('./td[2]/text()').extract_first()
            port = sel.xpath('./td[3]/text()').extract_first()
            yield Proxy(host=ip, port=port)

The code comparison shows that haipproxy uses Scrapy's more structured approach for crawling, while ProxyPool uses a simpler custom crawling method.

python ip proxy tool scrapy crawl. 抓取大量免费代理 ip,提取有效 ip 使用

Pros of IPProxyTool

  • Supports multiple proxy sources, including free and paid services
  • Includes a web interface for easy management and visualization of proxy data
  • Offers more detailed proxy information, such as response time and anonymity level

Cons of IPProxyTool

  • Less actively maintained, with fewer recent updates
  • More complex setup process compared to ProxyPool
  • Limited documentation, which may make it harder for new users to get started

Code Comparison

IPProxyTool:

class Validator(object):
    def __init__(self):
        self.detect_from = "http://httpbin.org/get"
        self.timeout = 10
        self.valid_proxies = []
        self.invalid_proxies = []

ProxyPool:

class Tester(object):
    def __init__(self):
        self.redis = RedisClient()
        self.proxy_tester = ProxyTester()

    def run(self):
        print('Tester is working')
        try:
            proxies = self.redis.all()
            loop = asyncio.get_event_loop()
            for i in range(0, len(proxies), BATCH_TEST_SIZE):
                test_proxies = proxies[i:i + BATCH_TEST_SIZE]
                tasks = [self.proxy_tester.test(proxy) for proxy in test_proxies]
                loop.run_until_complete(asyncio.wait(tasks))
        except Exception as e:
            print('Tester error', e.args)

Both projects aim to provide a pool of usable proxies, but they differ in their implementation and features. IPProxyTool offers a more comprehensive set of features, including a web interface and support for multiple proxy sources. However, ProxyPool has a simpler setup process and is more actively maintained, making it potentially more suitable for users who prioritize ease of use and ongoing support.

IPProxyPool代理池项目,提供代理ip

Pros of IPProxyPool

  • More comprehensive proxy validation process, including checks for anonymity levels and support for HTTPS
  • Includes a web interface for easy management and visualization of proxy data
  • Supports multiple database backends (Redis, MongoDB, MySQL) for storing proxy information

Cons of IPProxyPool

  • Less actively maintained, with fewer recent updates compared to ProxyPool
  • More complex setup process due to additional dependencies and database requirements
  • Limited documentation, especially for non-Chinese speakers

Code Comparison

IPProxyPool:

def validUsefulProxy(proxy):
    if isinstance(proxy, bytes):
        proxy = proxy.decode('utf8')
    proxies = {"http": "http://{proxy}".format(proxy=proxy)}
    try:
        r = requests.get('http://httpbin.org/ip', proxies=proxies, timeout=30, verify=False)
        if r.status_code == 200 and r.json().get('origin'):
            return True
    except:
        return False

ProxyPool:

async def test_proxy(proxy):
    try:
        async with aiohttp.ClientSession() as session:
            try:
                async with session.get(TEST_URL, proxy=f'http://{proxy}', timeout=15, ssl=False) as response:
                    if response.status == 200:
                        return True
            except:
                return False
    except:
        return False

The code comparison shows that IPProxyPool uses synchronous requests for proxy validation, while ProxyPool utilizes asynchronous HTTP requests, potentially offering better performance for large-scale proxy testing.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

ProxyPool

build deploy Docker Pulls

简易高效的代理池,提供如下功能:

  • 定时抓取免费代理网站,简易可扩展。
  • 使用 Redis 对代理进行存储并对代理可用性进行排序。
  • 定时测试和筛选,剔除不可用代理,留下可用代理。
  • 提供代理 API,随机取用测试通过的可用代理。

代理池原理解析可见「如何搭建一个高效的代理池」,建议使用之前阅读。

使用前注意

本代理池是基于市面上各种公开代理源搭建的,所以可用性并不高,很可能上百上千个代理中才能找到一两个可用代理,不适合直接用于爬虫爬取任务。

如果您的目的是为了尽快使用代理完成爬取任务,建议您对接一些付费代理或者直接使用已有代理资源;如果您的目的是为了学习如何搭建一个代理池,您可以参考本项目继续完成后续步骤。

付费代理推荐:

使用准备

首先当然是克隆代码并进入 ProxyPool 文件夹:

git clone https://github.com/Python3WebSpider/ProxyPool.git
cd ProxyPool

然后选用下面 Docker 和常规方式任意一个执行即可。

使用要求

可以通过两种方式来运行代理池,一种方式是使用 Docker(推荐),另一种方式是常规方式运行,要求如下:

Docker

如果使用 Docker,则需要安装如下环境:

  • Docker
  • Docker-Compose

安装方法自行搜索即可。

官方 Docker Hub 镜像:germey/proxypool

常规方式

常规方式要求有 Python 环境、Redis 环境,具体要求如下:

  • Python>=3.6
  • Redis

Docker 运行

如果安装好了 Docker 和 Docker-Compose,只需要一条命令即可运行。

docker-compose up

运行结果类似如下:

redis        | 1:M 19 Feb 2020 17:09:43.940 * DB loaded from disk: 0.000 seconds
redis        | 1:M 19 Feb 2020 17:09:43.940 * Ready to accept connections
proxypool    | 2020-02-19 17:09:44,200 CRIT Supervisor is running as root.  Privileges were not dropped because no user is specified in the config file.  If you intend to run as root, you can set user=root in the config file to avoid this message.
proxypool    | 2020-02-19 17:09:44,203 INFO supervisord started with pid 1
proxypool    | 2020-02-19 17:09:45,209 INFO spawned: 'getter' with pid 10
proxypool    | 2020-02-19 17:09:45,212 INFO spawned: 'server' with pid 11
proxypool    | 2020-02-19 17:09:45,216 INFO spawned: 'tester' with pid 12
proxypool    | 2020-02-19 17:09:46,596 INFO success: getter entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
proxypool    | 2020-02-19 17:09:46,596 INFO success: server entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
proxypool    | 2020-02-19 17:09:46,596 INFO success: tester entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

可以看到 Redis、Getter、Server、Tester 都已经启动成功。

这时候访问 http://localhost:5555/random 即可获取一个随机可用代理。

如果下载速度特别慢,可以自行修改 Dockerfile,修改:

- RUN pip install -r requirements.txt
+ RUN pip install -r requirements.txt -i https://pypi.douban.com/simple

常规方式运行

如果不使用 Docker 运行,配置好 Python、Redis 环境之后也可运行,步骤如下。

安装和配置 Redis

本地安装 Redis、Docker 启动 Redis、远程 Redis 都是可以的,只要能正常连接使用即可。

首先可以需要一下环境变量,代理池会通过环境变量读取这些值。

设置 Redis 的环境变量有两种方式,一种是分别设置 host、port、password,另一种是设置连接字符串,设置方法分别如下:

设置 host、port、password,如果 password 为空可以设置为空字符串,示例如下:

export PROXYPOOL_REDIS_HOST='localhost'
export PROXYPOOL_REDIS_PORT=6379
export PROXYPOOL_REDIS_PASSWORD=''
export PROXYPOOL_REDIS_DB=0

或者只设置连接字符串:

export PROXYPOOL_REDIS_CONNECTION_STRING='redis://localhost'

这里连接字符串的格式需要符合 redis://[:password@]host[:port][/database] 的格式, 中括号参数可以省略,port 默认是 6379,database 默认是 0,密码默认为空。

以上两种设置任选其一即可。

安装依赖包

这里强烈推荐使用 Conda 或 virtualenv 创建虚拟环境,Python 版本不低于 3.6。

然后 pip 安装依赖即可:

pip3 install -r requirements.txt

运行代理池

两种方式运行代理池,一种是 Tester、Getter、Server 全部运行,另一种是按需分别运行。

一般来说可以选择全部运行,命令如下:

python3 run.py

运行之后会启动 Tester、Getter、Server,这时访问 http://localhost:5555/random 即可获取一个随机可用代理。

或者如果你弄清楚了代理池的架构,可以按需分别运行,命令如下:

python3 run.py --processor getter
python3 run.py --processor tester
python3 run.py --processor server

这里 processor 可以指定运行 Tester、Getter 还是 Server。

使用

成功运行之后可以通过 http://localhost:5555/random 获取一个随机可用代理。

可以用程序对接实现,下面的示例展示了获取代理并爬取网页的过程:

import requests

proxypool_url = 'http://127.0.0.1:5555/random'
target_url = 'http://httpbin.org/get'

def get_random_proxy():
    """
    get random proxy from proxypool
    :return: proxy
    """
    return requests.get(proxypool_url).text.strip()

def crawl(url, proxy):
    """
    use proxy to crawl page
    :param url: page url
    :param proxy: proxy, such as 8.8.8.8:8888
    :return: html
    """
    proxies = {'http': 'http://' + proxy}
    return requests.get(url, proxies=proxies).text


def main():
    """
    main method, entry point
    :return: none
    """
    proxy = get_random_proxy()
    print('get random proxy', proxy)
    html = crawl(target_url, proxy)
    print(html)

if __name__ == '__main__':
    main()

运行结果如下:

get random proxy 116.196.115.209:8080
{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.22.0",
    "X-Amzn-Trace-Id": "Root=1-5e4d7140-662d9053c0a2e513c7278364"
  },
  "origin": "116.196.115.209",
  "url": "https://httpbin.org/get"
}

可以看到成功获取了代理,并请求 httpbin.org 验证了代理的可用性。

可配置项

代理池可以通过设置环境变量来配置一些参数。

开关

  • ENABLE_TESTER:允许 Tester 启动,默认 true
  • ENABLE_GETTER:允许 Getter 启动,默认 true
  • ENABLE_SERVER:运行 Server 启动,默认 true

环境

  • APP_ENV:运行环境,可以设置 dev、test、prod,即开发、测试、生产环境,默认 dev
  • APP_DEBUG:调试模式,可以设置 true 或 false,默认 true
  • APP_PROD_METHOD: 正式环境启动应用方式,默认是gevent, 可选:tornado,meinheld(分别需要安装 tornado 或 meinheld 模块)

Redis 连接

  • PROXYPOOL_REDIS_HOST / REDIS_HOST:Redis 的 Host,其中 PROXYPOOL_REDIS_HOST 会覆盖 REDIS_HOST 的值。
  • PROXYPOOL_REDIS_PORT / REDIS_PORT:Redis 的端口,其中 PROXYPOOL_REDIS_PORT 会覆盖 REDIS_PORT 的值。
  • PROXYPOOL_REDIS_PASSWORD / REDIS_PASSWORD:Redis 的密码,其中 PROXYPOOL_REDIS_PASSWORD 会覆盖 REDIS_PASSWORD 的值。
  • PROXYPOOL_REDIS_DB / REDIS_DB:Redis 的数据库索引,如 0、1,其中 PROXYPOOL_REDIS_DB 会覆盖 REDIS_DB 的值。
  • PROXYPOOL_REDIS_CONNECTION_STRING / REDIS_CONNECTION_STRING:Redis 连接字符串,其中 PROXYPOOL_REDIS_CONNECTION_STRING 会覆盖 REDIS_CONNECTION_STRING 的值。
  • PROXYPOOL_REDIS_KEY / REDIS_KEY:Redis 储存代理使用字典的名称,其中 PROXYPOOL_REDIS_KEY 会覆盖 REDIS_KEY 的值。

处理器

  • CYCLE_TESTER:Tester 运行周期,即间隔多久运行一次测试,默认 20 秒
  • CYCLE_GETTER:Getter 运行周期,即间隔多久运行一次代理获取,默认 100 秒
  • TEST_URL:测试 URL,默认百度
  • TEST_TIMEOUT:测试超时时间,默认 10 秒
  • TEST_BATCH:批量测试数量,默认 20 个代理
  • TEST_VALID_STATUS:测试有效的状态码
  • API_HOST:代理 Server 运行 Host,默认 0.0.0.0
  • API_PORT:代理 Server 运行端口,默认 5555
  • API_THREADED:代理 Server 是否使用多线程,默认 true

日志

  • LOG_DIR:日志相对路径
  • LOG_RUNTIME_FILE:运行日志文件名称
  • LOG_ERROR_FILE:错误日志文件名称
  • LOG_ROTATION: 日志记录周转周期或大小,默认 500MB,见 loguru - rotation
  • LOG_RETENTION: 日志保留日期,默认 7 天,见 loguru - retention
  • ENABLE_LOG_FILE:是否输出 log 文件,默认 true,如果设置为 false,那么 ENABLE_LOG_RUNTIME_FILE 和 ENABLE_LOG_ERROR_FILE 都不会生效
  • ENABLE_LOG_RUNTIME_FILE:是否输出 runtime log 文件,默认 true
  • ENABLE_LOG_ERROR_FILE:是否输出 error log 文件,默认 true

以上内容均可使用环境变量配置,即在运行前设置对应环境变量值即可,如更改测试地址和 Redis 键名:

export TEST_URL=http://weibo.cn
export REDIS_KEY=proxies:weibo

即可构建一个专属于微博的代理池,有效的代理都是可以爬取微博的。

如果使用 Docker-Compose 启动代理池,则需要在 docker-compose.yml 文件里面指定环境变量,如:

version: "3"
services:
  redis:
    image: redis:alpine
    container_name: redis
    command: redis-server
    ports:
      - "6379:6379"
    restart: always
  proxypool:
    build: .
    image: "germey/proxypool"
    container_name: proxypool
    ports:
      - "5555:5555"
    restart: always
    environment:
      REDIS_HOST: redis
      TEST_URL: http://weibo.cn
      REDIS_KEY: proxies:weibo

扩展代理爬虫

代理的爬虫均放置在 proxypool/crawlers 文件夹下,目前对接了有限几个代理的爬虫。

若扩展一个爬虫,只需要在 crawlers 文件夹下新建一个 Python 文件声明一个 Class 即可。

写法规范如下:

from pyquery import PyQuery as pq
from proxypool.schemas.proxy import Proxy
from proxypool.crawlers.base import BaseCrawler

BASE_URL = 'http://www.664ip.cn/{page}.html'
MAX_PAGE = 5

class Daili66Crawler(BaseCrawler):
    """
    daili66 crawler, http://www.66ip.cn/1.html
    """
    urls = [BASE_URL.format(page=page) for page in range(1, MAX_PAGE + 1)]

    def parse(self, html):
        """
        parse html file to get proxies
        :return:
        """
        doc = pq(html)
        trs = doc('.containerbox table tr:gt(0)').items()
        for tr in trs:
            host = tr.find('td:nth-child(1)').text()
            port = int(tr.find('td:nth-child(2)').text())
            yield Proxy(host=host, port=port)

在这里只需要定义一个 Crawler 继承 BaseCrawler 即可,然后定义好 urls 变量和 parse 方法即可。

  • urls 变量即为爬取的代理网站网址列表,可以用程序定义也可写成固定内容。
  • parse 方法接收一个参数即 html,代理网址的 html,在 parse 方法里只需要写好 html 的解析,解析出 host 和 port,并构建 Proxy 对象 yield 返回即可。

网页的爬取不需要实现,BaseCrawler 已经有了默认实现,如需更改爬取方式,重写 crawl 方法即可。

欢迎大家多多发 Pull Request 贡献 Crawler,使其代理源更丰富强大起来。

部署

本项目提供了 Kubernetes 部署脚本,如需部署到 Kubernetes,请参考 kubernetes。

如有一起开发的兴趣可以在 Issue 留言,非常感谢!

LICENSE

MIT