Convert Figma logo to code with AI

jhao104 logoproxy_pool

Python ProxyPool for web spider

22,577
5,315
22,577
301

Top Related Projects

:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis

An Efficient ProxyPool with Getter, Tester and Server

4,012

Intelligent proxy pool for Humans™ to extract content from the internet and build your own Large Language Models in this new AI era

1,895

Daily feed of bad IPs (with blacklist hit scores)

Quick Overview

Proxy_pool is an open-source Python project that provides a simple proxy IP pool. It automatically collects free proxy IPs from the internet, validates them, and offers an API for retrieving usable proxy IPs. The project aims to simplify the process of obtaining and managing proxy IPs for various applications.

Pros

  • Automatic proxy collection and validation
  • Easy-to-use API for retrieving proxy IPs
  • Supports multiple proxy sources and protocols (HTTP, HTTPS, SOCKS4/5)
  • Configurable and extensible architecture

Cons

  • Reliability of free proxies can be inconsistent
  • Limited documentation, especially for advanced usage
  • Potential legal and ethical concerns when using proxy IPs without permission
  • May require frequent maintenance to keep proxy sources up-to-date

Code Examples

  1. Retrieving a random proxy:
import requests

proxy = requests.get("http://127.0.0.1:5010/get/").json()
print(f"Random proxy: {proxy}")
  1. Retrieving a specific protocol proxy:
import requests

https_proxy = requests.get("http://127.0.0.1:5010/get/?type=https").json()
print(f"HTTPS proxy: {https_proxy}")
  1. Reporting an invalid proxy:
import requests

requests.get("http://127.0.0.1:5010/delete/?proxy=1.1.1.1:8080")

Getting Started

  1. Clone the repository:

    git clone https://github.com/jhao104/proxy_pool.git
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Modify the setting.py file to configure proxy sources and other settings.

  4. Run the proxy pool:

    python proxyPool.py schedule
    python proxyPool.py server
    
  5. Access the API at http://127.0.0.1:5010 to retrieve proxy IPs.

Competitor Comparisons

:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis

Pros of haipproxy

  • More advanced proxy validation and scoring system
  • Supports multiple proxy sources and protocols (HTTP, HTTPS, Socks4/5)
  • Includes a web interface for easier management

Cons of haipproxy

  • More complex setup and configuration
  • Requires additional dependencies (e.g., Scrapy, Redis)
  • Less frequently updated compared to proxy_pool

Code Comparison

proxy_pool:

def check_proxy(proxy):
    url = "http://www.baidu.com/get_ip.php"
    try:
        r = requests.get(url, proxies={"http": "http://" + proxy}, timeout=10, verify=False)
        if r.status_code == 200:
            return True
    except:
        return False

haipproxy:

def validate_proxy(proxy):
    start = time.time()
    try:
        r = requests.get(self.target_url, proxies={"http": proxy, "https": proxy},
                         timeout=self.timeout, verify=False)
        if r.ok:
            speed = time.time() - start
            return True, speed
    except:
        return False, None

The code comparison shows that haipproxy includes a more sophisticated proxy validation process, measuring response time and supporting both HTTP and HTTPS protocols. proxy_pool's implementation is simpler but less comprehensive.

An Efficient ProxyPool with Getter, Tester and Server

Pros of ProxyPool

  • More comprehensive documentation, including detailed setup instructions and API usage examples
  • Supports multiple proxy sources and validation methods out of the box
  • Includes a web interface for easy management and monitoring of the proxy pool

Cons of ProxyPool

  • Slightly more complex setup process due to additional dependencies
  • May have higher resource usage due to more extensive features

Code Comparison

proxy_pool:

def check_proxy(proxy):
    url = "http://www.baidu.com/get_ip.php"
    try:
        r = requests.get(url, proxies={"http": "http://" + proxy}, timeout=10, verify=False)
        if r.status_code == 200:
            return True
    except:
        return False

ProxyPool:

def check_proxy(proxy):
    try:
        resp = requests.get(self.test_url, proxies={
            'http': 'http://' + proxy,
            'https': 'https://' + proxy
        }, timeout=self.timeout, verify=False)
        if resp.status_code == 200:
            return True
    except (ProxyError, ConnectTimeout, SSLError, ReadTimeout):
        return False

Both projects aim to provide a pool of usable proxies, but ProxyPool offers a more feature-rich solution with better documentation. However, this comes at the cost of a slightly more complex setup and potentially higher resource usage. The code comparison shows that ProxyPool's proxy checking function is more comprehensive, handling both HTTP and HTTPS proxies and catching specific exceptions.

4,012

Intelligent proxy pool for Humans™ to extract content from the internet and build your own Large Language Models in this new AI era

Pros of Scylla

  • Written in Rust, offering better performance and memory safety
  • Supports both IPv4 and IPv6 proxies
  • Provides a RESTful API for easier integration

Cons of Scylla

  • Less actively maintained (last update over 2 years ago)
  • Fewer built-in proxy sources compared to proxy_pool
  • Limited documentation and community support

Code Comparison

proxy_pool (Python):

def get_proxy():
    return self.db.pop()

def delete_proxy(proxy):
    self.db.delete(proxy)

Scylla (Rust):

pub fn get_proxy(&self) -> Option<Proxy> {
    self.proxies.pop_front()
}

pub fn delete_proxy(&mut self, proxy: &Proxy) {
    self.proxies.retain(|p| p != proxy);
}

Both projects aim to provide a pool of proxies, but they differ in implementation language and features. proxy_pool is written in Python and offers a wider range of proxy sources, making it more flexible for various use cases. It also has more recent updates and a larger community.

Scylla, on the other hand, leverages Rust's performance benefits and provides a RESTful API, which can be advantageous for certain applications. However, its development seems to have slowed down, potentially limiting its long-term viability.

The code comparison shows similar basic functionality for retrieving and deleting proxies, with Scylla's implementation benefiting from Rust's strong typing and memory safety features.

1,895

Daily feed of bad IPs (with blacklist hit scores)

Pros of ipsum

  • Focuses on IP blocklists for security purposes, providing a more specialized tool
  • Regularly updated with new malicious IP addresses from various sources
  • Lightweight and easy to integrate into existing security systems

Cons of ipsum

  • Limited to IP blocklists, lacking the proxy pool functionality
  • May require additional tools or scripts for implementation in certain use cases
  • Less versatile compared to proxy_pool's broader proxy management features

Code comparison

ipsum:

#!/usr/bin/env python

import re
import sys

def addr_to_int(value):
    return struct.unpack("!I", socket.inet_aton(value))[0]

proxy_pool:

class ProxyPool(object):
    def __init__(self):
        self.pool = set()

    def add(self, proxy):
        self.pool.add(proxy)

    def remove(self, proxy):
        self.pool.discard(proxy)

The code snippets show that ipsum focuses on IP address manipulation, while proxy_pool manages a set of proxy addresses. This reflects their different purposes: ipsum for IP blocklists and proxy_pool for proxy management.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

ProxyPool 爬虫代理IP池

Build Status Packagist GitHub contributors

______                        ______             _
| ___ \_                      | ___ \           | |
| |_/ / \__ __   __  _ __   _ | |_/ /___   ___  | |
|  __/|  _// _ \ \ \/ /| | | ||  __// _ \ / _ \ | |
| |   | | | (_) | >  < \ |_| || |  | (_) | (_) || |___
\_|   |_|  \___/ /_/\_\ \__  |\_|   \___/ \___/ \_____\
                       __ / /
                      /___ /

ProxyPool

爬虫代理IP池项目,主要功能为定时采集网上发布的免费代理验证入库,定时验证入库的代理保证代理的可用性,提供API和CLI两种使用方式。同时你也可以扩展代理源以增加代理池IP的质量和数量。

  • 文档: document Documentation Status

  • 支持版本:

  • 测试地址: http://demo.spiderpy.cn (勿压谢谢)

  • 付费代理推荐: luminati-china. 国外的亮数据BrightData(以前叫luminati)被认为是代理市场领导者,覆盖全球的7200万IP,大部分是真人住宅IP,成功率扛扛的。付费套餐多种,需要高质量代理IP的可以注册后联系中文客服。申请免费试用 目前有50%折扣优惠活动。(PS:用不明白的同学可以参考这个使用教程)。

运行项目

下载代码:
  • git clone
git clone git@github.com:jhao104/proxy_pool.git
  • releases
https://github.com/jhao104/proxy_pool/releases 下载对应zip文件
安装依赖:
pip install -r requirements.txt
更新配置:
# setting.py 为项目配置文件

# 配置API服务

HOST = "0.0.0.0"               # IP
PORT = 5000                    # 监听端口


# 配置数据库

DB_CONN = 'redis://:pwd@127.0.0.1:8888/0'


# 配置 ProxyFetcher

PROXY_FETCHER = [
    "freeProxy01",      # 这里是启用的代理抓取方法名,所有fetch方法位于fetcher/proxyFetcher.py
    "freeProxy02",
    # ....
]

启动项目:

# 如果已经具备运行条件, 可用通过proxyPool.py启动。
# 程序分为: schedule 调度程序 和 server Api服务

# 启动调度程序
python proxyPool.py schedule

# 启动webApi服务
python proxyPool.py server

Docker Image

docker pull jhao104/proxy_pool

docker run --env DB_CONN=redis://:password@ip:port/0 -p 5010:5010 jhao104/proxy_pool:latest

docker-compose

项目目录下运行:

docker-compose up -d

使用

  • Api

启动web服务后, 默认配置下会开启 http://127.0.0.1:5010 的api接口服务:

apimethodDescriptionparams
/GETapi介绍None
/getGET随机获取一个代理可选参数: ?type=https 过滤支持https的代理
/popGET获取并删除一个代理可选参数: ?type=https 过滤支持https的代理
/allGET获取所有代理可选参数: ?type=https 过滤支持https的代理
/countGET查看代理数量None
/deleteGET删除代理?proxy=host:ip
  • 爬虫使用

  如果要在爬虫代码中使用的话, 可以将此api封装成函数直接使用,例如:

import requests

def get_proxy():
    return requests.get("http://127.0.0.1:5010/get/").json()

def delete_proxy(proxy):
    requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))

# your spider code

def getHtml():
    # ....
    retry_count = 5
    proxy = get_proxy().get("proxy")
    while retry_count > 0:
        try:
            html = requests.get('http://www.example.com', proxies={"http": "http://{}".format(proxy)})
            # 使用代理访问
            return html
        except Exception:
            retry_count -= 1
    # 删除代理池中代理
    delete_proxy(proxy)
    return None

扩展代理

  项目默认包含几个免费的代理获取源,但是免费的毕竟质量有限,所以如果直接运行可能拿到的代理质量不理想。所以,提供了代理获取的扩展方法。

  添加一个新的代理源方法如下:

  • 1、首先在ProxyFetcher类中添加自定义的获取代理的静态方法, 该方法需要以生成器(yield)形式返回host:ip格式的代理,例如:

class ProxyFetcher(object):
    # ....

    # 自定义代理源获取方法
    @staticmethod
    def freeProxyCustom1():  # 命名不和已有重复即可

        # 通过某网站或者某接口或某数据库获取代理
        # 假设你已经拿到了一个代理列表
        proxies = ["x.x.x.x:3128", "x.x.x.x:80"]
        for proxy in proxies:
            yield proxy
        # 确保每个proxy都是 host:ip正确的格式返回
  • 2、添加好方法后,修改setting.py文件中的PROXY_FETCHER项:

  在PROXY_FETCHER下添加自定义方法的名字:

PROXY_FETCHER = [
    "freeProxy01",    
    "freeProxy02",
    # ....
    "freeProxyCustom1"  #  # 确保名字和你添加方法名字一致
]

  schedule 进程会每隔一段时间抓取一次代理,下次抓取时会自动识别调用你定义的方法。

免费代理源

目前实现的采集免费代理网站有(排名不分先后, 下面仅是对其发布的免费代理情况, 付费代理测评可以参考这里):

代理名称状态更新速度可用率地址代码
站大爷✔★**地址freeProxy01
66代理✔★*地址freeProxy02
开心代理✔★*地址freeProxy03
FreeProxyList✔★*地址freeProxy04
快代理✔★*地址freeProxy05
冰凌代理✔★★★*地址freeProxy06
云代理✔★*地址freeProxy07
小幻代理✔★★*地址freeProxy08
免费代理库✔☆*地址freeProxy09
89代理✔☆*地址freeProxy10
稻壳代理✔★★***地址freeProxy11

如果还有其他好的免费代理网站, 可以在提交在issues, 下次更新时会考虑在项目中支持。

问题反馈

  任何问题欢迎在Issues 中反馈,同时也可以到我的博客中留言。

  你的反馈会让此项目变得更加完美。

贡献代码

  本项目仅作为基本的通用的代理池架构,不接收特有功能(当然,不限于特别好的idea)。

  本项目依然不够完善,如果发现bug或有新的功能添加,请在Issues中提交bug(或新功能)描述,我会尽力改进,使她更加完美。

  这里感谢以下contributor的无私奉献:

  @kangnwh | @bobobo80 | @halleywj | @newlyedward | @wang-ye | @gladmo | @bernieyangmh | @PythonYXY | @zuijiawoniu | @netAir | @scil | @tangrela | @highroom | @luocaodan | @vc5 | @1again | @obaiyan | @zsbh | @jiannanya | @Jerry12228

Release Notes

changelog

Featured|HelloGitHub