Convert Figma logo to code with AI

awolfly9 logoIPProxyTool

python ip proxy tool scrapy crawl. 抓取大量免费代理 ip,提取有效 ip 使用

1,995
415
1,995
12

Top Related Projects

Python ProxyPool for web spider

Lists of HTTP, SOCKS4, SOCKS5 proxies with geolocation info. Updated every hour.

Get PROXY List that gets updated everyday

1,895

Daily feed of bad IPs (with blacklist hit scores)

Quick Overview

IPProxyTool is an open-source project designed to collect and verify free proxy IP addresses from various sources. It provides a tool for gathering, testing, and managing proxy IPs, which can be useful for web scraping, anonymity, or bypassing geographical restrictions.

Pros

  • Automatically collects and verifies proxy IPs from multiple sources
  • Supports both HTTP and HTTPS proxies
  • Includes a built-in web interface for easy management of proxy IPs
  • Allows customization of proxy sources and verification methods

Cons

  • Last updated in 2017, potentially outdated
  • Limited documentation, especially for non-Chinese speakers
  • May require additional configuration for optimal performance
  • Reliability of free proxy IPs can be inconsistent

Code Examples

# Initialize the proxy collector
from ipproxytool.spiders.proxy.xicidaili import XiCiDaiLiSpider

spider = XiCiDaiLiSpider()
spider.start()
# Verify a proxy IP
from ipproxytool.spiders.validator.httpbin import HttpBinSpider

validator = HttpBinSpider()
is_valid = validator.validate_proxy('127.0.0.1', '8080')
print(f"Proxy is valid: {is_valid}")
# Retrieve valid proxies from the database
from ipproxytool.db.mysql import MySQLManager

db = MySQLManager()
valid_proxies = db.get_valid_proxy(count=10)
print(f"Valid proxies: {valid_proxies}")

Getting Started

  1. Clone the repository:

    git clone https://github.com/awolfly9/IPProxyTool.git
    cd IPProxyTool
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Configure the database settings in config.py

  4. Run the main script:

    python ipproxytool.py
    
  5. Access the web interface at http://localhost:8000 to manage and view collected proxies.

Competitor Comparisons

Python ProxyPool for web spider

Pros of proxy_pool

  • More active development with recent updates
  • Supports multiple database backends (Redis, MongoDB)
  • Includes a RESTful API for easy integration

Cons of proxy_pool

  • Less detailed documentation compared to IPProxyTool
  • Primarily focused on Chinese proxy sources

Code Comparison

IPProxyTool:

class Validator(object):
    def __init__(self, proxies):
        self.proxies = proxies
        self.timeout = 10
        self.threads = 20

proxy_pool:

class ProxyCheck(object):
    def __init__(self):
        self.raw_proxy_queue = Queue()
        self.thread_list = list()
        self.useful_proxy_queue = Queue()

Both projects use object-oriented programming for their core functionality. IPProxyTool's Validator class focuses on validating proxies, while proxy_pool's ProxyCheck class manages proxy queues and threading.

proxy_pool offers more flexibility with database options and includes an API, making it easier to integrate into existing projects. However, IPProxyTool provides more detailed documentation, which can be beneficial for users new to proxy management.

The choice between these tools depends on specific requirements, such as the need for an API, database preferences, and the importance of documentation quality.

Lists of HTTP, SOCKS4, SOCKS5 proxies with geolocation info. Updated every hour.

Pros of proxy-list

  • Regularly updated proxy lists in multiple formats (TXT, JSON)
  • Simple and straightforward to use, with no additional setup required
  • Supports various proxy protocols (HTTP, HTTPS, SOCKS4, SOCKS5)

Cons of proxy-list

  • Lacks proxy validation or testing functionality
  • No built-in tools for proxy scraping or management
  • Limited customization options for users

Code comparison

proxy-list:

# No specific code to show, as it's primarily a collection of proxy lists

IPProxyTool:

from config import *
from sql.sql import SqlManager
from ipproxy import IPProxy

ipproxy = IPProxy()
ipproxy.run()

Summary

proxy-list is a straightforward repository that provides regularly updated proxy lists in various formats, making it easy for users to access and use proxies without additional setup. However, it lacks advanced features like proxy validation or management tools.

IPProxyTool, on the other hand, offers a more comprehensive solution for proxy management, including scraping, validation, and database storage. It requires more setup but provides greater flexibility and control over the proxy collection and validation process.

Choose proxy-list for quick access to proxy lists or IPProxyTool for a more robust proxy management solution.

Get PROXY List that gets updated everyday

Pros of PROXY-List

  • Larger collection of proxy servers, updated more frequently
  • Simpler to use, with ready-to-use proxy lists in various formats
  • Active community and regular contributions

Cons of PROXY-List

  • Less sophisticated proxy validation and testing
  • Lacks advanced features like proxy rotation or API integration
  • No built-in proxy scraping functionality

Code Comparison

PROXY-List typically provides proxy lists in plain text or JSON format:

123.45.67.89:8080
98.76.54.32:3128

IPProxyTool offers more structured output and functionality:

class Proxy(object):
    def __init__(self):
        self.ip = ''
        self.port = ''
        self.country = ''
        self.anonymity = ''
        self.https = ''
        self.speed = ''
        self.source = ''

PROXY-List is more suitable for users who need quick access to a large number of proxies without extensive validation or management features. IPProxyTool is better for those requiring more control over proxy selection, testing, and integration into larger projects.

1,895

Daily feed of bad IPs (with blacklist hit scores)

Pros of ipsum

  • Regularly updated with a large list of malicious IP addresses
  • Lightweight and easy to integrate into existing security systems
  • Provides multiple formats for IP lists (plain text, CSV, JSON)

Cons of ipsum

  • Focused solely on malicious IP addresses, not proxy servers
  • Lacks features for testing or verifying IP addresses
  • No built-in proxy management or rotation capabilities

Code comparison

IPProxyTool:

def start(self):
    for p in self.proxy_getter_functions:
        p.start()
    for p in self.proxy_getter_functions:
        p.join()

ipsum:

def update(self):
    for name in self.SOURCES:
        worker = threading.Thread(target=self._retrieve_worker, args=(name,))
        worker.start()
        self._threads.append(worker)

Both projects use threading to retrieve data from multiple sources concurrently. IPProxyTool focuses on gathering and managing proxy servers, while ipsum collects malicious IP addresses from various sources.

IPProxyTool offers more comprehensive proxy management features, including testing and verification. ipsum, on the other hand, provides a simpler solution for maintaining a list of malicious IP addresses, which can be useful for security applications and firewalls.

The choice between these tools depends on the specific use case: IPProxyTool for proxy management and ipsum for maintaining a blocklist of malicious IPs.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

IPProxyTool

使用 scrapy 爬虫抓取代理网站,获取大量的免费代理 ip。过滤出所有可用的 ip,存入数据库以备使用。 可以访问我的个人站点,查看我的更多有趣项目 西瓜

感谢 youngjeff 和我一起维护该项目

运行环境

安装 python3 and mysql 数据库

cryptography模块安装环境:

sudo yum install gcc libffi-devel python-devel openssl-devel
$ pip install -r requirements.txt

下载使用

将项目克隆到本地

$ git clone https://github.com/awolfly9/IPProxyTool.git

进入工程目录

$ cd IPProxyTool

修改 mysql 数据库配置 config.py 中 database_config 的用户名和密码为数据库的用户名和密码

$ vim config.py
---------------

database_config = {
	'host': 'localhost',
	'port': 3306,
	'user': 'root',
	'password': '123456',
	'charset': 'utf8',
}

MYSQL: 导入数据表结构

$ mysql> create database ipproxy;
Query OK, 1 row affected (0.00 sec)
$ mysql> use ipproxy;
Database changed
$ mysql> source '/你的项目目录/db.sql'

运行启动脚本 ipproxytool.py 也可以分别运行抓取,验证,服务器接口脚本,运行方法参考项目说明

$ python ipproxytool.py 

新增异步验证方式,运行方法如下

$ python ipproxytool.py async

项目说明

抓取代理网站

所有抓取代理网站的代码都在 proxy

扩展抓取其他的代理网站

1.在 proxy 目录下新建脚本并继承自 BaseSpider
2.设置 name、urls、headers
3.重写 parse_page 方法,提取代理数据
4.将数据存入数据库 具体可以参考 ip181 kuaidaili
5.如果需要抓取特别复杂的代理网站,可以参考peuland

修改 run_crawl_proxy.py 导入抓取库,添加到抓取队列

可以单独运行 run_crawl_proxy.py 脚本开始抓取代理网站

$ python run_crawl_proxy.py

验证代理 ip 是否有效

目前验证方式:
1.从上一步抓取并存储的数据库中取出所有的代理 IP
2.利用取出的代理 IP 去请求 httpbin
3.根据请求结果判断出代理 IP 的有效性,是否支持 HTTPS 以及匿名度,并存储到表 httpbin 中
4.从 httpbin 表中取出代理去访问目标网站,例如 豆瓣
5.如果请求在合适的时间返回成功的数据,则认为这个代理 IP 有效。并且存入相应的表中

一个目标网站对应一个脚本,所有验证代理 ip 的代码都在 validator

扩展验证其他网站

1.在 validator 目录下新建脚本并继承 Validator
2.设置 name、timeout、urls、headers
3.然后调用 init 方法,可以参考 baidu douban
4.如果需要特别复杂的验证方式,可以参考 assetstore

修改 run_validator.py 导入验证库,添加到验证队列

可以单独运行 run_validator.py 开始验证代理ip的有效性

$ python run_validator.py

获取代理 ip 数据服务器接口

在 config.py 中修改启动服务器端口配置 data_port,默认为 8000 启动服务器

$ python run_server.py

服务器提供接口

获取

http://127.0.0.1:8000/select?name=httpbin&anonymity=1&https=yes&order=id&sort=desc&count=100

参数

NameTypeDescriptionmust
namestr数据库名称是
anonymityint1:高匿 2:匿名 3:透明否
httpsstrhttps:yes http:no否
orderstrtable 字段否
sortstrasc 升序,desc 降序否
countint获取代理数量,默认 100否

删除

http://127.0.0.1:8000/delete?name=httpbin&ip=27.197.144.181

参数

NameTypeDescription是否必须
namestr数据库名称是
ipstr需要删除的 ip是

插入

http://127.0.0.1:8000/insert?name=httpbin&ip=555.22.22.55&port=335&country=%E4%B8%AD%E5%9B%BD&anonymity=1&https=yes&speed=5&source=100

参数

NameTypeDescription是否必须
namestr数据库名称是
ipstrip 地址是
portstr端口是
countrystr国家否
anonymityint1:高匿,2:匿名,3:透明否
httpsstryes:https,no:http否
speedfloat访问速度否
sourcestrip 来源否

TODO

参考

项目更新

-----------------------------2020-12-29----------------------------

  1. 修改之前错误的路径命名
  2. 修改mysql 表结构

-----------------------------2017-6-23----------------------------
1.python2 -> python3
2.web.py -> flask

-----------------------------2017-5-17----------------------------
1.本系统在原来的基础上加入了docker。操作见下方,关于docker的相关知识可以上官网看看http://www.docker.com.

-----------------------------2017-3-30----------------------------
1.修改完善 readme
2.数据插入支持事务

-----------------------------2017-3-14----------------------------
1.更改服务器接口,添加排序方式
2.添加多进程方式验证代理 ip 的有效性

-----------------------------2017-2-20----------------------------
1.添加服务器获取接口更多筛选条件

-----------------------------2017-2-16----------------------------
1.验证代理 IP 的匿名度
2.验证代理 IP HTTPS 支持
3.添加 httpbin 验证并发数设置,默认为 4

在系统中安装docker就可以使用本程序:

下载本程序

git clone https://github.com/awolfly9/IPProxyTool

然后进入目录:

cd IPProxyTool

创建镜像:

docker build -t proxy .

运行容器:

docker run -it proxy

在config.py中按照自己的需求修改配置信息

database_config = {
    'host': 'localhost',
    'port': 3306,
    'user': 'root',
    'password': 'root',
    'charset': 'utf8',
}