Top Related Projects
Python ProxyPool for web spider
:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis
python ip proxy tool scrapy crawl. 抓取大量免费代理 ip,提取有效 ip 使用
IPProxyPool代理池项目,提供代理ip
Quick Overview
ProxyPool is an open-source Python project that provides a proxy IP pool with automatic crawling, verification, and API support. It aims to offer a reliable source of proxy IPs for web scraping and other network-related tasks, with features like customizable crawling, proxy validation, and easy integration through a RESTful API.
Pros
- Automatic proxy crawling and validation, reducing manual effort
- RESTful API for easy integration with other applications
- Customizable proxy sources and validation criteria
- Support for multiple database backends (Redis, MongoDB)
Cons
- Limited documentation, especially for advanced configurations
- Potential legal and ethical concerns when using proxies without permission
- May require frequent updates to maintain effectiveness as proxy sources change
- Performance can vary depending on the quality of crawled proxies
Code Examples
- Fetching a random proxy:
import requests
proxy = requests.get("http://localhost:5555/random").text
print(f"Random proxy: {proxy}")
- Getting a proxy count:
import requests
count = requests.get("http://localhost:5555/count").text
print(f"Total proxies: {count}")
- Checking if a specific IP is available:
import requests
ip = "1.1.1.1"
result = requests.get(f"http://localhost:5555/{ip}").text
print(f"Is {ip} available: {result}")
Getting Started
-
Clone the repository:
git clone https://github.com/Python3WebSpider/ProxyPool.git
-
Install dependencies:
cd ProxyPool pip install -r requirements.txt
-
Configure the settings in
proxypool/setting.py
(e.g., database connection, proxy sources) -
Run the proxy pool:
python3 run.py
-
Access the API at
http://localhost:5555
Competitor Comparisons
Python ProxyPool for web spider
Pros of proxy_pool
- More active development with recent updates and contributions
- Includes a web API for easy integration with other applications
- Supports multiple database backends (Redis, MongoDB, SQLite)
Cons of proxy_pool
- Less comprehensive documentation compared to ProxyPool
- Fewer built-in proxy sources out of the box
- Slightly more complex setup process
Code Comparison
ProxyPool:
class Crawler(object):
def crawl(self):
print('Crawler is working')
proxy_count = 0
for callback_label in range(self.crawler_func.__len__()):
callback = self.crawler_func[callback_label]
proxies = callback()
for proxy in proxies:
proxy = proxy.strip()
if proxy and self.db.add(proxy):
proxy_count += 1
return proxy_count
proxy_pool:
class ProxyFetcher(object):
def run(self):
print('ProxyFetcher is working')
proxy_count = 0
for callback_label in self.fetcher_func:
callback = getattr(self, callback_label)
proxies = callback()
for proxy in proxies:
if proxy.strip():
self.db.put(proxy)
proxy_count += 1
return proxy_count
Both projects implement similar functionality for crawling and fetching proxies, but proxy_pool uses a slightly different approach with getattr()
to access callback functions. ProxyPool's implementation is more straightforward, while proxy_pool offers more flexibility in method naming and organization.
:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis
Pros of haipproxy
- More advanced proxy acquisition methods, including crawling from multiple sources and supporting ADSL dial-up
- Better scalability with distributed architecture using Scrapy and Redis
- More comprehensive proxy validation and scoring system
Cons of haipproxy
- More complex setup and configuration due to its distributed nature
- Steeper learning curve for users unfamiliar with Scrapy and Redis
- Potentially higher resource requirements for running the full system
Code Comparison
ProxyPool:
def crawl_xicidaili():
for i in range(1, 3):
start_url = 'http://www.xicidaili.com/nn/{}'.format(i)
html = get_page(start_url)
ip_addresses = re.findall('<td>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})</td>', html)
port_numbers = re.findall('<td>(\d+)</td>', html)
for address, port in zip(ip_addresses, port_numbers):
yield ':'.join([address, port])
haipproxy:
class XiciSpider(RedisSpider):
def parse(self, response):
for sel in response.xpath('//table[@id="ip_list"]/tr[position()>1]'):
ip = sel.xpath('./td[2]/text()').extract_first()
port = sel.xpath('./td[3]/text()').extract_first()
yield Proxy(host=ip, port=port)
The code comparison shows that haipproxy uses Scrapy's more structured approach for crawling, while ProxyPool uses a simpler custom crawling method.
python ip proxy tool scrapy crawl. 抓取大量免费代理 ip,提取有效 ip 使用
Pros of IPProxyTool
- Supports multiple proxy sources, including free and paid services
- Includes a web interface for easy management and visualization of proxy data
- Offers more detailed proxy information, such as response time and anonymity level
Cons of IPProxyTool
- Less actively maintained, with fewer recent updates
- More complex setup process compared to ProxyPool
- Limited documentation, which may make it harder for new users to get started
Code Comparison
IPProxyTool:
class Validator(object):
def __init__(self):
self.detect_from = "http://httpbin.org/get"
self.timeout = 10
self.valid_proxies = []
self.invalid_proxies = []
ProxyPool:
class Tester(object):
def __init__(self):
self.redis = RedisClient()
self.proxy_tester = ProxyTester()
def run(self):
print('Tester is working')
try:
proxies = self.redis.all()
loop = asyncio.get_event_loop()
for i in range(0, len(proxies), BATCH_TEST_SIZE):
test_proxies = proxies[i:i + BATCH_TEST_SIZE]
tasks = [self.proxy_tester.test(proxy) for proxy in test_proxies]
loop.run_until_complete(asyncio.wait(tasks))
except Exception as e:
print('Tester error', e.args)
Both projects aim to provide a pool of usable proxies, but they differ in their implementation and features. IPProxyTool offers a more comprehensive set of features, including a web interface and support for multiple proxy sources. However, ProxyPool has a simpler setup process and is more actively maintained, making it potentially more suitable for users who prioritize ease of use and ongoing support.
IPProxyPool代理池项目,提供代理ip
Pros of IPProxyPool
- More comprehensive proxy validation process, including checks for anonymity levels and support for HTTPS
- Includes a web interface for easy management and visualization of proxy data
- Supports multiple database backends (Redis, MongoDB, MySQL) for storing proxy information
Cons of IPProxyPool
- Less actively maintained, with fewer recent updates compared to ProxyPool
- More complex setup process due to additional dependencies and database requirements
- Limited documentation, especially for non-Chinese speakers
Code Comparison
IPProxyPool:
def validUsefulProxy(proxy):
if isinstance(proxy, bytes):
proxy = proxy.decode('utf8')
proxies = {"http": "http://{proxy}".format(proxy=proxy)}
try:
r = requests.get('http://httpbin.org/ip', proxies=proxies, timeout=30, verify=False)
if r.status_code == 200 and r.json().get('origin'):
return True
except:
return False
ProxyPool:
async def test_proxy(proxy):
try:
async with aiohttp.ClientSession() as session:
try:
async with session.get(TEST_URL, proxy=f'http://{proxy}', timeout=15, ssl=False) as response:
if response.status == 200:
return True
except:
return False
except:
return False
The code comparison shows that IPProxyPool uses synchronous requests for proxy validation, while ProxyPool utilizes asynchronous HTTP requests, potentially offering better performance for large-scale proxy testing.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
ProxyPool
ç®æé«æçä»£çæ± ï¼æä¾å¦ä¸åè½ï¼
- 宿¶æåå 费代çç½ç«ï¼ç®æå¯æ©å±ã
- ä½¿ç¨ Redis 对代çè¿è¡åå¨å¹¶å¯¹ä»£çå¯ç¨æ§è¿è¡æåºã
- 宿¶æµè¯åçéï¼åé¤ä¸å¯ç¨ä»£çï¼çä¸å¯ç¨ä»£çã
- æä¾ä»£ç APIï¼éæºåç¨æµè¯éè¿çå¯ç¨ä»£çã
ä»£çæ± åçè§£æå¯è§ãå¦ä½æå»ºä¸ä¸ªé«æçä»£çæ± ãï¼å»ºè®®ä½¿ç¨ä¹åé 读ã
使ç¨å注æ
æ¬ä»£çæ± æ¯åºäºå¸é¢ä¸åç§å ¬å¼ä»£çæºæå»ºçï¼æä»¥å¯ç¨æ§å¹¶ä¸é«ï¼å¾å¯è½ä¸ç¾ä¸å个代çä¸æè½æ¾å°ä¸ä¸¤ä¸ªå¯ç¨ä»£çï¼ä¸éåç´æ¥ç¨äºç¬è«ç¬åä»»å¡ã
妿æ¨çç®çæ¯ä¸ºäºå°½å¿«ä½¿ç¨ä»£ç宿ç¬åä»»å¡ï¼å»ºè®®æ¨å¯¹æ¥ä¸äºä»è´¹ä»£çæè ç´æ¥ä½¿ç¨å·²æä»£çèµæºï¼å¦ææ¨çç®çæ¯ä¸ºäºå¦ä¹ å¦ä½æå»ºä¸ä¸ªä»£çæ± ï¼æ¨å¯ä»¥åèæ¬é¡¹ç®ç»§ç»å®æåç»æ¥éª¤ã
ä»è´¹ä»£çæ¨èï¼
- ADSL æ¨å·ä»£çï¼æµ·éæ¨å·ï¼ä¸å½å¢å ï¼é«è´¨é代ç
- æµ·å¤/å ¨ç代çï¼ä¸å½å¢å¤é«è´¨é代ç
- èçª 4G/5G 代çï¼æé«è´¨éï¼ä¸å½å¢å ï¼é²é£æ§ä»£ç
使ç¨åå¤
é¦å å½ç¶æ¯å é代ç å¹¶è¿å ¥ ProxyPool æä»¶å¤¹ï¼
git clone https://github.com/Python3WebSpider/ProxyPool.git
cd ProxyPool
ç¶åéç¨ä¸é¢ Docker åå¸¸è§æ¹å¼ä»»æä¸ä¸ªæ§è¡å³å¯ã
使ç¨è¦æ±
å¯ä»¥éè¿ä¸¤ç§æ¹å¼æ¥è¿è¡ä»£çæ± ï¼ä¸ç§æ¹å¼æ¯ä½¿ç¨ Dockerï¼æ¨èï¼ï¼å¦ä¸ç§æ¹å¼æ¯å¸¸è§æ¹å¼è¿è¡ï¼è¦æ±å¦ä¸ï¼
Docker
å¦æä½¿ç¨ Dockerï¼åéè¦å®è£ å¦ä¸ç¯å¢ï¼
- Docker
- Docker-Compose
å®è£ æ¹æ³èªè¡æç´¢å³å¯ã
宿¹ Docker Hub éåï¼germey/proxypool
å¸¸è§æ¹å¼
å¸¸è§æ¹å¼è¦æ±æ Python ç¯å¢ãRedis ç¯å¢ï¼å ·ä½è¦æ±å¦ä¸ï¼
- Python>=3.6
- Redis
Docker è¿è¡
妿å®è£ å¥½äº Docker å Docker-Composeï¼åªéè¦ä¸æ¡å½ä»¤å³å¯è¿è¡ã
docker-compose up
è¿è¡ç»æç±»ä¼¼å¦ä¸ï¼
redis | 1:M 19 Feb 2020 17:09:43.940 * DB loaded from disk: 0.000 seconds
redis | 1:M 19 Feb 2020 17:09:43.940 * Ready to accept connections
proxypool | 2020-02-19 17:09:44,200 CRIT Supervisor is running as root. Privileges were not dropped because no user is specified in the config file. If you intend to run as root, you can set user=root in the config file to avoid this message.
proxypool | 2020-02-19 17:09:44,203 INFO supervisord started with pid 1
proxypool | 2020-02-19 17:09:45,209 INFO spawned: 'getter' with pid 10
proxypool | 2020-02-19 17:09:45,212 INFO spawned: 'server' with pid 11
proxypool | 2020-02-19 17:09:45,216 INFO spawned: 'tester' with pid 12
proxypool | 2020-02-19 17:09:46,596 INFO success: getter entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
proxypool | 2020-02-19 17:09:46,596 INFO success: server entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
proxypool | 2020-02-19 17:09:46,596 INFO success: tester entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
å¯ä»¥çå° RedisãGetterãServerãTester é½å·²ç»å¯å¨æåã
è¿æ¶åè®¿é® http://localhost:5555/random å³å¯è·åä¸ä¸ªéæºå¯ç¨ä»£çã
妿ä¸è½½é度ç¹å«æ ¢ï¼å¯ä»¥èªè¡ä¿®æ¹ Dockerfileï¼ä¿®æ¹ï¼
- RUN pip install -r requirements.txt
+ RUN pip install -r requirements.txt -i https://pypi.douban.com/simple
å¸¸è§æ¹å¼è¿è¡
妿ä¸ä½¿ç¨ Docker è¿è¡ï¼é 置好 PythonãRedis ç¯å¢ä¹åä¹å¯è¿è¡ï¼æ¥éª¤å¦ä¸ã
å®è£ åé ç½® Redis
æ¬å°å®è£ RedisãDocker å¯å¨ Redisãè¿ç¨ Redis 齿¯å¯ä»¥çï¼åªè¦è½æ£å¸¸è¿æ¥ä½¿ç¨å³å¯ã
é¦å å¯ä»¥éè¦ä¸ä¸ç¯å¢åéï¼ä»£çæ± ä¼éè¿ç¯å¢åé读åè¿äºå¼ã
设置 Redis çç¯å¢åéæä¸¤ç§æ¹å¼ï¼ä¸ç§æ¯åå«è®¾ç½® hostãportãpasswordï¼å¦ä¸ç§æ¯è®¾ç½®è¿æ¥å符串ï¼è®¾ç½®æ¹æ³åå«å¦ä¸ï¼
设置 hostãportãpasswordï¼å¦æ password 为空å¯ä»¥è®¾ç½®ä¸ºç©ºå符串ï¼ç¤ºä¾å¦ä¸ï¼
export PROXYPOOL_REDIS_HOST='localhost'
export PROXYPOOL_REDIS_PORT=6379
export PROXYPOOL_REDIS_PASSWORD=''
export PROXYPOOL_REDIS_DB=0
æè åªè®¾ç½®è¿æ¥å符串ï¼
export PROXYPOOL_REDIS_CONNECTION_STRING='redis://localhost'
è¿éè¿æ¥åç¬¦ä¸²çæ ¼å¼éè¦ç¬¦å redis://[:password@]host[:port][/database]
çæ ¼å¼ï¼
䏿¬å·åæ°å¯ä»¥çç¥ï¼port é»è®¤æ¯ 6379ï¼database é»è®¤æ¯ 0ï¼å¯ç é»è®¤ä¸ºç©ºã
以ä¸ä¸¤ç§è®¾ç½®ä»»éå ¶ä¸å³å¯ã
å®è£ ä¾èµå
è¿éå¼ºçæ¨èä½¿ç¨ Conda æ virtualenv å建èæç¯å¢ï¼Python çæ¬ä¸ä½äº 3.6ã
ç¶å pip å®è£ ä¾èµå³å¯ï¼
pip3 install -r requirements.txt
è¿è¡ä»£çæ±
ä¸¤ç§æ¹å¼è¿è¡ä»£çæ± ï¼ä¸ç§æ¯ TesterãGetterãServer å ¨é¨è¿è¡ï¼å¦ä¸ç§æ¯æéåå«è¿è¡ã
ä¸è¬æ¥è¯´å¯ä»¥éæ©å ¨é¨è¿è¡ï¼å½ä»¤å¦ä¸ï¼
python3 run.py
è¿è¡ä¹åä¼å¯å¨ TesterãGetterãServerï¼è¿æ¶è®¿é® http://localhost:5555/random å³å¯è·åä¸ä¸ªéæºå¯ç¨ä»£çã
æè å¦æä½ å¼æ¸ æ¥äºä»£çæ± çæ¶æï¼å¯ä»¥æéåå«è¿è¡ï¼å½ä»¤å¦ä¸ï¼
python3 run.py --processor getter
python3 run.py --processor tester
python3 run.py --processor server
è¿é processor å¯ä»¥æå®è¿è¡ TesterãGetter è¿æ¯ Serverã
使ç¨
æåè¿è¡ä¹åå¯ä»¥éè¿ http://localhost:5555/random è·åä¸ä¸ªéæºå¯ç¨ä»£çã
å¯ä»¥ç¨ç¨åºå¯¹æ¥å®ç°ï¼ä¸é¢ç示ä¾å±ç¤ºäºè·å代çå¹¶ç¬åç½é¡µçè¿ç¨ï¼
import requests
proxypool_url = 'http://127.0.0.1:5555/random'
target_url = 'http://httpbin.org/get'
def get_random_proxy():
"""
get random proxy from proxypool
:return: proxy
"""
return requests.get(proxypool_url).text.strip()
def crawl(url, proxy):
"""
use proxy to crawl page
:param url: page url
:param proxy: proxy, such as 8.8.8.8:8888
:return: html
"""
proxies = {'http': 'http://' + proxy}
return requests.get(url, proxies=proxies).text
def main():
"""
main method, entry point
:return: none
"""
proxy = get_random_proxy()
print('get random proxy', proxy)
html = crawl(target_url, proxy)
print(html)
if __name__ == '__main__':
main()
è¿è¡ç»æå¦ä¸ï¼
get random proxy 116.196.115.209:8080
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.22.0",
"X-Amzn-Trace-Id": "Root=1-5e4d7140-662d9053c0a2e513c7278364"
},
"origin": "116.196.115.209",
"url": "https://httpbin.org/get"
}
å¯ä»¥çå°æåè·åäºä»£çï¼å¹¶è¯·æ± httpbin.org éªè¯äºä»£ççå¯ç¨æ§ã
å¯é 置项
ä»£çæ± å¯ä»¥éè¿è®¾ç½®ç¯å¢å鿥é ç½®ä¸äºåæ°ã
å¼å ³
- ENABLE_TESTERï¼å 许 Tester å¯å¨ï¼é»è®¤ true
- ENABLE_GETTERï¼å 许 Getter å¯å¨ï¼é»è®¤ true
- ENABLE_SERVERï¼è¿è¡ Server å¯å¨ï¼é»è®¤ true
ç¯å¢
- APP_ENVï¼è¿è¡ç¯å¢ï¼å¯ä»¥è®¾ç½® devãtestãprodï¼å³å¼åãæµè¯ãç产ç¯å¢ï¼é»è®¤ dev
- APP_DEBUGï¼è°è¯æ¨¡å¼ï¼å¯ä»¥è®¾ç½® true æ falseï¼é»è®¤ true
- APP_PROD_METHOD: æ£å¼ç¯å¢å¯å¨åºç¨æ¹å¼ï¼é»è®¤æ¯
gevent
ï¼ å¯éï¼tornado
ï¼meinheld
ï¼åå«éè¦å®è£ tornado æ meinheld 模åï¼
Redis è¿æ¥
- PROXYPOOL_REDIS_HOST / REDIS_HOSTï¼Redis ç Hostï¼å ¶ä¸ PROXYPOOL_REDIS_HOST ä¼è¦ç REDIS_HOST çå¼ã
- PROXYPOOL_REDIS_PORT / REDIS_PORTï¼Redis ç端å£ï¼å ¶ä¸ PROXYPOOL_REDIS_PORT ä¼è¦ç REDIS_PORT çå¼ã
- PROXYPOOL_REDIS_PASSWORD / REDIS_PASSWORDï¼Redis çå¯ç ï¼å ¶ä¸ PROXYPOOL_REDIS_PASSWORD ä¼è¦ç REDIS_PASSWORD çå¼ã
- PROXYPOOL_REDIS_DB / REDIS_DBï¼Redis çæ°æ®åºç´¢å¼ï¼å¦ 0ã1ï¼å ¶ä¸ PROXYPOOL_REDIS_DB ä¼è¦ç REDIS_DB çå¼ã
- PROXYPOOL_REDIS_CONNECTION_STRING / REDIS_CONNECTION_STRINGï¼Redis è¿æ¥å符串ï¼å ¶ä¸ PROXYPOOL_REDIS_CONNECTION_STRING ä¼è¦ç REDIS_CONNECTION_STRING çå¼ã
- PROXYPOOL_REDIS_KEY / REDIS_KEYï¼Redis å¨å代ç使ç¨åå ¸çåç§°ï¼å ¶ä¸ PROXYPOOL_REDIS_KEY ä¼è¦ç REDIS_KEY çå¼ã
å¤çå¨
- CYCLE_TESTERï¼Tester è¿è¡å¨æï¼å³é´éå¤ä¹ è¿è¡ä¸æ¬¡æµè¯ï¼é»è®¤ 20 ç§
- CYCLE_GETTERï¼Getter è¿è¡å¨æï¼å³é´éå¤ä¹ è¿è¡ä¸æ¬¡ä»£çè·åï¼é»è®¤ 100 ç§
- TEST_URLï¼æµè¯ URLï¼é»è®¤ç¾åº¦
- TEST_TIMEOUTï¼æµè¯è¶ æ¶æ¶é´ï¼é»è®¤ 10 ç§
- TEST_BATCHï¼æ¹éæµè¯æ°éï¼é»è®¤ 20 个代ç
- TEST_VALID_STATUSï¼æµè¯ææçç¶æç
- API_HOSTï¼ä»£ç Server è¿è¡ Hostï¼é»è®¤ 0.0.0.0
- API_PORTï¼ä»£ç Server è¿è¡ç«¯å£ï¼é»è®¤ 5555
- API_THREADEDï¼ä»£ç Server æ¯å¦ä½¿ç¨å¤çº¿ç¨ï¼é»è®¤ true
æ¥å¿
- LOG_DIRï¼æ¥å¿ç¸å¯¹è·¯å¾
- LOG_RUNTIME_FILEï¼è¿è¡æ¥å¿æä»¶åç§°
- LOG_ERROR_FILEï¼é误æ¥å¿æä»¶åç§°
- LOG_ROTATION: æ¥å¿è®°å½å¨è½¬å¨ææå¤§å°ï¼é»è®¤ 500MBï¼è§ loguru - rotation
- LOG_RETENTION: æ¥å¿ä¿çæ¥æï¼é»è®¤ 7 天ï¼è§ loguru - retention
- ENABLE_LOG_FILEï¼æ¯å¦è¾åº log æä»¶ï¼é»è®¤ trueï¼å¦æè®¾ç½®ä¸º falseï¼é£ä¹ ENABLE_LOG_RUNTIME_FILE å ENABLE_LOG_ERROR_FILE é½ä¸ä¼çæ
- ENABLE_LOG_RUNTIME_FILEï¼æ¯å¦è¾åº runtime log æä»¶ï¼é»è®¤ true
- ENABLE_LOG_ERROR_FILEï¼æ¯å¦è¾åº error log æä»¶ï¼é»è®¤ true
以ä¸å 容åå¯ä½¿ç¨ç¯å¢åéé ç½®ï¼å³å¨è¿è¡å设置对åºç¯å¢åéå¼å³å¯ï¼å¦æ´æ¹æµè¯å°åå Redis é®åï¼
export TEST_URL=http://weibo.cn
export REDIS_KEY=proxies:weibo
å³å¯æå»ºä¸ä¸ªä¸å±äºå¾®åçä»£çæ± ï¼ææç代ç齿¯å¯ä»¥ç¬åå¾®åçã
å¦æä½¿ç¨ Docker-Compose å¯å¨ä»£çæ± ï¼åéè¦å¨ docker-compose.yml æä»¶é颿å®ç¯å¢åéï¼å¦ï¼
version: "3"
services:
redis:
image: redis:alpine
container_name: redis
command: redis-server
ports:
- "6379:6379"
restart: always
proxypool:
build: .
image: "germey/proxypool"
container_name: proxypool
ports:
- "5555:5555"
restart: always
environment:
REDIS_HOST: redis
TEST_URL: http://weibo.cn
REDIS_KEY: proxies:weibo
æ©å±ä»£çç¬è«
代ççç¬è«åæ¾ç½®å¨ proxypool/crawlers æä»¶å¤¹ä¸ï¼ç®å对æ¥äºæéå 个代ççç¬è«ã
è¥æ©å±ä¸ä¸ªç¬è«ï¼åªéè¦å¨ crawlers æä»¶å¤¹ä¸æ°å»ºä¸ä¸ª Python æä»¶å£°æä¸ä¸ª Class å³å¯ã
åæ³è§èå¦ä¸ï¼
from pyquery import PyQuery as pq
from proxypool.schemas.proxy import Proxy
from proxypool.crawlers.base import BaseCrawler
BASE_URL = 'http://www.664ip.cn/{page}.html'
MAX_PAGE = 5
class Daili66Crawler(BaseCrawler):
"""
daili66 crawler, http://www.66ip.cn/1.html
"""
urls = [BASE_URL.format(page=page) for page in range(1, MAX_PAGE + 1)]
def parse(self, html):
"""
parse html file to get proxies
:return:
"""
doc = pq(html)
trs = doc('.containerbox table tr:gt(0)').items()
for tr in trs:
host = tr.find('td:nth-child(1)').text()
port = int(tr.find('td:nth-child(2)').text())
yield Proxy(host=host, port=port)
å¨è¿éåªéè¦å®ä¹ä¸ä¸ª Crawler ç»§æ¿ BaseCrawler å³å¯ï¼ç¶åå®ä¹å¥½ urls åéå parse æ¹æ³å³å¯ã
- urls åéå³ä¸ºç¬åç代çç½ç«ç½åå表ï¼å¯ä»¥ç¨ç¨åºå®ä¹ä¹å¯åæåºå®å 容ã
- parse æ¹æ³æ¥æ¶ä¸ä¸ªåæ°å³ htmlï¼ä»£çç½åç htmlï¼å¨ parse æ¹æ³éåªéè¦å好 html çè§£æï¼è§£æåº host å portï¼å¹¶æå»º Proxy 对象 yield è¿åå³å¯ã
ç½é¡µçç¬åä¸éè¦å®ç°ï¼BaseCrawler å·²ç»æäºé»è®¤å®ç°ï¼å¦éæ´æ¹ç¬åæ¹å¼ï¼éå crawl æ¹æ³å³å¯ã
欢è¿å¤§å®¶å¤å¤å Pull Request è´¡ç® Crawlerï¼ä½¿å ¶ä»£çæºæ´ä¸°å¯å¼ºå¤§èµ·æ¥ã
é¨ç½²
æ¬é¡¹ç®æä¾äº Kubernetes é¨ç½²èæ¬ï¼å¦éé¨ç½²å° Kubernetesï¼è¯·åè kubernetesã
妿ä¸èµ·å¼åçå ´è¶£å¯ä»¥å¨ Issue çè¨ï¼é常æè°¢ï¼
LICENSE
MIT
Top Related Projects
Python ProxyPool for web spider
:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis
python ip proxy tool scrapy crawl. 抓取大量免费代理 ip,提取有效 ip 使用
IPProxyPool代理池项目,提供代理ip
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot