您好,登錄后才能下訂單哦!
單線程+多任務異步協程
在函數(特殊函數)定義的時候,使用async修飾,函數調用后,內部語句不會立即執行,而是會返回一個協程對象
任務對象=高級的協程對象(進一步封裝)=特殊的函數
任務對象必須要注冊到時間循環對象中
給任務對象綁定回調:爬蟲的數據解析中
當做是一個裝載任務對象的容器
當啟動事件循環對象的時候,存儲在內的任務對象會異步執行
time.sleep -- asyncio.sleep
requests -- aiohttp
import asyncio import time start_time = time.time() async def get_request(url): await asyncio.sleep(2) print(url,'下載完成!') urls = [ 'www.1.com', 'www.2.com', ] task_lst = [] # 任務對象列表 for url in urls: c = get_request(url) # 協程對象 task = asyncio.ensure_future(c) # 任務對象 # task.add_done_callback(...) # 綁定回調 task_lst.append(task) loop = asyncio.get_event_loop() # 事件循環對象 loop.run_until_complete(asyncio.wait(task_lst)) # 注冊,手動掛起
線程池+requests模塊
# 線程池 import time from multiprocessing.dummy import Pool start_time = time.time() url_list = [ 'www.1.com', 'www.2.com', 'www.3.com', ] def get_request(url): print('正在下載...',url) time.sleep(2) print('下載完成!',url) pool = Pool(3) pool.map(get_request,url_list) print('總耗時:',time.time()-start_time)
兩個方法提升爬蟲效率
起一個flask服務端
from flask import Flask import time app = Flask(__name__) @app.route('/bobo') def index_bobo(): time.sleep(2) return 'hello bobo!' @app.route('/jay') def index_jay(): time.sleep(2) return 'hello jay!' @app.route('/tom') def index_tom(): time.sleep(2) return 'hello tom!' if __name__ == '__main__': app.run(threaded=True)
aiohttp模塊+單線程多任務異步協程
import asyncio import aiohttp import requests import time start = time.time() async def get_page(url): # page_text = requests.get(url=url).text # print(page_text) # return page_text async with aiohttp.ClientSession() as s: #生成一個session對象 async with await s.get(url=url) as response: page_text = await response.text() print(page_text) return page_text urls = [ 'http://127.0.0.1:5000/bobo', 'http://127.0.0.1:5000/jay', 'http://127.0.0.1:5000/tom', ] tasks = [] for url in urls: c = get_page(url) task = asyncio.ensure_future(c) tasks.append(task) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) end = time.time() print(end-start) # 異步執行! # hello tom! # hello bobo! # hello jay! # 2.0311079025268555
''' aiohttp模塊實現單線程+多任務異步協程 并用xpath解析數據 ''' import aiohttp import asyncio from lxml import etree import time start = time.time() # 特殊函數:請求的發送和數據的捕獲 # 注意async with await關鍵字 async def get_request(url): async with aiohttp.ClientSession() as s: async with await s.get(url=url) as response: page_text = await response.text() return page_text # 返回頁面源碼 # 回調函數,解析數據 def parse(task): page_text = task.result() tree = etree.HTML(page_text) msg = tree.xpath('/html/body/ul//text()') print(msg) urls = [ 'http://127.0.0.1:5000/bobo', 'http://127.0.0.1:5000/jay', 'http://127.0.0.1:5000/tom', ] tasks = [] for url in urls: c = get_request(url) task = asyncio.ensure_future(c) task.add_done_callback(parse) #綁定回調函數! tasks.append(task) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) end = time.time() print(end-start)
requests模塊+線程池
import time import requests from multiprocessing.dummy import Pool start = time.time() urls = [ 'http://127.0.0.1:5000/bobo', 'http://127.0.0.1:5000/jay', 'http://127.0.0.1:5000/tom', ] def get_request(url): page_text = requests.get(url=url).text print(page_text) return page_text pool = Pool(3) pool.map(get_request, urls) end = time.time() print('總耗時:', end-start) # 實現異步請求 # hello jay! # hello bobo! # hello tom! # 總耗時: 2.0467123985290527
小結
aiohttp模塊+單線程多任務異步協程
requests模塊+線程池
requests
urllib
aiohttp
以上就是python如何提升爬蟲效率的詳細內容,更多關于python提升爬蟲效率的資料請關注億速云其它相關文章!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。