您好,登錄后才能下訂單哦!
這篇文章主要介紹python分布式爬蟲中spider_Worker節點指的是什么,文中介紹的非常詳細,具有一定的參考價值,感興趣的小伙伴們一定要看完!
在將多線程版本改寫成分布式的爬蟲,主要用的可跨平臺的multiprocessing.managers的BaseManager模塊,這個模塊的主要功能就是將task_queue和result_queue兩個隊列注冊成函數暴露到網上去,Master節點監聽端口,讓Worker子節點去連接,不同主機之間就可以通過注冊的函數來共享同步資源,Master節點主要負責發送任務和獲取結果,Worker就獲取任務隊列的任務開始跑,并將獲取的結果存儲到數據庫獲取返回回來。
spider_Worker 節點主要調用spider()函數對任務進行處理,方法都類似,子節點每獲取一個鏈接就傳回Master, 另外需要注意的是Master文件只能運行一個,但Worker節點可以同時運行多個并行同步處理task任務隊列。
spider_Master.py
#coding:utf-8 from multiprocessing.managers import BaseManager from Queue import Queue import time import argparse import MySQLdb import sys page = 2 word = 'inurl:login.action' output = 'test.txt' page = (page+1) * 10 host = '127.0.0.1' port = 500 urls = [] class Master(): def __init__(self): self.task_queue = Queue() #server需要先創建兩個共享隊列,worker端不需要 self.result_queue = Queue() def start(self): BaseManager.register('get_task_queue',callable=lambda:self.task_queue) #在網絡上注冊一個get_task_queue函數,即把兩個隊列暴露到網上,worker端不需要callable參數 BaseManager.register('get_result_queue',callable=lambda:self.result_queue) manager = BaseManager(address=(host,port),authkey='sir') manager.start() #master端為start,即開始監聽端口,worker端為connect task = manager.get_task_queue() #master和worker都是從網絡上獲取task隊列和result隊列,不能在創建的兩個隊列 result = manager.get_result_queue() print 'put task' for i in range(0,page,10): target = 'https://www.baidu.com/s?wd=%s&pn=%s'%(word,i) print 'put task %s'%target task.put(target) print 'try get result' while True: try: url = result.get(True,5) #獲取數據時需要超時長一些 print url urls.append(url) except: break manager.shutdown() if __name__ == '__main__': start = time.time() server = Master() server.start() print '共爬取數據%s條'%len(urls) print time.time()-start with open(output,'a') as f: for url in urls: f.write(url[1]+'\n') conn = MySQLdb.connect('localhost','root','root','Struct',charset='utf8') cursor = conn.cursor() for record in urls: sql = "insert into s045 values('%s','%s','%s')"%(record[0],record[1],str(record[2])) cursor.execute(sql) conn.commit() conn.close()
spider_Worker
#coding:utf-8 import re import Queue import time import requests from multiprocessing.managers import BaseManager from bs4 import BeautifulSoup as bs host = '127.0.0.1' port = 500 class Worder(): def __init__(self): self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'} def spider(self,target,result): urls = [] pn = int(target.split('=')[-1])/10 +1 # print pn # print target html = requests.get(target,headers=self.headers) soup = bs(html.text,"lxml") res = soup.find_all(name="a", attrs={'class':'c-showurl'}) for r in res: try: h = requests.get(r['href'],headers=self.headers,timeout=3) if h.status_code == 200: url = h.url # print url time.sleep(1) title = re.findall(r'<title>(.*?)</title>',h.content)[0] # print url,title title = title.decode('utf-8') print 'send spider url:',url result.put((pn,url,title)) else: continue except: continue # return urls def start(self): BaseManager.register('get_task_queue') BaseManager.register('get_result_queue') print 'Connect to server %s'%host m = BaseManager(address=(host,port),authkey='sir') m.connect() task = m.get_task_queue() result = m.get_result_queue() print 'try get queue' while True: try: target = task.get(True,1) print 'run pages %s'%target res = self.spider(target,result) # print res except: break if __name__ == '__main__': w = Worder() w.start()
以上是“python分布式爬蟲中spider_Worker節點指的是什么”這篇文章的所有內容,感謝各位的閱讀!希望分享的內容對大家有幫助,更多相關知識,歡迎關注億速云行業資訊頻道!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。