您好,登錄后才能下訂單哦!
今天小編給大家分享一下Python Ajax爬蟲方法案例分析的相關知識點,內容詳細,邏輯清晰,相信大部分人都還太了解這方面的知識,所以分享這篇文章給大家參考一下,希望大家閱讀完這篇文章后有所收獲,下面我們一起來了解一下吧。
街拍圖片網址
keyword: 街拍 pd: atlas dvpf: pc aid: 4916 page_num: 1 search_json: {"from_search_id":"20220104115420010212192151532E8188","origin_keyword":"街拍","image_keyword":"街拍"} rawJSON: 1 search_id: 202201041159040101501341671A4749C4
可以找到規律,page_num從1
開始累加,其他參數不變
def get_page(page_num): global headers headers = { 'Host': 'so.toutiao.com', #'Referer': 'https://so.toutiao.com/search?keyword=%E8%A1%97%E6%8B%8D&pd=atlas&dvpf=pc&aid=4916&page_num=0&search_json={%22from_search_id%22:%22202112272022060101510440283EE83D67%22,%22origin_keyword%22:%22%E8%A1%97%E6%8B%8D%22,%22image_keyword%22:%22%E8%A1%97%E6%8B%8D%22}', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', 'X-Requested-With': 'XMLHttpRequest', 'Cookie': 'msToken=S0DFBkZ9hmyLOGYd3_QjhhXgrm38qTyOITnkNb0t_oavfbVxuYV1JZ0tT5hLgswSfmZLFD6c2lONm_5TomUQXVXjen7CIxM2AGwbhHRYKjhg; _S_DPR=1.5; _S_IPAD=0; MONITOR_WEB_ID=7046351002275317255; ttwid=1%7C0YdWalNdIiSpIk3CvvHwV25U8drq3QAj08E8QOApXhs%7C1640607595%7C720e971d353416921df127996ed708931b4ae28a0a8691a5466347697e581ce8; _S_WIN_WH=262_623' } params = { 'keyword': '街拍', 'pd': 'atlas', 'dvpf': 'pc', 'aid': '4916', 'page_num': page_num, 'search_json': '%7B%22from_search_id%22%3A%22202112272022060101510440283EE83D67%22%2C%22origin_keyword%22%3A%22%E8%A1%97%E6%8B%8D%22%2C%22image_keyword%22%3A%22%E8%A1%97%E6%8B%8D%22%7D', 'rawJSON': 1, 'search_id': '2021122721183101015104402851E3883D' } url = 'https://so.toutiao.com/search?' + urlencode(params) print(url) try: response=requests.get(url,headers=headers,params=params) if response.status_code == 200: #if response.content: #print(response.json()) return response.json() except requests.ConnectionError: return None
def get_images(json): images = json.get('rawData').get('data') for image in images: link = image.get('img_url') yield link
實現一個保存圖片的方法save_image()
,其中 item 就是前面 get_images() 方法返回的一個字典。在該方法中,首先根據 item
的 title 來創建文件夾,然后請求這個圖片鏈接,獲取圖片的二進制數據,以二進制的形式寫入文件。圖片的名稱可以使用其內容的 MD5 值,這樣可以去除重復。相關
代碼如下:
def save_image(link): data = requests.get(link).content with open(f'./image/{md5(data).hexdigest()}.jpg', 'wb')as f:#使用data的md5碼作為圖片名 f.write(data)
def main(page_num): json = get_page(page_num) for link in get_images(json): #print(link) save_image(link)
這里定義了分頁的起始頁數和終止頁數,分別為GROUP_START
和 GROUP_END
,還利用了多線程的線程池,調用其 map() 方法實現程下載。
if __name__ == '__main__': GROUP_START = 1 GROUP_END = 20 pool = Pool() groups = ([x for x in range(GROUP_START, GROUP_END + 1)]) #print(groups) pool.map(main, groups) pool.close() pool.join()
import requests from urllib.parse import urlencode from hashlib import md5 from multiprocessing.pool import Pool def get_page(page_num): global headers headers = { 'Host': 'so.toutiao.com', #'Referer': 'https://so.toutiao.com/search?keyword=%E8%A1%97%E6%8B%8D&pd=atlas&dvpf=pc&aid=4916&page_num=0&search_json={%22from_search_id%22:%22202112272022060101510440283EE83D67%22,%22origin_keyword%22:%22%E8%A1%97%E6%8B%8D%22,%22image_keyword%22:%22%E8%A1%97%E6%8B%8D%22}', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', 'X-Requested-With': 'XMLHttpRequest', 'Cookie': 'msToken=S0DFBkZ9hmyLOGYd3_QjhhXgrm38qTyOITnkNb0t_oavfbVxuYV1JZ0tT5hLgswSfmZLFD6c2lONm_5TomUQXVXjen7CIxM2AGwbhHRYKjhg; _S_DPR=1.5; _S_IPAD=0; MONITOR_WEB_ID=7046351002275317255; ttwid=1%7C0YdWalNdIiSpIk3CvvHwV25U8drq3QAj08E8QOApXhs%7C1640607595%7C720e971d353416921df127996ed708931b4ae28a0a8691a5466347697e581ce8; _S_WIN_WH=262_623' } params = { 'keyword': '街拍', 'pd': 'atlas', 'dvpf': 'pc', 'aid': '4916', 'page_num': page_num, 'search_json': '%7B%22from_search_id%22%3A%22202112272022060101510440283EE83D67%22%2C%22origin_keyword%22%3A%22%E8%A1%97%E6%8B%8D%22%2C%22image_keyword%22%3A%22%E8%A1%97%E6%8B%8D%22%7D', 'rawJSON': 1, 'search_id': '2021122721183101015104402851E3883D' } url = 'https://so.toutiao.com/search?' + urlencode(params) print(url) try: response=requests.get(url,headers=headers,params=params) if response.status_code == 200: #if response.content: #print(response.json()) return response.json() except requests.ConnectionError: return None def get_images(json): images = json.get('rawData').get('data') for image in images: link = image.get('img_url') yield link def save_image(link): data = requests.get(link).content with open(f'./image/{md5(data).hexdigest()}.jpg', 'wb')as f:#使用data的md5碼作為圖片名 f.write(data) def main(page_num): json = get_page(page_num) for link in get_images(json): #print(link) save_image(link) if __name__ == '__main__': GROUP_START = 1 GROUP_END = 20 pool = Pool() groups = ([x for x in range(GROUP_START, GROUP_END + 1)]) #print(groups) pool.map(main, groups) pool.close() pool.join()
以上就是“Python Ajax爬蟲方法案例分析”這篇文章的所有內容,感謝各位的閱讀!相信大家閱讀完這篇文章都有很大的收獲,小編每天都會為大家更新不同的知識,如果還想學習更多的知識,請關注億速云行業資訊頻道。
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。