您好,登錄后才能下訂單哦!
這篇文章主要介紹如何使用selenium+PhantomJS爬取豆瓣讀書,文中介紹的非常詳細,具有一定的參考價值,感興趣的小伙伴們一定要看完!
獲取關于Python的全部書籍信息;
通過代碼測試 request攜帶‘User-Agent'及 ‘data'數據信息的方式均無法獲取到相關信息,獲取數據時,部分數據為空,導致獲取過程中報錯,無法獲取全部數據,初步判定豆瓣讀書的反爬機制較為嚴格;通過selenium 模擬瀏覽器請求的方法測試后發現,可利用 selenium 方法請求獲取數據;
#導入需要的模塊 from selenium import webdriver import time from lxml import etree import pymysql import re #創建一個函數 def my_browers(url, page): # 獲取瀏覽器對象 browers = webdriver.PhantomJS(executable_path=r'd:\Desktop\pythonjs\phantomjs-2.1.1-windows\bin\phantomjs.exe') # 用瀏覽器發起請求 browers.get(url) #休息兩秒,頻率低一點,爬的時間久一點,安全就多一點 time.sleep(2) # 獲取頁面信息 html = browers.page_source # 調用頁面解析函數 parse_html(html) # 解析頁面信息 def parse_html(html): # 生成一個xpath對象 html = etree.HTML(html) # 獲取所有的書籍信息列表 books = html.xpath('//div[contains(@class,"sc-bZQynM")]') # 遍歷每一本書籍 然后拿到我們想要的數據 for book in books: # 創建一個存書字典存數據用 book_dict = {} # 獲取封面信息 pic = book.xpath('//img/@src') if pic: book_dict['pic'] = pic[0] else: book_dict['pic'] = '' # print(pic) # 獲取書名 book_name = book.xpath('//div[@class="title"]/a/text()') # print(book_name) if book_name: book_name = book_name[0] # 刪除書名中最后出現的引號, #由于存數據庫的時候書名最后面的引號會導致數據庫報錯,刪除可以使代碼更健壯 if '"' in book_name: pattern = re.compile(r'"') book_name = pattern.sub('', book_name) if "'" in book_name: pattern = re.compile(r"'") book_name = pattern.sub('', book_name) # 刪除書名中最后出現的\,存數據的時候書名最后的\會把sql語句最后的引號轉義, #刪除可以使代碼更健壯 if '\\' in book_name: book_name = book_name[:-1] book_dict['book_name'] = book_name else: book_dict['book_name'] = '' # 獲取書籍詳情連接 book_url = book.xpath('//div[@class="title"]/a/@href') if book_url: book_dict['book_url'] = book_url[0] else: book_dict['book_url'] = '' # 獲取評分信息 score_book = book.xpath('//span[@class="rating_nums"]/text()') if score_book: book_dict['score_book'] = score_book[0] else: book_dict['score_book'] = '' # 獲取出版社信息 book_detail = book.xpath('//div[@class="meta abstract"]/text()') if book_detail: # 刪除書詳情中最后出現的引號; book_detail = book_detail[0] if "'" in book_detail: pattern = re.compile(r"'") book_detail = pattern.sub('', book_detail) book_dict['book_detail'] = book_detail else: book_dict['book_detail'] = '' print(book_dict) # 調用數據庫函數 insert_mysql(book_dict) # 插入數據庫 def insert_mysql(book_dict): # 連接數據庫 conn = pymysql.connect('localhost', 'root', 'root', 'test', charset='utf8') # 創建操作數據庫的對象 cursor = conn.cursor() pic = book_dict['pic'] book_name = book_dict['book_name'] book_url = book_dict['book_url'] score = book_dict['score_book'] book_detail = book_dict['book_detail'] sql = f"insert into python_book (pic,book_name,book_url,score,book_detail) " \ f"VALUE ('{pic}','{book_name}','{book_url}','{score}','{book_detail}')" # 執行并提交 cursor.execute(sql) conn.commit() if __name__ == '__main__': for i in range(0, 199): print('=================下載第{}頁========================'.format(i + 1)) page = i * 15 base_url = 'https://book.douban.com/subject_search?search_text=python&cat=1001&start={}'.format(page) my_browers(base_url, page)
以上是“如何使用selenium+PhantomJS爬取豆瓣讀書”這篇文章的所有內容,感謝各位的閱讀!希望分享的內容對大家有幫助,更多相關知識,歡迎關注億速云行業資訊頻道!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。