您好,登錄后才能下訂單哦!
? ? 最近準備換房子,在網站上尋找各種房源信息,看得眼花繚亂,于是想著能否將基本信息匯總起來便于查找,便用python將基本信息爬下來放到excel,這樣一來就容易搜索了。
? ? 1. 利用lxml中的xpath提取信息
? ? xpath是一門在 xml文檔中查找信息的語言,xpath可用來在 xml 文檔中對元素和屬性進行遍歷。對比正則表達式 re兩者可以完成同樣的工作,實現的功能也差不多,但xpath明顯比re具有優勢。具有如下優點:(1)可在xml中查找信息 ;(2)支持html的查找;(3)通過元素和屬性進行導航
? ? 2. 利用xlsxwriter模塊將信息保存至excel
? ? xlsxwriter是操作excel的庫,可以幫助我們高效快速的,大批量的,自動化的操作excel。它可以寫數據,畫圖,完成大部分常用的excel操作。缺點是xlsxwriter 只能創建新文件,不可以修改原有文件,如果創建新文件時與原有文件同名,則會覆蓋原有文件。
? ? 3. 爬取思路
? ? 觀察發現貝殼網租房信息總共是100頁,我們可以分每頁獲取到html代碼,然后提取需要的信息保存至字典,將所有頁面的信息匯總,最后將字典數據寫入excel。
? ? 4. 爬蟲源代碼
#?@Author:?Rainbowhhy #?@Date??:?19-6-25?下午6:35 import?requests import?time from?lxml?import?etree import?xlsxwriter def?get_html(page): ????"""獲取網站html代碼""" ????url?=?"https://bj.zu.ke.com/zufang/pg{}/#contentList".format(page) ????headers?=?{ ????????'user-agent':?'Mozilla/5.0?(X11;?Linux?x86_64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/70.0.3538.77?Safari/537.36' ????} ????response?=?requests.get(url,?headers=headers).text ????return?response def?parse_html(htmlcode,?data): ????"""解析html代碼""" ????content?=?etree.HTML(htmlcode) ????results?=?content.xpath('///div[@class="content__article"]/div[1]/div') ????for?result?in?results[:]: ????????community?=?result.xpath('./div[1]/p[@class="content__list--item--title?twoline"]/a/text()')[0].replace('\n', ????????????????????????????????????????????????????????????????????????????????????????????????????????????????'').strip().split()[ ????????????0] ????????address?=?"-".join(result.xpath('./div/p[@class="content__list--item--des"]/a/text()')) ????????landlord?=?result.xpath('./div/p[@class="content__list--item--brand?oneline"]/text()')[0].replace('\n', ??????????????????????????????????????????????????????????????????????????????????????????????????????????'').strip()?if?len( ????????????result.xpath('./div/p[@class="content__list--item--brand?oneline"]/text()'))?>?0?else?"" ????????postime?=?result.xpath('./div/p[@class="content__list--item--time?oneline"]/text()')[0] ????????introduction?=?",".join(result.xpath('./div/p[@class="content__list--item--bottom?oneline"]/i/text()')) ????????price?=?result.xpath('./div/span/em/text()')[0] ????????description?=?"".join(result.xpath('./div/p[2]/text()')).replace('\n',?'').replace('-',?'').strip().split() ????????area?=?description[0] ????????count?=?len(description) ????????if?count?==?6: ????????????orientation?=?description[1]?+?description[2]?+?description[3]?+?description[4] ????????elif?count?==?5: ????????????orientation?=?description[1]?+?description[2]?+?description[3] ????????elif?count?==?4: ????????????orientation?=?description[1]?+?description[2] ????????elif?count?==?3: ????????????orientation?=?description[1] ????????else: ????????????orientation?=?"" ????????pattern?=?description[-1] ????????floor?=?"".join(result.xpath('./div/p[2]/span/text()')[1].replace('\n',?'').strip().split()).strip()?if?len( ????????????result.xpath('./div/p[2]/span/text()'))?>?1?else?"" ????????date_time?=?time.strftime("%Y-%m-%d",?time.localtime()) ????????"""數據存入字典""" ????????data_dict?=?{ ????????????"community":?community, ????????????"address":?address, ????????????"landlord":?landlord, ????????????"postime":?postime, ????????????"introduction":?introduction, ????????????"price":?'¥'?+?price, ????????????"area":?area, ????????????"orientation":?orientation, ????????????"pattern":?pattern, ????????????"floor":?floor, ????????????"date_time":?date_time ????????} ????????data.append(data_dict) def?excel_storage(response): ????"""將字典數據寫入excel""" ????workbook?=?xlsxwriter.Workbook('./beikeHouse.xlsx') ????worksheet?=?workbook.add_worksheet() ????"""設置標題加粗""" ????bold_format?=?workbook.add_format({'bold':?True}) ????worksheet.write('A1',?'小區名稱',?bold_format) ????worksheet.write('B1',?'租房地址',?bold_format) ????worksheet.write('C1',?'房屋來源',?bold_format) ????worksheet.write('D1',?'發布時間',?bold_format) ????worksheet.write('E1',?'租房說明',?bold_format) ????worksheet.write('F1',?'房屋價格',?bold_format) ????worksheet.write('G1',?'房屋面積',?bold_format) ????worksheet.write('H1',?'房屋朝向',?bold_format) ????worksheet.write('I1',?'房屋戶型',?bold_format) ????worksheet.write('J1',?'房屋樓層',?bold_format) ????worksheet.write('K1',?'查看日期',?bold_format) ????row?=?1 ????col?=?0 ????for?item?in?response: ????????worksheet.write_string(row,?col?+?0,?item['community']) ????????worksheet.write_string(row,?col?+?1,?item['address']) ????????worksheet.write_string(row,?col?+?2,?item['landlord']) ????????worksheet.write_string(row,?col?+?3,?item['postime']) ????????worksheet.write_string(row,?col?+?4,?item['introduction']) ????????worksheet.write_string(row,?col?+?5,?item['price']) ????????worksheet.write_string(row,?col?+?6,?item['area']) ????????worksheet.write_string(row,?col?+?7,?item['orientation']) ????????worksheet.write_string(row,?col?+?8,?item['pattern']) ????????worksheet.write_string(row,?col?+?9,?item['floor']) ????????worksheet.write_string(row,?col?+?10,?item['date_time']) ????????row?+=?1 ????workbook.close() def?main(): ????all_datas?=?[] ????"""網站總共100頁,循環100次""" ????for?page?in?range(1,?100): ????????html?=?get_html(page) ????????parse_html(html,?all_datas) ????excel_storage(all_datas) if?__name__?==?'__main__': ????main()
? ? 5. 信息截圖
? ??
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。