怎么利用python實現查看溧陽的攝影圈

發布時間：2022-05-17 13:53:32 來源：億速云閱讀：134 作者：iii 欄目：開發技術

這篇“怎么利用python實現查看溧陽的攝影圈”文章的知識點大部分人都不太理解，所以小編給大家總結了以下內容，內容詳細，步驟清晰，具有一定的借鑒價值，希望大家閱讀完這篇文章能有所收獲，下面我們一起來看看這篇“怎么利用python實現查看溧陽的攝影圈”文章吧。

目標站點分析

本次要采集的目標站點分頁規則如下：

http://www.jsly001.com/thread-htm-fid-45-page-{頁碼}.html

代碼采用多線程 threading 模塊+requests 模塊+BeautifulSoup 模塊編寫。

采取規則依據列表頁 → 詳情頁：

怎么利用python實現查看溧陽的攝影圈

溧陽攝影圈圖片采集代碼

本案例屬于實操案例，先展示完整代碼，然后基于注釋與重點函數進行說明。

主要實現步驟如下所示：

設置日志輸出級別
聲明一個 LiYang 類，其繼承自 threading.Thread
實例化多線程對象
每個線程都去獲取全局資源
調用html解析函數
獲取板塊主題分割區域，主要為防止獲取置頂的主題
使用 lxml 進行解析
解析出標題與數據
解析圖片地址
保存圖片

import random
import threading
import logging
from bs4 import BeautifulSoup
import requests
import lxml
logging.basicConfig(level=logging.NOTSET) # 設置日志輸出級別
# 聲明一個 LiYang 類，其繼承自 threading.Thread
class LiYangThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self) # 實例化多線程對象
        self._headers = self._get_headers() # 隨機獲取 ua
        self._timeout = 5 # 設置超時時間

    # 每個線程都去獲取全局資源
    def run(self):
        # while True: # 此處為多線程開啟位置
        try:
            res = requests.get(url="http://www.jsly001.com/thread-htm-fid-45-page-1.html", headers=self._headers,
                               timeout=self._timeout) # 測試獲取第一頁數據
        except Exception as e:
            logging.error(e)
        if res is not None:
            html_text = res.text
            self._format_html(html_text) # 調用html解析函數

    def _format_html(self, html):
        # 使用 lxml 進行解析
        soup = BeautifulSoup(html, 'lxml')

        # 獲取板塊主題分割區域，主要為防止獲取置頂的主題
        part_tr = soup.find(attrs={'class': 'bbs_tr4'})

        if part_tr is not None:
            items = part_tr.find_all_next(attrs={"name": "readlink"}) # 獲取詳情頁地址
        else:
            items = soup.find_all(attrs={"name": "readlink"})
        # 解析出標題與數據
        data = [(item.text, f'http://www.jsly001.com/{item["href"]}') for item in items]
        # 進入標題內頁
        for name, url in data:
            self._get_imgs(name, url)

    def _get_imgs(self, name, url):
        """解析圖片地址"""
        try:
            res = requests.get(url=url, headers=self._headers, timeout=self._timeout)
        except Exception as e:
            logging.error(e)
		# 圖片提取邏輯
        if res is not None:
            soup = BeautifulSoup(res.text, 'lxml')
            origin_div1 = soup.find(attrs={'class': 'tpc_content'})
            origin_div2 = soup.find(attrs={'class': 'imgList'})
            content = origin_div2 if origin_div2 else origin_div1

            if content is not None:
                imgs = content.find_all('img')

                # print([img.get("src") for img in imgs])
                self._save_img(name, imgs) # 保存圖片
    def _save_img(self, name, imgs):
        """保存圖片"""
        for img in imgs:
            url = img.get("src")
            if url.find('http') < 0:
                continue
            # 尋找父標簽中的 id 屬性
            id_ = img.find_parent('span').get("id")

            try:
                res = requests.get(url=url, headers=self._headers, timeout=self._timeout)
            except Exception as e:
                logging.error(e)

            if res is not None:
                name = name.replace("/", "_")
                with open(f'./imgs/{name}_{id_}.jpg', "wb+") as f: # 注意在 python 運行時目錄提前創建 imgs 文件夾
                    f.write(res.content)
    def _get_headers(self):
        uas = [
            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
        ]
        ua = random.choice(uas)
        headers = {
            "user-agent": ua
        }
        return headers
if __name__ == '__main__':
    my_thread = LiYangThread()
    my_thread.run()

本次案例采用中，BeautifulSoup 模塊采用 lxml 解析器 對 HTML 數據進行解析，后續多采用此解析器，在使用前注意先導入 lxml 模塊。

數據提取部分采用 soup.find() 與 soup.find_all() 兩個函數進行，代碼中還使用了 find_parent() 函數，用于采集父級標簽中的 id 屬性。

# 尋找父標簽中的 id 屬性
id_ = img.find_parent('span').get("id")

代碼運行過程出現 DEBUG 信息，控制 logging 日志輸出級別即可。![用python看溧陽攝影圈，里面照片非常真

以上就是關于“怎么利用python實現查看溧陽的攝影圈”這篇文章的內容，相信大家都有了一定的了解，希望小編分享的內容對大家有幫助，若想了解更多相關的知識內容，請關注億速云行業資訊頻道。

向AI問一下細節

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

怎么利用python實現查看溧陽的攝影圈

目標站點分析

溧陽攝影圈圖片采集代碼

猜你喜歡

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

怎么利用python實現查看溧陽的攝影圈

目標站點分析

溧陽攝影圈圖片采集代碼

猜你喜歡

最新資訊

相關推薦

相關標簽