Python 爬蟲之Beautiful Soup模塊使用指南

發布時間：2020-09-27 00:59:05 來源：腳本之家閱讀：181 作者：hoxis 欄目：開發技術

爬取網頁的流程一般如下：

選著要爬的網址（url）
使用 python 登錄上這個網址（urlopen、requests 等）
讀取網頁信息（read() 出來）
將讀取的信息放入 BeautifulSoup
使用 BeautifulSoup 選取 tag 信息等

可以看到，頁面的獲取其實不難，難的是數據的篩選，即如何獲取到自己想要的數據。本文就帶大家學習下 BeautifulSoup 的使用。

BeautifulSoup 官網介紹如下：

Beautiful Soup 是一個可以從 HTML 或 XML 文件中提取數據的 Python 庫，它能夠通過你喜歡的轉換器實現慣用的文檔導航、查找、修改文檔的方式，能夠幫你節省數小時甚至數天的工作時間。

1 安裝

可以利用 pip 直接安裝：

$ pip install beautifulsoup4

BeautifulSoup 不僅支持 HTML 解析器，還支持一些第三方的解析器，如 lxml，XML，html5lib 但是需要安裝相應的庫。如果我們不安裝，則 Python 會使用 Python 默認的解析器，其中 lxml 解析器更加強大，速度更快，推薦安裝。

$ pip install html5lib
$ pip install lxml

2 BeautifulSoup 的簡單使用

首先我們先新建一個字符串，后面就以它來演示 BeautifulSoup 的使用。

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用 BeautifulSoup 解析這段代碼，能夠得到一個 BeautifulSoup 的對象，并能按照標準的縮進格式的結構輸出:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc, "lxml")
>>> print(soup.prettify())

篇幅有限，輸出結果這里不再展示。

另外，這里展示下幾個簡單的瀏覽結構化數據的方法：

>>> soup.title
<title>The Dormouse's story</title>
>>> soup.title.name
'title'
>>> soup.title.string
"The Dormouse's story"
>>> soup.p['class']
['title']
>>> soup.a
<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
>>> soup.find_all('a')
[<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
>>> soup.find(id='link1')
<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>

3 對象的種類

Beautiful Soup 將復雜 HTML 文檔轉換成一個復雜的樹形結構，每個節點都是 Python 對象，所有對象可以歸納為 4 種: Tag、NavigableString、BeautifulSoup、Comment 。

3.1 Tag

Tag通俗點講就是 HTML 中的一個個標簽，像上面的 div，p，例如：

<title>The Dormouse's story</title>
  
<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>

可以利用 soup 加標簽名輕松地獲取這些標簽的內容。

>>> print(soup.p)
<p class="title"><b>The Dormouse's story</b></p>
>>> print(soup.title)
<title>The Dormouse's story</title>

不過有一點是，它查找的是在所有內容中的第一個符合要求的標簽，如果要查詢所有的標簽，我們在后面進行介紹。

每個 Tag 有兩個重要的屬性 name 和 attrs，name 指標簽的名字或者 tag 本身的 name，attrs 通常指一個標簽的 class。

>>> print(soup.p.name)
p
>>> print(soup.p.attrs)
{'class': ['title']}

3.2 NavigableString

NavigableString：獲取標簽內部的文字，如，soup.p.string。

>>> print(soup.p.string)
The Dormouse's story

3.3 BeautifulSoup

BeautifulSoup：表示一個文檔的全部內容。大部分時候，可以把它當作 Tag 對象，是一個特殊的 Tag。

3.4 Comment

Comment：Comment 對象是一個特殊類型的 NavigableString 對象，其輸出的內容不包括注釋符號，但是如果不好好處理它，可能會對我們的文本處理造成意想不到的麻煩。

>>> markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
>>> soup = BeautifulSoup(markup)
>>> comment = soup.b.string
>>> print(comment)
Hey, buddy. Want to buy a used parser?
>>> type(comment)
<class 'bs4.element.Comment'>

b 標簽里的內容實際上是注釋，但是如果我們利用 .string 來輸出它的內容，我們發現它已經把注釋符號去掉了，所以這可能會給我們帶來不必要的麻煩。

這時候我們可以先判斷了它的類型，是否為 bs4.element.Comment 類型，然后再進行其他操作，如打印輸出等。

4 搜索文檔樹

BeautifulSoup 主要用來遍歷子節點及子節點的屬性，并提供了很多方法，比如獲取子節點、父節點、兄弟節點等，但通過實踐來看，這些方法用到的并不多。我們主要用到的是從文檔樹中搜索出我們的目標。

通過點取屬性的方式只能獲得當前文檔中的第一個 tag，例如，soup.li。如果想要得到所有的<li> 標簽，就需要用到 find_all()，find_all() 方法搜索當前 tag 的所有 tag 子節點，并判斷是否符合過濾器的條件 find_all() 所接受的參數如下：

find_all( name , attrs , recursive , text , **kwargs )

4.1 按 name 搜索

可以查找所有名字為 name 的 tag，字符串對象會被自動忽略掉。

>>> soup.find_all('b')
[<b>The Dormouse's story</b>]
>>> soup.find_all('a')
[<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

4.2 按 id 搜索

如果文檔樹中包含一個名字為 id 的參數，其實在搜索時會把該參數當作指定名字 tag 的屬性來搜索:

>>> soup.find_all(id='link1')
[<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>]

4.3 按 attr 搜索

有些 tag 屬性在搜索不能使用，比如 HTML5 中的 data-* 屬性，但是可以通過 find_all() 方法的 attrs 參數定義一個字典參數來搜索包含特殊屬性的 tag。

其實 id 也是一個 attr：

>>> soup.find_all(attrs={'id':'link1'})
[<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>]

4.4 按 CSS 搜索

按照 CSS 類名搜索 tag 的功能非常實用，但標識 CSS 類名的關鍵字 class 在 Python 中是保留字，使用 class 做參數會導致語法錯誤。因此從 Beautiful Soup 的 4.1.1 版本開始，可以通過 class_ 參數搜索有指定 CSS 類名的 tag:

>>> soup.find_all(class_='sister')
[<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

4.5 string 參數

通過 string 參數可以搜搜文檔中的字符串內容。與 name 參數的可選值一樣，string 參數接受字符串、正則表達式、列表、True。

>>> soup.find_all('a', string='Elsie')
[<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>]

4.6 recursive 參數

調用 tag 的 find_all() 方法時，Beautiful Soup 會檢索當前 tag 的所有子孫節點，如果只想搜索 tag 的直接子節點，可以使用參數 recursive=False。

4.6 find() 方法

它與 find_all() 方法唯一的區別是 find_all() 方法的返回結果是值包含一個元素的列表，而 find() 方法只返回第一個匹配的結果。

4.7 get_text() 方法

如果只想得到 tag 中包含的文本內容，那么可以用 get_text() 方法，這個方法獲取到 tag 中包含的所有文本內容。

>>> soup.find_all('a', string='Elsie')[0].get_text()
'Elsie'
>>> soup.find_all('a', string='Elsie')[0].string
'Elsie'

至此，Beautiful Soup 的常用使用方法已講完，若果想了解更多內容，建議看下官方文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/。

總結

本篇主要帶大家了解了 Beautiful Soup，結合一些小例子，相信大家對 Beautiful Soup 已不再陌生，下回會帶大家結合 Beautiful Soup 進行爬蟲的實戰，歡迎繼續關注！

以上就是本文的全部內容，希望對大家的學習有所幫助，也希望大家多多支持億速云。

向AI問一下細節

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

Python 爬蟲之Beautiful Soup模塊使用指南

猜你喜歡

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

Python 爬蟲之Beautiful Soup模塊使用指南

猜你喜歡

最新資訊

相關推薦

相關標簽