如何在python中使用html2text庫將HTML轉換成markdown

發布時間：2021-02-26 16:23:37 來源：億速云閱讀：729 作者：戴恩恩欄目：開發技術

這篇文章主要介紹了如何在python中使用html2text庫將HTML轉換成markdown，此處通過實例代碼給大家介紹的非常詳細，對大家的學習或工作具有一定的參考價值，需要的朋友可以參考下：

python是什么意思

Python是一種跨平臺的、具有解釋性、編譯性、互動性和面向對象的腳本語言，其最初的設計是用于編寫自動化腳本，隨著版本的不斷更新和新功能的添加，常用于用于開發獨立的項目和大型項目。

首先，進行安裝：

pip install html2text

命令行方式使用html2text

安裝完后，就可以通過命令html2text進行一系列的操作了。

html2text命令使用方式為：html2text [(filename|url) [encoding]]。通過html2text -h，我們可以查看該命令支持的選項：

選項	描述
`--version`	顯示程序版本號并退出
`-h, --help`	顯示幫助信息并退出
`--no-wrap-links`	轉換期間包裝鏈接
`--ignore-emphasis`	對于強調，不包含任何格式
`--reference-links`	使用參考樣式的鏈接，而不是內聯鏈接
`--ignore-links`	對于鏈接，不包含任何格式
`--protect-links`	保護鏈接不換行，并用尖角括號將其圍起來
`--ignore-images`	對于圖像，不包含任何格式
`--images-to-alt`	丟棄圖像數據，只保留替換文本
`--images-with-size`	將圖像標簽作為原生html，并帶height和width屬性，以保留維度
`-g, --google-doc`	轉換一個被導出為html的谷歌文檔
`-d, --dash-unordered-list`	對于無序列表，使用破折號而不是星號
`-e, --asterisk-emphasis`	對于被強調文本，使用星號而不是下劃線
`-b BODY_WIDTH, --body-width=BODY_WIDTH`	每個輸出行的字符數，0表示不自動換行
`-i LIST_INDENT, --google-list-indent=LIST_INDENT`	Google縮進嵌套列表的像素數
`-s, --hide-strikethrough`	隱藏帶刪除線文本。只有當也指定-g的時候才有用
`--escape-all`	轉義所有特殊字符。輸出較為不可讀，但是會避免極端情況下的格式化問題。
`--bypass-tables`	以HTML格式格式化表單，而不是Markdown語法。
`--single-line-break`	在一個塊元素后使用單個換行符，而不是兩個換行符。注意：要求–body-width=0
`--unicode-snob`	整個文檔中都使用unicode
`--no-automatic-links`	在任何適用情況下，不要使用自動鏈接
`--no-skip-internal-links`	不要跳過內部鏈接
`--links-after-para`	將鏈接置于每段之后而不是文檔之后
`--mark-code`	用復制代碼代碼如下: … 將代碼塊標記出來
`--decode-errors=DECODE_ERRORS`	如何處理decode錯誤。接受值為'ignore', ‘strict'和'replace'

具體使用如下：

# 傳遞url
html2text http://eepurl.com/cK06Gn

# 傳遞文件名，編碼方式設置為utf-8
html2text test.html utf-8

腳本中使用html2text

除了直接通過命令行使用html2text外，我們還可以在腳本中將其作為庫導入。

我們以以下html文本為例

html_content = """
<span ><a href="http://blog.yhat.com/posts/visualize-nba-pipelines.html" rel="external nofollow" target="_blank" >Data Wrangling 101: Using Python to Fetch, Manipulate &amp; Visualize NBA Data</a></span><br>
A tutorial using pandas and a few other packages to build a simple datapipe for getting NBA data. Even though this tutorial is done using NBA data, you don't need to be an NBA fan to follow along. The same concepts and techniques can be applied to any project of your choosing.<br>
"""

一句話轉換html文本為Markdown格式的文本：

import html2text
print html2text.html2text(html_content)

輸出如下：

[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA
Data](http://blog.yhat.com/posts/visualize-nba-pipelines.html)
A tutorial using pandas and a few other packages to build a simple datapipe
for getting NBA data. Even though this tutorial is done using NBA data, you
don't need to be an NBA fan to follow along. The same concepts and techniques
can be applied to any project of your choosing.

另外，還可以使用上面的配置項：

import html2text
h = html2text.HTML2Text()
print h.handle(html_content) # 輸出同上

注意：下面僅展示使用某個配置項時的輸出，不使用某個配置項時使用默認值的輸出（如無特殊說明）同上。

--ignore-emphasis

指定選項–ignore-emphasis

h.ignore_emphasis = True
print h.handle("<p>hello, this is <em>Ele</em></p>")

輸出為：

hello, this is Ele

不指定選項–ignore-emphasis

h.ignore_emphasis = False # 默認值
print h.handle("<p>hello, this is <em>Ele</em></p>")

輸出為：

hello, this is _Ele_

--reference-links

h.inline_links = False
print h.handle(html_content)

輸出為：

[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA
Data][16]
A tutorial using pandas and a few other packages to build a simple datapipe
for getting NBA data. Even though this tutorial is done using NBA data, you
don't need to be an NBA fan to follow along. The same concepts and techniques
can be applied to any project of your choosing.
[16]: http://blog.yhat.com/posts/visualize-nba-pipelines.html

--ignore-links

h.ignore_links = True
print h.handle(html_content)

輸出為：

Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data
A tutorial using pandas and a few other packages to build a simple datapipe
for getting NBA data. Even though this tutorial is done using NBA data, you
don't need to be an NBA fan to follow along. The same concepts and techniques
can be applied to any project of your choosing.

--protect-links

h.protect_links = True
print h.handle(html_content)

輸出為：

[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA
Data](<http://blog.yhat.com/posts/visualize-nba-pipelines.html>)
A tutorial using pandas and a few other packages to build a simple datapipe
for getting NBA data. Even though this tutorial is done using NBA data, you
don't need to be an NBA fan to follow along. The same concepts and techniques
can be applied to any project of your choosing.

--ignore-images

h.ignore_images = True
print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png"  alt="hot3"> ending ...</p>')

輸出為：

This is a img: ending ...

--images-to-alt

h.images_to_alt = True
print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png"  alt="hot3"> ending ...</p>')

輸出為：

This is a img: hot3 ending ...

--images-with-size

h.images_with_size = True
print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png" height=32px width=32px alt="hot3"> ending ...</p>')

輸出為：

This is a img: <img src='https://my.oschina.net/img/hot3.png' width='32px'
height='32px' alt='hot3' /> ending ...

--body-width

h.body_width=0
print h.handle(html_content)

輸出為：

[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data](http://blog.yhat.com/posts/visualize-nba-pipelines.html)
A tutorial using pandas and a few other packages to build a simple datapipe for getting NBA data. Even though this tutorial is done using NBA data, you don't need to be an NBA fan to follow along. The same concepts and techniques can be applied to any project of your choosing.

--mark-code

h.mark_code=True
print h.handle('<pre class="hljs css"><code class="hljs css">&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-selector-tag"><span class="hljs-selector-tag">rpm</span></span>&nbsp;<span class="hljs-selector-tag"><span class="hljs-selector-tag">-Uvh</span></span>&nbsp;<span class="hljs-selector-tag"><span class="hljs-selector-tag">erlang-solutions-1</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.0-1</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.noarch</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.rpm</span></span></code></pre>')

輸出為：

復制代碼代碼如下:
rpm -Uvh erlang-solutions-1.0-1.noarch.rpm

通過這種方式，就可以以腳本的形式自定義HTML -> MARKDOWN的自動化過程了。例子可參考下面的例子

#-*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8') 
import re
import requests
from lxml import etree
import html2text


# 獲取第一個issue
def get_first_issue(url):
  resp = requests.get(url)
  page = etree.HTML(resp.text)
  issue_list = page.xpath("//ul[@id='archive-list']/div[@class='display_archive']/li/a")
  fst_issue = issue_list[0].attrib
  fst_issue["text"] = issue_list[0].text
  return fst_issue


# 獲取issue的內容，并轉成markdown
def get_issue_md(url):
  resp = requests.get(url)
  page = etree.HTML(resp.text)
  content = page.xpath("//table[@id='templateBody']")[0]#'//table[@class="bodyTable"]')[0]
  h = html2text.HTML2Text()
  h.body_width=0 # 不自動換行
  return h.handle(etree.tostring(content))

subtitle_mapping = {
  '**From Our Sponsor**': '# 來自贊助商',
  '**News**': '# 新聞',
  '**Articles**,** Tutorials and Talks**': '# 文章，教程和講座',
  '**Books**': '# 書籍',
  '**Interesting Projects, Tools and Libraries**': '# 好玩的項目，工具和庫',
  '**Python Jobs of the Week**': '# 本周的Python工作',
  '**New Releases**': '# 最新發布',
  '**Upcoming Events and Webinars**': '# 近期活動和網絡研討會',
}
def clean_issue(content):
  # 去除‘Share Python Weekly'及后面部分
  content = re.sub('\*\*Share Python Weekly.*', '', content, flags=re.IGNORECASE)
  # 預處理標題
  for k, v in subtitle_mapping.items():
    content = content.replace(k, v)
  return content

tpl_str = """原文：[{title}]({url})
---
{content}
"""
def run():
  issue_list_url = "https://us2.campaign-archive.com/home/?u=e2e180baf855ac797ef407fc7&id=9e26887fc5"
  print "開始獲取最新的issue……"
  fst = get_first_issue(issue_list_url)
  #fst = {'href': 'http://eepurl.com/dqpDyL', 'title': 'Python Weekly - Issue 341'}
  print "獲取完畢。開始截取最新的issue內容并將其轉換成markdown格式"
  content = get_issue_md(fst['href'])
  print "開始清理issue內容"
  content = clean_issue(content)

  print "清理完畢，準備將", fst['title'], "寫入文件"
  title = fst['title'].replace('- ', '').replace(' ', '_')
  with open(title.strip()+'.md', "wb") as f:
    f.write(tpl_str.format(title=fst['title'], url=fst['href'], content=content))
  print "恭喜，完成啦。文件保存至%s.md" % title

if __name__ == '__main__':
  run()

到此這篇關于如何在python中使用html2text庫將HTML轉換成markdown的文章就介紹到這了,更多相關如何在python中使用html2text庫將HTML轉換成markdown的內容請搜索億速云以前的文章或繼續瀏覽下面的相關文章希望大家以后多多支持億速云！

向AI問一下細節

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

如何在python中使用html2text庫將HTML轉換成markdown

python是什么意思

命令行方式使用html2text

腳本中使用html2text

猜你喜歡

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

如何在python中使用html2text庫將HTML轉換成markdown

python是什么意思

命令行方式使用html2text

腳本中使用html2text

猜你喜歡

最新資訊

相關推薦

相關標簽