您好,登錄后才能下訂單哦!
本文小編為大家詳細介紹“python爬蟲字體加密問題怎么解決”,內容詳細,步驟清晰,細節處理妥當,希望這篇“python爬蟲字體加密問題怎么解決”文章能幫助大家解決疑惑,下面跟著小編的思路慢慢深入,一起來學習新知識吧。
拋出問題
我們看到這個號碼是在頁面上正常顯示的
F12 又是這樣就比較麻煩,不能直接獲取.
用requests庫也是獲取不到正常想要的 源碼的,因為字體加密了.
查看頁面源代碼又是這樣的.所以就是我們想怎么解密呢.
獲取到真正的源碼
找到對應的字體庫
進行解析操作.
為什么用webdriver,因為
requests拿不到真正的源碼.
from selenium import webdriver # --- 進行chrome的配置 options = webdriver.ChromeOptions() prefs = {"profile.managed_default_content_settings.images": 2} # 設置無圖模式 options.add_experimental_option("prefs", prefs) options.add_argument("service_args = ['–ignore-ssl-errors = true', '–ssl-protocol = TLSv1']") options.binary_location = r'C:\Program Files\Google\Chrome\Application\chrome.exe' # ---- chrome進行端口接管調用 options.add_argument('-incognito') driver = webdriver.Chrome(options=options) driver.set_page_load_timeout(5) # --- 設置寬和高位置 driver.maximize_window() # --- 攔截webdriver檢測代碼 driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": """ Object.defineProperty(navigator, 'webdriver', { get: () => undefined }) """})
找到對應的字體庫
這上面進行申明了告訴了我們這個是字體
base64
,然后就是那下來然后生成文件.
# 示例 import base64 # 省略了很長的... b64_code = 'AAEAAAAKAIAAAwAgT1MvMla19RMAAACsAAAAYGNtYXAGQAPOAAABDAAAAa5nbHlmZrwdwAAAArwAAAakaGVhZBQx4JoAAAlgAAAANmhoZWEFswFxAAAJmAAAACRobXR4DVYBYgAACbwAAAAubG9jYQwQCnYAAAnsAAAAIm1heHAAFABOAAAKEAAAACBuYW1lUuodRwAACjAAAAGecG9zdDHgxUkAAAvQAAAAdAAEAgsBkAAFAAACmQLMAAAAjwKZAswAAAHrADMBCQAAAgAGAwAAAAAAAAAAAAEQAAAAAAAAAAAAAABQZkVkAMAAI4EEAyz/LABcAywA1AAAAAEAAAAAAxgAAAAAACAAAQAAAAQAAAADAAAAJAABAAAAAABcAAMAAQAAACQAAwAKAAABYgAEADgAAAAKAAgAAgACACMAKwAtAC///wAAACMAKgAtAC/e/9j/1//WAAEAAAAAAAAAAAAAAAABBgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABAAAAAAAAAgMABAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMAAAAAABMAAAAAAAAAAUAAAAjAAAAIwAAAAEAAAAqAAAAKwAAAAIAAAAtAAAALQAAAAQAAAAvAAAALwAAAAUACID7AAiBBAAAAAYAAAACACIAAAEyAqoAAwAHAAA3ESERJzMRIyIBEO7MzAACqv1WIgJmAAAAAgAdAAACIALbABsAHwAAARUjByM3IwcjNyM1MzcjNTM3MwczNzMHMxUjByMzNyMB/4AmSCZrJ0knZnQjdoQkSSVrJkkmYnAitWwkbAEUR83Nzc1HuUjGxsbGSLm5AAAAAQAkAKQB3gI2ABEAABM3FyczBzcXBxcHJxcjNwcnNyQumSJzJZkun58umSRyIZguoAGXZ26mpGpmKClma6anbWYqAAABAEMAkwH6AkoACwAAARUjNSM1MzUzFTMVAUNKtrZKtwFKt7dJt7dJAAAAAAEAGgFCASQBrQADAAATNSEVGgEKAUJrawAAAAABAAD/gwEnAwoAAwAAFycTM0pK30h9AQOGAAAAAgAj//YCGgLmABMAJwAAARQOAiMiLgI1ND4CMzIeAgUUHgIzMj4CNTQuAiMiDgICGhw9X0NGYDwaGjxgR0JfPRz+qAgUJB0cJBUHBxQkHB0kFQgBb1WLYzY2Y4xVVYpiNTVii1VKc08qKk9zSklzTykpT3MAAAAAAQArAAACCgLfACEAADc1MzI+AjURDgMjIi4CNT4DPwEzERQeAjsBFWRUDRMNBhQiIB8PDRUQChAiJiwaSHIFCxUQUgA3Bg8aEwIBGCccDwoUHBEEDBIbEjX9mhAZEQg3AAAAAAEAJAAAAg4C5gArAAABFA4EDwEzMjY/ATMHITU3PgM1NCYjIgYVIi4CNTQ+AjMyHgIB9AsYKDtPM2fvHy0JCD0G/hyYLz0jDiomNCodMCMTHThUODpXPB4CPBgtMDZATjFhJCMf12qaMU5HRSg6NllYCxgnGxwyJhcYLD8AAAAAAQAd//YCDgLmAEQAABciLgI1ND4CMxQeAjMyPgI1NC4CKwE1MzI+AjU0JiMiDgIVIiY1ND4CMzIeAhUUDgIHHgMVFA4C+TpTNhkOGB8SEiEvHBktIxUVKDsnP0MhMSAQKyobIxMHQEUdOVQ4N1c+IRgqOSIfQTUiL01kChQiLRgTHhUKITEhEA4iOiweMSMUQBUoOCE4PxstOR4tLxsvJBQWKz4oIzouIgwFGSo/LD5VNBYAAgAOAAACKQLbABgAIwAAJRUUHgI7ARUhNTMyPgI9ASE1ATMRMxUlNDY3DgMPATMBvw0XHxEN/pkcEh5XDf7lASKPav8AAwQFFhkXBorUvz8YHQ8FNzcFDx0YPz4B3v4nQ/YtaDAMKiwoCeUAAQAp//YCBgLbADoAADcyPgI1NCYjIg4CBycTIRcjJy4DKwEUDgIPAT4DMzIeAhUUDgIjIi4CNTQ2MxQeAuwZLiIVSUMTIBsYCy8gAYQFOwgCBgsQDNUCAgMBCAgZHiIPPGBFJTBNXy85UDIXLSUMGis+ECVAL0xLAwUHAxIBYrojCQ4KBgEQGyISXgMGBAMcNlI3Q1o3GBUiLRgkIxYsIxYAAAACAC7/9gIZAuYALAA8AAABIg4CBz4DMzIeAhUUDgIjIi4CNTQ+AjMyHgIVFA4CIzQuAgMiDgIHFB4CMzI2NTQmAUkeMSMVAwobIysaL0s2HR48WDs5XUMlJEhuSjJFKxMNHS4iBg8bNw4fHBgGEh5pFygtMgKpJEVkQQcNCwcdN04yN1tBJCpWg1lVk20/EyAoFhAdFg0XLyYY/tkIDhIJSWpEIFBZU0wAAAAAAQAtAAACGwLbAAsAADcBISIGDwEjNyEVAakBEf7yHBwDBj4FAen+5QACbBsZNNcy/VcAAAMAH//2Ah5C5gAlADkATQAANzQ+AjcuATU0PgIzMh5CFRQOAgceAxUUDgIjIi4CFzI+AjU0LgInDgMVFB4CEzQuAiMiDgIVFB4CFz4DHxUoOCE9QRg4W0I2UjcbEyQzIC5BKBMkQ2E+QF4+Hf4aKx4QESU4KBEeFQ0RHit6DBgkFxUhFgsOHCkbExsSCLshNSslESNaPCRDNCAbMEInHi8nIRAXLTI2HzFLNBwfNUhiEyIvHBkpIyISCx0jLBseMiMUAgQWKyEUER8qGBsoIBkNCxkgKAAAAAIAJP/2Ag8C5gAoADYAABciLgI1NDY3HgMzMjY3DgMjIi4CNTQ+AjMyHgIVFA4CAzI2NzQuAiMiBhUUFukvQCgRGBoHFR4nGkVKBQwdJS0aLEo1HiA9Vzc3XkUmIUdvHyU1DxEcKBgsMDAKFCAqFhYfBRcoHRGVkw8ZEwobNk80N1tCJChUglpVlG9AAW4lH0JePB1WV0dJAAAAAAEAAAABAACt4Ie1Xw889QALBAAAAAAA2XTOiAAAAADZdM6IAAD/gwIpAwoAAAAIAAIAAAAAAAAAAQAAAyz/LABcAj0AAAAAAikAAQAAAAAAAAAAAAAAAAAAAAcBdgAiAj0AHQICACQCPQBDAT4AGgEnAAACPQAjACsAJAAdAA4AKQAuAC0AHwAkAAAAAAAUAEQAZgB8AIoAmADUAQYBRgGgAdYCKAJ+ApgDBANSAAAAAQAAABAATgADAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAwAlgABAAAAAAABAA0AAAABAAAAAAACAAYADQABAAAAAAADAA0AEwABAAAAAAAEAA0AIAABAAAAAAAFAB4ALQABAAAAAAAGAA0ASwADAAEECQABABoAWAADAAEECQACAAwAcgADAAEECQADABoAfgADAAEECQAEABoAmAADAAEECQAFADwAsgADAAEECQAGABoA7kxlZVRyZWVzaGFkb3dNZWRpdW1MZWVUcmVlc2hhZG93TGVlVHJlZXNoYWRvd1ZlcnNpb24gMS4wOyBGb250RWRpdG9yICh3MS4wKUxlZVRyZWVzaGFkb3cATABlAGUAVAByAGUAZQBzAGgAYQBkAG8AdwBNAGUAZABpAHUAbQBMAGUAZQBUAHIAZQBlAHMAaABhAGQAbwB3AEwAZQBlAFQAcgBlAGUAcwBoAGEAZABvAHcAVgBlAHIAcwBpAG8AbgAgADEALgAwADsAIABGAG8AbgB0AEUAZABpAHQAbwByACAAKAB2ADEALgAwACkATABlAGUAVAByAGUAZQBzAGgAYQBkAG8AdwAAAAIAAAAAAAAAMgAAAAAAAAAAAAAAAAAAAAAAAAAAABAAEAAAAAYADQAOABAAEgECAQMBBAEFAQYBBwEIAQkBCgELBHplcm8Db25lA3R3bwV0aHJlZQRmb3VyBGZpdmUDc2l4BXNldmVuBWVpZ2h0BG5pbmU=' with open('font.ttf', 'wb') as f: f.write(base64.decodebytes(b64_code.encode())) from fontTools.ttLib import TTFont # 導包 font = TTFont('font.ttf') font.saveXML('font.xml')
# 簡單封裝下 import base64 def w_tff(one_html): res_tff = re.findall(r';base64,(.*?)"', one_html, re.S) if res_tff and len(res_tff) == 1: new_res_ttf = res_tff[0] with open('123_new_ttf.ttf', 'wb') as f: f.write(base64.decodebytes(new_res_ttf.encode()))
讀取文件找到里面的對應關系,就是 你這個數字的格式 是存儲在.ttf文件里的.
from fontTools.ttLib import TTFont def get_num_phone(es_str: str): # 加載字體生成映射關系 path = '123_new_ttf.ttf' font = TTFont(path) # font.saveXML('font.xml') # 生成xml文件 # 得到映射關系 bestcmap = font.getBestCmap() ss = {} for key, value in bestcmap.items(): keys = hex(key).replace('0x', '').replace("&#x", "") # 10進制轉16進制 if value == "zero": value = 0 elif value == "one": value = 1 elif value == "one": value = 1 elif value == "two": value = 2 elif value == "three": value = 3 elif value == "four": value = 4 elif value == "five": value = 5 elif value == "six": value = 6 elif value == "seven": value = 7 elif value == "eight": value = 8 elif value == "nine": value = 9 elif value == "hyphen": value = "-" ss.update({ keys: value }) need_re = es_str list_phone = "" try: for item in need_re.split(";"): if item: new_item = item.replace("&#x", "") list_phone += "".join(str(ss[new_item])) if not list_phone or len(list_phone) < 2: return None return list_phone except Exception as e: return None
<cmap> <tableVersion version="0"/> <cmap_format_4 platformID="0" platEncID="3" language="0"> <map code="0x23" name="numbersign"/><!-- NUMBER SIGN --> <map code="0x2a" name="asterisk"/><!-- ASTERISK --> <map code="0x2b" name="plus"/><!-- PLUS SIGN --> <map code="0x2d" name="hyphen"/><!-- HYPHEN-MINUS --> <map code="0x2f" name="slash"/><!-- SOLIDUS --> </cmap_format_4> <cmap_format_0 platformID="1" platEncID="0" language="0"> <map code="0x23" name="numbersign"/> <map code="0x2a" name="asterisk"/> <map code="0x2b" name="plus"/> <map code="0x2d" name="hyphen"/> <map code="0x2f" name="slash"/> </cmap_format_0> <cmap_format_4 platformID="3" platEncID="1" language="0"> <map code="0x23" name="numbersign"/><!-- NUMBER SIGN --> <map code="0x2a" name="asterisk"/><!-- ASTERISK --> <map code="0x2b" name="plus"/><!-- PLUS SIGN --> <map code="0x2d" name="hyphen"/><!-- HYPHEN-MINUS --> <map code="0x2f" name="slash"/><!-- SOLIDUS --> </cmap_format_4> <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="76" language="0" nGroups="5"> <map code="0x23" name="numbersign"/><!-- NUMBER SIGN --> <map code="0x2a" name="asterisk"/><!-- ASTERISK --> <map code="0x2b" name="plus"/><!-- PLUS SIGN --> <map code="0x2d" name="hyphen"/><!-- HYPHEN-MINUS --> <map code="0x2f" name="slash"/><!-- SOLIDUS --> <map code="0x880fb" name="zero"/><!-- ???? --> <map code="0x880fc" name="one"/><!-- ???? --> <map code="0x880fd" name="two"/><!-- ???? --> <map code="0x880fe" name="three"/><!-- ???? --> <map code="0x880ff" name="four"/><!-- ???? --> <map code="0x88100" name="five"/><!-- ???? --> <map code="0x88101" name="six"/><!-- ???? --> <map code="0x88102" name="seven"/><!-- ???? --> <map code="0x88103" name="eight"/><!-- ???? --> <map code="0x88104" name="nine"/><!-- ???? --> </cmap_format_12> </cmap>
讀取ttf文件,(再生成xml文件,第一次尋找映射關系是需要做的)
font.getBestCmap()
獲取映射關系表
我們觀察 xml文件的cmap
段進行研究 ,可以看到我們明確需要的結果
keys = hex(key).replace('0x', '').replace("&#x", "")
10進制轉16進制 ,會得到映射關系表 {'23': 'numbersign', '2a': 'asterisk', '2b': 'plus', '2d': '-', '2f': 'slash', '8826e': 0, '8826f': 1, '88270': 2, '88271': 3, '88272': 4, '88273': 5, '88274': 6, '88275': 7, '88276': 8, '88277': 9}
和從頁面上那些來的結果 進行 逐個匹配調整就行了.
webdriver拿下來的頁面源碼有可能有點問題,所以我用了 soup_text = bs4.BeautifulSoup(driver.page_source, 'lxml').text
的方法來處理源代碼 (import bs4
)其他的就是一些小細節上的問題了.基本的思路就是這樣的.
讀到這里,這篇“python爬蟲字體加密問題怎么解決”文章已經介紹完畢,想要掌握這篇文章的知識點還需要大家自己動手實踐使用過才能領會,如果想了解更多相關內容的文章,歡迎關注億速云行業資訊頻道。
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。