怎么用Python標準庫修改搜索引擎獲取結果

發布時間：2021-08-25 15:33:27 來源：億速云閱讀：159 作者：chen 欄目：編程語言

這篇文章主要講解了“怎么用Python標準庫修改搜索引擎獲取結果”，文中的講解內容簡單清晰，易于學習與理解，下面請大家跟著小編的思路慢慢深入，一起來研究和學習“怎么用Python標準庫修改搜索引擎獲取結果”吧！

我輸入的關鍵字作為地址參數傳遞給某個程序，這個程序就會返回一個頁面，上面包括頂部（logo和搜索UI）／結果部分／底部（版權信息部分），我們要得到的就是中間結果部分，這個可以用Python標準庫的urllib中的urlopen方法得到整個頁面的字符串，然后再解析這些字符串，完全有辦法把中間結果部分抽取出來，抽出著串字符串，加上自己的頭部和頂部和底部，那樣搜索小偷的雛形就大概完成了，下面先寫個測試代碼。

[code]   # Search Thief   # creator: Singo   # date: 2007-8-24   import urllib   import re   class SearchThief:   " " "the google thief " " "   global path,targetURL   path = "pages\\ "   # targetURL = "http://www.google.cn/search?complete=1&hl=zh-CN&q= "   targetURL = "http://www.baidu.com/s?wd= "   def __init__(self,key):   self.key = key   def getPage(self):   webStr = urllib.urlopen(targetURL+self.key).read() # get the page string form the url   self.setPageToFile(webStr)   def setPageToFile(self,webStr):   rereSetStr = re.compile( "\r ")   self.key = reSetStr.sub( " ",self.key) # replace the string "\r "   targetFile = file(path+self.key+ ".html ", "w ") # open the file for "w "rite   targetFile.write(webStr)   targetFile.close()   print "done "   inputKey = raw_input( "Enter you want to search --> ")   obj = SearchThief(inputKey)   obj.getPage()   [/code]

這里只是要求用戶輸入一個關鍵字，然后向搜索引擎提交請求，把返回的頁面保存到一個目錄下，這只是一個測試的例子，如果要做真正的搜索小偷，完全可以不保存這個頁面，把抽取出來的字符串加入到我們預先設計好的模板里面，直接以web的形式顯示在客戶端，那樣就可以實現利用盜取某些搜索引擎的結果并構造新的頁面呈現。

看一下百度搜索結果頁的源碼，在搜索結構的那個table標簽前面有個 <DIV id=Div> </DIV> 的標簽，我們可以根據這個標簽得到下移兩行的結果集，于是增加一個方法。

getResultStr()   [code]   def getResultStr(self,webStr):   webStrwebStrList = webStr.read().split( "\r\n ")   line = webStrList.index( " <DIV id=Div> </DIV> ")+2 # get the line from " <DIV id=Div> </DIV> " move 2 line   resultStr = webStrList[line]   return resultStr   [/code]

既然得到結果列表，那么我們要把這個結果列表放到自己定義的頁面里面，我們可以說這個頁面叫模板：

[code]   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN " "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ">   <html xmlns= "http://www.w3.org/1999/xhtml ">   <head>   < http-equivhttp-equiv= "Content-Type " content= "text/html; charset=gb2312 " />   <title> SuperSingo搜索-%title% </title>   <link href= "default/css/global.css " type=text/css rel=stylesheet>   </head>   <body>   <div id= "top ">   <div id= "logo "> <img src= "default/images/logo.jpg " /> </div>   <div id= "searchUI ">   <input type= "text " style= "width:300px; " />   <input type= "submit " value= "Search " />   </div>   <div class= "clear "/>   </div>   <div id= "result_info ">   工找到：&times;&times;&times;條記錄，耗時&times;&times;&times;秒   </div>   <div id= "result "> %result% </div>   <div id= "foot ">

這里搜索的結構全都是百度那里過來的哦！其中%title%和%result%是等待替換的字符，為了替換這些字符，我們再增加一個方法，

[b]reCreatePage()：[/b]   [code]   def reCreatePage(self,resultStr):   demoStr = urllib.urlopen(demoPage).read() # get the demo page string   rereTitle = re.compile( "%title% ")   demoStr = reTitle.sub(self.key,demoStr) # re set the page title   rereResult = re.compile( "%result% ")   demoStr = reResult.sub(resultStr,demoStr) # re set the page result   return demoStr   [/code]

這樣就可以把模板中的%title%和%result%替換成我們想要的標簽了。

[code]   # the main programme   # creator: Singo   # date: 2007-8-24   import urllib   import re   class SearchThief:   " " "the google thief " " "   global path,targetURL,demoPage   path = "pages\\ "   # targetURL = "http://www.google.cn/search?complete=1&hl=zh-CN&q= "   targetURL = "http://www.baidu.com/s?wd= "   demoPage = path+ "__demo__.html "   def __init__(self,key):   self.key = key   def getPage(self):   webStr = urllib.urlopen(targetURL+self.key) # get the page string form the url   webStr = self.getResultStr(webStr) # get the result part   webStr = self.reCreatePage(webStr) # re create a new page   self.setPageToFile(webStr)   def getResultStr(self,webStr):   webStrwebStrList = webStr.read().split( "\r\n ")   line = webStrList.index( " <DIV id=Div> </DIV> ")+2 # get the line from " <DIV id=Div> </DIV> " move 2 line   resultStr = webStrList[line]   return resultStr   def reCreatePage(self,resultStr):   demoStr = urllib.urlopen(demoPage).read() # get the demo page string   rereTitle = re.compile( "%title% ")   demoStr = reTitle.sub(self.key,demoStr) # re set the page title   rereResult = re.compile( "%result% ")   demoStr = reResult.sub(resultStr,demoStr) # re set the page result   return demoStr   def setPageToFile(self,webStr):   rereSetStr = re.compile( "\r ")   self.key = reSetStr.sub( " ",self.key) # replace the string "\r "   targetFile = file(path+self.key+ ".html ", "w ") # open the file for "w "rite   targetFile.write(webStr)   targetFile.close()   print "done "   inputKey = raw_input( "Enter you want to search --> ")   obj = SearchThief(inputKey)   obj.getPage()   [/code]

這樣我們就可以得到一個自己定義的風格而含有百度搜索出來的結果的頁面，這里只做了標題和結果及的替換，同樣道理，我們還可以把“百度快照”替換掉，我們還可以重新生成翻頁控件,這樣一個搜索小偷就基本完成啦。

用Python標準庫向Google請求時，Google會返回一個不是我們希望得到的頁面，上面的內容是提示無權訪問，Google很聰明，這步已經被他們想到了，但百度沒做這樣的限制哦，于是成功截取百度的數據。同樣道理，還可以嘗試其他搜索引擎，比如yisou和soso。

感謝各位的閱讀，以上就是“怎么用Python標準庫修改搜索引擎獲取結果”的內容了，經過本文的學習后，相信大家對怎么用Python標準庫修改搜索引擎獲取結果這一問題有了更深刻的體會，具體使用情況還需要大家實踐驗證。這里是億速云，小編將為大家推送更多相關知識點的文章，歡迎關注！

向AI問一下細節

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

怎么用Python標準庫修改搜索引擎獲取結果

猜你喜歡

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

怎么用Python標準庫修改搜索引擎獲取結果

猜你喜歡

最新資訊

相關推薦

相關標簽