您好,登錄后才能下訂單哦!
這篇文章主要介紹如何用Python爬蟲代理增加網站流量,文中示例代碼介紹的非常詳細,具有一定的參考價值,感興趣的小伙伴們一定要看完!
獲得了免費的代理列表,那么就有很多事情可以干,比如 , 爬取某個網站并且沒有被封IP的風險, 比如, 增加某網站的流量。
完整代碼:
#coding:utf-8 import urllib2 import urllib import cookielib import hashlib import re import time import json import unittest from selenium import webdriver from bs4 import BeautifulSoup from pip._vendor.distlib._backport.tarfile import TUREAD from time import sleep from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.desired_capabilities import DesiredCapabilities import random class Spide: def __init__(self,proxy_ip,proxy_type,proxy_port,use_proxy=False): print 'using the proxy info :',proxy_ip self.proxy_ip = proxy_ip self.proxy_type = proxy_type self.proxy_port = proxy_port self.proxy = urllib2.ProxyHandler({proxy_type: proxy_ip+":"+proxy_port}) self.usercode = "" self.userid = "" self.cj = cookielib.LWPCookieJar(); self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj)); if use_proxy: self.opener = urllib2.build_opener(self.proxy) urllib2.install_opener(self.opener); def add_view(self): print '--->start adding view' print '--->proxy info',self.proxy_ip service_args = [ '--proxy='+self.proxy_ip+':'+self.proxy_port, '--proxy-type='+self.proxy_type, ] dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = ( "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 " "(KHTML, like Gecko) Chrome/15.0.87" ) driver = webdriver.PhantomJS(executable_path='/home/bin/phantomjs',service_args=service_args,desired_capabilities=dcap) driver.set_page_load_timeout(90) driver.get("http://www.503error.com/") soup = BeautifulSoup(driver.page_source, 'xml') titles = soup.find_all('h2', {'class': 'entry-title'}) ranCount = random.randint(0,len(titles)) print 'random find a link of the website to access , random is :',ranCount randomlink = titles[ranCount].a.attrs['href'] driver.get(randomlink) driver.close() print 'finish once' def get_proxy(self): proxy_info_json = "" #first get the proxy info from print '-->using the ip'+self.proxy_ip+'to get the proxyinfo' try: reqRequest_proxy = urllib2.Request('url2'); reqRequest_proxy.add_header('Accept','*/*'); reqRequest_proxy.add_header('Accept-Language','zh-CN,zh;q=0.8'); reqRequest_proxy.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36'); reqRequest_proxy.add_header('Content-Type','application/x-www-form-urlencoded'); proxy_info = urllib2.urlopen(reqRequest_proxy).read(); print proxy_info proxy_info_json = json.loads(proxy_info) return_str=proxy_info_json['protocol']+":"+proxy_info_json['ip']+proxy_info_json['port'] except Exception,e: print 'proxy have problem' #print proxy_info_json['protocol'] #print proxy_info_json['ip'] #print proxy_info_json['port'] return proxy_info_json #print proxy_info def get_proxys100(self): proxy_info_json = "" #first get the proxy info from print '-->using the ip'+self.proxy_ip+'to get the proxyinfo100' try: reqRequest_proxy = urllib2.Request('url1'); reqRequest_proxy.add_header('Accept','*/*'); reqRequest_proxy.add_header('Accept-Language','zh-CN,zh;q=0.8'); reqRequest_proxy.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36'); reqRequest_proxy.add_header('Content-Type','application/x-www-form-urlencoded'); proxy_info = urllib2.urlopen(reqRequest_proxy).read(); #print proxy_info proxy_info_json = json.loads(proxy_info) #for porxy_i in proxy_info_json: # print porxy_i #return_str=proxy_info_json['protocol']+":"+proxy_info_json['ip']+proxy_info_json['port'] return proxy_info_json except Exception,e: print 'proxy have problem' if __name__ == "__main__": #firs time get the proxy print 'START ADDING VIEW:' print 'Geting the new proxy info First time' print '---------------------------------------------------------------------------------------------------------' for count in range(1): test = Spide(proxy_ip='youproxyip',proxy_type='http',proxy_port='3128',use_proxy=False) proxy_list = test.get_proxy() print '->this is the :',count print '->Geting the new proxy info:' print '->using the proxy to get proxy list incase forbiden' print '->proxy info',proxy_list proxy100 = test.get_proxys100() for proxy1 in proxy100: try: print 'proxy1:',proxy1 Spide1=Spide(proxy_ip=proxy1['ip'],proxy_type=proxy1['type'],proxy_port=proxy1['port'],use_proxy=True) print 'before add view' Spide1.add_view() print '->sleep 15 s' time.sleep(15) #sleep random time to ranTime = random.randint(10,50) print '->sleep random time:',ranTime time.sleep(ranTime) print '-> getting new proxy ' #proxy_list = Spide1.get_proxy() except Exception,e: print '->something wrong ,hahah ,next'
一點小的注釋:
整個流程為: 1 獲取代理 ->2 訪問首頁 —>3 獲取首頁博客列表,隨機訪問->4隨機等待N秒 ->返回第1步
1:你需要更改youproxyip為你一個你已經擁有的代理ip,或者,不用填寫,因為后邊的use_proxy=False, 這個時候你確保你能夠不適用代理訪問到代碼中的兩個自動抓取代理ip地址的網站
2:/home/bin/phantomjs 這個路徑是你安裝的phantomjs路徑
3:代碼中有兩個獲取代理的方法,例子中選擇了一個(不要噴我下邊的循環為什么是一次還要循環,因為這個版本是原來是有外層循環的)
4: 獲取免費代理地址就不寫了,url1 ,url2 為隱藏的獲取免費代理的網站
以上是如何用Python爬蟲代理增加網站流量的所有內容,感謝各位的閱讀!希望分享的內容對大家有幫助,更多相關知識,歡迎關注億速云行業資訊頻道!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。