您好,登錄后才能下訂單哦!
java spring+mybatis整合如何實現今日頭條搞笑動態圖片的爬取,相信很多沒有經驗的人對此束手無策,為此本文總結了問題出現的原因和解決方法,通過這篇文章希望你能解決這個問題。
抓取的動態圖:
數據庫:
今日頭條本身就是做爬蟲的,爬取各大網站的圖片文字信息,再自己整合后推送給用戶,特別是里面的動態圖片,很有意思。在網上搜了搜,大多都是用Python來寫的,本人是學習javaweb這塊的,對正則表達式也不是很熟悉,就想著能不能換個我熟悉的方式來寫。此爬蟲使用spring+mybatis框架整合實現,使用mysql數據庫保存爬取的數據,用jsoup來操作HTML的標簽節點(完美避開正則表達式),獲取文章中動態圖片的鏈接,通過響應頭中“Content-Type”的值來判斷圖片的格式,再將圖片保存在本地。當然也可以爬取里面的文字,比如一些搞笑的黃段子,在此基礎上稍加改動就可以實現,此爬蟲只是提供一個入門的思路,更多好玩的爬蟲玩法還待大家去開發,哈哈。
核心語言:java;
核心框架:spring;
持久層框架:mybatis;
數據庫連接池:Alibaba Drui;
日志管理:Log4j;
jar包管理:maven; 。。。。
打開頭條首頁,找到點擊搞笑模塊,點擊F12,下滾后加載下一頁,發現是通過ajax請求api來獲取的數據,如下圖:
這是響應的json數據,里面的參數和值顧名思義大家都懂得。
是ajax訪問就好解決了,通過我百度谷歌各種研究后發現,ajax請求的前三個參數是不變的,改變category參數是請求不同的模塊,本列子是請求的搞笑模塊所以值為funny,max_behot_time和max_behot_time_tmp這兩個參數值是時間戳,首次請求是0,之后的值是響應json數據里面的next中的值。as和cp值是通過一段js生成的,其實就是一個加密了的時間戳而已。js代碼后面會貼。
項目搭建后之后為下圖所示的文件結構,不懂得自行谷歌 哈哈
不多說直接上核心代碼了:
public class TouTiaoCrawler { // 搞笑板塊的api地址 public static final String FUNNY = "http://www.toutiao.com/api/pc/feed/?utm_source=toutiao&widen=1"; // 頭條首頁地址 public static final String TOUTIAO = "http://www.toutiao.com"; // 使用"spring.xml"和"spring-mybatis.xml"這兩個配置文件創建Spring上下文 static ApplicationContext ac = new ClassPathXmlApplicationContext( "spring-mybatis.xml"); // 從Spring容器中根據bean的id取出我們要使用的funnyMapper對象 static FunnyMapper funnyMapper = (FunnyMapper) ac.getBean("funnyMapper"); // 接口訪問次數 private static int refreshCount = 0; // 時間戳 private static long time = 0; public static void main(String[] args) { System.out.println("----------開始干活!-----------------"); while (true) { crawler(time); } } public static void crawler(long hottime) {// 傳入時間戳,會獲取這個時間戳的內容 refreshCount++; System.out.println("----------第" + refreshCount + "次刷新------返回的請求時間為:" + hottime + "----------"); String url = FUNNY + "&max_behot_time=" + hottime + "&max_behot_time_tmp=" + hottime; JSONObject param = getUrlParam(); // 獲取用js代碼得到的as和cp的值 // 定義接口訪問的模塊 /* * __all__ : 推薦 news_hot: 熱點 funny:搞笑 */ String module = "funny"; url += "&as=" + param.get("as") + "&cp=" + param.get("cp") + "&category=" + module; JSONObject json = null; try { json = getReturnJson(url);// 獲取json串 } catch (Exception e) { e.printStackTrace(); } if (json != null) { time = json.getJSONObject("next").getLongValue("max_behot_time"); JSONArray data = json.getJSONArray("data"); for (int i = 0; i < data.size(); i++) { try { JSONObject obj = (JSONObject) data.get(i); // 判斷這條文章是否已經爬過 if (funnyMapper.selectByGroupId((String) obj .get("group_id")) != null) { System.out .println("----------此文章已經爬過啦!-----------------"); continue; } // 訪問頁面返回document對象 String url1 = TOUTIAO + "/a" + obj.getString("group_id"); Document document = getArticleInfo(url1); System.out.println("----------成功訪問了文章:" + url1 + "-----------------"); // 將document也存入 obj.put("document", document.toString()); // 將json對象轉換成java Entity對象 Funny funny = JSON.parseObject(obj.toString(), Funny.class); // json入庫 funny.setBehotTime(new Date()); funnyMapper.insertSelective(funny); } catch (Exception e) { e.printStackTrace(); } } } else { System.out.println("----------返回的json列表為空----------"); } } // 訪問接口,返回json封裝的數據格式 public static JSONObject getReturnJson(String url) { try { URL httpUrl = new URL(url); BufferedReader in = new BufferedReader(new InputStreamReader( httpUrl.openStream(), "UTF-8")); String line = null; String content = ""; while ((line = in.readLine()) != null) { content += line; } in.close(); return JSONObject.parseObject(content); } catch (Exception e) { System.err.println("訪問失敗:" + url); e.printStackTrace(); } return null; } // 獲取網站的document對象 public static Document getArticleInfo(String url) { try { Connection connect = Jsoup.connect(url); Document document; document = connect.get(); Elements article = document.getElementsByClass("article-content"); if (article.size() > 0) { Elements a = article.get(0).getElementsByTag("img"); if (a.size() > 0) { for (Element e : a) { String url2 = e.attr("src"); // 下載img標簽里面的圖片到本地 saveToFile(url2); } } } return document; } catch (IOException e) { System.err.println("訪問文章頁失敗:" + url + " 原因" + e.getMessage()); return null; } } // 執行js獲取as和cp參數值 public static JSONObject getUrlParam() { JSONObject jsonObject = null; FileReader reader = null; try { ScriptEngineManager manager = new ScriptEngineManager(); ScriptEngine engine = manager.getEngineByName("javascript"); String jsFileName = "toutiao.js"; // 讀取js文件 reader = new FileReader(jsFileName); // 執行指定腳本 engine.eval(reader); if (engine instanceof Invocable) { Invocable invoke = (Invocable) engine; Object obj = invoke.invokeFunction("getParam"); jsonObject = JSONObject.parseObject(obj != null ? obj .toString() : null); } } catch (Exception e) { e.printStackTrace(); } finally { try { if (reader != null) { reader.close(); } } catch (IOException e) { e.printStackTrace(); } } return jsonObject; } // 通過url獲取圖片并保存在本地 public static void saveToFile(String destUrl) { FileOutputStream fos = null; BufferedInputStream bis = null; HttpURLConnection httpUrl = null; URL url = null; String uuid = UUID.randomUUID().toString(); String fileAddress = "d:\\imag/" + uuid;// 存儲本地文件地址 int BUFFER_SIZE = 1024; byte[] buf = new byte[BUFFER_SIZE]; int size = 0; try { url = new URL(destUrl); httpUrl = (HttpURLConnection) url.openConnection(); httpUrl.connect(); String Type = httpUrl.getHeaderField("Content-Type"); if (Type.equals("image/gif")) { fileAddress += ".gif"; } else if (Type.equals("image/png")) { fileAddress += ".png"; } else if (Type.equals("image/jpeg")) { fileAddress += ".jpg"; } else { System.err.println("未知圖片格式"); return; } bis = new BufferedInputStream(httpUrl.getInputStream()); fos = new FileOutputStream(fileAddress); while ((size = bis.read(buf)) != -1) { fos.write(buf, 0, size); } fos.flush(); System.out.println("圖片保存成功!地址:" + fileAddress); } catch (IOException e) { e.printStackTrace(); } catch (ClassCastException e) { e.printStackTrace(); } finally { try { fos.close(); bis.close(); httpUrl.disconnect(); } catch (IOException e) { e.printStackTrace(); } catch (NullPointerException e) { e.printStackTrace(); } } } }
獲取as和cp參數的js代碼
function getParam(){ var asas; var cpcp; var t = Math.floor((new Date).getTime() / 1e3) , e = t.toString(16).toUpperCase() , i = md5(t).toString().toUpperCase(); if (8 != e.length){ asas = "479BB4B7254C150"; cpcp = "7E0AC8874BB0985"; }else{ for (var n = i.slice(0, 5), o = i.slice(-5), a = "", s = 0; 5 > s; s++){ a += n[s] + e[s]; } for (var r = "", c = 0; 5 > c; c++){ r += e[c + 3] + o[c]; } asas = "A1" + a + e.slice(-3); cpcp= e.slice(0, 3) + r + "E1"; } return '{"as":"'+asas+'","cp":"'+cpcp+'"}'; } !function(e) { "use strict"; function t(e, t) { var n = (65535 & e) + (65535 & t) , r = (e >> 16) + (t >> 16) + (n >> 16); return r << 16 | 65535 & n } function n(e, t) { return e << t | e >>> 32 - t } function r(e, r, o, i, a, u) { return t(n(t(t(r, e), t(i, u)), a), o) } function o(e, t, n, o, i, a, u) { return r(t & n | ~t & o, e, t, i, a, u) } function i(e, t, n, o, i, a, u) { return r(t & o | n & ~o, e, t, i, a, u) } function a(e, t, n, o, i, a, u) { return r(t ^ n ^ o, e, t, i, a, u) } function u(e, t, n, o, i, a, u) { return r(n ^ (t | ~o), e, t, i, a, u) } function s(e, n) { e[n >> 5] |= 128 << n % 32, e[(n + 64 >>> 9 << 4) + 14] = n; var r, s, c, l, f, p = 1732584193, d = -271733879, h = -1732584194, m = 271733878; for (r = 0; r < e.length; r += 16) s = p, c = d, l = h, f = m, p = o(p, d, h, m, e[r], 7, -680876936), m = o(m, p, d, h, e[r + 1], 12, -389564586), h = o(h, m, p, d, e[r + 2], 17, 606105819), d = o(d, h, m, p, e[r + 3], 22, -1044525330), p = o(p, d, h, m, e[r + 4], 7, -176418897), m = o(m, p, d, h, e[r + 5], 12, 1200080426), h = o(h, m, p, d, e[r + 6], 17, -1473231341), d = o(d, h, m, p, e[r + 7], 22, -45705983), p = o(p, d, h, m, e[r + 8], 7, 1770035416), m = o(m, p, d, h, e[r + 9], 12, -1958414417), h = o(h, m, p, d, e[r + 10], 17, -42063), d = o(d, h, m, p, e[r + 11], 22, -1990404162), p = o(p, d, h, m, e[r + 12], 7, 1804603682), m = o(m, p, d, h, e[r + 13], 12, -40341101), h = o(h, m, p, d, e[r + 14], 17, -1502002290), d = o(d, h, m, p, e[r + 15], 22, 1236535329), p = i(p, d, h, m, e[r + 1], 5, -165796510), m = i(m, p, d, h, e[r + 6], 9, -1069501632), h = i(h, m, p, d, e[r + 11], 14, 643717713), d = i(d, h, m, p, e[r], 20, -373897302), p = i(p, d, h, m, e[r + 5], 5, -701558691), m = i(m, p, d, h, e[r + 10], 9, 38016083), h = i(h, m, p, d, e[r + 15], 14, -660478335), d = i(d, h, m, p, e[r + 4], 20, -405537848), p = i(p, d, h, m, e[r + 9], 5, 568446438), m = i(m, p, d, h, e[r + 14], 9, -1019803690), h = i(h, m, p, d, e[r + 3], 14, -187363961), d = i(d, h, m, p, e[r + 8], 20, 1163531501), p = i(p, d, h, m, e[r + 13], 5, -1444681467), m = i(m, p, d, h, e[r + 2], 9, -51403784), h = i(h, m, p, d, e[r + 7], 14, 1735328473), d = i(d, h, m, p, e[r + 12], 20, -1926607734), p = a(p, d, h, m, e[r + 5], 4, -378558), m = a(m, p, d, h, e[r + 8], 11, -2022574463), h = a(h, m, p, d, e[r + 11], 16, 1839030562), d = a(d, h, m, p, e[r + 14], 23, -35309556), p = a(p, d, h, m, e[r + 1], 4, -1530992060), m = a(m, p, d, h, e[r + 4], 11, 1272893353), h = a(h, m, p, d, e[r + 7], 16, -155497632), d = a(d, h, m, p, e[r + 10], 23, -1094730640), p = a(p, d, h, m, e[r + 13], 4, 681279174), m = a(m, p, d, h, e[r], 11, -358537222), h = a(h, m, p, d, e[r + 3], 16, -722521979), d = a(d, h, m, p, e[r + 6], 23, 76029189), p = a(p, d, h, m, e[r + 9], 4, -640364487), m = a(m, p, d, h, e[r + 12], 11, -421815835), h = a(h, m, p, d, e[r + 15], 16, 530742520), d = a(d, h, m, p, e[r + 2], 23, -995338651), p = u(p, d, h, m, e[r], 6, -198630844), m = u(m, p, d, h, e[r + 7], 10, 1126891415), h = u(h, m, p, d, e[r + 14], 15, -1416354905), d = u(d, h, m, p, e[r + 5], 21, -57434055), p = u(p, d, h, m, e[r + 12], 6, 1700485571), m = u(m, p, d, h, e[r + 3], 10, -1894986606), h = u(h, m, p, d, e[r + 10], 15, -1051523), d = u(d, h, m, p, e[r + 1], 21, -2054922799), p = u(p, d, h, m, e[r + 8], 6, 1873313359), m = u(m, p, d, h, e[r + 15], 10, -30611744), h = u(h, m, p, d, e[r + 6], 15, -1560198380), d = u(d, h, m, p, e[r + 13], 21, 1309151649), p = u(p, d, h, m, e[r + 4], 6, -145523070), m = u(m, p, d, h, e[r + 11], 10, -1120210379), h = u(h, m, p, d, e[r + 2], 15, 718787259), d = u(d, h, m, p, e[r + 9], 21, -343485551), p = t(p, s), d = t(d, c), h = t(h, l), m = t(m, f); return [p, d, h, m] } function c(e) { var t, n = ""; for (t = 0; t < 32 * e.length; t += 8) n += String.fromCharCode(e[t >> 5] >>> t % 32 & 255); return n } function l(e) { var t, n = []; for (n[(e.length >> 2) - 1] = void 0, t = 0; t < n.length; t += 1) n[t] = 0; for (t = 0; t < 8 * e.length; t += 8) n[t >> 5] |= (255 & e.charCodeAt(t / 8)) << t % 32; return n } function f(e) { return c(s(l(e), 8 * e.length)) } function p(e, t) { var n, r, o = l(e), i = [], a = []; for (i[15] = a[15] = void 0, o.length > 16 && (o = s(o, 8 * e.length)), n = 0; 16 > n; n += 1) i[n] = 909522486 ^ o[n], a[n] = 1549556828 ^ o[n]; return r = s(i.concat(l(t)), 512 + 8 * t.length), c(s(a.concat(r), 640)) } function d(e) { var t, n, r = "0123456789abcdef", o = ""; for (n = 0; n < e.length; n += 1) t = e.charCodeAt(n), o += r.charAt(t >>> 4 & 15) + r.charAt(15 & t); return o } function h(e) { return unescape(encodeURIComponent(e)) } function m(e) { return f(h(e)) } function g(e) { return d(m(e)) } function v(e, t) { return p(h(e), h(t)) } function y(e, t) { return d(v(e, t)) } function b(e, t, n) { return t ? n ? v(t, e) : y(t, e) : n ? m(e) : g(e) } "function" == typeof define && define.amd ? define("static/js/lib/md5", ["require"], function() { return b }) : "object" == typeof module && module.exports ? module.exports = b : e.md5 = b }(this)
我還發現了頭條有個簡約版,研究后發現這個簡約版應該更好爬一些。
訪問的格式是p+頁碼,直接讀取每頁里面的鏈接,就可以進行爬取了,就不再通過json串來獲取文章地址,也不需要傳什么限制參數,在本項目上稍加改動就可以了
看完上述內容,你們掌握java spring+mybatis整合如何實現今日頭條搞笑動態圖片的爬取的方法了嗎?如果還想學到更多技能或想了解更多相關內容,歡迎關注億速云行業資訊頻道,感謝各位的閱讀!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。