將唐詩三百首寫入 Elasticsearch 會發生什么

發布時間：2021-12-16 18:13:34 來源：億速云閱讀：118 作者：柒染欄目：大數據

本篇文章給大家分享的是有關將唐詩三百首寫入 Elasticsearch 會發生什么，小編覺得挺實用的，因此分享給大家學習，希望大家閱讀完這篇文章后可以有所收獲，話不多說，跟著小編一起來看看吧。

1、實戰項目

將唐詩三百首寫入Elasticsearch會發生什么？

2、項目說明

此項目是根據實戰項目濃縮的一個小項目，幾乎涵蓋之前講解的所有知識點。

通過這個項目的實戰，能讓你串聯起之前的知識點應用于實戰，并建立起需求分析、整體設計、數據建模、ingest管道使用、檢索/聚合選型、kibana可視化分析等的全局認知。

3、需求

數據來源：https://github.com/xuchunyang/300

注意數據源bug：第1753行種的"id":178 需要手動改成 "id": 252。

3.1 數據需求

注意：

1）詞典選擇
2）分詞器選型
3）mapping設置
4）支持的目標維度考量
5）設定插入時間（自定義動態添加，非人工）

3.2 寫入需求

注意：

1）特殊字符清洗
2）新增插入時間

3.3 分析需求

檢索分析DSL實戰

1）飛花令環節：包含銘毅天下（分別包含）詩句有哪些？各有多少首？
2）李白的詩有幾首？按照詩長短排序，由短到長
3）取TOP10最長、最短的詩的作者列表

聚合分析實戰及可視化實戰

1）三百首誰的作品最多？取TOP10排行
2）五言絕句和七言律詩占比，以及對應作者占比統計
3）同名詩排行統計
4）三百首詩分詞形成什么樣的詞云

4、需求解讀與設計

4.1 需求解讀

本著：編碼之前，設計先行的原則。

開發人員的通病——新的項目拿到需求以后，不論其簡單還是復雜，都要先梳理需求，整理出其邏輯架構，優先設計，以便建立全局認知，而不是上來就動手敲代碼。

本項目的核心知識點涵蓋如下幾塊內容

Elasticsearch 數據建模
Elasticsearch bulk批量寫入
Elasticsearch 預處理
Elasticsearch檢索
Elasticsearch聚合
kibana Visualize 使用
kibana Dashboard 使用

4.2 邏輯架構梳理

有圖有真相。

將唐詩三百首寫入 Elasticsearch 會發生什么

根據需求梳理出如下的邏輯架構，實際開發中要謹記如下的數據流向。

4.3 建模梳理

之前也有講述，這里再強調一下數據建模的重要性。

數據模型支撐了系統和數據，系統和數據支撐了業務系統。

一個好的數據模型：

能讓系統更好的集成、能簡化接口。
能簡化數據冗余、減少磁盤空間、提升傳輸效率。
兼容更多的數據，不會因為數據類型的新增而導致實現邏輯更改。
能幫助更多的業務機會，提高業務效率。
能減少業務風險、降低業務成本。

對于Elasticsearch的數據建模的核心是Mapping的構建。

對于原始json數據：

    "id": 251,    "contents": "打起黃鶯兒，莫教枝上啼。啼時驚妾夢，不得到遼西。",    "type": "五言絕句",    "author": "金昌緒",    "title": "春怨"

我們的建模邏輯如下：

字段名稱	字段類型	備注說明
_id		對應自增id
contents	text & keyword	涉及分詞，注意開啟：fielddata：true
type	text & keyword
author	text & keyword
title	text & keyword
timestamp	date	代表插入時間
cont_length	long	contents長度，排序用

由于涉及中文分詞，選型分詞器很重要。

這里依然推薦：選擇ik分詞。

ik詞典的選擇建議：自帶詞典不完備，網上搜索互聯網的一些常用語詞典、行業詞典如（詩詞相關詞典）作為補充完善。

4.4 概要設計

原始文檔json的批量讀取和寫入通過 elasticsearch python低版本 api 和高版本 api elasticsearch-dsl 結合實現。
數據的預處理環節通過 ingest pipeline實現。設計數據預處理地方：每一篇詩的json寫入時候，插入timestamp時間戳字段。
template和mapping的構建通過kibana實現。
分詞選型：ik_max_word 細粒度分詞，以查看更細粒度的詞云。

5、項目實戰

5.1 數據預處理ingest

創建：indexed_at 的管道，目的：

新增document時候指定插入時間戳字段。
新增長度字段，以便于后續排序。

PUT _ingest/pipeline/indexed_at{  "description": "Adds timestamp  to documents",  "processors": [    {      "set": {        "field": "_source.timestamp",        "value": "{{_ingest.timestamp}}"      }    },    {      "script": {        "source": "ctx.cont_length = ctx.contents.length();"      }    }  ]}

5.2 Mapping和template構建

如下DSL,分別構建了模板：my_template。

指定了settings、別名、mapping的基礎設置。

模板的好處和便捷性，在之前的章節中有過詳細講解。

PUT _template/my_template{  "index_patterns": [    "some_index*"  ],  "aliases": {    "some_index": {}  },  "settings": {    "index.default_pipeline": "indexed_at",    "number_of_replicas": 1,    "refresh_interval": "30s"  },  "mappings": {    "properties": {      "cont_length":{        "type":"long"      },      "author": {        "type": "text",        "fields": {          "field": {            "type": "keyword"          }        },        "analyzer": "ik_max_word"      },      "contents": {        "type": "text",        "fields": {          "field": {            "type": "keyword"          }        },        "analyzer": "ik_max_word",        "fielddata": true      },      "timestamp": {        "type": "date"      },      "title": {        "type": "text",        "fields": {          "field": {            "type": "keyword"          }        },        "analyzer": "ik_max_word"      },      "type": {        "type": "text",        "fields": {          "field": {            "type": "keyword"          }        },        "analyzer": "ik_max_word"      }    }  }}
PUT some_index_01

5.3 數據讀取與寫入

通過如下的python代碼實現。注意：

bulk批量寫入比單條寫入性能要高很多。
尤其對于大文件的寫入優先考慮bulk批量處理實現。

def read_and_write_index():    # define an empty list for the Elasticsearch docs    doc_list = []
    # use Python's enumerate() function to iterate over list of doc strings    input_file = open('300.json',  encoding="utf8", errors='ignore')    json_array = json.load(input_file)
    for item in json_array:        try:            # convert the string to a dict object            # add a new field to the Elasticsearch doc            dict_doc = {}            # add a dict key called "_id" if you'd like to specify an ID for the doc            dict_doc["_id"] = item['id']            dict_doc["contents"] = item['contents']            dict_doc["type"] = item['type']            dict_doc["author"] = item['author']            dict_doc["title"] = item['title']
            # append the dict object to the list []            doc_list += [dict_doc]
        except json.decoder.JSONDecodeError as err:            # print the errors            print("ERROR for num:", item['id'], "-- JSONDecodeError:", err, "for doc:", dict_doc)            print("Dict docs length:", len(doc_list))


    try:        print ("\nAttempting to index the list of docs using helpers.bulk()")
        # use the helpers library's Bulk API to index list of Elasticsearch docs        resp = helpers.bulk(            client,            doc_list,            index = "some_index",            doc_type = "_doc"            )
        # print the response returned by Elasticsearch        print ("helpers.bulk() RESPONSE:", resp)        print ("helpers.bulk() RESPONSE:", json.dumps(resp, indent=4))    except Exception as err:        # print any errors returned w        ## Prerequisiteshile making the helpers.bulk() API call        print("Elasticsearch helpers.bulk() ERROR:", err)        quit()

5.4 數據分析

5.5 檢索分析

5.5.1 飛花令環節：包含銘毅天下（分別包含）詩句有哪些？各有多少首？

GET some_index/_search{  "query": {    "match": {      "contents": "銘"    }  }}
GET some_index/_search{  "query": {    "match": {      "contents": "毅"    }  }}
GET some_index/_search{  "query": {    "match": {      "contents": "天下"    }  }}

實踐表明：

銘：0首
毅：1首
天下：114 首

不禁感嘆：唐詩先賢們也是心懷天下，憂國憂民啊！

5.5.2 李白的詩有幾首？按照詩長短排序，由短到長

POST some_index/_search{   "query": {    "match_phrase": {      "author": "李白"    }  },  "sort": [    {      "cont_length": {        "order": "desc"      }    }  ]}
POST some_index/_search{  "aggs": {    "genres": {      "terms": {        "field": "author.keyword"      }    }  }}

唐詩三百首中，李白共33首詩（僅次于杜甫39首），最長的是“蜀道難”，共：353 個字符。

李白、杜甫不愧為：詩仙和詩圣啊！也都是高產詩人！

5.5.3 取TOP10最長、最短的詩的作者列表

POST some_index/_search{  "sort": [    {      "cont_length": {        "order": "desc"      }    }  ]}
POST some_index/_search{  "sort": [    {      "cont_length": {        "order": "asc"      }    }  ]}

最長的詩：白居易-長恨歌-960個字符。

最短的詩：王維-鹿柴- 24個字符（并列的非常多）。

5.6 聚合分析

以下的截圖通過kibana實現。細節在之前的kibana可視化中都有過講解。

5.6.1 三百首誰的作品最多？取TOP10排行

將唐詩三百首寫入 Elasticsearch 會發生什么

5.6.2 五言絕句和七言律詩占比，以及對應作者占比統計

將唐詩三百首寫入 Elasticsearch 會發生什么

5.6.3 同名詩排行統計

將唐詩三百首寫入 Elasticsearch 會發生什么

5.6.4 三百首詩分詞形成什么樣的詞云

將唐詩三百首寫入 Elasticsearch 會發生什么

5.6.5 全局視圖

將唐詩三百首寫入 Elasticsearch 會發生什么

以上就是將唐詩三百首寫入 Elasticsearch 會發生什么，小編相信有部分知識點可能是我們日常工作會見到或用到的。希望你能通過這篇文章學到更多知識。更多詳情敬請關注億速云行業資訊頻道。

向AI問一下細節

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

將唐詩三百首寫入 Elasticsearch 會發生什么

1、實戰項目

2、項目說明

3、需求

3.1 數據需求

3.2 寫入需求

3.3 分析需求

4、需求解讀與設計

4.1 需求解讀

4.2 邏輯架構梳理

4.3 建模梳理

4.4 概要設計

5、項目實戰

5.1 數據預處理ingest

5.2 Mapping和template構建

5.3 數據讀取與寫入

5.4 數據分析

5.5 檢索分析

5.5.1 飛花令環節：包含銘毅天下（分別包含）詩句有哪些？各有多少首？

5.5.2 李白的詩有幾首？按照詩長短排序，由短到長

5.5.3 取TOP10最長、最短的詩的作者列表

5.6 聚合分析

5.6.1 三百首誰的作品最多？取TOP10排行

5.6.2 五言絕句和七言律詩占比，以及對應作者占比統計

5.6.3 同名詩排行統計

5.6.4 三百首詩分詞形成什么樣的詞云

5.6.5 全局視圖

猜你喜歡

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

將唐詩三百首寫入 Elasticsearch 會發生什么

1、實戰項目

2、項目說明

3、 需求

3.1 數據需求

3.2 寫入需求

3.3 分析需求

4、 需求解讀與設計

4.1 需求解讀

4.2 邏輯架構梳理

4.3 建模梳理

4.4 概要設計

5、項目實戰

5.1 數據預處理ingest

5.2 Mapping和template構建

5.3 數據讀取與寫入

5.4 數據分析

5.5 檢索分析

5.5.1 飛花令環節：包含銘 毅 天下（分別包含）詩句有哪些？各有多少首？

5.5.2 李白的詩有幾首？按照詩長短排序，由短到長

5.5.3 取TOP10最長、最短的詩的作者列表

5.6 聚合分析

5.6.1 三百首誰的作品最多？取TOP10排行

5.6.2 五言絕句和七言律詩占比，以及對應作者占比統計

5.6.3 同名詩排行統計

5.6.4 三百首詩分詞形成什么樣的詞云

5.6.5 全局視圖

猜你喜歡

最新資訊

相關推薦

相關標簽

3、需求

4、需求解讀與設計

5.5.1 飛花令環節：包含銘毅天下（分別包含）詩句有哪些？各有多少首？