您好,登錄后才能下訂單哦!
搜索打分計算幾個關鍵詞
TF: token frequency ,某個搜索字段分詞后再document中字段(待搜索的字段)中出現的次數
IDF:inverse document frequency,逆文檔頻率,某個搜索的字段在所有document中出現的次數取反
(freq + k1 * (1 - b + b * dl / avgdl))
兩個文檔如下:
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "321697",
"_score" : 6.6273837,
"_source" : {
"title" : "Steve Jobs"
}
}
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "23706",
"_score" : 6.0948296,
"_source" : {
"title" : "All About Steve"
}
}
如果我們通過title
的match
查詢
GET /movies/_search
{
"query": {
"match": {
"title": "steve"
}
}
}
那么從打分結果就可以看出第一個文檔打分高于第二個,這個具體原因是:
TF
方面看在帶搜索字段上出現的頻率一致
IDF
方面看在整個文檔中出現的頻率一致
TFNORM
方面則不一樣了,第一個文檔中該詞占比為1/2
,第二個文檔中該詞占比為1/3
,故而第一個文檔在該搜索下打分比第二個索引高,所以ES算法時使用了TFNORM計算方式freq / (freq + k1 * (1 - b + b * dl / avgdl))
最后的ES
中的TF
算法融合了詞頻歸一化
和BM25
如果我們要查看具體Elasticsearch一個打分算法,則可以通過如下命令展示
GET /movies/_search
{
// 和MySQL的執行計劃類似
"explain": true,
"query": {
"match": {
"title": "steve"
}
}
}
執行結果,查看其中一個
{
"_shard": "[movies][1]",
"_node": "pqNhgutvQfqcLqLEzIDnbQ",
"_index": "movies",
"_type": "_doc",
"_id": "321697",
"_score": 6.6273837,
"_source": {
"overview": "Set backstage at three iconic product launches and ending in 1998 with the unveiling of the iMac, Steve Jobs takes us behind the scenes of the digital revolution to paint an intimate portrait of the brilliant man at its epicenter.",
"voteAverage": 6.8,
"keywords": [
{
"id": 5565,
"name": "biography"
},
{
"id": 6104,
"name": "computer"
},
{
"id": 15300,
"name": "father daughter relationship"
},
{
"id": 157935,
"name": "apple computer"
},
{
"id": 161160,
"name": "steve jobs"
},
{
"id": 185722,
"name": "based on true events"
}
],
"releaseDate": "2015-01-01T00:00:00.000Z",
"runtime": 122,
"originalLanguage": "en",
"title": "Steve Jobs",
"productionCountries": [
{
"iso_3166_1": "US",
"name": "United States of America"
}
],
"revenue": 34441873,
"genres": [
{
"id": 18,
"name": "Drama"
},
{
"id": 36,
"name": "History"
}
],
"originalTitle": "Steve Jobs",
"popularity": 53.670525,
"tagline": "Can a great man be a good man?",
"spokenLanguages": [
{
"iso_639_1": "en",
"name": "English"
}
],
"id": 321697,
"voteCount": 1573,
"productionCompanies": [
{
"name": "Universal Pictures",
"id": 33
},
{
"name": "Scott Rudin Productions",
"id": 258
},
{
"name": "Legendary Pictures",
"id": 923
},
{
"name": "The Mark Gordon Company",
"id": 1557
},
{
"name": "Management 360",
"id": 4220
},
{
"name": "Cloud Eight Films",
"id": 6708
}
],
"budget": 30000000,
"homepage": "http://www.stevejobsthefilm.com",
"status": "Released"
},
- }
]
}
]
}
}
此時可以看到結果多出了以下的一組數據(執行計劃)
{
"_explanation": {
"value": 6.6273837,
// title字段值steve在所有匹配的1526個文檔中的權重
"description": "weight(title:steve in 1526) [PerFieldSimilarity], result of:",
"details": [
{
// value = idf.value * tf.value * 2.2
// 6.6273837 = 6.4412656 * 0.46767938 * 2.2
"value": 6.6273837,
"description": "score(freq=1.0), product of:",
"details": [
{
"value": 2.2,
// 放大因子,這個數值可以在創建索引的時候指定,默認值是2.2
"description": "boost",
"details": []
},
{
"value": 6.4412656,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 2,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1567,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.46767938,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1,
"description": "freq, occurrences of term within document",
"details": []
},
// 這塊提現了BM25算法((freq + k1 * (1 - b + b * dl / avgdl)))
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
// 這塊就可以提現出一個歸一化的操作算法
{
"value": 2,
"description": "dl, length of field",
"details": []
},
{
"value": 2.1474154,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
}
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。