Elasticsearch 搜索打分計算原理淺析

發布時間：2020-03-03 23:44:32 來源：網絡閱讀：336 作者：baizhihua0809 欄目：開發技術

搜索打分計算幾個關鍵詞

TF: token frequency ,某個搜索字段分詞后再document中字段(待搜索的字段)中出現的次數
IDF：inverse document frequency，逆文檔頻率，某個搜索的字段在所有document中出現的次數取反
TFNORM：token frequency normalized，詞頻歸一化
BM25:算法：(freq + k1 * (1 - b + b * dl / avgdl))

兩個文檔如下：

{
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "321697",
        "_score" : 6.6273837,
        "_source" : {
          "title" : "Steve Jobs"
      }
}

{
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "23706",
        "_score" : 6.0948296,
        "_source" : {
          "title" : "All About Steve"
      }
}

如果我們通過title的match查詢

GET /movies/_search
{
  "query": {
    "match": {
      "title": "steve"
    }
  }
}

那么從打分結果就可以看出第一個文檔打分高于第二個，這個具體原因是：

TF方面看在帶搜索字段上出現的頻率一致

IDF方面看在整個文檔中出現的頻率一致

TFNORM方面則不一樣了，第一個文檔中該詞占比為1/2,第二個文檔中該詞占比為1/3，故而第一個文檔在該搜索下打分比第二個索引高，所以ES算法時使用了TFNORM計算方式freq / (freq + k1 * (1 - b + b * dl / avgdl))

最后的ES中的TF算法融合了詞頻歸一化和BM25

如果我們要查看具體Elasticsearch一個打分算法，則可以通過如下命令展示

GET /movies/_search
{
  // 和MySQL的執行計劃類似
  "explain": true, 
  "query": {
    "match": {
      "title": "steve"
    }
  }
}

執行結果，查看其中一個

{
    "_shard": "[movies][1]",
    "_node": "pqNhgutvQfqcLqLEzIDnbQ",
    "_index": "movies",
    "_type": "_doc",
    "_id": "321697",
    "_score": 6.6273837,
    "_source": {
        "overview": "Set backstage at three iconic product launches and ending in 1998 with the unveiling of the iMac, Steve Jobs takes us behind the scenes of the digital revolution to paint an intimate portrait of the brilliant man at its epicenter.",
        "voteAverage": 6.8,
        "keywords": [
            {
                "id": 5565,
                "name": "biography"
            },
            {
                "id": 6104,
                "name": "computer"
            },
            {
                "id": 15300,
                "name": "father daughter relationship"
            },
            {
                "id": 157935,
                "name": "apple computer"
            },
            {
                "id": 161160,
                "name": "steve jobs"
            },
            {
                "id": 185722,
                "name": "based on true events"
            }
        ],
        "releaseDate": "2015-01-01T00:00:00.000Z",
        "runtime": 122,
        "originalLanguage": "en",
        "title": "Steve Jobs",
        "productionCountries": [
            {
                "iso_3166_1": "US",
                "name": "United States of America"
            }
        ],
        "revenue": 34441873,
        "genres": [
            {
                "id": 18,
                "name": "Drama"
            },
            {
                "id": 36,
                "name": "History"
            }
        ],
        "originalTitle": "Steve Jobs",
        "popularity": 53.670525,
        "tagline": "Can a great man be a good man?",
        "spokenLanguages": [
            {
                "iso_639_1": "en",
                "name": "English"
            }
        ],
        "id": 321697,
        "voteCount": 1573,
        "productionCompanies": [
            {
                "name": "Universal Pictures",
                "id": 33
            },
            {
                "name": "Scott Rudin Productions",
                "id": 258
            },
            {
                "name": "Legendary Pictures",
                "id": 923
            },
            {
                "name": "The Mark Gordon Company",
                "id": 1557
            },
            {
                "name": "Management 360",
                "id": 4220
            },
            {
                "name": "Cloud Eight Films",
                "id": 6708
            }
        ],
        "budget": 30000000,
        "homepage": "http://www.stevejobsthefilm.com",
        "status": "Released"
    },
    -          }
                ]
            }
        ]
    }
}

此時可以看到結果多出了以下的一組數據（執行計劃）

{
    "_explanation": {
        "value": 6.6273837,
        // title字段值steve在所有匹配的1526個文檔中的權重
        "description": "weight(title:steve in 1526) [PerFieldSimilarity], result of:",
        "details": [
            {
                // value = idf.value * tf.value * 2.2
                // 6.6273837 = 6.4412656 * 0.46767938 * 2.2
                "value": 6.6273837,
                "description": "score(freq=1.0), product of:",
                "details": [
                    {
                        "value": 2.2,
                        // 放大因子，這個數值可以在創建索引的時候指定，默認值是2.2
                        "description": "boost",
                        "details": []
                    },
                    {
                        "value": 6.4412656,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                                "value": 2,
                                "description": "n, number of documents containing term",
                                "details": []
                            },
                            {
                                "value": 1567,
                                "description": "N, total number of documents with field",
                                "details": []
                            }
                        ]
                    },
                    {
                        "value": 0.46767938,
                        "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                        "details": [
                            {
                                "value": 1,
                                "description": "freq, occurrences of term within document",
                                "details": []
                            },
                            // 這塊提現了BM25算法（(freq + k1 * (1 - b + b * dl / avgdl))）
                            {
                                "value": 1.2,
                                "description": "k1, term saturation parameter",
                                "details": []
                            },
                            {
                                "value": 0.75,
                                "description": "b, length normalization parameter",
                                "details": []
                            },
                            // 這塊就可以提現出一個歸一化的操作算法
                            {
                                "value": 2,
                                "description": "dl, length of field",
                                "details": []
                            },
                            {
                                "value": 2.1474154,
                                "description": "avgdl, average length of field",
                                "details": []
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

向AI問一下細節

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

Elasticsearch 搜索打分計算原理淺析

猜你喜歡

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

Elasticsearch 搜索打分計算原理淺析

猜你喜歡

最新資訊

相關推薦

相關標簽