Spring Boot中使用Java API調用lucene的示例分析

發布時間：2021-07-08 13:41:03 來源：億速云閱讀：237 作者：小新欄目：開發技術

這篇文章主要介紹Spring Boot中使用Java API調用lucene的示例分析，文中介紹的非常詳細，具有一定的參考價值，感興趣的小伙伴們一定要看完！

Lucene是apache軟件基金會4 jakarta項目組的一個子項目，是一個開放源代碼的全文檢索引擎工具包，但它不是一個完整的全文檢索引擎，而是一個全文檢索引擎的架構，提供了完整的查詢引擎和索引引擎，部分文本分析引擎（英文與德文兩種西方語言）。Lucene的目的是為軟件開發人員提供一個簡單易用的工具包，以方便的在目標系統中實現全文檢索的功能，或者是以此為基礎建立起完整的全文檢索引擎

全文檢索概述

比如，我們一個文件夾中，或者一個磁盤中有很多的文件，記事本、world、Excel、pdf，我們想根據其中的關鍵詞搜索包含的文件。例如，我們輸入Lucene，所有內容含有Lucene的文件就會被檢查出來。這就是所謂的全文檢索。

因此，很容易的我們想到，應該建立一個關鍵字與文件的相關映射，盜用ppt中的一張圖，很明白的解釋了這種映射如何實現。

倒排索引

Spring Boot中使用Java API調用lucene的示例分析

有了這種映射關系，我們就來看看Lucene的架構設計。

下面是Lucene的資料必出現的一張圖，但也是其精髓的概括。

Spring Boot中使用Java API調用lucene的示例分析

我們可以看到，Lucene的使用主要體現在兩個步驟：

1 創建索引，通過IndexWriter對不同的文件進行索引的創建，并將其保存在索引相關文件存儲的位置中。

2 通過索引查尋關鍵字相關文檔。

在Lucene中，就是使用這種“倒排索引”的技術，來實現相關映射。

Lucene數學模型

文檔、域、詞元

文檔是Lucene搜索和索引的原子單位，文檔為包含一個或者多個域的容器，而域則是依次包含“真正的”被搜索的內容，域值通過分詞技術處理，得到多個詞元。

For Example，一篇小說（斗破蒼穹）信息可以稱為一個文檔，小說信息又包含多個域，例如：標題（斗破蒼穹）、作者、簡介、最后更新時間等等，對標題這個域采用分詞技術又可以得到一個或者多個詞元（斗、破、蒼、穹）。

Lucene文件結構

層次結構

index
一個索引存放在一個目錄中

segment
一個索引中可以有多個段，段與段之間是獨立的，添加新的文檔可能產生新段，不同的段可以合并成一個新段

document
文檔是創建索引的基本單位，不同的文檔保存在不同的段中，一個段可以包含多個文檔

field
域，一個文檔包含不同類型的信息，可以拆分開索引

term
詞，索引的最小單位，是經過詞法分析和語言處理后的數據。

正向信息

按照層次依次保存了從索引到詞的包含關系：index-->segment-->document-->field-->term。

反向信息

反向信息保存了詞典的倒排表映射：term-->document

IndexWriter
lucene中最重要的的類之一，它主要是用來將文檔加入索引，同時控制索引過程中的一些參數使用。

Analyzer
分析器,主要用于分析搜索引擎遇到的各種文本。常用的有StandardAnalyzer分析器,StopAnalyzer分析器,WhitespaceAnalyzer分析器等。

Directory
索引存放的位置;lucene提供了兩種索引存放的位置，一種是磁盤，一種是內存。一般情況將索引放在磁盤上；相應地lucene提供了FSDirectory和RAMDirectory兩個類。

Document
文檔;Document相當于一個要進行索引的單元，任何可以想要被索引的文件都必須轉化為Document對象才能進行索引。

Field
字段。

IndexSearcher
是lucene中最基本的檢索工具，所有的檢索都會用到IndexSearcher工具;

Query
查詢，lucene中支持模糊查詢，語義查詢，短語查詢，組合查詢等等,如有TermQuery,BooleanQuery,RangeQuery,WildcardQuery等一些類。

QueryParser
是一個解析用戶輸入的工具，可以通過掃描用戶輸入的字符串，生成Query對象。

Hits
在搜索完成之后，需要把搜索結果返回并顯示給用戶，只有這樣才算是完成搜索的目的。在lucene中，搜索的結果的集合是用Hits類的實例來表示的。

測試用例

Github 代碼

代碼我已放到 Github ，導入spring-boot-lucene-demo 項目

github spring-boot-lucene-demo

添加依賴

<!--對分詞索引查詢解析-->
<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-queryparser</artifactId>
  <version>7.1.0</version>
</dependency>

<!--高亮 -->
<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-highlighter</artifactId>
  <version>7.1.0</version>
</dependency>

<!--smartcn 中文分詞器 SmartChineseAnalyzer smartcn分詞器 需要lucene依賴 且和lucene版本同步-->
<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-analyzers-smartcn</artifactId>
  <version>7.1.0</version>
</dependency>

<!--ik-analyzer 中文分詞器-->
<dependency>
  <groupId>cn.bestwu</groupId>
  <artifactId>ik-analyzers</artifactId>
  <version>5.1.0</version>
</dependency>

<!--MMSeg4j 分詞器-->
<dependency>
  <groupId>com.chenlb.mmseg4j</groupId>
  <artifactId>mmseg4j-solr</artifactId>
  <version>2.4.0</version>
  <exclusions>
    <exclusion>
      <groupId>org.apache.solr</groupId>
      <artifactId>solr-core</artifactId>
    </exclusion>
  </exclusions>
</dependency>

配置 lucene

private Directory directory;

private IndexReader indexReader;

private IndexSearcher indexSearcher;

@Before
public void setUp() throws IOException {
  //索引存放的位置，設置在當前目錄中
  directory = FSDirectory.open(Paths.get("indexDir/"));

  //創建索引的讀取器
  indexReader = DirectoryReader.open(directory);

  //創建一個索引的查找器，來檢索索引庫
  indexSearcher = new IndexSearcher(indexReader);
}

@After
public void tearDown() throws Exception {
  indexReader.close();
}

**
 * 執行查詢，并打印查詢到的記錄數
 *
 * @param query
 * @throws IOException
 */
public void executeQuery(Query query) throws IOException {

  TopDocs topDocs = indexSearcher.search(query, 100);

  //打印查詢到的記錄數
  System.out.println("總共查詢到" + topDocs.totalHits + "個文檔");
  for (ScoreDoc scoreDoc : topDocs.scoreDocs) {

    //取得對應的文檔對象
    Document document = indexSearcher.doc(scoreDoc.doc);
    System.out.println("id：" + document.get("id"));
    System.out.println("title：" + document.get("title"));
    System.out.println("content：" + document.get("content"));
  }
}

/**
 * 分詞打印
 *
 * @param analyzer
 * @param text
 * @throws IOException
 */
public void printAnalyzerDoc(Analyzer analyzer, String text) throws IOException {

  TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));
  CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
  try {
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
      System.out.println(charTermAttribute.toString());
    }
    tokenStream.end();
  } finally {
    tokenStream.close();
    analyzer.close();
  }
}

創建索引

@Test
public void indexWriterTest() throws IOException {
  long start = System.currentTimeMillis();

  //索引存放的位置，設置在當前目錄中
  Directory directory = FSDirectory.open(Paths.get("indexDir/"));

  //在 6.6 以上版本中 version 不再是必要的，并且，存在無參構造方法，可以直接使用默認的 StandardAnalyzer 分詞器。
  Version version = Version.LUCENE_7_1_0;

  //Analyzer analyzer = new StandardAnalyzer(); // 標準分詞器，適用于英文
  //Analyzer analyzer = new SmartChineseAnalyzer();//中文分詞
  //Analyzer analyzer = new ComplexAnalyzer();//中文分詞
  //Analyzer analyzer = new IKAnalyzer();//中文分詞

  Analyzer analyzer = new IKAnalyzer();//中文分詞

  //創建索引寫入配置
  IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);

  //創建索引寫入對象
  IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);

  //創建Document對象，存儲索引

  Document doc = new Document();

  int id = 1;

  //將字段加入到doc中
  doc.add(new IntPoint("id", id));
  doc.add(new StringField("title", "Spark", Field.Store.YES));
  doc.add(new TextField("content", "Apache Spark 是專為大規模數據處理而設計的快速通用的計算引擎", Field.Store.YES));
  doc.add(new StoredField("id", id));

  //將doc對象保存到索引庫中
  indexWriter.addDocument(doc);

  indexWriter.commit();
  //關閉流
  indexWriter.close();

  long end = System.currentTimeMillis();
  System.out.println("索引花費了" + (end - start) + " 毫秒");
}

響應

17:58:14.655 [main] DEBUG org.wltea.analyzer.dic.Dictionary - 加載擴展詞典：ext.dic
17:58:14.660 [main] DEBUG org.wltea.analyzer.dic.Dictionary - 加載擴展停止詞典：stopword.dic
索引花費了879 毫秒

刪除文檔

@Test
public void deleteDocumentsTest() throws IOException {
  //Analyzer analyzer = new StandardAnalyzer(); // 標準分詞器，適用于英文
  //Analyzer analyzer = new SmartChineseAnalyzer();//中文分詞
  //Analyzer analyzer = new ComplexAnalyzer();//中文分詞
  //Analyzer analyzer = new IKAnalyzer();//中文分詞

  Analyzer analyzer = new IKAnalyzer();//中文分詞

  //創建索引寫入配置
  IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);

  //創建索引寫入對象
  IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);

  // 刪除title中含有關鍵詞“Spark”的文檔
  long count = indexWriter.deleteDocuments(new Term("title", "Spark"));

  // 除此之外IndexWriter還提供了以下方法：
  // DeleteDocuments(Query query):根據Query條件來刪除單個或多個Document
  // DeleteDocuments(Query[] queries):根據Query條件來刪除單個或多個Document
  // DeleteDocuments(Term term):根據Term來刪除單個或多個Document
  // DeleteDocuments(Term[] terms):根據Term來刪除單個或多個Document
  // DeleteAll():刪除所有的Document

  //使用IndexWriter進行Document刪除操作時，文檔并不會立即被刪除，而是把這個刪除動作緩存起來，當IndexWriter.Commit()或IndexWriter.Close()時，刪除操作才會被真正執行。

  indexWriter.commit();
  indexWriter.close();

  System.out.println("刪除完成:" + count);
}

響應

刪除完成:1

更新文檔

/**
 * 測試更新
 * 實際上就是刪除后新增一條
 *
 * @throws IOException
 */
@Test
public void updateDocumentTest() throws IOException {
  //Analyzer analyzer = new StandardAnalyzer(); // 標準分詞器，適用于英文
  //Analyzer analyzer = new SmartChineseAnalyzer();//中文分詞
  //Analyzer analyzer = new ComplexAnalyzer();//中文分詞
  //Analyzer analyzer = new IKAnalyzer();//中文分詞

  Analyzer analyzer = new IKAnalyzer();//中文分詞

  //創建索引寫入配置
  IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);

  //創建索引寫入對象
  IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);

  Document doc = new Document();

  int id = 1;

  doc.add(new IntPoint("id", id));
  doc.add(new StringField("title", "Spark", Field.Store.YES));
  doc.add(new TextField("content", "Apache Spark 是專為大規模數據處理而設計的快速通用的計算引擎", Field.Store.YES));
  doc.add(new StoredField("id", id));

  long count = indexWriter.updateDocument(new Term("id", "1"), doc);
  System.out.println("更新文檔:" + count);
  indexWriter.close();
}

響應

更新文檔:1

按詞條搜索

/**
 * 按詞條搜索
 * <p>
 * TermQuery是最簡單、也是最常用的Query。TermQuery可以理解成為“詞條搜索”，
 * 在搜索引擎中最基本的搜索就是在索引中搜索某一詞條，而TermQuery就是用來完成這項工作的。
 * 在Lucene中詞條是最基本的搜索單位，從本質上來講一個詞條其實就是一個名/值對。
 * 只不過這個“名”是字段名，而“值”則表示字段中所包含的某個關鍵字。
 *
 * @throws IOException
 */
@Test
public void termQueryTest() throws IOException {

  String searchField = "title";
  //這是一個條件查詢的api，用于添加條件
  TermQuery query = new TermQuery(new Term(searchField, "Spark"));

  //執行查詢，并打印查詢到的記錄數
  executeQuery(query);
}

響應