Grouping組查詢怎么使用

發布時間：2021-12-16 09:18:36 來源：億速云閱讀：134 作者：iii 欄目：大數據

這篇文章主要介紹“Grouping組查詢怎么使用”，在日常操作中，相信很多人在Grouping組查詢怎么使用問題上存在疑惑，小編查閱了各式資料，整理出簡單好用的操作方法，希望對大家解答”Grouping組查詢怎么使用”的疑惑有所幫助！接下來，請跟著小編一起來學習吧！

lucene的聯想詞是在org.apache.lucene.lucene-grouping包下邊，提供了組查詢功能的支持。

Grouping介紹

Grouping主要為用戶提供了處理不同lucene的中含有相同filed的不同doc的分組統計。

<!--組查詢-->
<dependency>
    <groupId>org.apache.lucene</groupId>
	<artifactId>lucene-grouping</artifactId>
    <version>5.5.2</version>
</dependency>

Grouping深入

1. Grouping參數

groupField：要分組的字段（如果groupField在doc中不存在，會返回一個null的分組）
groupSort：分組的排序規則，排序字段決定了分組內容展示的先后順序
topNGroup：分組展示的數量，只計算0到topNGroup條記錄
groupOffset：從第幾個TopGroup開始計算 例：groupOffset為3的話，會展示從3到topNGroup對應的記錄，此數值我們可以用于分頁查詢
withinGroupSort：每組內怎么排序
maxDocsPerGroup：每組處理多少個doc
withinGroupOffset：每組顯示的doc初始位置

2.Grouping實現

grouping實現需要兩步：

利用TermFirstPassGroupingCollector來收集top groups
用TermSecondPassGroupingCollector處理每個group對應的doc

3. Grouping查詢

對要搜索的信息創建Query查詢對象，Lucene會根據Query查詢對象生成最終的查詢語法，類似關系數據庫Sql語法一樣Lucene也有自己的查詢語法，比如："word:北京長安中西醫結合醫院"表示查詢Field的word為"北京長安中西醫結合醫院"的文檔信息。

可通過兩種方法創建查詢對象：

使用Lucene提供Query子類

Query是一個抽象類，lucene提供了很多查詢對象，比如TermQuery項精確查詢，NumericRangeQuery數字范圍查詢等。

Query query = new TermQuery(new Term("word", "北京市鼓樓中醫院"));

//參數1：要查詢的域 參數2：最小值 參數3：最大值  參數4：是否包含最小值  參數5：是否包含最大值
Query query =NumericRangeQuery.newLongRange("size", 0, 1000, 	true, true);


擴展：
BooleanQuery query = new BooleanQuery();

Query query1 =NumericRangeQuery.newLongRange("size", 0, 1000, true, true);
Query query2 = new TermQuery(new Term("word","北京市鼓樓中醫院"));

//添加到BooleanQuery對象中
query.add(query1,Occur.MUST);
query.add(query2,Occur.MUST);

備注：
Occur.MUST:必須滿足此條件
Occur.SHOULD:應該滿足，但是不滿足也可以
Occur.MUST_NOT:必須不滿足

使用QueryParse解析查詢表達式

QueryParse會將用戶輸入的查詢表達式解析成Query對象實例。

QueryParser queryParser =new QueryParser("word", new StandardAnalyzer());
Query query = queryParser.parse("北京市鼓樓中醫院");

Grouping用例

1. Controller層

package com.spring.master.lucene.group.controller;

import com.spring.master.lucene.group.service.GroupService;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import javax.annotation.Resource;

/**
 * @author Huan Lee
 * @version 1.0
 * @date 2020-09-14 14:19
 * @describtion 業精于勤，荒于嬉；行成于思，毀于隨。
 */
@RestController
@RequestMapping(value = "/group")
public class GroupController {

    @Resource
    private GroupService groupService;

    /**
     * 創建索引文檔
     * @return
     */
    @GetMapping(value = "/createIndex")
    public String createIndex() {
        return groupService.createIndex();
    }


    /**
     * 搜索分組
     * lucene 5
     * @return
     */
    @GetMapping(value = "/searchGroup5")
    public String searchGroup5(String keyword) {
        groupService.searchGroup5(keyword);
        return "true";
    }

    /**
     * 搜索分組
     * lucene 7
     * @return
     */
    @GetMapping(value = "/searchGroup7")
    public String searchGroup7(String keyword) {
        groupService.searchGroup7(keyword);
        return "true";
    }
}


訪問地址：
localhost:2000/spring-master/group/createIndex
localhost:2000/spring-master/group/searchGroup5?keyword=北京長安中西醫結合醫院
localhost:2000/spring-master/group/searchGroup7?keyword=北京長安中西醫結合醫院

2. Service層

package com.spring.master.lucene.group.impl;

import com.spring.master.global.Identities;
import com.spring.master.lucene.group.constant.SystemConstants;
import com.spring.master.lucene.group.service.GroupService;
import com.spring.master.lucene.suggest.constant.SuggestConstants;
import com.spring.master.lucene.suggest.util.FileUtils;
import com.spring.master.lucene.suggest.vo.DictionaryVO;
import lombok.extern.slf4j.Slf4j;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.queryparser.xml.builders.MatchAllDocsQueryBuilder;
import org.apache.lucene.search.*;
import org.apache.lucene.search.grouping.GroupDocs;
import org.apache.lucene.search.grouping.GroupingSearch;
import org.apache.lucene.search.grouping.SearchGroup;
import org.apache.lucene.search.grouping.TopGroups;
import org.apache.lucene.search.grouping.term.TermAllGroupsCollector;
import org.apache.lucene.search.grouping.term.TermFirstPassGroupingCollector;
import org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.Version;
import org.nlpcn.commons.lang.util.MD5;
import org.springframework.stereotype.Service;

import java.io.IOException;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.Random;

/**
 * @author Huan Lee
 * @version 1.0
 * @date 2020-09-14 18:45
 * @describtion 業精于勤，荒于嬉；行成于思，毀于隨。
 */
@Service
@Slf4j
public class GroupServiceImpl implements GroupService {

    /**
     * 指定在哪個索引上進行分組
     */
    static String groupField = "sourceType";

    /**
     * 標準分詞器
     */
    private static Analyzer analyzer = new StandardAnalyzer();


    @Override
    public String createIndex() {

        try {
            Directory directory = FSDirectory.open(Paths.get(SystemConstants.indexDir));
            IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
            indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);

            IndexWriter writer = new IndexWriter(directory, indexWriterConfig);

            // 讀DictionaryVO數據
            List<DictionaryVO> diseases = FileUtils.readCsv(SuggestConstants.disease);
            List<DictionaryVO> doctors = FileUtils.readCsv(SuggestConstants.doctor);
            List<DictionaryVO> facultys = FileUtils.readCsv(SuggestConstants.faculty);
            List<DictionaryVO> hospitals = FileUtils.readCsv(SuggestConstants.hospital);
            List<DictionaryVO> drugcatalogues = FileUtils.readCsv(SuggestConstants.drugcatalogue);

            diseases.forEach(disease -> {
                Document doc = new Document();
                // 進行分組的域上建立的必須是SortedDocValuesField類型
                doc.add(new SortedDocValuesField(groupField, new BytesRef("Disease")));
                doc.add(new StringField("id", Identities.uuid(), Field.Store.YES));
                doc.add(new StringField("sourceType", "Disease", Field.Store.YES));
                doc.add(new TextField("word", disease.getWord(), Field.Store.YES));
                try {
                    writer.addDocument(doc);
                } catch (Exception e) {
                    log.error(e.getMessage());
                }
            });

            hospitals.forEach(hospital -> {
                Document doc = new Document();
                // 進行分組的域上建立的必須是SortedDocValuesField類型
                doc.add(new SortedDocValuesField(groupField, new BytesRef("Hospital")));
                doc.add(new StringField("id", Identities.uuid(), Field.Store.YES));
                doc.add(new StringField("sourceType", "Hospital", Field.Store.YES));
                doc.add(new TextField("word", hospital.getWord(), Field.Store.YES));
                try {
                    writer.addDocument(doc);
                } catch (Exception e) {
                    log.error(e.getMessage());
                }
            });

            facultys.forEach(faculty -> {
                Document doc = new Document();
                // 進行分組的域上建立的必須是SortedDocValuesField類型
                doc.add(new SortedDocValuesField(groupField, new BytesRef("Faculty")));
                doc.add(new StringField("id", Identities.uuid(), Field.Store.YES));
                doc.add(new StringField("sourceType", "Faculty", Field.Store.YES));
                doc.add(new TextField("word", faculty.getWord(), Field.Store.YES));
                try {
                    writer.addDocument(doc);
                } catch (Exception e) {
                    log.error(e.getMessage());
                }
            });

            drugcatalogues.forEach(drugcatalogue -> {
                Document doc = new Document();
                // 進行分組的域上建立的必須是SortedDocValuesField類型
                doc.add(new SortedDocValuesField(groupField, new BytesRef("Drugcatalogue")));
                doc.add(new StringField("id", Identities.uuid(), Field.Store.YES));
                doc.add(new StringField("sourceType", "Drugcatalogue", Field.Store.YES));
                doc.add(new TextField("word", drugcatalogue.getWord(), Field.Store.YES));
                try {
                    writer.addDocument(doc);
                } catch (Exception e) {
                    log.error(e.getMessage());
                }
            });

            doctors.forEach(doctor -> {
                Document doc = new Document();
                // 進行分組的域上建立的必須是SortedDocValuesField類型
                doc.add(new SortedDocValuesField(groupField, new BytesRef("Doctor")));
                doc.add(new StringField("id", Identities.uuid(), Field.Store.YES));
                doc.add(new StringField("sourceType", "Doctor", Field.Store.YES));
                doc.add(new TextField("word", doctor.getWord(), Field.Store.YES));
                try {
                    writer.addDocument(doc);
                } catch (Exception e) {
                    log.error(e.getMessage());
                }
            });

            // 6 -- no sourceType field
            Document doc = new Document();
            doc.add(new StringField("sourceType", "Faculty", Field.Store.YES));
            doc.add(new TextField("word", "高壓氧室", Field.Store.YES));
            doc.add(new StringField("id", "0", Field.Store.YES));
            writer.addDocument(doc);

            writer.commit();
            writer.close();
        } catch (Exception e) {
            log.error(e.getMessage());
        }
        return "true";
    }

    @Override
    public void searchGroup5(String keyword) {

        try {

            Directory directory = FSDirectory.open(Paths.get(SystemConstants.indexDir));

            IndexReader reader = DirectoryReader.open(directory);
            IndexSearcher searcher = new IndexSearcher(reader);
            // term查詢
//            Query query = new TermQuery(new Term("word", keyword));

            QueryParser queryParser =new QueryParser("word", new StandardAnalyzer());
            Query query = queryParser.parse(keyword);


            // 每個分組內部的排序規則
            Sort groupSort = Sort.RELEVANCE;

            // 前N條中分組
            int topNGroups = 10;

            // 分組起始偏移量
            int groupOffset = 0;

            // 是否填充SearchGroup的sortValues
            boolean fillFields = true;

            // groupSort用于對組進行排序，docSort用于對組內記錄進行排序，多數情況下兩者是相同的，但也可不同
            Sort docSort = groupSort;

            // 用于組內分頁，起始偏移量
            int docOffset = 0;

            // 每組返回多少條結果
            int docsPerGroup = 5;

            // 是否需要計算總的分組數量
            boolean requiredTotalGroupCount = true;

            // 是否需要緩存評分
            boolean cacheScores = true;

            // 第一次查詢緩存容量的大小：設置為16M
            double maxCacheRAMMB = 16.0;

            // 支持lucene5.x版本
            TermFirstPassGroupingCollector c1 = new TermFirstPassGroupingCollector("sourceType", groupSort, groupOffset + topNGroups);

            /** 將TermFirstPassGroupingCollector包裝成CachingCollector，為第一次查詢加緩存，避免重復評分
             *  CachingCollector就是用來為結果收集器添加緩存功能的
             */
            CachingCollector cachedCollector = CachingCollector.create(c1, cacheScores, maxCacheRAMMB);

            // 開始第一次分組統計
            searcher.search(query, cachedCollector);

            // 第一次查詢返回的結果集TopGroups中只有分組域值以及每組總的評分，至于每個分組里有幾條，分別哪些索引文檔，則需要進行第二次查詢獲取
            Collection<SearchGroup<BytesRef>> topGroups = c1.getTopGroups(groupOffset, fillFields);

            if (topGroups == null) {
                System.out.println("No groups matched ");
                return;
            }

            // 是否獲取每個分組內部每個索引的評分
            boolean getScores = true;

            // 是否計算最大評分
            boolean getMaxScores = true;

            // 如果需要對Lucene的score進行修正，則需要重載TermSecondPassGroupingCollector
            TermSecondPassGroupingCollector c2 = new TermSecondPassGroupingCollector("sourceType", topGroups, groupSort, docSort, docOffset + docsPerGroup, getScores, getMaxScores, fillFields);

            // 如果需要計算總的分組數量，則需要把TermSecondPassGroupingCollector包裝成TermAllGroupsCollector
            // TermAllGroupsCollector就是用來收集總分組數量的
            TermAllGroupsCollector allGroupsCollector = null;

            Collector secondPassCollector = null;

            //若需要統計總的分組數量
            if (requiredTotalGroupCount) {
                allGroupsCollector = new TermAllGroupsCollector("sourceType");
                secondPassCollector = MultiCollector.wrap(c2, allGroupsCollector);
            } else {
                secondPassCollector = c2;
            }

            // 如果第一次查詢已經加了緩存，則直接從緩存中取
            if (cachedCollector.isCached()) {
                // 第二次查詢直接從緩存中取
                cachedCollector.replay(secondPassCollector);
            } else {
                // 開始第二次分組查詢
                searcher.search(query, secondPassCollector);
            }

            // 所有組的數量
            int totalGroupCount = 0;

            // 所有滿足條件的記錄數
            int totalHitCount = 0;

            // 所有組內的滿足條件的記錄數(通常該值與totalHitCount是一致的)
            int totalGroupedHitCount = -1;
            if (requiredTotalGroupCount) {
                totalGroupCount = allGroupsCollector.getGroupCount();
            }

            TopGroups<BytesRef> groupsResult = c2.getTopGroups(docOffset);
            // 這里打印的3項信息就是第一次查詢的統計結果
            totalHitCount = groupsResult.totalHitCount;
            totalGroupedHitCount = groupsResult.totalGroupedHitCount;

            // 打印總的分組數量
            log.info("groupCount: {}", totalGroupCount);
            log.info("groupsResult.totalHitCount: {}", totalHitCount);
            log.info("groupsResult.totalGroupedHitCount: {}", totalGroupedHitCount);
            log.info("************************************");

            int groupIdx = 0;

            // 下面打印的是第二次查詢的統計結果，如果你僅僅只需要第一次查詢的統計結果信息，不需要每個分組內部的詳細信息，則不需要進行第二次查詢，請知曉
            // 迭代組
            for (GroupDocs<BytesRef> groupDocs : groupsResult.groups) {
                groupIdx ++;
                String groupVL = groupDocs.groupValue == null ? "分組域的域值為空" : new String(groupDocs.groupValue.bytes);

                // 分組域的域值，groupIdx表示組的索引即第幾組
                log.info("group[{}].groupFieldValue: {}", groupIdx, groupVL);
                // 當前分組內命中的總記錄數
                log.info("group[{}].totalHits: {}", groupIdx, groupDocs.totalHits);

                int docIdx = 0;
                // 迭代組內的記錄
                for (ScoreDoc scoreDoc : groupDocs.scoreDocs) {
                    docIdx++;
                    // 打印分組內部每條記錄的索引文檔ID及其評分
                    log.info("group[" + groupIdx + "][" + docIdx + "]{docID:Score}:" + scoreDoc.doc + "/" + scoreDoc.score);
                    // 根據docID可以獲取到整個Document對象，通過doc.get(fieldName)可以獲取某個存儲域的域值
                    // 注意searcher.doc根據docID返回的document對象中不包含docValuesField域的域值，只包含非docValuesField域的域值，請知曉
                    Document doc = searcher.doc(scoreDoc.doc);
                    log.info("group[" + groupIdx + "][" + docIdx + "]{docID:author}:" + doc.get("id") + ":" + doc.get("word"));
                }
                log.info("*****************************************");
            }
        } catch (Exception e) {
            log.error(e.getMessage());
        }
    }

    @Override
    public void searchGroup7(String keyword) {

        try {
            Directory directory = FSDirectory.open(Paths.get(SystemConstants.indexDir));
            IndexReader reader = DirectoryReader.open(directory);
            IndexSearcher searcher = new IndexSearcher(reader);

            // 標準分詞
            Analyzer analyzer = new StandardAnalyzer();

            // 指定要進行分組的索引
            GroupingSearch groupingSearch = new GroupingSearch(groupField);
            // 指定分組排序規則
            groupingSearch.setGroupSort(new Sort(SortField.FIELD_SCORE));
            // 是否填充SearchGroup的sortValues
            groupingSearch.setFillSortFields(true);
            groupingSearch.setCachingInMB(4.0, true);
            groupingSearch.setAllGroups(true);
//            groupingSearch.setAllGroupHeads(true);
            // 限制分組個數
            groupingSearch.setGroupDocsLimit(10);

            QueryParser parser = new QueryParser("word", analyzer);
            Query query = parser.parse(keyword);

            TopGroups<BytesRef> result = groupingSearch.search(searcher, query, 0, 1000);

            // 總命中數
            log.info("總命中數: {}", result.totalHitCount);
            //
            log.info("分組數:{}", result.groups.length);

            // 按照分組打印查詢結果
            for (GroupDocs<BytesRef> groupDocs : result.groups){
                if (groupDocs != null) {
                    if (groupDocs.groupValue != null) {
                        log.info("分組:{}", groupDocs.groupValue.utf8ToString());
                    }else{
                        // 由于建立索引時有一條數據沒有在分組索引上建立SortedDocValued索引，因此這個分組的groupValue為null
                        log.info("分組:{}", "unknow");
                    }
                    log.info("組內數據條數:{}", groupDocs.totalHits);

                    for(ScoreDoc scoreDoc : groupDocs.scoreDocs){
                        log.info("sourceType:{}", searcher.doc(scoreDoc.doc).get("sourceType"));
                        log.info("word:{}", searcher.doc(scoreDoc.doc).get("word"));
                        log.info("*****************************");
                    }
                    System.out.println("=====================================");
                }
            }
        } catch (Exception e) {
            log.error(e.getMessage());
        }
    }
}

3. Util

package com.spring.master.lucene.util;

import com.spring.master.lucene.suggest.vo.DictionaryVO;
import lombok.extern.slf4j.Slf4j;

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.List;

/**
 * @author Huan Lee
 * @version 1.0
 * @date 2020-09-11 09:57
 * @describtion 業精于勤，荒于嬉；行成于思，毀于隨。
 */
@Slf4j
public class FileUtils {

    /**
     * 讀取詞典csv文件
     * @param fileNamePath
     * @return
     */
    public static List<DictionaryVO> readCsv(String fileNamePath) {

        List<DictionaryVO> dictionarys = new ArrayList<>();
        try {
            // 換成你的文件名
            BufferedReader reader = new BufferedReader(new FileReader(fileNamePath));
            String line;
            while ((line = reader.readLine()) != null) {
                // CSV格式文件為逗號分隔符文件，這里根據逗號切分
                String[] item = line.split(",");
                dictionarys.add(new DictionaryVO(item[0], item[1], Long.parseLong(item[2]), Long.parseLong(item[3])));
            }
        } catch (Exception e) {
            e.printStackTrace();
            log.error(e.getMessage());
        }
        return dictionarys;
    }
}

4. Constant

package com.spring.master.lucene.group.constant;

/**
 * @author Huan Lee
 * @version 1.0
 * @date 2020-09-14 14:27
 * @describtion 業精于勤，荒于嬉；行成于思，毀于隨。
 */
public class SystemConstants {

    /**
     * 索引目錄
     */
    public static final String indexDir = "/Users/lihuan/Documents/projects/git/me/dictionary";
}

到此，關于“Grouping組查詢怎么使用”的學習就結束了，希望能夠解決大家的疑惑。理論與實踐的搭配能更好的幫助大家學習，快去試試吧！若想繼續學習更多相關知識，請繼續關注億速云網站，小編會繼續努力為大家帶來更多實用的文章！

向AI問一下細節

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

Grouping組查詢怎么使用

Grouping介紹

Grouping深入

1. Grouping參數

2.Grouping實現

3. Grouping查詢

Grouping用例

1. Controller層

2. Service層

3. Util

4. Constant

猜你喜歡

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

Grouping組查詢怎么使用

Grouping介紹

Grouping深入

1. Grouping參數

2.Grouping實現

3. Grouping查詢

Grouping用例

1. Controller層

2. Service層

3. Util

4. Constant

猜你喜歡

最新資訊

相關推薦

相關標簽