【Lucene】Apache Lucene全文檢索引擎架構之構建索引2 - shanheyongmu－IT工程師數位筆記本

文章出處

　上一篇博文中已經對全文檢索有了一定的了解，這篇文章主要來總結一下全文檢索的第一步：構建索引。其實上一篇博文中的示例程序已經對構建索引寫了一段程序了，而且那個程序還是挺完善的。不過從知識點的完整性來考慮，我想從Lucene的添加文檔、刪除文檔、修改文檔以及文檔域加權四個部分來展開對構建索引的總結，也便于我后期的查看。會重點分析一下刪除文檔（因為有兩中方式）和文檔域加權這（實際中會用到比較多）兩個部分。

1. 準備階段

新建一個maven工程，pom.xml如下：

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>demo.lucene</groupId>
  <artifactId>Lucene02</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <build/>

  <dependencies>
    <!-- lucene核心包 -->
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        <version>6.1.0</version>
    </dependency>
    <!-- lucene查詢解析包 -->
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-queryparser</artifactId>
        <version>6.1.0</version>
    </dependency>
    <!-- lucene解析器包 -->
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-analyzers-common</artifactId>
        <version>6.1.0</version>
    </dependency>

    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
    </dependency>
  </dependencies>
</project>

lucene6 需要jdk1.8 所以JDK1.7 1.6最好使用5.3.1的lucene

因為要測試的比較多，直接在工程中新建一個junit測試類IndexingTest1.Java，然后在類中準備一下用來測試的數據，如下：

public class IndexingTest1 {

    private Directory dir; //存放索引的位置

    //準備一下用來測試的數據
    private String ids[] = {"1", "2", "3"}; //用來標識文檔
    private String citys[] = {"shanghai", "nanjing", "qingdao"};
    private String descs[] = {
        "Shanghai is a bustling city.",
        "Nanjing is a city of culture.",
        "Qingdao is a beautiful city"
    };

}

這個數據就好比是數據庫中存的三張表，文檔標識表，城市表，城市描述表，那么每個文件中的內容實際上可以理解為包含id，城市和城市描述這樣子。也就是說相當于有三個文件，每個文件中的內容描述了一個一個城市。下面開始每一部分的測試與分析了。

2. 添加文檔

　　添加文檔其實就是建立索引，那么首先得獲取寫索引的對象，然后通過這個對象去添加文檔，每個文檔就是一個Lucene的Document，先來看下程序，繼續在IndexingTest.java中添加：

public class IndexingTest1 {

    private Directory dir; //存放索引的位置

    //準備一下用來測試的數據
    private String ids[] = {"1", "2", "3"}; //用來標識文檔
    private String citys[] = {"shanghai", "nanjing", "qingdao"};
    private String descs[] = {
        "Shanghai is a bustling city.",
        "Nanjing is a city of culture.",
        "Qingdao is a beautiful city"
    };

    //生成索引
    @Test
    public void index() throws Exception {      
        IndexWriter writer = getWriter(); //獲取寫索引的實例
        for(int i = 0; i < ids.length; i++) {
            Document doc = new Document();
            doc.add(new StringField("id", ids[i], Field.Store.YES));
            doc.add(new StringField("city", citys[i], Field.Store.YES));
            doc.add(new TextField("descs", descs[i], Field.Store.NO));
            writer.addDocument(doc); //添加文檔
        }
        writer.close(); //close了才真正寫到文檔中
    }

    //獲取IndexWriter實例
    private IndexWriter getWriter() throws Exception {
        dir = FSDirectory.open(Paths.get("D:\\lucene2"));
        Analyzer analyzer = new StandardAnalyzer(); //標準分詞器，會自動去掉空格啊，is a the等單詞
        IndexWriterConfig config = new IndexWriterConfig(analyzer); //將標準分詞器配到寫索引的配置中
        IndexWriter writer = new IndexWriter(dir, config); //實例化寫索引對象
        return writer;
    }

}

可以看出，其實相當于id、城市名和城市描述是一個文檔中的不同的部分，然后用這三個作為了一個Field，便于后面去查詢。每個文檔添加好了域之后，就添加到寫索引的實例writer中寫入。實際中是先獲取一個文件，然后根據這個文件的信息去設定一些Field，然后將這些Field封裝到Document對象中傳給寫索引的實例，類似于上一篇博文中的那些代碼。
　　然后運行一下index方法，即可在D:\lucene2\目錄下看到生成的索引文件。我們也可以寫一個測試方法，測試一下生成了幾個文檔：

public class IndexingTest1 {

    //省略上面的代碼

    /***********  下面來測試了  ****************/
    //測試寫入了幾個文檔
    @Test
    public void testIndexWriter() throws Exception {
        IndexWriter writer = getWriter();
        System.out.println("總共寫入了" + writer.numDocs() + "個文檔");
        writer.close();
    }
}

3. 讀取文檔

　　讀取文檔的話需要IndexReader對象，初始化的時候要傳入讀取文檔所在的路徑，也就是剛剛上面生成文檔的路徑D:\lucene2\，然后即可讀取文檔數量，測試一下：

public class IndexingTest1 {

    //省略上面的代碼

    //測試讀取文檔
    @Test
    public void testIndexReader() throws Exception {
        dir = FSDirectory.open(Paths.get("D:\\lucene2"));
        IndexReader reader = DirectoryReader.open(dir);
        System.out.println("最大文檔數：" + reader.maxDoc());
        System.out.println("實際文檔數：" + reader.numDocs());
        reader.close();
    }   
}

因為從測試數據中看，只有三個文檔，測試結果如下：

最大文檔數：3
實際文檔數：3

4. 刪除文檔

　　這里我要著重說一下，刪除文檔有兩種方式，這兩種方式各有特點。一種是在合并前刪除，另一種是在合并后刪除，什么意思呢？合并前刪除指的是并沒有真正刪除這個文檔，只是在這個文檔上做一個標記而已；而合并后刪除指的是真正刪掉了這個文檔了。
　　這兩個各有什么用呢？比如一個項目比較大的話，訪問量也很多，那么在并發訪問的情況下，頻繁的刪除操作會給系統的性能造成一定的影響，那么這個時候就可以用合并前刪除，先不刪，只是標記一下該文檔屬于已刪除的文檔，等到訪問量比較小的時候（比如檢測CPU比較閑的時候），我再調用刪除程序統一刪除標記過的文檔，這樣可以提升系統的性能。相反，如果數據量不大，刪除操作也影響不了多大性能的話，那就直接刪除好了，即使用合并后刪除。下面針對這兩個刪除，各寫一個測試程序測試一下：

public class IndexingTest1 {

    //省略上面的代碼

    //測試刪除文檔，在合并前
    @Test
    public void testDeleteBeforeMerge() throws Exception {
        IndexWriter writer = getWriter();
        System.out.println("刪除前有" + writer.numDocs() + "個文檔");
        writer.deleteDocuments(new Term("id", "1")); //刪除id=1對應的文檔
        writer.commit(); //提交刪除,并沒有真正刪除
        System.out.println("刪除后最大文檔數：" + writer.maxDoc());
        System.out.println("刪除后實際文檔數：" + writer.numDocs());
        writer.close();
    }

    //測試刪除文檔，在合并后
    @Test
    public void testDeleteAfterMerge() throws Exception {
        IndexWriter writer = getWriter();
        System.out.println("刪除前有" + writer.numDocs() + "個文檔");
        writer.deleteDocuments(new Term("id", "1")); //刪除id=1對應的文檔
        writer.forceMergeDeletes(); //強制合并（強制刪除），沒有索引了
        writer.commit(); //提交刪除，真的刪除了
        System.out.println("刪除后最大文檔數：" + writer.maxDoc());
        System.out.println("刪除后實際文檔數：" + writer.numDocs());
        writer.close();
    }   

}

在測試的時候要注意的是，測試完合并前刪除后，要刪掉索引路徑中的所有索引，重新調用上面的index方法重新生成一下，再去測試合并后刪除，因為之前刪掉一個了，會影響后面的測試。看一下測試結果：

合并前刪除：
　刪除前有3個文檔
　刪除后最大文檔數：3
　刪除后實際文檔數：2
合并后刪除：
　刪除前有3個文檔
　刪除后最大文檔數：2
　刪除后實際文檔數：2

5. 修改文檔

　　修改文檔也就是更新文檔，思路是先新建一個Document對象，然后按照前面設置的字段自己再設置個新的，然后更新原來的文檔，看一下測試程序：

public class IndexingTest1 {

    //省略上面的代碼

    //測試更新
    @Test
    public void testUpdate() throws Exception {
        IndexWriter writer = getWriter();
        //新建一個Document
        Document doc = new Document();
        doc.add(new StringField("id", ids[1], Field.Store.YES));
        doc.add(new StringField("city", "shanghai22", Field.Store.YES));
        doc.add(new TextField("descs", "shanghai update", Field.Store.NO));

        //將原來id為1對應的文檔，用新建的文檔替換
        writer.updateDocument(new Term("id", "1"), doc);
        writer.close();
        System.out.println(doc.getField("descs"));
    }       
}

　　看一下執行結果，會打印出indexed,tokenized<descs:shanghai update>，從decs描述中可以看出，這個描述是我們新建的那個文檔的描述，說明我們已經修改成功了。

6. 文檔域加權

　　這部分要著重說明一下，比如說我們在查詢的時候，如果查詢的字段在多個文檔中都會存在，則會根據Lucene自己的排序規則給我們列出，但是如果我想優先看查詢出來的某個文檔呢？或者說我如何設定讓Lucene按照自己的意愿的順序給我列出查詢出的文檔呢？
　　這么說可能有點難以理解，舉個通俗易懂的例子，有ABCD四個人都寫了一篇關于java的文章，即文章標題都有java，現在我要查詢有“java”這個字符串的文章，但是D是老板，我想如果查出來的文章中有老板寫的，我要優先看老板的文章，也就是說要把老板的文章放在最前面，這個時候我就可以在程序中設定權重了。
　　要模擬這個場景，新建一個測試類IndexingTest2.java。我再造一下模擬的數據，如下：

public class IndexingTest2 {

    private Directory dir; //存放索引的位置

    //準備一下數據，四個人寫了四篇文章，Json是boss
    private String ids[]={"1","2","3","4"};
    private String authors[]={"Jack","Marry","John","Json"};
    private String positions[]={"accounting","technician","salesperson","boss"};
    private String titles[]={"Java is a good language.","Java is a cross platform language","Java powerful","You should learn java"};
    private String contents[]={
            "If possible, use the same JRE major version at both index and search time.",
            "When upgrading to a different JRE major version, consider re-indexing. ",
            "Different JRE major versions may implement different versions of Unicode.",
            "For example: with Java 1.4, `LetterTokenizer` will split around the character U+02C6."
    };

}

按照慣例，我們得先對這些數據生成索引，這個和上面添加文檔的過程的是一樣的，唯一區別的是，在生成索引的時候加了一下權重操作。如下：

public class IndexingTest2 {

    //省略上面代碼

    @Test
    public void index() throws Exception { //生成索引
        dir = FSDirectory.open(Paths.get("D:\\lucene2"));
        IndexWriter writer = getWriter();
        for(int i = 0; i < ids.length; i++) {
            Document doc = new Document();
            doc.add(new StringField("id", ids[i], Field.Store.YES));
            doc.add(new StringField("author", authors[i], Field.Store.YES));
            doc.add(new StringField("position", positions[i], Field.Store.YES));

            //這部分就是加權操作了，對title這個Field進行加權，因為等會我要查這個Field
            TextField field = new TextField("title", titles[i], Field.Store.YES);
            //先判斷之個人對應的職位是不是boss，如果是就加權
            if("boss".equals(positions[i])) {
                field.setBoost(1.5f); //加權操作，默認為1，1.5表示加權了，小于1就降權了
            }

            doc.add(field);
            doc.add(new TextField("content", contents[i], Field.Store.NO));
            writer.addDocument(doc); //添加文檔
        }
        writer.close(); //close了才真正寫到文檔中
    }

    //獲取IndexWriter實例
    private IndexWriter getWriter() throws Exception {
        Analyzer analyzer = new StandardAnalyzer(); //標準分詞器，會自動去掉空格啊，is a the等單詞
        IndexWriterConfig config = new IndexWriterConfig(analyzer); //將標準分詞器配到寫索引的配置中
        IndexWriter writer = new IndexWriter(dir, config); //實例化寫索引對象
        return writer;
    }
}

從代碼中看出，如果想對那個field進行加權，就直接用該field去調用setBoost()方法即可，在調用之前，根據自己設定的條件進行判斷就行了。先運行一下上面的index方法生成索引，然后我們寫一個測試類來測試一下：

public class IndexingTest2 {

    //省略上面代碼

    //文檔域加權測試
    @Test
    public void search() throws Exception {
        dir = FSDirectory.open(Paths.get("D:\\lucene2"));
        IndexReader reader = DirectoryReader.open(dir);
        IndexSearcher search = new IndexSearcher(reader);
        String searchField = "title"; //要查詢的Field
        String q = "java"; //要查詢的字符串
        Term term = new Term(searchField, q);
        Query query = new TermQuery(term);

        TopDocs hits = search.search(query, 10);
        System.out.println("匹配" + q + "總共查詢到" + hits.totalHits + "個文檔");
        for(ScoreDoc score : hits.scoreDocs) {
            Document doc = search.doc(score.doc);
            System.out.println(doc.get("author")); //打印一下查出來記錄對應的作者
        }
        reader.close();
    }
}