Java Lucene NGramTokenizer

Question

提问by CodeKingPlusPlus

I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizerI do not see a method that will return the individual ngrams that were tokenized. In fact I only see two methods in the NGramTokenizer class that return String Objects.

我正在尝试将字符串标记为 ngram。奇怪的是，在NGramTokenizer的文档中，我没有看到一种方法可以返回被标记化的单个 ngram。事实上，我只在 NGramTokenizer 类中看到两个返回字符串对象的方法。

Here is the code that I have:

这是我拥有的代码：

Reader reader = new StringReader("This is a test string");
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);

Where are the ngrams that were tokenized?
How can I get the output in Strings/Words?

标记化的 ngram 在哪里？
如何获得字符串/单词的输出？

I want my output to be like: This, is, a, test, string, This is, is a, a test, test string, This is a, is a test, a test string.

我希望我的输出是这样的：This, is, a, test, string, This is, is a, a test, test string, This is a, is a test, a test string。

Answer 1

回答by femtoRgon

I don't think you'll find what you're looking for trying to find methods returning String. You'll need to deal with Attributes.

我不认为你会找到你正在寻找的东西，试图找到返回 String 的方法。你需要处理Attributes。

Should work something like:

应该像这样工作：

Reader reader = new StringReader("This is a test string");
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);
CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class);
gramTokenizer.reset();

while (gramTokenizer.incrementToken()) {
    String token = charTermAttribute.toString();
    //Do something
}
gramTokenizer.end();
gramTokenizer.close();

Be sure to reset() the Tokenizer it if it needs to be reused after that, though.

但是，如果之后需要重用它，请务必重置（）Tokenizer。

Tokenizing grouping of words, rather than chars, per comments:

每条评论标记词组，而不是字符：

Reader reader = new StringReader("This is a test string");
TokenStream tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
tokenizer = new ShingleFilter(tokenizer, 1, 3);
CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);

while (tokenizer.incrementToken()) {
    String token = charTermAttribute.toString();
    //Do something
}

Answer 2

回答by Amir

For recent version of Lucene (4.2.1), this is a clean code which works. Before executing this code, you have to import 2 jar files:

对于最新版本的 Lucene (4.2.1)，这是一个有效的干净代码。在执行此代码之前，您必须导入 2 个 jar 文件：

lucene-core-4.2.1.jar
lucene-analuzers-common-4.2.1.jar

lucene-core-4.2.1.jar
lucene-analuzers-common-4.2.1.jar

Find these files at http://www.apache.org/dyn/closer.cgi/lucene/java/4.2.1

在http://www.apache.org/dyn/closer.cgi/lucene/java/4.2.1找到这些文件

//LUCENE 4.2.1
Reader reader = new StringReader("This is a test string");      
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);

CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class);

while (gramTokenizer.incrementToken()) {
    String token = charTermAttribute.toString();
    System.out.println(token);
}

Answer 3

回答by Pavan Patil

package ngramalgoimpl;
import java.util.*;

public class ngr {

    public static List<String> n_grams(int n, String str) {
        List<String> n_grams = new ArrayList<String>();
        String[] words = str.split(" ");
        for (int i = 0; i < words.length - n + 1; i++)
            n_grams.add(concatination(words, i, i+n));
        return n_grams;
    }
     /*stringBuilder is used to cancatinate mutable sequence of characters*/
    public static String concatination(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for (int i = start; i < end; i++)
            sb.append((i > start ? " " : "") + words[i]);
        return sb.toString();
    }

    public static void main(String[] args) {
        for (int n = 1; n <= 3; n++) {
            for (String ngram : n_grams(n, "This is my car."))
                System.out.println(ngram);
            System.out.println();
        }
    }
}

Answer 4

回答by Mark Leighton Fisher

Without creating a test program, I would guess that incrementToken() returns the next token which will be one of the ngrams.

如果不创建测试程序，我猜 incrementToken() 会返回下一个标记，它将是 ngram 之一。

For example, using ngram lengths of 1-3 with the string 'a b c d', NGramTokenizer could return:

例如，将 1-3 的 ngram 长度与字符串 'abc d' 一起使用，NGramTokenizer 可以返回：

a
a b
a b c
b
b c
b c d
c
c d
d

where 'a', 'a b', etc. are the resulting ngrams.

其中“a”、“a b”等是生成的 ngram。

[Edit]

[编辑]

You might also want to look at Querying lucene tokens without indexing, as it talks about peeking into the token stream.

您可能还想查看Querying lucene tokens without indexing，因为它谈到了窥视令牌流。

Java Lucene NGramTokenizer

提问by CodeKingPlusPlus

回答by femtoRgon

回答by Amir

回答by Pavan Patil

回答by Mark Leighton Fisher

相关推荐

最近更新

标签

Java Lucene NGramTokenizer

提问by CodeKingPlusPlus

回答by femtoRgon

回答by Amir

回答by Pavan Patil

回答by Mark Leighton Fisher

相关推荐

java Android 休息客户端

java 使用paintComponent()在JFrame中绘制矩形

Java 对象转换在幕后如何工作？

Java - 将字母字符串转换为相应 ascii 的 int？

相关推荐

最近更新

标签