Java 从一个句子生成 N-gram

Question

提问by Preetam Purbia

How to generate an n-gram of a string like:

如何生成一个像这样的字符串的 n-gram：

String Input="This is my car."

I want to generate n-gram with this input:

我想用这个输入生成 n-gram：

Input Ngram size = 3

Output should be:

输出应该是：

This
is
my
car

This is
is my
my car

This is my
is my car

Give some idea in Java, how to implement that or if any library is available for it.

在 Java 中给出一些想法，如何实现它或者是否有任何可用的库。

I am trying to use this NGramTokenizerbut its giving n-gram's of character sequence and I want n-grams of word sequence.

我正在尝试使用这个 NGramTokenizer，但它给出了 n-gram 的字符序列，我想要 n-gram 的单词序列。

Answer 1

采纳答案by Shashikant Kore

You are looking for ShingleFilter.

您正在寻找ShingleFilter。

Update: The link points to version 3.0.2. This class may be in different package in newer version of Lucene.

更新：链接指向 3.0.2 版。在较新版本的 Lucene 中，此类可能位于不同的包中。

Answer 2

回答by aioobe

I believe this would do what you want:

我相信这会做你想要的：

import java.util.*;

public class Test {

    public static List<String> ngrams(int n, String str) {
        List<String> ngrams = new ArrayList<String>();
        String[] words = str.split(" ");
        for (int i = 0; i < words.length - n + 1; i++)
            ngrams.add(concat(words, i, i+n));
        return ngrams;
    }

    public static String concat(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for (int i = start; i < end; i++)
            sb.append((i > start ? " " : "") + words[i]);
        return sb.toString();
    }

    public static void main(String[] args) {
        for (int n = 1; n <= 3; n++) {
            for (String ngram : ngrams(n, "This is my car."))
                System.out.println(ngram);
            System.out.println();
        }
    }
}

Output:

输出：

This
is
my
car.

This is
is my
my car.

This is my
is my car.

An "on-demand" solution implemented as an Iterator:

作为迭代器实现的“按需”解决方案：

class NgramIterator implements Iterator<String> {

    String[] words;
    int pos = 0, n;

    public NgramIterator(int n, String str) {
        this.n = n;
        words = str.split(" ");
    }

    public boolean hasNext() {
        return pos < words.length - n + 1;
    }

    public String next() {
        StringBuilder sb = new StringBuilder();
        for (int i = pos; i < pos + n; i++)
            sb.append((i > pos ? " " : "") + words[i]);
        pos++;
        return sb.toString();
    }

    public void remove() {
        throw new UnsupportedOperationException();
    }
}

Answer 3

回答by Landei

This code returns an array of all Strings of the given length:

此代码返回给定长度的所有字符串的数组：

public static String[] ngrams(String s, int len) {
    String[] parts = s.split(" ");
    String[] result = new String[parts.length - len + 1];
    for(int i = 0; i < parts.length - len + 1; i++) {
       StringBuilder sb = new StringBuilder();
       for(int k = 0; k < len; k++) {
           if(k > 0) sb.append(' ');
           sb.append(parts[i+k]);
       }
       result[i] = sb.toString();
    }
    return result;
}

E.g.

例如

System.out.println(Arrays.toString(ngrams("This is my car", 2)));
//--> [This is, is my, my car]
System.out.println(Arrays.toString(ngrams("This is my car", 3)));
//--> [This is my, is my car]

Answer 4

回答by tozCSS

/**
 * 
 * @param sentence should has at least one string
 * @param maxGramSize should be 1 at least
 * @return set of continuous word n-grams up to maxGramSize from the sentence
 */
public static List<String> generateNgramsUpto(String str, int maxGramSize) {

    List<String> sentence = Arrays.asList(str.split("[\W+]"));

    List<String> ngrams = new ArrayList<String>();
    int ngramSize = 0;
    StringBuilder sb = null;

    //sentence becomes ngrams
    for (ListIterator<String> it = sentence.listIterator(); it.hasNext();) {
        String word = (String) it.next();

        //1- add the word itself
        sb = new StringBuilder(word);
        ngrams.add(word);
        ngramSize=1;
        it.previous();

        //2- insert prevs of the word and add those too
        while(it.hasPrevious() && ngramSize<maxGramSize){
            sb.insert(0,' ');
            sb.insert(0,it.previous());
            ngrams.add(sb.toString());
            ngramSize++;
        }

        //go back to initial position
        while(ngramSize>0){
            ngramSize--;
            it.next();
        }                   
    }
    return ngrams;
}

Call:

称呼：

long startTime = System.currentTimeMillis();
ngrams = ToolSet.generateNgramsUpto("This is my car.", 3);
long stopTime = System.currentTimeMillis();
System.out.println("My time = "+(stopTime-startTime)+" ms with ngramsize = "+ngrams.size());
System.out.println(ngrams.toString());

Output:

输出：

My time = 1 ms with ngramsize = 9 [This, is, This is, my, is my, This is my, car, my car, is my car]

我的时间 = 1 ms，ngramsize = 9 [This, is, This is, my, is my, This is my, car, my car, is my car]

Answer 5

回答by Dung TQ

    public static void CreateNgram(ArrayList<String> list, int cutoff) {
    try
    {
        NGramModel ngramModel = new NGramModel();
        POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
        PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
        POSTaggerME tagger = new POSTaggerME(model);
        perfMon.start();
        for(int i = 0; i<list.size(); i++)
        {
            String inputString = list.get(i);
            ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(inputString));
            String line;
            while ((line = lineStream.read()) != null) 
            {
                String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
                String[] tags = tagger.tag(whitespaceTokenizerLine);

                POSSample sample = new POSSample(whitespaceTokenizerLine, tags);

                perfMon.incrementCounter();

                String words[] = sample.getSentence();

                if(words.length > 0)
                {
                    for(int k = 2; k< 4; k++)
                    {
                        ngramModel.add(new StringList(words), k, k);
                    }
                }
            }
        }
        ngramModel.cutoff(cutoff, Integer.MAX_VALUE);
        Iterator<StringList> it = ngramModel.iterator();
        while(it.hasNext())
        {
            StringList strList = it.next();
            System.out.println(strList.toString());
        }
        perfMon.stopAndPrintFinalResult();
    }catch(Exception e)
    {
        System.out.println(e.toString());
    }
}

Here is my codes to create n-gram. In this case, n = 2, 3. n-gram of words sequence which smaller than cutoff value will ignore from result set. Input is list of sentences, then it parse using a tool of OpenNLP

这是我创建 n-gram 的代码。在这种情况下，n = 2, 3. 小于截止值的 n-gram 单词序列将从结果集中忽略。输入是句子列表，然后使用OpenNLP工具解析

Answer 6

回答by M Sach

public static void main(String[] args) {

    String[] words = "This is my car.".split(" ");
    for (int n = 0; n < 3; n++) {

        List<String> list = ngrams(n, words);
        for (String ngram : list) {
            System.out.println(ngram);
        }
        System.out.println();

    }
}

public static List<String> ngrams(int stepSize, String[] words) {
    List<String> ngrams = new ArrayList<String>();
    for (int i = 0; i < words.length-stepSize; i++) {

        String initialWord = "";
        int internalCount = i;
        int internalStepSize = i + stepSize;
        while (internalCount <= internalStepSize
                && internalCount < words.length) {
            initialWord = initialWord+" " + words[internalCount];
            ++internalCount;
        }
        ngrams.add(initialWord);

    }
    return ngrams;
}

Answer 7

回答by Jagesh Maharjan

Check this out:

看一下这个：

public static void main(String[] args) {
    NGram nGram = new NGram();
    String[] tokens = "this is my car".split(" ");
    int i = tokens.length;
    List<String> ngrams = new ArrayList<>();
    while (i >= 1){
        ngrams.addAll(nGram.getNGram(tokens, i, new ArrayList<>()));
        i--;
    }
    System.out.println(ngrams);
}

private List<String> getNGram(String[] tokens, int n, List<String> ngrams) {
    StringBuilder strbldr = new StringBuilder();
    if (tokens.length < n) {
        return ngrams;
    }else {
        for (int i=0; i<n; i++){
            strbldr.append(tokens[i]).append(" ");
        }
        ngrams.add(strbldr.toString().trim());
        String[] newTokens = Arrays.copyOfRange(tokens, 1, tokens.length);
        return getNGram(newTokens, n, ngrams);
    }
}

Simple recursive function, better running time.

简单的递归函数，更好的运行时间。

Java 从一个句子生成 N-gram

提问by Preetam Purbia

采纳答案by Shashikant Kore

回答by aioobe

回答by Landei

回答by tozCSS

回答by Dung TQ

回答by M Sach

回答by Jagesh Maharjan

相关推荐

最近更新

标签

Java 从一个句子生成 N-gram

提问by Preetam Purbia

采纳答案by Shashikant Kore

回答by aioobe

回答by Landei

回答by tozCSS

回答by Dung TQ

回答by M Sach

回答by Jagesh Maharjan

相关推荐

Javadoc：package.html 或 package-info.java

Java 在 Filter bean 类中使用一些 bean？

适用于 java 或 android 的 ePub 库

java类中的循环依赖

相关推荐

最近更新

标签