java 计算不同单词的数量

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6454348/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 15:57:10  来源:igfitidea点击:

count number of distinct words

javatext-processing

提问by mahi

I am trying to count the number of distinct words in the text, using Java.

我正在尝试使用 Java 计算文本中不同单词的数量。

The word can be a unigram, bigram or trigram noun. These three are already found out by using Stanford POS tagger, but I'm not able to calculate the words whose frequency is greater than equal to one, two, three, four and five, and their counts.

该词可以是unigram、bigram 或 trigram 名词。这三个已经通过使用Stanford POS tagger找到了,但是我无法计算频率大于等于一、二、三、四和五的单词及其计数。

回答by Wolfcow

I might not be understanding correctly, but if all you need to do is count the number of distinct words in a given text depending on where/how you are getting the words you need to count from the text, you could use a Java.Util.Scannerand then add the words to an ArrayListand if the word already exists in the list don't add it and then the size of the list would be the number of Distinct words, something like the example below:

我可能没有正确理解,但是如果您需要做的只是根据您从文本中获取需要计算的单词的位置/方式来计算给定文本中不同单词的数量,则可以使用 aJava.Util.Scanner然后添加话来的ArrayList,如果这个词已经存在于列表不添加,然后列表的大小将是区别词的数量,像下面的例子:

public ArrayList<String> makeWordList(){
    Scanner scan = new Scanner(yourTextFileOrOtherTypeOfInput);
    ArrayList<String> listOfWords = new ArrayList<String>();

       String word = scan.next(); //scanner automatically uses " " as a delimeter
       if(!listOfWords.contains(word)){ //add the word if it isn't added already
            listOfWords.add(word);
    }

    return listOfWords; //return the list you made of distinct words
}

public int getDistinctWordCount(ArrayList<String> list){
    return list.size();
}

now if you actually have to count the number of characters in the word first before you add it to the list then you would just need to add some statements to check the length of the word string before adding it to the list. for example:

现在,如果您在将单词添加到列表之前实际上必须先计算单词中的字符数,那么您只需要添加一些语句来检查单词字符串的长度,然后再将其添加到列表中。例如:

if(word.length() <= someNumber){
//do whatever you need to
}

Sorry if i'm not understanding the question and just gave some crappy unrelated answer =P but I hope it helps in some way!

对不起,如果我不理解这个问题,只是给出了一些蹩脚的无关答案=P,但我希望它在某种程度上有所帮助!

if you needed to keep track of how often you see the same word, even though you only want to count it once, you could make a variable that keeps track of that frequency and put it in a list such that the index of the frequency count is the same as the index in the ArrayList so you know which word the frequency corresponds too or better yet use a HashMapwhere the key is the distinct word and the value is its frequency (basically use the same code as above but instead of ArrayList use HashMap and add in some variable to count the frequency:

如果您需要跟踪您看到同一个单词的频率,即使您只想计算一次,您可以创建一个跟踪该频率的变量并将其放入一个列表中,以便频率计数的索引与 ArrayList 中的索引相同,因此您也知道频率对应哪个单词,或者更好地使用 a HashMap,其中键是不同的单词,值是它的频率(基本上使用与上面相同的代码,但使用 HashMap 代替 ArrayList并添加一些变量来计算频率:

 public HashMap<String, Integer> makeWordList(){
        Scanner scan = new Scanner(yourTextFileOrOtherTypeOfInput);
        HashMap<String, Integer> listOfWords = new HashMap<String, Integer>();
        Scanner scan = new Scanner(sc);
        while(cs.hasNext())
       {
            String word = scan.next(); //scanner automatically uses " " as a delimeter
            int countWord = 0;
            if(!listOfWords.containsKey(word))
            {                             //add word if it isn't added already
                listOfWords.put(word, 1); //first occurance of this word
            }
            else
            {
                countWord = listOfWords.get(word) + 1; //get current count and increment
                //now put the new value back in the HashMap
                listOfWords.remove(word); //first remove it (can't have duplicate keys)
                listOfWords.put(word, countWord); //now put it back with new value
            }
       }
        return listOfWrods; //return the HashMap you made of distinct words
    }

public int getDistinctWordCount(HashMap<String, Integer> list){
       return list.size();
}

//get the frequency of the given word
public int getFrequencyForWord(String word, HashMap<String, Integer> list){
    return list.get(word);
}

回答by Bozho

You can use a Multiset

你可以使用一个 Multiset

  • split the string on space
  • create a new multiset from the result
  • 在空间上拆分字符串
  • 从结果创建一个新的多重集

Something like

就像是

String[] words = string.split(" ");
Multiset<String> wordCounts = HashMultiset.create(Arrays.asList(words));

回答by Sagar

There can be a many solutions for this problem, but one hat helped me, was as simple as below:

这个问题可以有很多解决方案,但是一顶帽子帮助了我,就像下面一样简单:

public static int countDistinctWords(String str){
        Set<String> noOWoInString = new HashSet<String>();
        String[] words = str.split(" ");
        //noOWoInString.addAll(words);
    for(String wrd:words){
        noOWoInString.add(wrd);
    }
    return noOWoInString.size();
}

Thanks, Sagar

谢谢,萨加尔