java 按频率排序单词？（最小到最大）

Question

提问by user1333781

does any one have any idea how to sort a list of words in the order of their frequency (least to greatest) using the built in collection.sortand a comparator<string>interface?

有没有人知道如何使用内置collection.sort和comparator<string>界面按频率（从最小到最大）的顺序对单词列表进行排序？

I already have a method that gets the count of a certain word in the text file. Now, I just need to create a method that compares the counts of each word and then puts them in a list sorted by the least frequency to the greatest.

我已经有一种方法可以获取文本文件中某个单词的计数。现在，我只需要创建一个方法来比较每个单词的计数，然后将它们放入按频率从低到高排序的列表中。

Any ideas and tips would be very much appreciated. I'm having trouble getting started on this particular method.

任何想法和提示将不胜感激。我在开始使用这种特殊方法时遇到了麻烦。

public class Parser implements Comparator<String> {

    public Map<String, Integer> wordCount;

    void parse(String filename) throws IOException {
        File file = new File(filename);
        Scanner scanner = new Scanner(file);

        //mapping of string -> integer (word -> frequency)
        Map<String, Integer> wordCount = new HashMap<String, Integer>();

        //iterates through each word in the text file
        while(scanner.hasNext()) {
            String word = scanner.next();
            if (scanner.next()==null) {
                wordCount.put(word, 1);
            }
            else {
                wordCount.put(word, wordCount.get(word) + 1);;
                }
            }
            scanner.next().replaceAll("[^A-Za-z0-9]"," ");
            scanner.next().toLowerCase();
        }

    public int getCount(String word) {
        return wordCount.get(word);
    }

    public int compare(String w1, String w2) {
        return getCount(w1) - getCount(w2);
    } 

        //this method should return a list of words in order of frequency from least to   greatest
    public List<String> getWordsInOrderOfFrequency() {
        List<Integer> wordsByCount = new ArrayList<Integer>(wordCount.values());
        //this part is unfinished.. the part i'm having trouble sorting the word frequencies
        List<String> result = new ArrayList<String>();


    }
}

Answer 1

回答by rodion

First of all your usage of scanner.next()seems incorrect. next()will return the next word and move onto next one every time you call it, therefore the following code:

首先，您使用的scanner.next()似乎不正确。next()每次调用时都会返回下一个单词并移动到下一个单词，因此代码如下：

if(scanner.next() == null){ ... }

and also

并且

scanner.next().replaceAll("[^A-Za-z0-9]"," ");
scanner.next().toLowerCase();

will consume and then just throw away words. What you probably want to do is:

会消耗，然后就扔掉的话。您可能想要做的是：

String word = scanner.next().replaceAll("[^A-Za-z0-9]"," ").toLowerCase();

at the beginning of your whileloop, so that the changes to your word are saved in the wordvariable, and not just thrown away.

在while循环开始时，这样对单词的更改就会保存在word变量中，而不仅仅是丢弃。

Secondly, the usage of the wordCountmap is slightly broken. What you want to do is to check if the wordis already in the map to decide what word count to set. To do this, instead of checking for scanner.next() == nullyou should look in the map, for example:

其次，wordCount地图的使用略有破损。您想要做的是检查word地图中是否已经存在以决定要设置的字数。为此，scanner.next() == null您应该查看地图而不是检查，例如：

if(!wordCount.containsKey(word)){
  //no count registered for the word yet
  wordCount.put(word, 1);
}else{
  wordCount.put(word, wordCount.get(word) + 1);
}

alternatively you can do this:

或者你可以这样做：

Integer count = wordCount.get(word);
if(count == null){
  //no count registered for the word yet
  wordCount.put(word, 1);
}else{
  wordCount.put(word, count+1);
}

I would prefer this approach, because it's a bit cleaner, and does only one map look-up per word, whereas the first approach sometimes does two look-ups.

我更喜欢这种方法，因为它更简洁一些，并且每个单词只进行一次地图查找，而第一种方法有时会进行两次查找。

Now, to get a list of words in descending order of frequencies, you can convert your map to a list first, then apply Collections.sort()as was suggested in this post. Below is a simplified version suited to your needs:

现在，在频率从高到低得到的单词列表，你可以先在地图转换到一个列表，然后应用Collections.sort()在建议这个职位。以下是适合您需求的简化版本：

static List<String> getWordInDescendingFreqOrder(Map<String, Integer> wordCount) {

    // Convert map to list of <String,Integer> entries
    List<Map.Entry<String, Integer>> list = 
        new ArrayList<Map.Entry<String, Integer>>(wordCount.entrySet());

    // Sort list by integer values
    Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {
        public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
            // compare o2 to o1, instead of o1 to o2, to get descending freq. order
            return (o2.getValue()).compareTo(o1.getValue());
        }
    });

    // Populate the result into a list
    List<String> result = new ArrayList<String>();
    for (Map.Entry<String, Integer> entry : list) {
        result.add(entry.getKey());
    }
    return result;
}

Hope this helps.

希望这可以帮助。

Edit:Changed the comparison function as suggested by @dragon66. Thanks.

编辑：按照@dragon66 的建议更改了比较功能。谢谢。

Answer 2

回答by UVM

You can compare and extract ideas from the following:

您可以从以下内容中比较和提取想法：

public class FrequencyCount {

    public static void main(String[] args) {

        // read in the words as an array
        String s = StdIn.readAll();
        // s = s.toLowerCase();
        // s = s.replaceAll("[\",!.:;?()']", "");
        String[] words = s.split("\s+");

        // sort the words
        Merge.sort(words);

        // tabulate frequencies of each word
        Counter[] zipf = new Counter[words.length];
        int M = 0;                                        // number of distinct words
        for (int i = 0; i < words.length; i++) {
            if (i == 0 || !words[i].equals(words[i-1]))   // short-circuiting OR
                zipf[M++] = new Counter(words[i], words.length);
            zipf[M-1].increment();
        }

        // sort by frequency and print
        Merge.sort(zipf, 0, M);                           // sorting a subarray
        for (int j = M-1; j >= 0; j--) {
            StdOut.println(zipf[j]);
        }
    }
}

Answer 3

回答by user unknown

A solution, close to your original posting with corrections and the sorting as suggested by Torious in the comments:

一个解决方案，接近您的原始帖子，并按照 Torious 在评论中的建议进行更正和排序：

import java.util.*;

public class Parser implements Comparator <String> {

    public Map<String, Integer> wordCount;

    void parse ()
    {
        Scanner scanner = new Scanner (System.in);

        // don't redeclare it here - your attribute wordCount will else be shadowed
        wordCount = new HashMap<String, Integer> ();

        //iterates through each word in the text file
        while (scanner.hasNext ()) {
            String word = scanner.next ();
            // operate on the word, not on next and next of next word from Scanner
            word = word.replaceAll (" [^A-Za-z0-9]", " ");
            word = word.toLowerCase ();
            // look into your map:
            if (! wordCount.containsKey (word))
                wordCount.put (word, 1);
            else
                wordCount.put (word, wordCount.get (word) + 1);;
        }
    }

    public int getCount (String word) {
        return wordCount.get (word);
    }

    public int compare (String w1, String w2) {
        return getCount (w1) - getCount (w2);
    }

    public List<String> getWordsInOrderOfFrequency () {
        List<String> justWords = new ArrayList<String> (wordCount.keySet());
        Collections.sort (justWords, this);
        return justWords; 
    }

    public static void main (String args []) {
        Parser p = new Parser ();
        p.parse ();
        List<String> ls = p.getWordsInOrderOfFrequency ();
        for (String s: ls) 
            System.out.println (s);
    }
}

Answer 4

回答by user unknown

rodions Solution is a kind of a Generics hell, but I don't have it simpler - just different.

rodions 解决方案是一种泛型地狱，但我没有它更简单 - 只是不同。

In the End, his solution is shorter and better.

最后，他的解决方案更短更好。

At the first looks, it seems that a TreeMap might be appropriate, but it sorts by Key, and is of no help for sorting by value, and we can't switch key-value, because we look it up by the key.

乍一看，似乎TreeMap可能合适，但它是按Key排序的，对按值排序没有帮助，而且我们不能切换key-value，因为我们是按key查找的。

So the next idea is to generate a HashMap, and use Collections.sort, but it doesn't take a Map, just Lists for sorting. From a Map, there is entrySet, which produces another Collection, which is a Set, and not a List. That was the point where I took another direction:

所以接下来的想法是生成一个HashMap，并使用Collections.sort，但它不需要Map，只需要Lists进行排序。从 Map 中，有 entrySet，它产生另一个 Collection，它是一个 Set，而不是一个 List。这就是我转向另一个方向的地方：

I implemented an Iterator: I iterate over the entrySet, and only return Keys, where the value is 1. If the value is 2, I buffer them for later use. If the Iterator is exhausted, I look into the buffer, and if it isn't empty, I use the iterator of the buffer in future, increment the minimum value I look for, and create a new Buffer.

我实现了一个迭代器：我遍历 entrySet，只返回键，其中值为 1。如果值为 2，我将它们缓冲以备后用。如果迭代器耗尽，我查看缓冲区，如果它不为空，我将来使用缓冲区的迭代器，增加我寻找的最小值，并创建一个新的缓冲区。

The advantage of an Iterator/Iterable pair is, that the values can be obtained by the simplified for-loop.

Iterator/Iterable 对的优点是，可以通过简化的 for 循环获得这些值。

import java.util.*;

// a short little declaration :) 
public class WordFreq implements Iterator <Map.Entry <String, Integer>>, Iterable <Map.Entry <String, Integer>>
{
    private Map <String, Integer> counter;
    private Iterator <Map.Entry <String, Integer>> it;
    private Set <Map.Entry <String, Integer>> buf;
    private int maxCount = 1; 

    public Iterator <Map.Entry <String, Integer>> iterator () {
        return this;
    }

    // The iterator interface expects a "remove ()" - nobody knows why
    public void remove ()
    {
        if (hasNext ())
            next ();
    } 

    public boolean hasNext ()
    {
        return it.hasNext () || ! buf.isEmpty ();
    }

    public Map.Entry <String, Integer> next ()
    {
        while (it.hasNext ()) {
            Map.Entry <String, Integer> mesi = it.next ();
            if (mesi.getValue () == maxCount)
                return mesi;
            else
                buf.add (mesi);
        }
        if (buf.isEmpty ())
            return null;
        ++maxCount;
        it = buf.iterator (); 
        buf = new HashSet <Map.Entry <String, Integer>> ();     
        return next ();
    } 

    public WordFreq ()
    {
        it = fill ();
        buf = new HashSet <Map.Entry <String, Integer>> ();
        // The "this" here has to be an Iterable to make the foreach work
        for (Map.Entry <String, Integer> mesi : this)
        {
            System.out.println (mesi.getValue () + ":\t" + mesi.getKey ());
        }
    }

    public Iterator <Map.Entry <String, Integer>> fill ()
    {
        counter = new HashMap <String, Integer> ();
        Scanner sc = new Scanner (System.in);
        while (sc.hasNext ())
        {
            push (sc.next ());
        }
        Set <Map.Entry <String, Integer>> set = counter.entrySet ();
        return set.iterator ();
    }

    public void push (String word)
    {
        Integer i = counter.get (word);
        int n = 1 + ((i != null) ? i : 0); 
        counter.put (word, n);
    }

    public static void main (String args[])
    {
        new WordFreq ();
    }
}

Since my solution reads from stdin, you invoke it with:

由于我的解决方案从 stdin 读取，因此您可以使用以下命令调用它：

cat WordFreq.java | java WordFreq

java 按频率排序单词？（最小到最大）

提问by user1333781

回答by rodion

回答by UVM

回答by user unknown

回答by user unknown

相关推荐

最近更新

标签

java 按频率排序单词？（最小到最大）

提问by user1333781

回答by rodion

回答by UVM

回答by user unknown

回答by user unknown

相关推荐

Java 运算符：|= 按位或并赋值示例

java 使用 POI 的异常

java Guice：场注入的影响

从 Java 在 MySQL 中创建临时表

相关推荐

最近更新

标签