java 如何统计每个单词出现的次数？

Question

提问by Devon

If I have an article in English, or a novel in English, and I want to count how many times each words appears, what is the fastest algorithm written in Java?

如果我有一篇英文文章，或者一本英文小说，我想计算每个单词出现的次数，用 Java 编写的最快算法是什么？

Some people said you can use Map < String, Integer>() to complete this, but I was wondering how do I know what is the key words? Every article has different words and how do you know the "key" words then add one on its count?

有人说可以用 Map < String, Integer>() 来完成，但我想知道我怎么知道关键字是什么？每篇文章都有不同的词，你怎么知道“关键”词然后在它的计数上加一个？

Answer 1

回答by yankee

Here is another way to do it with the things that appeared in Java 8:

这是使用 Java 8 中出现的内容的另一种方法：

private void countWords(final Path file) throws IOException {
    Arrays.stream(new String(Files.readAllBytes(file), StandardCharsets.UTF_8).split("\W+"))
        .collect(Collectors.groupingBy(Function.<String>identity(), TreeMap::new, counting())).entrySet()
        .forEach(System.out::println);
}

So what is it doing?

那么它在做什么呢？

It reads a text file completely into memory, into a byte array to be more precise: Files.readAllBytes(file). This method turned up in Java 7 and allows methods of loading files very fast, however for the price that the file will be completely in memory, costing a lot of memory. For speed however this is a good appraoch.
The byte[] is converted to a String: new String(Files.readAllBytes(file), StandardCharsets.UTF_8)while assuming that the file is UTF8 encoded. Change at your own need. The price is a full memory copy of the already huge piece of data in memory. It maybe faster to work with a memory mapped file instead.
The string is split at non-Word charcaters: ...split("\\W+")which creates an array of strings with all your words.
We create a stream from that array: Arrays.stream(...). This by itself does not do very much, but we can do a lot of fun things with the stream
We group all the words together: Collectors.groupingBy(Function.<String>identity(), TreeMap::new, counting()). This means:
- We want to group the words by the word themselves (identity()). We could also e.g. lowercase the string here first if you want grouping to be case insensitive. This will end up to be the key in a map.
- As a result for storng the grouped values we want a TreeMap (TreeMap::new). TreeMaps are sorted by their key, so we can easily output in alphabetical order in the end. If you do not need sorting you could also use a HashMap here.
- As value for each group we want to have the number of occurances of each word (counting()). In background that means that for each word we add to a group we increase the counter by one.
From Step 5 we are left with a Map that maps words to their count. Now we just want to print them. So we access a collection with all the key/value pairs in this map (.entrySet()).
Finally the actual printing. We say that each element should be passed to the println method: .forEach(System.out::println). And now you are left with a nice list.

它读取文本文件完全进入存储器，到字节数组更准确地说：Files.readAllBytes(file)。这种方法出现在 Java 7 中，并允许非常快速地加载文件的方法，但代价是文件将完全在内存中，消耗大量内存。然而，对于速度来说，这是一个很好的方法。
byte[] 被转换为 String:new String(Files.readAllBytes(file), StandardCharsets.UTF_8)同时假设文件是 UTF8 编码的。根据您自己的需要进行更改。价格是内存中已经很大的数据的完整内存副本。它可以更快地工作，内存映射文件来代替。
该字符串在非 Word 字符处拆分：...split("\\W+")这将创建一个包含所有单词的字符串数组。
我们创建这个数组流：Arrays.stream(...)。这本身并没有多大作用，但是我们可以用流做很多有趣的事情
我们将所有单词组合在一起：Collectors.groupingBy(Function.<String>identity(), TreeMap::new, counting())。这意味着：
- 我们想按单词本身对单词进行分组 ( identity())。例如，如果您希望分组不区分大小写，我们也可以先将此处的字符串小写。这将最终成为地图中的关键。
- 因此，为了存储分组值，我们需要一个 TreeMap ( TreeMap::new)。TreeMap 是按它们的键排序的，所以我们可以很容易地最后按字母顺序输出。如果您不需要排序，您也可以在此处使用 HashMap。
- 作为每个组的值，我们希望每个单词出现的次数 ( counting())。在后台，这意味着对于我们添加到组中的每个单词，我们将计数器加一。
从第 5 步开始，我们留下了一个 Map，它将单词映射到它们的数量。现在我们只想打印它们。因此，我们访问一个包含此映射 ( .entrySet()) 中所有键/值对的集合。
最后是实际打印。我们说，每一个元素应该传递给println方法：.forEach(System.out::println)。现在你留下了一个不错的清单。

So how good is this answer? The upside is that is is very short and thus highly expressive. It also gets along with only a single system call that hides behind Files.readAllBytes(or at least a fixed number I am not sure if this really works with a single system call) and System calls can be a bottleneck. E.g. if you are reading a file from a stream, each call to read may trigger a system call. This is significantly reduced by using a BufferedReader that as the name suggests buffers. but stilly readAllBytesshould be fastest. The price for this is that it consumes huge amounts of memory. However wikipedia claims that a typical english book has 500 pages with 2,000 characters per page which mean roughly 1 Megabytewhich should not be a problem in terms of memory consumption even if you are on a smartphone, raspberry pi or a really really old computer.

那么这个答案有多好？好处是它非常短，因此具有很强的表现力。它也只有一个隐藏在后面的系统调用Files.readAllBytes（或者至少是一个固定的数字，我不确定这是否真的适用于单个系统调用）并且系统调用可能是一个瓶颈。例如，如果您正在从流中读取文件，则每次调用 read 都可能触发系统调用。通过使用 BufferedReader（顾名思义）可以显着减少这种情况。但 StillyreadAllBytes应该是最快的。这样做的代价是它消耗了大量内存。然而，维基百科声称一本典型的英文书有500 页，每页 2,000 个字符，这意味着大约 1 兆字节即使您使用的是智能手机、树莓派或非常旧的计算机，这在内存消耗方面也不成问题。

This solutions does involve some optimizations that were not possible prior to Java 8. For example the idiom map.put(word, map.get(word) + 1)requires the "word" to be looked up twicte in the map, which is an unnecessary waste.

此解决方案确实涉及一些在 Java 8 之前不可能实现的优化。例如，习惯用法map.put(word, map.get(word) + 1)要求在地图中两次查找“单词”，这是一种不必要的浪费。

But also a simple loop might be easier to optimize for the compiler and might save a number of method calls. So I wanted to know and put this to a test. I generated a file using:

但是，一个简单的循环可能更容易为编译器优化，并且可能会节省大量的方法调用。所以我想知道并对此进行测试。我使用以下方法生成了一个文件：

[ -f /tmp/random.txt ] && rm /tmp/random.txt; for i in {1..15}; do head -n 10000 /usr/share/dict/american-english >> /tmp/random.txt; done; perl -MList::Util -e 'print List::Util::shuffle <>' /tmp/random.txt > /tmp/random.tmp; mv /tmp/random.tmp /tmp/random.txt

Which gives me a file of about 1,3MB, so not that untypical for a book with most words being repeated 15 times, but in random order to circumvent that this end up to be a branch prediction test. Then I ran the following tests:

这给了我一个大约 1,3MB 的文件，所以对于大多数单词重复 15 次的书来说并不是那么不典型，但是为了规避这最终成为分支预测测试的随机顺序。然后我进行了以下测试：

public class WordCountTest {

    @Test(dataProvider = "provide_description_testMethod")
    public void test(String description, TestMethod testMethod) throws Exception {
        long start = System.currentTimeMillis();
        for (int i = 0; i < 100_000; i++) {
            testMethod.run();
        }
        System.out.println(description + " took " + (System.currentTimeMillis() - start) / 1000d + "s");
    }

    @DataProvider
    public Object[][] provide_description_testMethod() {
        Path path = Paths.get("/tmp/random.txt");
        return new Object[][]{
            {"classic", (TestMethod)() -> countWordsClassic(path)},
            {"mixed", (TestMethod)() -> countWordsMixed(path)},
            {"mixed2", (TestMethod)() -> countWordsMixed2(path)},
            {"stream", (TestMethod)() -> countWordsStream(path)},
            {"stream2", (TestMethod)() -> countWordsStream2(path)},
        };
    }

    private void countWordsClassic(final Path path) throws IOException {
        final Map<String, Integer> wordCounts = new HashMap<>();
        for (String word : new String(readAllBytes(path), StandardCharsets.UTF_8).split("\W+")) {
            Integer oldCount = wordCounts.get(word);
            if (oldCount == null) {
                wordCounts.put(word, 1);
            } else {
                wordCounts.put(word, oldCount + 1);
            }
        }
    }

    private void countWordsMixed(final Path path) throws IOException {
        final Map<String, Integer> wordCounts = new HashMap<>();
        for (String word : new String(readAllBytes(path), StandardCharsets.UTF_8).split("\W+")) {
            wordCounts.merge(word, 1, (key, oldCount) -> oldCount + 1);
        }
    }

    private void countWordsMixed2(final Path path) throws IOException {
        final Map<String, Integer> wordCounts = new HashMap<>();
        Pattern.compile("\W+")
            .splitAsStream(new String(readAllBytes(path), StandardCharsets.UTF_8))
            .forEach(word -> wordCounts.merge(word, 1, (key, oldCount) -> oldCount + 1));
    }

    private void countWordsStream2(final Path tmpFile) throws IOException {
        Pattern.compile("\W+").splitAsStream(new String(readAllBytes(tmpFile), StandardCharsets.UTF_8))
            .collect(Collectors.groupingBy(Function.<String>identity(), HashMap::new, counting()));
    }

    private void countWordsStream(final Path tmpFile) throws IOException {
        Arrays.stream(new String(readAllBytes(tmpFile), StandardCharsets.UTF_8).split("\W+"))
            .collect(Collectors.groupingBy(Function.<String>identity(), HashMap::new, counting()));
    }

    interface TestMethod {
        void run() throws Exception;
    }
}

The result were:

结果是：

type    length  diff
classic 4665s    +9%
mixed   4273s    +0%
mixed2  4833s    +13%
stream  4868s    +14%
stream2 5070s    +19%

Note that I previously also tested with TreeMaps, but found that the HashMaps were much faster, even if I sorted the output afterwards. Also I changed the tests above after Tagir Valeev told me in the comments below about the Pattern.splitAsStream()method. Since I got strongly varying results I left the tests run for quite a while as you can see by the length in seconds above to get meaningful results.

请注意，我之前也使用 TreeMaps 进行过测试，但发现 HashMaps 的速度要快得多，即使我之后对输出进行了排序。在 Tagir Valeev 在下面的评论中告诉我该Pattern.splitAsStream()方法之后，我也更改了上面的测试。由于我得到了非常不同的结果，我让测试运行了很长一段时间，正如您可以通过上面的以秒为单位的长度看到的，以获得有意义的结果。

How I judge the results:

我如何判断结果：

The "mixed" approach which does not use streams at all, but uses the "merge" method with callback introduced in Java 8 does improve the performance. This is something I expected because the classic get/put appraoch requires the key to be looked up twice in the HashMap and this is not required anymore with the "merge"-approach.
To my suprise the Pattern.splitAsStream()appraoch is actually slower compared to Arrays.asStream(....split()). I did have a look at the source code of both implementations and I noticed that the split()call saves the results in an ArrayList which starts with a size of zero and is enlarged as needed. This requires many copy operations and in the end another copy operation to copy the ArrayList to an array. But "splitAsStream" actually creates an iterator which I thought can be queried as needed avoiding these copy operations completely. I did not quite look through all the source that converts the iterator to a stream object, but it seems to be slow and I don't know why. In the end it theoretically could have to do with CPU memory caches: If exactly the same code is executed over and over again the code will more likely be in the cache then actually running on large function chains, but this is a very wild speculation on my side. It may also be something completely different. However splitAsStreamMIGHThave a better memory footprint, maybe it does not, I did not profile that.
The stream approach in general is pretty slow. This is not totally unexpected because quite a number of method invocations take place, including for example something as pointless as Function.identity. However I did not expect the difference at this magnitude.

“混合”方法根本不使用流，而是使用 Java 8 中引入的带有回调的“合并”方法确实提高了性能。这是我所期望的，因为经典的 get/put 方法需要在 HashMap 中查找键两次，而“合并”方法不再需要这样做。
令我惊讶的Pattern.splitAsStream()是，与Arrays.asStream(....split()). 我确实查看了两个实现的源代码，我注意到split()call 将结果保存在一个 ArrayList 中，该 ArrayList 的大小从零开始，并根据需要进行放大。这需要许多复制操作，最后需要另一个复制操作将 ArrayList 复制到数组。但是“splitAsStream”实际上创建了一个迭代器，我认为可以根据需要对其进行查询，从而完全避免这些复制操作。我没有完全浏览将迭代器转换为流对象的所有源代码，但它似乎很慢，我不知道为什么。最后，理论上它可能与 CPU 内存缓存有关：如果一遍又一遍地执行完全相同的代码，则代码更有可能在缓存中然后实际运行在大型函数链上，但这是一个非常疯狂的猜测我这边。它也可能是完全不同的东西。然而splitAsStreamMIGHT有更好的内存占用，也许没有，我没有分析。
流方法通常很慢。这并不完全出乎意料，因为发生了相当多的方法调用，包括例如像Function.identity. 然而，我没想到会有这么大的差异。

As an interesting side note I find the mixed approach which was fastest quite well to read and understand. The call to "merge" does not have the most ovbious effect to me, but if you know what this method is doing it seems most readable to me while at the same time the groupingBycommand is more difficult to understand for me. I guess one might be tempted to say that this groupingByis so special and highly optimised that it makes sense to use it for performance but as demonstrated here, this is not the case.

作为一个有趣的旁注，我发现混合方法是最快的阅读和理解。对“合并”的调用对我来说并没有最明显的效果，但是如果你知道这个方法在做什么，它对我来说似乎最易读，同时groupingBy命令对我来说更难以理解。我想有人可能会说这groupingBy是非常特殊且高度优化的，因此将其用于性能是有意义的，但正如此处所展示的，情况并非如此。

Answer 2

回答by yunandtidus

    Map<String, Integer> countByWords = new HashMap<String, Integer>();
    Scanner s = new Scanner(new File("your_file_path"));
    while (s.hasNext()) {
        String next = s.next();
        Integer count = countByWords.get(next);
        if (count != null) {
            countByWords.put(next, count + 1);
        } else {
            countByWords.put(next, 1);
        }
    }
    s.close();

this count "I'm" as only one word

这把“我”算作只有一个词

Answer 3

回答by Grice

General overview of steps:

步骤概述：

Create a HashMap<String, Integer>Read the file one word a time. If it doesn't exist in your HashMap, add it and change the count value assigned to 1. If it exists, increment the value by 1. Read till end of file.

创建一个一次HashMap<String, Integer>读一个字的文件。如果它在您的HashMap.

This will result in a set of all your words and the count for each word.

这将产生一组所有单词和每个单词的计数。

Answer 4

回答by Jared Wadsworth

If i were you, I would use one of the implementations of a map<String, int>, like a hashmap. Then as you loop through each word if it already exists just increment the int by one, otherwise add it into the map. At the end you can pull out all of the words, or query it based on a specific word to get the count.

如果我是你，我会使用 a 的一种实现map<String, int>，比如哈希图。然后，当您遍历每个单词时，如果它已经存在，只需将 int 增加一，否则将其添加到地图中。最后你可以把所有的单词拉出来，或者根据一个特定的单词查询来得到计数。

If order is important to you, you could try a SortedMap<String, int>to be able to pring them out in alphabetical order.

如果顺序对您很重要，您可以尝试SortedMap<String, int>按字母顺序将它们打印出来。

Hope that helps!

希望有帮助！

Answer 5

回答by Markony

It is actually classic word-count algorithm. Here is the solution:

它实际上是经典的字数统计算法。这是解决方案：

public Map<String, Integer> wordCount(String[] strings) {

  Map<String, Integer> map = new HashMap<String, Integer>();
  int count = 0;

  for (String s:strings) {

    if (map.containsKey(s)) {
      count = map.get(s);
      map.put(s, count + 1);
    } else {
        map.put(s, 1);
    }

  }
  return map;
}

java 如何统计每个单词出现的次数？

提问by Devon

回答by yankee

回答by yunandtidus

回答by Grice

回答by Jared Wadsworth

回答by Markony

相关推荐

最近更新

标签

java 如何统计每个单词出现的次数？

提问by Devon

回答by yankee

回答by yunandtidus

回答by Grice

回答by Jared Wadsworth

回答by Markony

相关推荐

java Spring Boot自定义http错误响应？

java 在嵌套类中实现的比较器接口

java 使用 Spring Web MVC 自定义处理 405 错误

java - file.createNewFile() IOException“没有这样的文件或目录”

相关推荐

最近更新

标签