java 计算大文本文件的词频

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14746430/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 17:15:22  来源:igfitidea点击:

Count word frequency of huge text file

javaalgorithmdata-structurestext-filesword-count

提问by vikky.rk

I have a huge text file (larger than the available RAM memory). I need to count the frequency of all words and output the word and the frequency count into a new file. The result should be sorted in the descending order of frequency count.

我有一个巨大的文本文件(大于可用的 RAM 内存)。我需要计算所有单词的频率并将单词和频率计数输出到一个新文件中。结果应按频率计数的降序排序。

My Approach:

我的方法:

  1. Sort the given file - external sort
  2. Count the frequency of each word sequentially, store the count in another file (along with the word)
  3. Sort the output file based of frequency count - external sort.
  1. 对给定文件进行排序 - 外部排序
  2. 依次计算每个单词的出现频率,将计数存储在另一个文件中(与单词一起)
  3. 根据频率计数对输出文件进行排序 - 外部排序。

I want to know if there are better approaches to do it. I have heard of disk based hash tables? or B+ trees, but never tried them before.

我想知道是否有更好的方法来做到这一点。我听说过基于磁盘的哈希表吗?或 B+ 树,但以前从未尝试过。

Note: I have seen similar questions asked on SO, but none of them have to address the issue with data larger than memory.

注意:我在 SO 上看到过类似的问题,但没有一个必须解决数据大于内存的问题。

Edit: Based on the comments, agreed the a dictionary in practice should fit in the memory of today's computers. But lets take a hypothetical dictionary of words, that is huge enough not to fit in the memory.

编辑:根据评论,同意实践中的字典应该适合当今计算机的内存。但是让我们假设一个单词字典,它足够大,无法放入内存。

回答by ogzd

I would go with a map reduceapproach:

我会采用一种map reduce方法:

  1. Distribute your text file on nodes, assuming each text in a node can fit into RAM.
  2. Calculate each word frequency within the node. (using hash tables)
  3. Collect each result in a master node and combine them all.
  1. 在节点上分发文本文件,假设节点中的每个文本都可以放入 RAM。
  2. 计算节点内的每个词频。(使用hash tables
  3. 将每个结果收集到一个主节点中并将它们全部组合起来。

回答by Sani Singh Huttunen

All unique words probably fit in memory so I'd use this approach:

所有独特的词可能都适合记忆,所以我会使用这种方法:

  • Create a dictionary (HashMap<string, int>).
  • Read the huge text file line by line.
  • Add new words into the dictionary and set value to 1.
  • Add 1 to the value of existing words.
  • 创建字典 ( HashMap<string, int>)。
  • 逐行阅读巨大的文本文件。
  • 将新词添加到字典中并将值设置为 1。
  • 将现有单词的值加 1。

After you've parsed the entire huge file:

在你解析了整个大文件之后:

  • Sort the dictionary by frequency.
  • Write, to a new file, the sorted dictionary with words and frequency.
  • 按频率对字典进行排序。
  • 将按单词和频率排序的字典写入新文件。

Mind though to convert the words to either lowercase or uppercase.

介意将单词转换为小写或大写。

回答by pushy

Best way to achieve it would be to read the file line by line and store the words into a Multimap (e.g. Guava). If this Map extends your memory you could try using a Key-Value store (e.g. Berkeley JE DB, or MapDB). These key-value stores work similar to a map, but they store their values on the HDD. I used MapDB for a similar problem and it was blazing fast.

实现它的最佳方法是逐行读取文件并将单词存储到 Multimap 中(例如Guava)。如果此 Map 扩展了您的内存,您可以尝试使用键值存储(例如 Berkeley JE DB 或MapDB)。这些键值存储的工作方式类似于地图,但它们将它们的值存储在 HDD 上。我使用 MapDB 解决了类似的问题,而且速度非常快。

回答by Matteo

If the list of unique words and the frequency fits in memory (not the file just the unique words) you can use a hash table and read the file sequentially (without storing it).

如果唯一词列表和频率适合内存(而不是文件只是唯一词),您可以使用哈希表并按顺序读取文件(不存储它)。

You can then sort the entries of the hash table by the number of occurrences.

然后,您可以按出现次数对哈希表的条目进行排序。