string 给定一个文件,尽可能高效地找出出现频率最高的十个词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4495241/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 00:55:47  来源:igfitidea点击:

Given a file, find the ten most frequently occurring words as efficiently as possible

algorithmstring

提问by efficiencyIsBliss

This is apparently an interview question (found it in a collection of interview questions), but even if it's not it's pretty cool.

这显然是一个面试问题(在一系列面试问题中找到),但即使不是,它也很酷。

We are told to do this efficiently on all complexity measures. I thought of creating a HashMap that maps the words to their frequency. That would be O(n) in time and space complexity, but since there may be lots of words we cannot assume that we can store everything in memory.

我们被告知要在所有复杂性度量上有效地执行此操作。我想创建一个 HashMap 将单词映射到它们的频率。那将是时间和空间复杂度为 O(n),但由于可能有很多单词,我们不能假设我们可以将所有内容都存储在内存中。

I must add that nothing in the question says that the words cannot be stored in memory, but what if that were the case? If that's not the case, then the question does not seem as challenging.

我必须补充一点,问题中没有任何内容说单词不能存储在内存中,但如果是这样呢?如果不是这样,那么这个问题似乎没有挑战性。

回答by Ben Hymanson

Optimizing for my own time:

针对我自己的时间进行优化:

sort file | uniq -c | sort -nr | head -10

Possibly followed by awk '{print $2}'to eliminate the counts.

可能接着是awk '{print $2}'消除计数。

回答by Summer_More_More_Tea

I think the trie data structureis a choice.

我认为特里数据结构是一种选择。

In the trie, you can record word count in each node representing frequency of word consisting of characters on the path from root to current node.

在树中,您可以记录每个节点的词数,表示从根到当前节点的路径上由字符组成的词的频率。

The time complexity to setup the trie is O(Ln) ~ O(n) (where L is number of characters in the longest word, which we can treat as a constant). To find the top 10 words, we can traversal the trie, which also costs O(n). So it takes O(n) to solve this problem.

设置 trie 的时间复杂度为 O(Ln) ~ O(n)(其中 L 是最长单词中的字符数,我们可以将其视为常数)。为了找到前 10 个单词,我们可以遍历 trie,这也需要 O(n)。所以解决这个问题需要O(n)。

回答by Alessandro

An complete solution would be something like this:

一个完整的解决方案是这样的:

  1. Do an external sort O(N log N)
  2. Count the word freq in the file O(N)
  3. (An alternate would be the use of a Trie as @Summer_More_More_Tea to count the frequencies, if you can afford that amount of memory) O(k*N) //for the two first steps
  4. Use a min-heap:
    • Put the first n elements on the heap
    • For every word left add it to the heap and delete the new min in heap
    • In the end the heap Will contain the n-th most common words O(|words|*log(n))
  1. 进行外部排序 O(N log N)
  2. 计算文件中的单词频率 O(N)
  3. (另一种方法是使用 Trie 作为 @Summer_More_More_Tea 来计算频率,如果你能负担得起那么大的内存)O(k*N) //对于前两个步骤
  4. 使用最小堆:
    • 将前 n 个元素放在堆上
    • 对于剩下的每个单词,将其添加到堆中并删除堆中的新 min
    • 最后堆将包含第 n 个最常见的单词 O(|words|*log(n))

With the Trie the cost would be O(k*N), because the number of total words generally is bigger than the size of the vocabulary. Finally, since k is smaller for most of the western languages you could assume a linear complexity.

使用 Trie 的成本将是 O(k*N),因为总单词的数量通常大于词汇表的大小。最后,由于大多数西方语言的 k 较小,因此您可以假设线性复杂度。

回答by Alessandro

I have done in C# like this(a sample)

我已经在 C# 中完成了这样的操作(示例)

int wordFrequency = 10;
string words = "hello how r u u u u  u  u u  u  u u u  u u u u  u u u ? hello there u u u u ! great to c u there. hello .hello hello hello hello hello .hello hello hello hello hello hello ";            

var result = (from word in words.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries)
                          group word by word into g
                          select new { Word = g.Key, Occurance = g.Count() }).ToList().FindAll(i => i.Occurance >= wordFrequency);

回答by amol_beast

Let's say we assign a random prime number to each of the 26 alphabets. Then we scan the file. Whenever we find a word, we calculate its hash value(formula based on the positon & the value of the alphabets making the word). If we find this value in the hash table, then we know for sure that we are not encountering it for the first time and we increment its key value. And maintain a array of maximum 10. But If we encounter a new hash , then we store the file pointer for that hash value, and initialize the key to 0.

假设我们为 26 个字母中的每一个分配了一个随机素数。然后我们扫描文件。每当我们找到一个单词时,我们都会计算它的哈希值(基于位置和构成单词的字母值的公式)。如果我们在哈希表中找到这个值,那么我们肯定知道我们不是第一次遇到它,我们会增加它的键值。并维护一个最大为 10 的数组。 但是如果我们遇到一个新的 hash ,那么我们存储该散列值的文件指针,并将键初始化为 0。

回答by Aly Farahat

I think this is a typical application of counting sort since the sum of occurrences of each word is equal to the total number of words. A hash table with a counting sort should do the job in a time proportional to the number of words.

我认为这是计数排序的典型应用,因为每个单词出现的总和等于单词总数。带有计数排序的哈希表应该在与单词数成正比的时间内完成这项工作。

回答by user470379

You could make a time/space tradeoff and go O(n^2)for time and O(1)for (memory) space by counting how many times a word occurs each time you encounter it in a linear pass of the data. If the count is above the top 10 found so far, then keep the word and the count, otherwise ignore it.

您可以进行时间/空间权衡,O(n^2)O(1)通过计算在数据的线性传递中每次遇到某个单词时出现的次数来获取时间和(内存)空间。如果计数高于目前找到的前 10 个,则保留单词和计数,否则忽略它。

回答by EnabrenTane

Says building a Hash and sorting the values is best. I'm inclined to agree. http://www.allinterview.com/showanswers/56657.html

说构建哈希并对值进行排序是最好的。我倾向于同意。 http://www.allinterview.com/showanswers/56657.html

Here is a Bash implementation that does something similar...I think http://www.commandlinefu.com/commands/view/5994/computes-the-most-frequent-used-words-of-a-text-file

这是一个 Bash 实现,它做了类似的事情......我认为 http://www.commandlinefu.com/commands/view/5994/computes-the-most-frequent-used-words-of-a-text-file

回答by Sanjit Saluja

Depending on the size of the input data, it may or may not be a good idea to keep a HashMap. Say for instance, our hash-map is too big to fit into main memory. This can cause a very high number of memory transfers as most hash-map implementations need random access and would not be very good on the cache.

根据输入数据的大小,保留 HashMap 可能是也可能不是一个好主意。例如,我们的哈希映射太大而无法放入主内存。这会导致大量的内存传输,因为大多数哈希映射实现需要随机访问,并且在缓存上不会很好。

In such cases sorting the input data would be a better solution.

在这种情况下,对输入数据进行排序将是更好的解决方案。

回答by Amit Bose

    int k = 0;
    int n = i;
    int j;
    string[] stringList = h.Split(" ".ToCharArray(),
                                  StringSplitOptions.RemoveEmptyEntries);
    int m = stringList.Count();
    for (j = 0; j < m; j++)
    {
        int c = 0;
        for (k = 0; k < m; k++)
        {
            if (string.Compare(stringList[j], stringList[k]) == 0)
            {
                c = c + 1;
            }
        }
    }