C++ 计算文件中单词出现频率的优雅方法

Question

提问by pintu

What are the elegant and effective ways to count the frequency of each "english" word in a file?

计算文件中每个“英语”单词出现频率的优雅而有效的方法是什么？

Answer 1

回答by Nawaz

First of all, I define letter_onlystd::localeso as to ignore punctuations coming from the stream, and to read only valid "english" letters from the input stream. That way, the stream will treat the words "ways", "ways."and "ways!"as just the same word "ways", because the stream will ignore punctuations like "."and "!".

首先，我定义letter_onlystd::locale为忽略来自流的标点符号，并仅从输入流中读取有效的“英文”字母。这样一来，流将治疗的话"ways"，"ways."和"ways!"刚才一样的字"ways"，因为流将忽略像标点符号"."和"!"。

struct letter_only: std::ctype<char> 
{
    letter_only(): std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table()
    {
        static std::vector<std::ctype_base::mask> 
            rc(std::ctype<char>::table_size,std::ctype_base::space);

        std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);
        return &rc[0];
    }
};

Solution 1

解决方案1

int main()
{
     std::map<std::string, int> wordCount;
     ifstream input;
     input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
     input.open("filename.txt");
     std::string word;
     while(input >> word)
     {
         ++wordCount[word];
     }
     for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
     {
           cout << it->first <<" : "<< it->second << endl;
     }
}

Solution 2

解决方案2

struct Counter
{
    std::map<std::string, int> wordCount;
    void operator()(const std::string & item) { ++wordCount[item]; }
    operator std::map<std::string, int>() { return wordCount; }
};

int main()
{
     ifstream input;
     input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
     input.open("filename.txt");
     istream_iterator<string> start(input);
     istream_iterator<string> end;
     std::map<std::string, int> wordCount = std::for_each(start, end, Counter());
     for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
     {
          cout << it->first <<" : "<< it->second << endl;
     }
 }

Answer 2

回答by Chris Koknat

Perl is arguably not so elegant, but very effective.
I posted a solution here: Processing huge text files

Perl 可以说不是那么优雅，但非常有效。
我在这里发布了一个解决方案：处理大文本文件

In a nutshell,

简而言之，

1) If needed, strip punctuation and convert uppercase to lowercase:
perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/A-Z/a-z/" file_raw > file

1）如果需要，去除标点符号并将大写转换为小写：
perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/A-Z/a-z/" file_raw > file

2) Count the occurrence of each word. Print results sorted first by frequency, and then alphabetically:
perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq

2）统计每个单词的出现次数。先按频率排序，然后按字母顺序打印结果：
perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq

I ran this code on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in under 3 minutes.

我在 580,000,000 字的 3.3GB 文本文件上运行此代码。
Perl 5.22 在 3 分钟内完成。

Answer 3

回答by UmmaGumma

Here is working solution.This should work with real text (including punctuation) :

这是有效的解决方案。这应该适用于真实文本（包括标点符号）：

#include <iterator>
#include <iostream>
#include <fstream>
#include <map>
#include <string>
#include <cctype>

std::string getNextToken(std::istream &in)
{
    char c;
    std::string ans="";
    c=in.get();
    while(!std::isalpha(c) && !in.eof())//cleaning non letter charachters
    {
        c=in.get();
    }
    while(std::isalpha(c))
    {
        ans.push_back(std::tolower(c));
        c=in.get();
    }
    return ans;
}

int main()
{
    std::map<std::string,int> words;
    std::ifstream fin("input.txt");

    std::string s;
    std::string empty ="";
    while((s=getNextToken(fin))!=empty )
            ++words[s];

    for(std::map<std::string,int>::iterator iter = words.begin(); iter!=words.end(); ++iter)
        std::cout<<iter->first<<' '<<iter->second<<std::endl;
}

Edit: Now my code calling tolower for every letter.

编辑：现在我的代码为每个字母调用 tolower。

Answer 4

回答by Baltasarq

My solution is the following one. Firstly, all symbols are converted to spaces. Then, basically the same solution provided here before is used in order to extract words:

我的解决方案是以下一个。首先，所有符号都转换为空格。然后，使用之前提供的基本相同的解决方案来提取单词：

const std::string Symbols = ",;.:-()\t!???\"[]{}&<>+-*/=#'";
typedef std::map<std::string, unsigned int> WCCollection;
void countWords(const std::string fileName, WCCollection &wcc)
    {
        std::ifstream input( fileName.c_str() );

        if ( input.is_open() ) {
            std::string line;
            std::string word;

            while( std::getline( input, line ) ) {
                // Substitute punctuation symbols with spaces
                for(std::string::const_iterator it = line.begin(); it != line.end(); ++it) {
                    if ( Symbols.find( *it ) != std::string::npos ) {
                        *it = ' ';
                    }

                }

                // Let std::operator>> separate by spaces
                std::istringstream filter( line );
                while( filter >> word ) {
                    ++( wcc[word] );
                }
            }
        }

    }

Answer 5

回答by Adrian McCarthy

Decide on exactly what you mean by "an English word". The definition should cover things like whether "able-bodied" is one word or two, how to handle apostrophes ("Don't trust 'em!"), whether capitalization is significant, etc.
Create a set of test cases so you can be sure you get all the decisions in step 1 correct.
Create a tokenizer that reads the next word (as defined in step 1) from the input and returns it in a standard form. Depending on how your definition, this might be a simple state machine, a regular expression, or just relying on <istream>'s extraction operators (e.g., std::cin >> word;). Test your tokenizer with all the test cases from step 2.
Choose a data structure for keeping the words and counts. In modern C++, you'd probably end up with something like std::map<std::string, unsigned>or std::unordered_map<std::string, int>.
Write a loop that gets the next word from the tokenizer and increments its count in the histogram until there are no more words in the input.

确定“一个英语单词”的确切含义。定义应涵盖诸如“健全”是一两个词，如何处理撇号（“不要相信他们！”），大写是否重要等内容。
创建一组测试用例，以确保您在第 1 步中做出的所有决定都是正确的。
创建一个标记器，从输入中读取下一个单词（如步骤 1 中定义的）并以标准形式返回它。根据您的定义方式，这可能是一个简单的状态机、一个正则表达式，或者仅依赖于 <istream> 的提取运算符（例如，std::cin >> word;）。使用步骤 2 中的所有测试用例测试您的分词器。
选择一种数据结构来保存单词和计数。在现代 C++ 中，您可能最终会得到类似std::map<std::string, unsigned>或std::unordered_map<std::string, int>.
编写一个循环，从分词器中获取下一个单词，并在直方图中递增其计数，直到输入中没有更多单词为止。

Answer 6

回答by Fred Nurk

Pseudocode for an algorithm which I believe to be close to what you want:

我认为接近您想要的算法的伪代码：

counts = defaultdict(int)
for line in file:
  for word in line.split():
    if any(x.isalpha() for x in word):
      counts[word.toupper()] += 1

freq = sorted(((count, word) for word, count in counts.items()), reversed=True)
for count, word in freq:
  print "%d\t%s" % (count, word)

Case-insensitive comparison is handled na?vely and probably combines words you don't want to combine in an absolutely general sense. Be careful of non-ASCII characters in your implementation of the above. False positives may include "1-800-555-TELL", "0xDEADBEEF", and "42 km", depending on what you want. Missed words include "911 emergency services" (I'd probably want that counted as three words).

不区分大小写的比较被简单地处理，并且可能结合了你不想在绝对一般意义上结合的词。在执行上述操作时要小心非 ASCII 字符。误报可能包括“1-800-555-TELL”、“0xDEADBEEF”和“42 km”，具体取决于您想要什么。漏掉的词包括“911 紧急服务”（我可能希望将其计为三个词）。

In short, natural language parsing is hard: you probably can make due with some approximation depending on your actual use case.

简而言之，自然语言解析很难：您可能可以根据您的实际用例进行一些近似处理。

Answer 7

回答by Chirag Tayal

One more simple way is to count the number of spaces in the file till more then one space was found, if you consider only single space between words...

一种更简单的方法是计算文件中的空格数，直到找到超过一个空格，如果您只考虑单词之间的单个空格...

Answer 8

回答by user9178028

string mostCommon( string filename ) {

    ifstream input( filename );
    string line;
    string mostFreqUsedWord;
    string token;
    map< string, int > wordFreq;

    if ( input.is_open() ) {

        while ( true ) {
            input >> token;
            if( input ) {
                wordFreq[ token ]++;
                if ( wordFreq[ token] > wordFreq[ mostFreqUsedWord ] )
                    mostFreqUsedWord = token;
            } else
                break;
        }
        input.close();
    } else {
        cout << "Unable to ope file." << endl;
    }
    return mostFreqUsedWord;
}

C++ 计算文件中单词出现频率的优雅方法

提问by pintu

回答by Nawaz

Solution 1

解决方案1

Solution 2

解决方案2

回答by Chris Koknat

回答by UmmaGumma

回答by Baltasarq

回答by Adrian McCarthy

回答by Fred Nurk

回答by Chirag Tayal

回答by user9178028

相关推荐

最近更新

标签

C++ 计算文件中单词出现频率的优雅方法

提问by pintu

回答by Nawaz

Solution 1

解决方案1

Solution 2

解决方案2

回答by Chris Koknat

回答by UmmaGumma

回答by Baltasarq

回答by Adrian McCarthy

回答by Fred Nurk

回答by Chirag Tayal

回答by user9178028

相关推荐

C++中只有静态方法的类的优点

C++ 创建对新对象的引用

C++ 将 uint64_t 转换为 std::string

从 C++ 调用 Web 服务

相关推荐

最近更新

标签