C++ 计算文本文件中每个单词的出现次数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16867944/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Counting Occurrences of Each Word in a Text File
提问by user2374842
Given a large text file with multiple strings, what would be the most efficient way to read the text file and count how many occurrences of each word are there in C++? The text file's size is unknown so I cannot just use a simple array. Also, there is another catch. Each line of this text file starts with a category key word and the following words are the features of that category. I need to be able to count how many occurrences of each word is in that category.
给定一个包含多个字符串的大型文本文件,读取文本文件并计算 C++ 中每个单词出现次数的最有效方法是什么?文本文件的大小未知,所以我不能只使用一个简单的数组。此外,还有一个问题。这个文本文件的每一行都以一个类别关键字开始,后面的词是该类别的特征。我需要能够计算每个单词在该类别中出现的次数。
For example:
例如:
colors red blue green yellow orange purple
sky blue high clouds air empty vast big
ocean wet water aquatic blue
colors brown black blue white blue blue
With this example, I need to count that within the "colors" category, there are 4 occurrences of "blue", even though there are 6 total occurrences of blue in total.
在这个例子中,我需要计算在“颜色”类别中,“蓝色”出现了 4 次,尽管总共出现了 6 次蓝色。
回答by MasterMastic
I would use a streamfor reading and separating the words (it separates words by looking for whitespace) and save them to a dictionary(The standard C++ method is to use std::map
).
我会使用流来读取和分隔单词(它通过查找空格来分隔单词)并将它们保存到字典中(标准的 C++ 方法是使用std::map
)。
Here is a C++ documented code:
这是一个 C++ 文档化代码:
#include <iostream>
#include <map> // A map will be used to count the words.
#include <fstream> // Will be used to read from a file.
#include <string> // The map's key value.
using namespace std;
//Will be used to print the map later.
template <class KTy, class Ty>
void PrintMap(map<KTy, Ty> map)
{
typedef std::map<KTy, Ty>::iterator iterator;
for (iterator p = map.begin(); p != map.end(); p++)
cout << p->first << ": " << p->second << endl;
}
int main(void)
{
static const char* fileName = "C:\MyFile.txt";
// Will store the word and count.
map<string, unsigned int> wordsCount;
{
// Begin reading from file:
ifstream fileStream(fileName);
// Check if we've opened the file (as we should have).
if (fileStream.is_open())
while (fileStream.good())
{
// Store the next word in the file in a local variable.
string word;
fileStream >> word;
//Look if it's already there.
if (wordsCount.find(word) == wordsCount.end()) // Then we've encountered the word for a first time.
wordsCount[word] = 1; // Initialize it to 1.
else // Then we've already seen it before..
wordsCount[word]++; // Just increment it.
}
else // We couldn't open the file. Report the error in the error stream.
{
cerr << "Couldn't open the file." << endl;
return EXIT_FAILURE;
}
// Print the words map.
PrintMap(wordsCount);
}
return EXIT_SUCCESS;
}
Output:
输出:
air: 1
aquatic: 1
big: 1
black: 1
blue: 6
brown: 1
clouds: 1
colors: 2
empty: 1
green: 1
high: 1
ocean: 1
orange: 1
purple: 1
red: 1
sky: 1
vast: 1
water: 1
wet: 1
white: 1
yellow: 1
空气:1
水生:1
大:1
黑色:1
蓝色:6
棕色:1
云:1
颜色:2
空:1
绿色:1
高:1
海洋:1
橙色:1
紫色:1
红色:1
天空:1
广阔: 1
水:1
湿:1
白:1
黄:1
回答by Filip
Tokenize the words and store them as key-value pairs.
标记单词并将它们存储为键值对。
UPDATE: I realized that I have misread the question. Following code should separate and count by categories:
更新:我意识到我误读了这个问题。以下代码应按类别分开和计数:
#include <iostream>
#include <string>
#include <map>
#include <fstream>
using namespace std;
int main()
{
ifstream file;
file.open("path\to\text\file");
if(!file.is_open()) return 1;
map<string, map<string, int> > categories;
while(file.good())
{
string s;
getline(file, s);
int pos = s.find_first_of(' ');
if(pos < 0) continue;
string word = s.substr(0, pos);
string category = word;
s = s.erase(0, pos+1);
while(s.size() > 0)
{
pos = s.find_first_of(' ');
if(pos < 0)
pos = s.size();
string word = s.substr(0, pos);
if(word != "")
categories[category][word]++;
s = s.erase(0, pos+1);
}
}
for(map<string, map<string, int> >::iterator cit = categories.begin(); cit != categories.end(); ++cit)
{
cout << "Category - " << cit->first << endl;
for(map<string, int>::iterator wit = cit->second.begin(); wit != cit->second.end(); ++wit)
cout << "\tword: " << wit->first << ",\t" << wit->second << endl;
}
return 0;
}
Update 2: Chris asked for a modification of the algorithm:
更新 2:Chris 要求修改算法:
#include <iostream>
#include <string>
#include <map>
#include <fstream>
using namespace std;
int main()
{
ifstream file;
file.open("D:\Documents\txt.txt");
if(!file.is_open()) return 1;
map<string, int> categories;
while(file.good())
{
string s;
getline(file, s);
int pos = s.find_first_of(' ');
if(pos < 0) continue;
while(s.size() > 0)
{
pos = s.find_first_of(' ');
if(pos < 0)
pos = s.size();
string word = s.substr(0, pos);
if(word != "")
categories[word]++;
s = s.erase(0, pos+1);
}
}
for(map<string, int>::iterator wit = categories.begin(); wit != categories.end(); ++wit)
cout << "word: " << wit->first << "\t" << wit->second << endl;
return 0;
}
回答by DavidRR
Here's a solution that achieves your stated objective. See it live here.
这是实现您既定目标的解决方案。看到它住在这里。
It makes use of std::map
to maintain a count of the number of times that a (category, word)pair occurs.
它使用std::map
来维护(类别,单词)对出现的次数的计数。
std::istringstream
is used to break the data first into rows, and then into words.
std::istringstream
用于先将数据拆分为行,然后拆分为单词。
OUTPUT:
输出:
(colors, black) => 1
(colors, blue) => 4
(colors, brown) => 1
(colors, green) => 1
(colors, orange) => 1
(colors, purple) => 1
(colors, red) => 1
(colors, white) => 1
(colors, yellow) => 1
(ocean, aquatic) => 1
(ocean, blue) => 1
(ocean, water) => 1
(ocean, wet) => 1
(sky, air) => 1
(sky, big) => 1
(sky, blue) => 1
(sky, clouds) => 1
(sky, empty) => 1
(sky, high) => 1
(sky, vast) => 1
PROGRAM:
程序:
#include <iostream> // std::cout, std::endl
#include <map> // std::map
#include <sstream> // std::istringstream
#include <utility> // std::pair
int main()
{
// The data.
std::string content =
"colors red blue green yellow orange purple\n"
"sky blue high clouds air empty vast big\n"
"ocean wet water aquatic blue\n"
"colors brown black blue white blue blue\n";
// Load the data into an in-memory table.
std::istringstream table(content);
std::string row;
std::string category;
std::string word;
const char delim = ' ';
std::map<pair<std::string, std::string>, long> category_map;
std::pair<std::string, std::string> cw_pair;
long count;
// Read each row from the in-memory table.
while (!table.eof())
{
// Get a row of data.
getline(table, row);
// Allow the row to be read word-by-word.
std::istringstream words(row);
// Get the first word in the row; it is the category.
getline(words, category, delim);
// Get the remaining words in the row.
while (std::getline(words, word, delim)) {
cw_pair = std::make_pair(category, word);
// Maintain a count of each time a (category, word) pair occurs.
if (category_map.count(cw_pair) > 0) {
category_map[cw_pair] += 1;
} else {
category_map[cw_pair] = 1;
}
}
}
// Print out each unique (category, word) pair and
// the number of times that it occurs.
std::map<pair<std::string, std::string>, long>::iterator it;
for (it = category_map.begin(); it != category_map.end(); ++it) {
cw_pair = it->first;
category = cw_pair.first;
word = cw_pair.second;
count = it->second;
std::cout << "(" << category << ", " << word << ") => "
<< count << std::endl;
}
}