java 在java中计算.txt文件中的单词频率
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29526643/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Counting frequency of words from a .txt file in java
提问by Kommander Kitten
I am working on a Comp Sci assignment. In the end, the program will determine whether a file is written in English or French. Right now, I'm struggling with the method that counts the frequency of words that appears in a .txt file.
我正在做 Comp Sci 作业。最后,程序将确定文件是用英语还是法语编写的。现在,我正在努力使用计算 .txt 文件中出现的单词频率的方法。
I have a set of text files in both English and French in their respective folders labeled 1-20. The method asks for a directory (which in this case is "docs/train/eng/" or "docs/train/fre/") and for how many files that the program should go through (there are 20 files in each folder). Then it reads that file, splits all the words apart (I don't need to worry about capitalization or punctuation), and puts every word in a HashMap along with how many times they were in the file. (Key = word, Value = frequency).
我在标有 1-20 的各自文件夹中有一组英文和法文文本文件。该方法要求提供一个目录(在本例中为“docs/train/eng/”或“docs/train/fre/”)以及程序应通过的文件数量(每个文件夹中有 20 个文件) . 然后它读取该文件,将所有单词分开(我不需要担心大小写或标点符号),并将每个单词连同它们在文件中的次数一起放在 HashMap 中。(关键字 = 词,值 = 频率)。
This is the code I came up with for the method:
这是我为该方法想出的代码:
public static HashMap<String, Integer> countWords(String directory, int nFiles) {
// Declare the HashMap
HashMap<String, Integer> wordCount = new HashMap();
// this large 'for' loop will go through each file in the specified directory.
for (int k = 1; k < nFiles; k++) {
// Puts together the string that the FileReader will refer to.
String learn = directory + k + ".txt";
try {
FileReader reader = new FileReader(learn);
BufferedReader br = new BufferedReader(reader);
// The BufferedReader reads the lines
String line = br.readLine();
// Split the line into a String array to loop through
String[] words = line.split(" ");
int freq = 0;
// for loop goes through every word
for (int i = 0; i < words.length; i++) {
// Case if the HashMap already contains the key.
// If so, just increments the value
if (wordCount.containsKey(words[i])) {
wordCount.put(words[i], freq++);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], freq++);
}
}
// Catching the file not found error
// and any other errors
}
catch (FileNotFoundException fnfe) {
System.err.println("File not found.");
}
catch (Exception e) {
System.err.print(e);
}
}
return wordCount;
}
The code compiles. Unfortunately, when I asked it to print the results of all the word counts for the 20 files, it printed this. It's complete gibberish (though the words are definitely there) and is not at all what I need the method to do.
代码编译。不幸的是,当我要求它打印 20 个文件的所有字数统计结果时,它打印了这个. 这完全是胡言乱语(尽管这些词肯定存在)并且根本不是我需要的方法。
If anyone could help me debug my code, I would greatly appreciate it. I've been at it for ages, conducting test after test and I'm ready to give up.
如果有人能帮我调试我的代码,我将不胜感激。我已经从事它多年,进行一个又一个测试,我准备放弃。
回答by jas
I would have expected something more like this. Does it make sense?
我会期待更像这样的事情。是否有意义?
if (wordCount.containsKey(words[i])) {
int n = wordCount.get(words[i]);
wordCount.put(words[i], ++n);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], 1);
}
If the word is already in the hashmap, we want to get the current count, add 1
to that and replace the word with the new count in the hashmap.
如果单词已经在哈希映射中,我们想要获取当前计数,添加1
到该计数并用哈希映射中的新计数替换单词。
If the word is not yet in the hashmap, we simply put it in the map with a count of 1
to start with. The next time we see the same word we'll up the count to 2
, etc.
如果单词还没有出现在哈希图中,我们只需将它放入映射中,并以1
为开头。下次我们看到同一个词时,我们会将计数加到2
,等等。
回答by Michael Hobbs
Let me combine all the good answers here.
让我在这里结合所有好的答案。
1) Split up your methods to handle one thing each. One to read the files into strings[], one to process the strings[], and one to call the first two.
1)拆分你的方法来处理每件事。一个将文件读入strings[],一个处理strings[],一个调用前两个。
2) When you split think deeply about how you want to split. As @m0skit0 suggest you should likely split with \b for this problem.
2)当你分裂时,请深思你想如何分裂。正如@m0skit0 建议的那样,您应该针对此问题与 \b 分开。
3) As @jas suggested you should first check if your map already has the word. If it does increment the count, if not add the word to the map and set it's count to 1.
3)正如@jas 建议的那样,您应该首先检查您的地图是否已经有这个词。如果确实增加了计数,否则将单词添加到地图并将其计数设置为 1。
4) To print out the map in the way you likely expect, take a look at the below:
4) 要以您可能期望的方式打印地图,请查看以下内容:
Map test = new HashMap();
for (Map.Entry entry : test.entrySet()){
System.out.println(entry.getKey() + " " + entry.getValue());
}
回答by m0skit0
If you split by space only, then other signs (parenthesis, punctuation marks, etc...) will be included in the words. For example: "This phrase, contains... funny stuff"
, if you split it by space you get: "This"
"phrase,"
"contains..."
"funny"
and "stuff"
.
如果仅按空格分隔,则单词中将包含其他符号(括号、标点符号等)。例如:"This phrase, contains... funny stuff"
,如果你按空间分割它,你会得到:"This"
"phrase,"
"contains..."
"funny"
和"stuff"
。
You can avoid this by splitting by word boundary (\b
) instead.
您可以通过按字边界 ( \b
)拆分来避免这种情况。
line.split("\b");
Btw your if and else parts are identical. You're always incrementing freq by one, which doesn't make much sense. If the word is already in the map, you want to get the current frequency, add 1 to it, and update the frequency in the map. If not, you put it in the map with a value of 1.
顺便说一句,您的 if 和 else 部分是相同的。您总是将 freq 加一,这没有多大意义。如果这个词已经在地图中,你想获取当前的频率,给它加1,然后更新地图中的频率。如果不是,则将其放入地图中,值为 1。
And pro tip: always print/log the full stacktrace for the exceptions.
专业提示:始终打印/记录异常的完整堆栈跟踪。