Java倒排索引程序
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23449621/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java Inverted Index program
提问by user3600008
I am writing an inverted index program on java which returns the frequency of terms among multiple documents. I have been able to return the number times a word appears in the entire collection, but I have not been able to return which documents the word appears in. This is the code I have so far:
我正在用 java 编写一个倒排索引程序,它返回多个文档之间的术语频率。我已经能够返回一个单词在整个集合中出现的次数,但我无法返回该单词出现在哪些文档中。 这是我迄今为止的代码:
import java.util.*; // Provides TreeMap, Iterator, Scanner
import java.io.*; // Provides FileReader, FileNotFoundException
public class Run
{
public static void main(String[ ] args)
{
// **THIS CREATES A TREE MAP**
TreeMap<String, Integer> frequencyData = new TreeMap<String, Integer>( );
Map[] mapArray = new Map[5];
mapArray[0] = new HashMap<String, Integer>();
readWordFile(frequencyData);
printAllCounts(frequencyData);
}
public static int getCount(String word, TreeMap<String, Integer> frequencyData)
{
if (frequencyData.containsKey(word))
{ // The word has occurred before, so get its count from the map
return frequencyData.get(word); // Auto-unboxed
}
else
{ // No occurrences of this word
return 0;
}
}
public static void printAllCounts(TreeMap<String, Integer> frequencyData)
{
System.out.println("-----------------------------------------------");
System.out.println(" Occurrences Word");
for(String word : frequencyData.keySet( ))
{
System.out.printf("%15d %s\n", frequencyData.get(word), word);
}
System.out.println("-----------------------------------------------");
}
public static void readWordFile(TreeMap<String, Integer> frequencyData)
{
int total = 0;
Scanner wordFile;
String word; // A word read from the file
Integer count; // The number of occurrences of the word
int counter = 0;
int docs = 0;
//**FOR LOOP TO READ THE DOCUMENTS**
for(int x=0; x<Docs.length; x++)
{ //start of for loop [*
try
{
wordFile = new Scanner(new FileReader(Docs[x]));
}
catch (FileNotFoundException e)
{
System.err.println(e);
return;
}
while (wordFile.hasNext( ))
{
// Read the next word and get rid of the end-of-line marker if needed:
word = wordFile.next( );
// This makes the Word lower case.
word = word.toLowerCase();
word = word.replaceAll("[^a-zA-Z0-9\s]", "");
// Get the current count of this word, add one, and then store the new count:
count = getCount(word, frequencyData) + 1;
frequencyData.put(word, count);
total = total + count;
counter++;
docs = x + 1;
}
} //End of for loop *]
System.out.println("There are " + total + " terms in the collection.");
System.out.println("There are " + counter + " unique terms in the collection.");
System.out.println("There are " + docs + " documents in the collection.");
}
// Array of documents
static String Docs [] = {"words.txt", "words2.txt",};
回答by Alexey Malev
Try adding second map word -> set of document name
like this:
尝试word -> set of document name
像这样添加第二张地图:
Map<String, Set<String>> filenames = new HashMap<String, Set<String>>();
...
word = word.replaceAll("[^a-zA-Z0-9\s]", "");
// Get the current count of this word, add one, and then store the new count:
count = getCount(word, frequencyData) + 1;
frequencyData.put(word, count);
Set<String> filenamesForWord = filenames.get(word);
if (filenamesForWord == null) {
filenamesForWord = new HashSet<String>();
}
filenamesForWord.add(Docs[x]);
filenames.put(word, filenamesForWord);
total = total + count;
counter++;
docs = x + 1;
When you need to get a set of filenames in which you encountered a particular word, you'll just get()
it from the map filenames
. Here is the example that prints out all the file names, in which we have encountered a word:
当您需要获取一组在其中遇到特定单词的文件名时,您只需get()
从 map 中获取即可filenames
。这是打印出所有文件名的示例,其中我们遇到了一个单词:
public static void printAllCounts(TreeMap<String, Integer> frequencyData, Map<String, Set<String>> filenames) {
System.out.println("-----------------------------------------------");
System.out.println(" Occurrences Word");
for(String word : frequencyData.keySet( ))
{
System.out.printf("%15d %s\n", frequencyData.get(word), word);
for (String filename : filenames.get(word)) {
System.out.println(filename);
}
}
System.out.println("-----------------------------------------------");
}
回答by Bobulous
Instead of simply having a Map from word to count, create a Map from each word to a nested Map from document to count. In other words:
不是简单地使用从单词到计数的 Map,而是创建从每个单词到从文档到计数的嵌套 Map 的 Map。换句话说:
Map<String, Map<String, Integer>> wordToDocumentMap;
Then, inside your loop which records the counts, you want to use code which looks like this:
然后,在记录计数的循环中,您希望使用如下所示的代码:
Map<String, Integer> documentToCountMap = wordToDocumentMap.get(currentWord);
if(documentToCountMap == null) {
// This word has not been found anywhere before,
// so create a Map to hold document-map counts.
documentToCountMap = new TreeMap<>();
wordToDocumentMap.put(currentWord, documentToCountMap);
}
Integer currentCount = documentToCountMap.get(currentDocument);
if(currentCount == null) {
// This word has not been found in this document before, so
// set the initial count to zero.
currentCount = 0;
}
documentToCountMap.put(currentDocument, currentCount + 1);
Now you're capturing the counts on a per-word and per-document basis.
现在,您正在捕获每个单词和每个文档的计数。
Once you've completed the analysis and you want to print a summary of the results, you can run through the map like so:
完成分析并希望打印结果摘要后,您可以像这样运行地图:
for(Map.Entry<String, Map<String,Integer>> wordToDocument :
wordToDocumentMap.entrySet()) {
String currentWord = wordToDocument.getKey();
Map<String, Integer> documentToWordCount = wordToDocument.getValue();
for(Map.Entry<String, Integer> documentToFrequency :
documentToWordCount.entrySet()) {
String document = documentToFrequency.getKey();
Integer wordCount = documentToFrequency.getValue();
System.out.println("Word " + currentWord + " found " + wordCount +
" times in document " + document);
}
}
For an explanation of the for-each structure in Java, see this tutorial page.
有关 Java 中 for-each 结构的说明,请参阅本教程页面。
For a good explanation of the features of the Map interface, including the entrySet
method, see this tutorial page.
有关 Map 界面功能的详细说明,包括entrySet
方法,请参阅本教程页面。
回答by Peter
I've put a scanner into the main methode, and the word I search for will return the documents the word occurce in. I also return how many times the word occurs, but I will only get it to be the total of times in all of three documents. And I want it to return how many times it occurs in each document. I want this to be able to calculate tf-idf, if u have a total answer for the whole tf-idf I would appreciate. Cheers
我已经将扫描仪放入主要方法中,我搜索的单词将返回该单词出现的文档。我还返回该单词出现的次数,但我只会得到它的总次数三个文件。我希望它返回它在每个文档中出现的次数。我希望这能够计算 tf-idf,如果您对整个 tf-idf 有一个完整的答案,我将不胜感激。干杯
Here is my code:
这是我的代码:
import java.util.*; // Provides TreeMap, Iterator, Scanner
import java.io.*; // Provides FileReader, FileNotFoundException
public class test2
{
public static void main(String[ ] args)
{
// **THIS CREATES A TREE MAP**
TreeMap<String, Integer> frequencyData = new TreeMap<String, Integer>();
Map<String, Set<String>> filenames = new HashMap<String, Set<String>>();
Map<String, Integer> countByWords = new HashMap<String, Integer>();
Map[] mapArray = new Map[5];
mapArray[0] = new HashMap<String, Integer>();
readWordFile(countByWords, frequencyData, filenames);
printAllCounts(countByWords, frequencyData, filenames);
}
public static int getCount(String word, TreeMap<String, Integer> frequencyData)
{
if (frequencyData.containsKey(word))
{ // The word has occurred before, so get its count from the map
return frequencyData.get(word); // Auto-unboxed
}
else
{ // No occurrences of this word
return 0;
}
}
public static void printAllCounts( Map<String, Integer> countByWords, TreeMap<String, Integer> frequencyData, Map<String, Set<String>> filenames)
{
System.out.println("-----------------------------------------------");
System.out.print("Search for a word: ");
String worde;
int result = 0;
Scanner input = new Scanner(System.in);
worde=input.nextLine();
if(!filenames.containsKey(worde)){
System.out.println("The word does not exist");
}
else{
for(String filename : filenames.get(worde)){
System.out.println(filename);
System.out.println(countByWords.get(worde));
}
}
System.out.println("\n-----------------------------------------------");
}
public static void readWordFile(Map<String, Integer> countByWords ,TreeMap<String, Integer> frequencyData, Map<String, Set<String>> filenames)
{
Scanner wordFile;
String word; // A word read from the file
Integer count; // The number of occurrences of the word
int counter = 0;
int docs = 0;
//**FOR LOOP TO READ THE DOCUMENTS**
for(int x=0; x<Docs.length; x++)
{ //start of for loop [*
try
{
wordFile = new Scanner(new FileReader(Docs[x]));
}
catch (FileNotFoundException e)
{
System.err.println(e);
return;
}
while (wordFile.hasNext( ))
{
// Read the next word and get rid of the end-of-line marker if needed:
word = wordFile.next( );
// This makes the Word lower case.
word = word.toLowerCase();
word = word.replaceAll("[^a-zA-Z0-9\s]", "");
// Get the current count of this word, add one, and then store the new count:
count = countByWords.get(word);
if(count != null){
countByWords.put(word, count + 1);
}
else{
countByWords.put(word, 1);
}
Set<String> filenamesForWord = filenames.get(word);
if (filenamesForWord == null) {
filenamesForWord = new HashSet<String>();
}
filenamesForWord.add(Docs[x]);
filenames.put(word, filenamesForWord);
counter++;
docs = x + 1;
}
} //End of for loop *]
System.out.println("There are " + counter + " terms in the collection.");
System.out.println("There are " + docs + " documents in the collection.");
}
// Array of documents
static String Docs [] = {"Document1.txt", "Document2.txt", "Document3.txt"};
}