Java 计算文件中的单词数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4094119/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 11:32:50  来源:igfitidea点击:

Counting number of words in a file

javaalgorithmloopsio

提问by

I'm having a problem counting the number of words in a file. The approach that I am taking is when I see a space or a newLine then I know to count a word.

我在计算文件中的单词数时遇到问题。我采用的方法是,当我看到一个空格或一个换行符时,我就知道要数一个单词。

The problem is that if I have multiple lines between paragraphs then I ended up counting them as words also. If you look at the readFile() method you can see what I am doing.

问题是,如果我在段落之间有多行,那么我最终也会将它们算作单词。如果您查看 readFile() 方法,您可以看到我在做什么。

Could you help me out and guide me in the right direction on how to fix this?

你能帮我解决这个问题吗?

Example input file (including a blank line):

示例输入文件(包括一个空行):

word word word
word word

word word word

采纳答案by Brian Clements

I would change your approach a bit. First, I would use a BufferedReaderto read the file file in line-by-line using readLine(). Then split each line on whitespace using String.split("\\s")and use the size of the resulting array to see how many words are on that line. To get the number of characters you could either look at the size of each line or of each split word (depending of if you want to count whitespace as characters).

我会稍微改变你的方法。首先,我将使用BufferedReader.a 逐行读取文件文件readLine()。然后使用空格分割每一行String.split("\\s")并使用结果数组的大小来查看该行上有多少单词。要获得字符数,您可以查看每行或每个拆分单词的大小(取决于您是否要将空格计为字符)。

回答by levik

Just keep a boolean flag around that lets you know if the previous character was whitespace or not (pseudocode follows):

只需保留一个布尔标志,让您知道前一个字符是否为空格(伪代码如下):

boolean prevWhitespace = false;
int wordCount = 0;
while (char ch = getNextChar(input)) {
  if (isWhitespace(ch)) {
    if (!prevWhitespace) {
      prevWhitespace = true;
      wordCount++;
    }
  } else {
    prevWhitespace = false;
  }
}

回答by Gthompson83

Hack solution

黑客解决方案

You can read the text file into a String var. Then split the String into an array using a single whitespace as the delimiter StringVar.Split(" ").

您可以将文本文件读入字符串变量。然后使用单个空格作为分隔符 StringVar.Split(" ") 将字符串拆分为一个数组。

The Array count would equal the number of "Words" in the file. Of course this wouldnt give you a count of line numbers.

数组计数将等于文件中的“单词”数。当然,这不会给你行号的计数。

回答by tanyehzheng

You can use a Scanner with a FileInputStream instead of BufferedReader with a FileReader. For example:-

您可以将 Scanner 与 FileInputStream 一起使用,而不是将 BufferedReader 与 FileReader 一起使用。例如:-

File file = new File("sample.txt");
try(Scanner sc = new Scanner(new FileInputStream(file))){
    int count=0;
    while(sc.hasNext()){
        sc.next();
        count++;
    }
System.out.println("Number of words: " + count);
}

回答by fabrizioM

3 steps: Consume all the white spaces, check if is a line, consume all the nonwhitespace.3

3个步骤:消耗所有的空白,检查是否是一行,消耗所有的非空白。3

while(true){
    c = inFile.read();                
    // consume whitespaces
    while(isspace(c)){ inFile.read() }
    if (c == '\n'){ numberLines++; continue; }
    while (!isspace(c)){
         numberChars++;
         c = inFile.read();
    }
    numberWords++;
}

回答by javasqlsecurity dot com

This is just a thought. There is one very easy way to do it. If you just need number of words and not actual words then just use Apache WordUtils

这只是一个想法。有一种非常简单的方法可以做到。如果您只需要单词数而不是实际单词,那么只需使用 Apache WordUtils

import org.apache.commons.lang.WordUtils;

public class CountWord {

public static void main(String[] args) {    
String str = "Just keep a boolean flag around that lets you know if the previous character was whitespace or not pseudocode follows";

    String initials = WordUtils.initials(str);

    System.out.println(initials);
    //so number of words in your file will be
    System.out.println(initials.length());    
  }
}

回答by Oso

I think a correct approach would be by means of Regex:

我认为正确的方法是通过正则表达式:

String fileContent = <text from file>;    
String[] words = Pattern.compile("\s+").split(fileContent);
System.out.println("File has " + words.length + " words");

Hope it helps. The "\s+" meaning is in Pattern javadoc

希望能帮助到你。"\s+" 的意思是在Pattern javadoc

回答by narendra kumar botta

import java.io.BufferedReader;
import java.io.FileReader;

public class CountWords {

    public static void main (String args[]) throws Exception {

       System.out.println ("Counting Words");       
       FileReader fr = new FileReader ("c:\Customer1.txt");        
       BufferedReader br = new BufferedReader (fr);     
       String line = br.readLin ();
       int count = 0;
       while (line != null) {
          String []parts = line.split(" ");
          for( String w : parts)
          {
            count++;        
          }
          line = br.readLine();
       }         
       System.out.println(count);
    }
}

回答by Yash

File Word-Count

文件字数统计

If in between words having some symbols then you can split and count the number of Words.

如果单词之间有一些符号,那么您可以拆分并计算单词的数量。

Scanner sc = new Scanner(new FileInputStream(new File("Input.txt")));
        int count = 0;
        while (sc.hasNext()) {

            String[] s = sc.next().split("d*[.@:=#-]"); 

            for (int i = 0; i < s.length; i++) {
                if (!s[i].isEmpty()){
                    System.out.println(s[i]);
                    count++;
                }   
            }           
        }
        System.out.println("Word-Count : "+count);

回答by F.A. Botic

Take a look at my solution here, it should work. The idea is to remove all the unwanted symbols from the words, then separate those words and store them in some other variable, i was using ArrayList. By adjusting the "excludedSymbols" variable you can add more symbols which you would like to be excluded from the words.

在这里查看我的解决方案,它应该可以工作。这个想法是从单词中删除所有不需要的符号,然后将这些单词分开并将它们存储在其他一些变量中,我使用的是 ArrayList。通过调整“excludedSymbols”变量,您可以添加更多您希望从单词中排除的符号。

public static void countWords () {
    String textFileLocation ="c:\yourFileLocation";
    String readWords ="";
    ArrayList<String> extractOnlyWordsFromTextFile = new ArrayList<>();
    // excludedSymbols can be extended to whatever you want to exclude from the file 
    String[] excludedSymbols = {" ", "," , "." , "/" , ":" , ";" , "<" , ">", "\n"};
    String readByteCharByChar = "";
    boolean testIfWord = false;


    try {
        InputStream inputStream = new FileInputStream(textFileLocation);
        byte byte1 = (byte) inputStream.read();
        while (byte1 != -1) {

            readByteCharByChar +=String.valueOf((char)byte1);
            for(int i=0;i<excludedSymbols.length;i++) {
            if(readByteCharByChar.equals(excludedSymbols[i])) {
                if(!readWords.equals("")) {
                extractOnlyWordsFromTextFile.add(readWords);
                }
                readWords ="";
                testIfWord = true;
                break;
            }
            }
            if(!testIfWord) {
                readWords+=(char)byte1;
            }
            readByteCharByChar = "";
            testIfWord = false;
            byte1 = (byte)inputStream.read();
            if(byte1 == -1 && !readWords.equals("")) {
                extractOnlyWordsFromTextFile.add(readWords);
            }
        }
        inputStream.close();
        System.out.println(extractOnlyWordsFromTextFile);
        System.out.println("The number of words in the choosen text file are: " + extractOnlyWordsFromTextFile.size());
    } catch (IOException ioException) {

        ioException.printStackTrace();
    }
}