Java 计算字符串中单词出现的次数

Question

提问by Doug

I have a large text file I am reading from and I need to find out how many times some words come up. For example, the word the. I'm doing this line by line each line is a string.

我正在阅读一个大文本文件，我需要找出某些单词出现的次数。例如，这个词the。我正在逐行执行此操作，每行都是一个字符串。

I need to make sure that I only count legit the's--the thein otherwould not count. This means I know I need to use regular expressions in some way. What I was trying so far is this:

我需要确保我只计算合法the的—— theinother不会计算在内。这意味着我知道我需要以某种方式使用正则表达式。到目前为止我正在尝试的是：

numSpace += line.split("[^a-z]the[^a-z]").length;

I realize the regular expression may not be correct at the moment but I tried without that and just tried to find occurrences of the word theand I get wrong numbers too. I was under the impression this would split the string up into an array and how many times that array was split up was how many times the word is in the string. Any ideas I would be grateful.

我意识到目前正则表达式可能不正确，但我尝试没有它，只是试图找到这个词的出现，the我也得到了错误的数字。我的印象是这会将字符串拆分为一个数组，并且该数组拆分的次数是该单词在字符串中的次数。任何想法，我将不胜感激。

Update: Given some ideas, I've come up with this:

更新：鉴于一些想法，我想出了这个：

numThe += line.split("[^a-zA-Z][Tt]he[^a-zA-Z]", -1).length - 1;

Though still getting some strange numbers. I was able to get an accurate general count (without the regular expression), now my issue is with the regexp.

虽然仍然得到一些奇怪的数字。我能够获得准确的一般计数（没有正则表达式），现在我的问题是正则表达式。

Answer 1

采纳答案by polygenelubricants

Using splitto count isn't the most efficient, but if you insist on doing that, the proper way is this:

使用splitto count 不是最有效的，但如果你坚持这样做，正确的方法是这样的：

haystack.split(needle, -1).length -1

If you don't set limitto -1, splitdefaults to 0, which removes trailing empty strings, which messes up your count.

如果您没有设置limit为-1，则split默认为0，这将删除尾随的空字符串，这会扰乱您的计数。

From the API:

从API：

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. [...] If nis zero then [...] trailing empty strings will be discarded.

limit 参数控制应用模式的次数，因此会影响结果数组的长度。[...] 如果n为零，则 [...] 尾随空字符串将被丢弃。

You also need to subtract 1 from the lengthof the array, because Noccurrences of the delimiter splits the string into N+1parts.

您还需要从length数组的中减去 1 ，因为N出现的分隔符会将字符串拆分为多个N+1部分。

As for the regex itself (i.e. the needle), you can use \bthe word boundary anchors around the word. If you allow wordto contain metacharacters (e.g. count occurrences of "$US"), you may want to Pattern.quoteit.

至于正则表达式本身（即needle），您可以\b在word. 如果您允许word包含元字符（例如计算的出现次数"$US"），您可能需要Pattern.quote它。

I've come up with this:
numThe += line.split("[^a-zA-Z][Tt]he[^a-zA-Z]", -1).length - 1;
Though still getting some strange numbers. I was able to get an accurate general count (without the regular expression), now my issue is with the regexp.

我想出了这个：
numThe += line.split("[^a-zA-Z][Tt]he[^a-zA-Z]", -1).length - 1;
虽然仍然得到一些奇怪的数字。我能够获得准确的一般计数（没有正则表达式），现在我的问题是正则表达式。

Now the issue is that you're not counting [Tt]hethat appears as the first or last word, because the regex says that it has to be preceded/followed by some character, something that matches [^a-zA-Z](that is, your match must be of length 5!). You're not allowing the case where there isn'ta character at all!

现在的问题是，您没有计算[Tt]he出现在第一个或最后一个单词中的那个，因为正则表达式说它必须在某个字符之前/之后是某个匹配的字符[^a-zA-Z]（也就是说，您的匹配项的长度必须为 5！）。您不允许出现根本没有角色的情况！

You can try something like this instead:

你可以尝试这样的事情：

"(^|[^a-zA-Z])[Tt]he([^a-zA-Z]|$)"

This isn't the most concise solution, but it works.

这不是最简洁的解决方案，但它有效。

Something like this (using negative lookarounds) also works:

像这样（使用负面环视）也有效：

"(?<![a-zA-Z])[Tt]he(?![^a-zA-Z])"

This has the benefit of matching just[Tt]he, without any extra characters around it like your previous solution did. This is relevant in case you actually want to process the tokens returned by split, because the delimiter in this case isn't "stealing" anything from the tokens.

这样做的好处是只匹配[Tt]he，而没有像您之前的解决方案那样在其周围添加任何额外字符。如果您确实想要处理由返回的令牌split，这是相关的，因为在这种情况下，分隔符不会从令牌中“窃取”任何东西。

Non-`split`

非-`split`

Though using splitto count is rather convenient, it isn't the most efficient (e.g. it's doing all kinds of work to return those strings that you discard). The fact that as you said you're counting line-by-line means that the pattern would also have to be recompiled and thrown away every line.

尽管使用splitto count 相当方便，但它并不是最有效的（例如，它正在做各种工作来返回您丢弃的那些字符串）。正如您所说，您正在逐行计算这一事实意味着该模式也必须重新编译并丢弃每一行。

A more efficient way would be to use the same regex you did before and do the usual Pattern.compileand while (matcher.find()) count++;

一个更有效的方法是使用你之前做了同样的正则表达式和做平常Pattern.compile和while (matcher.find()) count++;

Answer 2

回答by codaddict

You can try using the word boundary \b in the regex:

您可以尝试在正则表达式中使用单词边界 \b：

\bthe\b

Also the size of the array returned by the splitwill be 1 more than the actual number of occurrences of the word the in the string.

此外，由返回的数组的大小split将比中单词 the 的实际出现次数多 1 string。

Answer 3

回答by Jeff Beck

Why not run your line through the Java StringTokenizerthen you can get the words broken up by not just spaces but also commas and other punctuation. Just run through your tokens and count the occurrence of each "the" or any word you would like.

为什么不通过 Java StringTokenizer运行您的行，那么您不仅可以将单词拆分为空格，还可以使用逗号和其他标点符号。只需遍历您的标记并计算每个“the”或您想要的任何单词的出现次数。

It would be very easy to expand this a bit and make a map that had each word as a key and kept a count of each word use. Also you may need to consider running each word through a function to stemthe word so you can count a more useful thing then just the words.

将它稍微扩展一下并制作一个以每个单词作为关键字并保留每个单词使用计数的地图会很容易。此外，您可能需要考虑通过一个函数来运行每个单词来词干该单词，这样您就可以计算出比单词更有用的东西。

Answer 4

回答by drekka

I think this is an area where unit tests can really help. I had a similar thing some time ago where I wanted to break a string up in a number of complex ways and create a number of tests, each of which tested against a different source string, helped me to isolate the regex and also quickly see when I got it wrong.

我认为这是单元测试可以真正提供帮助的领域。前段时间我有过类似的事情，我想以多种复杂的方式分解字符串并创建多个测试，每个测试针对不同的源字符串进行测试，帮助我隔离正则表达式并快速查看何时我弄错了。

Certainly if you gave us an example of a test string and the result it would help us to give you better answers.

当然，如果你给我们一个测试字符串的例子和结果，它会帮助我们给你更好的答案。

Answer 5

回答by fish

Splitting the Strings sounds like a lot of overhead just to find out the number of occurrences in a file. You could use String.indexOf(String, int)to recursively go through the whole line/file, like this:

为了找出文件中出现的次数，拆分字符串听起来像是很多开销。您可以使用String.indexOf(String, int)递归遍历整个行/文件，如下所示：

int occurrences = 0;
int index = 0;
while (index < s.length() && (index = s.indexOf("the", index)) >= 0) {
    occurrences++;
    index + 3; //length of 'the'
}

Answer 6

回答by Fakrudeen

Search for " the " using boyer-moore[in the remainder of the string after a hit] and count number of occurences?

使用boyer-moore[在命中后的剩余字符串中]搜索“ the”并计算出现次数？

Answer 7

回答by narendra kumar botta

public class OccurenceOfWords {
 public static void main(String args[]){    
   String file = "c:\customer1.txt";
   TreeMap <String ,Integer> index = new TreeMap();

    String []list = null;
      try(    FileReader fr = new FileReader(file);//using arm jdk 7.0 feature
                BufferedReader br = new BufferedReader(fr))
        {
            String line = br.readLine();
            while(line!= null){
                list = line.split("[ \n\t\r:;',.(){}]");
                for(int i = 0 ; i < list.length;i++)
                {
                  String word = list[i].toLowerCase();  
                    if(word.length() != 0)
                    {
                        if(index.get(word)== null)
                        { index.put(word,1);
                         }
                        else    
                        {
                            int occur = index.get(word).intValue();
                            occur++;
                            index.put(word, occur);
                        }
                        line = br.readLine();
                    }  
                }
         }}
                         catch(Exception ex){
                       System.out.println(ex.getMessage());
                       }
                    for(String item : index.keySet()){
                        int repeats = index.get(item).intValue();
                       System.out.printf("\n%10s\t%d",item,repeats);
                 }   
             }               
  }

Answer 8

回答by Bahaa Hany

To get the number of occurrence of a specific word use the below code

要获取特定单词的出现次数，请使用以下代码

     Pattern pattern = Pattern.compile("Thewordyouwant");
        Matcher matcher = pattern.matcher(string);
        int count = 0;
        while(matcher.find())
            count++;

Java 计算字符串中单词出现的次数

提问by Doug

采纳答案by polygenelubricants

Non-`split`

非-`split`

回答by codaddict

回答by Jeff Beck

回答by drekka

回答by fish

回答by Fakrudeen

回答by narendra kumar botta

回答by Bahaa Hany

相关推荐

最近更新

标签

Java 计算字符串中单词出现的次数

提问by Doug

采纳答案by polygenelubricants

Non-split

非-split

回答by codaddict

回答by Jeff Beck

回答by drekka

回答by fish

回答by Fakrudeen

回答by narendra kumar botta

回答by Bahaa Hany

相关推荐

Java 有指针吗？

Ant 构建失败，“[javac] javac：目标版本无效：7”

Java：BufferedReader 的 readLine() 中的 IOEXceptions 有什么用？

无法将证书导入 java 控制面板

相关推荐

最近更新

标签

Non-`split`

非-`split`