Linux 如何在文本文件中找到多个单词的数量？

Question

提问by Rakesh

i am able to find the number of times a word occurs in a text file like in Linux we can use

我能够找到一个单词在文本文件中出现的次数，比如在 Linux 中我们可以使用

cat filename|grep -c tom

my question is how do i find the count of multiple words like "tom" and "joe" in a text file.

我的问题是如何在文本文件中找到多个单词（如“tom”和“joe”）的数量。

Answer 1

采纳答案by Travis Nelson

Since you have a couple names, regular expressions is the way to go on this one. At first I thought it was as simple as just a grep count on the regular expression of joe or tom, but fount that this did not account for the scenario where tom and joe are on the same line (or tom and tom for that matter).

由于您有几个名字，正则表达式是处理这个名字的方法。起初我认为这就像对 joe 或 tom 的正则表达式进行 grep 计数一样简单，但发现这并没有考虑到 tom 和 joe 在同一行（或 tom 和 tom 就此而言）的情况.

test.txt:

测试.txt：

tom is really really cool!  joe for the win!
tom is actually lame.


$ grep -c '\<\(tom\|joe\)\>' test.txt
2

As you can see from the test.txt file, 2 is the wrong answer, so we needed to account for names being on the same line.

正如您从 test.txt 文件中看到的，2 是错误的答案，因此我们需要考虑名称在同一行。

I then used grep -o to show only the part of a matching line that matches the pattern where it gave the correct pattern matches of tom or joe in the file. I then piped the results into number of lines into wc for the line count.

然后我使用 grep -o 仅显示匹配行中与模式匹配的部分，其中在文件中给出了正确的 tom 或 joe 模式匹配。然后我将结果通过管道输入到 wc 中的行数中以获得行数。

$ grep -o '\(joe\|tom\)' test.txt|wc -l
       3

3...the correct answer! Hope this helps

3...正确答案！希望这可以帮助

Answer 2

回答by carlpett

Ok, so first split the file into words, then sortand uniq:

好了，该文件以便第一分成的话，那么sort和uniq：

tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c

~~You use uniq:~~

~~你使用uniq：~~

~~sort filename | uniq -c~~
~~sort filename | uniq -c~~

Answer 3

回答by Fredrik Pihl

Use awk:

使用 awk：

{for (i=1;i<=NF;i++)
    count[$i]++
}
END {
    for (i in count)
        print count[i], i
}

This will produce a complete word frequency count for the input. Pipe tho output to grepto get the desired fields

这将为输入生成完整的词频计数。管道输出到grep获得所需的字段

awk -f w.awk input | grep -E 'tom|joe'

BTW, you do not need catin your example, most programs that acts as filters can take the filename as an parameter; hence it's better to use

顺便说一句，您cat的示例中不需要，大多数充当过滤器的程序都可以将文件名作为参数；因此最好使用

grep -c tom filename

if not, there is a strong possibility that people will start throwing Useless Use of Cat Awardat you ;-)

如果没有，人们很有可能会开始向您抛出无用的猫奖；-)

Answer 4

回答by phoxis

Here is one:

这是一个：

cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c

UPDATE

更新

A shell script solution:

一个shell脚本解决方案：

#!/bin/bash

file_name=""
string=""

if [ $# -ne 2 ]
  then
   echo "Usage: \<\(tom\|joe\)\>
 <pattern to search> <file_name>"
   exit 1
fi

if [ ! -f "$file_name" ]
 then
  echo "file \"$file_name\" does not exist, or is not a regular file"
  exit 2
fi

line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0

# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
 do
  flag=0
  while [[ "$line" == *$string* ]]
   do
    flag=1
    line_no_list[line_no_indx]=$curr_line_indx
    line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
    total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
    line=${line/"$string"/}
  done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
  if (( flag == 1 ))
   then
    line_no_indx=$((line_no_indx+2))
  fi
  curr_line_indx=$((curr_line_indx+1))
done < "$file_name"


echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "

for ((i=0; i<line_no_indx; i=i+2))
 do
  echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done

echo

Answer 5

回答by Jan Hudec

The sample you gave does notsearch for words"tom". It will count "atom" and "bottom" and many more.
Grep searches for regular expressions. Regular expression that matches word "tom" or "joe" is
```
\<\(tom\|joe\)\>
```

您提供的示例不会搜索单词“tom”。它将计算“原子”和“底部”等等。
Grep 搜索正则表达式。匹配单词“tom”或“joe”的正则表达式是
```
 cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"
```

Answer 6

回答by Kimvais

You could do regexp,

你可以做正则表达式，

cat filename | grep -fc names

Answer 7

回答by Foo Bah

I completely forgot about grep -f:

我完全忘记了 grep -f：

cat filename | awk 'NR==FNR {h[NR] = ;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($ echo tomorrow | grep -c tom
1
,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -

AWK solution:

AWK解决方案：

Assuming the names are in a file called names:

假设名称位于名为的文件中names：

gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'

Note that your original grep doesn't search for words. e.g.

请注意，您的原始 grep 不搜索单词。例如

echo "tom is really really cool!  joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3

You need grep -w

你需要 grep -w

Answer 8

回答by hemflit

##代码##

The gawk program sets the record separator to anything non-alphabetic, so every word will end up on a separate line. Then grep counts lines that match one of the words you want exactly.

gawk 程序将记录分隔符设置为任何非字母字符，因此每个单词都将在一个单独的行中结束。然后 grep 计算与您想要的单词之一完全匹配的行数。

We use gawk because the POSIX awk doesn't allow regex record separator.

我们使用 gawk 是因为 POSIX awk 不允许使用正则表达式记录分隔符。

For brevity, you can replace '{print}'with 1- either way, it's an Awk program that simply prints out all input records ("is 1true? it is? then do the default action, which is {print}.")

为简洁起见，您可以替换'{print}'为1- 无论哪种方式，它都是一个 awk 程序，它简单地打印出所有输入记录（“是1真的？它是？然后执行默认操作，即{print}.”）

Answer 9

回答by Jotne

To find all hits in all lines

查找所有行中的所有匹配项

##代码##

This will count "tomtom" as 2 hits.

这会将“tomtom”计为 2 次点击。

Linux 如何在文本文件中找到多个单词的数量？

提问by Rakesh

采纳答案by Travis Nelson

回答by carlpett

回答by Fredrik Pihl

回答by phoxis

回答by Jan Hudec

回答by Kimvais

回答by Foo Bah

回答by hemflit

回答by Jotne

相关推荐

最近更新

标签

Linux 如何在文本文件中找到多个单词的数量？

提问by Rakesh

采纳答案by Travis Nelson

回答by carlpett

回答by Fredrik Pihl

回答by phoxis

回答by Jan Hudec

回答by Kimvais

回答by Foo Bah

回答by hemflit

回答by Jotne

相关推荐

Linux 头文件中的代码似乎会导致编译错误

Linux 不会分配伪终端，因为 stdin 不是终端

linux usb连接/断开事件

如何使用命令行通过 PuTTy 将文件从 Windows 机器上传到 Linux 机器？

相关推荐

最近更新

标签