Linux 如何在文本文件中找到多个单词的数量?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7171891/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how do i find the count of multiple words in a text file?
提问by Rakesh
i am able to find the number of times a word occurs in a text file like in Linux we can use
我能够找到一个单词在文本文件中出现的次数,比如在 Linux 中我们可以使用
cat filename|grep -c tom
my question is how do i find the count of multiple words like "tom" and "joe" in a text file.
我的问题是如何在文本文件中找到多个单词(如“tom”和“joe”)的数量。
采纳答案by Travis Nelson
Since you have a couple names, regular expressions is the way to go on this one. At first I thought it was as simple as just a grep count on the regular expression of joe or tom, but fount that this did not account for the scenario where tom and joe are on the same line (or tom and tom for that matter).
由于您有几个名字,正则表达式是处理这个名字的方法。起初我认为这就像对 joe 或 tom 的正则表达式进行 grep 计数一样简单,但发现这并没有考虑到 tom 和 joe 在同一行(或 tom 和 tom 就此而言)的情况.
test.txt:
测试.txt:
tom is really really cool! joe for the win!
tom is actually lame.
$ grep -c '\<\(tom\|joe\)\>' test.txt
2
As you can see from the test.txt file, 2 is the wrong answer, so we needed to account for names being on the same line.
正如您从 test.txt 文件中看到的,2 是错误的答案,因此我们需要考虑名称在同一行。
I then used grep -o to show only the part of a matching line that matches the pattern where it gave the correct pattern matches of tom or joe in the file. I then piped the results into number of lines into wc for the line count.
然后我使用 grep -o 仅显示匹配行中与模式匹配的部分,其中在文件中给出了正确的 tom 或 joe 模式匹配。然后我将结果通过管道输入到 wc 中的行数中以获得行数。
$ grep -o '\(joe\|tom\)' test.txt|wc -l
3
3...the correct answer! Hope this helps
3...正确答案!希望这可以帮助
回答by carlpett
Ok, so first split the file into words, then sort
and uniq
:
好了,该文件以便第一分成的话,那么sort
和uniq
:
tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c
You use uniq
:
你使用uniq
:
sort filename | uniq -c
sort filename | uniq -c
回答by Fredrik Pihl
Use awk:
使用 awk:
{for (i=1;i<=NF;i++)
count[$i]++
}
END {
for (i in count)
print count[i], i
}
This will produce a complete word frequency count for the input.
Pipe tho output to grep
to get the desired fields
这将为输入生成完整的词频计数。管道输出到grep
获得所需的字段
awk -f w.awk input | grep -E 'tom|joe'
BTW, you do not need cat
in your example, most programs that acts as filters can take the filename as an parameter; hence it's better to use
顺便说一句,您cat
的示例中不需要,大多数充当过滤器的程序都可以将文件名作为参数;因此最好使用
grep -c tom filename
if not, there is a strong possibility that people will start throwing Useless Use of Cat Awardat you ;-)
如果没有,人们很有可能会开始向您抛出无用的猫奖;-)
回答by phoxis
Here is one:
这是一个:
cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c
UPDATE
更新
A shell script solution:
一个shell脚本解决方案:
#!/bin/bash
file_name=""
string=""
if [ $# -ne 2 ]
then
echo "Usage: \<\(tom\|joe\)\>
<pattern to search> <file_name>"
exit 1
fi
if [ ! -f "$file_name" ]
then
echo "file \"$file_name\" does not exist, or is not a regular file"
exit 2
fi
line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0
# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
do
flag=0
while [[ "$line" == *$string* ]]
do
flag=1
line_no_list[line_no_indx]=$curr_line_indx
line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
line=${line/"$string"/}
done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
if (( flag == 1 ))
then
line_no_indx=$((line_no_indx+2))
fi
curr_line_indx=$((curr_line_indx+1))
done < "$file_name"
echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "
for ((i=0; i<line_no_indx; i=i+2))
do
echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done
echo
回答by Jan Hudec
- The sample you gave does notsearch for words"tom". It will count "atom" and "bottom" and many more.
Grep searches for regular expressions. Regular expression that matches word "tom" or "joe" is
\<\(tom\|joe\)\>
- 您提供的示例不会搜索单词“tom”。它将计算“原子”和“底部”等等。
Grep 搜索正则表达式。匹配单词“tom”或“joe”的正则表达式是
cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"
回答by Kimvais
You could do regexp,
你可以做正则表达式,
cat filename | grep -fc names
回答by Foo Bah
I completely forgot about grep -f:
我完全忘记了 grep -f:
cat filename | awk 'NR==FNR {h[NR] = ;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($ echo tomorrow | grep -c tom
1
,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -
AWK solution:
AWK解决方案:
Assuming the names are in a file called names
:
假设名称位于名为 的文件中names
:
gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'
Note that your original grep doesn't search for words. e.g.
请注意,您的原始 grep 不搜索单词。例如
echo "tom is really really cool! joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3
You need grep -w
你需要 grep -w
回答by hemflit
The gawk program sets the record separator to anything non-alphabetic, so every word will end up on a separate line. Then grep counts lines that match one of the words you want exactly.
gawk 程序将记录分隔符设置为任何非字母字符,因此每个单词都将在一个单独的行中结束。然后 grep 计算与您想要的单词之一完全匹配的行数。
We use gawk because the POSIX awk doesn't allow regex record separator.
我们使用 gawk 是因为 POSIX awk 不允许使用正则表达式记录分隔符。
For brevity, you can replace '{print}'
with 1
- either way, it's an Awk program that simply prints out all input records ("is 1
true? it is? then do the default action, which is {print}
.")
为简洁起见,您可以替换'{print}'
为1
- 无论哪种方式,它都是一个 awk 程序,它简单地打印出所有输入记录(“是1
真的?它是?然后执行默认操作,即{print}
.”)
回答by Jotne
To find all hits in all lines
查找所有行中的所有匹配项
##代码##This will count "tomtom" as 2 hits.
这会将“tomtom”计为 2 次点击。