bash 计算单词列表中的每个单词在文件中出现的次数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10662645/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Count how many times each word from a word list appears in a file?
提问by Village
I have a file, list.txtwhich contains a list of words. I want to check how many times each word appears in another file, file1.txt, then output the results. A simple output of all of the numbers sufficient, as I can manually add them to list.txtwith a spreadsheet program, but if the script adds the numbers at the end of each line in list.txt, that is even better, e.g.:
我有一个文件,list.txt其中包含一个单词列表。我想检查每个单词在另一个文件中出现的次数file1.txt,然后输出结果。所有数字的简单输出就足够了,因为我可以list.txt使用电子表格程序手动添加它们,但是如果脚本在每行末尾添加数字list.txt,那就更好了,例如:
bear 3
fish 15
I have tried this, but it does not work:
我试过这个,但它不起作用:
cat list.txt | grep -c file1.txt
回答by Todd A. Jacobs
You can do this in a loop that reads a single word at a time from a word-list file, and then counts the instances in a data file. For example:
您可以在循环中执行此操作,一次从单词列表文件中读取一个单词,然后对数据文件中的实例进行计数。例如:
while read; do
echo -n "$REPLY "
fgrep -ow "$REPLY" data.txt | wc -l
done < <(sort -u word_list.txt)
The "secret sauce" consists of:
“秘制酱料”包括:
- using the implicit REPLY variable;
- using process substitution to collect words from the word-list file; and
- ensuring that you are grepping for whole words in the data file.
- 使用隐式 REPLY 变量;
- 使用过程替换从单词列表文件中收集单词;和
- 确保您正在搜索数据文件中的整个单词。
回答by glenn Hymanman
This awk method only has to pass through each file once:
这个 awk 方法只需要通过每个文件一次:
awk '
# read the words in list.txt
NR == FNR {count[]=0; next}
# process file1.txt
{
for (i=0; i<=NF; i++)
if ($i in count)
count[$i]++
}
# output the results
END {
for (word in count)
print word, count[word]
}
' list.txt file1.txt
回答by potong
This might work for you (GNU sed):
这可能对你有用(GNU sed):
tr -s ' ' '\n' file1.txt |
sort |
uniq -c |
sed -e '1i\s|.*|& 0|' -e 's/\s*\(\S*\)\s\(\S*\)\s*/s|\<\>.*| |/' |
sed -f - list.txt
Explanation:
解释:
- Split
file1.txtinto words - Sort the words
- Count the words
- Create a
sedscript to match the words (initially zero out each word) - Run the above script against the
list.txt
- 分
file1.txt词 - 对单词进行排序
- 数字
- 创建一个
sed脚本来匹配单词(最初将每个单词归零) - 运行上面的脚本
list.txt
回答by Sahil Singh
single line command
单行命令
cat file1.txt |tr " " "\n"|sort|uniq -c |sort -n -r -k 1 |grep -w -f list.txt
The last part of the command tells grep to read words to match from list (-f option) and then match whole words(-w) i.e. if list.txt contains contains car, grep should ignore carriage.
命令的最后一部分告诉 grep 从列表(-f 选项)中读取要匹配的单词,然后匹配整个单词(-w),即如果 list.txt 包含 car,grep 应该忽略回车。
However keep in mind that your view of whole word and grep's view might differ. for eg. although car will not match with carriage, it will match with car-wash , notice that "-" will be considered for word boundary. grep takes anything except letters,numbers and underscores as word boundary. Which should not be a problem as this conforms to the accepted definition of a word in English language.
但是请记住,您对整个单词的看法和 grep 的看法可能不同。例如。尽管 car 不会与carriage 匹配,但会与car-wash 匹配,注意“-”将被视为词边界。grep 将除字母、数字和下划线以外的任何内容作为单词边界。这应该不是问题,因为这符合英语单词的公认定义。

