bash 计算单词列表中的每个单词在文件中出现的次数？

Question

提问by Village

I have a file, list.txtwhich contains a list of words. I want to check how many times each word appears in another file, file1.txt, then output the results. A simple output of all of the numbers sufficient, as I can manually add them to list.txtwith a spreadsheet program, but if the script adds the numbers at the end of each line in list.txt, that is even better, e.g.:

我有一个文件，list.txt其中包含一个单词列表。我想检查每个单词在另一个文件中出现的次数file1.txt，然后输出结果。所有数字的简单输出就足够了，因为我可以list.txt使用电子表格程序手动添加它们，但是如果脚本在每行末尾添加数字list.txt，那就更好了，例如：

bear 3
fish 15

I have tried this, but it does not work:

我试过这个，但它不起作用：

cat list.txt | grep -c file1.txt

Answer 1

回答by Todd A. Jacobs

You can do this in a loop that reads a single word at a time from a word-list file, and then counts the instances in a data file. For example:

您可以在循环中执行此操作，一次从单词列表文件中读取一个单词，然后对数据文件中的实例进行计数。例如：

while read; do
    echo -n "$REPLY "
    fgrep -ow "$REPLY" data.txt | wc -l
done < <(sort -u word_list.txt)

The "secret sauce" consists of:

“秘制酱料”包括：

using the implicit REPLY variable;
using process substitution to collect words from the word-list file; and
ensuring that you are grepping for whole words in the data file.

使用隐式 REPLY 变量；
使用过程替换从单词列表文件中收集单词；和
确保您正在搜索数据文件中的整个单词。

Answer 2

回答by glenn Hymanman

This awk method only has to pass through each file once:

这个 awk 方法只需要通过每个文件一次：

awk '
  # read the words in list.txt
  NR == FNR {count[]=0; next}
  # process file1.txt
  {
    for (i=0; i<=NF; i++) 
      if ($i in count)
        count[$i]++
  }
  # output the results
  END {
    for (word in count)
      print word, count[word]
  }
' list.txt file1.txt

Answer 3

回答by potong

This might work for you (GNU sed):

这可能对你有用（GNU sed）：

tr -s ' ' '\n' file1.txt |
sort |
uniq -c |
sed -e '1i\s|.*|& 0|' -e 's/\s*\(\S*\)\s\(\S*\)\s*/s|\<\>.*| |/' |
sed -f - list.txt

Explanation:

解释：

Split file1.txtinto words
Sort the words
Count the words
Create a sedscript to match the words (initially zero out each word)
Run the above script against the list.txt

分file1.txt词
对单词进行排序
数字
创建一个sed脚本来匹配单词（最初将每个单词归零）
运行上面的脚本 list.txt

Answer 4

回答by Sahil Singh

single line command

单行命令

cat file1.txt |tr " " "\n"|sort|uniq -c |sort -n -r -k 1 |grep -w -f list.txt

The last part of the command tells grep to read words to match from list (-f option) and then match whole words(-w) i.e. if list.txt contains contains car, grep should ignore carriage.

命令的最后一部分告诉 grep 从列表（-f 选项）中读取要匹配的单词，然后匹配整个单词（-w），即如果 list.txt 包含 car，grep 应该忽略回车。

However keep in mind that your view of whole word and grep's view might differ. for eg. although car will not match with carriage, it will match with car-wash , notice that "-" will be considered for word boundary. grep takes anything except letters,numbers and underscores as word boundary. Which should not be a problem as this conforms to the accepted definition of a word in English language.

但是请记住，您对整个单词的看法和 grep 的看法可能不同。例如。尽管 car 不会与carriage 匹配，但会与car-wash 匹配，注意“-”将被视为词边界。grep 将除字母、数字和下划线以外的任何内容作为单词边界。这应该不是问题，因为这符合英语单词的公认定义。

bash 计算单词列表中的每个单词在文件中出现的次数？

提问by Village

回答by Todd A. Jacobs

回答by glenn Hymanman

回答by potong

回答by Sahil Singh

相关推荐

最近更新

标签

bash 计算单词列表中的每个单词在文件中出现的次数？

提问by Village

回答by Todd A. Jacobs

回答by glenn Hymanman

回答by potong

回答by Sahil Singh

相关推荐

Bash 的源命令不适用于来自 Internet 的 curl'd 文件

bash /usr/bin/env 错误的解释器

bash 使用 inotify-tools 作为守护进程处理数据

BASH：读取用户输入时，Enter 带新行

相关推荐

最近更新

标签