bash 如何创建文件中每个单词的频率列表?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10552803/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 22:03:42  来源:igfitidea点击:

How to create a frequency list of every word in a file?

bashfile-iosedgrep

提问by Village

I have a file like this:

我有一个这样的文件:

This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.

I would like to generate a two-column list. The first column shows what words appear, the second column shows how often they appear, for example:

我想生成一个两列列表。第一列显示哪些词出现,第二列显示它们出现的频率,例如:

this@1
is@1
a@1
file@1
with@1
many@1
words3
some@2
of@2
the@2
only@1
appear@2
more@1
than@1
one@1
once@1
time@1 
  • To make this work simpler, prior to processing the list, I will remove all punctuation, and change all text to lowercase letters.
  • Unless there is a simple solution around it, wordsand wordcan count as two separate words.
  • 为了使这项工作更简单,在处理列表之前,我将删除所有标点符号,并将所有文本更改为小写字母。
  • 除非有一个简单的解决方案,words并且word可以算作两个单独的词。

So far, I have this:

到目前为止,我有这个:

sed -i "s/ /\n/g" ./file1.txt # put all words on a new line
while read line
do
     count="$(grep -c $line file1.txt)"
     echo $line"@"$count >> file2.txt # add word and frequency to file
done < ./file1.txt
sort -u -d # remove duplicate lines

For some reason, this is only showing "0" after each word.

出于某种原因,这仅在每个单词后显示“0”。

How can I generate a list of every word that appears in a file, along with frequency information?

如何生成文件中出现的每个单词的列表以及频率信息?

回答by eduffy

Not sedand grep, but tr, sort, uniq, and awk:

sedgrep,但是trsortuniq,和awk

% (tr ' ' '\n' | sort | uniq -c | awk '{print "@"}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1

回答by Bohdan

uniq -calready does what you want, just sort the input:

uniq -c已经做了你想要的,只需对输入进行排序:

echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c

output:

输出:

  6 a
  7 d
  7 s

回答by Rony

Content of the input file

输入文件的内容

$ cat inputFile.txt
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.

Using sed | sort | uniq

使用 sed | sort | uniq

$ sed 's/\.//g;s/\(.*\)/\L/;s/\ /\n/g' inputFile.txt | sort | uniq -c
      1 a
      2 appear
      1 file
      1 is
      1 many
      1 more
      2 of
      1 once
      1 one
      1 only
      2 some
      1 than
      2 the
      1 this
      1 time
      1 with
      3 words

uniq -icwill count and ignore case, but result list will have Thisinstead of this.

uniq -ic将计数并忽略大小写,但结果列表将This代替this.

回答by Jerin A Mathews

You can use tr for this, just run

您可以为此使用 tr ,只需运行

tr ' ' '' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt

Sample Output for a text file of city names:

城市名称文本文件的示例输出:

3026 Toronto
2006 Montréal
1117 Edmonton
1048 Calgary
905 Ottawa
724 Winnipeg
673 Vancouver
495 Brampton
489 Mississauga
482 London
467 Hamilton

回答by potong

This might work for you:

这可能对你有用:

tr '[:upper:]' '[:lower:]' <file |
tr -d '[:punct:]' |
tr -s ' ' '\n' | 
sort |
uniq -c |
sed 's/ *\([0-9]*\) \(.*\)/@/'

回答by Sheharyar

Let's use AWK!

让我们使用 AWK!

This function lists the frequency of each word occurring in the provided file in Descending order:

此函数按降序列出在提供的文件中出现的每个单词的频率:

function wordfrequency() {
  awk '
     BEGIN { FS="[^a-zA-Z]+" } {
         for (i=1; i<=NF; i++) {
             word = tolower($i)
             words[word]++
         }
     }
     END {
         for (w in words)
              printf("%3d %s\n", words[w], w)
     } ' | sort -rn
}

You can call it on your file like this:

你可以像这样在你的文件上调用它:

$ cat your_file.txt | wordfrequency


Source: AWK-ward Ruby

资料来源:AWK-ward Ruby

回答by John Red

Let's do it in Python 3!

让我们用 Python 3 来做吧!

"""Counts the frequency of each word in the given text; words are defined as
entities separated by whitespaces; punctuations and other symbols are ignored;
case-insensitive; input can be passed through stdin or through a file specified
as an argument; prints highest frequency words first"""

# Case-insensitive
# Ignore punctuations `~!@#$%^&*()_-+={}[]\|:;"'<>,.?/

import sys

# Find if input is being given through stdin or from a file
lines = None
if len(sys.argv) == 1:
    lines = sys.stdin
else:
    lines = open(sys.argv[1])

D = {}
for line in lines:
    for word in line.split():
        word = ''.join(list(filter(
            lambda ch: ch not in "`~!@#$%^&*()_-+={}[]\|:;\"'<>,.?/",
            word)))
        word = word.lower()
        if word in D:
            D[word] += 1
        else:
            D[word] = 1

for word in sorted(D, key=D.get, reverse=True):
    print(word + ' ' + str(D[word]))

Let's name this script "frequency.py" and add a line to "~/.bash_aliases":

让我们将此脚本命名为“frequency.py”并在“~/.bash_aliases”中添加一行:

alias freq="python3 /path/to/frequency.py"

Now to find the frequency words in your file "content.txt", you do:

现在要在您的文件“content.txt”中找到频率词,您可以:

freq content.txt

You can also pipe output to it:

您还可以通过管道将输出传递给它:

cat content.txt | freq

And even analyze text from multiple files:

甚至分析来自多个文件的文本:

cat content.txt story.txt article.txt | freq


If you are using Python 2, just replace

如果您使用的是 Python 2,只需替换

  • ''.join(list(filter(args...)))with filter(args...)
  • python3with python
  • print(whatever)with print whatever
  • ''.join(list(filter(args...)))filter(args...)
  • python3python
  • print(whatever)print whatever

回答by Paused until further notice.

The sort requires GNU AWK (gawk). If you have another AWK without asort(), this can be easily adjusted and then piped to sort.

排序需要 GNU AWK ( gawk)。如果您有另一个没有 AWK 的 AWK asort(),则可以轻松对其进行调整,然后通过管道传输到sort.

awk '{gsub(/\./, ""); for (i = 1; i <= NF; i++) {w = tolower($i); count[w]++; words[w] = w}} END {qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]]}' inputfile

Broken out onto multiple lines:

分成多行:

awk '{
    gsub(/\./, ""); 
    for (i = 1; i <= NF; i++) {
        w = tolower($i); 
        count[w]++; 
        words[w] = w
    }
} 
END {
    qty = asort(words); 
    for (w = 1; w <= qty; w++)
        print words[w] "@" count[words[w]]
}' inputfile

回答by GL2014

#!/usr/bin/env bash

declare -A map 
words=""

[[ -f  ]] || { echo "usage: $(basename 
  awk '{ 
       BEGIN{word[""]=0;}
    {
    for (el =1 ; el <= NF ; ++el) {word[$el]++ }
    }
 END {
 for (i in word) {
        if (i !="") 
           {
              print word[i],i;
           }
                 }
 }' file.txt | sort -nr
wordfile)"; exit 1 ;} while read line; do for word in $line; do ((map[$word]++)) done; done < <(cat $words ) for key in ${!map[@]}; do echo "the word $key appears ${map[$key]} times" done|sort -nr -k5

回答by Dani Konoplya

##代码##