bash 如何创建文件中每个单词的频率列表？

Question

提问by Village

I have a file like this:

我有一个这样的文件：

This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.

I would like to generate a two-column list. The first column shows what words appear, the second column shows how often they appear, for example:

我想生成一个两列列表。第一列显示哪些词出现，第二列显示它们出现的频率，例如：

this@1
is@1
a@1
file@1
with@1
many@1
words3
some@2
of@2
the@2
only@1
appear@2
more@1
than@1
one@1
once@1
time@1

To make this work simpler, prior to processing the list, I will remove all punctuation, and change all text to lowercase letters.
Unless there is a simple solution around it, wordsand wordcan count as two separate words.

为了使这项工作更简单，在处理列表之前，我将删除所有标点符号，并将所有文本更改为小写字母。
除非有一个简单的解决方案，words并且word可以算作两个单独的词。

So far, I have this:

到目前为止，我有这个：

sed -i "s/ /\n/g" ./file1.txt # put all words on a new line
while read line
do
     count="$(grep -c $line file1.txt)"
     echo $line"@"$count >> file2.txt # add word and frequency to file
done < ./file1.txt
sort -u -d # remove duplicate lines

For some reason, this is only showing "0" after each word.

出于某种原因，这仅在每个单词后显示“0”。

How can I generate a list of every word that appears in a file, along with frequency information?

如何生成文件中出现的每个单词的列表以及频率信息？

Answer 1

回答by eduffy

Not sedand grep, but tr, sort, uniq, and awk:

不sed和grep，但是tr，sort，uniq，和awk：

% (tr ' ' '\n' | sort | uniq -c | awk '{print "@"}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1

Answer 2

回答by Bohdan

uniq -calready does what you want, just sort the input:

uniq -c已经做了你想要的，只需对输入进行排序：

echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c

output:

输出：

  6 a
  7 d
  7 s

Answer 3

回答by Rony

Content of the input file

输入文件的内容

$ cat inputFile.txt
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.

Using sed | sort | uniq

使用 sed | sort | uniq

$ sed 's/\.//g;s/\(.*\)/\L/;s/\ /\n/g' inputFile.txt | sort | uniq -c
      1 a
      2 appear
      1 file
      1 is
      1 many
      1 more
      2 of
      1 once
      1 one
      1 only
      2 some
      1 than
      2 the
      1 this
      1 time
      1 with
      3 words

uniq -icwill count and ignore case, but result list will have Thisinstead of this.

uniq -ic将计数并忽略大小写，但结果列表将This代替this.

Answer 4

回答by Jerin A Mathews

You can use tr for this, just run

您可以为此使用 tr ，只需运行

tr ' ' '' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt

Sample Output for a text file of city names:

城市名称文本文件的示例输出：

3026 Toronto
2006 Montréal
1117 Edmonton
1048 Calgary
905 Ottawa
724 Winnipeg
673 Vancouver
495 Brampton
489 Mississauga
482 London
467 Hamilton

Answer 5

回答by potong

This might work for you:

这可能对你有用：

tr '[:upper:]' '[:lower:]' <file |
tr -d '[:punct:]' |
tr -s ' ' '\n' | 
sort |
uniq -c |
sed 's/ *\([0-9]*\) \(.*\)/@/'

Answer 6

回答by Sheharyar

Let's use AWK!

让我们使用 AWK！

This function lists the frequency of each word occurring in the provided file in Descending order:

此函数按降序列出在提供的文件中出现的每个单词的频率：

function wordfrequency() {
  awk '
     BEGIN { FS="[^a-zA-Z]+" } {
         for (i=1; i<=NF; i++) {
             word = tolower($i)
             words[word]++
         }
     }
     END {
         for (w in words)
              printf("%3d %s\n", words[w], w)
     } ' | sort -rn
}

You can call it on your file like this:

你可以像这样在你的文件上调用它：

$ cat your_file.txt | wordfrequency

Source: AWK-ward Ruby

资料来源：AWK-ward Ruby

Answer 7

回答by John Red

Let's do it in Python 3!

让我们用 Python 3 来做吧！

"""Counts the frequency of each word in the given text; words are defined as
entities separated by whitespaces; punctuations and other symbols are ignored;
case-insensitive; input can be passed through stdin or through a file specified
as an argument; prints highest frequency words first"""

# Case-insensitive
# Ignore punctuations `~!@#$%^&*()_-+={}[]\|:;"'<>,.?/

import sys

# Find if input is being given through stdin or from a file
lines = None
if len(sys.argv) == 1:
    lines = sys.stdin
else:
    lines = open(sys.argv[1])

D = {}
for line in lines:
    for word in line.split():
        word = ''.join(list(filter(
            lambda ch: ch not in "`~!@#$%^&*()_-+={}[]\|:;\"'<>,.?/",
            word)))
        word = word.lower()
        if word in D:
            D[word] += 1
        else:
            D[word] = 1

for word in sorted(D, key=D.get, reverse=True):
    print(word + ' ' + str(D[word]))

Let's name this script "frequency.py" and add a line to "~/.bash_aliases":

让我们将此脚本命名为“frequency.py”并在“~/.bash_aliases”中添加一行：

alias freq="python3 /path/to/frequency.py"

Now to find the frequency words in your file "content.txt", you do:

现在要在您的文件“content.txt”中找到频率词，您可以：

freq content.txt

You can also pipe output to it:

您还可以通过管道将输出传递给它：

cat content.txt | freq

And even analyze text from multiple files:

甚至分析来自多个文件的文本：

cat content.txt story.txt article.txt | freq

If you are using Python 2, just replace

如果您使用的是 Python 2，只需替换

''.join(list(filter(args...)))with filter(args...)
python3with python
print(whatever)with print whatever

''.join(list(filter(args...)))和 filter(args...)
python3和 python
print(whatever)和 print whatever

Answer 8

回答by Paused until further notice.

The sort requires GNU AWK (gawk). If you have another AWK without asort(), this can be easily adjusted and then piped to sort.

排序需要 GNU AWK ( gawk)。如果您有另一个没有 AWK 的 AWK asort()，则可以轻松对其进行调整，然后通过管道传输到sort.

awk '{gsub(/\./, ""); for (i = 1; i <= NF; i++) {w = tolower($i); count[w]++; words[w] = w}} END {qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]]}' inputfile

Broken out onto multiple lines:

分成多行：

awk '{
    gsub(/\./, ""); 
    for (i = 1; i <= NF; i++) {
        w = tolower($i); 
        count[w]++; 
        words[w] = w
    }
} 
END {
    qty = asort(words); 
    for (w = 1; w <= qty; w++)
        print words[w] "@" count[words[w]]
}' inputfile

Answer 9

回答by GL2014

#!/usr/bin/env bash

declare -A map 
words=""

[[ -f  ]] || { echo "usage: $(basename   awk '{ 
       BEGIN{word[""]=0;}
    {
    for (el =1 ; el <= NF ; ++el) {word[$el]++ }
    }
 END {
 for (i in word) {
        if (i !="") 
           {
              print word[i],i;
           }
                 }
 }' file.txt | sort -nr
 wordfile)"; exit 1 ;}

while read line; do 
  for word in $line; do 
    ((map[$word]++))
  done; 
done < <(cat $words )

for key in ${!map[@]}; do 
  echo "the word $key appears ${map[$key]} times"
done|sort -nr -k5

Answer 10

回答by Dani Konoplya

##代码##

bash 如何创建文件中每个单词的频率列表？

提问by Village

回答by eduffy

回答by Bohdan

回答by Rony

回答by Jerin A Mathews

回答by potong

回答by Sheharyar

Let's use AWK!

让我们使用 AWK！

回答by John Red

回答by Paused until further notice.

回答by GL2014

回答by Dani Konoplya

相关推荐

最近更新

标签

bash 如何创建文件中每个单词的频率列表？

提问by Village

回答by eduffy

回答by Bohdan

回答by Rony

回答by Jerin A Mathews

回答by potong

回答by Sheharyar

Let's use AWK!

让我们使用 AWK！

回答by John Red

回答by Paused until further notice.

回答by GL2014

回答by Dani Konoplya

相关推荐

bash 如何防止 rm 报告找不到文件？

bash 在 shell 脚本中连接到 sqlplus 并运行 SQL 脚本

Bash shell 十进制到二进制基数 2 的转换

Bash：来自命令输出的 grep 模式

相关推荐

最近更新

标签