bash 如何创建文件中每个单词的频率列表?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10552803/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to create a frequency list of every word in a file?
提问by Village
I have a file like this:
我有一个这样的文件:
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
I would like to generate a two-column list. The first column shows what words appear, the second column shows how often they appear, for example:
我想生成一个两列列表。第一列显示哪些词出现,第二列显示它们出现的频率,例如:
this@1
is@1
a@1
file@1
with@1
many@1
words3
some@2
of@2
the@2
only@1
appear@2
more@1
than@1
one@1
once@1
time@1
- To make this work simpler, prior to processing the list, I will remove all punctuation, and change all text to lowercase letters.
- Unless there is a simple solution around it,
words
andword
can count as two separate words.
- 为了使这项工作更简单,在处理列表之前,我将删除所有标点符号,并将所有文本更改为小写字母。
- 除非有一个简单的解决方案,
words
并且word
可以算作两个单独的词。
So far, I have this:
到目前为止,我有这个:
sed -i "s/ /\n/g" ./file1.txt # put all words on a new line
while read line
do
count="$(grep -c $line file1.txt)"
echo $line"@"$count >> file2.txt # add word and frequency to file
done < ./file1.txt
sort -u -d # remove duplicate lines
For some reason, this is only showing "0" after each word.
出于某种原因,这仅在每个单词后显示“0”。
How can I generate a list of every word that appears in a file, along with frequency information?
如何生成文件中出现的每个单词的列表以及频率信息?
回答by eduffy
Not sed
and grep
, but tr
, sort
, uniq
, and awk
:
不sed
和grep
,但是tr
,sort
,uniq
,和awk
:
% (tr ' ' '\n' | sort | uniq -c | awk '{print "@"}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF
a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1
回答by Bohdan
uniq -calready does what you want, just sort the input:
uniq -c已经做了你想要的,只需对输入进行排序:
echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c
output:
输出:
6 a
7 d
7 s
回答by Rony
Content of the input file
输入文件的内容
$ cat inputFile.txt
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
Using sed | sort | uniq
使用 sed | sort | uniq
$ sed 's/\.//g;s/\(.*\)/\L/;s/\ /\n/g' inputFile.txt | sort | uniq -c
1 a
2 appear
1 file
1 is
1 many
1 more
2 of
1 once
1 one
1 only
2 some
1 than
2 the
1 this
1 time
1 with
3 words
uniq -ic
will count and ignore case, but result list will have This
instead of this
.
uniq -ic
将计数并忽略大小写,但结果列表将This
代替this
.
回答by Jerin A Mathews
You can use tr for this, just run
您可以为此使用 tr ,只需运行
tr ' ' '' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt
Sample Output for a text file of city names:
城市名称文本文件的示例输出:
3026 Toronto
2006 Montréal
1117 Edmonton
1048 Calgary
905 Ottawa
724 Winnipeg
673 Vancouver
495 Brampton
489 Mississauga
482 London
467 Hamilton
回答by potong
This might work for you:
这可能对你有用:
tr '[:upper:]' '[:lower:]' <file |
tr -d '[:punct:]' |
tr -s ' ' '\n' |
sort |
uniq -c |
sed 's/ *\([0-9]*\) \(.*\)/@/'
回答by Sheharyar
Let's use AWK!
让我们使用 AWK!
This function lists the frequency of each word occurring in the provided file in Descending order:
此函数按降序列出在提供的文件中出现的每个单词的频率:
function wordfrequency() {
awk '
BEGIN { FS="[^a-zA-Z]+" } {
for (i=1; i<=NF; i++) {
word = tolower($i)
words[word]++
}
}
END {
for (w in words)
printf("%3d %s\n", words[w], w)
} ' | sort -rn
}
You can call it on your file like this:
你可以像这样在你的文件上调用它:
$ cat your_file.txt | wordfrequency
Source: AWK-ward Ruby
资料来源:AWK-ward Ruby
回答by John Red
Let's do it in Python 3!
让我们用 Python 3 来做吧!
"""Counts the frequency of each word in the given text; words are defined as
entities separated by whitespaces; punctuations and other symbols are ignored;
case-insensitive; input can be passed through stdin or through a file specified
as an argument; prints highest frequency words first"""
# Case-insensitive
# Ignore punctuations `~!@#$%^&*()_-+={}[]\|:;"'<>,.?/
import sys
# Find if input is being given through stdin or from a file
lines = None
if len(sys.argv) == 1:
lines = sys.stdin
else:
lines = open(sys.argv[1])
D = {}
for line in lines:
for word in line.split():
word = ''.join(list(filter(
lambda ch: ch not in "`~!@#$%^&*()_-+={}[]\|:;\"'<>,.?/",
word)))
word = word.lower()
if word in D:
D[word] += 1
else:
D[word] = 1
for word in sorted(D, key=D.get, reverse=True):
print(word + ' ' + str(D[word]))
Let's name this script "frequency.py" and add a line to "~/.bash_aliases":
让我们将此脚本命名为“frequency.py”并在“~/.bash_aliases”中添加一行:
alias freq="python3 /path/to/frequency.py"
Now to find the frequency words in your file "content.txt", you do:
现在要在您的文件“content.txt”中找到频率词,您可以:
freq content.txt
You can also pipe output to it:
您还可以通过管道将输出传递给它:
cat content.txt | freq
And even analyze text from multiple files:
甚至分析来自多个文件的文本:
cat content.txt story.txt article.txt | freq
If you are using Python 2, just replace
如果您使用的是 Python 2,只需替换
''.join(list(filter(args...)))
withfilter(args...)
python3
withpython
print(whatever)
withprint whatever
''.join(list(filter(args...)))
和filter(args...)
python3
和python
print(whatever)
和print whatever
回答by Paused until further notice.
The sort requires GNU AWK (gawk
). If you have another AWK without asort()
, this can be easily adjusted and then piped to sort
.
排序需要 GNU AWK ( gawk
)。如果您有另一个没有 AWK 的 AWK asort()
,则可以轻松对其进行调整,然后通过管道传输到sort
.
awk '{gsub(/\./, ""); for (i = 1; i <= NF; i++) {w = tolower($i); count[w]++; words[w] = w}} END {qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]]}' inputfile
Broken out onto multiple lines:
分成多行:
awk '{
gsub(/\./, "");
for (i = 1; i <= NF; i++) {
w = tolower($i);
count[w]++;
words[w] = w
}
}
END {
qty = asort(words);
for (w = 1; w <= qty; w++)
print words[w] "@" count[words[w]]
}' inputfile
回答by GL2014
#!/usr/bin/env bash
declare -A map
words=""
[[ -f ]] || { echo "usage: $(basename awk '{
BEGIN{word[""]=0;}
{
for (el =1 ; el <= NF ; ++el) {word[$el]++ }
}
END {
for (i in word) {
if (i !="")
{
print word[i],i;
}
}
}' file.txt | sort -nr
wordfile)"; exit 1 ;}
while read line; do
for word in $line; do
((map[$word]++))
done;
done < <(cat $words )
for key in ${!map[@]}; do
echo "the word $key appears ${map[$key]} times"
done|sort -nr -k5