Python 计算词频并从中制作字典
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21852066/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Counting word frequency and making a dictionary from it
提问by user3323103
I want to take every word from a text file, and count the word frequency in a dictionary.
我想从文本文件中取出每个单词,并计算字典中的词频。
Example: 'this is the textfile, and it is used to take words and count'
例子: 'this is the textfile, and it is used to take words and count'
d = {'this': 1, 'is': 2, 'the': 1, ...}
I am not that far, but I just can't see how to complete it. My code so far:
我还没有那么远,但我就是不知道如何完成它。到目前为止我的代码:
import sys
argv = sys.argv[1]
data = open(argv)
words = data.read()
data.close()
wordfreq = {}
for i in words:
#there should be a counter and somehow it must fill the dict.
采纳答案by Don
If you don't want to use collections.Counter, you can write your own function:
如果您不想使用 collections.Counter,您可以编写自己的函数:
import sys
filename = sys.argv[1]
fp = open(filename)
data = fp.read()
words = data.split()
fp.close()
unwanted_chars = ".,-_ (and so on)"
wordfreq = {}
for raw_word in words:
word = raw_word.strip(unwanted_chars)
if word not in wordfreq:
wordfreq[word] = 0
wordfreq[word] += 1
for finer things, look at regular expressions.
对于更好的事情,请查看正则表达式。
回答by Michael
from collections import Counter
t = 'this is the textfile, and it is used to take words and count'
dict(Counter(t.split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1}
Or better with removing punctuation before counting:
或者在计数之前删除标点符号更好:
dict(Counter(t.replace(',', '').replace('.', '').split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile': 1}
回答by user1749431
The following takes the string, splits it into a list with split(), for loops the list and counts the frequency of each item in the sentence with Python's count function count (). The words,i, and its frequency are placed as tuples in an empty list, ls, and then converted into key and value pairs with dict().
下面取字符串,用split()把它拆分成一个列表,for循环这个列表,用Python的count函数count()统计句子中每一项出现的频率。单词 i 及其频率作为元组放置在空列表 ls 中,然后使用 dict() 转换为键值对。
sentence = 'this is the textfile, and it is used to take words and count'.split()
ls = []
for i in sentence:
word_count = sentence.count(i) # Pythons count function, count()
ls.append((i,word_count))
dict_ = dict(ls)
print dict_
output; {'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1}
输出; {'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2, 'it': 1, 'to': 1, 'take': 1, '单词': 1, 'the': 1, 'textfile,': 1}
回答by Grijesh Chauhan
Although, using Counterfrom collectionslibrary as suggested by @Michael is better approach, but I am adding answer just to improve your code(I believe that will be an answer for new Python learner):
虽然,按照@Michael 的建议使用Counterfrom collectionslibrary 是更好的方法,但我添加答案只是为了改进您的代码(我相信这将是新 Python 学习者的答案):
From comment in your codeit seem like you wants to improve your code. And I think you are able to read file content in words (while usually I avoid using read()function and use for line in file_descriptor:kind of code).
从您代码中的注释来看,您似乎想改进您的代码。而且我认为您可以用文字读取文件内容(虽然通常我避免使用read()函数并使用for line in file_descriptor:某种代码)。
As wordsis a string, In for loop, for i in words:the loop-variable iis not a word but a char. You are iterating over chars in string instead of iterating over words in string words. To understand this notice following code snipe:
由于words是一个字符串,在for循环, for i in words:循环变量i是不是一个单词,一个字符。您正在迭代 string 中的字符,而不是迭代 string 中的单词words。要了解此通知,请使用以下代码狙击:
>>> for i in "Hi, h r u?":
... print i
...
H
i
,
h
r
u
?
>>>
Because iterating over string char by chars instead word by words is not what you wanted, to iterate words by words you should split method/function from string class in Python.str.split(str="", num=string.count(str))methodreturns a list of all the words in the string,using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
因为逐个字符而不是逐字迭代字符串字符不是您想要的,所以要逐字迭代,您应该从 Python 中的字符串类中拆分方法/函数。方法返回字符串中所有单词的列表,使用 str 作为分隔符(如果未指定,则拆分所有空格),可选择将拆分数量限制为 num。str.split(str="", num=string.count(str))
Notice below code examples:
请注意以下代码示例:
Split:
分裂:
>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?']
loop with split:
循环拆分:
>>> for i in "Hi, how are you?".split():
... print i
...
Hi,
how
are
you?
And it looks something like you needs. Except word Hi,because split()by default split by whitespaces so Hi,are kept as a single string (and obviously) that you don't want. To count frequency of words in the file.
它看起来像你需要的东西。除了 word 之外,Hi,因为split()默认情况下由空格分割,因此Hi,保留为您不想要的单个字符串(显然)。计算文件中单词的频率。
One good solution can be that use regex, But first to keep answer simple I answering with replace()method. The method str.replace(old, new[, max])returns a copy of the string in which the occurrences of old have been replaced with new, optionally restricting the number of replacements to max.
一个好的解决方案可以是使用正则表达式,但首先要保持简单的回答,我用replace()方法回答。该方法str.replace(old, new[, max])返回字符串的副本,其中 old 的出现已被 new 替换,可选择将替换次数限制为最大值。
Now check below code example for what I want to suggest:
现在检查下面的代码示例以了解我想建议的内容:
>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?'] # it has , with Hi
>>> "Hi, how are you?".replace(',', ' ').split()
['Hi', 'how', 'are', 'you?'] # , replaced by space then split
loop:
环形:
>>> for word in "Hi, how are you?".replace(',', ' ').split():
... print word
...
Hi
how
are
you?
Now, how to count frequency:
现在,如何计算频率:
One way is use Counter as @Michael suggested, but to use your approach in which you wants to start from empty dict. Do something like this code:
一种方法是按照@Michael 的建议使用 Counter,但要使用您想要从空字典开始的方法。执行以下代码:
words = f.read()
wordfreq = {}
for word in .replace(', ',' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
# ^^ add 1 to 0 or old value from dict
What I am doing?: because initially wordfreqis empty you can't assign to wordfreq[word]at first time(it will rise key exception). so I used setdefault dict method.
我在做什么?:因为最初wordfreq是空的,你不能wordfreq[word]在第一次分配(它会引发关键异常)。所以我使用了 setdefault dict 方法。
dict.setdefault(key, default=None)is similar to get(), but will set dict[key]=defaultif key is not already in dict. So for first time when a new word comes I set it with 0in dict using setdefaultthen add 1and assign to same dict.
dict.setdefault(key, default=None)类似于get(),但dict[key]=default如果 key 不在 dict 中,则会设置。所以第一次当一个新词出现时,我0在 dict 中使用setdefault然后添加1并分配给同一个 dict来设置它。
I written an equivalent code using with openinstead of single open.
我编写了一个等效的代码,使用with open而不是 single open。
with open('~/Desktop/file') as f:
words = f.read()
wordfreq = {}
for word in words.replace(',', ' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
print wordfreq
That runs like this:
它是这样运行的:
$ cat file # file is
this is the textfile, and it is used to take words and count
$ python work.py # indented manually
{'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2,
'it': 1, 'to': 1, 'take': 1, 'words': 1,
'the': 1, 'textfile': 1}
Using re.split(pattern, string, maxsplit=0, flags=0)
使用 re.split(pattern, string, maxsplit=0, flags=0)
Just change for loop: for i in re.split(r"[,\s]+", words):, that should produce correct output.
只需更改 for loop: for i in re.split(r"[,\s]+", words):,这应该会产生正确的输出。
Edit: better to find all alphanumeric character because you may have more than one punctuation symbols.
编辑:最好找到所有字母数字字符,因为您可能有多个标点符号。
>>> re.findall(r'[\w]+', words) # manually indent output
['this', 'is', 'the', 'textfile', 'and',
'it', 'is', 'used', 'to', 'take', 'words', 'and', 'count']
use for loop as: for word in re.findall(r'[\w]+', words):
将 for 循环用作: for word in re.findall(r'[\w]+', words):
How would I write code without using read():
我将如何在不使用的情况下编写代码read():
File is:
文件是:
$ cat file
This is the text file, and it is used to take words and count. And multiple
Lines can be present in this file.
It is also possible that Same words repeated in with capital letters.
Code is:
代码是:
$ cat work.py
import re
wordfreq = {}
with open('file') as f:
for line in f:
for word in re.findall(r'[\w]+', line.lower()):
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
print wordfreq
Used lower()to convert upper letter to lower.
用于lower()将大写字母转换为小写字母。
output:
输出:
$python work.py # manually strip output
{'and': 3, 'letters': 1, 'text': 1, 'is': 3,
'it': 2, 'file': 2, 'in': 2, 'also': 1, 'same': 1,
'to': 1, 'take': 1, 'capital': 1, 'be': 1, 'used': 1,
'multiple': 1, 'that': 1, 'possible': 1, 'repeated': 1,
'words': 2, 'with': 1, 'present': 1, 'count': 1, 'this': 2,
'lines': 1, 'can': 1, 'the': 1}
回答by Fuji Komalan
sentence = "this is the textfile, and it is used to take words and count"
# split the sentence into words.
# iterate thorugh every word
counter_dict = {}
for word in sentence.lower().split():
# add the word into the counter_dict initalize with 0
if word not in counter_dict:
counter_dict[word] = 0
# increase its count by 1
counter_dict[word] =+ 1
回答by Rajeev Sharma
#open your text book,Counting word frequency
File_obj=open("Counter.txt",'r')
w_list=File_obj.read()
print(w_list.split())
di=dict()
for word in w_list.split():
if word in di:
di[word]=di[word] + 1
else:
di[word]=1
max_count=max(di.values())
largest=-1
maxusedword=''
for k,v in di.items():
print(k,v)
if v>largest:
largest=v
maxusedword=k
print(maxusedword,largest)
回答by AnitaAgrawal
My approach is to do few things from ground:
我的方法是从头开始做一些事情:
- Remove punctuations from the text input.
- Make list of words.
- Remove empty strings.
- Iterate through list.
- Make each new word a key into Dictionary with value 1.
- If a word is already exist as key then increment it's value by one.
- 从文本输入中删除标点符号。
- 制作单词列表。
- 删除空字符串。
- 遍历列表。
- 使每个新单词成为字典的键,值为 1。
- 如果一个单词已经作为键存在,则将它的值加一。
text = '''this is the textfile, and it is used to take words and count'''
word = '' #This will hold each word
wordList = [] #This will be collection of words
for ch in text: #traversing through the text character by character
#if character is between a-z or A-Z or 0-9 then it's valid character and add to word string..
if (ch >= 'a' and ch <= 'z') or (ch >= 'A' and ch <= 'Z') or (ch >= '0' and ch <= '9'):
word += ch
elif ch == ' ': #if character is equal to single space means it's a separator
wordList.append(word) # append the word in list
word = '' #empty the word to collect the next word
wordList.append(word) #the last word to append in list as loop ended before adding it to list
print(wordList)
wordCountDict = {} #empty dictionary which will hold the word count
for word in wordList: #traverse through the word list
if wordCountDict.get(word.lower(), 0) == 0: #if word doesn't exist then make an entry into dic with value 1
wordCountDict[word.lower()] = 1
else: #if word exist then increament the value by one
wordCountDict[word.lower()] = wordCountDict[word.lower()] + 1
print(wordCountDict)
Another approach:
另一种方法:
text = '''this is the textfile, and it is used to take words and count'''
for ch in '.\'!")(,;:?-\n':
text = text.replace(ch, ' ')
wordsArray = text.split(' ')
wordDict = {}
for word in wordsArray:
if len(word) == 0:
continue
else:
wordDict[word.lower()] = wordDict.get(word.lower(), 0) + 1
print(wordDict)
回答by Rangita R
you can also use default dictionaries with int type.
您还可以使用 int 类型的默认字典。
from collections import defaultdict
wordDict = defaultdict(int)
text = 'this is the textfile, and it is used to take words and count'.split(" ")
for word in text:
wordDict[word]+=1
explanation: we initialize a default dictionary whose values are of the type int. This way the default value for any key will be 0 and we don't need to check if a key is present in the dictionary or not. we then split the text with the spaces into a list of words. then we iterate through the list and increment the count of the word's count.
解释:我们初始化了一个默认字典,其值为 int 类型。这样,任何键的默认值都是 0,我们不需要检查字典中是否存在键。然后我们将带有空格的文本拆分为一个单词列表。然后我们遍历列表并增加单词计数的计数。
回答by rajadhiraja
wordList = 'this is the textfile, and it is used to take words and count'.split()
wordFreq = {}
# Logic: word not in the dict, give it a value of 1. if key already present, +1.
for word in wordList:
if word not in wordFreq:
wordFreq[word] = 1
else:
wordFreq[word] += 1
print(wordFreq)

