有效计算python中的词频
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35857519/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Efficiently count word frequencies in python
提问by rkjt50r983
I'd like to count frequencies of all words in a text file.
我想计算文本文件中所有单词的频率。
>>> countInFile('test.txt')
should return {'aaa':1, 'bbb': 2, 'ccc':1}if the target text file is like:
应该返回{'aaa':1, 'bbb': 2, 'ccc':1}如果目标文本文件是这样的:
# test.txt
aaa bbb ccc
bbb
I've implemented it with pure python following some posts. However, I've found out pure-python ways are insufficient due to huge file size (> 1GB).
我在一些帖子之后用纯 python 实现了它。但是,由于文件大小(> 1GB),我发现纯 python 方法是不够的。
I think borrowing sklearn's power is a candidate.
我认为借用sklearn的力量是一个候选人。
If you let CountVectorizer count frequencies for each line, I guess you will get word frequencies by summing up each column. But, it sounds a bit indirect way.
如果您让 CountVectorizer 计算每行的频率,我想您将通过对每列求和来获得词频。但是,这听起来有点间接。
What is the most efficient and straightforward way to count words in a file with python?
使用python计算文件中单词的最有效和最直接的方法是什么?
Update
更新
My (very slow) code is here:
我的(很慢)代码在这里:
from collections import Counter
def get_term_frequency_in_file(source_file_path):
wordcount = {}
with open(source_file_path) as f:
for line in f:
line = line.lower().translate(None, string.punctuation)
this_wordcount = Counter(line.split())
wordcount = add_merge_two_dict(wordcount, this_wordcount)
return wordcount
def add_merge_two_dict(x, y):
return { k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y) }
回答by ShadowRanger
The most succinct approach is to use the tools Python gives you.
最简洁的方法是使用 Python 提供的工具。
from future_builtins import map # Only on Python 2
from collections import Counter
from itertools import chain
def countInFile(filename):
with open(filename) as f:
return Counter(chain.from_iterable(map(str.split, f)))
That's it. map(str.split, f)is making a generator that returns lists of words from each line. Wrapping in chain.from_iterableconverts that to a single generator that produces a word at a time. Countertakes an input iterable and counts all unique values in it. At the end, you returna dict-like object (a Counter) that stores all unique words and their counts, and during creation, you only store a line of data at a time and the total counts, not the whole file at once.
就是这样。map(str.split, f)正在制作一个生成器,list从每行返回s 个单词。包装chain.from_iterable将其转换为一次生成一个单词的单个生成器。Counter接受一个输入迭代并计算其中的所有唯一值。最后,您return是一个类似 dict对象 (a Counter),它存储所有唯一的单词及其计数,并且在创建期间,您一次只存储一行数据和总计数,而不是一次存储整个文件。
In theory, on Python 2.7 and 3.1, you might do slightly better looping over the chained results yourself and using a dictor collections.defaultdict(int)to count (because Counteris implemented in Python, which can make it slower in some cases), but letting Counterdo the work is simpler and more self-documenting (I mean, the whole goal is counting, so use a Counter). Beyond that, on CPython (the reference interpreter) 3.2 and higher Counterhas a C level accelerator for counting iterable inputs that will run faster than anything you could write in pure Python.
理论上,在 Python 2.7 和 3.1 上,您可能会自己更好地循环链接结果并使用 a dictorcollections.defaultdict(int)进行计数(因为Counter在 Python 中实现,在某些情况下可能会使其变慢),但让Counter工作更简单和更多的自我记录(我的意思是,整个目标是计数,所以使用 a Counter)。除此之外,在 CPython(参考解释器)3.2 及更高版本上Counter有一个 C 级加速器,用于计算可迭代输入,其运行速度比您用纯 Python 编写的任何东西都要快。
Update:You seem to want punctuation stripped and case-insensitivity, so here's a variant of my earlier code that does that:
更新:您似乎想要去除标点符号和不区分大小写,所以这是我早期代码的一个变体:
from string import punctuation
def countInFile(filename):
with open(filename) as f:
linewords = (line.translate(None, punctuation).lower().split() for line in f)
return Counter(chain.from_iterable(linewords))
Your code runs much more slowly because it's creating and destroying many small Counterand setobjects, rather than .update-ing a single Counteronce per line (which, while slightly slower than what I gave in the updated code block, would be at least algorithmically similar in scaling factor).
你的代码的运行速度要慢得多,因为它创建和销毁许多小型Counter和set对象,而不是.update-ing单Counter每行(其中,而稍比我在更新的代码块给速度较慢,至少会在比例因子算法类似的一次)。
回答by alvas
A memory efficient and accurate way is to make use of
一种有效且准确的记忆方法是利用
- CountVectorizer in
scikit(for ngram extraction) - NLTK for
word_tokenize numpymatrix sum to collect the countscollections.Counterfor collecting the counts and vocabulary
- CountVectorizer in
scikit(用于 ngram 提取) - NLTK 为
word_tokenize numpy矩阵和来收集计数collections.Counter用于收集计数和词汇
An example:
一个例子:
import urllib.request
from collections import Counter
import numpy as np
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
# Our sample textfile.
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
# Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
# X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
X = ngram_vectorizer.fit_transform(data.split('\n'))
# Vocabulary
vocab = list(ngram_vectorizer.get_feature_names())
# Column-wise sum of the X matrix.
# It's some crazy numpy syntax that looks horribly unpythonic
# For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = X.sum(axis=0).A1
freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))
[out]:
[出去]:
[(',', 32000),
('.', 17783),
('de', 11225),
('a', 7197),
('que', 5710),
('la', 4732),
('je', 4304),
('se', 4013),
('на', 3978),
('na', 3834)]
Essentially, you can also do this:
本质上,您也可以这样做:
from collections import Counter
import numpy as np
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
def freq_dist(data):
"""
:param data: A string with sentences separated by '\n'
:type data: str
"""
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
X = ngram_vectorizer.fit_transform(data.split('\n'))
vocab = list(ngram_vectorizer.get_feature_names())
counts = X.sum(axis=0).A1
return Counter(dict(zip(vocab, counts)))
Let's timeit:
让我们timeit:
import time
start = time.time()
word_distribution = freq_dist(data)
print (time.time() - start)
[out]:
[出去]:
5.257147789001465
Note that CountVectorizercan also take a file instead of a string and there's no need to read the whole file into memory. In code:
请注意,CountVectorizer也可以使用文件而不是字符串,并且无需将整个文件读入内存。在代码中:
import io
from collections import Counter
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
infile = '/path/to/input.txt'
ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)
with io.open(infile, 'r', encoding='utf8') as fin:
X = ngram_vectorizer.fit_transform(fin)
vocab = ngram_vectorizer.get_feature_names()
counts = X.sum(axis=0).A1
freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))
回答by nat gillin
Here's some benchmark. It'll look strange but the crudest code wins.
这是一些基准。它看起来很奇怪,但最粗糙的代码获胜。
[code]:
[代码]:
from collections import Counter, defaultdict
import io, time
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
infile = '/path/to/file'
def extract_dictionary_sklearn(file_path):
with io.open(file_path, 'r', encoding='utf8') as fin:
ngram_vectorizer = CountVectorizer(analyzer='word')
X = ngram_vectorizer.fit_transform(fin)
vocab = ngram_vectorizer.get_feature_names()
counts = X.sum(axis=0).A1
return Counter(dict(zip(vocab, counts)))
def extract_dictionary_native(file_path):
dictionary = Counter()
with io.open(file_path, 'r', encoding='utf8') as fin:
for line in fin:
dictionary.update(line.split())
return dictionary
def extract_dictionary_paddle(file_path):
dictionary = defaultdict(int)
with io.open(file_path, 'r', encoding='utf8') as fin:
for line in fin:
for words in line.split():
dictionary[word] +=1
return dictionary
start = time.time()
extract_dictionary_sklearn(infile)
print time.time() - start
start = time.time()
extract_dictionary_native(infile)
print time.time() - start
start = time.time()
extract_dictionary_paddle(infile)
print time.time() - start
[out]:
[出去]:
38.306814909
24.8241138458
12.1182529926
Data size (154MB) used in the benchmark above:
上述基准测试中使用的数据大小(154MB):
$ wc -c /path/to/file
161680851
$ wc -l /path/to/file
2176141
Some things to note:
一些注意事项:
- With the
sklearnversion, there's an overhead of vectorizer creation + numpy manipulation and conversion into aCounterobject - Then native
Counterupdate version, it seems likeCounter.update()is an expensive operation
- 在该
sklearn版本中,矢量化器创建 + numpy 操作和转换为Counter对象存在开销 - 然后原生
Counter更新版本,好像Counter.update()是一个昂贵的操作
回答by Goodies
This should suffice.
这应该足够了。
def countinfile(filename):
d = {}
with open(filename, "r") as fin:
for line in fin:
words = line.strip().split()
for word in words:
try:
d[word] += 1
except KeyError:
d[word] = 1
return d
回答by Nizam Mohamed
Instead of decoding the whole bytes read from the url, I process the binary data. Because bytes.translateexpects its second argument to be a byte string, I utf-8 encode punctuation. After removing punctuations, I utf-8 decode the byte string.
我没有解码从 url 读取的整个字节,而是处理二进制数据。因为bytes.translate期望它的第二个参数是一个字节字符串,所以我 utf-8 encode punctuation。删除标点符号后,我 utf-8 解码字节字符串。
The function freq_distexpects an iterable. That's why I've passed data.splitlines().
该函数freq_dist需要一个可迭代的。这就是为什么我已经通过了data.splitlines()。
from urllib2 import urlopen
from collections import Counter
from string import punctuation
from time import time
import sys
from pprint import pprint
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
data = urlopen(url).read()
def freq_dist(data):
"""
:param data: file-like object opened in binary mode or
sequence of byte strings separated by '\n'
:type data: an iterable sequence
"""
#For readability
#return Counter(word for line in data
# for word in line.translate(
# None,bytes(punctuation.encode('utf-8'))).decode('utf-8').split())
punc = punctuation.encode('utf-8')
words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
return Counter(words)
start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(word_dist.most_common(10))
Output;
输出;
elapsed: 0.806480884552
[(u'de', 11106),
(u'a', 6742),
(u'que', 5701),
(u'la', 4319),
(u'je', 4260),
(u'se', 3938),
(u'\u043d\u0430', 3929),
(u'na', 3623),
(u'da', 3534),
(u'i', 3487)]
It seems dictis more efficient than Counterobject.
它似乎dict比Counter对象更有效。
def freq_dist(data):
"""
:param data: A string with sentences separated by '\n'
:type data: str
"""
d = {}
punc = punctuation.encode('utf-8')
words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
for word in words:
d[word] = d.get(word, 0) + 1
return d
start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(sorted(word_dist.items(), key=lambda x: (x[1], x[0]), reverse=True)[:10])
Output;
输出;
elapsed: 0.642680168152
[(u'de', 11106),
(u'a', 6742),
(u'que', 5701),
(u'la', 4319),
(u'je', 4260),
(u'se', 3938),
(u'\u043d\u0430', 3929),
(u'na', 3623),
(u'da', 3534),
(u'i', 3487)]
To be more memory efficient when opening huge file, you have to pass just the opened url. But the timing will include file download time too.
为了在打开大文件时提高内存效率,您必须只传递打开的 url。但时间也将包括文件下载时间。
data = urlopen(url)
word_dist = freq_dist(data)
回答by Stephen Grimes
Skip CountVectorizer and scikit-learn.
跳过 CountVectorizer 和 scikit-learn。
The file may be too large to load into memory but I doubt the python dictionary gets too large. The easiest option for you may be to split the large file into 10-20 smaller files and extend your code to loop over the smaller files.
该文件可能太大而无法加载到内存中,但我怀疑 python 字典变得太大了。对您来说,最简单的选择可能是将大文件拆分为 10-20 个较小的文件,并扩展您的代码以在较小的文件上循环。
回答by Murtadha Alrahbi
you can try with sklearn
你可以试试 sklearn
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data=['i am student','the student suffers a lot']
transformed_data =vectorizer.fit_transform(data)
vocab= {a: b for a, b in zip(vectorizer.get_feature_names(), np.ravel(transformed_data.sum(axis=0)))}
print (vocab)
回答by Pradeep Singh
Combining every ones else's views and some of my own :) Here is what I have for you
结合其他人的观点和我自己的一些观点:) 这是我给你的
from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
text='''Note that if you use RegexpTokenizer option, you lose
natural language features special to word_tokenize
like splitting apart contractions. You can naively
split on the regex \w+ without any need for the NLTK.
'''
# tokenize
raw = ' '.join(word_tokenize(text.lower()))
tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
words = tokenizer.tokenize(raw)
# remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]
# count word frequency, sort and return just 20
counter = Counter()
counter.update(words)
most_common = counter.most_common(20)
most_common
Output
输出
(All ones)
(所有)
[('note', 1),
('use', 1),
('regexptokenizer', 1),
('option', 1),
('lose', 1),
('natural', 1),
('language', 1),
('features', 1),
('special', 1),
('word', 1),
('tokenize', 1),
('like', 1),
('splitting', 1),
('apart', 1),
('contractions', 1),
('naively', 1),
('split', 1),
('regex', 1),
('without', 1),
('need', 1)]
One can do better than this in terms of efficiency but if you are not worried about it too much, this code is the best.
在效率方面可以做得比这更好,但如果您不太担心,这段代码是最好的。

