有效计算python中的词频

Question

提问by rkjt50r983

I'd like to count frequencies of all words in a text file.

我想计算文本文件中所有单词的频率。

>>> countInFile('test.txt')

should return {'aaa':1, 'bbb': 2, 'ccc':1}if the target text file is like:

应该返回{'aaa':1, 'bbb': 2, 'ccc':1}如果目标文本文件是这样的：

# test.txt
aaa bbb ccc
bbb

I've implemented it with pure python following some posts. However, I've found out pure-python ways are insufficient due to huge file size (> 1GB).

我在一些帖子之后用纯 python 实现了它。但是，由于文件大小（> 1GB），我发现纯 python 方法是不够的。

I think borrowing sklearn's power is a candidate.

我认为借用sklearn的力量是一个候选人。

If you let CountVectorizer count frequencies for each line, I guess you will get word frequencies by summing up each column. But, it sounds a bit indirect way.

如果您让 CountVectorizer 计算每行的频率，我想您将通过对每列求和来获得词频。但是，这听起来有点间接。

What is the most efficient and straightforward way to count words in a file with python?

使用python计算文件中单词的最有效和最直接的方法是什么？

Update

更新

My (very slow) code is here:

我的（很慢）代码在这里：

from collections import Counter

def get_term_frequency_in_file(source_file_path):
    wordcount = {}
    with open(source_file_path) as f:
        for line in f:
            line = line.lower().translate(None, string.punctuation)
            this_wordcount = Counter(line.split())
            wordcount = add_merge_two_dict(wordcount, this_wordcount)
    return wordcount

def add_merge_two_dict(x, y):
    return { k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y) }

Answer 1

回答by ShadowRanger

The most succinct approach is to use the tools Python gives you.

最简洁的方法是使用 Python 提供的工具。

from future_builtins import map  # Only on Python 2

from collections import Counter
from itertools import chain

def countInFile(filename):
    with open(filename) as f:
        return Counter(chain.from_iterable(map(str.split, f)))

That's it. map(str.split, f)is making a generator that returns lists of words from each line. Wrapping in chain.from_iterableconverts that to a single generator that produces a word at a time. Countertakes an input iterable and counts all unique values in it. At the end, you returna dict-like object (a Counter) that stores all unique words and their counts, and during creation, you only store a line of data at a time and the total counts, not the whole file at once.

就是这样。map(str.split, f)正在制作一个生成器，list从每行返回s 个单词。包装chain.from_iterable将其转换为一次生成一个单词的单个生成器。Counter接受一个输入迭代并计算其中的所有唯一值。最后，您return是一个类似 dict对象 (a Counter)，它存储所有唯一的单词及其计数，并且在创建期间，您一次只存储一行数据和总计数，而不是一次存储整个文件。

In theory, on Python 2.7 and 3.1, you might do slightly better looping over the chained results yourself and using a dictor collections.defaultdict(int)to count (because Counteris implemented in Python, which can make it slower in some cases), but letting Counterdo the work is simpler and more self-documenting (I mean, the whole goal is counting, so use a Counter). Beyond that, on CPython (the reference interpreter) 3.2 and higher Counterhas a C level accelerator for counting iterable inputs that will run faster than anything you could write in pure Python.

理论上，在 Python 2.7 和 3.1 上，您可能会自己更好地循环链接结果并使用 a dictorcollections.defaultdict(int)进行计数（因为Counter在 Python 中实现，在某些情况下可能会使其变慢），但让Counter工作更简单和更多的自我记录（我的意思是，整个目标是计数，所以使用 a Counter）。除此之外，在 CPython（参考解释器）3.2 及更高版本上Counter有一个 C 级加速器，用于计算可迭代输入，其运行速度比您用纯 Python 编写的任何东西都要快。

Update:You seem to want punctuation stripped and case-insensitivity, so here's a variant of my earlier code that does that:

更新：您似乎想要去除标点符号和不区分大小写，所以这是我早期代码的一个变体：

from string import punctuation

def countInFile(filename):
    with open(filename) as f:
        linewords = (line.translate(None, punctuation).lower().split() for line in f)
        return Counter(chain.from_iterable(linewords))

Your code runs much more slowly because it's creating and destroying many small Counterand setobjects, rather than .update-ing a single Counteronce per line (which, while slightly slower than what I gave in the updated code block, would be at least algorithmically similar in scaling factor).

你的代码的运行速度要慢得多，因为它创建和销毁许多小型Counter和set对象，而不是.update-ing单Counter每行（其中，而稍比我在更新的代码块给速度较慢，至少会在比例因子算法类似的一次）。

Answer 2

回答by alvas

A memory efficient and accurate way is to make use of

一种有效且准确的记忆方法是利用

CountVectorizer in scikit(for ngram extraction)
NLTK for word_tokenize
numpymatrix sum to collect the counts
collections.Counterfor collecting the counts and vocabulary

CountVectorizer in scikit（用于 ngram 提取）
NLTK 为 word_tokenize
numpy矩阵和来收集计数
collections.Counter用于收集计数和词汇

An example:

一个例子：

import urllib.request
from collections import Counter

import numpy as np 

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

# Our sample textfile.
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')


# Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
# X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
X = ngram_vectorizer.fit_transform(data.split('\n'))

# Vocabulary
vocab = list(ngram_vectorizer.get_feature_names())

# Column-wise sum of the X matrix.
# It's some crazy numpy syntax that looks horribly unpythonic
# For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = X.sum(axis=0).A1

freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))

[out]:

[出去]：

[(',', 32000),
 ('.', 17783),
 ('de', 11225),
 ('a', 7197),
 ('que', 5710),
 ('la', 4732),
 ('je', 4304),
 ('se', 4013),
 ('на', 3978),
 ('na', 3834)]

Essentially, you can also do this:

本质上，您也可以这样做：

from collections import Counter
import numpy as np 
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

def freq_dist(data):
    """
    :param data: A string with sentences separated by '\n'
    :type data: str
    """
    ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
    X = ngram_vectorizer.fit_transform(data.split('\n'))
    vocab = list(ngram_vectorizer.get_feature_names())
    counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

Let's timeit:

让我们timeit：

import time

start = time.time()
word_distribution = freq_dist(data)
print (time.time() - start)

[out]:

[出去]：

5.257147789001465

Note that CountVectorizercan also take a file instead of a string and there's no need to read the whole file into memory. In code:

请注意，CountVectorizer也可以使用文件而不是字符串，并且无需将整个文件读入内存。在代码中：

import io
from collections import Counter

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/input.txt'

ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)

with io.open(infile, 'r', encoding='utf8') as fin:
    X = ngram_vectorizer.fit_transform(fin)
    vocab = ngram_vectorizer.get_feature_names()
    counts = X.sum(axis=0).A1
    freq_distribution = Counter(dict(zip(vocab, counts)))
    print (freq_distribution.most_common(10))

Answer 3

回答by nat gillin

Here's some benchmark. It'll look strange but the crudest code wins.

这是一些基准。它看起来很奇怪，但最粗糙的代码获胜。

[code]:

[代码]：

from collections import Counter, defaultdict
import io, time

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/file'

def extract_dictionary_sklearn(file_path):
    with io.open(file_path, 'r', encoding='utf8') as fin:
        ngram_vectorizer = CountVectorizer(analyzer='word')
        X = ngram_vectorizer.fit_transform(fin)
        vocab = ngram_vectorizer.get_feature_names()
        counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

def extract_dictionary_native(file_path):
    dictionary = Counter()
    with io.open(file_path, 'r', encoding='utf8') as fin:
        for line in fin:
            dictionary.update(line.split())
    return dictionary

def extract_dictionary_paddle(file_path):
    dictionary = defaultdict(int)
    with io.open(file_path, 'r', encoding='utf8') as fin:
        for line in fin:
            for words in line.split():
                dictionary[word] +=1
    return dictionary

start = time.time()
extract_dictionary_sklearn(infile)
print time.time() - start

start = time.time()
extract_dictionary_native(infile)
print time.time() - start

start = time.time()
extract_dictionary_paddle(infile)
print time.time() - start

[out]:

[出去]：

38.306814909
24.8241138458
12.1182529926

Data size (154MB) used in the benchmark above:

上述基准测试中使用的数据大小（154MB）：

$ wc -c /path/to/file
161680851

$ wc -l /path/to/file
2176141

Some things to note:

一些注意事项：

With the sklearnversion, there's an overhead of vectorizer creation + numpy manipulation and conversion into a Counterobject
Then native Counterupdate version, it seems like Counter.update()is an expensive operation

在该sklearn版本中，矢量化器创建 + numpy 操作和转换为Counter对象存在开销
然后原生Counter更新版本，好像Counter.update()是一个昂贵的操作

Answer 4

回答by Goodies

This should suffice.

这应该足够了。

def countinfile(filename):
    d = {}
    with open(filename, "r") as fin:
        for line in fin:
            words = line.strip().split()
            for word in words:
                try:
                    d[word] += 1
                except KeyError:
                    d[word] = 1
    return d

Answer 5

回答by Nizam Mohamed

Instead of decoding the whole bytes read from the url, I process the binary data. Because bytes.translateexpects its second argument to be a byte string, I utf-8 encode punctuation. After removing punctuations, I utf-8 decode the byte string.

我没有解码从 url 读取的整个字节，而是处理二进制数据。因为bytes.translate期望它的第二个参数是一个字节字符串，所以我 utf-8 encode punctuation。删除标点符号后，我 utf-8 解码字节字符串。

The function freq_distexpects an iterable. That's why I've passed data.splitlines().

该函数freq_dist需要一个可迭代的。这就是为什么我已经通过了data.splitlines()。

from urllib2 import urlopen
from collections import Counter
from string import punctuation
from time import time
import sys
from pprint import pprint

url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'

data = urlopen(url).read()

def freq_dist(data):
    """
    :param data: file-like object opened in binary mode or
                 sequence of byte strings separated by '\n'
    :type data: an iterable sequence
    """
    #For readability   
    #return Counter(word for line in data
    #    for word in line.translate(
    #    None,bytes(punctuation.encode('utf-8'))).decode('utf-8').split())

    punc = punctuation.encode('utf-8')
    words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
    return Counter(words)


start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(word_dist.most_common(10))

Output;

输出;

elapsed: 0.806480884552

[(u'de', 11106),
 (u'a', 6742),
 (u'que', 5701),
 (u'la', 4319),
 (u'je', 4260),
 (u'se', 3938),
 (u'\u043d\u0430', 3929),
 (u'na', 3623),
 (u'da', 3534),
 (u'i', 3487)]

It seems dictis more efficient than Counterobject.

它似乎dict比Counter对象更有效。

def freq_dist(data):
    """
    :param data: A string with sentences separated by '\n'
    :type data: str
    """
    d = {}
    punc = punctuation.encode('utf-8')
    words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
    for word in words:
        d[word] = d.get(word, 0) + 1
    return d

start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(sorted(word_dist.items(), key=lambda x: (x[1], x[0]), reverse=True)[:10])

Output;

输出;

elapsed: 0.642680168152

[(u'de', 11106),
 (u'a', 6742),
 (u'que', 5701),
 (u'la', 4319),
 (u'je', 4260),
 (u'se', 3938),
 (u'\u043d\u0430', 3929),
 (u'na', 3623),
 (u'da', 3534),
 (u'i', 3487)]

To be more memory efficient when opening huge file, you have to pass just the opened url. But the timing will include file download time too.

为了在打开大文件时提高内存效率，您必须只传递打开的 url。但时间也将包括文件下载时间。

data = urlopen(url)
word_dist = freq_dist(data)

Answer 6

回答by Stephen Grimes

Skip CountVectorizer and scikit-learn.

跳过 CountVectorizer 和 scikit-learn。

The file may be too large to load into memory but I doubt the python dictionary gets too large. The easiest option for you may be to split the large file into 10-20 smaller files and extend your code to loop over the smaller files.

该文件可能太大而无法加载到内存中，但我怀疑 python 字典变得太大了。对您来说，最简单的选择可能是将大文件拆分为 10-20 个较小的文件，并扩展您的代码以在较小的文件上循环。

Answer 7

回答by Murtadha Alrahbi

you can try with sklearn

你可以试试 sklearn

from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()

    data=['i am student','the student suffers a lot']
    transformed_data =vectorizer.fit_transform(data)
    vocab= {a: b for a, b in zip(vectorizer.get_feature_names(), np.ravel(transformed_data.sum(axis=0)))}
    print (vocab)

Answer 8

回答by Pradeep Singh

Combining every ones else's views and some of my own :) Here is what I have for you

结合其他人的观点和我自己的一些观点:) 这是我给你的

from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

text='''Note that if you use RegexpTokenizer option, you lose 
natural language features special to word_tokenize 
like splitting apart contractions. You can naively 
split on the regex \w+ without any need for the NLTK.
'''

# tokenize
raw = ' '.join(word_tokenize(text.lower()))

tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
words = tokenizer.tokenize(raw)

# remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]

# count word frequency, sort and return just 20
counter = Counter()
counter.update(words)
most_common = counter.most_common(20)
most_common

Output

输出

(All ones)

（所有）

[('note', 1),
 ('use', 1),
 ('regexptokenizer', 1),
 ('option', 1),
 ('lose', 1),
 ('natural', 1),
 ('language', 1),
 ('features', 1),
 ('special', 1),
 ('word', 1),
 ('tokenize', 1),
 ('like', 1),
 ('splitting', 1),
 ('apart', 1),
 ('contractions', 1),
 ('naively', 1),
 ('split', 1),
 ('regex', 1),
 ('without', 1),
 ('need', 1)]

One can do better than this in terms of efficiency but if you are not worried about it too much, this code is the best.

在效率方面可以做得比这更好，但如果您不太担心，这段代码是最好的。

有效计算python中的词频

提问by rkjt50r983

Update

更新

回答by ShadowRanger

回答by alvas

回答by nat gillin

回答by Goodies

回答by Nizam Mohamed

回答by Stephen Grimes

回答by Murtadha Alrahbi

回答by Pradeep Singh

Output

输出

相关推荐

最近更新

标签

有效计算python中的词频

提问by rkjt50r983

Update

更新

回答by ShadowRanger

回答by alvas

回答by nat gillin

回答by Goodies

回答by Nizam Mohamed

回答by Stephen Grimes

回答by Murtadha Alrahbi

回答by Pradeep Singh

Output

输出

相关推荐

Python 与 xlrd 相比，使用 openpyxl 读取 Excel 文件的速度要慢得多

Python 在 Anaconda 上安装特定版本的 tensorflow

Python 使用百分位数删除 Pandas DataFrame 中的异常值

Python `ValueError：x_new 中的值高于插值范围。` - 除了不升序值之外还有什么其他原因？

相关推荐

最近更新

标签