Python 文本处理:NLTK 和 Pandas
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34784004/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python text processing: NLTK and pandas
提问by IVR
I'm looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data.
我正在寻找一种在 Python 中构建可与额外数据一起使用的术语文档矩阵的有效方法。
I have some text data with a few other attributes. I would like to run some analyses on the text and I would like to be able to correlate features extracted from text (such as individual word tokens or LDA topics) with the other attributes.
我有一些带有其他一些属性的文本数据。我想对文本进行一些分析,并且我希望能够将从文本中提取的特征(例如单个单词标记或 LDA 主题)与其他属性相关联。
My plan was load the data as a pandas data frame and then each response will represent a document. Unfortunately, I ran into an issue:
我的计划是将数据作为 Pandas 数据框加载,然后每个响应将代表一个文档。不幸的是,我遇到了一个问题:
import pandas as pd
import nltk
pd.options.display.max_colwidth = 10000
txt_data = pd.read_csv("data_file.csv",sep="|")
txt = str(txt_data.comment)
len(txt)
Out[7]: 71581
txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[10]: 45
txt_lines = []
f = open("txt_lines_only.txt")
for line in f:
txt_lines.append(line)
txt = str(txt_lines)
len(txt)
Out[14]: 1668813
txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[17]: 10086
Note that in both cases, text was processed in such a way that only the anything but spaces, letters and ,.?! was removed (for simplicity).
请注意,在这两种情况下,文本的处理方式只有空格、字母和 ,.?! 被删除(为简单起见)。
As you can see a pandas field converted into a string returns fewer matches and the length of the string is also shorter.
如您所见,转换为字符串的 Pandas 字段返回的匹配项更少,并且字符串的长度也更短。
Is there any way to improve the above code?
有没有办法改进上面的代码?
Also, str(x)
creates 1 big string out of the comments while [str(x) for x in txt_data.comment]
creates a list object which cannot be broken into a bag of words. What is the best way to produce a nltk.Text
object that will retain document indices? In other words I'm looking for a way to create a Term Document Matrix, R's equivalent of TermDocumentMatrix()
from tm
package.
此外,str(x)
从评论中创建 1 个大字符串,同时[str(x) for x in txt_data.comment]
创建一个列表对象,该对象不能分解为一袋单词。生成nltk.Text
保留文档索引的对象的最佳方法是什么?换句话说,我正在寻找一种创建术语文档矩阵的方法,R 相当于TermDocumentMatrix()
fromtm
包。
Many thanks.
非常感谢。
回答by Stefan
The benefit of using a pandas
DataFrame
would be to apply the nltk
functionality to each row
like so:
使用 a 的好处是pandas
DataFrame
将nltk
功能应用于每个人,row
如下所示:
word_file = "/usr/share/dict/words"
words = open(word_file).read().splitlines()[10:50]
random_word_list = [[' '.join(np.random.choice(words, size=1000, replace=True))] for i in range(50)]
df = pd.DataFrame(random_word_list, columns=['text'])
df.head()
text
0 Aaru Aaronic abandonable abandonedly abaction ...
1 abampere abampere abacus aback abalone abactor...
2 abaisance abalienate abandonedly abaff abacina...
3 Ababdeh abalone abac abaiser abandonable abact...
4 abandonable abandon aba abaiser abaft Abama ab...
len(df)
50
txt = df.text.apply(word_tokenize)
txt.head()
0 [Aaru, Aaronic, abandonable, abandonedly, abac...
1 [abampere, abampere, abacus, aback, abalone, a...
2 [abaisance, abalienate, abandonedly, abaff, ab...
3 [Ababdeh, abalone, abac, abaiser, abandonable,...
4 [abandonable, abandon, aba, abaiser, abaft, Ab...
txt.apply(len)
0 1000
1 1000
2 1000
3 1000
4 1000
....
44 1000
45 1000
46 1000
47 1000
48 1000
49 1000
Name: text, dtype: int64
As a result, you get the .count()
for each row
entry:
因此,您将获得.count()
每个row
条目的:
txt = txt.apply(lambda x: nltk.Text(x).count('abac'))
txt.head()
0 27
1 24
2 17
3 25
4 32
You can then sum the result using:
然后,您可以使用以下方法对结果求和:
txt.sum()
1239