Python 使用 NLTK 的高效术语文档矩阵

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15899861/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:19:48  来源:igfitidea点击:

efficient Term Document Matrix with NLTK

pythonpandasnltkterm-document-matrix

提问by user1043144

I am trying to create a term document matrix with NLTK and pandas. I wrote the following function:

我正在尝试使用 NLTK 和 Pandas 创建一个术语文档矩阵。我写了以下函数:

def fnDTM_Corpus(xCorpus):
    import pandas as pd
    '''to create a Term Document Matrix from a NLTK Corpus'''
    fd_list = []
    for x in range(0, len(xCorpus.fileids())):
        fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x])))
    DTM = pd.DataFrame(fd_list, index = xCorpus.fileids())
    DTM.fillna(0,inplace = True)
    return DTM.T

to run it

运行它

import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Data/'

newcorpus = PlaintextCorpusReader(corpus_root, '.*')

x = fnDTM_Corpus(newcorpus)

It works well for few small files in the corpus but gives me a MemoryErrorwhen I try to run it with a corpus of 4,000 files (of about 2 kb each).

它适用于语料库中的几个小文件,但是 当我尝试使用 4,000 个文件(每个大约 2 kb)的语料库运行它时,它给了我一个MemoryError

Am I missing something?

我错过了什么吗?

I am using a 32 bit python. (am on windows 7, 64-bit OS, Core Quad CPU, 8 GB RAM). Do I really need to use 64 bit for corpus of this size ?

我正在使用 32 位 python。(在 Windows 7、64 位操作系统、Core Quad CPU、8 GB RAM 上)。对于这种大小的语料库,我真的需要使用 64 位吗?

采纳答案by user1043144

Thanks to Radim and Larsmans. My objective was to have a DTM like the one you get in R tm. I decided to use scikit-learn and partly inspired by this blog entry. This the code I came up with.

感谢 Radim 和 Larsmans。我的目标是拥有一个像您在 R tm 中获得的 DTM。我决定使用 scikit-learn 并部分受到这篇博客条目的启发。这是我想出的代码。

I post it here in the hope that someone else will find it useful.

我把它贴在这里,希望其他人会觉得它有用。

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 

def fn_tdm_df(docs, xColNames = None, **kwargs):
    ''' create a term document matrix as pandas DataFrame
    with **kwargs you can pass arguments of CountVectorizer
    if xColNames is given the dataframe gets columns Names'''

    #initialize the  vectorizer
    vectorizer = CountVectorizer(**kwargs)
    x1 = vectorizer.fit_transform(docs)
    #create dataFrame
    df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
    if xColNames is not None:
        df.columns = xColNames

    return df

to use it on a list of text in a directory

在目录中的文本列表上使用它

DIR = 'C:/Data/'

def fn_CorpusFromDIR(xDIR):
    ''' functions to create corpus from a Directories
    Input: Directory
    Output: A dictionary with 
             Names of files ['ColNames']
             the text in corpus ['docs']'''
    import os
    Res = dict(docs = [open(os.path.join(xDIR,f)).read() for f in os.listdir(xDIR)],
               ColNames = map(lambda x: 'P_' + x[0:6], os.listdir(xDIR)))
    return Res

to create the dataframe

创建数据框

d1 = fn_tdm_df(docs = fn_CorpusFromDIR(DIR)['docs'],
          xColNames = fn_CorpusFromDIR(DIR)['ColNames'], 
          stop_words=None, charset_error = 'replace')  

回答by duhaime

I know the OP wanted to create a tdm in NLTK, but the textminingpackage (pip install textmining) makes it dead simple:

我知道 OP 想在 NLTK 中创建一个 tdm,但是textmining包 ( pip install textmining) 使它变得非常简单:

import textmining

def termdocumentmatrix_example():
    # Create some very short sample documents
    doc1 = 'John and Bob are brothers.'
    doc2 = 'John went to the store. The store was closed.'
    doc3 = 'Bob went to the store too.'
    # Initialize class to create term-document matrix
    tdm = textmining.TermDocumentMatrix()
    # Add the documents
    tdm.add_doc(doc1)
    tdm.add_doc(doc2)
    tdm.add_doc(doc3)
    # Write out the matrix to a csv file. Note that setting cutoff=1 means
    # that words which appear in 1 or more documents will be included in
    # the output (i.e. every word will appear in the output). The default
    # for cutoff is 2, since we usually aren't interested in words which
    # appear in a single document. For this example we want to see all
    # words however, hence cutoff=1.
    tdm.write_csv('matrix.csv', cutoff=1)
    # Instead of writing out the matrix you can also access its rows directly.
    # Let's print them to the screen.
    for row in tdm.rows(cutoff=1):
            print row

termdocumentmatrix_example()

Output:

输出:

['and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too']
[1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0]
[0, 2, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0]
[0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]

Alternatively, one can use pandas and sklearn [source]:

或者,可以使用 pandas 和 sklearn [source]

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = ['why hello there', 'omg hello pony', 'she went there? omg']
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)

Output:

输出:

   hello  omg  pony  she  there  went  why
0      1    0     0    0      1     0    1
1      1    1     1    0      0     0    0
2      0    1     0    1      1     1    0

回答by Ajay Ohri

An Alternative approach using tokens and Data Frame

使用令牌和数据帧的替代方法

import nltk
comment #nltk.download() to get toenize
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)

tokens = nltk.word_tokenize(raw)
type(tokens)

tokens[1:10]
['Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

tokens2=pd.DataFrame(tokens)
tokens2.columns=['Words']
tokens2.head()


Words
0   ?The
1   Project
2   Gutenberg
3   EBook
4   of

    tokens2.Words.value_counts().head()
,                 16178
.                  9589
the                7436
and                6284
to                 5278