Python 使用 NLTK 的高效术语文档矩阵

Question

提问by user1043144

I am trying to create a term document matrix with NLTK and pandas. I wrote the following function:

我正在尝试使用 NLTK 和 Pandas 创建一个术语文档矩阵。我写了以下函数：

def fnDTM_Corpus(xCorpus):
    import pandas as pd
    '''to create a Term Document Matrix from a NLTK Corpus'''
    fd_list = []
    for x in range(0, len(xCorpus.fileids())):
        fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x])))
    DTM = pd.DataFrame(fd_list, index = xCorpus.fileids())
    DTM.fillna(0,inplace = True)
    return DTM.T

to run it

运行它

import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Data/'

newcorpus = PlaintextCorpusReader(corpus_root, '.*')

x = fnDTM_Corpus(newcorpus)

It works well for few small files in the corpus but gives me a MemoryErrorwhen I try to run it with a corpus of 4,000 files (of about 2 kb each).

它适用于语料库中的几个小文件，但是当我尝试使用 4,000 个文件（每个大约 2 kb）的语料库运行它时，它给了我一个MemoryError。

Am I missing something?

我错过了什么吗？

I am using a 32 bit python. (am on windows 7, 64-bit OS, Core Quad CPU, 8 GB RAM). Do I really need to use 64 bit for corpus of this size ?

我正在使用 32 位 python。（在 Windows 7、64 位操作系统、Core Quad CPU、8 GB RAM 上）。对于这种大小的语料库，我真的需要使用 64 位吗？

Answer 1

采纳答案by user1043144

Thanks to Radim and Larsmans. My objective was to have a DTM like the one you get in R tm. I decided to use scikit-learn and partly inspired by this blog entry. This the code I came up with.

感谢 Radim 和 Larsmans。我的目标是拥有一个像您在 R tm 中获得的 DTM。我决定使用 scikit-learn 并部分受到这篇博客条目的启发。这是我想出的代码。

I post it here in the hope that someone else will find it useful.

我把它贴在这里，希望其他人会觉得它有用。

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 

def fn_tdm_df(docs, xColNames = None, **kwargs):
    ''' create a term document matrix as pandas DataFrame
    with **kwargs you can pass arguments of CountVectorizer
    if xColNames is given the dataframe gets columns Names'''

    #initialize the  vectorizer
    vectorizer = CountVectorizer(**kwargs)
    x1 = vectorizer.fit_transform(docs)
    #create dataFrame
    df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
    if xColNames is not None:
        df.columns = xColNames

    return df

to use it on a list of text in a directory

在目录中的文本列表上使用它

DIR = 'C:/Data/'

def fn_CorpusFromDIR(xDIR):
    ''' functions to create corpus from a Directories
    Input: Directory
    Output: A dictionary with 
             Names of files ['ColNames']
             the text in corpus ['docs']'''
    import os
    Res = dict(docs = [open(os.path.join(xDIR,f)).read() for f in os.listdir(xDIR)],
               ColNames = map(lambda x: 'P_' + x[0:6], os.listdir(xDIR)))
    return Res

to create the dataframe

创建数据框

d1 = fn_tdm_df(docs = fn_CorpusFromDIR(DIR)['docs'],
          xColNames = fn_CorpusFromDIR(DIR)['ColNames'], 
          stop_words=None, charset_error = 'replace')

Answer 2

回答by duhaime

I know the OP wanted to create a tdm in NLTK, but the textminingpackage (pip install textmining) makes it dead simple:

我知道 OP 想在 NLTK 中创建一个 tdm，但是textmining包 ( pip install textmining) 使它变得非常简单：

import textmining

def termdocumentmatrix_example():
    # Create some very short sample documents
    doc1 = 'John and Bob are brothers.'
    doc2 = 'John went to the store. The store was closed.'
    doc3 = 'Bob went to the store too.'
    # Initialize class to create term-document matrix
    tdm = textmining.TermDocumentMatrix()
    # Add the documents
    tdm.add_doc(doc1)
    tdm.add_doc(doc2)
    tdm.add_doc(doc3)
    # Write out the matrix to a csv file. Note that setting cutoff=1 means
    # that words which appear in 1 or more documents will be included in
    # the output (i.e. every word will appear in the output). The default
    # for cutoff is 2, since we usually aren't interested in words which
    # appear in a single document. For this example we want to see all
    # words however, hence cutoff=1.
    tdm.write_csv('matrix.csv', cutoff=1)
    # Instead of writing out the matrix you can also access its rows directly.
    # Let's print them to the screen.
    for row in tdm.rows(cutoff=1):
            print row

termdocumentmatrix_example()

Output:

输出：

['and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too']
[1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0]
[0, 2, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0]
[0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]

Alternatively, one can use pandas and sklearn [source]:

或者，可以使用 pandas 和 sklearn [source]：

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = ['why hello there', 'omg hello pony', 'she went there? omg']
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)

Output:

输出：

   hello  omg  pony  she  there  went  why
0      1    0     0    0      1     0    1
1      1    1     1    0      0     0    0
2      0    1     0    1      1     1    0

Answer 3

回答by Ajay Ohri

An Alternative approach using tokens and Data Frame

使用令牌和数据帧的替代方法

import nltk
comment #nltk.download() to get toenize
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)

tokens = nltk.word_tokenize(raw)
type(tokens)

tokens[1:10]
['Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

tokens2=pd.DataFrame(tokens)
tokens2.columns=['Words']
tokens2.head()


Words
0   ?The
1   Project
2   Gutenberg
3   EBook
4   of

    tokens2.Words.value_counts().head()
,                 16178
.                  9589
the                7436
and                6284
to                 5278

Python 使用 NLTK 的高效术语文档矩阵

提问by user1043144

采纳答案by user1043144

to create the dataframe

创建数据框

回答by duhaime

回答by Ajay Ohri

相关推荐

最近更新

标签

Python 使用 NLTK 的高效术语文档矩阵

提问by user1043144

采纳答案by user1043144

to create the dataframe

创建数据框

回答by duhaime

回答by Ajay Ohri

相关推荐

Python 如何将字节数组显示为十六进制值

python的简单图形

Python 如何使用带有多个参数的 Flask Jinja2 url_for

在 Python 中获取异常详细信息

相关推荐

最近更新

标签