从 Pandas DataFrame 创建术语密度矩阵的有效方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/22205845/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Efficient way to create term density matrix from pandas DataFrame
提问by nikosd
I am trying to create a term density matrix from a pandas dataframe, so I can rate terms appearing in the dataframe. I also want to be able to keep the 'spatial' aspect of my data (see comment at the end of post for an example of what I mean).
我正在尝试从 Pandas 数据框中创建术语密度矩阵,以便我可以对数据框中出现的术语进行评分。我还希望能够保留我的数据的“空间”方面(有关我的意思的示例,请参阅帖子末尾的评论)。
I am new to pandas and NLTK, so I expect my problem to be soluble with some existing tools.
我是 Pandas 和 NLTK 的新手,所以我希望我的问题可以用一些现有的工具解决。
I have a dataframe which contains two columns of interest: say 'title' and 'page'
我有一个包含两列感兴趣的数据框:说“标题”和“页面”
    import pandas as pd
    import re
    df = pd.DataFrame({'title':['Delicious boiled egg','Fried egg ','Split orange','Something else'], 'page':[1, 2, 3, 4]})
    df.head()
       page                 title
    0     1  Delicious boiled egg
    1     2            Fried egg 
    2     3          Split orange
    3     4        Something else
My goal is to clean up the text, and pass terms of interest to a TDM dataframe. I use two functions to help me clean up the strings
我的目标是清理文本,并将感兴趣的术语传递给 TDM 数据框。我使用两个函数来帮助我清理字符串
    import nltk.classify
    from nltk.tokenize import wordpunct_tokenize
    from nltk.corpus import stopwords
    import string   
    def remove_punct(strin):
        '''
        returns a string with the punctuation marks removed, and all lower case letters
        input: strin, an ascii string. convert using strin.encode('ascii','ignore') if it is unicode 
        '''
        return strin.translate(string.maketrans("",""), string.punctuation).lower()
    sw = stopwords.words('english')
    def tok_cln(strin):
        '''
        tokenizes string and removes stopwords
        '''
        return set(nltk.wordpunct_tokenize(strin)).difference(sw)
And one function which does the dataframe manipulation
以及一个执行数据帧操作的函数
    def df2tdm(df,titleColumn,placementColumn,newPlacementColumn):
        '''
        takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
        of the words appearing in the titleColumn
        Inputs: df, a DataFrame containing titleColumn, placementColumn among others
        Outputs: tdm_df, a DataFrame containing newPlacementColumn and columns with all the terms in df[titleColumn]
        '''
        tdm_df = pd.DataFrame(index=df.index, columns=[newPlacementColumn])
        tdm_df = tdm_df.fillna(0)
        for idx in df.index:
            for word in tok_cln( remove_punct(df[titleColumn][idx].encode('ascii','ignore')) ):
                if word not in tdm_df.columns:
                    newcol = pd.DataFrame(index = df.index, columns = [word])
                    tdm_df = tdm_df.join(newcol)
        tdm_df[newPlacementColumn][idx] = df[placementColumn][idx]
        tdm_df[word][idx] = 1
        return tdm_df.fillna(0,inplace = False)
    tdm_df = df2tdm(df,'title','page','pub_page')
    tdm_df.head()
This returns
这返回
      pub_page boiled egg delicious fried orange split something else
    0        1      1   1         1     0      0     0         0    0
    1        2      0   1         0     1      0     0         0    0
    2        3      0   0         0     0      1     1         0    0
    3        4      0   0         0     0      0     0         1    1
But it is painfully slow when parsing large sets (output of hundred thousands of rows, thousands of columns). My two questions:
但是在解析大型集合(输出数十万行,数千列)时,速度非常慢。我的两个问题:
Can I speed up this implementation?
我可以加快这个实施吗?
Is there some other tool I could use to get this done?
我可以使用其他工具来完成这项工作吗?
I want to be able to keep the 'spatial' aspect of my data, for example if 'egg' appears very often in pages 1-10 and then reappears often in pages 500-520, I want to know that.
我希望能够保留我的数据的“空间”方面,例如,如果“鸡蛋”经常出现在第 1-10 页,然后又经常出现在第 500-520 页,我想知道这一点。
回答by herrfz
You can use scikit-learn's CountVectorizer:
您可以使用 scikit-learn 的CountVectorizer:
In [14]: from sklearn.feature_extraction.text import CountVectorizer
In [15]: countvec = CountVectorizer()
In [16]: countvec.fit_transform(df.title)
Out[16]: 
<4x8 sparse matrix of type '<type 'numpy.int64'>'
    with 9 stored elements in Compressed Sparse Column format>
It returns the term document matrix in sparse representation because such matrix is usually huge and, well, sparse.
它以稀疏表示形式返回术语文档矩阵,因为这样的矩阵通常很大而且很稀疏。
For your particular example I guess converting it back to a DataFrame would still work:
对于您的特定示例,我想将其转换回 DataFrame 仍然有效:
In [17]: pd.DataFrame(countvec.fit_transform(df.title).toarray(), columns=countvec.get_feature_names())
Out[17]: 
   boiled  delicious  egg  else  fried  orange  something  split
0       1          1    1     0      0       0          0      0
1       0          0    1     0      1       0          0      0
2       0          0    0     0      0       1          0      1
3       0          0    0     1      0       0          1      0
[4 rows x 8 columns]
回答by Jeremy Jordan
herrfz provides a way to handle this but I just wanted to point out that creating a term density data structure using a Python set is counterproductive, seeing as a set is a collection of unique objects. You won't be able to capture the count for each word, only the presence of a word for a given row.
herrfz 提供了一种处理这个问题的方法,但我只想指出,使用 Python 集合创建术语密度数据结构会适得其反,因为集合是唯一对象的集合。您将无法捕获每个单词的计数,只能捕获给定行中某个单词的存在。
return set(nltk.wordpunct_tokenize(strin)).difference(sw)
In order to strip out the stopwords you could do something like
为了去除停用词,您可以执行以下操作
tokens_stripped = [token for token in tokens 
                   if token not in stopwords]
after tokenization.
标记化后。

