Python 如何使用 sklearn 计算字词共现矩阵？

Question

提问by newdev14

I am looking for a module in sklearn that lets you derive the word-word co-occurrence matrix.

我正在 sklearn 中寻找一个模块，它可以让您导出词-词共现矩阵。

I can get the document-term matrix but not sure how to go about obtaining a word-word matrix of co-ocurrences.

我可以获得文档词矩阵，但不确定如何获得共现的词词矩阵。

Answer 1

回答by Guiem Bosch

You can use the ngram_rangeparameter in the CountVectorizeror TfidfVectorizer

您可以ngram_range在CountVectorizeror 中使用参数TfidfVectorizer

Code example:

代码示例：

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # by saying 2,2 you are telling you only want pairs of 2 words

In case you want to explicitly say which co-occurrences of words you want to count, use the vocabularyparam, i.e: vocabulary = {'awesome unicorns':0, 'batman forever':1}

如果您想明确说明要计算哪些单词的同时出现，请使用vocabulary参数，即：vocabulary = {'awesome unicorns':0, 'batman forever':1}

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Self-explanatory and ready to use code with predefined word-word co-occurrences. In this case we are tracking for co-occurrences of awesome unicornsand batman forever:

不言自明且随时可用的代码，具有预定义的字词共现。在这种情况下，我们只跟踪的共同出现awesome unicorns和batman forever：

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
samples = ['awesome unicorns are awesome','batman forever and ever','I love batman forever']
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), vocabulary = {'awesome unicorns':0, 'batman forever':1}) 
co_occurrences = bigram_vectorizer.fit_transform(samples)
print 'Printing sparse matrix:', co_occurrences
print 'Printing dense matrix (cols are vocabulary keys 0-> "awesome unicorns", 1-> "batman forever")', co_occurrences.todense()
sum_occ = np.sum(co_occurrences.todense(),axis=0)
print 'Sum of word-word occurrences:', sum_occ
print 'Pretty printig of co_occurrences count:', zip(bigram_vectorizer.get_feature_names(),np.array(sum_occ)[0].tolist())

Final output is ('awesome unicorns', 1), ('batman forever', 2), which corresponds exactly to our samplesprovided data.

最终输出是('awesome unicorns', 1), ('batman forever', 2)，它与我们samples提供的数据完全对应。

Answer 2

回答by titipata

Here is my example solution using CountVectorizerin scikit-learn. And referring to this post, you can simply use matrix multiplication to get word-word co-occurrence matrix.

这是我CountVectorizer在 scikit-learn 中使用的示例解决方案。并参考这篇文章，您可以简单地使用矩阵乘法来获得词-词共现矩阵。

from sklearn.feature_extraction.text import CountVectorizer
docs = ['this this this book',
        'this cat good',
        'cat good shit']
count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model
X = count_model.fit_transform(docs)
# X[X > 0] = 1 # run this line if you don't want extra within-text cooccurence (see below)
Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format
Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0
print(Xc.todense()) # print out matrix in dense format

You can also refer to dictionary of words in count_model,

您也可以参考中的单词词典count_model，

count_model.vocabulary_

Or, if you want to normalize by diagonal component (referred to answer in previous post).

或者，如果您想通过对角线分量进行归一化（参考上一篇文章中的回答）。

import scipy.sparse as sp
Xc = (X.T * X)
g = sp.diags(1./Xc.diagonal())
Xc_norm = g * Xc # normalized co-occurence matrix

Extrato note @Federico Caccia answer, if you don't want co-occurrence that are spurious from the own text, set occurrence that is greater that 1 to 1 e.g.

额外要注意@Federico Caccia 的回答，如果您不希望自己的文本出现虚假的共现，请设置大于 1 到 1 的出现次数，例如

X[X > 0] = 1 # do this line first before computing cooccurrence
Xc = (X.T * X)
...

Answer 3

回答by Federico Caccia

@titipata I think your solution is not a good metric because we are giving the same weight to real co-ocurrences and to occurrences that are just spurious. For example, if I have 5 texts and the words appleand houseappears with this frecuency:

@titipata 我认为您的解决方案不是一个好的指标，因为我们对真实的共现和虚假的事件给予相同的权重。例如，如果我有 5 个文本并且单词apple和house以这种频率出现：

text1: apple:10, "house":1

文本 1：苹果：10，“房子”：1

text2: apple:10, "house":0

文本 2：苹果：10，“房子”：0

text3: apple:10, "house":0

文本 3：苹果：10，“房子”：0

text4: apple:10, "house":0

文本 4：苹果：10，“房子”：0

text5: apple:10, "house":0

文本 5：苹果：10，“房子”：0

The co-occurrencewe are going to measure is 10*1+10*0+10*0+10*0+10*0=10, but is just spurious.

我们要测量的共现是 10*1+10*0+10*0+10*0+10*0= 10，但这只是虚假的。

And, in this another important cases, like the following:

而且，在另一个重要的情况下，如下所示：

text1: apple:1, "banana":1

文本1 ：苹果：1，“香蕉”：1

text2: apple:1, "banana":1

文本 2：苹果：1，“香蕉”：1

text3: apple:1, "banana":1

文本 3：苹果：1，“香蕉”：1

text4: apple:1, "banana":1

文本 4：苹果：1，“香蕉”：1

text5: apple:1, "banana":1

文本 5：苹果：1，“香蕉”：1

we are going to get just a co-occurrenceof 1*1+1*1+1*1+1*1=5, when in fact that co-occurrence really important.

我们要得到的只是一个共生的1 * 1 + 1 * 1 + 1 * 1 + 1 * 1 = 5，当事实共生真的很重要。

@Guiem Bosch In this case co-occurrences are measured only when the two words are contiguous.

@Guiem Bosch 在这种情况下，只有当两个词相邻时才会测量共现。

I propose to use something the @titipa solution to compute the matrix:

我建议使用@titipa 解决方案来计算矩阵：

Xc = (Y.T * Y) # this is co-occurrence matrix in sparse csr format

where, instead of using X, use a matrix Y with onesin positions greater than 0 and zerosin another positions.

其中，不使用 X，而是使用矩阵 Y，其中1的位置大于 0，而其他位置的零。

Using this, in the first example we are going to have: co-occurrence:1*1+1*0+1*0+1*0+1*0=1and in the second example: co-occurrence:1*1+1*1+1*1+1*1+1*0=5which is what we are really looking for.

使用这个，在第一个例子中，我们将有： co-occurrence:1*1+1*0+1*0+1*0+1*0= 1，在第二个例子中： co-occurrence:1* 1+1*1+1*1+1*1+1*0= 5这就是我们真正想要的。

Answer 4

回答by Anwarvic

All the provided answers didn't use the window-moving concept into consideration. So, I did my own function that does find the co-occurrence matrix by applying a moving window of a defined size. This function takes a list of sentences and returns a pandas.DataFrameobject representing the co-occurrence matrix and a window_sizenumber:

所有提供的答案都没有考虑到窗口移动的概念。所以，我做了我自己的函数，通过应用一个定义大小的移动窗口来找到共生矩阵。这个函数接受一个句子列表并返回一个pandas.DataFrame表示共现矩阵的对象和一个window_size数字：

def co_occurrence(sentences, window_size):
    d = defaultdict(int)
    vocab = set()
    for text in sentences:
        # preprocessing (use tokenizer instead)
        text = text.lower().split()
        # iterate over sentences
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t, token]) )
                d[key] += 1

    # formulate the dictionary into dataframe
    vocab = sorted(vocab) # sort vocab
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df

Let's try it out given the following two simple sentences:

给定以下两个简单的句子，让我们尝试一下：

>>> text = ["I go to school every day by bus .",
            "i go to theatre every night by bus"]
>>> 
>>> df = co_occurrence(text, 2)
>>> df
         .  bus  by  day  every  go  i  night  school  theatre  to
.        0    1   1    0      0   0  0      0       0        0   0
bus      1    0   2    1      0   0  0      1       0        0   0
by       1    2   0    1      2   0  0      1       0        0   0
day      0    1   1    0      1   0  0      0       1        0   0
every    0    0   2    1      0   0  0      1       1        1   2
go       0    0   0    0      0   0  2      0       1        1   2
i        0    0   0    0      0   2  0      0       0        0   2
night    0    1   1    0      1   0  0      0       0        1   0
school   0    0   0    1      1   1  0      0       0        0   1
theatre  0    0   0    0      1   1  0      1       0        0   1
to       0    0   0    0      2   2  2      0       1        1   0

[11 rows x 11 columns]

Now, we have our co-occurrence matrix.

现在，我们有了共现矩阵。

Answer 5

回答by nathandrake

I used the below code for creating co-occurrance matrix with window size:

我使用以下代码创建具有窗口大小的共生矩阵：

#https://stackoverflow.com/questions/4843158/check-if-a-python-list-item-contains-a-string-inside-another-string
import pandas as pd
def co_occurance_matrix(input_text,top_words,window_size):
    co_occur = pd.DataFrame(index=top_words, columns=top_words)

    for row,nrow in zip(top_words,range(len(top_words))):
        for colm,ncolm in zip(top_words,range(len(top_words))):        
            count = 0
            if row == colm: 
                co_occur.iloc[nrow,ncolm] = count
            else: 
                for single_essay in input_text:
                    essay_split = single_essay.split(" ")
                    max_len = len(essay_split)
                    top_word_index = [index for index, split in enumerate(essay_split) if row in split]
                    for index in top_word_index:
                        if index == 0:
                            count = count + essay_split[:window_size + 1].count(colm)
                        elif index == (max_len -1): 
                            count = count + essay_split[-(window_size + 1):].count(colm)
                        else:
                            count = count + essay_split[index + 1 : (index + window_size + 1)].count(colm)
                            if index < window_size: 
                                count = count + essay_split[: index].count(colm)
                            else:
                                count = count + essay_split[(index - window_size): index].count(colm)
                co_occur.iloc[nrow,ncolm] = count

    return co_occur

then i used the below code to perform test:

然后我使用下面的代码来执行测试：

corpus = ['ABC DEF IJK PQR','PQR KLM OPQ','LMN PQR XYZ ABC DEF PQR ABC']
words = ['ABC','PQR','DEF']
window_size =2 

result = co_occurance_matrix(corpus,words,window_size)
result

Output is here:

输出在这里：

Answer 6

回答by a.k

with numpy, as corpus would be list of lists (each list a tokenized document):

使用 numpy，因为语料库将是列表列表（每个列表都是一个标记化的文档）：

corpus = [['<START>', 'All', 'that', 'glitters', "isn't", 'gold', '<END>'], 
          ['<START>', "All's", 'well', 'that', 'ends', 'well', '<END>']]

and a word->row/col mapping

和一个 word->row/col 映射

def compute_co_occurrence_matrix(corpus, window_size):

    words = sorted(list(set([word for words_list in corpus for word in words_list])))
    num_words = len(words)

    M = np.zeros((num_words, num_words))
    word2Ind = dict(zip(words, range(num_words)))

    for doc in corpus:

        cur_idx = 0
        doc_len = len(doc)

        while cur_idx < doc_len:

            left = max(cur_idx-window_size, 0)
            right = min(cur_idx+window_size+1, doc_len)
            words_to_add = doc[left:cur_idx] + doc[cur_idx+1:right]
            focus_word = doc[cur_idx]

            for word in words_to_add:
                outside_idx = word2Ind[word]
                M[outside_idx, word2Ind[focus_word]] += 1

            cur_idx += 1

    return M, word2Ind

Python 如何使用 sklearn 计算字词共现矩阵？

提问by newdev14

回答by Guiem Bosch

回答by titipata

回答by Federico Caccia

回答by Anwarvic

回答by nathandrake

回答by a.k

相关推荐

最近更新

标签

Python 如何使用 sklearn 计算字词共现矩阵？

提问by newdev14

回答by Guiem Bosch

回答by titipata

回答by Federico Caccia

回答by Anwarvic

回答by nathandrake

回答by a.k

相关推荐

Python 关于字符串插值的未使用变量的静音 PyLint 警告

Python 属性错误：类型对象没有属性

Python 我可以在命令行中运行 Jupyter 笔记本单元吗？

Python 没有名为“polls.apps.PollsConfigdjango”的模块；Django项目教程2

相关推荐

最近更新

标签