Python 如何使用 sklearn 计算字词共现矩阵?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35562789/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I calculate a word-word co-occurrence matrix with sklearn?
提问by newdev14
I am looking for a module in sklearn that lets you derive the word-word co-occurrence matrix.
我正在 sklearn 中寻找一个模块,它可以让您导出词-词共现矩阵。
I can get the document-term matrix but not sure how to go about obtaining a word-word matrix of co-ocurrences.
我可以获得文档词矩阵,但不确定如何获得共现的词词矩阵。
回答by Guiem Bosch
You can use the ngram_range
parameter in the CountVectorizer
or TfidfVectorizer
您可以ngram_range
在CountVectorizer
or 中使用参数TfidfVectorizer
Code example:
代码示例:
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # by saying 2,2 you are telling you only want pairs of 2 words
In case you want to explicitly say which co-occurrences of words you want to count, use the vocabulary
param, i.e: vocabulary = {'awesome unicorns':0, 'batman forever':1}
如果您想明确说明要计算哪些单词的同时出现,请使用vocabulary
参数,即:vocabulary = {'awesome unicorns':0, 'batman forever':1}
Self-explanatory and ready to use code with predefined word-word co-occurrences. In this case we are tracking for co-occurrences of awesome unicorns
and batman forever
:
不言自明且随时可用的代码,具有预定义的字词共现。在这种情况下,我们只跟踪的共同出现awesome unicorns
和batman forever
:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
samples = ['awesome unicorns are awesome','batman forever and ever','I love batman forever']
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), vocabulary = {'awesome unicorns':0, 'batman forever':1})
co_occurrences = bigram_vectorizer.fit_transform(samples)
print 'Printing sparse matrix:', co_occurrences
print 'Printing dense matrix (cols are vocabulary keys 0-> "awesome unicorns", 1-> "batman forever")', co_occurrences.todense()
sum_occ = np.sum(co_occurrences.todense(),axis=0)
print 'Sum of word-word occurrences:', sum_occ
print 'Pretty printig of co_occurrences count:', zip(bigram_vectorizer.get_feature_names(),np.array(sum_occ)[0].tolist())
Final output is ('awesome unicorns', 1), ('batman forever', 2)
, which corresponds exactly to our samples
provided data.
最终输出是('awesome unicorns', 1), ('batman forever', 2)
,它与我们samples
提供的数据完全对应。
回答by titipata
Here is my example solution using CountVectorizer
in scikit-learn. And referring to this post, you can simply use matrix multiplication to get word-word co-occurrence matrix.
这是我CountVectorizer
在 scikit-learn 中使用的示例解决方案。并参考这篇文章,您可以简单地使用矩阵乘法来获得词-词共现矩阵。
from sklearn.feature_extraction.text import CountVectorizer
docs = ['this this this book',
'this cat good',
'cat good shit']
count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model
X = count_model.fit_transform(docs)
# X[X > 0] = 1 # run this line if you don't want extra within-text cooccurence (see below)
Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format
Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0
print(Xc.todense()) # print out matrix in dense format
You can also refer to dictionary of words in count_model
,
您也可以参考 中的单词词典count_model
,
count_model.vocabulary_
Or, if you want to normalize by diagonal component (referred to answer in previous post).
或者,如果您想通过对角线分量进行归一化(参考上一篇文章中的回答)。
import scipy.sparse as sp
Xc = (X.T * X)
g = sp.diags(1./Xc.diagonal())
Xc_norm = g * Xc # normalized co-occurence matrix
Extrato note @Federico Caccia answer, if you don't want co-occurrence that are spurious from the own text, set occurrence that is greater that 1 to 1 e.g.
额外要注意@Federico Caccia 的回答,如果您不希望自己的文本出现虚假的共现,请设置大于 1 到 1 的出现次数,例如
X[X > 0] = 1 # do this line first before computing cooccurrence
Xc = (X.T * X)
...
回答by Federico Caccia
@titipata I think your solution is not a good metric because we are giving the same weight to real co-ocurrences and to occurrences that are just spurious. For example, if I have 5 texts and the words appleand houseappears with this frecuency:
@titipata 我认为您的解决方案不是一个好的指标,因为我们对真实的共现和虚假的事件给予相同的权重。例如,如果我有 5 个文本并且单词apple和house以这种频率出现:
text1: apple:10, "house":1
文本 1:苹果:10,“房子”:1
text2: apple:10, "house":0
文本 2:苹果:10,“房子”:0
text3: apple:10, "house":0
文本 3:苹果:10,“房子”:0
text4: apple:10, "house":0
文本 4:苹果:10,“房子”:0
text5: apple:10, "house":0
文本 5:苹果:10,“房子”:0
The co-occurrencewe are going to measure is 10*1+10*0+10*0+10*0+10*0=10, but is just spurious.
我们要测量的共现是 10*1+10*0+10*0+10*0+10*0= 10,但这只是虚假的。
And, in this another important cases, like the following:
而且,在另一个重要的情况下,如下所示:
text1: apple:1, "banana":1
文本1 :苹果:1,“香蕉”:1
text2: apple:1, "banana":1
文本 2:苹果:1,“香蕉”:1
text3: apple:1, "banana":1
文本 3:苹果:1,“香蕉”:1
text4: apple:1, "banana":1
文本 4:苹果:1,“香蕉”:1
text5: apple:1, "banana":1
文本 5:苹果:1,“香蕉”:1
we are going to get just a co-occurrenceof 1*1+1*1+1*1+1*1=5, when in fact that co-occurrence really important.
我们要得到的只是一个共生的1 * 1 + 1 * 1 + 1 * 1 + 1 * 1 = 5,当事实共生真的很重要。
@Guiem Bosch In this case co-occurrences are measured only when the two words are contiguous.
@Guiem Bosch 在这种情况下,只有当两个词相邻时才会测量共现。
I propose to use something the @titipa solution to compute the matrix:
我建议使用@titipa 解决方案来计算矩阵:
Xc = (Y.T * Y) # this is co-occurrence matrix in sparse csr format
where, instead of using X, use a matrix Y with onesin positions greater than 0 and zerosin another positions.
其中,不使用 X,而是使用矩阵 Y,其中1的位置大于 0,而其他位置的零。
Using this, in the first example we are going to have: co-occurrence:1*1+1*0+1*0+1*0+1*0=1and in the second example: co-occurrence:1*1+1*1+1*1+1*1+1*0=5which is what we are really looking for.
使用这个,在第一个例子中,我们将有: co-occurrence:1*1+1*0+1*0+1*0+1*0= 1,在第二个例子中: co-occurrence:1* 1+1*1+1*1+1*1+1*0= 5这就是我们真正想要的。
回答by Anwarvic
All the provided answers didn't use the window-moving concept into consideration. So, I did my own function that does find the co-occurrence matrix by applying a moving window of a defined size. This function takes a list of sentences and returns a pandas.DataFrame
object representing the co-occurrence matrix and a window_size
number:
所有提供的答案都没有考虑到窗口移动的概念。所以,我做了我自己的函数,通过应用一个定义大小的移动窗口来找到共生矩阵。这个函数接受一个句子列表并返回一个pandas.DataFrame
表示共现矩阵的对象和一个window_size
数字:
def co_occurrence(sentences, window_size):
d = defaultdict(int)
vocab = set()
for text in sentences:
# preprocessing (use tokenizer instead)
text = text.lower().split()
# iterate over sentences
for i in range(len(text)):
token = text[i]
vocab.add(token) # add to vocab
next_token = text[i+1 : i+1+window_size]
for t in next_token:
key = tuple( sorted([t, token]) )
d[key] += 1
# formulate the dictionary into dataframe
vocab = sorted(vocab) # sort vocab
df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
index=vocab,
columns=vocab)
for key, value in d.items():
df.at[key[0], key[1]] = value
df.at[key[1], key[0]] = value
return df
Let's try it out given the following two simple sentences:
给定以下两个简单的句子,让我们尝试一下:
>>> text = ["I go to school every day by bus .",
"i go to theatre every night by bus"]
>>>
>>> df = co_occurrence(text, 2)
>>> df
. bus by day every go i night school theatre to
. 0 1 1 0 0 0 0 0 0 0 0
bus 1 0 2 1 0 0 0 1 0 0 0
by 1 2 0 1 2 0 0 1 0 0 0
day 0 1 1 0 1 0 0 0 1 0 0
every 0 0 2 1 0 0 0 1 1 1 2
go 0 0 0 0 0 0 2 0 1 1 2
i 0 0 0 0 0 2 0 0 0 0 2
night 0 1 1 0 1 0 0 0 0 1 0
school 0 0 0 1 1 1 0 0 0 0 1
theatre 0 0 0 0 1 1 0 1 0 0 1
to 0 0 0 0 2 2 2 0 1 1 0
[11 rows x 11 columns]
Now, we have our co-occurrence matrix.
现在,我们有了共现矩阵。
回答by nathandrake
I used the below code for creating co-occurrance matrix with window size:
我使用以下代码创建具有窗口大小的共生矩阵:
#https://stackoverflow.com/questions/4843158/check-if-a-python-list-item-contains-a-string-inside-another-string
import pandas as pd
def co_occurance_matrix(input_text,top_words,window_size):
co_occur = pd.DataFrame(index=top_words, columns=top_words)
for row,nrow in zip(top_words,range(len(top_words))):
for colm,ncolm in zip(top_words,range(len(top_words))):
count = 0
if row == colm:
co_occur.iloc[nrow,ncolm] = count
else:
for single_essay in input_text:
essay_split = single_essay.split(" ")
max_len = len(essay_split)
top_word_index = [index for index, split in enumerate(essay_split) if row in split]
for index in top_word_index:
if index == 0:
count = count + essay_split[:window_size + 1].count(colm)
elif index == (max_len -1):
count = count + essay_split[-(window_size + 1):].count(colm)
else:
count = count + essay_split[index + 1 : (index + window_size + 1)].count(colm)
if index < window_size:
count = count + essay_split[: index].count(colm)
else:
count = count + essay_split[(index - window_size): index].count(colm)
co_occur.iloc[nrow,ncolm] = count
return co_occur
then i used the below code to perform test:
然后我使用下面的代码来执行测试:
corpus = ['ABC DEF IJK PQR','PQR KLM OPQ','LMN PQR XYZ ABC DEF PQR ABC']
words = ['ABC','PQR','DEF']
window_size =2
result = co_occurance_matrix(corpus,words,window_size)
result
回答by a.k
with numpy, as corpus would be list of lists (each list a tokenized document):
使用 numpy,因为语料库将是列表列表(每个列表都是一个标记化的文档):
corpus = [['<START>', 'All', 'that', 'glitters', "isn't", 'gold', '<END>'],
['<START>', "All's", 'well', 'that', 'ends', 'well', '<END>']]
and a word->row/col mapping
和一个 word->row/col 映射
def compute_co_occurrence_matrix(corpus, window_size):
words = sorted(list(set([word for words_list in corpus for word in words_list])))
num_words = len(words)
M = np.zeros((num_words, num_words))
word2Ind = dict(zip(words, range(num_words)))
for doc in corpus:
cur_idx = 0
doc_len = len(doc)
while cur_idx < doc_len:
left = max(cur_idx-window_size, 0)
right = min(cur_idx+window_size+1, doc_len)
words_to_add = doc[left:cur_idx] + doc[cur_idx+1:right]
focus_word = doc[cur_idx]
for word in words_to_add:
outside_idx = word2Ind[word]
M[outside_idx, word2Ind[focus_word]] += 1
cur_idx += 1
return M, word2Ind