用 NLTK 和 Python 检查两个单词之间的相似性
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30829382/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Check the similarity between two words with NLTK with Python
提问by Punuth
I have a two lists and I want to check the similarity between each words in the two list and find out the maximum similarity.Here is my code,
我有两个列表,我想检查两个列表中每个单词之间的相似度并找出最大的相似度。这是我的代码,
from nltk.corpus import wordnet
list1 = ['Compare', 'require']
list2 = ['choose', 'copy', 'define', 'duplicate', 'find', 'how', 'identify', 'label', 'list', 'listen', 'locate', 'match', 'memorise', 'name', 'observe', 'omit', 'quote', 'read', 'recall', 'recite', 'recognise', 'record', 'relate', 'remember', 'repeat', 'reproduce', 'retell', 'select', 'show', 'spell', 'state', 'tell', 'trace', 'write']
list = []
for word1 in list1:
for word2 in list2:
wordFromList1 = wordnet.synsets(word1)[0]
wordFromList2 = wordnet.synsets(word2)[0]
s = wordFromList1.wup_similarity(wordFromList2)
list.append(s)
print(max(list))
But this will result an error:
但这会导致错误:
wordFromList2 = wordnet.synsets(word2)[0]
IndexError: list index out of range
Please help me to fix this.
Thanking you
请帮我解决这个问题。
感谢您
回答by omerbp
Try checking whether these lists are empty before you use then:
在使用之前尝试检查这些列表是否为空:
from nltk.corpus import wordnet
list1 = ['Compare', 'require']
list2 = ['choose', 'copy', 'define', 'duplicate', 'find', 'how', 'identify', 'label', 'list', 'listen', 'locate', 'match', 'memorise', 'name', 'observe', 'omit', 'quote', 'read', 'recall', 'recite', 'recognise', 'record', 'relate', 'remember', 'repeat', 'reproduce', 'retell', 'select', 'show', 'spell', 'state', 'tell', 'trace', 'write']
list = []
for word1 in list1:
for word2 in list2:
wordFromList1 = wordnet.synsets(word1)
wordFromList2 = wordnet.synsets(word2)
if wordFromList1 and wordFromList2: #Thanks to @alexis' note
s = wordFromList1[0].wup_similarity(wordFromList2[0])
list.append(s)
print(max(list))
回答by alexis
You're getting an error if a synset list is empty, and you try to get the element at (non-existent) index zero. But why only check the zero'th element? If you want to check everything, try all pairs of elements in the returned synsets. You can use itertools.product()
to save yourself two for-loops:
如果同义词集列表为空,则会出现错误,并且您尝试在(不存在的)索引零处获取元素。但是为什么只检查第零个元素呢?如果您想检查所有内容,请尝试返回的同义词集中的所有元素对。您可以使用itertools.product()
来保存自己的两个 for 循环:
from itertools import product
sims = []
for word1, word2 in product(list1, list2):
syns1 = wordnet.synsets(word1)
syns2 = wordnet.synsets(word2)
for sense1, sense2 in product(syns1, syns2):
d = wordnet.wup_similarity(sense1, sense2)
sims.append((d, syns1, syns2))
This is inefficient because the same synsets are looked up again and again, but it is the closest to the logic of your code. If you have enough data to make speed an issue, you can speed it up by collecting the synsets for all words in list1
and list2
once, and taking the product of the synsets.
这是低效的,因为一遍又一遍地查找相同的同义词集,但它最接近您的代码逻辑。如果您有足够的数据来解决速度问题,您可以通过收集 inlist1
和list2
once 中所有单词的同义词集并取同义词集的乘积来加快速度。
>>> allsyns1 = set(ss for word in list1 for ss in wordnet.synsets(word))
>>> allsyns2 = set(ss for word in list2 for ss in wordnet.synsets(word))
>>> best = max((wordnet.wup_similarity(s1, s2) or 0, s1, s2) for s1, s2 in
product(allsyns1, allsyns2))
>>> print(best)
(0.9411764705882353, Synset('command.v.02'), Synset('order.v.01'))