javascript Javascript文本相似度算法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5042873/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-25 15:45:08  来源:igfitidea点击:

Javascript text similarity algorithm

javascriptalgorithmtextsimilarity

提问by Karington

I'm building a website that should collect various news feeds and would like the texts to be compared for similarity. What i need is some sort of a news text similarity algorithm. I know that php has the similar_text function and am not sure how good it is + i need it for javascript. So if anyone could point me to an example or a plugin or any instruction on how this is possible or at least where to look and start investigating.

我正在建立一个网站,该网站应该收集各种新闻提要,并希望对文本进行相似性比较。我需要的是某种新闻文本相似度算法。我知道 php 有 similar_text 函数,但我不确定它有多好 + 我需要它用于 javascript。因此,如果有人可以向我指出一个示例或插件或任何有关如何做到这一点或至少在哪里查看和开始调查的说明。

回答by Flexo

There's a javascript implementationof the Levenshtein distance metric, which is often used for text comparisons. If you want to compare whole articles or headlines though you might be better off looking at intersections between the sets of words that make up the text (and frequencies of those words) rather than just string similarity measures.

Levenshtein 距离度量有一个javascript 实现,通常用于文本比较。如果您想比较整篇文章或标题,您最好查看构成文本的词组(以及这些词的频率)之间的交集,而不仅仅是字符串相似性度量。

回答by philonous

The question whether two texts are similar is a philosophical one as long as you don't specify exactly what it should mean. Consider the Strings "house" and "mouse". Seen from a semantic level they are not very similar, but they arevery similar regarding their "physical appearance", because only one letter is different (and in this case you could go by Levenshtein distance).

两个文本是否相似的问题是一个哲学问题,只要你没有具体说明它应该是什么意思。考虑字符串“房子”和“鼠标”。从语义层面看,它们不是很相似,但它们“物理外观”方面非常相似,因为只有一个字母不同(在这种情况下,您可以使用Levenshtein distance)。

To decide about similarity you need an appropriate text representation. You could – for instance – extract and count all n-gramsand compare the two resulting frequency-vectors using a similarity measure as e.g. cosine similarity. Or you could stemthe words to their root form after having removed all stopwords, sum up their occurrences and use thisas input for a similarity measure.

要确定相似性,您需要适当的文本表示。例如,您可以提取和计算所有n-gram,并使用相似性度量(例如余弦相似性)比较两个结果频率向量。或者,您可以在删除所有停用词后将单词词干为它们的词根形式,总结它们的出现次数并将用作相似性度量的输入。

There are plenty approaches and papers about that topic, e.g. this oneabout short texts. In any case: The higher the abstraction level where you want to decide if two texts are similar the more difficult it will get. I think your question is a non-trivial one (and hence my answer rather abstract) ... ;-)

有关于这个话题很多方法和论文,如这一个约短文。在任何情况下:您想要确定两个文本是否相似的抽象级别越高,它就越困难。我认为你的问题是一个不平凡的问题(因此我的回答相当抽象)...... ;-)