string 查找具有相似文本的文章的算法

Question

提问by Osama Al-Maadeed

I have many articles in a database (with title,text), I'm looking for an algorithm to find the X most similar articles, something like Stack Overflow's "Related Questions" when you ask a question.

我在数据库中有很多文章（带有标题，文本），我正在寻找一种算法来查找 X 篇最相似的文章，例如 Stack Overflow 的“相关问题”，当您提出问题时。

I tried googling for this but only found pages about other "similar text" issues, something like comparing every article with all the others and storing a similarity somewhere. SO does this in "real time" on text that I just typed.

我尝试使用谷歌搜索，但只找到了有关其他“类似文本”问题的页面，例如将每篇文章与所有其他文章进行比较并在某处存储相似性。SO 在我刚刚输入的文本上“实时”执行此操作。

How?

如何？

Answer 1

采纳答案by Jay Kominek

Edit distanceisn't a likely candidate, as it would be spelling/word-order dependent, and much more computationally expensive than Will is leading you to believe, considering the size and number of the documents you'd actually be interested in searching.

编辑距离不是一个可能的候选者，因为它会依赖于拼写/词序，并且计算成本比 Will 引导您相信的要昂贵得多，考虑到您实际上有兴趣搜索的文档的大小和数量。

Something like Lucene is the way to go. You index all your documents, and then when you want to find documents similar to a given document, you turn your given document into a query, and search the index. Internally Lucene will be using tf-idfand an inverted indexto make the whole process take an amount of time proportional to the number of documents that could possibly match, not the total number of documents in the collection.

像Lucene这样的东西是要走的路。您索引所有文档，然后当您想查找与给定文档相似的文档时，将给定文档转换为查询，然后搜索索引。在内部，Lucene 将使用tf-idf和倒排索引来使整个过程花费的时间与可能匹配的文档数量成正比，而不是与集合中的文档总数成正比。

Answer 2

回答by Will

It depends upon your definition of similiar.

这取决于您对相似的定义。

The edit-distancealgorithm is the standard algorithm for (latin language) dictionary suggestions, and can work on whole texts. Two texts are similiar if they have basically the same words (eh letters) in the same order. So the following two book reviews would be fairly similiar:

该编辑距离算法是（拉丁语）词典建议的标准算法，并可以在整个文本工作。如果两个文本以相同的顺序具有基本相同的单词（eh 字母），则它们是相似的。因此，以下两篇书评将相当相似：

1) "This is a great book"

1）“这是一本好书”

2) "These are not great books"

2）“这些都不是好书”

(The number of letters to remove, insert, delete or alter to turn (2) into (1) is termed the 'edit distance'.)

（要删除、插入、删除或更改以将 (2) 变为 (1) 的字母数称为“编辑距离”。）

To implement this you would want to visit every review programmatically. This is perhaps not as costly as it sounds, and if it is too costly you could do the comparisions as a background task and store the n-most-similiar in a database field itself.

要实现这一点，您需要以编程方式访问每条评论。这可能并不像听起来那么昂贵，如果成本太高，您可以将比较作为后台任务并将 n-most-similar 存储在数据库字段本身中。

Another approach is to understand something of the structure of (latin) languages. If you strip short (non-capitialised or quoted) words, and assign weights to words (or prefixes) that are common or unique, you can do a Bayesianesque comparision. The two following book reviews might be simiplied and found to be similiar:

另一种方法是了解一些（拉丁）语言的结构。如果您去除短（非大写或引用）单词，并为常见或独特的单词（或前缀）分配权重，则可以进行贝叶斯式比较。以下两个书评可能会被简化并发现是相似的：

3) "The french revolution was blah blah War and Peace blah blah France." -> France/French(2) Revolution(1) War(1) Peace(1) (note that a dictionary has been used to combine France and French)

3）“法国大革命是等等等等War与和平等等等等法国。” -> France/French(2) 革命(1) War(1) 和平(1) （注意已经用字典把法国和法语结合起来了）

4) "This book is blah blah a revolution in french cuisine." -> France(1) Revolution(1)

4）“这本书简直是法国美食的一场革命。” -> 法国(1) 革命(1)

To implement this you would want to identify the 'keywords' in a review when it was created/updated, and to find similiar reviews use these keywords in the where-clause of a query (ideally 'full text' searching if the database supports it), with perhaps a post-processing of the results-set for scoring the candidates found.

要实现这一点，您需要在创建/更新评论时确定评论中的“关键字”，并在查询的 where 子句中使用这些关键字查找类似评论（如果数据库支持，最好使用“全文”搜索)，也许会对结果集进行后处理，以便对找到的候选人进行评分。

Books also have categories - are thrillers set in France similiar to historical studies of France, and so on? Meta-data beyond title and text might be useful for keeping results relevant.

书籍也有分类——以法国为背景的惊悚片是否类似于法国的历史研究，等等？标题和文本之外的元数据可能有助于保持结果的相关性。

Answer 3

回答by alex77

The tutorial at this linksounds like it may be what you need. It is easy to follow and works very well.

此链接中的教程听起来可能正是您所需要的。它很容易遵循并且效果很好。

His algorithm rewards both common substrings and a common ordering of those substrings and so should pick out similar titles quite nicely.

他的算法奖励公共子串和这些子串的公共排序，因此应该很好地挑选出相似的标题。

Answer 4

回答by Guido

I suggest to index your articles using Apache Lucene, a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Once indexed, you could easily find related articles.

我建议使用Apache Lucene为您的文章编制索引，这是一个完全用 Java 编写的高性能、功能齐全的文本搜索引擎库。它是一种适用于几乎所有需要全文搜索的应用程序的技术，尤其是跨平台的. 编入索引后，您可以轻松找到相关文章。

Answer 5

回答by mempko

One common algorithm used is the Self-Organizing Map. It is a type of neural network that will automatically categorize your articles. Then you can simply find the location that a current article is in the map and all articles near it are related. The important part of the algorithm is how you would vector quantize your input. There are several ways to do with with text. You can hash your document/title, you can count words and use that as an n dimensional vector, etc. Hope that helps, although I may have opened up a Pandora's box for you of an endless journey in AI.

一种常用的算法是自组织映射。它是一种神经网络，可以自动对您的文章进行分类。然后你可以简单地找到当前文章在地图中的位置，并且它附近的所有文章都是相关的。该算法的重要部分是如何矢量量化您的输入。有几种方法可以处理文本。您可以散列您的文档/标题，您可以计算单词并将其用作 n 维向量等。希望有帮助，尽管我可能已经为您打开了一个潘多拉魔盒，让您在 AI 中进行无尽的旅程。

Answer 6

回答by Treb

SO does the comparison only on the title, not on the body text of the question, so only on rather short strings.

SO 只在标题上进行比较，而不是在问题的正文上进行比较，所以只在相当短的字符串上进行比较。

You can use their algorithm (no idea what it looks like) on the article title and the keywords. If you have more cpu time to burn, also on the abstracts of your articles.

您可以在文章标题和关键字上使用他们的算法（不知道它是什么样子）。如果你有更多的 CPU 时间可以燃烧，也可以在你的文章摘要上。

Answer 7

回答by b w

Seconding the Lucene suggestion for full-text, but note that java is not a requirement; a .NET port is available. Also see the main Lucene pagefor links to other projects, including Lucy, a C port.

支持 Lucene 对全文的建议，但请注意 java 不是必需的；.NET 端口可用。另请参阅Lucene 主页以获取指向其他项目的链接，包括Lucy，一个 C 端口。

Answer 8

回答by Vinnie

Maybe what your looking for is something that does paraphrasing. I only have cursory knowledge of this, but paraphrasing is a natural language processingconcept to determine if two passages of text actually meanthe same thing - although the may use entirely different words.

也许您正在寻找的是可以进行释义的东西。我对此只有粗略的了解，但释义是一种自然语言处理概念，用于确定两段文本是否实际上表示同一件事——尽管它们可能使用完全不同的词。

Unfortunately I don't know of any tools that allow you to do this (although I'd be interested in finding one)

不幸的是，我不知道有什么工具可以让你做到这一点（虽然我很想找一个）

Answer 9

回答by spacemonkeys

If you are looking for words that wound alike, you could convert to soundex and the the soundex words to match ... worked for me

如果您正在寻找相似的单词，您可以转换为 soundex 和 soundex 单词以匹配......对我有用

Answer 10

回答by Mitchel Sellers

You can use SQL Server Full-text index to get the smart comparison, I believe that SO is using an ajax call, that does a query to return the similar questions.

您可以使用 SQL Server 全文索引来进行智能比较，我相信 SO 正在使用 ajax 调用，它执行查询以返回类似的问题。

What technologies are you using?

你使用什么技术？

string 查找具有相似文本的文章的算法

提问by Osama Al-Maadeed

采纳答案by Jay Kominek

回答by Will

回答by alex77

回答by Guido

回答by mempko

回答by Treb

回答by b w

回答by Vinnie

回答by spacemonkeys

回答by Mitchel Sellers

相关推荐

最近更新

标签

string 查找具有相似文本的文章的算法

提问by Osama Al-Maadeed

采纳答案by Jay Kominek

回答by Will

回答by alex77

回答by Guido

回答by mempko

回答by Treb

回答by b w

回答by Vinnie

回答by spacemonkeys

回答by Mitchel Sellers

相关推荐

如何按“pandas”中的列获取缺失/NaN 数据的汇总计数？

像 SQL 的 LIKE 一样匹配 Pandas 文本？

在 Pandas 中为 to_csv() 设置 File_Path

pandas 添加一列，这是熊猫连续行差异的结果

相关推荐

最近更新

标签