MySQL 如何找到相似的结果并按相似度排序?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3338889/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 16:40:50  来源:igfitidea点击:

How to find similar results and sort by similarity?

mysqlsqlstringsortingsimilarity

提问by TheFlash

How do I query for records ordered by similarity?

如何查询按相似度排序的记录?

Eg. searching for "Stock Overflow" would return

例如。搜索“库存溢出”将返回

  1. Stack Overflow
  2. SharePoint Overflow
  3. Math Overflow
  4. Politic Overflow
  5. VFX Overflow
  1. 堆栈溢出
  2. SharePoint 溢出
  3. 数学溢出
  4. 溢出
  5. 视觉特效溢出

Eg. searching for "LO" would return:

例如。搜索“LO”将返回:

  1. pabLO picasso
  2. michelangeLO
  3. Hymanson polLOck
  1. 巴勃罗毕加索
  2. 米开朗基罗
  3. Hyman逊波洛克


What I need help with:

我需要什么帮助:

  1. Using a search engine to index & search a MySQL table, for better results

    • Using the Sphinxsearch engine, with PHP

    • Using the Luceneengine with PHP

  2. Using full-text indexing, to find similar/containing strings

  1. 使用搜索引擎索引和搜索 MySQL 表,以获得更好的结果

    • 使用Sphinx搜索引擎,搭配 PHP

    • 在PHP 中使用Lucene引擎

  2. 使用全文索引,查找相似/包含的字符串



What does not work well

什么不工作

  • Levenshtein distanceis very erratic. (UDF, Query)
    Searching for "dog" gives me:
    1. dog
    2. bog
    3. ago
    4. big
    5. echo
  • LIKEreturns better results, but returns nothing for long queries although similar strings do exist
    1. dog
    2. dogid
    3. dogaral
    4. dogma
  • Levenshtein 距离非常不稳定。( UDF, Query)
    搜索“狗”给了我:
    1. 沼泽
    2. 回声
  • LIKE返回更好的结果,但对于长查询不返回任何内容,尽管确实存在类似的字符串
    1. 多加拉尔
    2. 教条

回答by Yanick Rochon

I have found out that the Levenshtein distance may be good when you are searching a full string against another full string, but when you are looking for keywords within a string, this method does not return (sometimes) the wanted results. Moreover, the SOUNDEX function is not suitable for languages other than english, so it is quite limited. You could get away with LIKE, but it's really for basic searches. You may want to look into other search methods for what you want to achieve. For example:

我发现当您针对另一个完整字符串搜索完整字符串时,Levenshtein 距离可能很好,但是当您在字符串中查找关键字时,此方法不会返回(有时)想要的结果。而且,SOUNDEX 功能不适用于英语以外的语言,因此非常有限。您可以使用 LIKE,但它确实适用于基本搜索。您可能想要查看其他搜索方法以了解您想要实现的目标。例如:

You may use Luceneas search base for your projects. It's implemented in most major programming languages and it'd quite fast and versatile. This method is probably the best, as it not only search for substrings, but also letter transposition, prefixes and suffixes (all combined). However, you need to keep a separate index (using CRON to update it from a independent script once in a while works though).

您可以使用Lucene作为您项目的搜索基础。它以大多数主要的编程语言实现,而且速度非常快且用途广泛。这种方法可能是最好的,因为它不仅搜索子字符串,还搜索字母换位、前缀和后缀(全部组合)。但是,您需要保留一个单独的索引(尽管偶尔使用 CRON 从独立脚本更新它)。

Or, if you want a MySQL solution, the fulltext functionality is pretty good, and certainly faster than a stored procedure. If your tables are not MyISAM, you can create a temporary table, then perform your fulltext search :

或者,如果您想要一个 MySQL 解决方案,全文功能非常好,而且肯定比存储过程快。如果您的表不是 MyISAM,您可以创建一个临时表,然后执行全文搜索:

CREATE TABLE IF NOT EXISTS `tests`.`data_table` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(2000) CHARACTER SET latin1 NOT NULL,
  `description` text CHARACTER SET latin1 NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;

Use a data generatorto generate some random data if you don't want to bother creating it yourself...

如果您不想自己创建,请使用数据生成器生成一些随机数据...

** NOTE** : the column type should be latin1_binto perform a case sensitive search instead of case insensitive with latin1. For unicode strings, I would recommend utf8_binfor case sensitive and utf8_general_cifor case insensitive searches.

**注意**:列类型应该是latin1_bin执行区分大小写的搜索,而不是不区分大小写的latin1。对于 unicode 字符串,我建议utf8_bin进行区分大小写和utf8_general_ci不区分大小写的搜索。

DROP TABLE IF EXISTS `tests`.`data_table_temp`;
CREATE TEMPORARY TABLE `tests`.`data_table_temp`
   SELECT * FROM `tests`.`data_table`;

ALTER TABLE `tests`.`data_table_temp`  ENGINE = MYISAM;

ALTER TABLE `tests`.`data_table_temp` ADD FULLTEXT `FTK_title_description` (
  `title` ,
  `description`
);

SELECT *,
       MATCH (`title`,`description`)
       AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) as `score`
  FROM `tests`.`data_table_temp`
 WHERE MATCH (`title`,`description`)
       AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE)
 ORDER BY `score` DESC;

DROP TABLE `tests`.`data_table_temp`;

Read more about it from the MySQL API reference page

MySQL API 参考页面阅读更多相关信息

The downside to this is that it will not look for letter transposition or "similar, sounds like" words.

这样做的缺点是它不会寻找字母换位或“相似,听起来像”的词。

** UPDATE**

**更新**

Using Lucene for your search, you will simply need to create a cron job (all web hosts have this "feature") where this job will simply execute a PHP script (i.g. "cd /path/to/script; php searchindexer.php") that will update the indexes. The reason being that indexing thousands of "documents" (rows, data, etc.) may take several seconds, even minutes, but this is to ensure that all searches are performed as fast as possible. Therefore, you may want to create a delay job to be run by the server. It may be overnight, or in the next hour, this is up to you. The PHP script should look something like this:

使用 Lucene 进行搜索,您只需要创建一个 cron 作业(所有网络主机都有这个“功能”),这个作业将简单地执行一个 PHP 脚本(ig“cd /path/to/script; php searchindexer.php” ) 将更新索引。原因是索引数千个“文档”(行、数据等)可能需要几秒钟甚至几分钟,但这是为了确保尽可能快地执行所有搜索。因此,您可能希望创建一个由服务器运行的延迟作业。可能是一夜之间,也可能是下一小时,这取决于您。PHP 脚本应如下所示:

$indexer = Zend_Search_Lucene::create('/path/to/lucene/data');

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  // change this option for your need
  new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);

$rowSet = getDataRowSet();  // perform your SQL query to fetch whatever you need to index
foreach ($rowSet as $row) {
   $doc = new Zend_Search_Lucene_Document();
   $doc->addField(Zend_Search_Lucene_Field::text('field1', $row->field1, 'utf-8'))
       ->addField(Zend_Search_Lucene_Field::text('field2', $row->field2, 'utf-8'))
       ->addField(Zend_Search_Lucene_Field::unIndexed('someValue', $someVariable))
       ->addField(Zend_Search_Lucene_Field::unIndexed('someObj', serialize($obj), 'utf-8'))
  ;
  $indexer->addDocument($doc);
}

// ... you can get as many $rowSet as you want and create as many documents
// as you wish... each document doesn't necessarily need the same fields...
// Lucene is pretty flexible on this

$indexer->optimize();  // do this every time you add more data to you indexer...
$indexer->commit();    // finalize the process

Then, this is basically how you search (basic search) :

然后,这基本上就是您搜索的方式(基本搜索):

$index = Zend_Search_Lucene::open('/path/to/lucene/data');

// same search options
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
   new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);

Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');

$query = 'php +field1:foo';  // search for the word 'php' in any field,
                                 // +search for 'foo' in field 'field1'

$hits = $index->find($query);

$numHits = count($hits);
foreach ($hits as $hit) {
   $score = $hit->score;  // the hit weight
   $field1 = $hit->field1;
   // etc.
}

Here are great sites about Lucene in Java, PHP, and .Net.

这里有关于 Lucene in JavaPHP.Net 的很棒的站点。

In conclusioneach search methods have their own pros and cons :

总之,每种搜索方法都有其优缺点:

  • You mentioned Sphinx searchand it looks very good, as long as you can make the deamon run on your web host.
  • Zend Lucene requires a cron job to re-index the database. While it is quite transparent to the user, this means that any new data (or deleted data!) is not always in sync with the data in your database and therefore won't show up right away on user search.
  • MySQL FULLTEXT search is good and fast, but will not give you all the power and flexibility of the first two.
  • 你提到了Sphinx 搜索,它看起来非常好,只要你能让守护进程在你的 web 主机上运行。
  • Zend Lucene 需要一个 cron 作业来重新索引数据库。虽然它对用户非常透明,但这意味着任何新数据(或删除的数据!)并不总是与数据库中的数据同步,因此不会立即显示在用户搜索中。
  • MySQL FULLTEXT 搜索既好又快,但不会为您提供前两者的所有功能和灵活性。

Please feel free to comment if I have forgotten/missed anything.

如果我忘记/遗漏了什么,请随时发表评论。

回答by opatut

1. Similarity

1. 相似性

For Levenshtein in MySQL I found this, from www.codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function

对于 MySQL 中的 Levenshtein 我发现了这个,来自www.codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function

SELECT 
    column, 
    LEVENSHTEIN(column, 'search_string') AS distance 
FROM table 
WHERE 
    LEVENSHTEIN(column, 'search_string') < distance_limit
ORDER BY distance DESC

2. Containing, case insensitive

2. 包含,不区分大小写

Use the LIKEstatement of MySQL, which is case insensitive by default. The %is a wildcard, so there may be any string before and after search_string.

使用LIKEMySQL的语句,默认不区分大小写。The%是通配符,因此 前后可能有任何字符串search_string

SELECT 
    *
FROM 
    table
WHERE 
    column_name LIKE "%search_string%"

3. Containing, case sensitive

3.包含,区分大小写

The MySQL Manualhelps:

MySQL手册帮助:

The default character set and collation are latin1 and latin1_swedish_ci, so nonbinary string comparisons are case insensitive by default. This means that if you search with col_name LIKE 'a%', you get all column values that start with A or a. To make this search case sensitive, make sure that one of the operands has a case sensitive or binary collation. For example, if you are comparing a column and a string that both have the latin1 character set, you can use the COLLATE operator to cause either operand to have the latin1_general_cs or latin1_bin collation...

默认字符集和排序规则是 latin1 和 latin1_swedish_ci,因此默认情况下非二进制字符串比较不区分大小写。这意味着如果您使用 col_name LIKE 'a%' 进行搜索,您将获得所有以 A 或 a 开头的列值。要使此搜索区分大小写,请确保其中一个操作数具有区分大小写或二进制排序规则。例如,如果您要比较都具有 latin1 字符集的列和字符串,则可以使用 COLLATE 运算符使任一操作数具有 latin1_general_cs 或 latin1_bin 排序规则...

My MySQL setup does not support latin1_general_csor latin1_bin, but it worked fine for me to use the collation utf8_binas binary utf8 is case sensitive:

我的 MySQL 设置不支持latin1_general_csor latin1_bin,但它对我来说很好用,utf8_bin因为二进制 utf8 区分大小写:

SELECT 
    *
FROM 
    table
WHERE 
    column_name LIKE "%search_string%" COLLATE utf8_bin

2. / 3. sorted by Levenshtein Distance

2. / 3. 按 Levenshtein 距离排序

SELECT 
    column, 
    LEVENSHTEIN(column, 'search_string') AS distance // for sorting
FROM table 
WHERE 
    column_name LIKE "%search_string%"
    COLLATE utf8_bin // for case sensitivity, just leave out for CI
ORDER BY
    distance
    DESC

回答by DaL

It seems that your definition of similarity is semantic similarity. So in order to build such a similarity function, you should use semantic similarity measures. Note that the scope of work on the issue might vary from few hours to years so it is recommended to decide on the scope before getting into work. I didn't figure out which data do you have in order to build the similarity relation. I assume the you have access the a dataset of documents and a dataset of queries. You can start with co-occurrence of the words (e.g., conditional probability). You will discover quickly that you get the list of stop wordsas related the most of the words simply because they are very popular. Using the lift of conditional probability will take care of the stop words but will make the relation prone to error in small number (most of your cases). You might try Jacardbut since it is symmetric there will be many relations it won't find. Then you might consider relations that appear only in short distance from the base word. You can (and should) consider relations base on general corpus's (e.g., Wikipedia) and user specific (e.g., his emails).

看来您对相似性的定义是语义相似性。因此,为了构建这样的相似度函数,您应该使用语义相似度度量。请注意,该问题的工作范围可能从几个小时到几年不等,因此建议在开始工作之前确定范围。我没有弄清楚你有哪些数据来建立相似性关系。我假设您可以访问文档数据集和查询数据集。您可以从单词的共现开始(例如,条件概率)。你很快就会发现你得到了停用词列表与大多数词相关的原因仅仅是因为它们非常流行。使用条件概率的提升将处理停用词,但会使关系在少数情况下容易出错(大多数情况下)。您可能会尝试Jacard,但由于它是对称的,因此它将找不到许多关系。然后,您可能会考虑仅出现在距基本词短距离内的关系。您可以(并且应该)考虑基于通用语料库(例如,维基百科)和特定于用户(例如,他的电子邮件)的关系。

Very shortly you will have plenty of similarity measures, when all the measures are good and have some advantage over the others.

很快,当所有度量都很好并且比其他度量具有一些优势时,您将拥有大量相似性度量。

In order to combine such measures, I like to reduce the problem into a classification problem.

为了结合这些措施,我喜欢将问题简化为分类问题。

You should build a data set of paris of words and label them as "is related". In order to build a large labeled dataset you can:

您应该构建一个 paris of words 的数据集,并将它们标记为“相关”。为了构建大型标记数据集,您可以:

  • Use sources of known related words (e.g., good old Wikipedia categories) for positives
  • Most of the word not known as related are not related.
  • 使用已知相关词的来源(例如,古老的维基百科类别)进行正面评价
  • 大多数不被称为相关的词都是不相关的。

Then use all the measures you have as features of the pairs. Now you are in the domain of supervised classification problem. Build a classifier on the data set, evaluated according to your needs and get a similarity measure that fits your needs.

然后使用您拥有的所有度量作为对的特征。现在您处于监督分类问题的领域。在数据集上构建分类器,根据您的需求进行评估,并获得适合您需求的相似性度量。