如何操纵 MySQL 全文搜索相关性以使一个字段比另一个字段更“有价值”?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/547542/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 12:47:28  来源:igfitidea点击:

How can I manipulate MySQL fulltext search relevance to make one field more 'valuable' than another?

mysqlsearchindexingfull-text-searchrelevance

提问by Buzz

Suppose I have two columns, keywords and content. I have a fulltext index across both. I want a row with foo in the keywords to have more relevance than a row with foo in the content. What do I need to do to cause MySQL to weight the matches in keywords higher than those in content?

假设我有两列,关键字和内容。我在两者之间都有全文索引。我希望关键字中包含 foo 的行比内容中包含 foo 的行具有更高的相关性。我需要做什么才能使 MySQL 将关键字中的匹配项加权高于内容中的匹配项?

I'm using the "match against" syntax.

我正在使用“匹配”语法。

SOLUTION:

解决方案:

Was able to make this work in the following manner:

能够以下列方式完成这项工作:

SELECT *, 
CASE when Keywords like '%watermelon%' then 1 else 0 END as keywordmatch, 
CASE when Content like '%watermelon%' then 1 else 0 END as contentmatch,
MATCH (Title, Keywords, Content) AGAINST ('watermelon') AS relevance 
FROM about_data  
WHERE MATCH(Title, Keywords, Content) AGAINST ('watermelon' IN BOOLEAN MODE) 
HAVING relevance > 0  
ORDER by keywordmatch desc, contentmatch desc, relevance desc 

采纳答案by notnot

Actually, using a case statement to make a pair of flags might be a better solution:

实际上,使用 case 语句来制作一对标志可能是一个更好的解决方案:

select 
...
, case when keyword like '%' + @input + '%' then 1 else 0 end as keywordmatch
, case when content like '%' + @input + '%' then 1 else 0 end as contentmatch
-- or whatever check you use for the matching
from 
   ... 
   and here the rest of your usual matching query
   ... 
order by keywordmatch desc, contentmatch desc

Again, this is only if all keyword matches rank higher than all the content-only matches. I also made the assumption that a match in both keyword and content is the highest rank.

同样,这仅在所有关键字匹配的排名高于所有仅内容匹配的情况下。我还假设关键字和内容的匹配是最高排名。

回答by mintywalker

Create three full text indexes

创建三个全文索引

  • a) one on the keyword column
  • b) one on the content column
  • c) one on both keyword and content column
  • a) 关键字列上的一个
  • b) 内容栏上的一项
  • c) 一个在关键字和内容列上

Then, your query:

然后,您的查询:

SELECT id, keyword, content,
  MATCH (keyword) AGAINST ('watermelon') AS rel1,
  MATCH (content) AGAINST ('watermelon') AS rel2
FROM table
WHERE MATCH (keyword,content) AGAINST ('watermelon')
ORDER BY (rel1*1.5)+(rel2) DESC

The point is that rel1gives you the relevance of your query just in the keywordcolumn (because you created the index only on that column). rel2does the same, but for the contentcolumn. You can now add these two relevance scores together applying any weighting you like.

关键是rel1让您的查询仅在keyword列中具有相关性(因为您仅在该列上创建了索引)。 rel2做同样的事情,但对于content列。您现在可以应用您喜欢的任何权重将这两个相关性分数相加。

However, you aren't using either of these two indexes for the actual search. For that, you use your third index, which is on both columns.

但是,您没有使用这两个索引中的任何一个进行实际搜索。为此,您使用位于两列上的第三个索引。

The index on (keyword,content) controls your recall. Aka, what is returned.

(keyword,content) 上的索引控制着你的回忆。Aka,返回什么。

The two separate indexes (one on keyword only, one on content only) control your relevance. And you can apply your own weighting criteria here.

两个单独的索引(一个仅针对关键字,一个仅针对内容)控制您的相关性。您可以在此处应用您自己的加权标准。

Note that you can use any number of different indexes (or, vary the indexes and weightings you use at query time based on other factors perhaps ... only search on keyword if the query contains a stop word ... decrease the weighting bias for keywords if the query contains more than 3 words ... etc).

请注意,您可以使用任意数量的不同索引(或者,根据其他因素改变您在查询时使用的索引和权重,也许……仅在查询包含停用词的情况下搜索关键字……减少权重偏差关键字(如果查询包含超过 3 个单词......等)。

Each index does use up disk space, so more indexes, more disk. And in turn, higher memory footprint for mysql. Also, inserts will take longer, as you have more indexes to update.

每个索引确实会占用磁盘空间,因此索引越多,磁盘就越多。反过来,mysql 的内存占用也更高。此外,插入将花费更长的时间,因为您有更多的索引要更新。

You should benchmark performance (being careful to turn off the mysql query cache for benchmarking else your results will be skewed) for your situation. This isn't google grade efficient, but it is pretty easy and "out of the box" and it's almost certainly a lot lot better than your use of "like" in the queries.

您应该针对您的情况对性能进行基准测试(小心关闭 mysql 查询缓存以进行基准测试,否则您的结果会出现偏差)。这不是谷歌等级效率,但它非常简单且“开箱即用”,并且几乎可以肯定比您在查询中使用“like”要好得多。

I find it works really well.

我发现它真的很好用。

回答by lubosdz

Simpler version using only 2 fulltext indexes (credits taken from @mintywalker):

仅使用 2 个全文索引的更简单版本(来自 @mintywalker 的学分):

SELECT id, 
   MATCH (`content_ft`) AGAINST ('keyword*' IN BOOLEAN MODE) AS relevance1,  
   MATCH (`title_ft`) AGAINST ('keyword*' IN BOOLEAN MODE) AS relevance2
FROM search_table
HAVING (relevance1 + relevance2) > 0
ORDER BY (relevance1 * 1.5) + (relevance2) DESC
LIMIT 0, 1000;

This will search both full indexed columns against the keywordand select matched relevance into two separate columns. We will exclude items with no match (relevance1 and relevance2 are both zero) and reorder results by increased weight of content_ftcolumn. We don't need composite fulltext index.

这将搜索两个完整索引列,keyword并将匹配的相关性选择到两个单独的列中。我们将排除不匹配的项目(相关性 1 和相关性 2 都为零)并通过增加content_ft列的权重对结果重新排序。我们不需要复合全文索引。

回答by lubosdz

I did this a few years ago, but without the full text index. I don't have the code handy (former employer), but I remember the technique well.

几年前我这样做了,但没有全文索引。我手边没有代码(前雇主),但我记得很清楚这项技术。

In a nutshell, I selected a "weight" from each column. For example:

简而言之,我从每列中选择了一个“权重”。例如:

select table.id, keyword_relevance + content_relevance as relevance from table
   left join
      (select id, 1 as keyword_relevance from table_name where keyword match) a
   on table.id = a.id
   left join
      (select id, 0.75 as content_relevance from table_name where content match) b
   on table.id = b.id

Please forrgive any shoddy SQL here, it's been a few years since I needed to write any, and I'm doing this off the top of my head...

请原谅这里的任何劣质 SQL,我已经好几年没有写任何东西了,而且我正在做这件事...

Hope this helps!

希望这可以帮助!

J.Js

J.Js

回答by Tom

In Boolean mode, MySQL supports the ">" and "<" operator to change a word's contribution to the relevance value that is assigned to a row.

在布尔模式下,MySQL 支持 ">" 和 "<" 运算符来更改单词对分配给行的相关值的贡献。

I wonder if something like this would work?

我想知道这样的事情是否可行?

SELECT *, 
MATCH (Keywords) AGAINST ('>watermelon' IN BOOLEAN MODE) AS relStrong, 
MATCH (Title,Keywords,Content) AGAINST ('<watermelon' IN BOOLEAN MODE) AS relWeak 
FROM about_data  
WHERE MATCH(Title, Keywords, Content) AGAINST ('watermelon' IN BOOLEAN MODE) 
ORDER by (relStrong+relWeak) desc

回答by dasplann

I needed something similar and used the OP's solution, but I noticed that fulltext doesn't match partial words. So if 'watermelon' is in Keywords or Content as part of a word (like watermelonsalesmanager) it doesn't MATCH and is not included in the results because of the WHERE MATCH. So I fooled around a bit and tweaked the OP's query to this:

我需要类似的东西并使用了 OP 的解决方案,但我注意到全文与部分单词不匹配。因此,如果“watermelon”作为单词的一部分出现在关键字或内容中(例如 watermelonsalesmanager),则它不匹配,并且由于 WHERE MATCH 而不会包含在结果中。所以我玩弄了一下,将 OP 的查询调整为:

SELECT *, 
CASE WHEN Keywords LIKE '%watermelon%' THEN 1 ELSE 0 END AS keywordmatch, 
CASE WHEN Content LIKE '%watermelon%' THEN 1 ELSE 0 END AS contentmatch,
MATCH (Title, Keywords, Content) AGAINST ('watermelon') AS relevance 
FROM about_data  
WHERE (Keywords LIKE '%watermelon%' OR 
  Title LIKE '%watermelon%' OR 
  MATCH(Title, Keywords, Content) AGAINST ('watermelon' IN BOOLEAN MODE)) 
HAVING (keywordmatch > 0 OR contentmatch > 0 OR relevance > 0)  
ORDER BY keywordmatch DESC, contentmatch DESC, relevance DESC

Hope this helps.

希望这可以帮助。

回答by Davide

Well, that depends on what do you exactly mean with:

好吧,这取决于你到底是什么意思:

I want a row with foo in the keywords to have more relevance than a row with foo in the content.

我希望关键字中包含 foo 的行比内容中包含 foo 的行具有更高的相关性。

If you mean that a row with foo in the keywords should come beforeanyrow with foo in the content, then I will do two separate queries, one for the keywords and then (possibly lazily, only if it's requested) the other one on the content.

如果您的意思是关键字中包含 foo 的行应该在内容中包含 foo 的任何之前,那么我将执行两个单独的查询,一个针对关键字,然后(可能是懒惰的,仅在请求时)另一个在内容。

回答by adamJLev

As far as I know, this isn't supported with MySQL fulltext search, but you can achieve the effect by somehow repeating that word several times in the keyword field. Instead of having keywords "foo bar", have "foo bar foo bar foo bar", that way both foo and bar are equally important within the keywords column, and since they appear several times they become more relevant to mysql.

据我所知,MySQL 全文搜索不支持此功能,但您可以通过在关键字字段中以某种方式多次重复该词来达到效果。不是有关键字“foo bar”,而是有“foo bar foo bar foo bar”,这样 foo 和 bar 在关键字列中同样重要,并且由于它们出现多次,它们变得与 mysql 更相关。

We use this on our site and it works.

我们在我们的网站上使用它并且它有效。

回答by notnot

If the metric is just that all the keyword matches are more "valuable" than all the content matches then you can just use a union with row counts. Something along these lines.

如果指标只是所有关键字匹配比所有内容匹配都更“有价值”,那么您可以使用带有行计数的联合。沿着这些路线的东西。

select *
from (
   select row_number() over(order by blahblah) as row, t.*
   from thetable t
   where keyword match

   union

   select row_number() over(order by blahblah) + @@rowcount + 1 as row, t.*
   from thetable t
   where content match
)
order by row

For anything more complicated than that, where you want to apply an actual weight to every row, I don't know how to help.

对于比这更复杂的事情,您想对每一行应用实际权重,我不知道如何提供帮助。