SQL 什么是全文搜索与 LIKE

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/224714/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 00:01:29  来源:igfitidea点击:

What is Full Text Search vs LIKE

sqlfull-text-searchsql-like

提问by Nathan W

I just read a post mentioning "full text search" in SQL.

我刚刚读了一篇提到 SQL 中的“全文搜索”的帖子。

I was just wondering what the difference between FTS and LIKE are. I did read a couple of articles but couldn't find anything that explained it well.

我只是想知道 FTS 和 LIKE 之间的区别是什么。我确实阅读了几篇文章,但找不到任何可以很好地解释它的内容。

回答by erickson

In general, there is a tradeoff between "precision" and "recall". High precision means that fewer irrelevant results are presented (no false positives), while high recall means that fewer relevant results are missing (no false negatives). Using the LIKE operator gives you 100% precision with no concessions for recall. A full text search facility gives you a lot of flexibility to tune down the precision for better recall.

通常,在“精度”和“召回”之间存在权衡。高精度意味着呈现较少的不相关结果(无误报),而高召回率意味着丢失较少的相关结果(无误报)。使用 LIKE 运算符可以为您提供 100% 的精确度,而不会在召回方面做出让步。全文搜索工具为您提供了很大的灵活性来调低精度以获得更好的召回率。

Most full text search implementations use an "inverted index". This is an index where the keys are individual terms, and the associated values are sets of records that contain the term. Full text search is optimized to compute the intersection, union, etc. of these record sets, and usually provides a ranking algorithm to quantify how strongly a given record matches search keywords.

大多数全文搜索实现使用“倒排索引”。这是一个索引,其中键是单个术语,关联的值是包含该术语的记录集。全文搜索被优化以计算这些记录集的交集、并集等,并且通常提供排名算法来量化给定记录与搜索关键字匹配的强度。

The SQL LIKE operator can be extremely inefficient. If you apply it to an un-indexed column, a full scan will be used to find matches (just like any query on an un-indexed field). If the column is indexed, matching can be performed against index keys, but with far less efficiency than most index lookups. In the worst case, the LIKE pattern will have leading wildcards that require every index key to be examined. In contrast, many information retrieval systems can enable support for leading wildcards by pre-compiling suffix trees in selected fields.

SQL LIKE 运算符可能非常低效。如果将其应用于未编入索引的列,则将使用完整扫描来查找匹配项(就像对未编入索引的字段进行任何查询一样)。如果该列已编入索引,则可以针对索引键执行匹配,但效率远低于大多数索引查找。在最坏的情况下,LIKE 模式将具有需要检查每个索引键的前导通配符。相比之下,许多信息检索系统可以通过在选定字段中预编译后缀树来支持前导通配符。

Other features typical of full-text search are

全文搜索的其他典型特征是

  • lexical analysis or tokenization—breaking a block of unstructured text into individual words, phrases, and special tokens
  • morphological analysis, or stemming—collapsing variations of a given word into one index term; for example, treating "mice" and "mouse", or "electrification" and "electric" as the same word
  • ranking—measuring the similarity of a matching record to the query string
  • 词法分析或标记化——将非结构化文本块分解为单独的单词、短语和特殊标记
  • 形态分析或词干提取——将给定词的变体合并为一个索引词;例如,将“小鼠”和“鼠标”,或“电气化”和“电动”视为同一个词
  • 排名——测量匹配记录与查询字符串的相似度

回答by Ignacio Vazquez-Abrams

FTS involves indexing the individual words within a text field in order to make searching through many records quick. Using LIKE still requires you to do a string search (linear or the like) within the field.

FTS 涉及对文本字段中的单个单词进行索引,以便快速搜索许多记录。使用 LIKE 仍然需要您在字段内进行字符串搜索(线性搜索或类似搜索)。

回答by Vipin Jain

MySQL creates an index from the words of the enabled full-text search column and performs searches on this index. MySQL uses a sophisticated algorithm to determine the rows matched against the search query.

MySQL 从启用的全文搜索列的单词创建索引,并在该索引上执行搜索。MySQL 使用复杂的算法来确定与搜索查询匹配的行。

Also, from this SO answer:

另外,从这个 SO 答案

There's a few advantages to full text searching.

Indexing:

Something like:

WHERE Foo LIKE '%Bar';

Cannot take advantage of an index. It has to look at every single row, and see if it matches. A fulltext index, however, can. In fact, fulltext indexes can offer a lot more flexibility in terms of the order of matching words, how close those words are together, etc.

Stemming:

A fulltext search can stem words. If you search for run, you can get results for "ran" or "running". Most fulltext engines have stem dictionaries in a variety of languages.

Weighted Results:

A fulltext index can encompass multiple columns. For example, you can search for "peach pie", and the index can include a title, keywords, and a body. Results that match the title can be weighted higher, as more relevant, and can be sorted to show near the top.

Disadvantages:

A fulltext index can potentially be huge, many times larger than a standard B-TREE index. For this reason, many hosted providers who offer database instances disable this feature, or at least charge extra for it. For example, last I checked, Windows Azure did not support fulltext queries.

Fulltext indexes can also be slower to update. If the data changes a lot, there might be some lag updating indexes compared to standard indexes.

全文搜索有几个优点。

索引:

就像是:

WHERE Foo LIKE '%Bar';

无法利用索引。它必须查看每一行,看看它是否匹配。但是,全文索引可以。事实上,全文索引可以在匹配词的顺序、这些词之间的接近程度等方面提供更大的灵活性。

词干:

全文搜索可以词干。如果您搜索 run,您可以获得“ran”或“running”的结果。大多数全文引擎都有多种语言的词干词典。

加权结果:

全文索引可以包含多个列。例如,您可以搜索“桃子派”,索引可以包括标题、关键字和正文。与标题匹配的结果的权重可以更高,因为更相关,并且可以排序以显示在顶部附近。

缺点:

全文索引可能很大,比标准 B-TREE 索引大很多倍。出于这个原因,许多提供数据库实例的托管提供商禁用此功能,或者至少为此收取额外费用。例如,上次我检查过,Windows Azure 不支持全文查询。

全文索引的更新速度也可能较慢。如果数据变化很大,与标准索引相比,更新索引可能会有一些滞后。

回答by Mitchel Sellers

Like uses wildcards only, and isn't all that powerful.

Like 只使用通配符,并不是那么强大。

Full text allows much more complex searching, including And, Or, Not, even similar sounding results (SOUNDEX) and many more items.

全文允许更复杂的搜索,包括 And、Or、Not,甚至类似的发音结果 (SOUNDEX) 和更多项目。

I would start looking at the SQL CONTAINS() FREETEXT() and related Full Text search items to help get a better understanding of what is available.

我将开始查看 SQL CONTAINS() FREETEXT() 和相关的全文搜索项,以帮助更好地了解可用内容。

回答by Kingz

The real difference is the scanning methodologies. For full-text search, the words (terms) are used as hash keys - each of which is associated with an array of documents the keys (terms) appears in. Its like this:

真正的区别在于扫描方法。对于全文搜索,单词(术语)用作哈希键 - 每个哈希键都与出现键(术语)的文档数组相关联。它是这样的:

Document sets = {d1, d2, d3, d4, ... dn}
Term sets = {t1, t2, t3, .. tn}

Now term-document matrix (which term member of which document) can be represented as:

现在术语-文档矩阵(哪个文档的哪个术语成员)可以表示为:

t1 -> {d1, d5, d9,.. dn}
t2 -> {d11, d50, d2,.. dn}
t3 -> {d23, d67, d34,.. dn}
:
tn -> {d90, d87, d57,.. dn}

When the request comes in asking for "Get me all documents containing the word/term t1" - then the document set {d1, d5, d9,.. dn} is returned.

当请求要求“获取包含单词/术语 t1 的所有文档”时,{d1, d5, d9,.. dn返回文档集}。

You could hack a de-normalized table schema to store documents - each row in MySQL table will be considered as "document" and a TEXT column could contain a paragraph etc. The inverted index will contain the terms as hash keys and the row-ids as the document ids.

您可以破解一个非规范化的表模式来存储文档 - MySQL 表中的每一行都将被视为“文档”,并且一个 TEXT 列可以包含一个段落等。倒排索引将包含术语作为哈希键和行 ID作为文档 ID。

Remember that this SQL query will have more or less O(1) performance. The query will be independent of

请记住,此 SQL 查询或多或少具有 O(1) 性能。查询将独立于

  1. Number of words/terms in the TEXT column
  2. The number of rows/documents matching the criteria
  3. The length of the words/terms
  1. TEXT 列中的字数/术语数
  2. 符合条件的行数/文档数
  3. 单词/术语的长度

For instance this SQL could be fired to extract all rows matching the given word XYZ:

例如,可以触发此 SQL 以提取与给定单词 XYZ 匹配的所有行:

SELECT * 
FROM   my_table 
WHERE  MATCH (my_text_column) against ('XYZ' IN boolean mode) ;

Caveat: If you add ORDER BY to this query, your runtimes will vary based on the several parameters, one of which is the number of matching rows/documents. So beware.

警告:如果您向此查询添加 ORDER BY,您的运行时将根据多个参数而变化,其中之一是匹配行/文档的数量。所以要小心。

The LIKE however has got nothing of this. It is forced to linearly scan the sentence/string and find all matching terms. Adding wild card adds to the mess. It works great for small length strings, as you can imagine, but will fail miserably for longer sentences. And definitely not comparable when having a paragraph or a whole page of text etc.

然而,LIKE 对此一无所知。它被迫线性扫描句子/字符串并找到所有匹配的术语。添加通配符会增加混乱。正如您可以想象的那样,它适用于小长度的字符串,但对于较长的句子会失败。当有一段或一整页的文字等时,绝对没有可比性。

回答by kamskyleo

FTS is more efficient, powerful (especially for Word Breakers and stemming functionalities) ... but check your requirements because sometimes DBs don't support all languages for example MSSQL doesn't support Greek (check on this page http://msdn.microsoft.com/en-us/library/ms176076(v=sql.110).aspx)

FTS 更高效、更强大(特别是对于分词器和词干提取功能)……但是请检查您的要求,因为有时 DB 不支持所有语言,例如 MSSQL 不支持希腊语(请查看此页面http://msdn。 microsoft.com/en-us/library/ms176076(v=sql.110).aspx)