SQL 在SQL中按子字符串查找字符串的最快方法？

Question

提问by msergey

I have huge table with 2 columns: Id and Title. Id is bigint and I'm free to choose type of Title column: varchar, char, text, whatever. Column Title contains random text strings like "abcdefg", "q", "allyourbasebelongtous" with maximum of 255 chars.

我有一个包含 2 列的大表：Id 和 Title。Id 是 bigint，我可以自由选择 Title 列的类型：varchar、char、text 等等。列标题包含随机文本字符串，如“abcdefg”、“q”、“allyourbasebelongtous”，最多 255 个字符。

My task is to get strings by given substring. Substrings also have random length and can be start, middle or end of strings. The most obvious way to perform it:

我的任务是通过给定的子字符串获取字符串。子字符串也有随机长度，可以是字符串的开头、中间或结尾。最明显的执行方式：

SELECT * FROM t LIKE '%abc%'

I don't care about INSERT, I need only to do fast selects. What can I do to perform search as fast as possible?

我不在乎 INSERT，我只需要进行快速选择。我该怎么做才能尽快执行搜索？

I use MS SQL Server 2008 R2, full text search will be useless, as far as I see.

我使用的是 MS SQL Server 2008 R2，据我所知，全文搜索将无用。

Answer 1

采纳答案by antlersoft

If you want to use less space than Randy's answer and there is considerable repetition in your data, you can create an N-Ary tree data structure where each edge is the next character and hang each string and trailing substring in your data on it.

如果您想使用比 Randy 的答案更少的空间并且您的数据中有相当多的重复，您可以创建一个 N-Ary 树数据结构，其中每个边都是下一个字符，并将数据中的每个字符串和尾随子字符串挂在上面。

You number the nodes in depth first order. Then you can create a table with up to 255 rows for each of your records, with the Id of your record, and the node id in your tree that matches the string or trailing substring. Then when you do a search, you find the node id that represents the string you are searching for (and all trailing substrings) and do a range search.

您按深度优先顺序对节点进行编号。然后，您可以为每条记录创建一个最多包含 255 行的表，其中包含记录的 ID 以及树中与字符串或尾随子字符串匹配的节点 ID。然后，当您进行搜索时，您会找到代表您正在搜索的字符串（以及所有尾随子字符串）的节点 id 并进行范围搜索。

Answer 2

回答by Randy

if you dont care about storage, then you can create another table with partial Title entries, beginning with each substring (up to 255 entries per normal title ).

如果您不关心存储，那么您可以创建另一个包含部分 Title 条目的表，从每个子字符串开始（每个普通 title 最多 255 个条目）。

in this way, you can index these substrings, and match only to the beginning of the string, should greatly improve performance.

这样，你就可以索引这些子字符串，并且只匹配到字符串的开头，应该会大大提高性能。

Answer 3

回答by BradC

Sounds like you've ruled out all good alternatives.

听起来你已经排除了所有好的选择。

You already know that your query

您已经知道您的查询

SELECT * FROM t WHERE TITLE LIKE '%abc%'

won't use an index, it will do a full table scan every time.

不会使用索引，它每次都会进行全表扫描。

If you were sure that the string was at the beginningof the field, you could do

如果您确定该字符串位于字段的开头，则可以执行

SELECT * FROM t WHERE TITLE LIKE 'abc%'

which would use an index on Title.

这将在标题上使用索引。

Are you sure full text search wouldn't help you here?

您确定全文搜索在这里对您没有帮助吗？

Depending on your business requirements, I've sometimes used the following logic:

根据您的业务需求，我有时会使用以下逻辑：

Do a "begins with" query (LIKE 'abc%') first, which will use an index.
Depending on if any rows are returned (or how many), conditionally move on to the "harder" search that will do the full scan (LIKE '%abc%')

首先执行“以”开头的查询 ( LIKE 'abc%')，这将使用索引。
根据是否返回任何行（或返回多少行），有条件地进行“更难”的搜索以进行完整扫描 ( LIKE '%abc%')

Depends on what you need, of course, but I've used this in situations where I can show the easiest and most common results first, and only move on to the more difficult query when necessary.

当然，这取决于您需要什么，但我已经在可以首先显示最简单和最常见的结果的情况下使用它，只有在必要时才转到更困难的查询。

Answer 4

回答by Dharmendar Kumar 'DK'

You can add another calculated column on the table: titleLength as len(title) PERSISTED. This would store the length of the "title" column. Create an index on this.

您可以在表上添加另一个计算列：titleLength as len(title) PERSISTED。这将存储“标题”列的长度。为此创建索引。

Also, add another calculated column called: ReverseTitle as Reverse(title) PERSISTED.

此外，添加另一个名为：ReverseTitle as Reverse(title) PERSISTED 的计算列。

Now when someone searches for a keyword, check if the length of keyword is same as titlelength. If so, do a "=" search. If length of keyword is less than the length of the titleLength, then do a LIKE. But first do a title LIKE 'abc%', then do a reverseTitle LIKE 'cba%'. Similar to Brad's approach - ie you do the next difficult query only if required.

现在，当有人搜索关键字时，检查关键字的长度是否与 titlelength 相同。如果是这样，请执行“=”搜索。如果关键字的长度小于 titleLength 的长度，则执行 LIKE。但首先做一个标题 LIKE 'abc%'，然后做一个 reverseTitle LIKE 'cba%'。类似于 Brad 的方法 - 即您只在需要时才执行下一个困难的查询。

Also, if the 80-20 rules applies to your keywords/ substrings (ie if most of the searches are on a minority of the keywords), then you can also consider doing some sort of caching. For eg: say you find that many users search for the keyword "abc" and this keyword search returns records with ids 20, 22, 24, 25 - you can store this in a separate table and have this indexed. And now when someone searches for a new keyword, first look in this "cache" table to see if the search was already performed by an earlier user. If so, no need to look again in main table. Simply return results from "cache" table.

此外，如果 80-20 规则适用于您的关键字/子字符串（即，如果大多数搜索针对少数关键字），那么您还可以考虑进行某种缓存。例如：假设您发现许多用户搜索关键字“abc”，并且此关键字搜索返回 ID 为 20、22、24、25 的记录 - 您可以将其存储在单独的表中并对其进行索引。现在，当有人搜索新关键字时，首先查看这个“缓存”表，看看搜索是否已经由较早的用户执行。如果是这样，则无需再次查看主表。只需从“缓存”表返回结果。

You can also combine the above with SQL Server TextSearch. (assuming you have a valid reason not to use it). But you could nevertheless use Text search first to shortlist the result set. and then run a SQL query against your table to get exact results using the Ids returned by the TExt Search as a parameter along with your keyword.

您还可以将上述内容与 SQL Server TextSearch 结合使用。（假设您有正当理由不使用它）。但是您仍然可以首先使用文本搜索来筛选结果集。然后针对您的表运行 SQL 查询以使用 TExt Search 返回的 Id 作为参数以及您的关键字来获得准确的结果。

All this is obviously assuming you have to use SQL. If not, you can explore something like Apache Solr.

所有这些显然都是假设您必须使用 SQL。如果没有，您可以探索诸如 Apache Solr 之类的东西。

Answer 5

回答by KuldipMCA

Create index view there is new feature in sql create index on the column that you need to search and use that view after in your search that will give your more faster result.

创建索引视图 sql create index 在您需要搜索的列上有一个新功能，并在搜索后使用该视图，这将提供更快的结果。

Answer 6

回答by U?ur Gümü?han

Use ASCIIcharset with clustered indexingthe char column. The charset influences the search performance because of the data size on both ram and disk. The bottleneck is often I/O.
Your column is 255 characters long so you can use normal index on your char field rather than full text, which is faster. Do not select unnecessary columns in your select statement.
Lastly, add more RAM to the server and Increase cache size.

使用ASCII字符集和聚集索引字符列。由于 ram 和磁盘上的数据大小，字符集会影响搜索性能。瓶颈通常是 I/O。
您的列有 255 个字符长，因此您可以在 char 字段上使用普通索引而不是全文索引，这样会更快。不要在 select 语句中选择不必要的列。
最后，向服务器添加更多 RAM 并增加缓存大小。

Answer 7

回答by Mohit Verma

Do one thing, use primary key on specific column & index it in cluster form.

做一件事，在特定列上使用主键并以集群形式索引它。

Then search using any method (wild card or = or any), it will search optimally because the table is already in clustered form, so it knows where he can find (because column is already in sorted form)

然后使用任何方法（通配符或= 或任何）进行搜索，它会进行最佳搜索，因为表已经是聚集形式，所以它知道他可以在哪里找到（因为列已经是排序形式）

SQL 在SQL中按子字符串查找字符串的最快方法？

提问by msergey

采纳答案by antlersoft

回答by Randy

回答by BradC

回答by Dharmendar Kumar 'DK'

回答by KuldipMCA

回答by U?ur Gümü?han

回答by Mohit Verma

相关推荐

最近更新

标签

SQL 在SQL中按子字符串查找字符串的最快方法？

提问by msergey

采纳答案by antlersoft

回答by Randy

回答by BradC

回答by Dharmendar Kumar 'DK'

回答by KuldipMCA

回答by U?ur Gümü?han

回答by Mohit Verma

相关推荐

SQL 如何利用SQL（Oracle）统计一个字符串的大小？

SQL Azure - 在数据库之间复制表

SQL 如何连接每个组的某个列中的所有字符串

SQL - 选择当前日期/时间之后的记录

相关推荐

最近更新

标签