如何更快地搜索 SQL 表中的数百万条记录?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5876861/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to search millions of record in SQL table faster?
提问by user737063
I have SQL table with millions of domain name. But now when I search for let's say
我有数百万个域名的 SQL 表。但是现在当我搜索让我们说
SELECT *
FROM tblDomainResults
WHERE domainName LIKE '%lifeis%'
It takes more than 10 minutes to get the results. I tried indexing but that didn't help.
得到结果需要10多分钟。我尝试建立索引,但这没有帮助。
What is the best way to store this millions of record and easily access these information in short period of time?
存储这数百万条记录并在短时间内轻松访问这些信息的最佳方法是什么?
There are about 50 million records and 5 column so far.
到目前为止,大约有 5000 万条记录和 5 列。
回答by Igor Nazarenko
Most likely, you tried a traditional index which cannot be used to optimize LIKE queries unless the pattern begins with a fixed string (e.g. 'lifeis%').
最有可能的是,您尝试了一个传统索引,该索引不能用于优化 LIKE 查询,除非模式以固定字符串开头(例如,'lifeis%')。
What you need for your query is a full-text index. Most DBMS support it these days.
您的查询需要的是全文索引。现在大多数 DBMS 都支持它。
回答by Will A
Full-text indexing is the far-and-away best option here - how this is accomplished will depend on the DBMS you're using.
全文索引是这里遥遥领先的最佳选择 - 如何实现将取决于您使用的 DBMS。
Short of that, ensuring that you have an index on the column being matched with the pattern willhelp performance, but by the sounds of it, you've tried this and it didn't help a great deal.
除此之外,确保您在与模式匹配的列上有一个索引将有助于提高性能,但从它的声音来看,您已经尝试过这个,但并没有太大帮助。
回答by Aaron Bertrand
Assuming that your 50 million row table includes duplicates (perhaps that is part of the problem), and assuming SQL Server (the syntax may change but the concept is similar on most RDBMSes), another option is to store domains in a lookup table, e.g.
假设您的 5000 万行表包含重复项(也许这是问题的一部分),并假设 SQL Server(语法可能会改变,但大多数 RDBMS 上的概念相似),另一种选择是将域存储在查找表中,例如
CREATE TABLE dbo.Domains
(
DomainID INT IDENTITY(1,1) PRIMARY KEY,
DomainName VARCHAR(255) NOT NULL
);
CREATE UNIQUE INDEX dn ON dbo.Domains(DomainName);
When you load new data, check if any of the domain names are new - and insert those into the Domains table. Then in your big table, you just include the DomainID. Not only will this keep your 50 million row table much smaller, it will also make lookups like this much more efficient.
当您加载新数据时,检查是否有任何域名是新的 - 并将它们插入到域表中。然后在您的大表中,您只需包含 DomainID。这不仅会使您的 5000 万行表变得更小,而且还会使这样的查找更加高效。
SELECT * -- please specify column names
FROM dbo.tblDomainResults AS dr
INNER JOIN dbo.Domains AS d
ON dr.DomainID = d.DomainID
WHERE d.DomainName LIKE '%lifeis%';
Of course except on the tiniest of tables, it will always help to avoid LIKE clauses with a leading wildcard.
当然,除了在最小的表上,避免带有前导通配符的 LIKE 子句总是有帮助的。
回答by tere?ko
Stop using LIKE statement. You could use fulltext search, but it will require MyISAM table and isn't all that good solution.
停止使用 LIKE 语句。您可以使用fulltext search,但它需要 MyISAM 表,并不是很好的解决方案。
I would recommend for you to examine available 3rd party solutions - like Luceneand Sphinx.
They will be superior.
回答by RHSeeger
One thing you might want to consider is having a separate search engine for such lookups. For example, you can use a SOLR (lucene) server to search on and retrieve the ids of entries that match your search, then retrieve the data from the database by id. Even having to make two different calls, its very likely it will wind up being faster.
您可能需要考虑的一件事是为此类查找使用单独的搜索引擎。例如,您可以使用 SOLR (lucene) 服务器来搜索和检索与您的搜索匹配的条目的 id,然后按 id 从数据库中检索数据。即使必须拨打两个不同的电话,它也很可能会更快。
回答by Jody
Indexes are slowed down whenever they have to go lookup ("bookmark lookup") data that the index itself doesn't contain. For instance, if your index has 2 columns, ID, and NAME, but you're selecting * (which is 5 columns total) the database has to read the index for the first two columns, then go lookup the other 3 columns in a less efficient data structure somewhere else.
每当必须查找(“书签查找”)索引本身不包含的数据时,索引就会变慢。例如,如果您的索引有 2 列、ID 和 NAME,但您选择了 *(总共 5 列),则数据库必须读取前两列的索引,然后查找其他 3 列其他地方的数据结构效率较低。
In this case, your index can't be used because of the "like". This is similar to not putting any where filter on the query, it will skip the index altogether since it has to read the whole table anyway it will do just that ("table scan"). There is a threshold (i think around 35-50% where the engine normally flips over to this).
在这种情况下,由于“喜欢”,您的索引无法使用。这类似于不在查询上放置任何 where 过滤器,它将完全跳过索引,因为它必须读取整个表,无论如何它都会这样做(“表扫描”)。有一个阈值(我认为大约 35-50% 引擎通常会翻转到此)。
In short, it seems unlikely that you need all 50 million rows from the DB for a production application, but if you do... use a machine with more memory and try methods that keep that data in memory. Maybe a No-SQL DB would be a better option - mongoDB, couch DB, tokyo cabinet. Things like this. Good luck!
简而言之,您似乎不太可能需要 DB 中的所有 5000 万行用于生产应用程序,但是如果您这样做了……请使用具有更多内存的机器并尝试将这些数据保存在内存中的方法。也许 No-SQL DB 会是更好的选择 - mongoDB、沙发数据库、东京柜。这样的事情。祝你好运!
回答by Scott Bruns
You could try breaking up the domain into chunks and then searh the chunks themselves. I did some thing like that years ago when I needed to search for words in sentences. I did not have full text searching available so I broke up the sentences into a word list and searched the words. It was really fast to find the results since the words were indexed.
您可以尝试将域分解为多个块,然后自行搜索这些块。几年前,当我需要在句子中搜索单词时,我做过类似的事情。我没有可用的全文搜索,所以我将句子分解成一个单词列表并搜索单词。由于对单词进行了索引,因此查找结果非常快。