SQL 快速从sqlserver中选择随机抽样

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/652064/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 01:26:54  来源:igfitidea点击:

Select random sampling from sqlserver quickly

sqlsql-serverdatabaseperformancerandom

提问by Byron Whitlock

I have a huge table of > 10 million rows. I need to efficiently grab a random sampling of 5000 from it. I have some constriants that reduces the total rows I am looking for to like 9 millon.

我有一个超过 1000 万行的大表。我需要有效地从中随机抽取 5000 个样本。我有一些约束可以将我正在寻找的总行数减少到 9 百万。

I tried using order by NEWID(), but that query will take too long as it has to do a table scan of all rows.

我尝试通过 NEWID() 使用 order,但该查询将花费太长时间,因为它必须对所有行进行表扫描。

Is there a faster way to do this?

有没有更快的方法来做到这一点?

回答by K. Brian Kelley

If you can use a pseudo-random sampling and you're on SQL Server 2005/2008, then take a look at TABLESAMPLE. For instance, an example from SQL Server 2008 / AdventureWorks 2008 which works based on rows:

如果您可以使用伪随机抽样并且您使用的是 SQL Server 2005/2008,那么请查看 TABLESAMPLE。例如,SQL Server 2008 / AdventureWorks 2008 中的一个基于行的示例:

USE AdventureWorks2008; 
GO 


SELECT FirstName, LastName
FROM Person.Person 
TABLESAMPLE (100 ROWS)
WHERE EmailPromotion = 2;

The catch is that TABLESAMPLE isn't exactly random as it generates a given number of rows from each physical page. You may not get back exactly 5000 rows unless you limit with TOP as well. If you're on SQL Server 2000, you're going to have to either generate a temporary table which match the primary key or you're going to have to do it using a method using NEWID().

问题是 TABLESAMPLE 不是完全随机的,因为它从每个物理页生成给定数量的行。除非您也使用 TOP 进行限制,否则您可能无法准确返回 5000 行。如果您使用的是 SQL Server 2000,您将不得不生成一个与主键匹配的临时表,或者您将不得不使用使用 NEWID() 的方法来完成它。

回答by John Sansom

Have you looked into using the TABLESAMPLE clause?

您是否考虑过使用 TABLESAMPLE 子句?

For example:

例如:

select *
from HumanResources.Department tablesample (5 percent)

回答by Mike Lieser

SQL Server 2000 Solution, regarding to Microsoft (instead of slow NEWID() on larger Tables):

SQL Server 2000 解决方案,关于 Microsoft(而不是较大表上的慢 NEWID()):

SELECT * FROM Table1
WHERE (ABS(CAST(
 (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

The SQL Server team at Microsoft realized that not being able to take random samples of rows easily was a common problem in SQL Server 2000; so, the team addressed the problem in SQL Server 2005 by introducing the TABLESAMPLE clause. This clause selects a subset of rows by choosing random data pages and returning all of the rows on those pages. However, for those of us who still have products that run on SQL Server 2000 and need backward-compatibility, or who need truly row-level randomness, the BINARY_CHECKSUM query is a very effective workaround.

Microsoft 的 SQL Server 团队意识到不能轻松地随机抽取行样本是 SQL Server 2000 中的一个常见问题;因此,该团队通过引入 TABLESAMPLE 子句解决了 SQL Server 2005 中的问题。该子句通过选择随机数据页并返回这些页上的所有行来选择行的子集。但是,对于我们这些仍然有在 SQL Server 2000 上运行的产品并需要向后兼容的人,或者需要真正的行级随机性的人来说,BINARY_CHECKSUM 查询是一种非常有效的解决方法。

Explanation can be found here: http://msdn.microsoft.com/en-us/library/cc441928.aspx

可以在此处找到说明:http: //msdn.microsoft.com/en-us/library/cc441928.aspx

回答by friism

Yeah, tablesample is your friend (note that it's not random in the statistical sense of the word): Tablesample at msdn

是的,tablesample 是你的朋友(请注意,它在这个词的统计意义上不是随机的): msdn 上的 Tablesample