T-SQL中的随机加权选择
如何根据所有候选行的权重在T-SQL中随机选择表行?
例如,我在表中有一组行,权重分别为50、25和25(总计为100,但不需要),我想随机选择其中的一行,其统计结果等于各自重量。
解决方案
回答
我们只需要对所有候选行的权重求和,然后在该总和中选择一个随机点,然后选择与该选定点进行协调的记录(每个记录将逐渐累积一个累加的权重和)。
DECLARE @id int, @weight_sum int, @weight_point int DECLARE @table TABLE (id int, weight int) INSERT INTO @table(id, weight) VALUES(1, 50) INSERT INTO @table(id, weight) VALUES(2, 25) INSERT INTO @table(id, weight) VALUES(3, 25) SELECT @weight_sum = SUM(weight) FROM @table SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0) SELECT TOP 1 @id = t1.id FROM @table t1, @table t2 WHERE t1.id >= t2.id GROUP BY t1.id HAVING SUM(t2.weight) >= @weight_point ORDER BY t1.id SELECT @id
回答
如果我们有很多记录,则"逐渐增加一个累加的总和"部分会很昂贵。如果我们也已经拥有很宽的分数/权重范围(即:范围足够宽,大多数记录的权重都是唯一的。1-5个星可能不会削减),我们可以执行类似的操作来选择一个权重值。我在这里使用VB.Net进行演示,但这也可以在纯Sql中轻松完成:
Function PickScore() 'Assume we have a database wrapper class instance called SQL and seeded a PRNG already 'Get count of scores in database Dim ScoreCount As Double = SQL.ExecuteScalar("SELECT COUNT(score) FROM [MyTable]") ' You could also approximate this with just the number of records in the table, which might be faster. 'Random number between 0 and 1 with ScoreCount possible values Dim rand As Double = Random.GetNext(ScoreCount) / ScoreCount 'Use the equation y = 1 - x^3 to skew results in favor of higher scores ' For x between 0 and 1, y is also between 0 and 1 with a strong bias towards 1 rand = 1 - (rand * rand * rand) 'Now we need to map the (0,1] vector to [1,Maxscore]. 'Just find MaxScore and mutliply by rand Dim MaxScore As UInteger = SQL.ExecuteScalar("SELECT MAX(Score) FROM Songs") Return MaxScore * rand End Function
运行此命令,然后选择得分最高且小于返回重量的记录。如果多于一个记录共享该分数,则随机选择它。这样做的好处是我们不必维护任何和,并且可以调整用于满足自己口味的概率方程。但同样,它在分数分布较大时效果最佳。
回答
使用随机数生成器执行此操作的方法是集成概率密度函数。使用一组离散值,我们可以计算前缀总和(直到该总和的所有值的总和)并进行存储。使用此选项,我们可以选择大于随机数的最小前缀总和(迄今为止的总和)值。
在数据库上,插入后的后续值必须更新。如果更新的相对频率和数据集的大小不致使执行此操作的成本过高,则意味着可以从单个s-argable(可通过索引查找解析的谓词)查询中获得适当的值。
回答
丹恩(Dane)的答案包括以引入平方律的方式进行自我连接。联接后的(n * n / 2)行,其中表中有n行。
更为理想的是能够只解析一次表。
DECLARE @id int, @weight_sum int, @weight_point int DECLARE @table TABLE (id int, weight int) INSERT INTO @table(id, weight) VALUES(1, 50) INSERT INTO @table(id, weight) VALUES(2, 25) INSERT INTO @table(id, weight) VALUES(3, 25) SELECT @weight_sum = SUM(weight) FROM @table SELECT @weight_point = FLOOR(((@weight_sum - 1) * RAND() + 1), 0) SELECT @id = CASE WHEN @weight_point < 0 THEN @id ELSE [table].id END, @weight_point = @weight_point - [table].weight FROM @table [table] ORDER BY [table].Weight DESC
这将遍历表格,将@id设置为每条记录的id值,同时将@weight减一。最终," @ weight_point"将变为负数。这意味着所有先前权重的" SUM"大于随机选择的目标值。这是我们想要的记录,因此从那时起,我们将`@id'设置为自身(忽略表中的任何ID)。
这仅在表中运行一次,但是即使所选值是第一条记录,也必须在整个表中运行。因为平均位置在表格的一半位置(如果按权重递增排序,则位置会更少),因此编写循环可能会更快...(特别是如果权重在同一组中):
DECLARE @id int, @weight_sum int, @weight_point int, @next_weight int, @row_count int DECLARE @table TABLE (id int, weight int) INSERT INTO @table(id, weight) VALUES(1, 50) INSERT INTO @table(id, weight) VALUES(2, 25) INSERT INTO @table(id, weight) VALUES(3, 25) SELECT @weight_sum = SUM(weight) FROM @table SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0) SELECT @next_weight = MAX(weight) FROM @table SELECT @row_count = COUNT(*) FROM @table SET @weight_point = @weight_point - (@next_weight * @row_count) WHILE (@weight_point > 0) BEGIN SELECT @next_weight = MAX(weight) FROM @table WHERE weight < @next_weight SELECT @row_count = COUNT(*) FROM @table WHERE weight = @next_weight SET @weight_point = @weight_point - (@next_weight * @row_count) END -- # Once the @weight_point is less than 0, we know that the randomly chosen record -- # is in the group of records WHERE [table].weight = @next_weight SELECT @row_count = FLOOR(((@row_count - 1) * RAND() + 1), 0) SELECT @id = CASE WHEN @row_count < 0 THEN @id ELSE [table].id END, @row_count = @row_count - 1 FROM @table [table] WHERE [table].weight = @next_weight ORDER BY [table].Weight DESC