SAS 在 proc sql 与 proc sort nodupkey 中不同

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21255424/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 00:50:07  来源:igfitidea点击:

SAS distinct in proc sql vs proc sort nodupkey

sqlsortingsasdistinctproc

提问by user2280549

I have following dataset:

我有以下数据集:

data work.dataset;
input a b c;
datalines;
27 93 71 
27 93 72
46 68 75
55 55 33
46 68 68
34 34 32
45 67 88
56 75 22
34 34 32
;
run;

I want to select all distinct records from first 2 columns, so I wrote:

我想从前两列中选择所有不同的记录,所以我写道:

proc sql;
create table work.output1 as
select distinct t1.a,
t1.b
from work.dataset t1;
quit;

But now I want to know what value of var c stands in previous set next to combination (var a, var b) seen in the output. Is there a way to find out? I tried following proc sort, but I don't know if it works the same way as selecting distinct records in proc sql.

但是现在我想知道在输出中看到的组合 (var a, var b) 旁边的先前集合中 var c 的值是多少。有没有办法查到?我尝试遵循 proc sort,但我不知道它是否与在 proc sql 中选择不同记录的工作方式相同。

proc sort data = work.dataset out = work.output2 NODUPKEY;
by a b;
run;

Thanks for help in advance.

提前感谢您的帮助。

回答by Joe

PROC SORTwith NODUPKEYwill always return the physical first record - ie, as you list the data, c=71will be kept always. PROC SQLwill not necessarily return any particular record; you could ask for minor max, but you could not guarantee the first record in sort order regardless of how you did the query; SQL will often resort the data as needed to accomplish the query as efficiently as possible.

PROC SORTwithNODUPKEY将始终返回物理第一条记录 - 即,当您列出数据时,c=71将始终保留。 PROC SQL不一定会返回任何特定记录;您可以要求minor max,但无论您如何进行查询,您都无法保证排序顺序中的第一条记录;SQL 通常会根据需要重新使用数据,以尽可能高效地完成查询。

They will be identical insomuch as they both return the same number of records, if that is your concern.

如果您担心,它们将是相同的,因为它们都返回相同数量的记录。

You cannot accomplish exactly the same thing in a straightforward manner in SQL; because SQL doesn't have a concept of row ordering, you would have to either have a method of choosing which c (max(c), min(c), etc.) or you would have to add a row counter and choose the lowest value of that.

在 SQL 中,您不能以直接的方式完成完全相同的事情;因为SQL没有行排序的概念,你就必须要么选择其中c(的方法max(c)min(c)等等),或者你将不得不增加一个行计数器,并选择了最低值。

For example:

例如:

data work.dataset;
input a b c;
rowcounter=_n_;
datalines;
27 93 71 
27 93 72
46 68 75
55 55 33
46 68 68
34 34 32
45 67 88
56 75 22
34 34 32
;
run;

proc sql;
select a,b,min(rowcounter*100+c)-min(rowcounter*100) as c
from work.dataset
group by a,b;
quit;

That's using a cheat (knowing that rowcounter*100 will always dominate the size of c); of course if your c doesn't have values appropriate for that, this won't work and you're better off merging it on separately.

这是使用作弊(知道 rowcounter*100 将始终支配 c 的大小);当然,如果你的 c 没有合适的值,这将不起作用,你最好单独合并它。

If you are interested in the SQL solution, you may consider posting that explicitly as a separate question as the SQL-only folk will then answer it.

如果您对 SQL 解决方案感兴趣,您可以考虑将其明确发布为一个单独的问题,因为只有 SQL 的人会回答它。

回答by Laurent de Walick

NODUPKEY will return one observation for each key. In your example only one of the two observations with a=27 and b=93 will be kept. Either c=71 or c=72 will be lost.

NODUPKEY 将为每个键返回一个观察结果。在您的示例中,只会保留 a=27 和 b=93 的两个观察值之一。c=71 或 c=72 将丢失。

The NODUPREC option will remove duplicate records. Both observations with a=27 and b=93 will be kept, but only one of the two with the values a=34, b=34 and c=32.

NODUPREC 选项将删除重复记录。a=27 和 b=93 的两个观测值都将被保留,但只有 a=34、b=34 和 c=32 的两个观测值之一。

回答by EconomySizeAl

Sql will not return a value for variable c in the above query, as it is not listed in the select statement. I think what you may be looking for is:

Sql 不会在上述查询中返回变量 c 的值,因为它没有列在 select 语句中。我认为您可能正在寻找的是:

proc sql;
create table work.output1 as
select t1.a,
t1.b,
min(t1.c) as c
from work.dataset t1
group by a, b;
quit;

If you would like the maximum value of cthen you can replace the function with max(t1.c) as c, or use any of the other sql functions in order to select your value. If you want to replicate PROC SORT nodupkey, and take the first value listed, you would need to use the function monotonic (I know... unsupported by SAS but it's there so whatever). Your code would now be:

如果您想要 的最大值,c则可以将函数替换为max(t1.c) as c,或使用任何其他 sql 函数来选择您的值。如果您想复制 PROC SORT nodupkey,并采用列出的第一个值,您将需要使用函数 monotonic(我知道......不受 SAS 支持,但无论如何它都在那里)。你的代码现在是:

proc sql;
create table work.output1 as
select monotonic() as rownum,
t1.a,
t1.b,
t1.c
from work.dataset t1
group by a, b
having calculated(rownum) = min(calculated rownum);
quit;