list 从列表中查找唯一值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3879522/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-11 01:33:13  来源:igfitidea点击:

finding unique values from a list

listrunique

提问by Roman Lu?trik

Suppose you have a list of values

假设你有一个值列表

x <- list(a=c(1,2,3), b = c(2,3,4), c=c(4,5,6))

I would like to find unique values from all list elements combined. So far, the following code did the trick

我想从组合的所有列表元素中找到唯一值。到目前为止,以下代码成功了

unique(unlist(x))

Does anyone know a more efficient way? I have a hefty list with a lot of values and would appreciate any speed-up.

有谁知道更有效的方法?我有一个包含很多值的庞大列表,希望能够加快速度。

回答by Gavin Simpson

This solution suggested by Marek is the best answer to the original Q. See below for a discussion of other approaches and why Marek's is the most useful.

Marek 建议的这个解决方案是原始问题的最佳答案。有关其他方法的讨论以及为什么 Marek 的方法最有用,请参见下文。

> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6


Discussion

讨论

A faster solution is to compute unique()on the components of your xfirst and then do a final unique()on those results. This will only work if the components of the list have the same number of unique values, as they do in both examples below. E.g.:

更快的解决方案是计算unique()第一个组件,x然后unique()对这些结果进行最终计算。这仅在列表的组件具有相同数量的唯一值时才有效,就像它们在下面的两个示例中所做的那样。例如:

First your version, then my double unique approach:

首先是您的版本,然后是我的双重独特方法:

> unique(unlist(x))
[1] 1 2 3 4 5 6
> unique.default(sapply(x, unique))
[1] 1 2 3 4 5 6

We have to call unique.defaultas there is a matrixmethod for uniquethat keeps one margin fixed; this is fine as a matrix can be treated as a vector.

我们必须调用,unique.default因为有一种matrix方法unique可以保持一个边距固定;这很好,因为矩阵可以被视为向量。

Marek, in the comments to this answer, notes that the slow speed of the unlistapproach is potentially due to the nameson the list. Marek's solution is to make use of the use.namesargument to unlist, which if used, results in a faster solution than the double unique version above. For the simple xof Roman's post we get

Marek 在对此答案的评论中指出,该unlist方法的缓慢速度可能是由于names列表中的 。Marek 的解决方案是使用 的use.names参数unlist,如果使用它,会产生比上述双重唯一版本更快的解决方案。对于x罗马的帖子的简单,我们得到

> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6

Marek's solution will work even when the number of unique elements differs between components.

即使组件之间唯一元素的数量不同,Marek 的解决方案也能奏效。

Here is a larger example with some timings of all three methods:

这是一个更大的示例,其中包含所有三种方法的一些时间:

## Create a large list (1000 components of length 100 each)
DF <- as.list(data.frame(matrix(sample(1:10, 1000*1000, replace = TRUE), 
                                ncol = 1000)))

Here are results for the two approaches using DF:

以下是使用 的两种方法的结果DF

> ## Do the three approaches give the same result:
> all.equal(unique.default(sapply(DF, unique)), unique(unlist(DF)))
[1] TRUE
> all.equal(unique(unlist(DF, use.names = FALSE)), unique(unlist(DF)))
[1] TRUE
> ## Timing Roman's original:
> system.time(replicate(10, unique(unlist(DF))))
   user  system elapsed 
  12.884   0.077  12.966
> ## Timing double unique version:
> system.time(replicate(10, unique.default(sapply(DF, unique))))
   user  system elapsed 
  0.648   0.000   0.653
> ## timing of Marek's solution:
> system.time(replicate(10, unique(unlist(DF, use.names = FALSE))))
   user  system elapsed 
  0.510   0.000   0.512

Which shows that the double uniqueis a lot quicker to applying unique()to the individual components and then unique()those smaller sets of unique values, but this speed-up is purely due to the nameson the list DF. If we tell unlistto not use the names, Marek's solution is marginally quicker than the double uniquefor this problem. As Marek's solution is using the correct tool properly, and it is quicker than the work-around, it is the preferred solution.

这表明 doubleunique应用于unique()单个组件和unique()那些较小的唯一值集要快得多,但这种加速纯粹是由于names列表中的DF。如果我们告诉unlist不要使用names,Marek 的解决方案比unique这个问题的两倍要快一些。由于 Marek 的解决方案正确使用了正确的工具,并且比变通方法更快,因此它是首选解决方案。

The big gotcha with the double uniqueapproach is that it will only work if, as in the two examples here, each component of the input list (DFor x) has the same number of unique values. In such cases sapplysimplifies the result to a matrix which allows us to apply unique.default. If the components of the input list have differing numbers of unique values, the double unique solution will fail.

双重unique方法的一个大问题是,它只有,如这里的两个示例中,输入列表(DFx)的每个组件都具有相同数量的唯一值时才有效。在这种情况下,sapply将结果简化为允许我们应用 的矩阵unique.default。如果输入列表的组件具有不同数量的唯一值,则双重唯一解决方案将失败。