list 有没有更有效的方法在列表中用 NA 替换 NULL?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22870198/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-11 02:09:16  来源:igfitidea点击:

Is there a more efficient way to replace NULL with NA in a list?

rperformancelistnull

提问by Jon M

I quite often come across data that is structured something like this:

我经常遇到结构如下的数据:

employees <- list(
    list(id = 1,
             dept = "IT",
             age = 29,
             sportsteam = "softball"),
    list(id = 2,
             dept = "IT",
             age = 30,
             sportsteam = NULL),
    list(id = 3,
             dept = "IT",
             age = 29,
             sportsteam = "hockey"),
    list(id = 4,
             dept = NULL,
             age = 29,
             sportsteam = "softball"))

In many cases such lists could be tens of millions of items long, so memory concerns and efficiency are always a concern.

在许多情况下,这样的列表可能有数千万个项目,因此内存问题和效率始终是一个问题。

I would like to turn the list into a dataframe but if I run:

我想将列表转换为数据框,但如果我运行:

library(data.table)
employee.df <- rbindlist(employees)

I get errors because of the NULL values. My normal strategy is to use a function like:

由于 NULL 值,我收到错误消息。我的正常策略是使用如下函数:

nullToNA <- function(x) {
    x[sapply(x, is.null)] <- NA
    return(x)
}

and then:

进而:

employees <- lapply(employees, nullToNA)
employee.df <- rbindlist(employees)

which returns

返回

   id dept age sportsteam
1:  1   IT  29   softball
2:  2   IT  30         NA
3:  3   IT  29     hockey
4:  4   NA  29   softball

However, the nullToNA function is very slow when applied to 10 million cases so it'd be good if there was a more efficient approach.

然而,nullToNA 函数在应用于 1000 万个案例时非常慢,所以如果有更有效的方法会很好。

One point that seems to slow the process down it the is.null function can only be applied to one item at a time (unlike is.na which can scan a full list in one go).

有一点似乎减慢了进程,因为 is.null 函数一次只能应用于一项(不像 is.na 可以一次性扫描完整列表)。

Any advice on how to do this operation efficiently on a large dataset?

关于如何在大型数据集上有效地执行此操作的任何建议?

采纳答案by Rich Scriven

Many efficiency problems in R are solved by first changing the original data into a form that makes the processes that follow as fast and easy as possible. Usually, this is matrix form.

R 中的许多效率问题都是通过首先将原始数据更改为一种形式来解决的,该形式使后续过程尽可能快速和轻松。通常,这是矩阵形式。

If you bring all the data together with rbind, your nullToNAfunction no longer has to search though nested lists, and therefore sapplyserves its purpose (looking though a matrix) more efficiently. In theory, this should make the process faster.

如果您将所有数据与 放在一起rbind,则您的nullToNA函数不再需要搜索嵌套列表,因此sapply可以更有效地实现其目的(查看矩阵)。从理论上讲,这应该会使过程更快。

Good question, by the way.

顺便说一句,好问题。

> dat <- do.call(rbind, lapply(employees, rbind))
> dat
     id dept age sportsteam
[1,] 1  "IT" 29  "softball"
[2,] 2  "IT" 30  NULL      
[3,] 3  "IT" 29  "hockey"  
[4,] 4  NULL 29  "softball"

> nullToNA(dat)
     id dept age sportsteam
[1,] 1  "IT" 29  "softball"
[2,] 2  "IT" 30  NA        
[3,] 3  "IT" 29  "hockey"  
[4,] 4  NA   29  "softball"

回答by infominer

A two step approach creates a dataframe after combing it with rbind:

两步方法在将其与rbind以下组合后创建一个数据帧:

employee.df<-data.frame(do.call("rbind",employees))

Now replace the NULL's, I am using "NULL" as R doesn't put NULL when you load the data and is reading it as character when you load it.

现在替换 NULL,我使用“NULL”,因为 R 在加载数据时不会放置 NULL,并且在加载数据时将其作为字符读取。

employee.df.withNA <- sapply(employee.df, function(x) ifelse(x == "NULL", NA, x))

回答by amanda

A tidyverse solution that I find easier to read is to write a function that works on a single element and map it over all of your NULLs.

我发现更容易阅读的 tidyverse 解决方案是编写一个函数,该函数适用于单个元素并将其映射到所有 NULL 值。

I'll use @rich-scriven's rbind and lapply approach to create a matrix, and then turn that into a dataframe.

我将使用@rich-scriven 的 rbind 和 lapply 方法来创建一个矩阵,然后将其转换为一个数据帧。

library(magrittr)

dat <- do.call(rbind, lapply(employees, rbind)) %>% 
  as.data.frame()

dat
#>   id dept age sportsteam
#> 1  1   IT  29   softball
#> 2  2   IT  30       NULL
#> 3  3   IT  29     hockey
#> 4  4 NULL  29   softball

Then we can use purrr::modify_depth()at a depth of 2 to apply replace_x()

然后我们可以purrr::modify_depth()在深度为 2 处应用replace_x()

replace_x <- function(x, replacement = NA_character_) {
  if (length(x) == 0 || length(x[[1]]) == 0) {
    replacement
  } else {
    x
  }
}

out <- dat %>% 
  purrr::modify_depth(2, replace_x)

out
#>   id dept age sportsteam
#> 1  1   IT  29   softball
#> 2  2   IT  30         NA
#> 3  3   IT  29     hockey
#> 4  4   NA  29   softball

回答by Barbara Bukhvalova

All of these solutions (I think) are hiding the fact that the data table is still a lost of lists and not a list of vectors (I did not notice in my application either until it started throwing unexpected errors during :=). Try this:

所有这些解决方案(我认为)都隐藏了一个事实,即数据表仍然是列表丢失而不是向量列表(我在我的应用程序中也没有注意到,直到它在 期间开始抛出意外错误:=)。尝试这个:

data.table(t(sapply(employees, function(x) unlist(lapply(x, function(x) ifelse(is.null(x),NA,x))))))

data.table(t(sapply(employees, function(x) unlist(lapply(x, function(x) ifelse(is.null(x),NA,x))))))

I believe it works fine, but I am not sure if it will suffer from slowness and can be optimized further.

我相信它工作正常,但我不确定它是否会受到缓慢的影响并且可以进一步优化。

回答by MS Berends

I often find do.call()functions hard to read. A solution I use daily (with a MySQL output containing "NULL"character values):

我经常发现do.call()函数难以阅读。我每天使用的解决方案(带有包含"NULL"字符值的 MySQL 输出):

NULL2NA <- function(df) {
  df[, 1:length(df)][df[, 1:length(df)] == 'NULL'] <- NA
  return(df)
}

But for all solutions: please remember that NAcannot be used for calculation without na.rm = TRUE, but with NULLyou can. NaNgives the same problem. For example:

但是对于所有解决方案:请记住,NA没有 不能用于计算na.rm = TRUE,但NULL可以。NaN给出了同样的问题。例如:

> mean(c(1, 2, 3))
2

> mean(c(1, 2, NA, 3))
NA

> mean(c(1, 2, NULL, 3))
2

> mean(c(1, 2, NaN, 3))
NaN