list data.frame 行到列表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3492379/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-11 01:32:14  来源:igfitidea点击:

data.frame rows to a list

listrdataframe

提问by Roman Lu?trik

I have a data.frame which I would like to convert to a list by rows, meaning each row would correspond to its own list elements. In other words, I would like a list that is as long as the data.frame has rows.

我有一个 data.frame,我想按行将其转换为列表,这意味着每一行都对应于它自己的列表元素。换句话说,我想要一个只要 data.frame 有行的列表。

So far, I've tackled this problem in the following manner, but I was wondering if there's a better way to approach this.

到目前为止,我已经通过以下方式解决了这个问题,但我想知道是否有更好的方法来解决这个问题。

xy.df <- data.frame(x = runif(10),  y = runif(10))

# pre-allocate a list and fill it with a loop
xy.list <- vector("list", nrow(xy.df))
for (i in 1:nrow(xy.df)) {
    xy.list[[i]] <- xy.df[i,]
}

回答by flodel

Like this:

像这样:

xy.list <- split(xy.df, seq(nrow(xy.df)))

And if you want the rownames of xy.dfto be the names of the output list, you can do:

如果您希望行xy.df名是输出列表的名称,您可以执行以下操作:

xy.list <- setNames(split(xy.df, seq(nrow(xy.df))), rownames(xy.df))

回答by Roman Lu?trik

Eureka!

尤里卡!

xy.list <- as.list(as.data.frame(t(xy.df)))

回答by Qiou Bi

If you want to completely abuse the data.frame (as I do) and like to keep the $ functionality, one way is to split you data.frame into one-line data.frames gathered in a list :

如果您想完全滥用 data.frame (就像我一样)并希望保留 $ 功能,一种方法是将您的 data.frame 拆分为收集在列表中的一行 data.frames :

> df = data.frame(x=c('a','b','c'), y=3:1)
> df
  x y
1 a 3
2 b 2
3 c 1

# 'convert' into a list of data.frames
ldf = lapply(as.list(1:dim(df)[1]), function(x) df[x[1],])

> ldf
[[1]]
x y
1 a 3    
[[2]]
x y
2 b 2
[[3]]
x y
3 c 1

# and the 'coolest'
> ldf[[2]]$y
[1] 2

It is not only intellectual masturbation, but allows to 'transform' the data.frame into a list of its lines, keeping the $ indexation which can be useful for further use with lapply (assuming the function you pass to lapply uses this $ indexation)

它不仅是智力上的自慰,而且允许将 data.frame '转换'为它的行列表,保留 $ 索引,这对于进一步与 lapply 一起使用很有用(假设您传递给 lapply 的函数使用此 $ 索引)

回答by Mike Stanley

A more modern solution uses only purrr::transpose:

更现代的解决方案仅使用purrr::transpose

library(purrr)
iris[1:2,] %>% purrr::transpose()
#> [[1]]
#> [[1]]$Sepal.Length
#> [1] 5.1
#> 
#> [[1]]$Sepal.Width
#> [1] 3.5
#> 
#> [[1]]$Petal.Length
#> [1] 1.4
#> 
#> [[1]]$Petal.Width
#> [1] 0.2
#> 
#> [[1]]$Species
#> [1] 1
#> 
#> 
#> [[2]]
#> [[2]]$Sepal.Length
#> [1] 4.9
#> 
#> [[2]]$Sepal.Width
#> [1] 3
#> 
#> [[2]]$Petal.Length
#> [1] 1.4
#> 
#> [[2]]$Petal.Width
#> [1] 0.2
#> 
#> [[2]]$Species
#> [1] 1

回答by lmo

I was working on this today for a data.frame (really a data.table) with millions of observations and 35 columns. My goal was to return a list of data.frames (data.tables) each with a single row. That is, I wanted to split each row into a separate data.frame and store these in a list.

我今天正在为一个包含数百万个观察值和 35 列的 data.frame(实际上是一个 data.table)工作。我的目标是返回一个 data.frames (data.tables) 列表,每个列表都有一行。也就是说,我想将每一行拆分成一个单独的 data.frame 并将它们存储在一个列表中。

Here are two methods I came up with that were roughly 3 times faster than split(dat, seq_len(nrow(dat)))for that data set. Below, I benchmark the three methods on a 7500 row, 5 column data set (irisrepeated 50 times).

这是我想出的两种方法,它们比split(dat, seq_len(nrow(dat)))该数据集快大约 3 倍。下面,我在一个 7500 行 5 列的数据集(iris重复 50 次)上对这三种方法进行了基准测试。

library(data.table)
library(microbenchmark)

microbenchmark(
split={dat1 <- split(dat, seq_len(nrow(dat)))},
setDF={dat2 <- lapply(seq_len(nrow(dat)),
                  function(i) setDF(lapply(dat, "[", i)))},
attrDT={dat3 <- lapply(seq_len(nrow(dat)),
           function(i) {
             tmp <- lapply(dat, "[", i)
             attr(tmp, "class") <- c("data.table", "data.frame")
             setDF(tmp)
           })},
datList = {datL <- lapply(seq_len(nrow(dat)),
                          function(i) lapply(dat, "[", i))},
times=20
) 

This returns

这返回

Unit: milliseconds
       expr      min       lq     mean   median        uq       max neval
      split 861.8126 889.1849 973.5294 943.2288 1041.7206 1250.6150    20
      setDF 459.0577 466.3432 511.2656 482.1943  500.6958  750.6635    20
     attrDT 399.1999 409.6316 461.6454 422.5436  490.5620  717.6355    20
    datList 192.1175 201.9896 241.4726 208.4535  246.4299  411.2097    20

While the differences are not as large as in my previous test, the straight setDFmethod is significantly faster at all levels of the distribution of runs with max(setDF) < min(split) and the attrmethod is typically more than twice as fast.

虽然差异没有我之前的测试那么大,但直接setDF方法在 max(setDF) < min(split) 的运行分布的所有级别上都明显更快,并且该attr方法通常快两倍多。

A fourth method is the extreme champion, which is a simple nested lapply, returning a nested list. This method exemplifies the cost of constructing a data.frame from a list. Moreover, all methods I tried with the data.framefunction were roughly an order of magnitude slower than the data.tabletechniques.

第四种方法是extreme Champion,它是一个简单的nested lapply,返回一个嵌套列表。此方法举例说明了从列表构建 data.frame 的成本。此外,我尝试使用该data.frame函数的所有方法都比data.table技术慢了一个数量级。

data

数据

dat <- vector("list", 50)
for(i in 1:50) dat[[i]] <- iris
dat <- setDF(rbindlist(dat))

回答by Artem Klevtsov

Seems a current version of the purrr(0.2.2) package is the fastest solution:

似乎purrr(0.2.2) 包的当前版本是最快的解决方案:

by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out

Let's compare the most interesting solutions:

让我们比较一下最有趣的解决方案:

data("Batting", package = "Lahman")
x <- Batting[1:10000, 1:10]
library(benchr)
library(purrr)
benchmark(
    split = split(x, seq_len(.row_names_info(x, 2L))),
    mapply = .mapply(function(...) structure(list(...), class = "data.frame", row.names = 1L), x, NULL),
    purrr = by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out
)

Rsults:

结果:

Benchmark summary:
Time units : milliseconds 
  expr n.eval   min  lw.qu median   mean  up.qu  max  total relative
 split    100 983.0 1060.0 1130.0 1130.0 1180.0 1450 113000     34.3
mapply    100 826.0  894.0  963.0  972.0 1030.0 1320  97200     29.3
 purrr    100  24.1   28.6   32.9   44.9   40.5  183   4490      1.0

Also we can get the same result with Rcpp:

我们也可以得到相同的结果Rcpp

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List df2list(const DataFrame& x) {
    std::size_t nrows = x.rows();
    std::size_t ncols = x.cols();
    CharacterVector nms = x.names();
    List res(no_init(nrows));
    for (std::size_t i = 0; i < nrows; ++i) {
        List tmp(no_init(ncols));
        for (std::size_t j = 0; j < ncols; ++j) {
            switch(TYPEOF(x[j])) {
                case INTSXP: {
                    if (Rf_isFactor(x[j])) {
                        IntegerVector t = as<IntegerVector>(x[j]);
                        RObject t2 = wrap(t[i]);
                        t2.attr("class") = "factor";
                        t2.attr("levels") = t.attr("levels");
                        tmp[j] = t2;
                    } else {
                        tmp[j] = as<IntegerVector>(x[j])[i];
                    }
                    break;
                }
                case LGLSXP: {
                    tmp[j] = as<LogicalVector>(x[j])[i];
                    break;
                }
                case CPLXSXP: {
                    tmp[j] = as<ComplexVector>(x[j])[i];
                    break;
                }
                case REALSXP: {
                    tmp[j] = as<NumericVector>(x[j])[i];
                    break;
                }
                case STRSXP: {
                    tmp[j] = as<std::string>(as<CharacterVector>(x[j])[i]);
                    break;
                }
                default: stop("Unsupported type '%s'.", type2name(x));
            }
        }
        tmp.attr("class") = "data.frame";
        tmp.attr("row.names") = 1;
        tmp.attr("names") = nms;
        res[i] = tmp;
    }
    res.attr("names") = x.attr("row.names");
    return res;
}

Now caompare with purrr:

现在与purrr

benchmark(
    purrr = by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out,
    rcpp = df2list(x)
)

Results:

结果:

Benchmark summary:
Time units : milliseconds 
 expr n.eval  min lw.qu median mean up.qu   max total relative
purrr    100 25.2  29.8   37.5 43.4  44.2 159.0  4340      1.1
 rcpp    100 19.0  27.9   34.3 35.8  37.2  93.8  3580      1.0

回答by Cro-Magnon

The best way for me was:

对我来说最好的方法是:

Example data:

示例数据:

Var1<-c("X1",X2","X3")
Var2<-c("X1",X2","X3")
Var3<-c("X1",X2","X3")

Data<-cbind(Var1,Var2,Var3)

ID    Var1   Var2  Var3 
1      X1     X2    X3
2      X4     X5    X6
3      X7     X8    X9

We call the BBmisclibrary

我们称 BBmisc图书馆

library(BBmisc)

data$lists<-convertRowsToList(data[,2:4])

And the result will be:

结果将是:

ID    Var1   Var2  Var3  lists
1      X1     X2    X3   list("X1", "X2", X3") 
2      X4     X5    X6   list("X4","X5", "X6") 
3      X7     X8    X9   list("X7,"X8,"X9) 

回答by user3553260

An alternative way is to convert the df to a matrix then applying the list apply lappyfunction over it: ldf <- lapply(as.matrix(myDF), function(x)x)

另一种方法是将 df 转换为矩阵,然后对其应用列表应用lappy函数:ldf <- lapply(as.matrix(myDF), function(x)x)

回答by MrHopko

Another alternative using library(purrr)(that seems to be a bit quicker on large data.frames)

另一种使用方法library(purrr)(在大型 data.frames 上似乎要快一些)

flatten(by_row(xy.df, ..f = function(x) flatten_chr(x), .labels = FALSE))

回答by Ronak Shah

A couple of more options :

还有几个选项:

With asplit

asplit

asplit(xy.df, 1)
#[[1]]
#     x      y 
#0.1137 0.6936 

#[[2]]
#     x      y 
#0.6223 0.5450 

#[[3]]
#     x      y 
#0.6093 0.2827 
#....


With splitand row

随着splitrow

split(xy.df, row(xy.df)[, 1])

#$`1`
#       x      y
#1 0.1137 0.6936

#$`2`
#       x     y
#2 0.6223 0.545

#$`3`
#       x      y
#3 0.6093 0.2827
#....

data

数据

set.seed(1234)
xy.df <- data.frame(x = runif(10),  y = runif(10))