list 逐行创建 R 数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3642535/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-11 01:32:38  来源:igfitidea点击:

Creating an R dataframe row-by-row

listrdataframe

提问by David B

I would like to construct a dataframe row-by-row in R. I've done some searching, and all I came up with is the suggestion to create an empty list, keep a list index scalar, then each time add to the list a single-row dataframe and advance the list index by one. Finally, do.call(rbind,)on the list.

我想在 R 中逐行构建一个数据框。我已经做了一些搜索,我想出的只是创建一个空列表的建议,保留一个列表索引标量,然后每次都添加到列表中一个单行数据框,并将列表索引推进 1。终于do.call(rbind,)上榜了。

While this works, it seems very cumbersome. Isn't there an easier way for achieving the same goal?

虽然这有效,但似乎非常麻烦。没有更简单的方法来实现相同的目标吗?

Obviously I refer to cases where I can't use some applyfunction and explicitly need to create the dataframe row by row. At least, is there a way to pushinto the end of a list instead of explicitly keeping track of the last index used?

显然,我指的是我不能使用某些apply函数并且明确需要逐行创建数据框的情况。至少,有没有办法push进入列表的末尾而不是明确跟踪使用的最后一个索引?

采纳答案by Dirk Eddelbuettel

You can grow them row by row by appending or using rbind().

您可以通过附加或使用rbind().

That does not mean you should. Dynamically growing structures is one of the least efficient ways to code in R.

这并不意味着你应该。动态增长的结构是在 R 中编码效率最低的方法之一。

If you can, allocate your entire data.frame up front:

如果可以,请预先分配整个 data.frame:

N <- 1e4  # total number of rows to preallocate--possibly an overestimate

DF <- data.frame(num=rep(NA, N), txt=rep("", N),  # as many cols as you need
                 stringsAsFactors=FALSE)          # you don't know levels yet

and then during your operations insert row at a time

然后在您的操作过程中一次插入一行

DF[i, ] <- list(1.4, "foo")

That should work for arbitrary data.frame and be much more efficient. If you overshot N you can always shrink empty rows out at the end.

这应该适用于任意 data.frame 并且效率更高。如果你超过 N,你总是可以在最后缩小空行。

回答by mbq

One can add rows to NULL:

可以将行添加到NULL

df<-NULL;
while(...){
  #Some code that generates new row
  rbind(df,row)->df
}

for instance

例如

df<-NULL
for(e in 1:10) rbind(df,data.frame(x=e,square=e^2,even=factor(e%%2==0)))->df
print(df)

回答by hatmatrix

This is a silly example of how to use do.call(rbind,)on the output of Map()[which is similar to lapply()]

这是一个关于如何do.call(rbind,)Map()[which is similar to lapply()]的输出上使用的愚蠢示例

> DF <- do.call(rbind,Map(function(x) data.frame(a=x,b=x+1),x=1:3))
> DF
  x y
1 1 2
2 2 3
3 3 4
> class(DF)
[1] "data.frame"

I use this construct quite often.

我经常使用这种结构。

回答by Allan Stokes

The reason I like Rcpp so much is that I don't always get how R Core thinks, and with Rcpp, more often than not, I don't have to.

我如此喜欢 Rcpp 的原因是我并不总是了解 R Core 的想法,而使用 Rcpp,我通常不必了解。

Speaking philosophically, you're in a state of sin with regards to the functional paradigm, which tries to ensure that every value appearsindependent of every other value; changing one value should never cause a visible change in another value, the way you get with pointers sharing representation in C.

哲学上说,你在罪与问候的功能模式,它试图以确保每一个值的状态出现独立每隔价值; 改变一个值永远不会导致另一个值的可见变化,就像你在 C 中使用指针共享表示一样。

The problems arise when functional programming signals the small craft to move out of the way, and the small craft replies "I'm a lighthouse". Making a long series of small changes to a large object which you want to process on in the meantime puts you square into lighthouse territory.

当函数式编程发出信号让小船让开,而小船回复“我是灯塔”时,问题就出现了。在此期间,对要处理的大型对象进行一系列细小的更改会使您进入灯塔领域。

In the C++ STL, push_back()is a way of life. It doesn't try to be functional, but it does try to accommodate common programming idioms efficiently.

在 C++ STL 中,push_back()是一种生活方式。它不会尝试是功能性的,但它确实尽力满足常见的编程风格有效

With some cleverness behind the scenes, you can sometimes arrange to have one foot in each world. Snapshot based file systems are a good example (which evolved from concepts such as union mounts, which also ply both sides).

借助幕后的一些聪明才智,您有时可以安排一只脚进入每个世界。基于快照的文件系统就是一个很好的例子(它是从联合挂载等概念演变而来的,这也适用于双方)。

If R Core wanted to do this, underlying vector storage could function like a union mount. One reference to the vector storage might be valid for subscripts 1:N, while another reference to the same storage is valid for subscripts 1:(N+1). There could be reserved storage not yet validly referenced by anything but convenient for a quick push_back(). You don't violate the functional concept when appending outside the range that any existing reference considers valid.

如果 R Core 想要做到这一点,底层向量存储可以像联合挂载一样发挥作用。对向量存储的一个引用可能对下标有效1:N,而对同一存储的另一个引用对下标有效1:(N+1)。可能有尚未有效引用的保留存储,但方便快速push_back(). 在任何现有引用认为有效的范围之外附加时,您不会违反功能概念。

Eventually appending rows incrementally, you run out of reserved storage. You'll need to create new copies of everything, with the storage multiplied by some increment. The STL implementations I've use tend to multiply storage by 2 when extending allocation. I thought I read in R Internals that there is a memory structure where the storage increments by 20%. Either way, growth operations occur with logarithmic frequency relative to the total number of elements appended. On an amortized basis, this is usually acceptable.

最终以增量方式附加行,您会耗尽保留的存储空间。您需要创建所有内容的新副本,存储乘以一些增量。在扩展分配时,我使用的 STL 实现倾向于将存储乘以 2。我以为我在 R Internals 中读到有一个内存结构,其中存储量增加了 20%。无论哪种方式,增长操作都以相对于附加元素总数的对数频率发生。在摊销的基础上,这通常是可以接受的。

As tricks behind the scenes go, I've seen worse. Every time you push_back()a new row onto the dataframe, a top level index structure would need to be copied. The new row could append onto shared representation without impacting any old functional values. I don't even think it would complicate the garbage collector much; since I'm not proposing push_front()all references are prefix references to the front of the allocated vector storage.

随着幕后花招的展开,我看到了更糟糕的情况。每次push_back()在数据框中添加新行时,都需要复制顶级索引结构。新行可以附加到共享表示上,而不会影响任何旧的功能值。我什至不认为它会使垃圾收集器复杂化。因为我不建议push_front()所有引用都是对分配的向量存储前面的前缀引用。

回答by John

Dirk Eddelbuettel's answer is the best; here I just note that you can get away with not pre-specifying the dataframe dimensions or data types, which is sometimes useful if you have multiple data types and lots of columns:

Dirk Eddelbuettel 的回答是最好的;在这里我只是注意到,您可以不预先指定数据框维度或数据类型,如果您有多种数据类型和大量列,这有时很有用:

row1<-list("a",1,FALSE) #use 'list', not 'c' or 'cbind'!
row2<-list("b",2,TRUE)  

df<-data.frame(row1,stringsAsFactors = F) #first row
df<-rbind(df,row2) #now this works as you'd expect.

回答by phili_b

I've found this way to create dataframe by raw without matrix.

我找到了这种通过原始创建数据框而不使用矩阵的方法。

With automatic column name

带自动列名

df<-data.frame(
        t(data.frame(c(1,"a",100),c(2,"b",200),c(3,"c",300)))
        ,row.names = NULL,stringsAsFactors = FALSE
    )

With column name

带列名

df<-setNames(
        data.frame(
            t(data.frame(c(1,"a",100),c(2,"b",200),c(3,"c",300)))
            ,row.names = NULL,stringsAsFactors = FALSE
        ), 
        c("col1","col2","col3")
    )

回答by Arthur Yip

Depending on the format of your new row, you might use tibble::add_rowif your new row is simple and can specified in "value-pairs". Or you could use dplyr::bind_rows, "an efficient implementation of the common pattern of do.call(rbind, dfs)".

根据新行的格式,tibble::add_row如果新行很简单并且可以在“值对”中指定,则可以使用。或者您可以使用dplyr::bind_rows“do.call(rbind, dfs) 的通用模式的有效实现”。

回答by Keegan Smith

If you have vectors destined to become rows, concatenate them using c(), pass them to a matrix row-by-row, and convert that matrix to a dataframe.

如果您的向量注定要成为行,请使用c()将它们连接起来,将它们逐行传递给矩阵,然后将该矩阵转换为数据帧。

For example, rows

例如,行

dummydata1=c(2002,10,1,12.00,101,426340.0,4411238.0,3598.0,0.92,57.77,4.80,238.29,-9.9)
dummydata2=c(2002,10,2,12.00,101,426340.0,4411238.0,3598.0,-3.02,78.77,-9999.00,-99.0,-9.9)
dummydata3=c(2002,10,8,12.00,101,426340.0,4411238.0,3598.0,-5.02,88.77,-9999.00,-99.0,-9.9)

can be converted to a data frame thus:

可以转换为数据帧,因此:

dummyset=c(dummydata1,dummydata2,dummydata3)
col.len=length(dummydata1)
dummytable=data.frame(matrix(data=dummyset,ncol=col.len,byrow=TRUE))

Admittedly, I see 2 major limitations: (1) this only works with single-mode data, and (2) you must know your final # columns for this to work (i.e., I'm assuming that you're not working with a ragged array whose greatest row length is unknown a priori).

诚然,我看到了 2 个主要限制:(1) 这仅适用于单模式数据,以及 (2) 您必须知道您的最后 # 列才能使其工作(即,我假设您没有使用最大行长度未知参差不齐的数组)。

This solution seems simple, but from my experience with type conversions in R, I'm sure it creates new challenges down-the-line. Can anyone comment on this?

这个解决方案看起来很简单,但根据我在 R 中进行类型转换的经验,我确信它会带来新的挑战。任何人都可以对此发表评论吗?