list 如何制作数据框列表?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17499013/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I make a list of data frames?
提问by Ben
How do I make a list of data frames and how do I access each of those data frames from the list?
如何制作数据框列表以及如何访问列表中的每个数据框?
For example, how can I put these data frames in a list ?
例如,如何将这些数据框放入列表中?
d1 <- data.frame(y1 = c(1, 2, 3),
y2 = c(4, 5, 6))
d2 <- data.frame(y1 = c(3, 2, 1),
y2 = c(6, 5, 4))
采纳答案by Peyton
This isn't related to your question, but you want to use =
and not <-
within the function call. If you use <-
, you'll end up creating variables y1
and y2
in whatever environment you're working in:
这与您的问题无关,但您想使用=
而不是<-
在函数调用中。如果你使用<-
,你最终会创建变量y1
,并y2
在任何环境下你的工作:
d1 <- data.frame(y1 <- c(1, 2, 3), y2 <- c(4, 5, 6))
y1
# [1] 1 2 3
y2
# [1] 4 5 6
This won't have the seemingly desired effect of creating column names in the data frame:
这不会产生在数据框中创建列名的看似理想的效果:
d1
# y1....c.1..2..3. y2....c.4..5..6.
# 1 1 4
# 2 2 5
# 3 3 6
The =
operator, on the other hand, will associate your vectors with arguments to data.frame
.
该=
运营商,在另一方面,将您的向量与参数相关联data.frame
。
As for your question, making a list of data frames is easy:
至于您的问题,制作数据框列表很容易:
d1 <- data.frame(y1 = c(1, 2, 3), y2 = c(4, 5, 6))
d2 <- data.frame(y1 = c(3, 2, 1), y2 = c(6, 5, 4))
my.list <- list(d1, d2)
You access the data frames just like you would access any other list element:
您可以像访问任何其他列表元素一样访问数据框:
my.list[[1]]
# y1 y2
# 1 1 4
# 2 2 5
# 3 3 6
回答by Gregor Thomas
The other answers show you howto make a list of data.frames when you already havea bunch of data.frames, e.g., d1
, d2
, .... Having sequentially named data frames is a problem, and putting them in a list is a good fix, but best practice is to avoid having a bunch of data.frames not in a listin the first place.
其他答案向您展示了当您已经拥有一堆data.frames 时如何制作 data.frames 列表,例如d1
,, d2
, .... 顺序命名数据框是一个问题,将它们放入列表中是一个问题很好的解决方法,但最佳实践是避免在列表中首先出现一堆 data.frames。
The other answers give plenty of detail of how to assign data frames to list elements, access them, etc. We'll cover that a little here too, but the Main Pointis to say don't wait until you have a bunch of a data.frames
to add them to a list. Start with the list.
其他答案给予充足的如何分配数据帧的细节到列表中的元素,访问他们,等我们将介绍的是,这里有点太,但要点是说不要等到你有一大堆的一data.frames
将它们添加到列表中。从列表开始。
The rest of the this answer will cover some common cases where you might be tempted to create sequential variables, and show you how to go straight to lists. If you're new to lists in R, you might want to also read What's the difference between [[
and [
in accessing elements of a list?.
此答案的其余部分将涵盖一些您可能会试图创建顺序变量的常见情况,并向您展示如何直接进入列表。如果您不熟悉 R 中的列表,您可能还想阅读访问列表元素和访问列表元素有什么区别?[[
[
.
Lists from the start
从一开始就列出
Don't ever create d1
d2
d3
, ..., dn
in the first place. Create a list d
with n
elements.
永远不要创建d1
d2
d3
,...,dn
首先。创建一个d
包含n
元素的列表。
Reading multiple files into a list of data frames
将多个文件读入数据框列表
This is done pretty easily when reading in files. Maybe you've got files data1.csv, data2.csv, ...
in a directory. Your goal is a list of data.frames called mydata
. The first thing you need is a vector with all the file names. You can construct this with paste (e.g., my_files = paste0("data", 1:5, ".csv")
), but it's probably easier to use list.files
to grab all the appropriate files: my_files <- list.files(pattern = "\\.csv$")
. You can use regular expressions to match the files, read more about regular expressions in other questions if you need help there. This way you can grab all CSV files even if they don't follow a nice naming scheme. Or you can use a fancier regex pattern if you need to pick certain CSV files out from a bunch of them.
这在读取文件时很容易完成。也许你有data1.csv, data2.csv, ...
一个目录中的文件。您的目标是一个名为mydata
. 您需要的第一件事是一个包含所有文件名的向量。您可以使用粘贴(例如my_files = paste0("data", 1:5, ".csv")
)来构建它,但使用它list.files
来获取所有适当的文件可能更容易:my_files <- list.files(pattern = "\\.csv$")
. 您可以使用正则表达式来匹配文件,如果您需要帮助,请在其他问题中阅读有关正则表达式的更多信息。这样您就可以获取所有 CSV 文件,即使它们不遵循良好的命名方案。或者,如果您需要从一堆 CSV 文件中挑选出某些 CSV 文件,则可以使用更高级的正则表达式模式。
At this point, most R beginners will use a for
loop, and there's nothing wrong with that, it works just fine.
在这一点上,大多数 R 初学者都会使用for
循环,这没有什么问题,它工作得很好。
my_data <- list()
for (i in seq_along(my_files)) {
my_data[[i]] <- read.csv(file = my_files[i])
}
A more R-like way to do it is with lapply
, which is a shortcut for the above
一种更像 R 的方法是使用lapply
,这是上面的快捷方式
my_data <- lapply(my_files, read.csv)
Of course, substitute other data import function for read.csv
as appropriate. readr::read_csv
or data.table::fread
will be faster, or you may also need a different function for a different file type.
当然,read.csv
可以酌情替换其他数据导入功能。readr::read_csv
或者data.table::fread
会更快,或者您可能还需要针对不同文件类型的不同功能。
Either way, it's handy to name the list elements to match the files
无论哪种方式,命名列表元素以匹配文件都很方便
names(my_data) <- gsub("\.csv$", "", my_files)
# or, if you prefer the consistent syntax of stringr
names(my_data) <- stringr::str_replace(my_files, pattern = ".csv", replacement = "")
Splitting a data frame into a list of data frames
将数据框拆分为数据框列表
This is super-easy, the base function split()
does it for you. You can split by a column (or columns) of the data, or by anything else you want
这非常简单,基本功能split()
会为您完成。您可以按数据的一列(或多列)或您想要的任何其他内容进行拆分
mt_list = split(mtcars, f = mtcars$cyl)
# This gives a list of three data frames, one for each value of cyl
This is also a nice way to break a data frame into pieces for cross-validation. Maybe you want to split mtcars
into training, test, and validation pieces.
这也是将数据帧分解为多块以进行交叉验证的好方法。也许您想mtcars
分成训练、测试和验证部分。
groups = sample(c("train", "test", "validate"),
size = nrow(mtcars), replace = TRUE)
mt_split = split(mtcars, f = groups)
# and mt_split has appropriate names already!
Simulating a list of data frames
模拟数据框列表
Maybe you're simulating data, something like this:
也许您正在模拟数据,如下所示:
my_sim_data = data.frame(x = rnorm(50), y = rnorm(50))
But who does only one simulation? You want to do this 100 times, 1000 times, more! But you don'twant 10,000 data frames in your workspace. Use replicate
and put them in a list:
但谁只进行一次模拟?你想这样做100次,1000次,更多!但是您不希望工作区中有 10,000 个数据框。使用replicate
并将它们放在列表中:
sim_list = replicate(n = 10,
expr = {data.frame(x = rnorm(50), y = rnorm(50))},
simplify = F)
In this case especially, you should also consider whether you really need separate data frames, or would a single data frame with a "group" column work just as well? Using data.table
or dplyr
it's quite easy to do things "by group" to a data frame.
特别是在这种情况下,您还应该考虑是否真的需要单独的数据框,或者带有“组”列的单个数据框是否也能正常工作?使用data.table
ordplyr
对数据框“按组”执行操作非常容易。
I didn't put my data in a list :( I will next time, but what can I do now?
我没有把我的数据放在一个列表中 :( 我下次会,但我现在能做什么?
If they're an odd assortment (which is unusual), you can simply assign them:
如果它们是一个奇怪的分类(这是不寻常的),您可以简单地分配它们:
mylist <- list()
mylist[[1]] <- mtcars
mylist[[2]] <- data.frame(a = rnorm(50), b = runif(50))
...
If you have data frames named in a pattern, e.g., df1
, df2
, df3
, and you want them in a list, you can get
them if you can write a regular expression to match the names. Something like
如果您有以模式命名的数据框,例如,df1
, df2
, df3
,并且您希望它们在列表中,get
并且您可以编写正则表达式来匹配名称,则可以使用它们。就像是
df_list = mget(ls(pattern = "df[0-9]"))
# this would match any object with "df" followed by a digit in its name
# you can test what objects will be got by just running the
ls(pattern = "df[0-9]")
# part and adjusting the pattern until it gets the right objects.
Generally, mget
is used to get multiple objects and return them in a named list. Its counterpart get
is used to get a single object and return it (not in a list).
通常,mget
用于获取多个对象并在命名列表中返回它们。它的对应物get
用于获取单个对象并返回它(不在列表中)。
Combining a list of data frames into a single data frame
将数据框列表合并为单个数据框
A common task is combining a list of data frames into one big data frame. If you want to stack them on top of each other, you would use rbind
for a pair of them, but for a list of data frames here are three good choices:
一项常见任务是将数据框列表合并为一个大数据框。如果您想将它们堆叠在一起,您可以将它们rbind
用于一对,但对于数据框列表,这里有三个不错的选择:
# base option - slower but not extra dependencies
big_data = do.call(what = rbind, args = df_list)
# data table and dplyr have nice functions for this that
# - are much faster
# - add id columns to identify the source
# - fill in missing values if some data frames have more columns than others
# see their help pages for details
big_data = data.table::rbindlist(df_list)
big_data = dplyr::bind_rows(df_list)
(Similarly using cbind
or dplyr::bind_cols
for columns.)
(类似地使用cbind
或dplyr::bind_cols
用于列。)
To merge (join) a list of data frames, you can see these answers. Often, the idea is to use Reduce
with merge
(or some other joining function) to get them together.
要合并(加入)数据框列表,您可以查看这些答案。通常,这个想法是使用Reduce
with merge
(或其他一些连接函数)将它们组合在一起。
Why put the data in a list?
为什么要把数据放在一个列表中?
Put similar data in lists because you want to do similar things to each data frame, and functions like lapply
, sapply
do.call
, the purrr
package, and the old plyr
l*ply
functions make it easy to do that. Examples of people easily doing things with lists are all over SO.
放入列表类似的数据,因为你想要做类似的事情,每个数据帧,以及功能,如lapply
,sapply
do.call
,的purrr
包,和老plyr
l*ply
功能可以很容易地做到这一点。人们很容易用列表做事的例子到处都是。
Even if you use a lowly for loop, it's much easier to loop over the elements of a list than it is to construct variable names with paste
and access the objects with get
. Easier to debug, too.
即使您使用低级 for 循环,循环列表的元素也比使用 构造变量名称paste
和使用 访问对象要容易得多get
。也更容易调试。
Think of scalability. If you really only need three variables, it's fine to use d1
, d2
, d3
. But then if it turns out you really need 6, that's a lot more typing. And next time, when you need 10 or 20, you find yourself copying and pasting lines of code, maybe using find/replace to change d14
to d15
, and you're thinking this isn't how programming should be. If you use a list, the difference between 3 cases, 30 cases, and 300 cases is at most one line of code---no change at all if your number of cases is automatically detected by, e.g., how many .csv
files are in your directory.
想想可扩展性。如果你真的只需要三个变量,使用d1
, d2
, 就可以了d3
。但是,如果事实证明您确实需要 6 个,那就需要更多的输入。下一次,当您需要 10 或 20 行时,您会发现自己复制和粘贴代码行,也许使用查找/替换来更改d14
为d15
,并且您认为这不是编程应有的方式。如果您使用列表,3 个案例、30 个案例和 300 个案例之间的差异至多是一行代码——如果您的案例数量被自动检测到,例如,有多少.csv
文件在您的目录。
You can name the elements of a list, in case you want to use something other than numeric indices to access your data frames (and you can use both, this isn't an XOR choice).
您可以命名列表的元素,以防您想使用数字索引以外的其他内容来访问您的数据框(您可以同时使用两者,这不是 XOR 选择)。
Overall, using lists will lead you to write cleaner, easier-to-read code, which will result in fewer bugs and less confusion.
总体而言,使用列表将使您编写更清晰、更易于阅读的代码,从而减少错误和混淆。
回答by Rich Scriven
You can also access specific columns and values in each list element with [
and [[
. Here are a couple of examples. First, we can access only the first column of each data frame in the list with lapply(ldf, "[", 1)
, where 1
signifies the column number.
您也可以访问特定的列和值在每个列表元素与 [
和[[
。这里有几个例子。首先,我们只能访问列表中每个数据框的第一列lapply(ldf, "[", 1)
,其中1
表示列号。
ldf <- list(d1 = d1, d2 = d2) ## create a named list of your data frames
lapply(ldf, "[", 1)
# $d1
# y1
# 1 1
# 2 2
# 3 3
#
# $d2
# y1
# 1 3
# 2 2
# 3 1
Similarly, we can access the first value in the second column with
同样,我们可以访问第二列中的第一个值
lapply(ldf, "[", 1, 2)
# $d1
# [1] 4
#
# $d2
# [1] 6
Then we can also access the column values directly, as a vector, with [[
然后我们也可以直接访问列值,作为向量,使用 [[
lapply(ldf, "[[", 1)
# $d1
# [1] 1 2 3
#
# $d2
# [1] 3 2 1
回答by Mark Miller
If you have a large number of sequentially named data frames you can create a list of the desired subset of data frames like this:
如果您有大量按顺序命名的数据框,您可以像这样创建所需数据框子集的列表:
d1 <- data.frame(y1=c(1,2,3), y2=c(4,5,6))
d2 <- data.frame(y1=c(3,2,1), y2=c(6,5,4))
d3 <- data.frame(y1=c(6,5,4), y2=c(3,2,1))
d4 <- data.frame(y1=c(9,9,9), y2=c(8,8,8))
my.list <- list(d1, d2, d3, d4)
my.list
my.list2 <- lapply(paste('d', seq(2,4,1), sep=''), get)
my.list2
where my.list2
returns a list containing the 2nd, 3rd and 4th data frames.
wheremy.list2
返回一个包含第 2、3 和 4 个数据帧的列表。
[[1]]
y1 y2
1 3 6
2 2 5
3 1 4
[[2]]
y1 y2
1 6 3
2 5 2
3 4 1
[[3]]
y1 y2
1 9 8
2 9 8
3 9 8
Note, however, that the data frames in the above list are no longer named. If you want to create a list containing a subset of data frames and want to preserve their names you can try this:
但是请注意,以上列表中的数据框不再命名。如果您想创建一个包含数据框子集的列表并希望保留它们的名称,您可以尝试以下操作:
list.function <- function() {
d1 <- data.frame(y1=c(1,2,3), y2=c(4,5,6))
d2 <- data.frame(y1=c(3,2,1), y2=c(6,5,4))
d3 <- data.frame(y1=c(6,5,4), y2=c(3,2,1))
d4 <- data.frame(y1=c(9,9,9), y2=c(8,8,8))
sapply(paste('d', seq(2,4,1), sep=''), get, environment(), simplify = FALSE)
}
my.list3 <- list.function()
my.list3
which returns:
返回:
> my.list3
$d2
y1 y2
1 3 6
2 2 5
3 1 4
$d3
y1 y2
1 6 3
2 5 2
3 4 1
$d4
y1 y2
1 9 8
2 9 8
3 9 8
> str(my.list3)
List of 3
$ d2:'data.frame': 3 obs. of 2 variables:
..$ y1: num [1:3] 3 2 1
..$ y2: num [1:3] 6 5 4
$ d3:'data.frame': 3 obs. of 2 variables:
..$ y1: num [1:3] 6 5 4
..$ y2: num [1:3] 3 2 1
$ d4:'data.frame': 3 obs. of 2 variables:
..$ y1: num [1:3] 9 9 9
..$ y2: num [1:3] 8 8 8
> my.list3[[1]]
y1 y2
1 3 6
2 2 5
3 1 4
> my.list3$d4
y1 y2
1 9 8
2 9 8
3 9 8
回答by lmo
Taking as a given you have a "large" number of data.frames with similar names (here d# where # is some positive integer), the following is a slight improvement of @mark-miller's method. It is more terse and returns a namedlist of data.frames, where each name in the list is the name of the corresponding original data.frame.
假设您有“大量”名称相似的 data.frames(此处为 d#,其中 # 是某个正整数),以下是对 @mark-miller 方法的略微改进。它更简洁,并返回一个命名的 data.frames 列表,其中列表中的每个名称都是对应的原始 data.frame 的名称。
The key is using mget
together with ls
. If the data frames d1 and d2 provided in the question were the only objects with names d# in the environment, then
关键是mget
与ls
. 如果问题中提供的数据框 d1 和 d2 是环境中唯一名称为 d# 的对象,则
my.list <- mget(ls(pattern="^d[0-9]+"))
which would return
哪个会返回
my.list
$d1
y1 y2
1 1 4
2 2 5
3 3 6
$d2
y1 y2
1 3 6
2 2 5
3 1 4
This method takes advantage of the pattern argument in ls
, which allows us to use regular expressions to do a finer parsing of the names of objects in the environment. An alternative to the regex "^d[0-9]+$"
is "^d\\d+$"
.
该方法利用了 中的模式参数ls
,它允许我们使用正则表达式对环境中的对象名称进行更精细的解析。正则表达式的替代方法"^d[0-9]+$"
是"^d\\d+$"
.
As @gregor points out, it is a better overall to set up your data construction process so that the data.frames are put into named lists at the start.
正如@gregor指出的那样,设置数据构建过程是一个更好的整体,以便在开始时将 data.frames 放入命名列表中。
data
数据
d1 <- data.frame(y1 = c(1,2,3),y2 = c(4,5,6))
d2 <- data.frame(y1 = c(3,2,1),y2 = c(6,5,4))
回答by ML_for_now
This may be a little late but going back to your example I thought I would extend the answer just a tad.
这可能有点晚了,但回到你的例子,我想我会稍微扩展一下答案。
D1 <- data.frame(Y1=c(1,2,3), Y2=c(4,5,6))
D2 <- data.frame(Y1=c(3,2,1), Y2=c(6,5,4))
D3 <- data.frame(Y1=c(6,5,4), Y2=c(3,2,1))
D4 <- data.frame(Y1=c(9,9,9), Y2=c(8,8,8))
Then you make your list easily:
然后你可以轻松地列出你的清单:
mylist <- list(D1,D2,D3,D4)
Now you have a list but instead of accessing the list the old way such as
现在您有了一个列表,但不是像这样以旧方式访问列表
mylist[[1]] # to access 'd1'
you can use this function to obtain & assign the dataframe of your choice.
您可以使用此功能来获取和分配您选择的数据框。
GETDF_FROMLIST <- function(DF_LIST, ITEM_LOC){
DF_SELECTED <- DF_LIST[[ITEM_LOC]]
return(DF_SELECTED)
}
Now get the one you want.
现在得到你想要的。
D1 <- GETDF_FROMLIST(mylist, 1)
D2 <- GETDF_FROMLIST(mylist, 2)
D3 <- GETDF_FROMLIST(mylist, 3)
D4 <- GETDF_FROMLIST(mylist, 4)
Hope that extra bit helps.
希望额外的一点帮助。
Cheers!
干杯!
回答by Soufiane Chami
Very simple ! Here is my suggestion :
很简单 !这是我的建议:
If you want to select dataframes in your workspace, try this :
如果要在工作区中选择数据框,请尝试以下操作:
Filter(function(x) is.data.frame(get(x)) , ls())
or
或者
ls()[sapply(ls(), function(x) is.data.frame(get(x)))]
all these will give the same result.
所有这些都会产生相同的结果。
You can change is.data.frame
to check other types of variables like is.function
您可以更改is.data.frame
以检查其他类型的变量,例如is.function
回答by Loek van der Kallen
I consider myself a complete newbie, but I think I have an extremely simple answer to one of the original subquestions that has not been stated here: accessing the data frames, or parts of it.
我认为自己是一个完整的新手,但我认为我对此处未说明的原始子问题之一有一个非常简单的答案:访问数据框或其中的一部分。
Let's start by creating the list with data frames as was stated above:
让我们首先创建带有数据框的列表,如上所述:
d1 <- data.frame(y1 = c(1, 2, 3), y2 = c(4, 5, 6))
d2 <- data.frame(y1 = c(3, 2, 1), y2 = c(6, 5, 4))
my.list <- list(d1, d2)
Then, if you want to access a specific value in one of the data frames, you can do so by using the double brackets sequentially. The first set gets you into the data frame, and the second set gets you to the specific coordinates:
然后,如果您想访问其中一个数据帧中的特定值,您可以通过按顺序使用双括号来实现。第一组让您进入数据框,第二组让您进入特定坐标:
my.list[[1]][[3,2]]
[1] 6