pandas 使用循环创建多个数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48888001/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Creating multiple dataframes with a loop
提问by Bob Isahofferer
This undoubtedly reflects lack of knowledge on my part, but I can't find anything online to help. I am very new to programming. I want to load 6 csvs and do a few things to them before combining them later. The following code iterates over each file but only creates one dataframe, called df
.
这无疑反映了我缺乏知识,但我在网上找不到任何帮助。我对编程很陌生。我想加载 6 个 csvs 并对它们做一些事情,然后再组合它们。以下代码遍历每个文件,但只创建一个名为df
.
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs = ('df1', 'df2', 'df3', 'df4', 'df5', 'df6')
for df, file in zip(dfs, files):
df = pd.read_csv(file)
print(df.shape)
print(df.dtypes)
print(list(df))
回答by Keith Dowd
I think you think your code is doing something that it is not actually doing.
我认为你认为你的代码正在做一些它实际上没有做的事情。
Specifically, this line: df = pd.read_csv(file)
具体来说,这一行: df = pd.read_csv(file)
You might think that in each iteration through the for
loop this line is being executed and modified with df
being replaced with a string in dfs
and file
being replaced with a filename in files
. While the latter is true, the former is not.
您可能会认为,在for
循环的每次迭代中,这一行都会被执行和修改,df
并被替换为字符串 indfs
和file
替换为文件名 in files
。虽然后者是真的,但前者不是。
Each iteration through the for
loop is reading a csv file and storing it in the variable df
effectively overwriting the csv file that was read in during the previous for
loop. In other words, df
in your for
loop is not being replaced with the variable names you defined in dfs
.
for
循环中的每次迭代都会读取一个 csv 文件并将其存储在变量中,df
从而有效地覆盖在前一个for
循环中读入的 csv 文件。换句话说,df
在您的for
循环中不会被您在dfs
.
The key takeaway here is that strings (e.g., 'df1'
, 'df2'
, etc.) cannot be substituted and used as variable names when executing code.
这里的关键是外卖的字符串(例如,'df1'
,'df2'
,等)不能被取代,并且执行代码时所使用的变量名。
One way to achieve the result you want is store each csv file read by pd.read_csv()
in a dictionary, where the key is name of the dataframe (e.g., 'df1'
, 'df2'
, etc.) and value is the dataframe returned by pd.read_csv()
.
实现您想要的结果的一种方法是将读取的每个 csv 文件存储pd.read_csv()
在字典中,其中键是数据帧的名称(例如'df1'
,'df2'
, 等),值是由 返回的数据帧pd.read_csv()
。
list_of_dfs = {}
for df, file in zip(dfs, files):
list_of_dfs[df] = pd.read_csv(file)
print(list_of_dfs[df].shape)
print(list_of_dfs[df].dtypes)
print(list(list_of_dfs[df]))
You can then reference each of your dataframes like this:
然后,您可以像这样引用每个数据帧:
print(list_of_dfs['df1'])
print(list_of_dfs['df2'])
You can learn more about dictionaries here:
您可以在此处了解有关词典的更多信息:
https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries
https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries
回答by ilia timofeev
Use dictionary to store you DataFrames and access them by name
使用字典来存储你的 DataFrames 并按名称访问它们
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs_names = ('df1', 'df2', 'df3', 'df4', 'df5', 'df6')
dfs ={}
for dfn,file in zip(dfs_names, files):
dfs[dfn] = pd.read_csv(file)
print(dfs[dfn].shape)
print(dfs[dfn].dtypes)
print(dfs['df3'])
Use list to store you DataFrames and access them by index
使用列表来存储你的数据帧并通过索引访问它们
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs = []
for file in files:
dfs.append( pd.read_csv(file))
print(dfs[len(dfs)-1].shape)
print(dfs[len(dfs)-1].dtypes)
print (dfs[2])
Do not store intermediate DataFrame, just process them and add to resulting DataFrame.
不要存储中间 DataFrame,只需处理它们并添加到结果 DataFrame。
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
df = pd.DataFrame()
for file in files:
df_n = pd.read_csv(file)
print(df_n.shape)
print(df_n.dtypes)
# do you want to do
df = df.append(df_n)
print (df)
If you will process them differently, then you do not need a general structure to store them. Do it simply independent.
如果您以不同的方式处理它们,那么您不需要通用结构来存储它们。做简单的独立。
df = pd.DataFrame()
def do_general_stuff(d): #here we do common things with DataFrame
print(d.shape,d.dtypes)
df1 = pd.read_csv("data1.csv")
# do you want to with df1
do_general_stuff(df1)
df = df.append(df1)
del df1
df2 = pd.read_csv("data2.csv")
# do you want to with df2
do_general_stuff(df2)
df = df.append(df2)
del df2
df3 = pd.read_csv("data3.csv")
# do you want to with df3
do_general_stuff(df3)
df = df.append(df3)
del df3
# ... and so on
And one geeky way, but don't ask how it works:)
还有一种令人讨厌的方式,但不要问它是如何工作的:)
from collections import namedtuple
files = ['data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv']
df = namedtuple('Cdfs',
['df1', 'df2', 'df3', 'df4', 'df5', 'df6']
)(*[pd.read_csv(file) for file in files])
for df_n in df._fields:
print(getattr(df, df_n).shape,getattr(df, df_n).dtypes)
print(df.df3)
回答by Gerard H. Pille
A dictionary can store them too
字典也可以存储它们
import pandas as pd
from pprint import pprint
files = ('doms_stats201610051.csv', 'doms_stats201610052.csv')
dfsdic = {}
dfs = ('df1', 'df2')
for df, file in zip(dfs, files):
dfsdic[df] = pd.read_csv(file)
print(dfsdic[df].shape)
print(dfsdic[df].dtypes)
print(list(dfsdic[df]))
print(dfsdic['df1'].shape)
print(dfsdic['df2'].shape)