pandas 使用循环创建多个数据帧

Question

提问by Bob Isahofferer

This undoubtedly reflects lack of knowledge on my part, but I can't find anything online to help. I am very new to programming. I want to load 6 csvs and do a few things to them before combining them later. The following code iterates over each file but only creates one dataframe, called df.

这无疑反映了我缺乏知识，但我在网上找不到任何帮助。我对编程很陌生。我想加载 6 个 csvs 并对它们做一些事情，然后再组合它们。以下代码遍历每个文件，但只创建一个名为df.

files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs = ('df1', 'df2', 'df3', 'df4', 'df5', 'df6')
for df, file in zip(dfs, files):
    df = pd.read_csv(file)
    print(df.shape)
    print(df.dtypes)
    print(list(df))

Answer 1

回答by Keith Dowd

I think you think your code is doing something that it is not actually doing.

我认为你认为你的代码正在做一些它实际上没有做的事情。

Specifically, this line: df = pd.read_csv(file)

具体来说，这一行： df = pd.read_csv(file)

You might think that in each iteration through the forloop this line is being executed and modified with dfbeing replaced with a string in dfsand filebeing replaced with a filename in files. While the latter is true, the former is not.

您可能会认为，在for循环的每次迭代中，这一行都会被执行和修改，df并被替换为字符串 indfs和file替换为文件名 in files。虽然后者是真的，但前者不是。

Each iteration through the forloop is reading a csv file and storing it in the variable dfeffectively overwriting the csv file that was read in during the previous forloop. In other words, dfin your forloop is not being replaced with the variable names you defined in dfs.

for循环中的每次迭代都会读取一个 csv 文件并将其存储在变量中，df从而有效地覆盖在前一个for循环中读入的 csv 文件。换句话说，df在您的for循环中不会被您在dfs.

The key takeaway here is that strings (e.g., 'df1', 'df2', etc.) cannot be substituted and used as variable names when executing code.

这里的关键是外卖的字符串（例如，'df1'，'df2'，等）不能被取代，并且执行代码时所使用的变量名。

One way to achieve the result you want is store each csv file read by pd.read_csv()in a dictionary, where the key is name of the dataframe (e.g., 'df1', 'df2', etc.) and value is the dataframe returned by pd.read_csv().

实现您想要的结果的一种方法是将读取的每个 csv 文件存储pd.read_csv()在字典中，其中键是数据帧的名称（例如'df1'，'df2'，等），值是由返回的数据帧pd.read_csv()。

list_of_dfs = {}
for df, file in zip(dfs, files):
    list_of_dfs[df] = pd.read_csv(file)
    print(list_of_dfs[df].shape)
    print(list_of_dfs[df].dtypes)
    print(list(list_of_dfs[df]))

You can then reference each of your dataframes like this:

然后，您可以像这样引用每个数据帧：

print(list_of_dfs['df1'])
print(list_of_dfs['df2'])

You can learn more about dictionaries here:

您可以在此处了解有关词典的更多信息：

https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries

Answer 2

回答by ilia timofeev

Use dictionary to store you DataFrames and access them by name

使用字典来存储你的 DataFrames 并按名称访问它们

files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs_names = ('df1', 'df2', 'df3', 'df4', 'df5', 'df6')
dfs ={}
for dfn,file in zip(dfs_names, files):
    dfs[dfn] = pd.read_csv(file)
    print(dfs[dfn].shape)
    print(dfs[dfn].dtypes)
print(dfs['df3'])

Use list to store you DataFrames and access them by index

使用列表来存储你的数据帧并通过索引访问它们

files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs = []
for file in  files:
    dfs.append( pd.read_csv(file))
    print(dfs[len(dfs)-1].shape)
    print(dfs[len(dfs)-1].dtypes)
print (dfs[2])

Do not store intermediate DataFrame, just process them and add to resulting DataFrame.

不要存储中间 DataFrame，只需处理它们并添加到结果 DataFrame。

files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
df = pd.DataFrame()
for file in  files:
    df_n =  pd.read_csv(file)
    print(df_n.shape)
    print(df_n.dtypes)
    # do you want to do
    df = df.append(df_n)
print (df)

If you will process them differently, then you do not need a general structure to store them. Do it simply independent.

如果您以不同的方式处理它们，那么您不需要通用结构来存储它们。做简单的独立。

df = pd.DataFrame()
def do_general_stuff(d): #here we do common things with DataFrame
    print(d.shape,d.dtypes)

df1 = pd.read_csv("data1.csv")
# do you want to with df1

do_general_stuff(df1)
df = df.append(df1)
del df1

df2 = pd.read_csv("data2.csv")
# do you want to with df2

do_general_stuff(df2)
df = df.append(df2)
del df2

df3 = pd.read_csv("data3.csv")
# do you want to with df3

do_general_stuff(df3)
df = df.append(df3)
del df3

# ... and so on

And one geeky way, but don't ask how it works:)

还有一种令人讨厌的方式，但不要问它是如何工作的:)

from collections import namedtuple
files = ['data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv']

df = namedtuple('Cdfs',
                ['df1', 'df2', 'df3', 'df4', 'df5', 'df6']
               )(*[pd.read_csv(file) for file in files])

for df_n in df._fields:
    print(getattr(df, df_n).shape,getattr(df, df_n).dtypes)

print(df.df3)

Answer 3

回答by Gerard H. Pille

A dictionary can store them too

字典也可以存储它们

import pandas as pd
from pprint import pprint

files = ('doms_stats201610051.csv', 'doms_stats201610052.csv')
dfsdic = {}
dfs = ('df1', 'df2')
for df, file in zip(dfs, files):
  dfsdic[df] = pd.read_csv(file)
  print(dfsdic[df].shape)
  print(dfsdic[df].dtypes)
  print(list(dfsdic[df]))

print(dfsdic['df1'].shape)
print(dfsdic['df2'].shape)

pandas 使用循环创建多个数据帧

提问by Bob Isahofferer

回答by Keith Dowd

回答by ilia timofeev

回答by Gerard H. Pille

相关推荐

最近更新

标签

pandas 使用循环创建多个数据帧

提问by Bob Isahofferer

回答by Keith Dowd

回答by ilia timofeev

回答by Gerard H. Pille

相关推荐

pandas ValueError：无法使用多维键建立索引

Pandas 绘制计数器随时间累积的总和

pandas 熊猫：选择两列不同的行

Pandas 中的 Sumifs 有两个条件

相关推荐

最近更新

标签