Pandas:使用循环和分层索引将多个 csv 文件导入数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21149920/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:34:35  来源:igfitidea点击:

Pandas: import multiple csv files into dataframe using a loop and hierarchical indexing

pythoncsvpandashierarchical-data

提问by mellover

I would like to read multiple CSV files (with a different number of columns) from a target directory into a single Python Pandas DataFrame to efficiently search and extract data.

我想将多个 CSV 文件(具有不同的列数)从目标目录读取到单个 Python Pandas DataFrame 中,以有效地搜索和提取数据。

Example file:

示例文件:

Events 
1,0.32,0.20,0.67
2,0.94,0.19,0.14,0.21,0.94
3,0.32,0.20,0.64,0.32
4,0.87,0.13,0.61,0.54,0.25,0.43 
5,0.62,0.21,0.77,0.44,0.16

Here is what I have so far:

这是我到目前为止所拥有的:

# get a list of all csv files in target directory
my_dir = "C:\Data\"
filelist = []
os.chdir( my_dir )
for files in glob.glob( "*.csv" ) :
    filelist.append(files)

# read each csv file into single dataframe and add a filename reference column 
# (i.e. file1, file2, file 3) for each file read
df = pd.DataFrame()
columns = range(1,100)
for c, f in enumerate(filelist) :
    key = "file%i" % c
    frame = pd.read_csv( (my_dir + f), skiprows = 1, index_col=0, names=columns )
    frame['key'] = key
    df = df.append(frame,ignore_index=True)

(the indexing isn't working properly)

(索引无法正常工作)

Essentially, the script below is exactly what I want (tried and tested) but needs to be looped through 10 or more csv files:

本质上,下面的脚本正是我想要的(经过尝试和测试),但需要循环遍历 10 个或更多 csv 文件:

df1 = pd.DataFrame()
df2 = pd.DataFrame()
columns = range(1,100)
df1 = pd.read_csv("C:\Data\Currambene_001y09h00m_events.csv", 
                  skiprows = 1, index_col=0, names=columns)
df2 = pd.read_csv("C:\Data\Currambene_001y12h00m_events.csv", 
                  skiprows = 1, index_col=0, names=columns)
keys = [('file1'), ('file2')]
df = pd.concat([df1, df2], keys=keys, names=['fileno'])

I have found many related links, however I am still not able to get this to work:

我找到了许多相关链接,但是我仍然无法使其正常工作:

回答by dmvianna

You need to decide in what axis you want to append your files. Pandas will always try to do the right thing by:

您需要决定在哪个轴上附加文件。Pandas 将始终尝试通过以下方式做正确的事情:

  1. Assuming that each column from each file is different, and appending digits to columns with similar names across files if necessary, so that they don't get mixed;
  2. Items that belong to the same row index across files are placed side by side, under their respective columns.
  1. 假设每个文件中的每一列都不同,并在必要时将数字附加到文件中具有相似名称的列中,以免它们混合在一起;
  2. 跨文件属于同一行索引的项目并排放置在各自的列下。

The trick to appending efficiently is to tip the files sideways, so you get the desired behaviour to match what pandas.concatwill be doing. This is my recipe:

有效追加的技巧是将文件侧向倾斜,以便您获得所需的行为以匹配pandas.concat将要执行的操作。这是我的食谱:

from pandas import *
files = !ls *.csv # IPython magic
d = concat([read_csv(f, index_col=0, header=None, axis=1) for f in files], keys=files)

Notice that read_csvis transposed with axis=1, so it will be concatenated on the column axis, preserving its names. If you need, you can transpose the resulting DataFrame back with d.T.

请注意,read_csv与 转置axis=1,因此它将在列轴上连接,保留其名称。如果需要,您可以使用d.T.

EDIT:

编辑:

For different number of columns in each source file, you'll need to supply a header. I understand you don't have a header in your source files, so let's create one with a simple function:

对于每个源文件中不同数量的列,您需要提供一个标题。我知道您的源文件中没有标题,所以让我们用一个简单的函数创建一个:

def reader(f):
    d = read_csv(f, index_col=0, header=None, axis=1)
    d.columns = range(d.shape[1])
    return d

df = concat([reader(f) for f in files], keys=files)