Pandas：使用循环和分层索引将多个 csv 文件导入数据帧

Question

提问by mellover

I would like to read multiple CSV files (with a different number of columns) from a target directory into a single Python Pandas DataFrame to efficiently search and extract data.

我想将多个 CSV 文件（具有不同的列数）从目标目录读取到单个 Python Pandas DataFrame 中，以有效地搜索和提取数据。

Example file:

示例文件：

Events 
1,0.32,0.20,0.67
2,0.94,0.19,0.14,0.21,0.94
3,0.32,0.20,0.64,0.32
4,0.87,0.13,0.61,0.54,0.25,0.43 
5,0.62,0.21,0.77,0.44,0.16

Here is what I have so far:

这是我到目前为止所拥有的：

# get a list of all csv files in target directory
my_dir = "C:\Data\"
filelist = []
os.chdir( my_dir )
for files in glob.glob( "*.csv" ) :
    filelist.append(files)

# read each csv file into single dataframe and add a filename reference column 
# (i.e. file1, file2, file 3) for each file read
df = pd.DataFrame()
columns = range(1,100)
for c, f in enumerate(filelist) :
    key = "file%i" % c
    frame = pd.read_csv( (my_dir + f), skiprows = 1, index_col=0, names=columns )
    frame['key'] = key
    df = df.append(frame,ignore_index=True)

(the indexing isn't working properly)

（索引无法正常工作）

Essentially, the script below is exactly what I want (tried and tested) but needs to be looped through 10 or more csv files:

本质上，下面的脚本正是我想要的（经过尝试和测试），但需要循环遍历 10 个或更多 csv 文件：

df1 = pd.DataFrame()
df2 = pd.DataFrame()
columns = range(1,100)
df1 = pd.read_csv("C:\Data\Currambene_001y09h00m_events.csv", 
                  skiprows = 1, index_col=0, names=columns)
df2 = pd.read_csv("C:\Data\Currambene_001y12h00m_events.csv", 
                  skiprows = 1, index_col=0, names=columns)
keys = [('file1'), ('file2')]
df = pd.concat([df1, df2], keys=keys, names=['fileno'])

I have found many related links, however I am still not able to get this to work:

我找到了许多相关链接，但是我仍然无法使其正常工作：

Answer 1

回答by dmvianna

You need to decide in what axis you want to append your files. Pandas will always try to do the right thing by:

您需要决定在哪个轴上附加文件。Pandas 将始终尝试通过以下方式做正确的事情：

Assuming that each column from each file is different, and appending digits to columns with similar names across files if necessary, so that they don't get mixed;
Items that belong to the same row index across files are placed side by side, under their respective columns.

假设每个文件中的每一列都不同，并在必要时将数字附加到文件中具有相似名称的列中，以免它们混合在一起；
跨文件属于同一行索引的项目并排放置在各自的列下。

The trick to appending efficiently is to tip the files sideways, so you get the desired behaviour to match what pandas.concatwill be doing. This is my recipe:

有效追加的技巧是将文件侧向倾斜，以便您获得所需的行为以匹配pandas.concat将要执行的操作。这是我的食谱：

from pandas import *
files = !ls *.csv # IPython magic
d = concat([read_csv(f, index_col=0, header=None, axis=1) for f in files], keys=files)

Notice that read_csvis transposed with axis=1, so it will be concatenated on the column axis, preserving its names. If you need, you can transpose the resulting DataFrame back with d.T.

请注意，read_csv与转置axis=1，因此它将在列轴上连接，保留其名称。如果需要，您可以使用d.T.

EDIT:

编辑：

For different number of columns in each source file, you'll need to supply a header. I understand you don't have a header in your source files, so let's create one with a simple function:

对于每个源文件中不同数量的列，您需要提供一个标题。我知道您的源文件中没有标题，所以让我们用一个简单的函数创建一个：

def reader(f):
    d = read_csv(f, index_col=0, header=None, axis=1)
    d.columns = range(d.shape[1])
    return d

df = concat([reader(f) for f in files], keys=files)

Pandas：使用循环和分层索引将多个 csv 文件导入数据帧

提问by mellover

回答by dmvianna

相关推荐

最近更新

标签

Pandas：使用循环和分层索引将多个 csv 文件导入数据帧

提问by mellover

回答by dmvianna

相关推荐

pandas 根据列名称创建 DataFrame 的子集

pandas 将python pandas数据帧写入csv文件时出错

在 Pandas/Python 中使用 GroupBy 绘图

Pandas DataFrame 转 CSV

相关推荐

最近更新

标签