将多个 csv 文件读入 Pandas 数据帧

Question

提问by Watty62

I am attempting to read multiple CSV files into a Pandas data frame. The CSVs aren't comma separated - the fields are delimited by a semicolon ";".

我正在尝试将多个 CSV 文件读入 Pandas 数据框中。CSV 不是逗号分隔的 - 字段由分号“;”分隔。

I based my code on the answers here.

我的代码基于这里的答案。

My data is all in a specific subdirectory: /data/luftdaten/5331

我的数据都在一个特定的子目录中： /data/luftdaten/5331

This is what I run:

这是我运行的：

import glob
import pandas as pd

path =r'data/luftdaten/5331' # use your path

filenames = glob.glob(path + "/*.csv")
count_files = 0
dfs = []
for filename in filenames:
    if count_files ==0:
        dfs.append(pd.read_csv(filename, sep=";")) 
        count_files += 1
    else:
        dfs.append(pd.read_csv(filename, sep=";", skiprows=[0]))
        count_files +=1

big_frame = pd.concat(dfs, ignore_index=True)

I use count_filesto monitor if it is the first CSV - in which case I import the headers. Otherwise, it skips the headers.

我count_files用来监视它是否是第一个 CSV - 在这种情况下，我会导入标题。否则，它会跳过标题。

The code executes OK.

代码执行正常。

If I run it with a single file in that directory, everything is fine:

如果我使用该目录中的单个文件运行它，则一切正常：

big_frame.info()

Output:

输出：

RangeIndex: 146 entries, 0 to 145
Data columns (**total 12 column**s):
sensor_id      146 non-null int64
sensor_type    146 non-null object
etc......

If I run it with 2 or more files in the directory things go wrong from the start.

如果我在目录中使用 2 个或更多文件运行它，事情从一开始就会出错。

Output with 4 files:

输出 4 个文件：

RangeIndex: 1893 entries, 0 to 1892
Data columns (total **33 columns**):
-2.077                 1164 non-null float64
-2.130                 145 non-null float64
2.40                   145 non-null float64

Running big_frame.head()on the single CSV version gives this, with the correct column names:

big_frame.head()在单个 CSV 版本上运行会得到这个，并带有正确的列名：

While running the same with four files imported gives me this:

使用导入的四个文件运行相同的程序时，我得到了这个：

Is there anything obvious that I am doing which is causing not only the number of rows to grow but the columns too?

有什么明显的我正在做的事情不仅会导致行数增加，而且列数也会增加？

Your guidance would be gratefully appreciated!

您的指导将不胜感激！

Answer 1

回答by asongtoruin

The reason why it's currently not working is when you do skiprows=[0]for each file after your first, that new dataframe has its second (index 1) row used as the column titles. Hence, when the frames are concatenated there are lots and lots of column headers that don't match. If you remove the skiprows=[0]it should work.

它当前不起作用的原因是当您skiprows=[0]在第一个文件之后为每个文件执行此操作时，该新数据框的第二个（索引 1）行用作列标题。因此，当连接帧时，会有很多不匹配的列标题。如果你删除skiprows=[0]它应该工作。

Assuming all of your files have the same header (or you're okay with NaNwhen they differ), you should be able to do this in a one-liner:

假设您的所有文件都具有相同的标题（或者NaN当它们不同时您可以接受），您应该能够以单行方式执行此操作：

big_frame = pd.concat([pd.read_csv(f, sep=';') for f in glob.glob(path + "/*.csv")],
                      ignore_index=True)

将多个 csv 文件读入 Pandas 数据帧

提问by Watty62

回答by asongtoruin

相关推荐

最近更新

标签

将多个 csv 文件读入 Pandas 数据帧

提问by Watty62

回答by asongtoruin

相关推荐

Pandas 导入：ModuleNotFoundError：没有名为“pandas._libs.tslib”的模块

pandas 映射熊猫数据框中的值范围

pandas AttributeError: 'DataFrame' 对象没有属性 'label'

pandas strptime() 参数 1 必须是 str，而不是系列时间序列转换

相关推荐

最近更新

标签