将多个 csv 文件读入 Pandas 数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50351908/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:34:00  来源:igfitidea点击:

Reading multiple csv files into a Pandas Data Frame

pythonpandascsvdataframe

提问by Watty62

I am attempting to read multiple CSV files into a Pandas data frame. The CSVs aren't comma separated - the fields are delimited by a semicolon ";".

我正在尝试将多个 CSV 文件读入 Pandas 数据框中。CSV 不是逗号分隔的 - 字段由分号“;”分隔。

I based my code on the answers here.

我的代码基于这里的答案。

My data is all in a specific subdirectory: /data/luftdaten/5331

我的数据都在一个特定的子目录中: /data/luftdaten/5331

This is what I run:

这是我运行的:

import glob
import pandas as pd

path =r'data/luftdaten/5331' # use your path

filenames = glob.glob(path + "/*.csv")
count_files = 0
dfs = []
for filename in filenames:
    if count_files ==0:
        dfs.append(pd.read_csv(filename, sep=";")) 
        count_files += 1
    else:
        dfs.append(pd.read_csv(filename, sep=";", skiprows=[0]))
        count_files +=1

big_frame = pd.concat(dfs, ignore_index=True)

I use count_filesto monitor if it is the first CSV - in which case I import the headers. Otherwise, it skips the headers.

count_files用来监视它是否是第一个 CSV - 在这种情况下,我会导入标题。否则,它会跳过标题。

The code executes OK.

代码执行正常。

If I run it with a single file in that directory, everything is fine:

如果我使用该目录中的单个文件运行它,则一切正常:

big_frame.info()

Output:

输出:

RangeIndex: 146 entries, 0 to 145
Data columns (**total 12 column**s):
sensor_id      146 non-null int64
sensor_type    146 non-null object
etc......

If I run it with 2 or more files in the directory things go wrong from the start.

如果我在目录中使用 2 个或更多文件运行它,事情从一开始就会出错。

Output with 4 files:

输出 4 个文件:

RangeIndex: 1893 entries, 0 to 1892
Data columns (total **33 columns**):
-2.077                 1164 non-null float64
-2.130                 145 non-null float64
2.40                   145 non-null float64

Running big_frame.head()on the single CSV version gives this, with the correct column names:

big_frame.head()在单个 CSV 版本上运行会得到这个,并带有正确的列名:

output from importing single CSV

导入单个 CSV 的输出

While running the same with four files imported gives me this:

使用导入的四个文件运行相同的程序时,我得到了这个:

output from importing 4 CSV files (cropped right)

导入 4 个 CSV 文件的输出(右裁剪)

Is there anything obvious that I am doing which is causing not only the number of rows to grow but the columns too?

有什么明显的我正在做的事情不仅会导致行数增加,而且列数也会增加?

Your guidance would be gratefully appreciated!

您的指导将不胜感激!

回答by asongtoruin

The reason why it's currently not working is when you do skiprows=[0]for each file after your first, that new dataframe has its second (index 1) row used as the column titles. Hence, when the frames are concatenated there are lots and lots of column headers that don't match. If you remove the skiprows=[0]it should work.

它当前不起作用的原因是当您skiprows=[0]在第一个文件之后为每个文件执行此操作时,该新数据框的第二个(索引 1)行用作列标题。因此,当连接帧时,会有很多不匹配的列标题。如果你删除skiprows=[0]它应该工作。

Assuming all of your files have the same header (or you're okay with NaNwhen they differ), you should be able to do this in a one-liner:

假设您的所有文件都具有相同的标题(或者NaN当它们不同时您可以接受),您应该能够以单行方式执行此操作:

big_frame = pd.concat([pd.read_csv(f, sep=';') for f in glob.glob(path + "/*.csv")],
                      ignore_index=True)