pandas 读取多个csv文件并将文件名添加为pandas中的新列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42756696/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:10:19  来源:igfitidea点击:

Read multiple csv files and Add filename as new column in pandas

pythoncsvpandasoperating-systemglob

提问by amwade2

I have several csv files in a single folder and I want to open them all in one dataframe and insert a new column with the associated filename. So far I've coded the following:

我在一个文件夹中有几个 csv 文件,我想在一个数据框中打开它们,并插入一个具有关联文件名的新列。到目前为止,我已经编写了以下代码:

import pandas as pd
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('path/*.csv'))))
df['filename']= os.path.basename(csv)
df

This gives me the dataframe I want but in the new column 'filename' it's only listing the last filename in the folder for every row. I'm looking for each row to be populated with it's associated csv file. Not just the last file in the folder.

这给了我想要的数据框,但在新列“文件名”中,它只列出文件夹中每一行的最后一个文件名。我正在寻找要填充其关联的 csv 文件的每一行。不仅仅是文件夹中的最后一个文件。

Any assistance for this newbie is much appreciated.

非常感谢对这个新手的任何帮助。

回答by jezrael

I think you need assignfor add new column in loop, also parameter ignore_index=Truewas added to concatfor remove duplicates in index:

我认为您需要assign在 中添加新列loop,还ignore_index=True添加了参数以concat删除中的重复项index

Files for test are a.csv, b.csv, c.csv.

测试文件为a.csvb.csvc.csv

import pandas as pd
import glob, os

files = glob.glob('files/*.csv')
print (files)
['files\a.csv', 'files\b.csv', 'files\c.csv']

files = glob.glob('files/*.csv')
print (files)
['files\a.csv', 'files\b.csv', 'files\c.csv']

df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp)) for fp in files])
print (df)
   a  b  c  d    New
0  0  1  2  5  a.csv
1  1  5  8  3  a.csv
2  0  9  6  5  b.csv
3  1  6  4  2  b.csv
4  0  7  1  7  c.csv
5  1  3  2  6  c.csv


files = glob.glob('files/*.csv')
df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp).split('.')[0]) for fp in files])
print (df)
   a  b  c  d New
0  0  1  2  5   a
1  1  5  8  3   a
2  0  9  6  5   b
3  1  6  4  2   b
4  0  7  1  7   c
5  1  3  2  6   c

回答by Abid Hasan

Firstly, you have no csv variable defined.

首先,您没有定义 csv 变量。

But anyway, this behaviour makes sense, because you are using the csv at the end so it'll be set to the last file. Ideally, you can use glob again to get all filenames, then set that as a new column.

但无论如何,这种行为是有道理的,因为您最后使用的是 csv,因此它将被设置为最后一个文件。理想情况下,您可以再次使用 glob 来获取所有文件名,然后将其设置为新列。

#this is a Python list containing filenames
csvs = glob.glob(os.path.join('path/*.csv'))

#now set the csv into a pd series
csv_paths = pd.Series(csvs)

df['file_name'] = csv_paths.values