pandas 读取多个csv文件并将文件名添加为pandas中的新列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42756696/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read multiple csv files and Add filename as new column in pandas
提问by amwade2
I have several csv files in a single folder and I want to open them all in one dataframe and insert a new column with the associated filename. So far I've coded the following:
我在一个文件夹中有几个 csv 文件,我想在一个数据框中打开它们,并插入一个具有关联文件名的新列。到目前为止,我已经编写了以下代码:
import pandas as pd
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('path/*.csv'))))
df['filename']= os.path.basename(csv)
df
This gives me the dataframe I want but in the new column 'filename' it's only listing the last filename in the folder for every row. I'm looking for each row to be populated with it's associated csv file. Not just the last file in the folder.
这给了我想要的数据框,但在新列“文件名”中,它只列出文件夹中每一行的最后一个文件名。我正在寻找要填充其关联的 csv 文件的每一行。不仅仅是文件夹中的最后一个文件。
Any assistance for this newbie is much appreciated.
非常感谢对这个新手的任何帮助。
回答by jezrael
I think you need assign
for add new column in loop
, also parameter ignore_index=True
was added to concat
for remove duplicates in index
:
我认为您需要assign
在 中添加新列loop
,还ignore_index=True
添加了参数以concat
删除中的重复项index
:
Files for test are a.csv, b.csv, c.csv.
import pandas as pd
import glob, os
files = glob.glob('files/*.csv')
print (files)
['files\a.csv', 'files\b.csv', 'files\c.csv']
files = glob.glob('files/*.csv')
print (files)
['files\a.csv', 'files\b.csv', 'files\c.csv']
df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp)) for fp in files])
print (df)
a b c d New
0 0 1 2 5 a.csv
1 1 5 8 3 a.csv
2 0 9 6 5 b.csv
3 1 6 4 2 b.csv
4 0 7 1 7 c.csv
5 1 3 2 6 c.csv
files = glob.glob('files/*.csv')
df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp).split('.')[0]) for fp in files])
print (df)
a b c d New
0 0 1 2 5 a
1 1 5 8 3 a
2 0 9 6 5 b
3 1 6 4 2 b
4 0 7 1 7 c
5 1 3 2 6 c
回答by Abid Hasan
Firstly, you have no csv variable defined.
首先,您没有定义 csv 变量。
But anyway, this behaviour makes sense, because you are using the csv at the end so it'll be set to the last file. Ideally, you can use glob again to get all filenames, then set that as a new column.
但无论如何,这种行为是有道理的,因为您最后使用的是 csv,因此它将被设置为最后一个文件。理想情况下,您可以再次使用 glob 来获取所有文件名,然后将其设置为新列。
#this is a Python list containing filenames
csvs = glob.glob(os.path.join('path/*.csv'))
#now set the csv into a pd series
csv_paths = pd.Series(csvs)
df['file_name'] = csv_paths.values