pandas 读取多个csv文件并将文件名添加为pandas中的新列

Question

提问by amwade2

I have several csv files in a single folder and I want to open them all in one dataframe and insert a new column with the associated filename. So far I've coded the following:

我在一个文件夹中有几个 csv 文件，我想在一个数据框中打开它们，并插入一个具有关联文件名的新列。到目前为止，我已经编写了以下代码：

import pandas as pd
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('path/*.csv'))))
df['filename']= os.path.basename(csv)
df

This gives me the dataframe I want but in the new column 'filename' it's only listing the last filename in the folder for every row. I'm looking for each row to be populated with it's associated csv file. Not just the last file in the folder.

这给了我想要的数据框，但在新列“文件名”中，它只列出文件夹中每一行的最后一个文件名。我正在寻找要填充其关联的 csv 文件的每一行。不仅仅是文件夹中的最后一个文件。

Any assistance for this newbie is much appreciated.

非常感谢对这个新手的任何帮助。

Answer 1

回答by jezrael

I think you need assignfor add new column in loop, also parameter ignore_index=Truewas added to concatfor remove duplicates in index:

我认为您需要assign在中添加新列loop，还ignore_index=True添加了参数以concat删除中的重复项index：

Files for test are a.csv, b.csv, c.csv.

测试文件为a.csv、b.csv、c.csv。

import pandas as pd
import glob, os

files = glob.glob('files/*.csv')
print (files)
['files\a.csv', 'files\b.csv', 'files\c.csv']

files = glob.glob('files/*.csv')
print (files)
['files\a.csv', 'files\b.csv', 'files\c.csv']

df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp)) for fp in files])
print (df)
   a  b  c  d    New
0  0  1  2  5  a.csv
1  1  5  8  3  a.csv
2  0  9  6  5  b.csv
3  1  6  4  2  b.csv
4  0  7  1  7  c.csv
5  1  3  2  6  c.csv

files = glob.glob('files/*.csv')
df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp).split('.')[0]) for fp in files])
print (df)
   a  b  c  d New
0  0  1  2  5   a
1  1  5  8  3   a
2  0  9  6  5   b
3  1  6  4  2   b
4  0  7  1  7   c
5  1  3  2  6   c

Answer 2

回答by Abid Hasan

Firstly, you have no csv variable defined.

首先，您没有定义 csv 变量。

But anyway, this behaviour makes sense, because you are using the csv at the end so it'll be set to the last file. Ideally, you can use glob again to get all filenames, then set that as a new column.

但无论如何，这种行为是有道理的，因为您最后使用的是 csv，因此它将被设置为最后一个文件。理想情况下，您可以再次使用 glob 来获取所有文件名，然后将其设置为新列。

#this is a Python list containing filenames
csvs = glob.glob(os.path.join('path/*.csv'))

#now set the csv into a pd series
csv_paths = pd.Series(csvs)

df['file_name'] = csv_paths.values

pandas 读取多个csv文件并将文件名添加为pandas中的新列

提问by amwade2

回答by jezrael

回答by Abid Hasan

相关推荐

最近更新

标签

pandas 读取多个csv文件并将文件名添加为pandas中的新列

提问by amwade2

回答by jezrael

回答by Abid Hasan

相关推荐

pandas 如何选择包含大于阈值的值的所有行？

AttributeError: 'function' 对象没有属性 'sum' pandas

pandas 如何在pandas df中设置新索引并删除默认索引

pandas 将所有数据框列转换为浮动的最快方法 - 熊猫 astype 慢

相关推荐

最近更新

标签