pandas 将文件夹的多个 csv 文件加载到一个数据框中
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/52289386/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Loading multiple csv files of a folder into one dataframe
提问by PV8
i have multiple csv files saved in one folder with the same column layout and want to load it into python as a dataframe in pandas.
我在一个文件夹中保存了多个具有相同列布局的 csv 文件,并希望将其作为 Pandas 中的数据框加载到 python 中。
The question is really simliar to this thread.
I am using the following code:
我正在使用以下代码:
import glob
import pandas as pd
salesdata = pd.DataFrame()
for f in glob.glob("TransactionData\Promorelevant\*.csv"):
appenddata = pd.read_csv(f, header=None, sep=";")
salesdata = salesdata.append(appenddata,ignore_index=True)
Is there a better solution for it with another package?
是否有更好的解决方案与另一个包?
This is taking to much time.
这需要很多时间。
Thanks
谢谢
回答by jezrael
回答by Muhammad Haseeb
With a help from link to actual answer
在链接到实际答案的帮助下
This seems to be the best one liner:
这似乎是最好的一个班轮:
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "*.csv"))))
回答by PascalVKooten
Maybe using bash will be faster:
也许使用 bash 会更快:
head -n 1 "TransactionData/Promorelevant/0.csv" > merged.csv
tail -q -n +2 TransactionData/Promorelevant*.csv >> merged.csv
Or if using from within a jupyter notebook
或者如果在 jupyter notebook 中使用
!head -n 1 "TransactionData/Promorelevant/0.csv" > merged.csv
!tail -q -n +2 "TransactionData/Promorelevant*.csv" >> merged.csv
The idea being that you won't need to parse anything.
这个想法是你不需要解析任何东西。
The first command copies the header of one of the files. You can skip this line if you don't have a header. Tail skips the headers for all the files and adds them to the csv.
第一个命令复制其中一个文件的标题。如果您没有标题,则可以跳过此行。Tail 跳过所有文件的标题并将它们添加到 csv。
Appending in Python is probably more expensive.
在 Python 中追加可能更昂贵。
Of course, make sure your parse is still valid using pandas.
当然,使用 Pandas 确保您的解析仍然有效。
pd.read_csv("merged.csv")
Curious to your benchmark.
对你的基准感到好奇。
回答by PV8
i checked all this approaches except the bash one with the time function (only one run, and also note that the files are on a shared drive).
我检查了所有这些方法,除了带有时间功能的 bash 方法(只运行一次,还要注意文件位于共享驱动器上)。
Here are the results:
结果如下:
My approach: 1220.49
我的方法:1220.49
List comphrension+concat: 1135.53
列表理解+连接:1135.53
concat+map+join: 1116.31
连接+地图+连接:1116.31
I will go for list comphrension+concat which will save me some minutes and i feel quite familiar with.
我会选择 list comphrension+concat,这将节省我一些时间,而且我觉得很熟悉。
Thanks for your ideas.
谢谢你的想法。