pandas 将文件夹的多个 csv 文件加载到一个数据框中

Question

提问by PV8

i have multiple csv files saved in one folder with the same column layout and want to load it into python as a dataframe in pandas.

我在一个文件夹中保存了多个具有相同列布局的 csv 文件，并希望将其作为 Pandas 中的数据框加载到 python 中。

The question is really simliar to this thread.

这个问题与这个线程非常相似。

I am using the following code:

我正在使用以下代码：

import glob
import pandas as pd
salesdata = pd.DataFrame()
for f in glob.glob("TransactionData\Promorelevant\*.csv"):
    appenddata = pd.read_csv(f, header=None, sep=";")
    salesdata = salesdata.append(appenddata,ignore_index=True)

Is there a better solution for it with another package?

是否有更好的解决方案与另一个包？

This is taking to much time.

这需要很多时间。

Thanks

谢谢

Answer 1

回答by jezrael

I suggest use list comprehension with concat:

我建议使用列表理解concat：

import glob
import pandas as pd

files = glob.glob("TransactionData\Promorelevant*.csv")
dfs = [pd.read_csv(f, header=None, sep=";") for f in files]

salesdata = pd.concat(dfs,ignore_index=True)

Answer 2

回答by Muhammad Haseeb

With a help from link to actual answer

在链接到实际答案的帮助下

This seems to be the best one liner:

这似乎是最好的一个班轮：

import glob, os    
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "*.csv"))))

Answer 3

回答by PascalVKooten

Maybe using bash will be faster:

也许使用 bash 会更快：

head -n 1 "TransactionData/Promorelevant/0.csv" > merged.csv
tail -q -n +2 TransactionData/Promorelevant*.csv >> merged.csv

Or if using from within a jupyter notebook

或者如果在 jupyter notebook 中使用

!head -n 1 "TransactionData/Promorelevant/0.csv" > merged.csv
!tail -q -n +2 "TransactionData/Promorelevant*.csv" >> merged.csv

The idea being that you won't need to parse anything.

这个想法是你不需要解析任何东西。

The first command copies the header of one of the files. You can skip this line if you don't have a header. Tail skips the headers for all the files and adds them to the csv.

第一个命令复制其中一个文件的标题。如果您没有标题，则可以跳过此行。Tail 跳过所有文件的标题并将它们添加到 csv。

Appending in Python is probably more expensive.

在 Python 中追加可能更昂贵。

Of course, make sure your parse is still valid using pandas.

当然，使用 Pandas 确保您的解析仍然有效。

pd.read_csv("merged.csv")

Curious to your benchmark.

对你的基准感到好奇。

Answer 4

回答by PV8

i checked all this approaches except the bash one with the time function (only one run, and also note that the files are on a shared drive).

我检查了所有这些方法，除了带有时间功能的 bash 方法（只运行一次，还要注意文件位于共享驱动器上）。

Here are the results:

结果如下：

My approach: 1220.49

我的方法：1220.49

List comphrension+concat: 1135.53

列表理解+连接：1135.53

concat+map+join: 1116.31

连接+地图+连接：1116.31

I will go for list comphrension+concat which will save me some minutes and i feel quite familiar with.

我会选择 list comphrension+concat，这将节省我一些时间，而且我觉得很熟悉。

Thanks for your ideas.

谢谢你的想法。

pandas 将文件夹的多个 csv 文件加载到一个数据框中

提问by PV8

回答by jezrael

回答by Muhammad Haseeb

回答by PascalVKooten

回答by PV8

相关推荐

最近更新

标签

pandas 将文件夹的多个 csv 文件加载到一个数据框中

提问by PV8

回答by jezrael

回答by Muhammad Haseeb

回答by PascalVKooten

回答by PV8

相关推荐

无需 pdfkit 即可将 Pandas DataFrame 保存为 PDF 文件格式

pandas 执行熊猫分组操作的更快替代方法

pandas 熊猫追加不起作用

pandas read_csv 使用 dtypes 但列中有 na 值

相关推荐

最近更新

标签