pandas 将数据帧写入 .xlsx 太慢

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38987767/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:50:20  来源:igfitidea点击:

write dataframe to .xlsx too slow

pandas

提问by inaMinute

I have a 40MB dataframe 'dfScore' I am writing to .xlsx。 enter image description herethe code is as follow,

我有一个 40MB 的数据帧 'dfScore' 我正在写入 .xlsx。 在此处输入图片说明代码如下,

writer = pandas.ExcelWriter('test.xlsx', engine='xlsxwriter')
dfScore.to_excel(writer,sheet_name='Sheet1')
writer.save()

the code dfScore.to_exceltake almost an hour ,the code writer.save()takes another hour. Is this normal? Is there a good way to take less than 10 min?

代码dfScore.to_excel需要将近一个小时,再the code writer.save()需要一个小时。这是正常的吗?有什么好方法可以花不到 10 分钟吗?

i already searched in stackoverflow ,but it seems some suggestions not working on my problem.

我已经在 stackoverflow 中搜索过,但似乎有些建议无法解决我的问题。

回答by payam.1991

Why don't you save it as .csv? I have worked with heavier DataFrames on my personal laptop and I had the same problem with writing to xlsx.

为什么不将其另存为 .csv?我在我的个人笔记本电脑上使用了更重的 DataFrames,我在写入 xlsx 时遇到了同样的问题。

your_dataframe.to_csv('my_file.csv',encoding='utf-8',columns=list_of_dataframe_columns)

then you can simply convert it to .xlsx with MS Excel or an online convertor.

然后您可以简单地使用 MS Excel 或在线转换器将其转换为 .xlsx。

回答by jmcnamara

the code dfScore.to_excel take almost an hour ,the code writer.save() takes another hour. Is this normal?

代码 dfScore.to_excel 需要将近一个小时,代码 writer.save() 需要一个小时。这是正常的吗?

That sounds a bit too high. I ran an XlsxWriter test writing 1,000,000 rows x 5 columns and it took ~ 100s. The time will vary based on the CPU and Memory of the test machine but 1 hour is 36 times slower which doesn't seem right.

这听起来有点太高了。我运行了一个 XlsxWriter 测试,写入 1,000,000 行 x 5 列,花了大约 100 秒。时间会根据测试机器的 CPU 和内存而有所不同,但 1 小时慢 36 倍,这似乎不正确。

Note, Excel, and thus XlsxWriter, only supports 1,048,576 rows per worksheet so you are effectively throwing away 3/4s of your data and wasting time doing it.

请注意,Excel 以及 XlsxWriter 仅支持每个工作表 1,048,576 行,因此您实际上会丢弃 3/4 的数据并浪费时间。

Is there a good way to take less than 10 min?

有什么好方法可以花不到 10 分钟吗?

For pure XlsxWriter programs pypygives a good speed up. For example rerunning my 1,000,000 rows x 5 columns testcase with pypy the time went from 99.15s to 16.49s. I don't know if Pandas works with pypy though.

对于纯 XlsxWriter 程序,pypy提供了很好的加速。例如,使用 pypy 重新运行我的 1,000,000 行 x 5 列测试用例,时间从 99.15 秒变为 16.49 秒。我不知道 Pandas 是否适用于 pypy。