pandas 将数据帧写入 .xlsx 太慢
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38987767/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
write dataframe to .xlsx too slow
提问by inaMinute
I have a 40MB dataframe 'dfScore' I am writing to .xlsx。
the code is as follow,
我有一个 40MB 的数据帧 'dfScore' 我正在写入 .xlsx。
代码如下,
writer = pandas.ExcelWriter('test.xlsx', engine='xlsxwriter')
dfScore.to_excel(writer,sheet_name='Sheet1')
writer.save()
the code dfScore.to_excel
take almost an hour ,the code writer.save()
takes another hour. Is this normal? Is there a good way to take less than 10 min?
代码dfScore.to_excel
需要将近一个小时,再the code writer.save()
需要一个小时。这是正常的吗?有什么好方法可以花不到 10 分钟吗?
i already searched in stackoverflow ,but it seems some suggestions not working on my problem.
我已经在 stackoverflow 中搜索过,但似乎有些建议无法解决我的问题。
回答by payam.1991
Why don't you save it as .csv? I have worked with heavier DataFrames on my personal laptop and I had the same problem with writing to xlsx.
为什么不将其另存为 .csv?我在我的个人笔记本电脑上使用了更重的 DataFrames,我在写入 xlsx 时遇到了同样的问题。
your_dataframe.to_csv('my_file.csv',encoding='utf-8',columns=list_of_dataframe_columns)
then you can simply convert it to .xlsx with MS Excel or an online convertor.
然后您可以简单地使用 MS Excel 或在线转换器将其转换为 .xlsx。
回答by jmcnamara
the code dfScore.to_excel take almost an hour ,the code writer.save() takes another hour. Is this normal?
代码 dfScore.to_excel 需要将近一个小时,代码 writer.save() 需要一个小时。这是正常的吗?
That sounds a bit too high. I ran an XlsxWriter test writing 1,000,000 rows x 5 columns and it took ~ 100s. The time will vary based on the CPU and Memory of the test machine but 1 hour is 36 times slower which doesn't seem right.
这听起来有点太高了。我运行了一个 XlsxWriter 测试,写入 1,000,000 行 x 5 列,花了大约 100 秒。时间会根据测试机器的 CPU 和内存而有所不同,但 1 小时慢 36 倍,这似乎不正确。
Note, Excel, and thus XlsxWriter, only supports 1,048,576 rows per worksheet so you are effectively throwing away 3/4s of your data and wasting time doing it.
请注意,Excel 以及 XlsxWriter 仅支持每个工作表 1,048,576 行,因此您实际上会丢弃 3/4 的数据并浪费时间。
Is there a good way to take less than 10 min?
有什么好方法可以花不到 10 分钟吗?
For pure XlsxWriter programs pypygives a good speed up. For example rerunning my 1,000,000 rows x 5 columns testcase with pypy the time went from 99.15s to 16.49s. I don't know if Pandas works with pypy though.
对于纯 XlsxWriter 程序,pypy提供了很好的加速。例如,使用 pypy 重新运行我的 1,000,000 行 x 5 列测试用例,时间从 99.15 秒变为 16.49 秒。我不知道 Pandas 是否适用于 pypy。