csv 和 xlsx 文件导入到 Pandas 数据框:速度问题

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16182822/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:47:09  来源:igfitidea点击:

csv & xlsx files import to pandas data frame: speed issue

pythoncsvpandasxlsxopenpyxl

提问by sashkello

Reading data (just 20000 numbers) from a xlsx file takes forever:

从 xlsx 文件读取数据(仅 20000 个数字)需要永远:

import pandas as pd
xlsxfile = pd.ExcelFile("myfile.xlsx")
data = xlsxfile.parse('Sheet1', index_col = None, header = None)

takes about 9 seconds.

大约需要 9 秒。

If I save the same file in csv format it takes ~25ms:

如果我以 csv 格式保存相同的文件,则需要大约 25 毫秒:

import pandas as pd
csvfile = "myfile.csv"
data = pd.read_csv(csvfile, index_col = None, header = None)

Is this an issue of openpyxl or am I missing something? Are there any alternatives?

这是 openpyxl 的问题还是我遗漏了什么?有没有其他选择?

回答by Matti John

xlrdhas support for .xlsx files, and this answersuggests that at least the beta version of xlrd with .xlsx support was quicker than openpyxl.

xlrd支持 .xlsx 文件,这个答案表明至少具有 .xlsx 支持的 xlrd 测试版比 openpyxl 更快。

The current stable version of Pandas (11.0) uses openpyxl for .xlsx files, but this has been changed for the next release. If you want to give it a go, you can download the dev version from GitHub

Pandas (11.0) 的当前稳定版本对 .xlsx 文件使用 openpyxl,但在下一个版本中已更改。如果你想试一试,你可以从GitHub下载开发版本