csv 和 xlsx 文件导入到 Pandas 数据框：速度问题

Question

提问by sashkello

Reading data (just 20000 numbers) from a xlsx file takes forever:

从 xlsx 文件读取数据（仅 20000 个数字）需要永远：

import pandas as pd
xlsxfile = pd.ExcelFile("myfile.xlsx")
data = xlsxfile.parse('Sheet1', index_col = None, header = None)

takes about 9 seconds.

大约需要 9 秒。

If I save the same file in csv format it takes ~25ms:

如果我以 csv 格式保存相同的文件，则需要大约 25 毫秒：

import pandas as pd
csvfile = "myfile.csv"
data = pd.read_csv(csvfile, index_col = None, header = None)

Is this an issue of openpyxl or am I missing something? Are there any alternatives?

这是 openpyxl 的问题还是我遗漏了什么？有没有其他选择？

Answer 1

回答by Matti John

xlrdhas support for .xlsx files, and this answersuggests that at least the beta version of xlrd with .xlsx support was quicker than openpyxl.

xlrd支持 .xlsx 文件，这个答案表明至少具有 .xlsx 支持的 xlrd 测试版比 openpyxl 更快。

The current stable version of Pandas (11.0) uses openpyxl for .xlsx files, but this has been changed for the next release. If you want to give it a go, you can download the dev version from GitHub

Pandas (11.0) 的当前稳定版本对 .xlsx 文件使用 openpyxl，但在下一个版本中已更改。如果你想试一试，你可以从GitHub下载开发版本

csv 和 xlsx 文件导入到 Pandas 数据框：速度问题

提问by sashkello

回答by Matti John

相关推荐

最近更新

标签

csv 和 xlsx 文件导入到 Pandas 数据框：速度问题

提问by sashkello

回答by Matti John

相关推荐

pandas 熊猫绘制时间序列 ['numpy.ndarray' 对象没有属性 'find']

如何在非简单标准上执行 DataFrames 与 Pandas 的内部或外部连接

从 URL 到“pandas.DataFrame”的 Excel 工作簿表

pandas 中的频率表（如 R 中的 plyr）

相关推荐

最近更新

标签