Python 将 Excel 文件读取到 Pandas 数据框的更快方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28766133/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:42:41  来源:igfitidea点击:

Faster way to read Excel files to pandas dataframe

pythonpandasimport-from-excel

提问by jsignell

I have a 14MB Excel file with five worksheetsthat I'm reading into a Pandas dataframe, and although the code below works, it takes 9 minutes!

我有一个14MB 的 Excel 文件,其中包含五个工作表,我正在将其读入 Pandas 数据帧,尽管下面的代码有效,但需要 9 分钟!

Does anyone have suggestions for speeding it up?

有没有人有加快速度的建议?

import pandas as pd

def OTT_read(xl,site_name):
    df = pd.read_excel(xl.io,site_name,skiprows=2,parse_dates=0,index_col=0,
                       usecols=[0,1,2],header=None,
                       names=['date_time','%s_depth'%site_name,'%s_temp'%site_name])
    return df

def make_OTT_df(FILEDIR,OTT_FILE):
    xl = pd.ExcelFile(FILEDIR + OTT_FILE)
    site_names = xl.sheet_names
    df_list = [OTT_read(xl,site_name) for site_name in site_names]
    return site_names,df_list

FILEDIR='c:/downloads/'
OTT_FILE='OTT_Data_All_stations.xlsx'
site_names_OTT,df_list_OTT = make_OTT_df(FILEDIR,OTT_FILE)

采纳答案by Rich Signell

As others have suggested, csv reading is faster. So if you are on windows and have Excel, you could call a vbscript to convert the Excel to csv and then read the csv. I tried the script below and it took about 30 seconds.

正如其他人所建议的那样,csv 读取速度更快。因此,如果您在 Windows 上并拥有 Excel,则可以调用 vbscript 将 Excel 转换为 csv,然后读取 csv。我尝试了下面的脚本,花了大约 30 秒。

# create a list with sheet numbers you want to process
sheets = map(str,range(1,6))

# convert each sheet to csv and then read it using read_csv
df={}
from subprocess import call
excel='C:\Users\rsignell\OTT_Data_All_stations.xlsx'
for sheet in sheets:
    csv = 'C:\Users\rsignell\test' + sheet + '.csv' 
    call(['cscript.exe', 'C:\Users\rsignell\ExcelToCsv.vbs', excel, csv, sheet])
    df[sheet]=pd.read_csv(csv)

Here's a little snippet of python to create the ExcelToCsv.vbs script:

这是创建 ExcelToCsv.vbs 脚本的一小段 Python 代码:

#write vbscript to file
vbscript="""if WScript.Arguments.Count < 3 Then
    WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
    Wscript.Quit
End If

csv_format = 6

Set objFSO = CreateObject("Scripting.FileSystemObject")

src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))

Dim oExcel
Set oExcel = CreateObject("Excel.Application")

Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate

oBook.SaveAs dest_file, csv_format

oBook.Close False
oExcel.Quit
""";

f = open('ExcelToCsv.vbs','w')
f.write(vbscript.encode('utf-8'))
f.close()

This answer benefited from Convert XLS to CSV on command lineand csv & xlsx files import to pandas data frame: speed issue

这个答案受益于Convert XLS to CSV on command linecsv & xlsx files import to pandas data frame: speed issue

回答by toto_tico

If you have less than 65536 rows (in each sheet) you can try xls(instead of xlsx. In my experience xlsis faster than xlsx. It is difficult to compare to csvbecause it depends on the number of sheets.

如果您的行数少于 65536 行(在每张纸中),您可以尝试xls(而不是xlsx。根据我的经验xls,比 快xlsx。很难比较,csv因为这取决于张数。

Although this is not an ideal solution (xlsis a binary old privative format), I have found this is useful if you are working with a lof many sheets, internal formulas with values that are often updated, or for whatever reason you would really like to keep the excel multisheet functionality(instead of csv separated files).

尽管这不是一个理想的解决方案(xls是一种二进制旧私有格式),但我发现如果您正在处理大量工作具有经常更新的值的内部公式,或者出于您真正想要的任何原因,这很有用保留 excel 多表功能(而不是 csv 分隔文件)。

回答by kirby

I know this is old but in case anyone else is looking for an answer that doesn't involve VB. Pandas read_csv()isfaster but you don't need a VB script to get a csv file.

我知道这很旧,但以防万一其他人正在寻找不涉及 VB 的答案。大熊猫read_csv()快,但你并不需要一个VB脚本得到一个CSV文件。

Open your Excel file and save as *.csv (comma separated value) format.

打开您的 Excel 文件并另存为 *.csv(逗号分隔值)格式。

Under tools you can select Web Options and under the Encoding tab you can change the encoding to whatever works for your data. I ended up using Windows, Western European because Windows UTF encoding is "special" but there's lots of ways to accomplish the same thing. Then use the encoding argument in pd.read_csv()to specify your encoding.

在工具下,您可以选择 Web 选项,在编码选项卡下,您可以将编码更改为适合您的数据的任何编码。我最终使用了西欧的 Windows,因为 Windows UTF 编码是“特殊的”,但有很多方法可以完成同样的事情。然后使用编码参数 inpd.read_csv()指定您的编码。

Encoding options are listed here

此处列出编码选项

回答by Dan Steingart

There's no reason to open excel ifyou're willing to deal with slow conversion once.

如果您愿意处理一次缓慢的转换,没有理由打开 excel 。

  1. Read the data into a dataframe with pd.read_excel()
  2. Dump it into a csv right away with pd.to_csv()
  1. 将数据读入数据帧 pd.read_excel()
  2. 立即将其转储到 csv 中 pd.to_csv()

Avoid both excel and windows specific calls. In my case the one-time time hit was worth the hassle. I got a ?.

避免 excel 和 windows 特定调用。在我的情况下,一次性命中是值得的。我有一个 ?。

回答by Ehsan Tabatabaei

In my experience, Pandas read_excel()works fine with Excel files with multiple sheets. As suggested in Using Pandas to read multiple worksheets, if you assign sheet_nameto Noneit will automatically put every sheet in a Dataframe and it will output a dictionary of Dataframes with the keys of sheet names.

根据我的经验,Pandasread_excel()可以很好地处理包含多张工作表的 Excel 文件。正如Using Pandas to read multiple worksheets 中所建议的,如果您分配sheet_nameNone它,它将自动将每个工作表放入一个Dataframe 中,并且它将输出一个带有工作表名称键的Dataframes 字典。

But the reason that it takes time is for where you parse textsin your code. 14MB excel with 5 sheets is not that much. I have a 20.1MB excel file with 46 sheets each one with more than 6000 rows and 17 columns and using read_excelit took like below:

但是需要时间的原因是您在代码中解析文本的位置。5 张 14MB excel 并不算多。我有一个 20.1MB 的 excel 文件,其中有 46 张纸,每张纸有 6000 多行和 17 列,使用read_excel它的过程如下:

t0 = time.time()

def parse(datestr):
    y,m,d = datestr.split("/")
    return dt.date(int(y),int(m),int(d))

data = pd.read_excel("DATA (1).xlsx", sheet_name=None, encoding="utf-8", skiprows=1, header=0, parse_dates=[1], date_parser=parse)

t1 = time.time()

print(t1 - t0)
## result: 37.54169297218323 seconds

In code above datais a dictionary of 46 Dataframes.

上面的代码data是一个包含 46 个数据帧的字典。

As others suggested, using read_csv()can help because reading .csvfile is faster. But consider that for the fact that .xlsxfiles use compression, .csvfiles might be larger and hence, slower to read. But if you wanted to convert your file to comma-separated using python (VBcode is offered by Rich Signel), you can use: Convert xlsx to csv

正如其他人所建议的那样,使用read_csv()会有所帮助,因为读取.csv文件速度更快。但考虑到.xlsx文件使用压缩的事实,.csv文件可能更大,因此读取速度更慢。但是,如果您想使用 python 将文件转换为逗号分隔(VBcode 由Rich Signel提供),您可以使用:Convert xlsx to csv