使用 Pandas 读取 Excel XML .xls 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33470130/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:08:43  来源:igfitidea点击:

Read Excel XML .xls file with pandas

pythonexcelpandasxlrd

提问by JBWhitmore

I'm aware of a number of previously asked questions, but none of the solutions given work on the reproducible example that I provide below.

我知道之前提出的一些问题,但没有一个解决方案适用于我在下面提供的可重现示例。

I am trying to read in .xlsfiles from http://www.eia.gov/coal/data.cfm#production-- specifically the Historical detailed coal production data (1983-2013)coalpublic2012.xlsfile that's freely available via the dropdown. Pandas cannot read it.

我正在尝试.xlshttp://www.eia.gov/coal/data.cfm#production读取文件——特别是历史详细煤炭生产数据 (1983-2013)coalpublic2012.xls文件,该文件可通过下拉菜单免费获得。Pandas无法读取它。

In contrast, the file for the most recent year available, 2013, coalpublic2013.xlsfile, works without a problem:

相比之下,最近一年可用的coalpublic2013.xls文件,2013, 文件,工作没有问题:

import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")

but the next decade of .xlsfiles (2004-2012) do not load. I have looked at these files with Excel, and they open, and are not corrupted.

但下一个十年的.xls文件(2004-2012)不会加载。我用 Excel 查看了这些文件,它们打开了,并且没有损坏。

The error that I get from pandas is:

我从Pandas那里得到的错误是:

---------------------------------------------------------------------------
XLRDError                                 Traceback (most recent call last)
<ipython-input-28-0da33766e9d2> in <module>()
----> 1 df = pd.read_excel("coalpublic2012.xlsx")

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
    161 
    162     if not isinstance(io, ExcelFile):
--> 163         io = ExcelFile(io, engine=engine)
    164 
    165     return io._parse_excel(

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
    204                 self.book = xlrd.open_workbook(file_contents=data)
    205             else:
--> 206                 self.book = xlrd.open_workbook(io)
    207         elif engine == 'xlrd' and isinstance(io, xlrd.Book):
    208             self.book = io

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    433         formatting_info=formatting_info,
    434         on_demand=on_demand,
--> 435         ragged_rows=ragged_rows,
    436         )
    437     return bk

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
     89         t1 = time.clock()
     90         bk.load_time_stage_1 = t1 - t0
---> 91         biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
     92         if not biff_version:
     93             raise XLRDError("Can't determine file's BIFF version")

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
   1228             bof_error('Expected BOF record; met end of file')
   1229         if opcode not in bofcodes:
-> 1230             bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
   1231         length = self.get2bytes()
   1232         if length == MY_EOF:

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
   1222         if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
   1223         def bof_error(msg):
-> 1224             raise XLRDError('Unsupported format, or corrupt file: ' + msg)
   1225         savpos = self._position
   1226         opcode = self.get2bytes()

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'

And I have tried various other things:

我还尝试了其他各种事情:

df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")

to no avail. My pandas version: 0.17.0

无济于事。我的Pandas版本:0.17.0

I've also submitted this as a bug to the pandas github issueslist.

我也将此作为 bug 提交到 pandas github问题列表。

回答by jrovegno

You can convert this Excel XML file programmatically. Requirement: only python and pandas.

您可以以编程方式转换此 Excel XML 文件。要求:只有python和pandas。

import pandas as pd
from xml.sax import ContentHandler, parse

# Reference https://goo.gl/KaOBG3
class ExcelHandler(ContentHandler):
    def __init__(self):
        self.chars = [  ]
        self.cells = [  ]
        self.rows = [  ]
        self.tables = [  ]
    def characters(self, content):
        self.chars.append(content)
    def startElement(self, name, atts):
        if name=="Cell":
            self.chars = [  ]
        elif name=="Row":
            self.cells=[  ]
        elif name=="Table":
            self.rows = [  ]
    def endElement(self, name):
        if name=="Cell":
            self.cells.append(''.join(self.chars))
        elif name=="Row":
            self.rows.append(self.cells)
        elif name=="Table":
            self.tables.append(self.rows)

excelHandler = ExcelHandler()
parse('coalpublic2012.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])

回答by maxymoo

The problem is that while the 2013 data is an actual Excel file, the 2012 data is an XML document, something which seems to not be supported in Python. I would say your best bet is to open it in Excel, and save a copy as either a proper Excel file, or as a CSV.

问题是,虽然 2013 年的数据是一个实际的 Excel 文件,但 2012 年的数据是一个 XML 文档,这在 Python 中似乎不受支持。我想说最好的办法是在 Excel 中打开它,然后将副本另存为正确的 Excel 文件或 CSV。

回答by jrovegno

You can convert this Excel XML file programmatically. Requirement: Windows, Office installed.

您可以以编程方式转换此 Excel XML 文件。要求:安装了Windows、Office。

1.Create in Notepad ExcelToCsv.vbs script:

1.在记事本中创建ExcelToCsv.vbs脚本:

if WScript.Arguments.Count < 3 Then
    WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
    Wscript.Quit
End If

csv_format = 6

Set objFSO = CreateObject("Scripting.FileSystemObject")

src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))

Dim oExcel
Set oExcel = CreateObject("Excel.Application")

Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate

oBook.SaveAs dest_file, csv_format

oBook.Close False
oExcel.Quit
  1. Convert the Excel XML file in CSV:
  1. 将 Excel XML 文件转换为 CSV:

$ cscript ExcelToCsv.vbs coalpublic2012.xls coalpublic2012.csv 1

$ cscript ExcelToCsv.vbs coalpublic2012.xls coalpublic2012.csv 1

  1. Open the CSV file with pandas
  1. 用 Pandas 打开 CSV 文件

>>> df1 = pd.read_csv('coalpublic2012.csv', skiprows=3)

>>> df1 = pd.read_csv('coalpublic2012.csv', skiprows=3)

Reference: Faster way to read Excel files to pandas dataframe

参考:将 Excel 文件读取到 Pandas 数据框的更快方法

回答by Stuti Verma

@JBWhitmore I have run the following code:

@JBWhitmore 我运行了以下代码:

import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)

This reads the file successfully without giving any error. But, it gives all the data in the exact format as mentioned. So, you may have to do extra efforts in order to process the data after reading it successfully.

这将成功读取文件而不会出现任何错误。但是,它以上述确切格式提供所有数据。因此,您可能需要付出额外的努力才能在成功读取数据后对其进行处理。