如何使用 xlrd 将 Excel 文件读入 Python?它可以读取较新的 Office 格式吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/118516/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 19:31:35  来源:igfitidea点击:

How do I read an Excel file into Python using xlrd? Can it read newer Office formats?

pythonxlrdimport-from-excel

提问by John Machin

My issue is below but would be interested comments from anyone with experience with xlrd.

我的问题在下面,但是任何有 xlrd 经验的人都会感兴趣的评论。

I just found xlrd and it looks like the perfect solution but I'm having a little problem getting started. I am attempting to extract data programatically from an Excel file I pulled from Dow Jones with current components of the Dow Jones Industrial Average (link: http://www.djindexes.com/mdsidx/?event=showAverages)

我刚刚找到 xlrd,它看起来是完美的解决方案,但我在开始时遇到了一些问题。我正在尝试以编程方式从我从道琼斯提取的 Excel 文件中提取数据,其中包含道琼斯工业平均指数的当前组成部分(链接:http: //www.djindexes.com/mdsidx/?event=showAverages

When I open the file unmodified I get a nasty BIFF error (binary format not recognized)

当我打开未修改的文件时,我收到一个令人讨厌的 BIFF 错误(无法识别二进制格式)

However you can see in this screenshot that Excel 2008 for Mac thinks it is in 'Excel 1997-2004' format (screenshot: http://skitch.com/alok/ssa3/componentreport-dji.xls-properties)

但是,您可以在此屏幕截图中看到 Excel 2008 for Mac 认为它是“Excel 1997-2004”格式(屏幕截图:http: //skitch.com/alok/ssa3/componentreport-dji.xls-properties

If I instead open it in Excel manually and save as 'Excel 1997-2004' format explicitly, then open in python usig xlrd, everything is wonderful. Remember, Office thinks the file is already in 'Excel 1997-2004' format. All files are .xls

如果我改为手动在 Excel 中打开它并显式保存为“Excel 1997-2004”格式,然后在 python usig xlrd 中打开,一切都很棒。请记住,Office 认为该文件已经是“Excel 1997-2004”格式。所有文件都是 .xls

Here is a pastebin of an ipython session replicating the issue: http://pastie.textmate.org/private/jbawdtrvlrruh88mzueqdq

这是复制问题的 ipython 会话的 pastebin:http://pastie.textmate.org/private/jbawdtrvlrruh88mzueqdq

Any thoughts on: How to trick xlrd into recognizing the file so I can extract data? How to use python to automate the explicit 'save as' format to one that xlrd will accept? Plan B?

关于以下问题的任何想法:如何欺骗 xlrd 识别文件以便我可以提取数据?如何使用python将显式“另存为”格式自动化为xlrd将接受的格式?B计划?

回答by John Machin

FWIW, I'm the author of xlrd, and the maintainer of xlwt (a fork of pyExcelerator). A few points:

FWIW,我是 xlrd 的作者,也是 xlwt(pyExcelerator 的一个分支)的维护者。几点:

  1. The file ComponentReport-DJI.xls is misnamed; it is not an XLS file, it is a tab-separated-values file. Open it with a text editor (e.g. Notepad) and you'll see what I mean. You can also look at the not-very-raw raw bytes with Python:

    >>> open('ComponentReport-DJI.xls', 'rb').read(200)
    'COMPANY NAME\tPRIMARY EXCHANGE\tTICKER\tSTYLE\tICB SUBSECTOR\tMARKET CAP RANGE\
    tWEIGHT PCT\tUSD CLOSE\t\r\n3M Co.\tNew York SE\tMMM\tN/A\tDiversified Industria
    ls\tBroad\t5.15676229508\t50.33\t\r\nAlcoa Inc.\tNew York SE\tA'
    

    You can read this file using Python's csv module ... just use delimiter="\t"in your call to csv.reader().

  2. xlrd can read any file that pyExcelerator can, and read them better—dates don't come out as floats, and the full story on Excel dates is in the xlrd documentation.

  3. pyExcelerator is abandonware—xlrd and xlwt are alive and well. Check out http://groups.google.com/group/python-excel

  1. ComponentReport-DJI.xls 文件命名错误;它不是 XLS 文件,而是制表符分隔值文件。用文本编辑器(例如记事本)打开它,你就会明白我的意思。您还可以使用 Python 查看不太原始的原始字节:

    >>> open('ComponentReport-DJI.xls', 'rb').read(200)
    'COMPANY NAME\tPRIMARY EXCHANGE\tTICKER\tSTYLE\tICB SUBSECTOR\tMARKET CAP RANGE\
    tWEIGHT PCT\tUSD CLOSE\t\r\n3M Co.\tNew York SE\tMMM\tN/A\tDiversified Industria
    ls\tBroad\t5.15676229508\t50.33\t\r\nAlcoa Inc.\tNew York SE\tA'
    

    您可以使用 Python 的 csv 模块读取此文件……只需delimiter="\t"在调用csv.reader().

  2. xlrd 可以读取 pyExcelerator 可以读取的任何文件,并且可以更好地读取它们——日期不会以浮点数形式出现,有关 Excel 日期的完整故事在 xlrd 文档中。

  3. pyExcelerator 是废弃软件——xlrd 和 xlwt 还活着。查看http://groups.google.com/group/python-excel

HTH John

HTH约翰

回答by msanders

xlrd support for Office 2007/2008 (OpenXML) format is in alpha test - see the following post in the python-excel newsgroup: http://groups.google.com/group/python-excel/msg/0c5f15ad122bf24b?hl=en

xlrd 对 Office 2007/2008 (OpenXML) 格式的支持正在进行 alpha 测试 - 请参阅 python-excel 新闻组中的以下帖子:http://groups.google.com/group/python-excel/msg/0c5f15ad122bf24b?hl=en

回答by John Fouhy

More info on pyExcelerator: To read a file, do this:

有关 pyExcelerator 的更多信息:要读取文件,请执行以下操作:

import pyExcelerator
book = pyExcelerator.parse_xls(filename)

where filename is a string that is the filename to read (not a file-like object). This will give you a data structure representing the workbook: a list of pairs, where the first element of the pair is the worksheet name and the second element is the worksheet data.

其中 filename 是一个字符串,它是要读取的文件名(不是类似文件的对象)。这将为您提供一个表示工作簿的数据结构:一个对的列表,其中该对的第一个元素是工作表名称,第二个元素是工作表数据。

The worksheet data is a dictionary, where the keys are (row, col) pairs (starting with 0) and the values are the cell contents -- generally int, float, or string. So, for instance, in the simple case of all the data being on the first worksheet:

工作表数据是一个字典,其中键是 (row, col) 对(从 0 开始),值是单元格内容——通常是 int、float 或 string。因此,例如,在所有数据都在第一个工作表上的简单情况下:

data = book[0][1]
print 'Cell A1 of worksheet %s is: %s' % (book[0][0], repr(data[(0, 0)]))

If the cell is empty, you'll get a KeyError. If you're dealing with dates, they may(I forget) come through as integers or floats; if this is the case, you'll need to convert. Basically the rule is: datetime.datetime(1899, 12, 31) + datetime.timedelta(days=n) but that might be off by 1 or 2 (because Excel treats 1900 as a leap-year for compatibility with Lotus, and because I can't remember if 1900-1-1 is 0 or 1), so do some trial-and-error to check. Datetimes are stored as floats, I think (days and fractions of a day).

如果单元格为空,您将收到 KeyError。如果您正在处理日期,它们可能(我忘记了)以整数或浮点数形式出现;如果是这种情况,您需要转换。基本上规则是: datetime.datetime(1899, 12, 31) + datetime.timedelta(days=n) 但这可能会相差 1 或 2(因为 Excel 将 1900 视为与 Lotus 兼容的闰年,并且因为我不记得 1900-1-1 是 0 还是 1),所以要反复试验检查。我认为日期时间存储为浮点数(天数和一天的一小部分)。

I think there is partial support for forumulas, but I wouldn't guarantee anything.

我认为对论坛有部分支持,但我不能保证任何事情。

回答by Michael Neale

Well here is some code that I did: (look down the bottom): here

好吧,这是我做的一些代码:(向下看):这里

Not sure about the newer formats - if xlrd can't read it, xlrd needs to have a new version released !

不确定较新的格式 - 如果 xlrd 无法读取,xlrd 需要发布新版本!

回答by John Fouhy

Do you have to use xlrd? I just downloaded 'UPDATED - Dow Jones Industrial Average Movers - 2008' from that website and had no trouble reading it with pyExcelerator.

你必须使用xlrd吗?我刚刚从该网站下载了“更新 - 道琼斯工业平均指数 - 2008 年”,使用pyExcelerator阅读它没有任何问题。

import pyExcelerator
book = pyExcelerator.parse_xls('DJIAMovers.xls')