Python 如何将 OpenDocument 电子表格转换为 Pandas DataFrame?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17834995/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:16:44  来源:igfitidea点击:

How to convert OpenDocument spreadsheets to a pandas DataFrame?

pythonpandaslibreofficedataframeopendocument

提问by Lamps1829

The Python library pandascan read Excel spreadsheets and convert them to a pandas.DataFramewith pandas.read_excel(file)command. Under the hood, it uses xlrdlibrary which does not supportods files.

Python 库pandas可以读取 Excel 电子表格并将其转换为pandas.DataFramewithpandas.read_excel(file)命令。在幕后,它使用不支持ods 文件的xlrd库。

Is there an equivalent of pandas.read_excelfor ods files? If not, how can I do the same for an Open Document Formatted spreadsheet (ods file)? ODF is used by LibreOffice and OpenOffice.

是否有等效的pandas.read_excelfor ods 文件?如果没有,我如何对 Open Document Formatted 电子表格(ods 文件)执行相同的操作?ODF 被 LibreOffice 和 OpenOffice 使用。

回答by Dr. Jan-Philip Gehrcke

If possible, save as CSV from the spreadsheet application and then use pandas.read_csv(). IIRC, an 'ods' spreadsheet file actually is an XML file which also contains quite some formatting information. So, if it's about tabular data, extract this raw data first to an intermediate file (CSV, in this case), which you can then parse with other programs, such as Python/pandas.

如果可能,请从电子表格应用程序另存为 CSV,然后使用pandas.read_csv(). IIRC,一个“ods”电子表格文件实际上是一个 XML 文件,它也包含相当多的格式信息。因此,如果是关于表格数据,请先将此原始数据提取到中间文件(在本例中为 CSV),然后您可以使用其他程序(例如 Python/pandas)进行解析。

回答by Lamps1829

Another option: read-ods-with-odfpy. This module takes an OpenDocument Spreadsheet as input, and returns a list, out of which a DataFrame can be created.

另一种选择:read-ods-with-odfpy。该模块将 OpenDocument 电子表格作为输入,并返回一个列表,从中可以创建一个 DataFrame。

回答by Matthias Berth

There is support for reading Excel files in Pandas (both xls and xlsx), see the read_excelcommand. You can use OpenOffice to save the spreadsheet as xlsx. The conversion can also be done automatically on the command line, apparently, using the convert-to command line parameter.

Pandas 支持读取 Excel 文件(xls 和 xlsx),请参阅read_excel命令。您可以使用 OpenOffice 将电子表格保存为 xlsx。显然,也可以使用convert-to 命令行参数在命令行上自动完成转换

Reading the data from xlsx avoids some of the issues (date formats, number formats, unicode) that you may run into when you convert to CSV first.

从 xlsx 读取数据可避免您首先转换为 CSV 时可能遇到的一些问题(日期格式、数字格式、unicode)。

回答by CPBL

It seems the answer is No! And I would characterize the tools to read in ODS still ragged. If you're on POSIX, maybe the strategy of exporting to xlsx on the fly before using Pandas' very nice importing tools for xlsx is an option:

似乎答案是否定的!我会描述在 ODS 中读取的工具仍然参差不齐。如果您使用 POSIX,也许可以选择在使用 Pandas 非常好的 xlsx 导入工具之前即时导出到 xlsx 的策略:

unoconv -f xlsx -o tmp.xlsx myODSfile.ods 

Altogether, my code looks like:

总而言之,我的代码如下所示:

import pandas as pd
import os
if fileOlderThan('tmp.xlsx','myODSfile.ods'):
    os.system('unoconv -f xlsx -o tmp.xlsx myODSfile.ods ')
xl_file = pd.ExcelFile('tmp.xlsx')
dfs = {sheet_name: xl_file.parse(sheet_name) 
          for sheet_name in xl_file.sheet_names}
df=dfs['Sheet1']

Here fileOlderThan() is a function (see http://github.com/cpbl/cpblUtilities) which returns true if tmp.xlsx does not exist or is older than the .ods file.

这里 fileOlderThan() 是一个函数(参见http://github.com/cpbl/cpblUtilities),如果 tmp.xlsx 不存在或比 .ods 文件旧,则返回 true。

回答by davidovitch

You can read ODF (Open Document Format .ods) documents in Python using the following modules:

您可以.ods使用以下模块在 Python 中读取 ODF(开放文档格式)文档:

Using ezodf, a simple ODS-to-DataFrame converter could look like this:

使用 ezodf,一个简单的 ODS-to-DataFrame 转换器可能如下所示:

import pandas as pd
import ezodf

doc = ezodf.opendoc('some_odf_spreadsheet.ods')

print("Spreadsheet contains %d sheet(s)." % len(doc.sheets))
for sheet in doc.sheets:
    print("-"*40)
    print("   Sheet name : '%s'" % sheet.name)
    print("Size of Sheet : (rows=%d, cols=%d)" % (sheet.nrows(), sheet.ncols()) )

# convert the first sheet to a pandas.DataFrame
sheet = doc.sheets[0]
df_dict = {}
for i, row in enumerate(sheet.rows()):
    # row is a list of cells
    # assume the header is on the first row
    if i == 0:
        # columns as lists in a dictionary
        df_dict = {cell.value:[] for cell in row}
        # create index for the column headers
        col_index = {j:cell.value for j, cell in enumerate(row)}
        continue
    for j, cell in enumerate(row):
        # use header instead of column index
        df_dict[col_index[j]].append(cell.value)
# and convert to a DataFrame
df = pd.DataFrame(df_dict)

P.S.

聚苯乙烯

  • ODF spreadsheet (*.ods files) support has been requested on the pandasissue tracker: https://github.com/pydata/pandas/issues/2311, but it is still not implemented.

  • ezodfwas used in the unfinished PR9070to implement ODF support in pandas. That PR is now closed (read the PR for a technical discussion), but it is still available as an experimental feature in thispandasfork.

  • there are also some brute force methods to read directly from the XML code (here)
  • pandas问题跟踪器已请求 ODF 电子表格(*.ods 文件)支持:https: //github.com/pydata/pandas/issues/2311,但仍未实现。

  • ezodf在未完成的PR9070 中使用了在Pandas中实现 ODF 支持。该 PR 现在已关闭(阅读 PR 以进行技术讨论),但它仍可用作pandas分支中的实验性功能。

  • 还有一些蛮力方法可以直接从 XML 代码中读取(这里

回答by MaxU

Here is a quick and dirty hack which uses ezodfmodule:

这是一个使用ezodf模块的快速而肮脏的黑客:

import pandas as pd
import ezodf

def read_ods(filename, sheet_no=0, header=0):
    tab = ezodf.opendoc(filename=filename).sheets[sheet_no]
    return pd.DataFrame({col[header].value:[x.value for x in col[header+1:]]
                         for col in tab.columns()})

Test:

测试:

In [92]: df = read_ods(filename='fn.ods')

In [93]: df
Out[93]:
     a    b    c
0  1.0  2.0  3.0
1  4.0  5.0  6.0
2  7.0  8.0  9.0

NOTES:

笔记:

  • all other useful parameters like header, skiprows, index_col, parse_colsare NOT implemented in this function - feel free to update this question if you want to implement them
  • ezodfdepends on lxmlmake sure you have it installed
  • 所有其他有用的参数,如header, skiprows, index_col,parse_cols未在此函数中实现 - 如果您想实现它们,请随时更新此问题
  • ezodf取决于lxml确保你安装了它

回答by wordsforthewise

If you only have a few .ods files to read, I would just open it in openoffice and save it as an excel file. If you have a lot of files, you could use the unoconvcommandin Linux to convert the .ods files to .xls programmatically (with bash)

如果您只有几个 .ods 文件要读取,我会在 openoffice 中打开它并将其另存为 excel 文件。如果你有很多文件,你可以使用Linux 中的unoconv命令以编程方式将 .ods 文件转换为 .xls (使用 bash

Then it's really easy to read it in with pd.read_excel('filename.xls')

然后它真的很容易阅读 pd.read_excel('filename.xls')

回答by Mike Adrion

I've had good luck with pandas read_clipboard. Selecting cells and then copy from excel or opendocument. In python run the following.

我在大熊猫 read_clipboard 上好运。选择单元格,然后从 excel 或 opendocument 复制。在 python 中运行以下命令。

import pandas as pd
data = pd.read_clipboard()

Pandas will do a good job based on the cells copied.

Pandas 会根据复制的单元格做好工作。

回答by iuvbio

Based heavily on the answer by davidovitch (thank you), I have put together a packagethat reads in a .ods file and returns a DataFrame. It's not a full implementation in pandasitself, such as his PR, but it provides a simple read_odsfunction that does the job.

主要基于由davidovitch答案(谢谢),我已经把一个包裹,上面写着一个.ods档案文件,并返回一个数据帧。它pandas本身并不是一个完整的实现,比如他的 PR,但它提供了一个简单的read_ods功能来完成这项工作。

You can install it with pip install pandas_ods_reader. It's also possible to specify whether the file contains a header row or not, and to specify custom column names.

您可以使用pip install pandas_ods_reader. 还可以指定文件是否包含标题行,并指定自定义列名称。

回答by Will Ayd

This is available natively in pandas 0.25. So long as you have odfpy installed you can do

这在 Pandas 0.25 中原生可用。只要你安装了 odfpy 就可以了

pd.read_excel("the_document.ods", engine="odf")