当第一列为空时,Pandas 读取带有多个标题的 Excel 表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50472696/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:35:27  来源:igfitidea点击:

Pandas read excel sheet with multiple header when first column is empty

pythonexcelpandas

提问by Alexandra Espichán

I have an excel sheet like this:

我有一个这样的excel表:

Excel sheet

Excel表格

I want to read it with pandas read_excel and I tried this:

我想用 pandas read_excel 阅读它,我试过这个:

df = pd.read_excel("test.xlsx", header=[0,1])

but it throws me this error:

但它给我带来了这个错误:

ParserError: Passed header=[0,1] are too many rows for this multi_index of columns

ParserError: Passed header=[0,1] 对于这个 multi_index 的列来说行太多了

Any suggestions?

有什么建议?

回答by Orenshi

If you don't mind massaging the DataFrame after reading the Excel you can try the below two ways:

如果您不介意在阅读 Excel 后按摩 DataFrame,您可以尝试以下两种方法:

>>> pd.read_excel("/tmp/sample.xlsx", usecols = "B:F", skiprows=[0])
  header1 Unnamed: 1 Unnamed: 2 header2 Unnamed: 4
0    col1       col2       col3    col4       col5
1       a          0          x       3          d
2       b          1          y       4          e
3       c          2          z       5          f

In above, you'd have to fix the first level of the MultiIndex since header1 and header2 are merged cells

在上面,您必须修复 MultiIndex 的第一级,因为 header1 和 header2 是合并的单元格

>>> pd.read_excel("/tmp/sample.xlsx", header=[0,1], usecols = "B:F", 
skiprows=[0])
        header1      header2
header1    col1 col2    col3 col4
a             0    x       3    d
b             1    y       4    e
c             2    z       5    f

In above, it got pretty close by skipping the empty row and parsing only columns (B:F) with data. If you notice, the columns got shifted though...

在上面,通过跳过空行并仅解析包含数据的列 (B:F),它非常接近。如果你注意到了,虽然列已经移动了......

NoteNot a clean solution but just wanted to share samples with you in a post rather than a comment

注意不是一个干净的解决方案,只是想在帖子而不是评论中与您分享示例

-- Edit based on discussion with OP --

-- 根据与 OP 的讨论进行编辑 --

Based on documentation for pandas read_excel, header[1,2]is creating a MultiIndex for your columns. Looks like it determines the labels for the DataFramedepending on what is populated in Column A. Since there's nothing there... the index has a bunch of Nanlike so

根据 pandas read_excel 的文档header[1,2]正在为您的列创建一个 MultiIndex。看起来它DataFrame根据列 A 中填充的内容确定标签。因为那里什么都没有......索引有一堆Nan这样的

>>> pd.read_excel("/tmp/sample.xlsx", header=[1,2])
    header1           header2
       col1 col2 col3    col4 col5
NaN       a    0    x       3    d
NaN       b    1    y       4    e
NaN       c    2    z       5    f

Again if you're okay with cleaning up columns and if the first column of the xlsx is always blank... you can drop it like below. Hopefully this is what you're looking for.

同样,如果您可以清理列并且 xlsx 的第一列始终为空白……您可以将其删除,如下所示。希望这就是你正在寻找的。

>>> pd.read_excel("/tmp/sample.xlsx", header[1,2]).reset_index().drop(['index'], level=0, axis=1)
  header1           header2
     col1 col2 col3    col4 col5
0       a    0    x       3    d
1       b    1    y       4    e
2       c    2    z       5    f

回答by BallpointBen

Hereis the documentation on the headerparameter:

是有关header参数的文档:

Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex. Use None if there is no header.

用于解析的 DataFrame 的列标签的行(0 索引)。如果传递一个整数列表,这些行位置将被组合成一个 MultiIndex。如果没有标题,请使用 None。

I think the following should work:

我认为以下应该有效:

df = pd.read_excel("test.xlsx", skiprows=2, usecols='B:F', header=0)