当第一列为空时,Pandas 读取带有多个标题的 Excel 表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50472696/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas read excel sheet with multiple header when first column is empty
提问by Alexandra Espichán
I have an excel sheet like this:
我有一个这样的excel表:
I want to read it with pandas read_excel and I tried this:
我想用 pandas read_excel 阅读它,我试过这个:
df = pd.read_excel("test.xlsx", header=[0,1])
but it throws me this error:
但它给我带来了这个错误:
ParserError: Passed header=[0,1] are too many rows for this multi_index of columns
ParserError: Passed header=[0,1] 对于这个 multi_index 的列来说行太多了
Any suggestions?
有什么建议?
回答by Orenshi
If you don't mind massaging the DataFrame after reading the Excel you can try the below two ways:
如果您不介意在阅读 Excel 后按摩 DataFrame,您可以尝试以下两种方法:
>>> pd.read_excel("/tmp/sample.xlsx", usecols = "B:F", skiprows=[0])
header1 Unnamed: 1 Unnamed: 2 header2 Unnamed: 4
0 col1 col2 col3 col4 col5
1 a 0 x 3 d
2 b 1 y 4 e
3 c 2 z 5 f
In above, you'd have to fix the first level of the MultiIndex since header1 and header2 are merged cells
在上面,您必须修复 MultiIndex 的第一级,因为 header1 和 header2 是合并的单元格
>>> pd.read_excel("/tmp/sample.xlsx", header=[0,1], usecols = "B:F",
skiprows=[0])
header1 header2
header1 col1 col2 col3 col4
a 0 x 3 d
b 1 y 4 e
c 2 z 5 f
In above, it got pretty close by skipping the empty row and parsing only columns (B:F) with data. If you notice, the columns got shifted though...
在上面,通过跳过空行并仅解析包含数据的列 (B:F),它非常接近。如果你注意到了,虽然列已经移动了......
NoteNot a clean solution but just wanted to share samples with you in a post rather than a comment
注意不是一个干净的解决方案,只是想在帖子而不是评论中与您分享示例
-- Edit based on discussion with OP --
-- 根据与 OP 的讨论进行编辑 --
Based on documentation for pandas read_excel, header[1,2]
is creating a MultiIndex for your columns. Looks like it determines the labels for the DataFrame
depending on what is populated in Column A. Since there's nothing there... the index has a bunch of Nan
like so
根据 pandas read_excel 的文档,header[1,2]
正在为您的列创建一个 MultiIndex。看起来它DataFrame
根据列 A 中填充的内容确定标签。因为那里什么都没有......索引有一堆Nan
这样的
>>> pd.read_excel("/tmp/sample.xlsx", header=[1,2])
header1 header2
col1 col2 col3 col4 col5
NaN a 0 x 3 d
NaN b 1 y 4 e
NaN c 2 z 5 f
Again if you're okay with cleaning up columns and if the first column of the xlsx is always blank... you can drop it like below. Hopefully this is what you're looking for.
同样,如果您可以清理列并且 xlsx 的第一列始终为空白……您可以将其删除,如下所示。希望这就是你正在寻找的。
>>> pd.read_excel("/tmp/sample.xlsx", header[1,2]).reset_index().drop(['index'], level=0, axis=1)
header1 header2
col1 col2 col3 col4 col5
0 a 0 x 3 d
1 b 1 y 4 e
2 c 2 z 5 f
回答by BallpointBen
Hereis the documentation on the header
parameter:
这是有关header
参数的文档:
Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex. Use None if there is no header.
用于解析的 DataFrame 的列标签的行(0 索引)。如果传递一个整数列表,这些行位置将被组合成一个 MultiIndex。如果没有标题,请使用 None。
I think the following should work:
我认为以下应该有效:
df = pd.read_excel("test.xlsx", skiprows=2, usecols='B:F', header=0)