如何将合并的 Excel 单元格与 NaN 读入 Pandas DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47834025/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:54:51  来源:igfitidea点击:

How to read merged Excel cells with NaN into Pandas DataFrame

pythonexcelpython-3.xpandas

提问by CPU

I would like to read an Excel sheet into Pandas DataFrame. However, there are merged Excel cells as well as Null rows (full/partial NaNfilled), as shown below. To clarify, John H. has made an order to purchase all the albums from "The Bodyguard" to "Red Pill Blues".

我想将 Excel 工作表读入 Pandas DataFrame。但是,有合并的 Excel 单元格以及空行(完整/部分NaN填充),如下所示。澄清一下,John H. 已下令购买从“The Bodyguard”到“Red Pill Blues”的所有专辑。

Excel sheet capture

Excel 工作表捕获

When I read this Excel sheet into a Pandas DataFrame, the Excel data does not get transferred correctly. Pandas considers a merged cell as one cell. The DataFrame looks like the following: (Note: Values in () are the desired values that I would like to have there)

当我将此 Excel 工作表读入 Pandas DataFrame 时,Excel 数据无法正确传输。Pandas 将合并的单元格视为一个单元格。DataFrame 如下所示:(注意:() 中的值是我想要的值)

Dataframe capture

数据帧捕获

Please note that the last row does not contain merged cells; it only carries a value for Artistcolumn.

请注意,最后一行不包含合并单元格;它只携带一个Artist列值。



EDIT:编辑:我确实尝试了以下方法来向前填充 NaN 值:(Pandas: Reading Excel with merged cellsPandas:使用合并单元格读取 Excel

df.index = pd.Series(df.index).fillna(method='ffill')  

However, the NaNvalues remain. What strategy or method could I use to populate the DataFrame correctly?Is there a Pandas method of unmerging the cells and duplicating the corresponding contents?

但是,这些NaN值仍然存在。我可以使用什么策略或方法来正确填充 DataFrame?有没有Pandas 的方法来取消合并单元格并复制相应的内容?

回答by Parfait

The referenced link you attempted needed to forward fill only the indexcolumn. For your use case, you need to fillnafor alldataframe columns. So, simply forward fill entire dataframe:

您尝试的引用链接只需要转发填充索引列。对于您的用例,您需要fillna所有数据框列。因此,只需向前填充整个数据帧:

df = pd.read_excel("Input.xlsx")
print(df)

#    Order_ID Customer_name            Album_Name           Artist  Quantity
# 0       NaN           NaN            RadioShake              NaN       NaN
# 1       1.0       John H.         The Bodyguard  Whitney Houston       2.0
# 2       NaN           NaN              Lemonade          Beyonce       1.0
# 3       NaN           NaN  The Thrill Of It All        Sam Smith       2.0
# 4       NaN           NaN              Thriller  Michael Hymanson      11.0
# 5       NaN           NaN                Divide       Ed Sheeran       4.0
# 6       NaN           NaN            Reputation     Taylor Swift       3.0
# 7       NaN           NaN        Red Pill Blues         Maroon 5       5.0

df = df.fillna(method='ffill')
print(df)

#    Order_ID Customer_name            Album_Name           Artist  Quantity
# 0       NaN           NaN            RadioShake              NaN       NaN
# 1       1.0       John H.         The Bodyguard  Whitney Houston       2.0
# 2       1.0       John H.              Lemonade          Beyonce       1.0
# 3       1.0       John H.  The Thrill Of It All        Sam Smith       2.0
# 4       1.0       John H.              Thriller  Michael Hymanson      11.0
# 5       1.0       John H.                Divide       Ed Sheeran       4.0
# 6       1.0       John H.            Reputation     Taylor Swift       3.0
# 7       1.0       John H.        Red Pill Blues         Maroon 5       5.0

回答by Manuel

Using conditional:

使用条件:

import pandas as pd

df_excel = pd.ExcelFile('Sales.xlsx')
df = df_excel.parse('Info')

for col in list(df):  # All columns
    pprow = 0
    prow = 1
    for row in df[1:].iterrows():  # All rows, except first
        if pd.isnull(df.loc[prow, 'Album Name']):  # If this cell is empty all in the same row too.
            continue
        elif pd.isnull(df.loc[prow, col]) and pd.isnull(df.loc[row[0], col]):  # If a cell and next one are empty, take previous valor. 
            df.loc[prow, col] = df.loc[pprow, col]
        pprow = prow
        prow = row[0]

Output (I use different names):

输出(我使用不同的名称):

    Order_ID Customer_name    Album Name
0        NaN           NaN         Radio
1        1.0          John            a 
2        1.0          John             b
3        1.0          John             c
4        1.0          John             d
5        1.0          John             e
6        1.0          John             f
7        NaN           NaN            GE
8        2.0         Harry   We are Born
9        3.0        Lizzy        Relapse
10       4.0           Abe         Smoke
11       4.0           Abe       Tell me
12       NaN           NaN           NaN
13       NaN           NaN      Best Buy
14       5.0        Kristy      The wall
15       6.0         Sammy  Kind of blue