Pandas：从 Excel 解析合并的标题列

Question

提问by Samarth Bharadwaj

The data in excel sheets is stored as follows:

excel表格中的数据存储如下：

   Area     |          Product1     |      Product2        |      Product3
            |      sales|sales.Value|   sales |sales.Value |  sales |sales.Value
  Location1 |    20     | 20000     |      25 |  10000     |   200  | 100
  Location2 |    30     | 30000     |      3  | 12300      |   213  | 10

the product name is a merge of 2 cells of two rows "no of sales" and "sales value" for each of 1000 or so areas for a given month. Similarly there are separate files for each month for the last 5 years. Further, new products have been added and removed in different months. So a different month file might look like:

产品名称是给定月份 1000 个左右区域中每一个的两行“销售额”和“销售额”的 2 个单元格的合并。同样，过去 5 年的每个月都有单独的文件。此外，新产品已在不同月份添加和删除。因此，不同的月份文件可能如下所示：

   Area     |          Product1     |      Product4        |      Product3

Can the forum suggest the best way to read this data using pandas? I can't use index since the product columns are different each month

论坛能否建议使用 Pandas 读取这些数据的最佳方式？我无法使用索引，因为每个月的产品列都不同

Ideally, I would like to convert the initial format above to:

理想情况下，我想将上面的初始格式转换为：

 Area      | Product1.sales|Product1.sales.Value| Product2.sales |Product2.sales.Value | 
 Location1 | 20            | 20000              | 25             | 10000               |  
 Location2 | 30            | 30000              | 3              | 12300               |

import pandas as pd
xl_file = read_excel("file path", skiprow=2, sheetname=0)
/* since the first two rows are always blank */


                  0            1        2               3                      4
      0          NaN          NaN      NaN       Auto loan                    NaN
      1  Branch Code  Branch Name   Region  No of accounts  Portfolio Outstanding
      2         3000       Name1  Central               0                      0
      3         3001       Name2  Central               0                      0

I want to convert it to Auto loan.No of account, Auto loan.Portfolio Outstandingas the headers.

我想将其转换为Auto loan.No of account,Auto loan.Portfolio Outstanding作为标题。

Answer 1

回答by unutbu

Suppose your DataFrame is df:

假设您的 DataFrame 是df：

import numpy as np
import pandas as pd

nan = np.nan
df = pd.DataFrame([
    (nan, nan, nan, 'Auto loan', nan)
    , ('Branch Code', 'Branch Name', 'Region', 'No of accounts'
       , 'Portfolio Outstanding')
    , (3000, 'Name1', 'Central', 0, 0)
    , (3001, 'Name2', 'Central', 0, 0)
])

so that it looks like this:

所以它看起来像这样：

             0            1        2               3                      4
0          NaN          NaN      NaN       Auto loan                    NaN
1  Branch Code  Branch Name   Region  No of accounts  Portfolio Outstanding
2         3000       Name1  Central               0                      0
3         3001       Name2  Central               0                      0

Then first forward fill the NaNs in the first two rows (thus propagating 'Auto loan', for example).

然后首先向前填充前两行中的 NaN（例如，传播“汽车贷款”）。

df.iloc[0:2] = df.iloc[0:2].fillna(method='ffill', axis=1)

Next fill in the remaining NaNs with empty strings:

接下来用空字符串填充剩余的 NaN：

df.iloc[0:2] = df.iloc[0:2].fillna('')

Now join the two rows together with .and assign that as the column level values:

现在将两行连接在一起.并将其分配为列级别值：

df.columns = df.iloc[0:2].apply(lambda x: '.'.join([y for y in x if y]), axis=0)

And finally, remove the first two rows:

最后，删除前两行：

df = df.iloc[2:]

This yields

这产生

  Branch Code Branch Name   Region Auto loan.No of accounts  \
2        3000      Name1  Central                        0   
3        3001      Name2  Central                        0   

  Auto loan.Portfolio Outstanding  
2                               0  
3                               0

Alternatively, you could create a MultiIndex column instead of creating a flat column index:

或者，您可以创建一个 MultiIndex 列而不是创建一个平面列索引：

import numpy as np
import pandas as pd

nan = np.nan
df = pd.DataFrame([
    (nan, nan, nan, 'Auto loan', nan)
    , ('Branch Code', 'Branch Name', 'Region', 'No of accounts'
       , 'Portfolio Outstanding')
    , (3000, 'Name1', 'Central', 0, 0)
    , (3001, 'Name2', 'Central', 0, 0)
])
df.iloc[0:2] = df.iloc[0:2].fillna(method='ffill', axis=1)
df.iloc[0:2] = df.iloc[0:2].fillna('Area')

df.columns = pd.MultiIndex.from_tuples(
    zip(*df.iloc[0:2].to_records(index=False).tolist()))
df = df.iloc[2:]

Now dflooks like this:

现在df看起来像这样：

         Area                           Auto loan                      
  Branch Code Branch Name   Region No of accounts Portfolio Outstanding
2        3000      Name1  Central              0                     0
3        3001      Name2  Central              0                     0

the column is a MultiIndex:

该列是一个多索引：

In [275]: df.columns
Out[275]: 
MultiIndex(levels=[[u'Area', u'Auto loan'], [u'Branch Code', u'Branch Name', u'No of accounts', u'Portfolio Outstanding', u'Region']],
           labels=[[0, 0, 0, 1, 1], [0, 1, 4, 2, 3]])

The column has two levels. The first level has values [u'Area', u'Auto loan'], the second has values [u'Branch Code', u'Branch Name', u'No of accounts', u'Portfolio Outstanding', u'Region'].

该列有两个级别。第一级有价值观[u'Area', u'Auto loan']，第二级有价值观[u'Branch Code', u'Branch Name', u'No of accounts', u'Portfolio Outstanding', u'Region']。

You can then access a column by specifing the value from both levels:

然后，您可以通过指定两个级别的值来访问列：

print(df.loc[:, ('Area', 'Branch Name')])
# 2    Name1
# 3    Name2
# Name: (Area, Branch Name), dtype: object

print(df.loc[:, ('Auto loan', 'No of accounts')])
# 2    0
# 3    0
# Name: (Auto loan, No of accounts), dtype: object

One advantage of using a MultiIndex is that you can easily select all columns which have a certain level value. For instance, to select the sub-DataFrame having to do with Auto loansyou could use:

使用 MultiIndex 的一个优点是您可以轻松选择具有特定级别值的所有列。例如，要选择与Auto loans您有关的子 DataFrame可以使用：

In [279]: df.loc[:, 'Auto loan']
Out[279]: 
  No of accounts Portfolio Outstanding
2              0                     0
3              0                     0

For more on selecting rows and columns from a MultiIndex, see MultiIndexing Using Slicers.

有关从 MultiIndex 中选择行和列的更多信息，请参阅MultiIndexing Using Slicers。

Pandas：从 Excel 解析合并的标题列

提问by Samarth Bharadwaj

回答by unutbu

相关推荐

最近更新

标签

Pandas：从 Excel 解析合并的标题列

提问by Samarth Bharadwaj

回答by unutbu

相关推荐

绘制 Pandas 系列数据的平滑曲线

pandas 使用布尔索引的 IndexingError

pandas 比较熊猫数据框中的行值

pandas Python：可视化数据直方图上的正态曲线

相关推荐

最近更新

标签