将函数应用于 MultiIndex pandas.DataFrame 列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22933158/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:54:15  来源:igfitidea点击:

Applying a function to a MultiIndex pandas.DataFrame column

pythonpandasapplymulti-index

提问by VGonPa

I have a MultiIndex pandas DataFrame in which I want to apply a function to one of its columns and assign the result to that same column.

我有一个 MultiIndex pandas DataFrame,我想在其中将一个函数应用于其中一列并将结果分配给同一列。

In [1]:
    import numpy as np
    import pandas as pd
    cols = ['One', 'Two', 'Three', 'Four', 'Five']
    df = pd.DataFrame(np.array(list('ABCDEFGHIJKLMNO'), dtype='object').reshape(3,5), index = list('ABC'), columns=cols)
    df.to_hdf('/tmp/test.h5', 'df')
    df = pd.read_hdf('/tmp/test.h5', 'df')
    df
Out[1]:
         One     Two     Three  Four    Five
    A    A       B       C      D       E
    B    F       G       H      I       J
    C    K       L       M      N       O
    3 rows × 5 columns

In [2]:
    df.columns = pd.MultiIndex.from_arrays([list('UUULL'), ['One', 'Two', 'Three', 'Four', 'Five']])
    df['L']['Five'] = df['L']['Five'].apply(lambda x: x.lower())
    df
-c:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead 
Out[2]:
         U                      L
         One    Two     Three   Four    Five
    A    A      B       C       D       E
    B    F      G       H       I       J
    C    K      L       M       N       O
    3 rows × 5 columns

In [3]:
    df.columns = ['One', 'Two', 'Three', 'Four', 'Five']
    df    
Out[3]:
         One    Two     Three   Four    Five
    A    A      B       C       D       E
    B    F      G       H       I       J
    C    K      L       M       N       O
    3 rows × 5 columns

In [4]:
    df['Five'] = df['Five'].apply(lambda x: x.upper())
    df
Out[4]:
         One    Two     Three   Four    Five
    A    A      B       C       D       E
    B    F      G       H       I       J
    C    K      L       M       N       O
    3 rows × 5 columns

As you can see, the function is not applied to the column, I guess because I get this warning:

如您所见,该功能未应用于该列,我想是因为我收到了此警告:

-c:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead

What is strange is that this error only happens sometimes, and I haven't been able to understand when does it happens and when not.

奇怪的是,这个错误只是偶尔发生,我一直无法理解它什么时候发生,什么时候不发生。

I managed to apply the function slicing the dataframe with .locas the warning recommended:

我设法按照.loc建议的警告应用了对数据框进行切片的功能:

In [5]:
    df.columns = pd.MultiIndex.from_arrays([list('UUULL'), ['One', 'Two', 'Three', 'Four', 'Five']])
    df.loc[:,('L','Five')] = df.loc[:,('L','Five')].apply(lambda x: x.lower())
    df

Out[5]:
         U                      L
         One    Two     Three   Four    Five
    A    A      B       C       D       e
    B    F      G       H       I       j
    C    K      L       M       N       o
    3 rows × 5 columns

but I would like to understand why this behavior happens when doing dict-like slicing (e.g. df['L']['Five']) and not when using the .locslicing.

但我想了解为什么在进行类似 dict 的切片(例如df['L']['Five'])而不是在使用.loc切片时会发生这种行为。

NOTE: The DataFrame comes from an HDF file which was not multiindexed is this perhaps the cause of the strange behavior?

注意:DataFrame 来自一个没有多索引的 HDF 文件,这可能是奇怪行为的原因吗?

EDIT: I'm using Pandas v.0.13.1and NumPy v.1.8.0

编辑:我正在使用Pandas v.0.13.1NumPy v.1.8.0

回答by Jeff

df['L']['Five']is selecting the level 0 with the value 'L' and returning a DataFrame, which then the column 'Five' is selected, returning the accessed series.

df['L']['Five']正在选择值为 'L' 的级别 0 并返回一个 DataFrame,然后选择列 'Five',返回访问的系列。

The __getitem__accessor for a Dataframe (the []), will try to do the right thing, and gives you the correct column. However, this is chained indexing, see here

__getitem__一个数据框(存取[]),将尝试做正确的事,并给你正确的列。但是,这是链式索引,请参见此处

To access a multi-index, use the tuple notation, ('a','b')and .locwhich is unambiguous, e.g. df.loc[:,('a','b')]. Furthermore this allows multi-axes indexing at the same time (e.g. rows AND columns).

要访问一个多索引,使用所述元组表示法,('a','b')并且.loc这是明确的,例如df.loc[:,('a','b')]。此外,这允许同时进行多轴索引(例如行和列)。

So, why does this not work when you do chained indexing and assignement, e.g. df['L']['Five'] = value.

那么,为什么当您进行链式索引和分配时这不起作用,例如df['L']['Five'] = value.

df['L']rerturns a data frame that is singly-indexed. Then another python operation df_with_L['Five']selects the series index by 'Five' happens. I indicated this by another variable. Because pandas sees these operations as separate events (e.g. separate calls to __getitem__, so it has to treat them as linear operations, they happen one after another.

df['L']返回一个单索引的数据帧。然后另一个 python 操作df_with_L['Five']通过 'Five' 选择系列索引。我用另一个变量表示了这一点。因为 Pandas 将这些操作视为单独的事件(例如对 的单独调用__getitem__,因此它必须将它们视为线性操作,它们一个接一个地发生。

Contrast this to df.loc[:,('L','Five')]which passes a nested tuple of (:,('L','Five'))to a single call to __getitem__. This allows pandas to deal with this as a single entity (and fyi be quite a bit faster because it can directly index into the frame).

对比 thisdf.loc[:,('L','Five')]将嵌套元组 of(:,('L','Five'))传递给单个调用__getitem__. 这允许 Pandas 将其作为单个实体处理(而且速度要快一些,因为它可以直接索引到框架中)。

Why does this matter? Since the chained indexing is 2 calls, it is possible that either call may return a copyof the data because of the way it is sliced. Thus when setting this you are actually setting a copy, and not the original frame. It is impossible for pandas to figure this out because their are 2 separate python operations that are not connected.

为什么这很重要?由于链式索引是 2 次调用,因此任何一个调用都可能返回数据的副本,因为它的切片方式。因此,在设置此项时,您实际上是在设置副本,而不是原始框架。pandas 不可能弄清楚这一点,因为它们是 2 个未连接的独立 python 操作。

The SettingWithCopywarning is a 'heuristic' to detect this (meaning it tends to catch most cases by is simply a lightweight check). Figuring this out for real is way complicated.

SettingWithCopy警告是检测此问题的“启发式”(意味着它往往通过简单的轻量级检查来捕获大多数情况)。真正弄清楚这一点是很复杂的。

The .locoperation is a single python operation, and thus can select a slice (which still may be a copy), but allows pandas to assign that slice back into the frame after it is modified thus setting the values as you would think.

.loc操作是一个单独的python操作,因此可以选择一个切片(它仍然可能是一个副本),但允许pandas在修改后将该切片分配回框架,从而按照您的想法设置值。

The reason for the warning, is this. Sometimes when you slice an array you will simply get a view back, which means you can set it no problem. However, even a singledtyped array cangenerate a copy if sliced in a particular way. A multi-dtyped DataFrame (meaning it has say float and object data), will almost always yield a copy. Whether a view is created is dependent on the memory layout of the array.

警告的原因是这样的。有时,当你对一个数组进行切片时,你只会得到一个视图,这意味着你可以设置它没有问题。但是,如果以特定方式切片,即使是单个dtyped 数组也可以生成副本。一个多类型的 DataFrame(意味着它有浮点数据和对象数据),几乎总是会产生一个副本。是否创建视图取决于数组的内存布局。

Note: this doesn't have anything to do with the source of the data.

注意:这与数据来源没有任何关系。