Python 有效地将一列中的值替换为另一列 Pandas DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39903090/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:53:04  来源:igfitidea点击:

Efficiently replace values from a column to another column Pandas DataFrame

pythonpandasreplacedataframe

提问by Pablo

I have a Pandas DataFrame like this:

我有一个像这样的 Pandas DataFrame:

   col1 col2 col3
1   0.2  0.3  0.3
2   0.2  0.3  0.3
3     0  0.4  0.4
4     0    0  0.3
5     0    0    0
6   0.1  0.4  0.4

I want to replace the col1values with the values in the second column (col2) only if col1values are equal to 0, and after (for the zero values remaining), do it again but with the third column (col3). The Desired Result is the next one:

仅当值等于 0 时,我才想col1用第二列 ( col2) 中的值替换这些值col1,之后(对于剩余的零值),再次执行此操作,但使用第三列 ( col3)。期望的结果是下一个:

   col1 col2 col3
1   0.2  0.3  0.3
2   0.2  0.3  0.3
3   0.4  0.4  0.4
4   0.3    0  0.3
5     0    0    0
6   0.1  0.4  0.4

I did it using the pd.replacefunction, but it seems too slow.. I think must be a faster way to accomplish that.

我使用该pd.replace功能完成了它,但它似乎太慢了..我认为必须是实现这一目标的更快方法。

df.col1.replace(0,df.col2,inplace=True)
df.col1.replace(0,df.col3,inplace=True)

is there a faster way to do that?, using some other function instead of the pd.replacefunction?

有没有更快的方法来做到这一点?使用其他函数而不是pd.replace函数?

回答by root

Using np.whereis faster. Using a similar pattern as you used with replace:

使用np.where速度更快。使用与您使用类似的模式replace

df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])

However, using a nested np.whereis slightly faster:

但是,使用嵌套np.where稍微快一点:

df['col1'] = np.where(df['col1'] == 0, 
                      np.where(df['col2'] == 0, df['col3'], df['col2']),
                      df['col1'])

Timings

时间安排

Using the following setup to produce a larger sample DataFrame and timing functions:

使用以下设置生成更大的样本 DataFrame 和计时函数:

df = pd.concat([df]*10**4, ignore_index=True)

def root_nested(df):
    df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1'])
    return df

def root_split(df):
    df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
    df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
    return df

def pir2(df):
    df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
    return df

def pir2_2(df):
    slc = (df.values != 0).argmax(axis=1)
    return df.values[np.arange(slc.shape[0]), slc]

def andrew(df):
    df.col1[df.col1 == 0] = df.col2
    df.col1[df.col1 == 0] = df.col3
    return df

def pablo(df):
    df['col1'] = df['col1'].replace(0,df['col2'])
    df['col1'] = df['col1'].replace(0,df['col3'])
    return df

I get the following timings:

我得到以下时间:

%timeit root_nested(df.copy())
100 loops, best of 3: 2.25 ms per loop

%timeit root_split(df.copy())
100 loops, best of 3: 2.62 ms per loop

%timeit pir2(df.copy())
100 loops, best of 3: 6.25 ms per loop

%timeit pir2_2(df.copy())
1 loop, best of 3: 2.4 ms per loop

%timeit andrew(df.copy())
100 loops, best of 3: 8.55 ms per loop

I tried timing your method, but it's been running for multiple minutes without completing. As a comparison, timing your method on just the 6 row example DataFrame (not the much larger one tested above) took 12.8 ms.

我尝试为您的方法计时,但它已经运行了几分钟而没有完成。作为比较,仅在 6 行示例 DataFrame(不是上面测试的大得多的数据帧)上为您的方法计时需要 12.8 毫秒。

回答by Andrew

I'm not sure if it's faster, but you're right that you can slice the dataframe to get your desired result.

我不确定它是否更快,但是您可以对数据帧进行切片以获得所需的结果是对的。

df.col1[df.col1 == 0] = df.col2
df.col1[df.col1 == 0] = df.col3
print(df)

Output:

输出:

   col1  col2  col3
0   0.2   0.3   0.3
1   0.2   0.3   0.3
2   0.4   0.4   0.4
3   0.3   0.0   0.3
4   0.0   0.0   0.0
5   0.1   0.4   0.4

Alternatively if you want it to be more terse (though I don't know if it's faster) you can combine what you did with what I did.

或者,如果您希望它更简洁(尽管我不知道它是否更快),您可以将您所做的与我所做的结合起来。

df.col1[df.col1 == 0] = df.col2.replace(0, df.col3)
print(df)

Output:

输出:

   col1  col2  col3
0   0.2   0.3   0.3
1   0.2   0.3   0.3
2   0.4   0.4   0.4
3   0.3   0.0   0.3
4   0.0   0.0   0.0
5   0.1   0.4   0.4

回答by piRSquared

approach using pd.DataFrame.whereand pd.DataFrame.bfill

方法使用pd.DataFrame.wherepd.DataFrame.bfill

df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
df

enter image description here

在此处输入图片说明

Another approach using np.argmax

另一种方法使用 np.argmax

def pir2(df):
    slc = (df.values != 0).argmax(axis=1)
    return df.values[np.arange(slc.shape[0]), slc]

I know there is a better way to use numpyto slice. I just can't think of it at the moment.

我知道有更好的方法可以numpy用来切片。我只是暂时想不出来。