pandas 如何在pandas中实现sql合并

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43177685/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:20:02  来源:igfitidea点击:

How to implement sql coalesce in pandas

pythonpandas

提问by Anoop

I have a data frame like

我有一个像

df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
     A     B   C
0  1.0   NaN   5
1  2.0  10.0  10
2  NaN   NaN   7 

I want to add a new column 'D'. Expected output is

我想添加一个新列“D”。预期输出是

     A     B   C    D
0  1.0   NaN   5    1.0
1  2.0  10.0  10    2.0
2  NaN   NaN   7    7.0

Thanks in advance!

提前致谢!

回答by philshem

Another way is to explicitly fill column D with A,B,C in that order.

另一种方法是按 A、B、C 的顺序显式填充 D 列。

df['D'] = np.nan
df['D'] = df.D.fillna(df.A).fillna(df.B).fillna(df.C)

回答by jezrael

I think you need bfillwith selecting first column by iloc:

我认为您需要bfill通过以下方式选择第一列iloc

df['D'] = df.bfill(axis=1).iloc[:,0]
print (df)
     A     B   C    D
0  1.0   NaN   5  1.0
1  2.0  10.0  10  2.0
2  NaN   NaN   7  7.0

same as:

与...一样:

df['D'] = df.fillna(method='bfill',axis=1).iloc[:,0]
print (df)
     A     B   C    D
0  1.0   NaN   5  1.0
1  2.0  10.0  10  2.0
2  NaN   NaN   7  7.0

回答by piRSquared

option 1
pandas

选项1
pandas

df.assign(D=df.lookup(df.index, df.isnull().idxmin(1)))

     A     B   C    D
0  1.0   NaN   5  1.0
1  2.0  10.0  10  2.0
2  NaN   NaN   7  7.0

option 2
numpy

选项 2
numpy

v = df.values
j = np.isnan(v).argmin(1)
df.assign(D=v[np.arange(len(v)), j])

     A     B   C    D
0  1.0   NaN   5  1.0
1  2.0  10.0  10  2.0
2  NaN   NaN   7  7.0


naive time test
over given data

对给定数据的幼稚时间测试

enter image description here

在此处输入图片说明

over larger data

在更大的数据上

enter image description here

在此处输入图片说明

回答by yardsale8

Another approach is to use the combine_firstmethod of a pd.Series. Using your example df,

另一种方法是使用 a 的combine_first方法pd.Series。使用你的例子df

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
>>> df
     A     B   C
0  1.0   NaN   5
1  2.0  10.0  10
2  NaN   NaN   7

we have

我们有

>>> df.A.combine_first(df.B).combine_first(df.C)
0    1.0
1    2.0
2    7.0

We can use reduceto abstract this pattern to work with an arbitrary number of columns.

我们可以使用reduce抽象此模式来处理任意数量的列。

>>> cols = [df[c] for c in df.columns]
>>> reduce(lambda acc, col: acc.combine_first(col), cols)
0    1.0
1    2.0
2    7.0
Name: A, dtype: float64

Let's put this all together in a function.

让我们将所有这些放在一个函数中。

>>> def coalesce(*args):
...     return reduce(lambda acc, col: acc.combine_first(col), args)
...
>>> coalesce(*cols)
0    1.0
1    2.0
2    7.0
Name: A, dtype: float64

回答by Denys

There is already a method for Series in Pandas that does this:

Pandas 中已经有一个 Series 方法可以做到这一点:

df['D'] = df['A'].combine_first(df['C'])

Or just stack them if you want to look up values sequentially:

或者,如果您想按顺序查找值,只需将它们堆叠起来:

df['D'] = df['A'].combine_first(df['B']).combine_first(df['C'])

This outputs the following:

这将输出以下内容:

>>> df
     A     B   C    D
0  1.0   NaN   5  1.0
1  2.0  10.0  10  2.0
2  NaN   NaN   7  7.0