Python 将 2 列中的值合并为 Pandas 数据框中的单列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38152389/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Coalesce values from 2 columns into a single column in a pandas dataframe
提问by Sevyns
I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:
我正在寻找一种行为类似于 T-SQL 中的合并的方法。我有 2 列(A 列和 B 列)在 Pandas 数据框中稀疏填充。我想使用以下规则创建一个新列:
- If the value in column A is not null, use that value for the new column C
- If the value in column A is null, use the value in column B for the new column C
- 如果列 A 中的值不为 null,则将该值用于新列 C
- 如果 A 列中的值为 null,则将 B 列中的值用于新列 C
Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?
就像我提到的,这可以通过 coalesce 函数在 MS SQL Server 中完成。我还没有为此找到一个好的pythonic方法;一个存在吗?
回答by MaxU
use combine_first():
In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
In [17]: df.loc[::2, 'a'] = np.nan
In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5
In [19]: df['c'] = df.a.combine_first(df.b)
In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0
回答by Merlin
Try this also.. easier to remember:
也试试这个..更容易记住:
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )
这稍微快一点: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )
%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 μs per loop
%timeit df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 μs per loop
回答by cs95
combine_first
is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.
combine_first
是最直接的选择。我在下面概述了其他几个。我将概述更多的解决方案,其中一些适用于不同的情况。
Case #1: Non-mutually Exclusive NaNs
案例 #1:非互斥 NaN
Not all rows have NaNs, and these NaN
s are notmutually exclusive between columns.
并非所有行都有 NaN,并且这些NaN
s在列之间并不相互排斥。
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 5.0
1 2.0 3.0
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 7.0 6.0
6 NaN 7.0
Let's combine first on a
.
让我们先结合起来a
。
df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
df['a'].where(pd.notnull, df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
You can use similar syntax using np.where
.
您可以使用类似的语法使用np.where
.
Alternatively, to combine first on b
, switch the conditions around.
或者,要先组合 on b
,请切换条件。
Case #2: Mutually Exclusive Positioned NaNs
案例#2:互斥定位的 NaN
All rows have NaN
s which are mutually exclusive between columns.
所有行都有NaN
在列之间互斥的 s。
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 NaN 6.0
6 NaN 7.0
This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.
此方法就地工作,修改原始 DataFrame。这是此用例的有效选项。
df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 NaN 4.0
4 5.0 5.0
5 NaN 6.0
6 NaN 7.0
df['a'].add(df['b'], fill_value=0)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
DataFrame.fillna
+ DataFrame.sum
DataFrame.fillna
+ DataFrame.sum
df.fillna(0).sum(1)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
回答by Erfan
Coalesce for multiple columns with DataFrame.bfill
多列合并 DataFrame.bfill
december 2019 answer
december 2019 answer
All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n
columns when n > 2
:
所有这些方法都适用于两列,并且可能适用于三列,但如果您n
在n > 2
以下情况下有列,则它们都需要方法链接:
example dataframe:
示例数据框:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
'col2':[np.NaN, 5, 1, 0, np.NaN],
'col3':[2, np.NaN, 9, 1, np.NaN],
'col4':[np.NaN, 10, 11, 4, 8]})
print(df)
col1 col2 col3 col4
0 NaN NaN 2.0 NaN
1 2.0 5.0 NaN 10.0
2 4.0 1.0 9.0 11.0
3 5.0 0.0 1.0 4.0
4 NaN NaN NaN 8.0
Using DataFrame.bfill
over the index axis (axis=1
) we can get the values in a generalized way even for a big n
amount of columns
使用DataFrame.bfill
索引轴 ( axis=1
),即使对于n
大量列,我们也可以以通用方式获取值
Plus, this would also work for string type
columns !!
另外,这也适用于string type
列!
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Using the Series.combine_first
(accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow
使用Series.combine_first
(接受的答案),它可能会变得非常麻烦,并且最终会在列数量增加时撤消
df['coalesce'] = (
df['col1'].combine_first(df['col2'])
.combine_first(df['col3'])
.combine_first(df['col4'])
)
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
回答by David Smith
I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:
我遇到了这个问题,但想合并多个列,从几个列中选择第一个非空值。我发现以下内容很有帮助:
Build dummy data
构建虚拟数据
import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
'a2': [2, None, 4, None],
'a3': [4, 5, None, None],
'a4': [None, None, None, None],
'b1': [9, 9, 9, 999]})
df
a1 a2 a3 a4 b1
0 NaN 2.0 4.0 None 9
1 2.0 NaN 5.0 None 9
2 3.0 4.0 NaN None 9
3 NaN NaN NaN None 999
coalesce a1 a2, a3 into a new column A
合并 a1 a2, a3 到一个新的列 A
def get_first_non_null(dfrow, columns_to_search):
for c in columns_to_search:
if pd.notnull(dfrow[c]):
return dfrow[c]
return None
# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)
print(df)
a1 a2 a3 a4 b1 A
0 NaN 2.0 4.0 None 9 2.0
1 2.0 NaN 5.0 None 9 2.0
2 3.0 4.0 NaN None 9 3.0
3 NaN NaN NaN None 999 NaN
回答by Christian DiMare
I'm thinking a solution like this,
我正在考虑这样的解决方案
def coalesce(s: pd.Series, *series: List[pd.Series]):
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
because given a DataFrame with columns with ['a', 'b', 'c']
, you can use it like a SQL coalesce,
因为给定一个带有 列的 DataFrame ['a', 'b', 'c']
,你可以像 SQL 合并一样使用它,
df['d'] = coalesce(df.a, df.b, df.c)
回答by Cilantro Ditrek
For a more general case, where there are no NaNs but you want the same behavior:
对于更一般的情况,没有 NaN 但您想要相同的行为:
回答by Stefan Voshage
Good code, put you have a typo for python 3, correct one looks like this
好代码,把你的python 3打错了,正确的看起来像这样
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s