Python 将 2 列中的值合并为 Pandas 数据框中的单列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38152389/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:25:46  来源:igfitidea点击:

Coalesce values from 2 columns into a single column in a pandas dataframe

pythonpandasnumpydataframe

提问by Sevyns

I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:

我正在寻找一种行为类似于 T-SQL 中的合并的方法。我有 2 列(A 列和 B 列)在 Pandas 数据框中稀疏填充。我想使用以下规则创建一个新列:

  1. If the value in column A is not null, use that value for the new column C
  2. If the value in column A is null, use the value in column B for the new column C
  1. 如果列 A 中的值不为 null,则将该值用于新列 C
  2. 如果 A 列中的值为 null,则将 B 列中的值用于新列 C

Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?

就像我提到的,这可以通过 coalesce 函数在 MS SQL Server 中完成。我还没有为此找到一个好的pythonic方法;一个存在吗?

回答by MaxU

use combine_first():

使用combine_first()

In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))

In [17]: df.loc[::2, 'a'] = np.nan

In [18]: df
Out[18]:
     a  b
0  NaN  0
1  5.0  5
2  NaN  8
3  2.0  8
4  NaN  3
5  9.0  4
6  NaN  7
7  2.0  0
8  NaN  6
9  2.0  5

In [19]: df['c'] = df.a.combine_first(df.b)

In [20]: df
Out[20]:
     a  b    c
0  NaN  0  0.0
1  5.0  5  5.0
2  NaN  8  8.0
3  2.0  8  2.0
4  NaN  3  3.0
5  9.0  4  9.0
6  NaN  7  7.0
7  2.0  0  2.0
8  NaN  6  6.0
9  2.0  5  2.0

回答by Merlin

Try this also.. easier to remember:

也试试这个..更容易记住:

df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )

This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )

这稍微快一点: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )

%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 μs per loop


%timeit  df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 μs per loop

回答by cs95

combine_firstis the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.

combine_first是最直接的选择。我在下面概述了其他几个。我将概述更多的解决方案,其中一些适用于不同的情况。

Case #1: Non-mutually Exclusive NaNs

案例 #1:非互斥 NaN

Not all rows have NaNs, and these NaNs are notmutually exclusive between columns.

并非所有行都有 NaN,并且这些NaNs在列之间并不相互排斥。

df = pd.DataFrame({
    'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
    'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})      
df

     a    b
0  1.0  5.0
1  2.0  3.0
2  3.0  NaN
3  NaN  4.0
4  5.0  NaN
5  7.0  6.0
6  NaN  7.0

Let's combine first on a.

让我们先结合起来a

Series.mask

Series.mask

df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    7.0
6    7.0
Name: a, dtype: float64

Series.where

Series.where

df['a'].where(pd.notnull, df['b'])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    7.0
6    7.0
Name: a, dtype: float64

You can use similar syntax using np.where.

您可以使用类似的语法使用np.where.

Alternatively, to combine first on b, switch the conditions around.

或者,要先组合 on b,请切换条件。



Case #2: Mutually Exclusive Positioned NaNs

案例#2:互斥定位的 NaN

All rows have NaNs which are mutually exclusive between columns.

所有行都有NaN在列之间互斥的 s。

df = pd.DataFrame({
    'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
    'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df

     a    b
0  1.0  NaN
1  2.0  NaN
2  3.0  NaN
3  NaN  4.0
4  5.0  NaN
5  NaN  6.0
6  NaN  7.0

Series.update

Series.update

This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.

此方法就地工作,修改原始 DataFrame。这是此用例的有效选项。

df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df

     a    b
0  1.0  1.0
1  2.0  2.0
2  3.0  3.0
3  NaN  4.0
4  5.0  5.0
5  NaN  6.0
6  NaN  7.0

Series.add

Series.add

df['a'].add(df['b'], fill_value=0)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
dtype: float64

DataFrame.fillna+ DataFrame.sum

DataFrame.fillna+ DataFrame.sum

df.fillna(0).sum(1)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
dtype: float64

回答by Erfan

Coalesce for multiple columns with DataFrame.bfill

多列合并 DataFrame.bfill

december 2019 answer

december 2019 answer

All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have ncolumns when n > 2:

所有这些方法都适用于两列,并且可能适用于三列,但如果您nn > 2以下情况下有列,则它们都需要方法链接:

example dataframe:

示例数据框

import numpy as np
import pandas as pd

df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
                   'col2':[np.NaN, 5, 1, 0, np.NaN],
                   'col3':[2, np.NaN, 9, 1, np.NaN],
                   'col4':[np.NaN, 10, 11, 4, 8]})

print(df)

   col1  col2  col3  col4
0   NaN   NaN   2.0   NaN
1   2.0   5.0   NaN  10.0
2   4.0   1.0   9.0  11.0
3   5.0   0.0   1.0   4.0
4   NaN   NaN   NaN   8.0

Using DataFrame.bfillover the index axis (axis=1) we can get the values in a generalized way even for a big namount of columns

使用DataFrame.bfill索引轴 ( axis=1),即使对于n大量列,我们也可以以通用方式获取值

Plus, this would also work for string typecolumns !!

另外,这也适用于string type列!

df['coalesce'] = df.bfill(axis=1).iloc[:, 0]

   col1  col2  col3  col4  coalesce
0   NaN   NaN   2.0   NaN       2.0
1   2.0   5.0   NaN  10.0       2.0
2   4.0   1.0   9.0  11.0       4.0
3   5.0   0.0   1.0   4.0       5.0
4   NaN   NaN   NaN   8.0       8.0

Using the Series.combine_first(accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow

使用Series.combine_first(接受的答案),它可能会变得非常麻烦,并且最终会在列数量增加时撤消

df['coalesce'] = (
    df['col1'].combine_first(df['col2'])
        .combine_first(df['col3'])
        .combine_first(df['col4'])
)

   col1  col2  col3  col4  coalesce
0   NaN   NaN   2.0   NaN       2.0
1   2.0   5.0   NaN  10.0       2.0
2   4.0   1.0   9.0  11.0       4.0
3   5.0   0.0   1.0   4.0       5.0
4   NaN   NaN   NaN   8.0       8.0

回答by David Smith

I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:

我遇到了这个问题,但想合并多个列,从几个列中选择第一个非空值。我发现以下内容很有帮助:

Build dummy data

构建虚拟数据

import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
                   'a2': [2, None, 4, None],
                   'a3': [4, 5, None, None],
                   'a4': [None, None, None, None],
                   'b1': [9, 9, 9, 999]})

df
    a1   a2   a3    a4   b1
0  NaN  2.0  4.0  None    9
1  2.0  NaN  5.0  None    9
2  3.0  4.0  NaN  None    9
3  NaN  NaN  NaN  None  999

coalesce a1 a2, a3 into a new column A

合并 a1 a2, a3 到一个新的列 A

def get_first_non_null(dfrow, columns_to_search):
    for c in columns_to_search:
        if pd.notnull(dfrow[c]):
            return dfrow[c]
    return None

# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)

print(df)
    a1   a2   a3    a4   b1    A
0  NaN  2.0  4.0  None    9  2.0
1  2.0  NaN  5.0  None    9  2.0
2  3.0  4.0  NaN  None    9  3.0
3  NaN  NaN  NaN  None  999  NaN

回答by Christian DiMare

I'm thinking a solution like this,

我正在考虑这样的解决方案

def coalesce(s: pd.Series, *series: List[pd.Series]):
    """coalesce the column information like a SQL coalesce."""
    for other in series:
        s = s.mask(pd.isnull, other)        
    return s

because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,

因为给定一个带有 列的 DataFrame ['a', 'b', 'c'],你可以像 SQL 合并一样使用它,

df['d'] = coalesce(df.a, df.b, df.c)

回答by Cilantro Ditrek

For a more general case, where there are no NaNs but you want the same behavior:

对于更一般的情况,没有 NaN 但您想要相同的行为:

Merge 'left', but override 'right' values where possible

合并“左”,但在可能的情况下覆盖“右”值

回答by Stefan Voshage

Good code, put you have a typo for python 3, correct one looks like this

好代码,把你的python 3打错了,正确的看起来像这样

    """coalesce the column information like a SQL coalesce."""
    for other in series:
        s = s.mask(pd.isnull, other)        
    return s