pandas 熊猫离开加入并更新现有列

Question

提问by iwbabn

I am new to pandas and can't seem to get this to work with merge function:

我是Pandas的新手，似乎无法将其与合并功能一起使用：

>>> left       >>> right
   a  b   c       a  c   d 
0  1  4   9    0  1  7  13
1  2  5  10    1  2  8  14
2  3  6  11    2  3  9  15
3  4  7  12

With a left join on column a, I would like to update common columns BY THE JOINED KEYS. Note last value in column c is from LEFT table since there is no match.

使用 a 列上的左连接，我想按连接键更新公共列。注意 c 列中的最后一个值来自 LEFT 表，因为没有匹配项。

>>> final       
   a  b   c   d 
0  1  4   7   13
1  2  5   8   14
2  3  6   9   15
3  4  7   12  NAN

How should I do this with Pandas merge function? Thank you.

我应该如何使用 Pandas 合并功能来做到这一点？谢谢你。

Answer 1

采纳答案by Andy Hayden

One way to do this is to set the a column as the index and update:

一种方法是将 a 列设置为索引，并且update：

In [11]: left_a = left.set_index('a')

In [12]: right_a = right.set_index('a')

Note: updateonly does a left join (not merges), so as well as set_index you also need to include the additional columns not present in left_a.

注意：update只进行左连接（不合并），因此除了 set_index 之外，您还需要包含left_a.

In [13]: res = left_a.reindex(columns=left_a.columns.union(right_a.columns))

In [14]: res.update(right_a)

In [15]: res.reset_index(inplace=True)

In [16]: res
Out[16]:
   a   b   c   d
0  1   4   7  13
1  2   5   8  14
2  3   6   9  15
3  4   7  12 NaN

Answer 2

回答by Zero

You can use merge()between leftand rightwith how='left'on 'a'column.

您可以使用merge()之间left，并right与how='left'在'a'列。

In [74]: final = left.merge(right, on='a', how='left')

In [75]: final
Out[75]:
   a  b  c_x  c_y   d
0  1  4    9    7  13
1  2  5   10    8  14
2  3  6   11    9  15
3  4  7   12  NaN NaN

Replace NaNvalue from c_ywith c_xvalue

替换NaN从价值c_y与c_x价值

In [76]: final['c'] = final['c_y'].fillna(final['c_x'])

In [77]: final
Out[77]:
   a  b  c_x  c_y   d   c
0  1  4    9    7  13   7
1  2  5   10    8  14   8
2  3  6   11    9  15   9
3  4  7   12  NaN NaN  12

Drop unwanted columns, and you have the result

删除不需要的列，你就有了结果

In [79]: final.drop(['c_x', 'c_y'], axis=1)
Out[79]:
   a  b   d   c
0  1  4  13   7
1  2  5  14   8
2  3  6  15   9
3  4  7 NaN  12

Answer 3

回答by starostise

One other way is to use pd.mergelike so:

另一种方法是像这样使用pd.merge：

 >>> import pandas as pd

 >>> final = pd.merge(right, left, 
                      how='outer',
                      left_index=True,
                      right_index=True,
                      on=('a', 'c')
                     ).sort_index(axis=1)

 >>> final       
    a  b   c   d 
 0  1  4   7   13.0
 1  2  5   8   14.0
 2  3  6   9   15.0
 3  4  7   12  NaN

You can compute the intersection of both DataFrames's columns names you want to update to pass it to the 'on=' parameter of the function.

您可以计算要更新的两个 DataFrame 列名称的交集，以将其传递给函数的 'on=' 参数。

It does not create unwanted columns that have to be dropped like with Zero's solution.

它不会像 Zero 的解决方案那样创建必须删除的不需要的列。

Edit: The NaN value might change integers to floats in the same column.

编辑： NaN 值可能会将整数更改为同一列中的浮点数。

Answer 4

回答by fixxxer

Here's a way to do it with join:

这是一种方法join：

In [632]: t = left.set_index('a').join(right.set_index('a'), rsuffix='_right')

In [633]: t
Out[633]: 
   b   c  c_right   d
a                    
1  4   9        7  13
2  5  10        8  14
3  6  11        9  15
4  7  12      NaN NaN

Now, we want to set null values of c_right(which is from the rightdataframe) with values from ccolumn from the leftdataframe. Updated the below process with a method taking from @John Galt's answer

现在，我们想c_right使用right数据帧中c列的值设置（来自数据帧）的空值left。使用来自@John Galt 的回答的方法更新了以下过程

In [657]: t['c_right'] = t['c_right'].fillna(t['c'])

In [658]: t
Out[658]: 
   b   c  c_right   d
a                    
1  4   9        7  13
2  5  10        8  14
3  6  11        9  15
4  7  12       12 NaN

In [659]: t.drop('c_right', axis=1)
Out[659]: 
   b   c   d
a           
1  4   9  13
2  5  10  14
3  6  11  15
4  7  12 NaN

Answer 5

回答by Ben

DataFrame.update()is nice, but it doesn't let you specify columns to join on and more importantly, if the otherdataframe has NaN values, those NaN values will not overwrite non-nan values in the original DataFrame. To me, this is undesirable behavior.

DataFrame.update()很好，但它不允许您指定要加入的列，更重要的是，如果另一个数据帧具有 NaN 值，则这些 NaN 值不会覆盖原始 DataFrame 中的非 nan 值。对我来说，这是不受欢迎的行为。

Here's a custom method I rolled to fix these issues. It's freshly written, so users beware..

这是我用来解决这些问题的自定义方法。它是新写的，所以用户要小心..

join_insertion()

加入_插入（）

def join_insertion(into_df, from_df, on, cols, mult='error'):
    """
    Suppose A and B are dataframes. A has columns {foo, bar, baz} and B has columns {foo, baz, buz}
    This function allows you to do an operation like:
    "where A and B match via the column foo, insert the values of baz and buz from B into A"
    Note that this'll update A's values for baz and it'll insert buz as a new column.
    This is a lot like DataFrame.update(), but that method annoyingly ignores NaN values in B!

    :param into_df: dataframe you want to modify
    :param from_df: dataframe with the values you want to insert
    :param cols: list of column names (values to insert)
    :param on: list of column names (values to join on), or a dict of {into:from} column name pairs
    :param mult: if a key of into_df matches multiple rows of from_df, how should this be handled?
    an error can be raised, or the first matching value can be inserted, or the last matching value
    can be inserted
    :return: a modified copy of into_df, with updated values using from_df
    """

    # Infer left_on, right_on
    if (isinstance(on, dict)):
        left_on = list(on.keys())
        right_on = list(on.values())
    elif(isinstance(on, list)):
        left_on = on
        right_on = on
    elif(isinstance(on, str)):
        left_on = [on]
        right_on = [on]
    else:
        raise Exception("on should be a list or dictionary")

    # Make cols a list if it isn't already
    if(isinstance(cols, str)):
        cols = [cols]

    # Setup
    A = into_df.copy()
    B = from_df[right_on + cols].copy()

    # Insert row ids
    A['_A_RowId_'] = np.arange(A.shape[0])
    B['_B_RowId_'] = np.arange(B.shape[0])

    A = pd.merge(
        left=A,
        right=B,
        how='left',
        left_on=left_on,
        right_on=right_on,
        suffixes=(None, '_y'),
        indicator=True
    ).sort_values(['_A_RowId_', '_B_RowId_'])

    # Check for rows of A which got duplicated by the merge, and then handle appropriately
    if(mult == 'error'):
        if(A.groupby('_A_RowId_').size().max() > 1):
            raise Exception("At least one key of into_df matched multiple rows of from_df.")
    elif(mult == 'first'):
        A = A.groupby('_A_RowId_').first().reset_index()
    elif(mult == 'last'):
        A = A.groupby('_A_RowId_').last().reset_index()

    mask = A._merge == 'both'
    cols_in_both = list(set(into_df.columns.to_list()).intersection(set(cols)))
    for col in cols_in_both:
        A.loc[mask, col] = A.loc[mask, col + '_y']

    # Drop unwanted columns
    A.drop(columns=list(set(A.columns).difference(set(into_df.columns.to_list() + cols))), inplace=True)

    return A

Example Use

示例使用

into_df = pd.DataFrame({
    'foo': [1, 2, 3],
    'bar': [4, 5, 6],
    'baz': [7, 8, 9]
})
   foo  bar  baz
0    1    4    7
1    2    5    8
2    3    6    9

from_df = pd.DataFrame({
    'foo': [1, 3, 5, 7, 3],
    'baz': [70, 80, 90, 30, 40],
    'buz': [0, 1, 2, 3, 4]
})
   foo  baz  buz
0    1   70    0
1    3   80    1
2    5   90    2
3    7   30    3
4    3   40    4

# Use it!

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='error')
  Exception: At least one key of into_df matched multiple rows of from_df.

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='first')
   foo  bar   baz  buz
0    1    4  70.0  0.0
1    2    5   8.0  NaN
2    3    6  80.0  1.0

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='last')
   foo  bar   baz  buz
0    1    4  70.0  0.0
1    2    5   8.0  NaN
2    3    6  40.0  4.0

As an aside, this is one of those things I severely miss from R's data.table package. With data.table, this is as easy as x[y, Foo := i.Foo, on = c("a", "b")]

顺便说一句，这是我从 R 的 data.table 包中严重怀念的事情之一。使用 data.table，这很简单x[y, Foo := i.Foo, on = c("a", "b")]

pandas 熊猫离开加入并更新现有列

提问by iwbabn

采纳答案by Andy Hayden

回答by Zero

回答by starostise

回答by fixxxer

回答by Ben

join_insertion()

加入_插入（）

Example Use

示例使用

相关推荐

最近更新

标签

pandas 熊猫离开加入并更新现有列

提问by iwbabn

采纳答案by Andy Hayden

回答by Zero

回答by starostise

回答by fixxxer

回答by Ben

join_insertion()

加入_插入（）

Example Use

示例使用

相关推荐

pandas 使用另一列值的 len() 添加 DataFrame 列

pandas 熊猫在我的数据中按第一天重新采样

在 scikit-learn 和/或 pandas 中重新采样

pandas 熊猫：缓慢的日期转换

相关推荐

最近更新

标签