pandas 熊猫离开加入并更新现有列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30045086/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas left join and update existing column
提问by iwbabn
I am new to pandas and can't seem to get this to work with merge function:
我是Pandas的新手,似乎无法将其与合并功能一起使用:
>>> left >>> right
a b c a c d
0 1 4 9 0 1 7 13
1 2 5 10 1 2 8 14
2 3 6 11 2 3 9 15
3 4 7 12
With a left join on column a, I would like to update common columns BY THE JOINED KEYS. Note last value in column c is from LEFT table since there is no match.
使用 a 列上的左连接,我想按连接键更新公共列。注意 c 列中的最后一个值来自 LEFT 表,因为没有匹配项。
>>> final
a b c d
0 1 4 7 13
1 2 5 8 14
2 3 6 9 15
3 4 7 12 NAN
How should I do this with Pandas merge function? Thank you.
我应该如何使用 Pandas 合并功能来做到这一点?谢谢你。
采纳答案by Andy Hayden
One way to do this is to set the a column as the index and update:
一种方法是将 a 列设置为索引,并且update:
In [11]: left_a = left.set_index('a')
In [12]: right_a = right.set_index('a')
Note: updateonly does a left join (not merges), so as well as set_index you also need to include the additional columns not present in left_a.
注意:update只进行左连接(不合并),因此除了 set_index 之外,您还需要包含left_a.
In [13]: res = left_a.reindex(columns=left_a.columns.union(right_a.columns))
In [14]: res.update(right_a)
In [15]: res.reset_index(inplace=True)
In [16]: res
Out[16]:
a b c d
0 1 4 7 13
1 2 5 8 14
2 3 6 9 15
3 4 7 12 NaN
回答by Zero
You can use merge()between leftand rightwith how='left'on 'a'column.
您可以使用merge()之间left,并right与how='left'在'a'列。
In [74]: final = left.merge(right, on='a', how='left')
In [75]: final
Out[75]:
a b c_x c_y d
0 1 4 9 7 13
1 2 5 10 8 14
2 3 6 11 9 15
3 4 7 12 NaN NaN
Replace NaNvalue from c_ywith c_xvalue
替换NaN从价值c_y与c_x价值
In [76]: final['c'] = final['c_y'].fillna(final['c_x'])
In [77]: final
Out[77]:
a b c_x c_y d c
0 1 4 9 7 13 7
1 2 5 10 8 14 8
2 3 6 11 9 15 9
3 4 7 12 NaN NaN 12
Drop unwanted columns, and you have the result
删除不需要的列,你就有了结果
In [79]: final.drop(['c_x', 'c_y'], axis=1)
Out[79]:
a b d c
0 1 4 13 7
1 2 5 14 8
2 3 6 15 9
3 4 7 NaN 12
回答by starostise
One other way is to use pd.mergelike so:
另一种方法是像这样使用pd.merge:
>>> import pandas as pd
>>> final = pd.merge(right, left,
how='outer',
left_index=True,
right_index=True,
on=('a', 'c')
).sort_index(axis=1)
>>> final
a b c d
0 1 4 7 13.0
1 2 5 8 14.0
2 3 6 9 15.0
3 4 7 12 NaN
You can compute the intersection of both DataFrames's columns names you want to update to pass it to the 'on=' parameter of the function.
您可以计算要更新的两个 DataFrame 列名称的交集,以将其传递给函数的 'on=' 参数。
It does not create unwanted columns that have to be dropped like with Zero's solution.
它不会像 Zero 的解决方案那样创建必须删除的不需要的列。
Edit: The NaN value might change integers to floats in the same column.
编辑: NaN 值可能会将整数更改为同一列中的浮点数。
回答by fixxxer
Here's a way to do it with join:
这是一种方法join:
In [632]: t = left.set_index('a').join(right.set_index('a'), rsuffix='_right')
In [633]: t
Out[633]:
b c c_right d
a
1 4 9 7 13
2 5 10 8 14
3 6 11 9 15
4 7 12 NaN NaN
Now, we want to set null values of c_right(which is from the rightdataframe) with values from ccolumn from the leftdataframe. Updated the below process with a method taking from @John Galt's answer
现在,我们想c_right使用right数据帧中c列的值设置(来自数据帧)的空值left。使用来自@John Galt 的回答的方法更新了以下过程
In [657]: t['c_right'] = t['c_right'].fillna(t['c'])
In [658]: t
Out[658]:
b c c_right d
a
1 4 9 7 13
2 5 10 8 14
3 6 11 9 15
4 7 12 12 NaN
In [659]: t.drop('c_right', axis=1)
Out[659]:
b c d
a
1 4 9 13
2 5 10 14
3 6 11 15
4 7 12 NaN
回答by Ben
DataFrame.update()is nice, but it doesn't let you specify columns to join on and more importantly, if the otherdataframe has NaN values, those NaN values will not overwrite non-nan values in the original DataFrame. To me, this is undesirable behavior.
DataFrame.update()很好,但它不允许您指定要加入的列,更重要的是,如果另一个数据帧具有 NaN 值,则这些 NaN 值不会覆盖原始 DataFrame 中的非 nan 值。对我来说,这是不受欢迎的行为。
Here's a custom method I rolled to fix these issues. It's freshly written, so users beware..
这是我用来解决这些问题的自定义方法。它是新写的,所以用户要小心..
join_insertion()
加入_插入()
def join_insertion(into_df, from_df, on, cols, mult='error'):
"""
Suppose A and B are dataframes. A has columns {foo, bar, baz} and B has columns {foo, baz, buz}
This function allows you to do an operation like:
"where A and B match via the column foo, insert the values of baz and buz from B into A"
Note that this'll update A's values for baz and it'll insert buz as a new column.
This is a lot like DataFrame.update(), but that method annoyingly ignores NaN values in B!
:param into_df: dataframe you want to modify
:param from_df: dataframe with the values you want to insert
:param cols: list of column names (values to insert)
:param on: list of column names (values to join on), or a dict of {into:from} column name pairs
:param mult: if a key of into_df matches multiple rows of from_df, how should this be handled?
an error can be raised, or the first matching value can be inserted, or the last matching value
can be inserted
:return: a modified copy of into_df, with updated values using from_df
"""
# Infer left_on, right_on
if (isinstance(on, dict)):
left_on = list(on.keys())
right_on = list(on.values())
elif(isinstance(on, list)):
left_on = on
right_on = on
elif(isinstance(on, str)):
left_on = [on]
right_on = [on]
else:
raise Exception("on should be a list or dictionary")
# Make cols a list if it isn't already
if(isinstance(cols, str)):
cols = [cols]
# Setup
A = into_df.copy()
B = from_df[right_on + cols].copy()
# Insert row ids
A['_A_RowId_'] = np.arange(A.shape[0])
B['_B_RowId_'] = np.arange(B.shape[0])
A = pd.merge(
left=A,
right=B,
how='left',
left_on=left_on,
right_on=right_on,
suffixes=(None, '_y'),
indicator=True
).sort_values(['_A_RowId_', '_B_RowId_'])
# Check for rows of A which got duplicated by the merge, and then handle appropriately
if(mult == 'error'):
if(A.groupby('_A_RowId_').size().max() > 1):
raise Exception("At least one key of into_df matched multiple rows of from_df.")
elif(mult == 'first'):
A = A.groupby('_A_RowId_').first().reset_index()
elif(mult == 'last'):
A = A.groupby('_A_RowId_').last().reset_index()
mask = A._merge == 'both'
cols_in_both = list(set(into_df.columns.to_list()).intersection(set(cols)))
for col in cols_in_both:
A.loc[mask, col] = A.loc[mask, col + '_y']
# Drop unwanted columns
A.drop(columns=list(set(A.columns).difference(set(into_df.columns.to_list() + cols))), inplace=True)
return A
Example Use
示例使用
into_df = pd.DataFrame({
'foo': [1, 2, 3],
'bar': [4, 5, 6],
'baz': [7, 8, 9]
})
foo bar baz
0 1 4 7
1 2 5 8
2 3 6 9
from_df = pd.DataFrame({
'foo': [1, 3, 5, 7, 3],
'baz': [70, 80, 90, 30, 40],
'buz': [0, 1, 2, 3, 4]
})
foo baz buz
0 1 70 0
1 3 80 1
2 5 90 2
3 7 30 3
4 3 40 4
# Use it!
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='error')
Exception: At least one key of into_df matched multiple rows of from_df.
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='first')
foo bar baz buz
0 1 4 70.0 0.0
1 2 5 8.0 NaN
2 3 6 80.0 1.0
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='last')
foo bar baz buz
0 1 4 70.0 0.0
1 2 5 8.0 NaN
2 3 6 40.0 4.0
As an aside, this is one of those things I severely miss from R's data.table package. With data.table, this is as easy as x[y, Foo := i.Foo, on = c("a", "b")]
顺便说一句,这是我从 R 的 data.table 包中严重怀念的事情之一。使用 data.table,这很简单x[y, Foo := i.Foo, on = c("a", "b")]

