pandas 反加入熊猫
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38516664/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Anti-Join Pandas
提问by Ayelavan
I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).
我有两个表,我想附加它们,以便仅保留表 A 中的所有数据,并且仅在其键唯一时才添加表 B 中的数据(键值在表 A 和 B 中是唯一的,但是在某些情况下键将同时出现在表 A 和 B)中。
I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.
我认为这样做的方法将涉及某种过滤连接(反连接)以获取表 B 中没有出现在表 A 中的值,然后附加两个表。
I am familiar with R and this is the code I would use to do this in R.
我熟悉 R,这是我将用来在 R 中执行此操作的代码。
library("dplyr")
## Filtering join to remove values already in "TableA" from "TableB"
FilteredTableB <- anti_join(TableB,TableA, by = "Key")
## Append "FilteredTableB" to "TableA"
CombinedTable <- bind_rows(TableA,FilteredTableB)
How would I achieve this in python?
我将如何在 python 中实现这一目标?
采纳答案by piRSquared
Consider the following dataframes
考虑以下数据帧
TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()
TableA
TableB
This is one way to do what you want
这是做你想做的一种方式
Method 1
方法一
# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)
# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)
Method 2
方法二
rows = []
for i, row in TableB.iterrows():
if row.Key not in TableA.Key.values:
rows.append(row)
pd.concat([TableA.T] + rows, axis=1).T
Timing
定时
4 rows with 2 overlap
4 行 2 重叠
Method 1 is much quicker
方法1要快得多
10,000 rows 5,000 overlap
10,000 行 5,000 重叠
loops are bad
循环不好
回答by Dennis Lyubyvy
indicator = True
in merge
command will tell you which join was applied by creating new column _merge
with three possible values:
indicator = True
inmerge
命令将通过创建_merge
具有三个可能值的新列来告诉您应用了哪个连接:
left_only
right_only
both
left_only
right_only
both
Keep right_only
and left_only
. That is it.
保持right_only
和left_only
。这就对了。
outer_join = TableA.merge(TableB, how = 'outer', indicator = True)
anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)
easy!
简单!
Here is a comparison with a solution from piRSquared:
这是与 piRSquared 的解决方案的比较:
1) When run on this example matching based on one column, piRSquared's solution is faster.
1) 在这个基于一列的匹配示例上运行时,piRSquared 的解决方案更快。
2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.
2)但它只适用于一列匹配。如果您想匹配多列 - 我的解决方案与一列一样好。
So it's up for you to decide.
所以这由你来决定。
回答by tommy.carstensen
I had the same problem. This answerusing how='outer'
and indicator=True
of mergeinspired me to come up with this solution:
我有同样的问题。这个使用how='outer'
and indicator=True
of merge 的答案激发了我想出这个解决方案:
import pandas as pd
import numpy as np
TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()
print('TableA', TableA, sep='\n')
print('TableB', TableB, sep='\n')
TableB_only = pd.merge(
TableA, TableB,
how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(
'_merge == "right_only"')
print('TableB_only', TableB_only, sep='\n')
Table_concatenated = pd.concat((TableA, TableB_only), join='inner')
print('Table_concatenated', Table_concatenated, sep='\n')
Which prints this output:
打印此输出:
TableA
Key A B C
0 a 0.035548 0.344711 0.860918
1 b 0.640194 0.212250 0.277359
2 c 0.592234 0.113492 0.037444
3 d 0.112271 0.205245 0.227157
TableB
Key A B C
0 a 0.754538 0.692902 0.537704
1 e 0.499092 0.864145 0.004559
2 c 0.082087 0.682573 0.421654
3 f 0.768914 0.281617 0.924693
TableB_only
Key A_foo B_foo C_foo A B C _merge
4 e NaN NaN NaN 0.499092 0.864145 0.004559 right_only
5 f NaN NaN NaN 0.768914 0.281617 0.924693 right_only
Table_concatenated
Key A B C
0 a 0.035548 0.344711 0.860918
1 b 0.640194 0.212250 0.277359
2 c 0.592234 0.113492 0.037444
3 d 0.112271 0.205245 0.227157
4 e 0.499092 0.864145 0.004559
5 f 0.768914 0.281617 0.924693
回答by Jamie Marshall
Easiest answer imaginable:
能想到的最简单的答案:
tableB = pd.concat([tableB, pd.Series(1)], axis=1)
mergedTable = tableA.merge(tableB, how="left" on="key")
answer = mergedTable[mergedTable.iloc[:,-1].isnull()][tableA.columns.tolist()]
Should be the fastest proposed as well.
也应该是最快的提议。
回答by Jossie Calderon
You'll have both tables TableA
and TableB
such that both DataFrame
objects have columns with unique values in their respective tables, but some columns may have values that occur simultaneously (have the same values for a row) in both tables.
您将拥有两个表TableA
,TableB
并且两个DataFrame
对象在其各自的表中都有具有唯一值的列,但某些列可能具有在两个表中同时出现的值(对于一行具有相同的值)。
Then, we want to merge the rows in TableA
with the rows in TableB
that don't match any in TableA
for a 'Key' column. The concept is to picture it as comparing two series of variable length, and combining the rows in one series sA
with the other sB
if sB
's values don't match sA
's. The following code solves this exercise:
然后,我们希望将'Key' 列中的任何行TableA
与其中的行合并。概念是图像并作为比较两个串联可变长度的,并且行中的一个串联组合与其他如果的值不匹配的。下面的代码解决了这个练习:TableB
TableA
sA
sB
sB
sA
import pandas as pd
TableA = pd.DataFrame([[2, 3, 4], [5, 6, 7], [8, 9, 10]])
TableB = pd.DataFrame([[1, 3, 4], [5, 7, 8], [9, 10, 0]])
removeTheseIndexes = []
keyColumnA = TableA.iloc[:,1] # your 'Key' column here
keyColumnB = TableB.iloc[:,1] # same
for i in range(0, len(keyColumnA)):
firstValue = keyColumnA[i]
for j in range(0, len(keyColumnB)):
copycat = keyColumnB[j]
if firstValue == copycat:
removeTheseIndexes.append(j)
TableB.drop(removeTheseIndexes, inplace = True)
TableA = TableA.append(TableB)
TableA = TableA.reset_index(drop=True)
Note this affects TableB
's data as well. You can use inplace=False
and re-assign it to a newTable
, then TableA.append(newTable)
alternatively.
请注意,这TableB
也会影响的数据。您可以使用inplace=False
并将其重新分配给 a newTable
,然后也可以使用TableA.append(newTable)
。
# Table A
0 1 2
0 2 3 4
1 5 6 7
2 8 9 10
# Table B
0 1 2
0 1 3 4
1 5 7 8
2 9 10 0
# Set 'Key' column = 1
# Run the script after the loop
# Table A
0 1 2
0 2 3 4
1 5 6 7
2 8 9 10
3 5 7 8
4 9 10 0
# Table B
0 1 2
1 5 7 8
2 9 10 0
回答by thrillhouse
Based on one of the other suggestions, here's a function that should do it. Using only pandas functions, no looping. You can use multiple columns as the key as well. If you change the line output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]
to output = merged.loc[~merged.dummy_col.isna(),tableA.columns.tolist()]
you have a semi_join.
根据其他建议之一,这里有一个应该这样做的功能。只使用Pandas函数,没有循环。您也可以使用多列作为键。如果您将行更改output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]
为output = merged.loc[~merged.dummy_col.isna(),tableA.columns.tolist()]
您有一个 semi_join。
def anti_join(tableA,tableB,on):
#if joining on index, make it into a column
if tableB.index.name is not None:
dummy = tableB.reset_index()[on]
else:
dummy = tableB[on]
#create a dummy columns of 1s
if isinstance(dummy, pd.Series):
dummy = dummy.to_frame()
dummy.loc[:,'dummy_col'] = 1
#preserve the index of tableA if it has one
if tableA.index.name is not None:
idx_name = tableA.index.name
tableA = tableA.reset_index(drop = False)
else:
idx_name = None
#do a left-join
merged = tableA.merge(dummy,on=on,how='left')
#keep only the non-matches
output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]
#reset the index (if applicable)
if idx_name is not None:
output = output.set_index(idx_name)
return(output)