pandas 反加入熊猫

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38516664/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:39:06  来源:igfitidea点击:

Anti-Join Pandas

pythonpandasdataframemergeanti-join

提问by Ayelavan

I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).

我有两个表,我想附加它们,以便仅保留表 A 中的所有数据,并且仅在其键唯一时才添加表 B 中的数据(键值在表 A 和 B 中是唯一的,但是在某些情况下键将同时出现在表 A 和 B)中。

I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.

我认为这样做的方法将涉及某种过滤连接(反连接)以获取表 B 中没有出现在表 A 中的值,然后附加两个表。

I am familiar with R and this is the code I would use to do this in R.

我熟悉 R,这是我将用来在 R 中执行此操作的代码。

library("dplyr")

## Filtering join to remove values already in "TableA" from "TableB"
FilteredTableB <- anti_join(TableB,TableA, by = "Key")

## Append "FilteredTableB" to "TableA"
CombinedTable <- bind_rows(TableA,FilteredTableB)

How would I achieve this in python?

我将如何在 python 中实现这一目标?

采纳答案by piRSquared

Consider the following dataframes

考虑以下数据帧

TableA = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('abcd'), name='Key'),
                      ['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('aecf'), name='Key'),
                      ['A', 'B', 'C']).reset_index()


TableA

enter image description here

在此处输入图片说明



TableB

enter image description here

在此处输入图片说明

This is one way to do what you want

这是做你想做的一种方式

Method 1

方法一

# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)

# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)

enter image description here

在此处输入图片说明

Method 2

方法二

rows = []
for i, row in TableB.iterrows():
    if row.Key not in TableA.Key.values:
        rows.append(row)

pd.concat([TableA.T] + rows, axis=1).T


Timing

定时

4 rows with 2 overlap

4 行 2 重叠

Method 1 is much quicker

方法1要快得多

enter image description here

在此处输入图片说明

10,000 rows 5,000 overlap

10,000 行 5,000 重叠

loops are bad

循环不好

enter image description here

在此处输入图片说明

回答by Dennis Lyubyvy

indicator = Truein mergecommand will tell you which join was applied by creating new column _mergewith three possible values:

indicator = Trueinmerge命令将通过创建_merge具有三个可能值的新列来告诉您应用了哪个连接:

  • left_only
  • right_only
  • both
  • left_only
  • right_only
  • both

Keep right_onlyand left_only. That is it.

保持right_onlyleft_only。这就对了。

outer_join = TableA.merge(TableB, how = 'outer', indicator = True)

anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)


easy!

简单!

Here is a comparison with a solution from piRSquared:

这是与 piRSquared 的解决方案的比较:

1) When run on this example matching based on one column, piRSquared's solution is faster.

1) 在这个基于一列的匹配示例上运行时,piRSquared 的解决方案更快。

2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.

2)但它只适用于一列匹配。如果您想匹配多列 - 我的解决方案与一列一样好。

So it's up for you to decide.

所以这由你来决定。

enter image description here

在此处输入图片说明

回答by tommy.carstensen

I had the same problem. This answerusing how='outer'and indicator=Trueof mergeinspired me to come up with this solution:

我有同样的问题。这个使用how='outer'and indicator=Trueof merge 的答案激发了我想出这个解决方案:

import pandas as pd
import numpy as np

TableA = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('abcd'), name='Key'),
                      ['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('aecf'), name='Key'),
                      ['A', 'B', 'C']).reset_index()

print('TableA', TableA, sep='\n')
print('TableB', TableB, sep='\n')

TableB_only = pd.merge(
    TableA, TableB,
    how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(
        '_merge == "right_only"')

print('TableB_only', TableB_only, sep='\n')

Table_concatenated = pd.concat((TableA, TableB_only), join='inner')

print('Table_concatenated', Table_concatenated, sep='\n')

Which prints this output:

打印此输出:

TableA
  Key         A         B         C
0   a  0.035548  0.344711  0.860918
1   b  0.640194  0.212250  0.277359
2   c  0.592234  0.113492  0.037444
3   d  0.112271  0.205245  0.227157
TableB
  Key         A         B         C
0   a  0.754538  0.692902  0.537704
1   e  0.499092  0.864145  0.004559
2   c  0.082087  0.682573  0.421654
3   f  0.768914  0.281617  0.924693
TableB_only
  Key  A_foo  B_foo  C_foo         A         B         C      _merge
4   e    NaN    NaN    NaN  0.499092  0.864145  0.004559  right_only
5   f    NaN    NaN    NaN  0.768914  0.281617  0.924693  right_only
Table_concatenated
  Key         A         B         C
0   a  0.035548  0.344711  0.860918
1   b  0.640194  0.212250  0.277359
2   c  0.592234  0.113492  0.037444
3   d  0.112271  0.205245  0.227157
4   e  0.499092  0.864145  0.004559
5   f  0.768914  0.281617  0.924693

回答by Jamie Marshall

Easiest answer imaginable:

能想到的最简单的答案:

tableB = pd.concat([tableB, pd.Series(1)], axis=1)
mergedTable = tableA.merge(tableB, how="left" on="key")

answer = mergedTable[mergedTable.iloc[:,-1].isnull()][tableA.columns.tolist()]

Should be the fastest proposed as well.

也应该是最快的提议。

回答by Jossie Calderon

You'll have both tables TableAand TableBsuch that both DataFrameobjects have columns with unique values in their respective tables, but some columns may have values that occur simultaneously (have the same values for a row) in both tables.

您将拥有两个表TableATableB并且两个DataFrame对象在其各自的表中都有具有唯一值的列,但某些列可能具有在两个表中同时出现的值(对于一行具有相同的值)。

Then, we want to merge the rows in TableAwith the rows in TableBthat don't match any in TableAfor a 'Key' column. The concept is to picture it as comparing two series of variable length, and combining the rows in one series sAwith the other sBif sB's values don't match sA's. The following code solves this exercise:

然后,我们希望将'Key' 列中的任何行TableA与其中的行合并。概念是图像并作为比较两个串联可变长度的,并且行中的一个串联组合与其他如果的值不匹配的。下面的代码解决了这个练习:TableBTableAsAsBsBsA

import pandas as pd

TableA = pd.DataFrame([[2, 3, 4], [5, 6, 7], [8, 9, 10]])
TableB = pd.DataFrame([[1, 3, 4], [5, 7, 8], [9, 10, 0]])

removeTheseIndexes = []
keyColumnA = TableA.iloc[:,1] # your 'Key' column here
keyColumnB = TableB.iloc[:,1] # same

for i in range(0, len(keyColumnA)):
    firstValue = keyColumnA[i]
    for j in range(0, len(keyColumnB)):
        copycat = keyColumnB[j]
        if firstValue == copycat:
            removeTheseIndexes.append(j)

TableB.drop(removeTheseIndexes, inplace = True)
TableA = TableA.append(TableB)
TableA = TableA.reset_index(drop=True)

Note this affects TableB's data as well. You can use inplace=Falseand re-assign it to a newTable, then TableA.append(newTable)alternatively.

请注意,这TableB也会影响的数据。您可以使用inplace=False并将其重新分配给 a newTable,然后也可以使用TableA.append(newTable)

# Table A
   0  1   2
0  2  3   4
1  5  6   7
2  8  9  10

# Table B
   0   1  2
0  1   3  4
1  5   7  8
2  9  10  0

# Set 'Key' column = 1
# Run the script after the loop

# Table A
   0   1   2
0  2   3   4
1  5   6   7
2  8   9  10
3  5   7   8
4  9  10   0

# Table B
   0   1  2
1  5   7  8
2  9  10  0

回答by thrillhouse

Based on one of the other suggestions, here's a function that should do it. Using only pandas functions, no looping. You can use multiple columns as the key as well. If you change the line output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()] to output = merged.loc[~merged.dummy_col.isna(),tableA.columns.tolist()] you have a semi_join.

根据其他建议之一,这里有一个应该这样做的功能。只使用Pandas函数,没有循环。您也可以使用多列作为键。如果您将行更改output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()] output = merged.loc[~merged.dummy_col.isna(),tableA.columns.tolist()] 您有一个 semi_join。

def anti_join(tableA,tableB,on):

   #if joining on index, make it into a column
   if tableB.index.name is not None:
       dummy = tableB.reset_index()[on]
   else:
       dummy = tableB[on]

   #create a dummy columns of 1s
   if isinstance(dummy, pd.Series):
       dummy = dummy.to_frame()

   dummy.loc[:,'dummy_col'] = 1

   #preserve the index of tableA if it has one
   if tableA.index.name is not None:
       idx_name = tableA.index.name
       tableA = tableA.reset_index(drop = False)
   else:
       idx_name = None

   #do a left-join
   merged = tableA.merge(dummy,on=on,how='left')

   #keep only the non-matches
   output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]

   #reset the index (if applicable)
   if idx_name is not None:
       output = output.set_index(idx_name)

   return(output)