比较 PandaS DataFrames 并返回第一个缺失的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33349797/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Compare PandaS DataFrames and return rows that are missing from the first one
提问by runski74
I have 2 dataFrames and want to compare them and return rows from the first one (df1) that are not in the second one (df2). I found a way to compare them and return the differences, but can't figure out how to return only missing ones from df1.
我有 2 个数据帧,想比较它们并从第一个(df1)中返回不在第二个(df2)中的行。我找到了一种比较它们并返回差异的方法,但无法弄清楚如何仅从 df1.
import pandas as pd
from pandas import Series, DataFrame
df1 = pd.DataFrame( {
"City" : ["Chicago", "San Franciso", "Boston"] ,
"State" : ["Illinois", "California", "Massachusett"] } )
df2 = pd.DataFrame( {
"City" : ["Chicago", "Mmmmiami", "Dallas" , "Omaha"] ,
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )
df = pd.concat([df1, df2])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
blah = df.reindex(idx)
回答by jabellcu
Building on @EdChum's suggestion:
基于@EdChum 的建议:
df = pd.merge(df1, df2, how='outer', suffixes=('','_y'), indicator=True)
rows_in_df1_not_in_df2 = df[df['_merge']=='left_only'][df1.columns]
rows_in_df1_not_in_df2
|Index |City |State |
|------|------------|------------|
|1 |San Franciso|California |
|2 |Boston |Massachusett|
EDIT: incorporate @RobertPeters suggestion
编辑:合并@RobertPeters 的建议
回答by EdChum
IIUC then if you're using pandas version 0.17.0
then you can use merge
and set indicator=True
:
IIUC 那么如果您使用的是Pandas版本,0.17.0
那么您可以使用merge
并设置indicator=True
:
In [80]:
df1 = pd.DataFrame( {
"City" : ["Chicago", "San Franciso", "Boston"] ,
"State" : ["Illinois", "California", "Massachusett"] } )
?
df2 = pd.DataFrame( {
"City" : ["Chicago", "Mmmmiami", "Dallas" , "Omaha"] ,
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )
pd.merge(df1,df2, how='outer', indicator=True)
Out[80]:
City State _merge
0 Chicago Illinois both
1 San Franciso California left_only
2 Boston Massachusett left_only
3 Mmmmiami Florida right_only
4 Dallas Texas right_only
5 Omaha Nebraska right_only
This adds a column to indicator whether the rows are only present in either lhs or rhs
这将添加一列以指示行是否仅存在于 lhs 或 rhs
回答by Zero
If you're on pandas < 0.17.0
如果你在Pandas < 0.17.0
You could work your way up like
你可以像
In [182]: df = pd.merge(df1, df2, on='City', how='outer')
In [183]: df
Out[183]:
City State_x State_y
0 Chicago Illinois Illinois
1 San Franciso California NaN
2 Boston Massachusett NaN
3 Mmmmiami NaN Florida
4 Dallas NaN Texas
5 Omaha NaN Nebraska
In [184]: df.ix[df['State_y'].isnull(),:]
Out[184]:
City State_x State_y
1 San Franciso California NaN
2 Boston Massachusett NaN