pandas 比较两个csv文件并使用python获取差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48693547/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:08:49  来源:igfitidea点击:

Comparing two csv files and get the difference using python

pythonpandascsv

提问by vishal

I have two csv files,

我有两个 csv 文件,

a1.csv

a1.csv

city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
Aguila,Arizona,http://www.co.apache.az.us/planning-and-zoning-division/zoning-ordinances/

a2.csv

a2.csv

city,state,link
Aguila,Arizona,http://www.co.apache.az.us

I want to get the difference of result, Here is the code which i tried,

我想得到结果的差异,这是我尝试过的代码,

import pandas as pd

a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')

mask = a.isin(b.to_dict(orient='list'))
# Reverse the mask and remove null rows.
# Upside is that index of original rows that
# are now gone are preserved (see result).
c = a[~mask].dropna()
print c

Expected Output:

预期输出:

city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf

But am getting Error:- Empty DataFrame Columns: [city, state, link] Index: []

但我收到错误:- Empty DataFrame Columns: [city, state, link] 索引:[]

I want to check based on first two rows, if its same then remove it off.

我想根据前两行进行检查,如果相同则将其删除。

Thanks in Advance.

提前致谢。

采纳答案by shindig_

First, concatenate the DataFrames, then drop the duplicates while still keeping the first one. Then reset the index to keep it consistent.

首先,连接 DataFrames,然后删除重复项,同时保留第一个。然后重置索引以保持一致。

import pandas as pd

a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
c = pd.concat([a,b], axis=0)

c.drop_duplicates(keep='first', inplace=True) # Set keep to False if you don't want any
                                              # of the duplicates at all
c.reset_index(drop=True, inplace=True)
print(c)

回答by TYZ

You can use pandasto read in two files, join them and remove all duplicate rows:

您可以使用pandas读取两个文件,加入它们并删除所有重复的行:

import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
ab = pd.concat([a,b], axis=0)
ab.drop_duplicates(keep=False)

Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

参考:https: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html