pandas 用python pandas比较两个csv文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42418180/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Compare two csv files with python pandas
提问by poyim
I have two csv files both consist of two columns.
我有两个 csv 文件都由两列组成。
The first one has the product id, and the second has the serial number.
第一个有产品ID,第二个有序列号。
I need to lookup, all serial numbers from the first csv, and find matches, on the second csv. The result report, will have the serial number that matched, and the corresponding product ids from each csv, in a separate column i trued to modify the below code, no luck.
我需要查找第一个 csv 中的所有序列号,并在第二个 csv 上找到匹配项。结果报告将包含匹配的序列号,以及来自每个 csv 的相应产品 ID,在我修改以下代码的单独列中,没有运气。
How would you approach this?
你会如何处理这个问题?
import pandas as pd
A=set(pd.read_csv("c1.csv", index_col=False, header=None)[0]) #reads the csv, takes only the first column and creates a set out of it.
B=set(pd.read_csv("c2.csv", index_col=False, header=None)[0]) #same here
print(A-B) #set A - set B gives back everything thats only in A.
print(B-A) # same here, other way around.
回答by jezrael
I think you need merge
:
我认为你需要merge
:
A = pd.DataFrame({'product id': [1455,5452,3775],
'serial number':[44,55,66]})
print (A)
B = pd.DataFrame({'product id': [7000,2000,1000],
'serial number':[44,55,77]})
print (B)
print (pd.merge(A, B, on='serial number'))
product id_x serial number product id_y
0 1455 44 7000
1 5452 55 2000
回答by MaxU
Try this:
尝试这个:
A = pd.read_csv("c1.csv", header=None, usecols=[0], names=['col']).drop_duplicates()
B = pd.read_csv("c2.csv", header=None, usecols=[0], names=['col']).drop_duplicates()
# A - B
pd.merge(A, B, on='col', how='left', indicator=True).query("_merge == 'left_only'")
# B - A
pd.merge(A, B, on='col', how='right', indicator=True).query("_merge == 'right_only'")
回答by Shijo
You can convert df into Sets , that will ignore the index while comparing the data, then use set symmetric_difference
您可以将 df 转换为 Sets ,在比较数据时将忽略索引,然后使用set symmetric_difference
ds1 = set([ tuple(values) for values in df1.values.tolist()])
ds2 = set([ tuple(values) for values in df2.values.tolist()])
ds1.symmetric_difference(ds2)
print df1 ,'\n\n'
print df2,'\n\n'
print pd.DataFrame(list(ds1.difference(ds2))),'\n\n'
print pd.DataFrame(list(ds2.difference(ds1))),'\n\n'
df1
df1
id Name score isEnrolled Comment
0 111 Hyman 2.17 True He was late to class
1 112 Nick 1.11 False Graduated
2 113 Zoe 4.12 True NaN
df2
df2
id Name score isEnrolled Comment
0 111 Hyman 2.17 True He was late to class
1 112 Nick 1.21 False Graduated
2 113 Zoe 4.12 False On vacation
Output
输出
0 1 2 3 4
0 113 Zoe 4.12 True NaN
1 112 Nick 1.11 False Graduated
0 1 2 3 4
0 113 Zoe 4.12 False On vacation
1 112 Nick 1.21 False Graduated
回答by Lohith
first_one=pd.read_csv(file_path)
//same way for second_one
// if product_id is the first column then its location would be at '0'
len_=len(first_one)
i=0
while(len_!=0)
{
if(first_one[i]==second_one[i])
{
//it is a match do whatever you want with this matched data
i=i-1;
}
len_=len_-1;
}