pandas 用python pandas比较两个csv文件

Question

提问by poyim

I have two csv files both consist of two columns.

我有两个 csv 文件都由两列组成。

The first one has the product id, and the second has the serial number.

第一个有产品ID，第二个有序列号。

I need to lookup, all serial numbers from the first csv, and find matches, on the second csv. The result report, will have the serial number that matched, and the corresponding product ids from each csv, in a separate column i trued to modify the below code, no luck.

我需要查找第一个 csv 中的所有序列号，并在第二个 csv 上找到匹配项。结果报告将包含匹配的序列号，以及来自每个 csv 的相应产品 ID，在我修改以下代码的单独列中，没有运气。

How would you approach this?

你会如何处理这个问题？

import pandas as pd
    A=set(pd.read_csv("c1.csv", index_col=False, header=None)[0]) #reads the csv, takes only the first column and creates a set out of it.
    B=set(pd.read_csv("c2.csv", index_col=False, header=None)[0]) #same here
    print(A-B) #set A - set B gives back everything thats only in A.
    print(B-A) # same here, other way around.

Answer 1

回答by jezrael

I think you need merge:

我认为你需要merge：

A = pd.DataFrame({'product id':   [1455,5452,3775],
                    'serial number':[44,55,66]})

print (A)

B = pd.DataFrame({'product id':   [7000,2000,1000],
                    'serial number':[44,55,77]})

print (B)

print (pd.merge(A, B, on='serial number'))
   product id_x  serial number  product id_y
0          1455             44          7000
1          5452             55          2000

Answer 2

回答by MaxU

Try this:

尝试这个：

A = pd.read_csv("c1.csv", header=None, usecols=[0], names=['col']).drop_duplicates()
B = pd.read_csv("c2.csv", header=None, usecols=[0], names=['col']).drop_duplicates()
# A - B
pd.merge(A, B, on='col', how='left', indicator=True).query("_merge == 'left_only'")
# B - A
pd.merge(A, B, on='col', how='right', indicator=True).query("_merge == 'right_only'")

Answer 3

回答by Shijo

You can convert df into Sets , that will ignore the index while comparing the data, then use set symmetric_difference

您可以将 df 转换为 Sets ，在比较数据时将忽略索引，然后使用set symmetric_difference

ds1 = set([ tuple(values) for values in df1.values.tolist()])
ds2 = set([ tuple(values) for values in df2.values.tolist()])

ds1.symmetric_difference(ds2)
print df1 ,'\n\n'
print df2,'\n\n'

print pd.DataFrame(list(ds1.difference(ds2))),'\n\n'
print pd.DataFrame(list(ds2.difference(ds1))),'\n\n'

df1

id  Name  score isEnrolled               Comment
0  111  Hyman   2.17       True  He was late to class
1  112  Nick   1.11      False             Graduated
2  113   Zoe   4.12       True                   NaN

df2

    id  Name  score isEnrolled               Comment
0  111  Hyman   2.17       True  He was late to class
1  112  Nick   1.21      False             Graduated
2  113   Zoe   4.12      False           On vacation

Output

输出

     0     1     2      3          4
0  113   Zoe  4.12   True        NaN
1  112  Nick  1.11  False  Graduated 


     0     1     2      3            4
0  113   Zoe  4.12  False  On vacation
1  112  Nick  1.21  False    Graduated

Answer 4

回答by Lohith

first_one=pd.read_csv(file_path)
//same way for second_one
// if product_id is the first column then its location would be at '0'
len_=len(first_one)
i=0
while(len_!=0)
{
if(first_one[i]==second_one[i])
{
//it is a match do whatever you want with this matched data
i=i-1;
}
len_=len_-1;
}

pandas 用python pandas比较两个csv文件

提问by poyim

回答by jezrael

回答by MaxU

回答by Shijo

回答by Lohith

相关推荐

最近更新

标签

pandas 用python pandas比较两个csv文件

提问by poyim

回答by jezrael

回答by MaxU

回答by Shijo

回答by Lohith

相关推荐

Pandas DataFrame 如何查询最近的日期时间索引？

Pandas：创建一个以列列表作为值的字典

pandas 类型错误：无法对具有非 np.nan 值的混合类型进行就地布尔设置

pandas 如何在python中为绘图添加填充？

相关推荐

最近更新

标签