Pandas 模糊合并/匹配名称列,有重复项
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19964546/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas fuzzy merge/match name column, with duplicates
提问by Wizuriel
I have two dataframes currently, one for donorsand one for fundraisers. I'm trying to find if any fundraisersalso gave donations, and if so, copy some of that information into my fundraiserdataset (donor name, email and their first donation). Problems with my data are:
我目前有两个数据框,donors一个用于fundraisers. 我试图找出是否有人fundraisers也进行了捐赠,如果有,请将其中的一些信息复制到我的fundraiser数据集中(捐赠者姓名、电子邮件和他们的第一次捐赠)。我的数据的问题是:
- I need to match by name and email, but a user might have slightly different names (ex 'Kat' and 'Kathy').
- Duplicate names for
donorsandfundraisers:- 2a) With donors I can get unique name/email combinations since I just care about the first donation date
- 2b) With fundraisers though I need to keep both rows and not lose data like the date.
- 我需要按姓名和电子邮件进行匹配,但用户的姓名可能略有不同(例如“Kat”和“Kathy”)。
- 重复的名称
donors和fundraisers:- 2a) 对于捐赠者,我可以获得唯一的姓名/电子邮件组合,因为我只关心第一次捐赠日期
- 2b)虽然我需要保留两行而不是丢失日期等数据,但在筹款活动中。
Sample code I have right now:
我现在拥有的示例代码:
import pandas as pd
import datetime
from fuzzywuzzy import fuzz
import difflib
donors = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Tom Smith","Jane Doe","Jane Doe","Kat test"]), "Email": pd.Series(['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']),"Date": (["27/03/2013 10:00:00 AM","1/03/2013 10:39:00 AM","2/03/2013 10:39:00 AM","3/03/2013 10:39:00 AM","4/03/2013 10:39:00 AM","27/03/2013 10:39:00 AM"])})
fundraisers = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Kathy test","Tes Ester", "Jane Doe"]),"Email": pd.Series(['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']),"Date": pd.Series(["2/03/2013 10:39:00 AM","27/03/2013 11:39:00 AM","3/03/2013 10:39:00 AM","4/03/2013 10:40:00 AM","27/03/2013 10:39:00 AM"])})
donors["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
fundraisers["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
donors["code"] = donors.apply(lambda row: str(row['name'])+' '+str(row['Email']), axis=1)
idx = donors.groupby('code')["Date"].transform(min) == donors['Date']
donors = donors[idx].reset_index().drop('index',1)
So this leaves me with the first donation by each donor (assuming anyone with the exact same name and email is the same person).
所以这给我留下了每个捐赠者的第一次捐赠(假设任何姓名和电子邮件完全相同的人都是同一个人)。
Ideally I want my fundraisersdataset to look like:
理想情况下,我希望我的fundraisers数据集看起来像:
Date Email name Donor Name Donor Email Donor Date
2013-03-27 10:00:00 [email protected] John Doe John Doe [email protected] 2013-03-27 10:00:00
2013-01-03 10:39:00 [email protected] John Doe John Doe [email protected] 2013-03-27 10:00:00
2013-02-03 10:39:00 [email protected] Kathy test Kat test [email protected] 2013-03-27 10:39:00
2013-03-03 10:39:00 [email protected] Tes Ester
2013-04-03 10:39:00 [email protected] Jane Doe Jane Doe [email protected] 2013-04-03 10:39:00
I tried following this thread: is it possible to do fuzzy match merge with python pandas?but keep getting index out of range errors (guessing it doesn't like the duplicated names in fundraisers) :( So any ideas how I can match/merge these datasets?
doing it with for loops (which works but is super slow and I feel there has to be a better way)
我尝试关注此线程:是否可以使用 python pandas 进行模糊匹配合并?但不断让索引超出范围错误(猜测它不喜欢筹款活动中的重复名称):(那么我有什么想法可以匹配/合并这些数据集吗?
用 for 循环来做(它可以工作但超级慢,我觉得必须有更好的方法)
Code:
代码:
fundraisers["donor name"] = ""
fundraisers["donor email"] = ""
fundraisers["donor date"] = ""
for donindex in range(len(donors.index)):
max = 75
for funindex in range(len(fundraisers.index)):
aname = donors["name"][donindex]
comp = fundraisers["name"][funindex]
ratio = fuzz.ratio(aname, comp)
if ratio > max:
if (donors["Email"][donindex] == fundraisers["Email"][funindex]):
ratio *= 2
max = ratio
fundraisers["donor name"][funindex] = aname
fundraisers["donor email"][funindex] = donors["Email"][donindex]
fundraisers["donor date"][funindex] = donors["Date"][donindex]
采纳答案by Roman Pekar
Here's a bit more pythonic (in my view), working (on your example) code, without explicit loops:
这里有更多的 pythonic(在我看来),工作(在你的例子上)代码,没有显式循环:
def get_donors(row):
d = donors.apply(lambda x: fuzz.ratio(x['name'], row['name']) * 2 if row['Email'] == x['Email'] else 1, axis=1)
d = d[d >= 75]
if len(d) == 0:
v = ['']*3
else:
v = donors.ix[d.idxmax(), ['name','Email','Date']].values
return pd.Series(v, index=['donor name', 'donor email', 'donor date'])
pd.concat((fundraisers, fundraisers.apply(get_donors, axis=1)), axis=1)
Output:
输出:
Date Email name donor name donor email donor date
0 2013-03-27 10:00:00 [email protected] John Doe John Doe [email protected] 2013-03-01 10:39:00
1 2013-03-01 10:39:00 [email protected] John Doe John Doe [email protected] 2013-03-01 10:39:00
2 2013-03-02 10:39:00 [email protected] Kathy test Kat test [email protected] 2013-03-27 10:39:00
3 2013-03-03 10:39:00 [email protected] Tes Ester
4 2013-03-04 10:39:00 [email protected] Jane Doe Jane Doe [email protected] 2013-03-04 10:39:00
回答by lostsoul29
I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].
我会使用 Jaro-Winkler,因为它是目前可用的性能最高、最准确的近似字符串匹配算法之一 [ Cohen, et al. ]、[温克勒]。
This is how I would do it with Jaro-Winkler from the jellyfishpackage:
这就是我使用jellyfish包中的Jaro-Winkler 的方法:
def get_closest_match(x, list_strings):
best_match = None
highest_jw = 0
for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)
if(current_score > highest_jw):
highest_jw = current_score
best_match = current_string
return best_match
df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))
df1.join(df2)
Output:
输出:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
Update:Use jaro_winkler from the Levenshteinmodule for improved performance.
更新:使用Levenshtein模块中的jaro_winkler以提高性能。
from jellyfish import jaro_winkler as jf_jw
from Levenshtein import jaro_winkler as lv_jw
%timeit jf_jw("appel", "apple")
>> 339 ns ± 1.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit lv_jw("appel", "apple")
>> 193 ns ± 0.675 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
回答by Shulaz Shan
How to identify Fuzzy duplication in DataFrame using Pandas
如何使用 Pandas 识别 DataFrame 中的模糊重复
def get_ratio(row):
name = row['Name_1']
return fuzz.token_sort_ratio(name,"Ceylon Hotels Corporation")
df[df.apply(get_ratio, axis=1) > 70]

