Pandas 模糊合并/匹配名称列，有重复项

Question

提问by Wizuriel

I have two dataframes currently, one for donorsand one for fundraisers. I'm trying to find if any fundraisersalso gave donations, and if so, copy some of that information into my fundraiserdataset (donor name, email and their first donation). Problems with my data are:

我目前有两个数据框，donors一个用于fundraisers. 我试图找出是否有人fundraisers也进行了捐赠，如果有，请将其中的一些信息复制到我的fundraiser数据集中（捐赠者姓名、电子邮件和他们的第一次捐赠）。我的数据的问题是：

I need to match by name and email, but a user might have slightly different names (ex 'Kat' and 'Kathy').
Duplicate names for donorsand fundraisers:
- 2a) With donors I can get unique name/email combinations since I just care about the first donation date
- 2b) With fundraisers though I need to keep both rows and not lose data like the date.

我需要按姓名和电子邮件进行匹配，但用户的姓名可能略有不同（例如“Kat”和“Kathy”）。
重复的名称donors和fundraisers：
- 2a) 对于捐赠者，我可以获得唯一的姓名/电子邮件组合，因为我只关心第一次捐赠日期
- 2b）虽然我需要保留两行而不是丢失日期等数据，但在筹款活动中。

Sample code I have right now:

我现在拥有的示例代码：

import pandas as pd
import datetime
from fuzzywuzzy import fuzz
import difflib 

donors = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Tom Smith","Jane Doe","Jane Doe","Kat test"]), "Email": pd.Series(['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']),"Date": (["27/03/2013  10:00:00 AM","1/03/2013  10:39:00 AM","2/03/2013  10:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:39:00 AM","27/03/2013  10:39:00 AM"])})
fundraisers = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Kathy test","Tes Ester", "Jane Doe"]),"Email": pd.Series(['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']),"Date": pd.Series(["2/03/2013  10:39:00 AM","27/03/2013  11:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:40:00 AM","27/03/2013  10:39:00 AM"])})

donors["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
fundraisers["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)

donors["code"] = donors.apply(lambda row: str(row['name'])+' '+str(row['Email']), axis=1)
idx = donors.groupby('code')["Date"].transform(min) == donors['Date']
donors = donors[idx].reset_index().drop('index',1)

So this leaves me with the first donation by each donor (assuming anyone with the exact same name and email is the same person).

所以这给我留下了每个捐赠者的第一次捐赠（假设任何姓名和电子邮件完全相同的人都是同一个人）。

Ideally I want my fundraisersdataset to look like:

理想情况下，我希望我的fundraisers数据集看起来像：

Date                Email       name        Donor Name  Donor Email Donor Date
2013-03-27 10:00:00     [email protected]      John Doe    John Doe    [email protected]      2013-03-27 10:00:00 
2013-01-03 10:39:00     [email protected]      John Doe    John Doe    [email protected]      2013-03-27 10:00:00 
2013-02-03 10:39:00     [email protected]      Kathy test  Kat test    [email protected]      2013-03-27 10:39:00 
2013-03-03 10:39:00     [email protected]    Tes Ester   
2013-04-03 10:39:00     [email protected]  Jane Doe    Jane Doe    [email protected]  2013-04-03 10:39:00

I tried following this thread: is it possible to do fuzzy match merge with python pandas?but keep getting index out of range errors (guessing it doesn't like the duplicated names in fundraisers) :( So any ideas how I can match/merge these datasets?
doing it with for loops (which works but is super slow and I feel there has to be a better way)

我尝试关注此线程：是否可以使用 python pandas 进行模糊匹配合并？但不断让索引超出范围错误（猜测它不喜欢筹款活动中的重复名称）:(那么我有什么想法可以匹配/合并这些数据集吗？
用 for 循环来做（它可以工作但超级慢，我觉得必须有更好的方法）

Code:

代码：

fundraisers["donor name"] = ""
fundraisers["donor email"] = ""
fundraisers["donor date"] = ""
for donindex in range(len(donors.index)):
    max = 75
    for funindex in range(len(fundraisers.index)):
        aname = donors["name"][donindex]
        comp = fundraisers["name"][funindex]
        ratio = fuzz.ratio(aname, comp)
        if ratio > max:
            if (donors["Email"][donindex] == fundraisers["Email"][funindex]):
                ratio *= 2
            max = ratio
            fundraisers["donor name"][funindex] = aname
            fundraisers["donor email"][funindex] = donors["Email"][donindex]
            fundraisers["donor date"][funindex] = donors["Date"][donindex]

Answer 1

采纳答案by Roman Pekar

Here's a bit more pythonic (in my view), working (on your example) code, without explicit loops:

这里有更多的 pythonic（在我看来），工作（在你的例子上）代码，没有显式循环：

def get_donors(row):
    d = donors.apply(lambda x: fuzz.ratio(x['name'], row['name']) * 2 if row['Email'] == x['Email'] else 1, axis=1)
    d = d[d >= 75]
    if len(d) == 0:
        v = ['']*3
    else:
        v = donors.ix[d.idxmax(), ['name','Email','Date']].values
    return pd.Series(v, index=['donor name', 'donor email', 'donor date'])

pd.concat((fundraisers, fundraisers.apply(get_donors, axis=1)), axis=1)

Output:

输出：

                 Date           Email        name donor name     donor email           donor date
0 2013-03-27 10:00:00          [email protected]    John Doe   John Doe          [email protected]  2013-03-01 10:39:00
1 2013-03-01 10:39:00          [email protected]    John Doe   John Doe          [email protected]  2013-03-01 10:39:00
2 2013-03-02 10:39:00          [email protected]  Kathy test   Kat test          [email protected]  2013-03-27 10:39:00
3 2013-03-03 10:39:00    [email protected]   Tes Ester                                                
4 2013-03-04 10:39:00  [email protected]    Jane Doe   Jane Doe  [email protected]  2013-03-04 10:39:00

Answer 2

回答by lostsoul29

I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].

我会使用 Jaro-Winkler，因为它是目前可用的性能最高、最准确的近似字符串匹配算法之一 [ Cohen, et al. ]、[温克勒]。

This is how I would do it with Jaro-Winkler from the jellyfishpackage:

这就是我使用jellyfish包中的Jaro-Winkler 的方法：

def get_closest_match(x, list_strings):

  best_match = None
  highest_jw = 0

  for current_string in list_strings:
    current_score = jellyfish.jaro_winkler(x, current_string)

    if(current_score > highest_jw):
      highest_jw = current_score
      best_match = current_string

  return best_match

df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))

df1.join(df2)

Output:

输出：

    number  letter
one     1   a
two     2   b
three   3   c
four    4   d
five    5   e

Update:Use jaro_winkler from the Levenshteinmodule for improved performance.

更新：使用Levenshtein模块中的jaro_winkler以提高性能。

from jellyfish import jaro_winkler as jf_jw
from Levenshtein import jaro_winkler as lv_jw

%timeit jf_jw("appel", "apple")
>> 339 ns ± 1.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit lv_jw("appel", "apple")
>> 193 ns ± 0.675 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Answer 3

回答by Shulaz Shan

How to identify Fuzzy duplication in DataFrame using Pandas

如何使用 Pandas 识别 DataFrame 中的模糊重复

This my data frame

这是我的数据框

def get_ratio(row):
name = row['Name_1']
return fuzz.token_sort_ratio(name,"Ceylon Hotels Corporation")
df[df.apply(get_ratio, axis=1) > 70]

Pandas 模糊合并/匹配名称列，有重复项

提问by Wizuriel

采纳答案by Roman Pekar

回答by lostsoul29

回答by Shulaz Shan

相关推荐

最近更新

标签

Pandas 模糊合并/匹配名称列，有重复项

提问by Wizuriel

采纳答案by Roman Pekar

回答by lostsoul29

回答by Shulaz Shan

相关推荐

如何在同一数据帧（Python、Pandas）中将 2 列合并为 1？

pandas / matplotlib：分面条形图

具有用户定义函数 Pandas 的 Groupby

对包含列表的 Pandas 列进行分组操作

相关推荐

最近更新

标签