pandas 跨数据框列应用模糊匹配并将结果保存在新列中

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38577332/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:40:14  来源:igfitidea点击:

Apply fuzzy matching across a dataframe column and save results in a new column

pythonpandasfuzzy-searchfuzzywuzzy

提问by Jstuff

I have two data frames with each having a different number of rows. Below is a couple rows from each data set

我有两个数据框,每个数据框都有不同的行数。下面是每个数据集中的几行

df1 =
     Company                                   City         State  ZIP
     FREDDIE LEES AMERICAN GOURMET SAUCE       St. Louis    MO     63101
     CITYARCHRIVER 2015 FOUNDATION             St. Louis    MO     63102
     GLAXOSMITHKLINE CONSUMER HEALTHCARE       St. Louis    MO     63102
     LACKEY SHEET METAL                        St. Louis    MO     63102

and

df2 = 
     FDA Company                    FDA City    FDA State   FDA ZIP
     LACKEY SHEET METAL             St. Louis   MO          63102
     PRIMUS STERILIZER COMPANY LLC  Great Bend  KS          67530
     HELGET GAS PRODUCTS INC        Omaha       NE          68127
     ORTHOQUEST LLC                 La Vista    NE          68128

I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1). My next goal is to compare each string under df1['Company']to each string under in df2['FDA Company']using several different matching commands from the fuzzy wuzzymodule and return the value of the best match and its name. I want to store that in a new column. For example if I did the fuzz.ratioand fuzz.token_sort_ratioon LACKY SHEET METALin df1['Company']to df2['FDA Company']it would return that the best match was LACKY SHEET METALwith a score of 100and this would then be saved under a new column in combined data. It results would look like

我使用combined_data = pandas.concat([df1, df2], axis = 1). 我的下一个目标是使用模块中的几个不同匹配命令将下的每个字符串与下df1['Company']的每个字符串进行比较,并返回最佳匹配的值及其名称。我想将它存储在一个新列中。举例来说,如果我做了,并在中到它会返回最匹配的是一个得分,这将随后在新列下保存。结果看起来像df2['FDA Company']fuzzy wuzzyfuzz.ratiofuzz.token_sort_ratioLACKY SHEET METALdf1['Company']df2['FDA Company']LACKY SHEET METAL100combined data

combined_data =
     Company                                   City         State  ZIP      FDA Company                     FDA City    FDA State   FDA ZIP     fuzzy.token_sort_ratio    match    fuzzy.ratio         match
     FREDDIE LEES AMERICAN GOURMET SAUCE       St. Louis    MO     63101    LACKEY SHEET METAL              St. Louis   MO          63102       LACKEY SHEET METAL        100      LACKEY SHEET METAL  100
     CITYARCHRIVER 2015 FOUNDATION             St. Louis    MO     63102    PRIMUS STERILIZER COMPANY LLC   Great Bend  KS          67530
     GLAXOSMITHKLINE CONSUMER HEALTHCARE       St. Louis    MO     63102    HELGET GAS PRODUCTS INC         Omaha       NE          68127
     LACKEY SHEET METAL                        St. Louis    MO     63102    ORTHOQUEST LLC                  La Vista    NE          68128

I tried doing

我试着做

combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1) 

But got an error because the lengths of the columns are different.

但是由于列的长度不同而出错。

I am stumped. How I can accomplish this?

我难住了。我怎样才能做到这一点?

回答by piRSquared

I couldn't tell what you were doing. This is how I would do it.

我不知道你在做什么。这就是我要做的。

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

Create a series of tuples to compare:

创建一系列要比较的元组:

compare = pd.MultiIndex.from_product([df1['Company'],
                                      df2['FDA Company']]).to_series()

Create a special function to calculate fuzzy metrics and return a series.

创建一个特殊的函数来计算模糊度量并返回一个系列。

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

Apply metricsto the compareseries

适用metricscompare系列

compare.apply(metrics)

enter image description here

在此处输入图片说明

There are bunch of ways to do this next part:

有很多方法可以完成下一部分:

Get closest matches to each row of df1

获取与每一行最接近的匹配 df1

compare.apply(metrics).unstack().idxmax().unstack(0)

enter image description here

在此处输入图片说明

Get closest matches to each row of df2

获取与每一行最接近的匹配 df2

compare.apply(metrics).unstack(0).idxmax().unstack(0)

enter image description here

在此处输入图片说明