pandas 基于条件的 2 个大数据集的模糊模糊字符串匹配 - python

Question

提问by Nirav

I have 2 large data sets that I have read into Pandas DataFrames (~ 20K rows and ~40K rows respectively). When I try merging these two DFs outright using pandas.merge on the address field, I get a paltry number of match compared to the number of rows. So I thought I would try to fuzzy string match to see if it improves the number of output matches.

我有 2 个已读入 Pandas DataFrames 的大数据集（分别为~20K 行和~40K 行）。当我尝试在地址字段上使用 pandas.merge 完全合并这两个 DF 时，与行数相比，我得到的匹配数微不足道。所以我想我会尝试模糊字符串匹配，看看它是否提高了输出匹配的数量。

I approached this by trying to create a new column in DF1 (20K rows) that was the result of applying the fuzzywuzzy extractone function on DF1[addressline] to DF2[addressline]. I shortly realized that this would take forever since it will be doing close to 1 billion comparisons.

我通过尝试在 DF1（20K 行）中创建一个新列来解决这个问题，这是将 DF1[addressline] 上的模糊模糊提取函数应用到 DF2[addressline] 的结果。我很快意识到这将需要很长时间，因为它将进行近 10 亿次比较。

Both of these datasets have "County" fields and my ask is this: is there a way to conditionally do a fuzzy string match on the "addressline" fields in both DFs based on the "county" fields being the same? Researching questions similar to mine I stumbled upon this discussion: Fuzzy logic on big datasets using Python

这两个数据集都有“县”字段，我的问题是：有没有办法根据“县”字段相同，有条件地对两个 DF 中的“地址线”字段进行模糊字符串匹配？研究与我类似的问题我偶然发现了这个讨论：使用 Python 在大数据集上的模糊逻辑

However I am still fuzzy (no pun intended) on how to go about grouping/blocking fields based on county. Any advice would be greatly appreciated!

但是，对于如何根据县对字段进行分组/阻止，我仍然很模糊（没有双关语）。任何建议将不胜感激！

import pandas as pd
from fuzzywuzzy import process

def fuzzy_match(x, choices, scorer, cutoff):
  return process.extractOne(x, choices = choices, scorer = scorer, score_cutoff= cutoff)[0]

test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'ID':['X','U','X','Y']}) 
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'ID':['X','U','X','Y']}) 
test['Address1'] = test['Address1'].apply(lambda x: x.lower()) 
test2['Address1'] = test2['Address1'].apply(lambda x: x.lower()) 
test['FuzzyAddress1'] = test['Address1'].apply(fuzzy_match, args = (test2['Address1'], fuzz.ratio, 80))

I've added 2 images that are sample sets of the 2 different DFs imported into Excel. Not all the fields have been included since they aren't important to my question. To reiterate my end goal, I want a new column in one of the DFs that has the top result from fuzzy matching an address line with the other address lines in the 2nd DF but only for those lines where the counties match between both DFs. From there I plan to merge the two dfs, one on the fuzzy matched address and the address line column in the 2nd DF. Hopefully this doesn't sound confusing.

我添加了 2 个图像，它们是导入 Excel 的 2 个不同 DF 的样本集。并非所有字段都包含在内，因为它们对我的问题并不重要。为了重申我的最终目标，我希望在其中一个 DF 中有一个新列，该列通过将地址行与第二个 DF 中的其他地址行进行模糊匹配而获得最高结果，但仅适用于县在两个 DF 之间匹配的那些行。从那里我计划合并两个 dfs，一个在模糊匹配地址和第二个 DF 中的地址行列上。希望这听起来不会令人困惑。

Answer 1

回答by maxymoo

You could adapt your fuzzy_matchfunction to take the id as a variable and use this to subset your choices before doing the fuzzy search (note that this requires applying the function over the whole dataframe rather than just the address column)

您可以调整您的fuzzy_match函数以将 id 作为变量，并在进行模糊搜索之前使用它来对您的选择进行子集化（请注意，这需要将函数应用于整个数据帧，而不仅仅是地址列）

def fuzzy_match(x, choices, scorer, cutoff):
    match = process.extractOne(x['Address1'], 
                               choices=choices.loc[choices['ID'] == x['ID'], 
                                                   'Address1'], 
                               scorer=scorer, 
                               score_cutoff=cutoff)
    if match:
        return match[0]

test['FuzzyAddress1'] = test.apply(fuzzy_match, 
                                   args=(test2, fuzz.ratio, 80), 
                                   axis=1)

pandas 基于条件的 2 个大数据集的模糊模糊字符串匹配 - python

提问by Nirav

回答by maxymoo

相关推荐

最近更新

标签

pandas 基于条件的 2 个大数据集的模糊模糊字符串匹配 - python

提问by Nirav

回答by maxymoo

相关推荐

如何在 Pandas DataFrame 中移动几行？

使用数据框列值的 Python Pandas 图

pandas diff() 为一阶差分给出 0 值

Pandas read_csv 加速

相关推荐

最近更新

标签