如何在 Pandas 中使用 apply 并行化许多(模糊)字符串比较?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37979167/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:27:02  来源:igfitidea点击:

how to parallelize many (fuzzy) string comparisons using apply in Pandas?

pythonpandasparallel-processingdaskfuzzywuzzy

提问by ??????

I have the following problem

我有以下问题

I have a dataframe masterthat contains sentences, such as

我有一个包含句子的数据框母版,例如

master
Out[8]: 
                  original
0  this is a nice sentence
1      this is another one
2    stackoverflow is nice

For every row in Master, I lookup into another Dataframe slavefor the best match using fuzzywuzzy. I use fuzzywuzzy because the matched sentences between the two dataframes could differ a bit (extra characters, etc).

对于师父的每一行,我查找到另一个数据帧使用的最佳匹配fuzzywuzzy。我使用fuzzywuzzy是因为两个数据帧之间匹配的句子可能会有所不同(额外字符等)。

For instance, slavecould be

例如,奴隶可以是

slave
Out[10]: 
   my_value                      name
0         2               hello world
1         1           congratulations
2         2  this is a nice sentence 
3         3       this is another one
4         1     stackoverflow is nice

Here is a fully-functional, wonderful, compact working example :)

这是一个功能齐全、精彩、紧凑的工作示例:)

from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
import difflib


master= pd.DataFrame({'original':['this is a nice sentence',
'this is another one',
'stackoverflow is nice']})


slave= pd.DataFrame({'name':['hello world',
'congratulations',
'this is a nice sentence ',
'this is another one',
'stackoverflow is nice'],'my_value': [2,1,2,3,1]})

def fuzzy_score(str1, str2):
    return fuzz.token_set_ratio(str1, str2)

def helper(orig_string, slave_df):
    #use fuzzywuzzy to see how close original and name are
    slave_df['score'] = slave_df.name.apply(lambda x: fuzzy_score(x,orig_string))
    #return my_value corresponding to the highest score
    return slave_df.ix[slave_df.score.idxmax(),'my_value']

master['my_value'] = master.original.apply(lambda x: helper(x,slave))

The 1 million dollars question is: can I parallelize my apply code above?

100 万美元的问题是:我可以并行化上面的应用代码吗?

After all, every row in masteris compared to all the rows in slave(slave is a small dataset and I can hold many copies of the data into the RAM).

毕竟,输入的每一行都master与输入的所有行进行比较slave(从站是一个小数据集,我可以将许多数据副本保存到 RAM 中)。

I dont see why I could not run multiple comparisons (i.e. process multiple rows at the same time).

我不明白为什么我不能运行多重比较(即同时处理多行)。

Problem: I dont know how to do that or if thats even possible.

问题:我不知道该怎么做,或者那是否可能。

Any help greatly appreciated!

非常感谢任何帮助!

回答by MRocklin

You can parallelize this with Dask.dataframe.

您可以将其与 Dask.dataframe 并行化。

>>> dmaster = dd.from_pandas(master, npartitions=4)
>>> dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave), name='my_value'))
>>> dmaster.compute()
                  original  my_value
0  this is a nice sentence         2
1      this is another one         3
2    stackoverflow is nice         1

Additionally, you should think about the tradeoffs between using threads vs processes here. Your fuzzy string matching almost certainly doesn't release the GIL, so you won't get any benefit from using multiple threads. However, using processes will cause data to serialize and move around your machine, which might slow things down a bit.

此外,您应该在这里考虑使用线程与进程之间的权衡。您的模糊字符串匹配几乎肯定不会释放 GIL,因此您不会从使用多线程中获得任何好处。但是,使用进程会导致数据序列化并在您的机器上移动,这可能会减慢速度。

You can experiment between using threads and processes or a distributed system by managing the get=keyword argument to the compute()method.

您可以通过管理方法的get=关键字参数来尝试使用线程和进程或分布式系统compute()

import dask.multiprocessing
import dask.threaded

>>> dmaster.compute(get=dask.threaded.get)  # this is default for dask.dataframe
>>> dmaster.compute(get=dask.multiprocessing.get)  # try processes instead

回答by shellcat_zero

I'm working on something similar and I wanted to provide a more complete working solution for anyone else you might stumble upon this question. @MRocklin unfortunately has some syntax errors in the code snippets provided. I am no expert with Dask, so I can't comment on some performance considerations, but this should accomplish your task just as @MRocklin has suggested. This is using Dask version 0.17.2and Pandas version 0.22.0:

我正在做类似的事情,我想为您可能偶然发现这个问题的任何其他人提供更完整的工作解决方案。不幸的是,@MRocklin 在提供的代码片段中有一些语法错误。我不是 Dask 的专家,所以我不能评论一些性能方面的考虑,但这应该能像@MRocklin 所建议的那样完成你的任务。这是使用Dask 版本 0.17.2Pandas 版本 0.22.0

import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
from fuzzywuzzy import fuzz
import pandas as pd

master= pd.DataFrame({'original':['this is a nice sentence',
'this is another one',
'stackoverflow is nice']})

slave= pd.DataFrame({'name':['hello world',
'congratulations',
'this is a nice sentence ',
'this is another one',
'stackoverflow is nice'],'my_value': [1,2,3,4,5]})

def fuzzy_score(str1, str2):
    return fuzz.token_set_ratio(str1, str2)

def helper(orig_string, slave_df):
    slave_df['score'] = slave_df.name.apply(lambda x: fuzzy_score(x,orig_string))
    #return my_value corresponding to the highest score
    return slave_df.loc[slave_df.score.idxmax(),'my_value']

dmaster = dd.from_pandas(master, npartitions=4)
dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave),meta=('x','f8'))

Then, obtain your results (like in this interpreter session):

然后,获得你的结果(就像在这个解释器会话中一样):

In [6]: dmaster.compute(get=dask.multiprocessing.get)                                             
Out[6]:                                          
                  original  my_value             
0  this is a nice sentence         3             
1      this is another one         4             
2    stackoverflow is nice         5    

回答by Learning stats by example

These answers are a little bit old. Some newer code:

这些答案有点旧。一些较新的代码:

dmaster = dd.from_pandas(master, npartitions=4)
dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave),meta=('x','f8'))
dmaster.compute(scheduler='processes') 

Personally I'd ditch that apply call to fuzzy_score in the helper function and just perform the operation there.

就我个人而言,我会在辅助函数中放弃对模糊分数的应用调用,而只是在那里执行操作。

You can alter the scheduler using these tips.

您可以使用这些提示更改调度程序。