pandas 编辑两个熊猫列之间的距离

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42892617/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:13:49  来源:igfitidea点击:

Edit distance between two pandas columns

pythonstringpandasnlpnltk

提问by Orest Xherija

I have a pandas DataFrame consisting of two columns of strings. I would like to create a third column containing the Edit Distance of the two columns.

我有一个由两列字符串组成的 Pandas DataFrame。我想创建第三列,其中包含两列的编辑距离。

from nltk.metrics import edit_distance    
df['edit'] = edit_distance(df['column1'], df['column2'])

For some reason this seems to go to some sort of infinite loop in the sense that it remains unresponsive for quite some time and then I have to terminate it manually.

出于某种原因,这似乎进入了某种无限循环,因为它在很长一段时间内仍然没有响应,然后我必须手动终止它。

Any suggestions are welcome.

欢迎任何建议。

回答by alexis

The nltk's edit_distancefunction is for comparing pairs of strings. If you want to compute the edit distance between corresponding pairs of strings, applyit separately to each row's strings like this:

nltk 的edit_distance功能是比较字符串对。如果要计算相应的字符串对之间的编辑距离,apply可以像这样分别计算每一行的字符串:

results = df.apply(lambda x: edit_distance(x["column1"], x["column2"]), axis=1)

Or like this (probably a little more efficient), to avoid including the irrelevant columns of the dataframe:

或者像这样(可能更有效),以避免包含数据帧的不相关列:

results = df.loc[:, ["column1", "column2"]].apply(lambda x: edit_distance(*x), axis=1)

To add the results to your dataframe, you'd use it like this:

要将结果添加到您的数据框中,您可以像这样使用它:

df["distance"] = df.loc[:, ["column1","column2"]].apply(lambda x: edit_distance(*x), axis=1)