pandas 编辑两个熊猫列之间的距离

Question

提问by Orest Xherija

I have a pandas DataFrame consisting of two columns of strings. I would like to create a third column containing the Edit Distance of the two columns.

我有一个由两列字符串组成的 Pandas DataFrame。我想创建第三列，其中包含两列的编辑距离。

from nltk.metrics import edit_distance    
df['edit'] = edit_distance(df['column1'], df['column2'])

For some reason this seems to go to some sort of infinite loop in the sense that it remains unresponsive for quite some time and then I have to terminate it manually.

出于某种原因，这似乎进入了某种无限循环，因为它在很长一段时间内仍然没有响应，然后我必须手动终止它。

Any suggestions are welcome.

欢迎任何建议。

Answer 1

回答by alexis

The nltk's edit_distancefunction is for comparing pairs of strings. If you want to compute the edit distance between corresponding pairs of strings, applyit separately to each row's strings like this:

nltk 的edit_distance功能是比较字符串对。如果要计算相应的字符串对之间的编辑距离，apply可以像这样分别计算每一行的字符串：

results = df.apply(lambda x: edit_distance(x["column1"], x["column2"]), axis=1)

Or like this (probably a little more efficient), to avoid including the irrelevant columns of the dataframe:

或者像这样（可能更有效），以避免包含数据帧的不相关列：

results = df.loc[:, ["column1", "column2"]].apply(lambda x: edit_distance(*x), axis=1)

To add the results to your dataframe, you'd use it like this:

要将结果添加到您的数据框中，您可以像这样使用它：

df["distance"] = df.loc[:, ["column1","column2"]].apply(lambda x: edit_distance(*x), axis=1)

pandas 编辑两个熊猫列之间的距离

提问by Orest Xherija

回答by alexis

相关推荐

最近更新

标签

pandas 编辑两个熊猫列之间的距离

提问by Orest Xherija

回答by alexis

相关推荐

pandas 在熊猫中添加日期

Pandas：更改交叉表结果的顺序

使用 Pandas 数据框和 gspread 更新现有的谷歌表

pandas 更改行顺序熊猫数据框

相关推荐

最近更新

标签