Python 如何比较熊猫中的两个字符串变量？

Question

提问by ??????

I have two string columns in my Pandas dataset

我的 Pandas 数据集中有两个字符串列

name1     name2
John Doe  John Doe
AleX T    Franz K

and I need to check whether name1equals name2. The naive way I use now is using a simple mask

我需要检查是否name1等于name2。我现在使用的天真方法是使用一个简单的掩码

mask=df.name1==df.name2

But the problem is that there may be mislabeled strings (in a way that is not predictable - the data is too big) that prevent an exact matching to occur.

但问题在于，可能存在错误标记的字符串（以不可预测的方式 - 数据太大）阻止了精确匹配的发生。

For instance "John Doe" and "John Doe " would not match. Of course, I trimmed, lower-cased my strings but other possibilities remain.

例如“John Doe”和“John Doe”将不匹配。当然，我对字符串进行了修剪和小写处理，但其他可能性仍然存在。

One idea would be to look whether name1is contained in name2. But it seems I cannot use str.containswith another variable as argument. Any other ideas?

一个想法是查看是否name1包含在name2. 但似乎我不能使用str.contains另一个变量作为参数。还有其他想法吗？

Many thanks!

非常感谢！

EDIT: using isingives non-sensical results. Example

编辑：使用isin给出了无意义的结果。例子

test = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']})

test
Out[6]: 
           A             B
0   john doe      john doe
1   john doe  eddie murphy
2       John        batman

test['A'].isin(test['B'])
Out[7]: 
0    False
1     True
2    False
Name: A, dtype: bool

Answer 1

采纳答案by jezrael

I think you can use str.lowerand str.replacewith arbitrary whitespace s/+:

我认为您可以使用str.lower和str.replace任意空格s/+：

test = pd.DataFrame({'A': ["john  doe", " john doe", 'John'], 
                     'B': [' john doe', 'eddie murphy', 'batman']})

print test['A'].str.lower().str.replace('s/+',"") == 
      test['B'].str.strip().str.replace('s/+',"")


0     True
1    False
2    False
dtype: bool

Answer 2

回答by EdChum

stripthe spaces and lowerthe case:

strip空间和lower案例：

In [414]:
test['A'].str.strip().str.lower() == test['B'].str.strip().str.lower()

Out[414]:
0     True
1    False
2    False
dtype: bool

Answer 3

回答by steboc

You can use difflib to compute distance

您可以使用 difflib 来计算距离

import difflib as dfl
dfl.SequenceMatcher(None,'John Doe', 'John doe').ratio()

edit : integration with Pandas :

编辑：与熊猫集成：

import pandas as pd
import difflib as dfl
df = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']})
df['VAR1'] = df.apply(lambda x : dfl.SequenceMatcher(None, x['A'], x['B']).ratio(),axis=1)

Answer 4

回答by Mai

What you want is a string distance based on editing effort distance(s1, s2), which is what we call edit distance of strings. Once you define that function in your namespace you can do:

您想要的是基于编辑工作量的字符串距离distance(s1, s2)，这就是我们所说的edit distance of strings。在命名空间中定义该函数后，您可以执行以下操作：

df['distance_s'] = df.apply(lambda r: distance(r['name1'], r['name2']))
filtered = df[df['distance_s'] < eps] # you define eps

From a Google search, the following came up:

从谷歌搜索，出现了以下内容：

https://pypi.python.org/pypi/editdistance

It is a dynamic programming problem, so you can challenge yourself by writing your own too. It may not be as efficient though.

这是一个动态规划问题，因此您也可以通过自己编写来挑战自己。虽然它可能没有那么高效。

Python 如何比较熊猫中的两个字符串变量？

提问by ??????

采纳答案by jezrael

回答by EdChum

回答by steboc

回答by Mai

相关推荐

最近更新

标签

Python 如何比较熊猫中的两个字符串变量？

提问by ??????

采纳答案by jezrael

回答by EdChum

回答by steboc

回答by Mai

相关推荐

flask - 从 python 到 html 显示数据库

Python pdfminer - 导入错误：没有名为 pdfminer.pdfdocument 的模块

Python 从seaborn保存情节

Python Keras：ImportError：没有名为 data_utils 的模块

相关推荐

最近更新

标签