Python 如何比较熊猫中的两个字符串变量?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35940880/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:11:11  来源:igfitidea点击:

how to compare two string variables in pandas?

pythonstringpandas

提问by ??????

I have two string columns in my Pandas dataset

我的 Pandas 数据集中有两个字符串列

name1     name2
John Doe  John Doe
AleX T    Franz K

and I need to check whether name1equals name2. The naive way I use now is using a simple mask

我需要检查是否name1等于name2。我现在使用的天真方法是使用一个简单的掩码

mask=df.name1==df.name2

mask=df.name1==df.name2

But the problem is that there may be mislabeled strings (in a way that is not predictable - the data is too big) that prevent an exact matching to occur.

但问题在于,可能存在错误标记的字符串(以不可预测的方式 - 数据太大)阻止了精确匹配的发生。

For instance "John Doe" and "John Doe " would not match. Of course, I trimmed, lower-cased my strings but other possibilities remain.

例如“John Doe”和“John Doe”将不匹配。当然,我对字符串进行了修剪和小写处理,但其他可能性仍然存在。

One idea would be to look whether name1is contained in name2. But it seems I cannot use str.containswith another variable as argument. Any other ideas?

一个想法是查看是否name1包含在name2. 但似乎我不能使用str.contains另一个变量作为参数。还有其他想法吗?

Many thanks!

非常感谢!

EDIT: using isingives non-sensical results. Example

编辑:使用isin给出了无意义的结果。例子

test = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']})

test
Out[6]: 
           A             B
0   john doe      john doe
1   john doe  eddie murphy
2       John        batman

test['A'].isin(test['B'])
Out[7]: 
0    False
1     True
2    False
Name: A, dtype: bool

采纳答案by jezrael

I think you can use str.lowerand str.replacewith arbitrary whitespace s/+:

我认为您可以使用str.lowerstr.replace任意空格s/+

test = pd.DataFrame({'A': ["john  doe", " john doe", 'John'], 
                     'B': [' john doe', 'eddie murphy', 'batman']})

print test['A'].str.lower().str.replace('s/+',"") == 
      test['B'].str.strip().str.replace('s/+',"")


0     True
1    False
2    False
dtype: bool

回答by EdChum

stripthe spaces and lowerthe case:

strip空间和lower案例:

In [414]:
test['A'].str.strip().str.lower() == test['B'].str.strip().str.lower()

Out[414]:
0     True
1    False
2    False
dtype: bool

回答by steboc

You can use difflib to compute distance

您可以使用 difflib 来计算距离

import difflib as dfl
dfl.SequenceMatcher(None,'John Doe', 'John doe').ratio()

edit : integration with Pandas :

编辑:与熊猫集成:

import pandas as pd
import difflib as dfl
df = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']})
df['VAR1'] = df.apply(lambda x : dfl.SequenceMatcher(None, x['A'], x['B']).ratio(),axis=1)

回答by Mai

What you want is a string distance based on editing effort distance(s1, s2), which is what we call edit distance of strings. Once you define that function in your namespace you can do:

您想要的是基于编辑工作量的字符串距离distance(s1, s2),这就是我们所说的edit distance of strings。在命名空间中定义该函数后,您可以执行以下操作:

df['distance_s'] = df.apply(lambda r: distance(r['name1'], r['name2']))
filtered = df[df['distance_s'] < eps] # you define eps

From a Google search, the following came up:

从谷歌搜索,出现了以下内容:

https://pypi.python.org/pypi/editdistance

https://pypi.python.org/pypi/editdistance

It is a dynamic programming problem, so you can challenge yourself by writing your own too. It may not be as efficient though.

这是一个动态规划问题,因此您也可以通过自己编写来挑战自己。虽然它可能没有那么高效。