Python 如何比较熊猫中的两个字符串变量?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35940880/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to compare two string variables in pandas?
提问by ??????
I have two string columns in my Pandas dataset
我的 Pandas 数据集中有两个字符串列
name1 name2
John Doe John Doe
AleX T Franz K
and I need to check whether name1
equals name2
.
The naive way I use now is using a simple mask
我需要检查是否name1
等于name2
。我现在使用的天真方法是使用一个简单的掩码
mask=df.name1==df.name2
mask=df.name1==df.name2
But the problem is that there may be mislabeled strings (in a way that is not predictable - the data is too big) that prevent an exact matching to occur.
但问题在于,可能存在错误标记的字符串(以不可预测的方式 - 数据太大)阻止了精确匹配的发生。
For instance "John Doe" and "John Doe " would not match. Of course, I trimmed, lower-cased my strings but other possibilities remain.
例如“John Doe”和“John Doe”将不匹配。当然,我对字符串进行了修剪和小写处理,但其他可能性仍然存在。
One idea would be to look whether name1
is contained in name2
. But it seems I cannot use str.contains
with another variable as argument. Any other ideas?
一个想法是查看是否name1
包含在name2
. 但似乎我不能使用str.contains
另一个变量作为参数。还有其他想法吗?
Many thanks!
非常感谢!
EDIT: using isin
gives non-sensical results.
Example
编辑:使用isin
给出了无意义的结果。例子
test = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']})
test
Out[6]:
A B
0 john doe john doe
1 john doe eddie murphy
2 John batman
test['A'].isin(test['B'])
Out[7]:
0 False
1 True
2 False
Name: A, dtype: bool
采纳答案by jezrael
I think you can use str.lower
and str.replace
with arbitrary whitespace s/+
:
我认为您可以使用str.lower
和str.replace
任意空格s/+
:
test = pd.DataFrame({'A': ["john doe", " john doe", 'John'],
'B': [' john doe', 'eddie murphy', 'batman']})
print test['A'].str.lower().str.replace('s/+',"") ==
test['B'].str.strip().str.replace('s/+',"")
0 True
1 False
2 False
dtype: bool
回答by EdChum
strip
the spaces and lower
the case:
strip
空间和lower
案例:
In [414]:
test['A'].str.strip().str.lower() == test['B'].str.strip().str.lower()
Out[414]:
0 True
1 False
2 False
dtype: bool
回答by steboc
You can use difflib to compute distance
您可以使用 difflib 来计算距离
import difflib as dfl
dfl.SequenceMatcher(None,'John Doe', 'John doe').ratio()
edit : integration with Pandas :
编辑:与熊猫集成:
import pandas as pd
import difflib as dfl
df = pd.DataFrame({'A': ["john doe", " john doe", 'John'], 'B': [' john doe', 'eddie murphy', 'batman']})
df['VAR1'] = df.apply(lambda x : dfl.SequenceMatcher(None, x['A'], x['B']).ratio(),axis=1)
回答by Mai
What you want is a string distance based on editing effort distance(s1, s2)
, which is what we call edit distance of strings
. Once you define that function in your namespace you can do:
您想要的是基于编辑工作量的字符串距离distance(s1, s2)
,这就是我们所说的edit distance of strings
。在命名空间中定义该函数后,您可以执行以下操作:
df['distance_s'] = df.apply(lambda r: distance(r['name1'], r['name2']))
filtered = df[df['distance_s'] < eps] # you define eps
From a Google search, the following came up:
从谷歌搜索,出现了以下内容:
https://pypi.python.org/pypi/editdistance
https://pypi.python.org/pypi/editdistance
It is a dynamic programming problem, so you can challenge yourself by writing your own too. It may not be as efficient though.
这是一个动态规划问题,因此您也可以通过自己编写来挑战自己。虽然它可能没有那么高效。