Python 熊猫 - 合并字符串列不起作用(错误?)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39582984/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas - Merging on string columns not working (bug?)
提问by user1496984
I'm trying to do a simple merge between two dataframes. These come from two different SQL tables, where the joining keys are strings:
我正在尝试在两个数据帧之间进行简单的合并。它们来自两个不同的 SQL 表,其中连接键是字符串:
>>> df1.col1.dtype
dtype('O')
>>> df2.col2.dtype
dtype('O')
I try to merge them using this:
我尝试使用以下方法合并它们:
>>> merge_res = pd.merge(df1, df2, left_on='col1', right_on='col2')
The result of the inner join is empty, which first prompted me that there might not be any entries in the intersection:
内连接的结果是空的,这首先提示我可能没有交集的条目:
>>> merge_res.shape
(0, 19)
But when I try to match a single element, I see this really odd behavior.
但是当我尝试匹配单个元素时,我看到了这种非常奇怪的行为。
# Pick random element in second dataframe
>>> df2.iloc[5,:].col2
'95498208100000'
# Manually look for it in the first dataframe
>>> df1[df1.col1 == '95498208100000']
0 rows × 19 columns
# Empty, which makes sense given the above merge result
# Now look for the same value as an integer
>>> df1[df1.col1 == 95498208100000]
1 rows × 19 columns
# FINDS THE ELEMENT!?!
So, the columns are defined with the 'object' dtype. Searching for them as strings don't yield any results. Searching for them as integers does return a result, and I think this is the reason why the merge doesn't work above..
因此,列是用 'object' dtype 定义的。将它们作为字符串搜索不会产生任何结果。以整数形式搜索它们确实会返回结果,我认为这就是合并在上面不起作用的原因..
Any ideas what's going on?
任何想法发生了什么?
It's almost as thought Pandas converts df1.col1
to an integer just because it can, even though it shouldbe treated as a string while matching.
这几乎就像 Pandas 转换df1.col1
为整数只是因为它可以转换为整数一样,即使它在匹配时应该被视为字符串。
(I tried to replicate this using sample dataframes, but for small examples, I don't see this behavior. Any suggestions on how I can find a more descriptive example would be appreciated as well.)
(我尝试使用示例数据框复制它,但对于小示例,我没有看到这种行为。关于如何找到更具描述性的示例的任何建议也将不胜感激。)
回答by user1496984
The issue was that the object
dtype is misleading. I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and leaving the remainders as strings.
问题是object
dtype 具有误导性。我认为这意味着所有项目都是字符串。但显然,在读取文件时,pandas 正在将一些元素转换为整数,并将其余部分保留为字符串。
The solution was to make sure that every field is a string:
解决方案是确保每个字段都是一个字符串:
>>> df1.col1 = df1.col1.astype(str)
>>> df2.col2 = df2.col2.astype(str)
Then the merge works as expected.
然后合并按预期工作。
(I wish there was a way of specifying a dtype
of str
...)
(我希望能有指定的方式dtype
的str
...)
回答by seeiespi
I ran into a case where the df.col = df.col.astype(str)
solution did not work. Turns out the problem was in the encoding.
我遇到了df.col = df.col.astype(str)
解决方案不起作用的情况。原来问题出在编码上。
My original data looked like this:
我的原始数据如下所示:
In [72]: df1['col1'][:3]
Out[73]:
col1
0 dustin pedroia
1 kevin youkilis
2 david ortiz
In [72]: df2['col2'][:3]
Out[73]:
col2
0 dustin pedroia
1 kevin youkilis
2 david ortiz
And after using .astype(str)
the merge still wasn't working so I executed the following:
使用.astype(str)
合并后仍然无法正常工作,因此我执行了以下操作:
df1.col1 = df1.col1.str.encode('utf-8')
df2.col2 = df2.col2.str.encode('utf-8')
and was able to find the difference:
并且能够找到差异:
In [95]: df1
Out[95]:
col1
0 b'dustin\xc2\xa0pedroia'
1 b'kevin\xc2\xa0youkilis'
2 b'david\xc2\xa0ortiz'
In [95]: df2
Out[95]:
col2
0 b'dustin pedroia'
1 b'kevin youkilis'
2 b'david ortiz'
At which point all I had to do was run df1.col1 = df1.col1.str.replace('\xa0',' ')
on the decoded df1.col1 variable (i.e. before running .str.encode('utf-8')
) and the merge worked perfectly.
在这一点上,我所要做的就是df1.col1 = df1.col1.str.replace('\xa0',' ')
在解码的 df1.col1 变量上运行(即在运行之前.str.encode('utf-8')
)并且合并工作完美。
NOTE: Regardless of what I was replacing I always used .str.encode('utf-8')
to check whether it worked.
注意:无论我更换什么,我总是用来.str.encode('utf-8')
检查它是否有效。
Alternatively
或者
Using regular expressions and the Variable Explorer in the Spyder IDE for Anaconda I found the following difference.
在 Anaconda 的 Spyder IDE 中使用正则表达式和变量资源管理器,我发现了以下差异。
import re
#places the raw string into a list
df1.col1 = df1.col1.apply(lambda x: re.findall(x, x))
df2.col2 = df2.col2.apply(lambda x: re.findall(x, x))
where my df1 data turned into this (copied and pasted from Spyder):
我的 df1 数据变成了这个(从 Spyder 复制粘贴):
['dustin\xa0pedroia']
['kevin\xa0youkilis']
['david\xa0ortiz']
which just has a slightly different solution. I don't know in what case the first example wouldn't work and the second would but I wanted to provide both just in case someone runs into it :)
只是有一个稍微不同的解决方案。我不知道在什么情况下第一个例子不起作用而第二个例子会但我想提供两者以防万一有人遇到它:)
回答by Feras
Thanks,@seeiespithe ..str.encode('utf-8') has helped me to figure out that my string needs to be stripped, as below
谢谢,@seeiespi..str.encode('utf-8') 帮助我弄清楚我的字符串需要被剥离,如下所示
20 b'Belize ' ... 0,612
21 b'Benin ' ... 0,546
The solution was to use the strip
解决方案是使用条带
df1.col1 = df1.col1.str.strip()
df1.col1 = df1.col1.str.strip()
回答by Jan
None of the above solutions worked for me because the merge was actually done correctly but the indexing got messed up. Removing the index solved it for me:
上述解决方案都不适用于我,因为合并实际上是正确完成的,但索引被搞砸了。删除索引为我解决了这个问题:
df['sth'] = df.merge(df2, how='left', on=['x', 'y'])['sth'].values