Python 熊猫 - 合并字符串列不起作用（错误？）

Question

提问by user1496984

I'm trying to do a simple merge between two dataframes. These come from two different SQL tables, where the joining keys are strings:

我正在尝试在两个数据帧之间进行简单的合并。它们来自两个不同的 SQL 表，其中连接键是字符串：

>>> df1.col1.dtype
dtype('O')
>>> df2.col2.dtype
dtype('O')

I try to merge them using this:

我尝试使用以下方法合并它们：

>>> merge_res = pd.merge(df1, df2, left_on='col1', right_on='col2')

The result of the inner join is empty, which first prompted me that there might not be any entries in the intersection:

内连接的结果是空的，这首先提示我可能没有交集的条目：

>>> merge_res.shape
(0, 19)

But when I try to match a single element, I see this really odd behavior.

但是当我尝试匹配单个元素时，我看到了这种非常奇怪的行为。

# Pick random element in second dataframe
>>> df2.iloc[5,:].col2
'95498208100000'

# Manually look for it in the first dataframe
>>> df1[df1.col1 == '95498208100000']
0 rows × 19 columns
# Empty, which makes sense given the above merge result

# Now look for the same value as an integer
>>> df1[df1.col1 == 95498208100000]
1 rows × 19 columns
# FINDS THE ELEMENT!?!

So, the columns are defined with the 'object' dtype. Searching for them as strings don't yield any results. Searching for them as integers does return a result, and I think this is the reason why the merge doesn't work above..

因此，列是用 'object' dtype 定义的。将它们作为字符串搜索不会产生任何结果。以整数形式搜索它们确实会返回结果，我认为这就是合并在上面不起作用的原因..

Any ideas what's going on?

任何想法发生了什么？

It's almost as thought Pandas converts df1.col1to an integer just because it can, even though it shouldbe treated as a string while matching.

这几乎就像 Pandas 转换df1.col1为整数只是因为它可以转换为整数一样，即使它在匹配时应该被视为字符串。

(I tried to replicate this using sample dataframes, but for small examples, I don't see this behavior. Any suggestions on how I can find a more descriptive example would be appreciated as well.)

（我尝试使用示例数据框复制它，但对于小示例，我没有看到这种行为。关于如何找到更具描述性的示例的任何建议也将不胜感激。）

Answer 1

回答by user1496984

The issue was that the objectdtype is misleading. I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and leaving the remainders as strings.

问题是objectdtype 具有误导性。我认为这意味着所有项目都是字符串。但显然，在读取文件时，pandas 正在将一些元素转换为整数，并将其余部分保留为字符串。

The solution was to make sure that every field is a string:

解决方案是确保每个字段都是一个字符串：

>>> df1.col1 = df1.col1.astype(str)
>>> df2.col2 = df2.col2.astype(str)

Then the merge works as expected.

然后合并按预期工作。

(I wish there was a way of specifying a dtypeof str...)

（我希望能有指定的方式dtype的str...）

Answer 2

回答by seeiespi

I ran into a case where the df.col = df.col.astype(str)solution did not work. Turns out the problem was in the encoding.

我遇到了df.col = df.col.astype(str)解决方案不起作用的情况。原来问题出在编码上。

My original data looked like this:

我的原始数据如下所示：

In [72]: df1['col1'][:3]
Out[73]: 
             col1
0  dustin pedroia
1  kevin youkilis
2     david ortiz

In [72]: df2['col2'][:3]
Out[73]: 
             col2
0  dustin pedroia
1  kevin youkilis
2     david ortiz

And after using .astype(str)the merge still wasn't working so I executed the following:

使用.astype(str)合并后仍然无法正常工作，因此我执行了以下操作：

df1.col1 = df1.col1.str.encode('utf-8')
df2.col2 = df2.col2.str.encode('utf-8')

and was able to find the difference:

并且能够找到差异：

In [95]: df1
Out[95]: 
                       col1
0  b'dustin\xc2\xa0pedroia'
1  b'kevin\xc2\xa0youkilis'
2     b'david\xc2\xa0ortiz'

In [95]: df2
Out[95]: 
                col2
0  b'dustin pedroia'
1  b'kevin youkilis'
2     b'david ortiz'

At which point all I had to do was run df1.col1 = df1.col1.str.replace('\xa0',' ')on the decoded df1.col1 variable (i.e. before running .str.encode('utf-8')) and the merge worked perfectly.

在这一点上，我所要做的就是df1.col1 = df1.col1.str.replace('\xa0',' ')在解码的 df1.col1 变量上运行（即在运行之前.str.encode('utf-8')）并且合并工作完美。

NOTE: Regardless of what I was replacing I always used .str.encode('utf-8')to check whether it worked.

注意：无论我更换什么，我总是用来.str.encode('utf-8')检查它是否有效。

Alternatively

或者

Using regular expressions and the Variable Explorer in the Spyder IDE for Anaconda I found the following difference.

在 Anaconda 的 Spyder IDE 中使用正则表达式和变量资源管理器，我发现了以下差异。

import re
#places the raw string into a list
df1.col1 = df1.col1.apply(lambda x: re.findall(x, x))  
df2.col2 = df2.col2.apply(lambda x: re.findall(x, x))

where my df1 data turned into this (copied and pasted from Spyder):

我的 df1 数据变成了这个（从 Spyder 复制粘贴）：

['dustin\xa0pedroia']
['kevin\xa0youkilis']
['david\xa0ortiz']

which just has a slightly different solution. I don't know in what case the first example wouldn't work and the second would but I wanted to provide both just in case someone runs into it :)

只是有一个稍微不同的解决方案。我不知道在什么情况下第一个例子不起作用而第二个例子会但我想提供两者以防万一有人遇到它:)

Answer 3

回答by Feras

Thanks,@seeiespithe ..str.encode('utf-8') has helped me to figure out that my string needs to be stripped, as below

谢谢，@seeiespi..str.encode('utf-8') 帮助我弄清楚我的字符串需要被剥离，如下所示

20                 b'Belize '   ...     0,612
21                  b'Benin '   ...     0,546

The solution was to use the strip

解决方案是使用条带

df1.col1 = df1.col1.str.strip()
df1.col1 = df1.col1.str.strip()

Answer 4

回答by Jan

None of the above solutions worked for me because the merge was actually done correctly but the indexing got messed up. Removing the index solved it for me:

上述解决方案都不适用于我，因为合并实际上是正确完成的，但索引被搞砸了。删除索引为我解决了这个问题：

df['sth'] = df.merge(df2, how='left', on=['x', 'y'])['sth'].values

Python 熊猫 - 合并字符串列不起作用（错误？）

提问by user1496984

回答by user1496984

回答by seeiespi

回答by Feras

回答by Jan

相关推荐

最近更新

标签

Python 熊猫 - 合并字符串列不起作用（错误？）

提问by user1496984

回答by user1496984

回答by seeiespi

回答by Feras

回答by Jan

相关推荐

Python 如何选择最后一行以及如何通过索引访问 PySpark 数据帧？

单击 find_element_by_class_name 按钮不起作用 python selenium webdriver 不起作用

Python 沿 numpy 数组应用函数

Python selenium.common.exceptions.WebDriverException：消息：连接被拒绝

相关推荐

最近更新

标签