pandas 在熊猫中设置联合

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38428108/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:36:19  来源:igfitidea点击:

Set Union in pandas

pythonpython-3.xnumpypandasvectorization

提问by cppgnlearner

I have two columns which I stored sets in my dataframe.

我有两列存储在我的数据框中。

I want to perform set union on the two columns using fast vectorized operation

我想使用快速矢量化操作对两列执行集合并集

df['union'] = df.set1 | df.set2

but the error TypeError: unsupported operand type(s) for |: 'set' and 'bool'is preventing me from doing so as I have type np.nanin both columns.

但错误TypeError: unsupported operand type(s) for |: 'set' and 'bool'阻止我这样做,因为我np.nan在两列中都输入了内容。

Is there a good solution to overcome this?

有没有好的解决方案来克服这个问题?

采纳答案by ayhan

For these operations pure Python may be more efficient.

对于这些操作,纯 Python 可能更高效。

%timeit pd.Series([set1.union(set2) for set1, set2 in zip(df['A'], df['B'])])
10 loops, best of 3: 43.3 ms per loop

%timeit df.apply(lambda x: x.A.union(x.B), axis=1)
1 loop, best of 3: 2.6 s per loop

If we could use +, it would probably take half the time (inheritance may not worth it):

如果我们可以使用+,它可能需要一半的时间(继承可能不值得):

%timeit df['A'] - df['B']
10 loops, best of 3: 22.1 ms per loop

%timeit pd.Series([set1.difference(set2) for set1, set2 in zip(df['A'], df['B'])])
10 loops, best of 3: 35.7 ms per loop


DataFrame for timings:

用于计时的数据帧:

import pandas as pd
import numpy as np
l1 = [set(np.random.choice(list('abcdefg'), np.random.randint(1, 5))) for _ in range(100000)]
l2 = [set(np.random.choice(list('abcdefg'), np.random.randint(1, 5))) for _ in range(100000)]

df = pd.DataFrame({'A': l1, 'B': l2})

回答by piRSquared

This is the best I could come up with:

这是我能想到的最好的:

# method 1
df.apply(lambda x: x.set1.union(x.set2), axis=1)

# method 2
df.applymap(list).sum(1).apply(set)

Wow!

哇!

I expected the method 2 to be quicker. Not so!

我希望方法 2 更快。不是这样!

enter image description here

在此处输入图片说明

Example

例子

df = pd.DataFrame([[{1, 2, 3}, {3, 4, 5}] for _ in range(3)],
                  columns=list('AB'))
df

enter image description here

在此处输入图片说明

df.apply(lambda x: x.set1.union(x.set2), axis=1)

0    {1, 2, 3, 4, 5}
1    {1, 2, 3, 4, 5}
2    {1, 2, 3, 4, 5}