Pandas `isin` 函数的更快替代方案

Question

提问by user3576212

I have a very large data frame dfthat looks like:

我有一个非常大的数据框df，看起来像：

ID       Value1    Value2
1345      3.2      332
1355      2.2      32
2346      1.0      11
3456      8.9      322

And I have a list that contains a subset of IDs ID_list. I need to have a subset of dffor the IDcontained in ID_list.

我有一个包含 IDs 子集的列表ID_list。我需要df为ID包含在ID_list.

Currently, I am using df_sub=df[df.ID.isin(ID_list)]to do it. But it takes a lot time. IDs contained in ID_listdoesn't have any pattern, so it's not within certain range. (And I need to apply the same operation to many similar dataframes. I was wondering if there is any faster way to do this. Will it help a lot if make IDas the index?

目前，我正在使用df_sub=df[df.ID.isin(ID_list)]它。但这需要很多时间。IDs 包含在ID_list没有任何模式，所以它不在一定范围内。（我需要对许多类似的数据帧应用相同的操作。我想知道是否有更快的方法来做到这一点。如果将 makeID作为索引会有很大帮助吗？

Thanks!

谢谢！

Answer 1

回答by chancyk

EDIT 2: Here's a link to a more recent look into the performance of various pandasoperations, though it doesn't seem to include merge and join to date.

编辑 2：这是对各种pandas操作性能的最新研究的链接，尽管迄今为止它似乎不包括合并和连接。

https://github.com/mm-mansour/Fast-Pandas

EDIT 1: These benchmarks were for a quite old version of pandas and likely are not still relevant. See Mike's comment below on merge.

编辑 1：这些基准测试是针对一个相当旧版本的Pandas，可能仍然不相关。请参阅下面迈克对的评论merge。

It depends on the size of your data but for large datasets DataFrame.joinseems to be the way to go. This requires your DataFrame index to be your 'ID' and the Series or DataFrame you're joining against to have an index that is your 'ID_list'. The Series must also have a nameto be used with join, which gets pulled in as a new field called name. You also need to specify an inner join to get something like isinbecause joindefaults to a left join. query insyntax seems to have the same speed characteristics as isinfor large datasets.

这取决于数据的大小，但对于大型数据集DataFrame.join似乎是要走的路。这要求您的 DataFrame 索引是您的“ID”，并且您要加入的系列或 DataFrame 的索引是您的“ID_list”。系列还必须有一个name要与一起使用join，它作为一个名为的新字段被拉入name。您还需要指定一个内部连接来获得类似的东西，isin因为join默认为左连接。查询in语法似乎与isin大型数据集具有相同的速度特征。

If you're working with small datasets, you get different behaviors and it actually becomes faster to use a list comprehension or apply against a dictionary than using isin.

如果您正在处理小型数据集，您会得到不同的行为，并且实际上使用列表推导式或应用于字典比使用isin.

Otherwise, you can try to get more speed with Cython.

否则，您可以尝试使用Cython提高速度。

# I'm ignoring that the index is defaulting to a sequential number. You
# would need to explicitly assign your IDs to the index here, e.g.:
# >>> l_series.index = ID_list
mil = range(1000000)
l = mil
l_series = pd.Series(l)

df = pd.DataFrame(l_series, columns=['ID'])


In [247]: %timeit df[df.index.isin(l)]
1 loops, best of 3: 1.12 s per loop

In [248]: %timeit df[df.index.isin(l_series)]
1 loops, best of 3: 549 ms per loop

# index vs column doesn't make a difference here
In [304]: %timeit df[df.ID.isin(l_series)]
1 loops, best of 3: 541 ms per loop

In [305]: %timeit df[df.index.isin(l_series)]
1 loops, best of 3: 529 ms per loop

# query 'in' syntax has the same performance as 'isin'
In [249]: %timeit df.query('index in @l')
1 loops, best of 3: 1.14 s per loop

In [250]: %timeit df.query('index in @l_series')
1 loops, best of 3: 564 ms per loop

# ID must be the index for DataFrame.join and l_series must have a name.
# join defaults to a left join so we need to specify inner for existence.
In [251]: %timeit df.join(l_series, how='inner')
10 loops, best of 3: 93.3 ms per loop

# Smaller datasets.
df = pd.DataFrame([1,2,3,4], columns=['ID'])
l = range(10000)
l_dict = dict(zip(l, l))
l_series = pd.Series(l)
l_series.name = 'ID_list'


In [363]: %timeit df.join(l_series, how='inner')
1000 loops, best of 3: 733 μs per loop

In [291]: %timeit df[df.ID.isin(l_dict)]
1000 loops, best of 3: 742 μs per loop

In [292]: %timeit df[df.ID.isin(l)]
1000 loops, best of 3: 771 μs per loop

In [294]: %timeit df[df.ID.isin(l_series)]
100 loops, best of 3: 2 ms per loop

# It's actually faster to use apply or a list comprehension for these small cases.
In [296]: %timeit df[[x in l_dict for x in df.ID]]
1000 loops, best of 3: 203 μs per loop

In [299]: %timeit df[df.ID.apply(lambda x: x in l_dict)]
1000 loops, best of 3: 297 μs per loop

Pandas `isin` 函数的更快替代方案

提问by user3576212

回答by chancyk

相关推荐

最近更新

标签

Pandas `isin` 函数的更快替代方案

提问by user3576212

回答by chancyk

相关推荐

使用 Numba 处理 Pandas DataFrame 时间序列的有效方法

按单列对 Pandas 数据框进行总和分组

pandas 从 Yahoo! 加载数据 熊猫理财

pandas 如何将我的熊猫数据框移动到 d3？

相关推荐

最近更新

标签

pandas 从 Yahoo! 加载数据熊猫理财