python 在 Cython 中做列表/字典的惯用方法？

Question

提问by ramanujan

My problem: I've found that processing large data sets with raw C++ using the STL map and vector can often be considerably faster (and with lower memory footprint) than using Cython.

我的问题：我发现使用 STL 映射和向量使用原始 C++ 处理大型数据集通常比使用 Cython 快得多（并且内存占用更少）。

I figure that part of this speed penalty is due to using Python lists and dicts, and that there might be some tricks to use less encumbered data structures in Cython. For example, this page (http://wiki.cython.org/tutorials/numpy) shows how to make numpy arrays very fast in Cython by predefining the size and types of the ND array.

我认为这种速度损失的一部分是由于使用 Python 列表和字典，并且可能有一些技巧可以在 Cython 中使用较少阻碍的数据结构。例如，这个页面 ( http://wiki.cython.org/tutorials/numpy) 展示了如何通过预定义 ND 数组的大小和类型在 Cython 中非常快速地制作 numpy 数组。

Question: Is there any way to do something similar with lists/dicts, e.g. by stating roughly how many elements or (key,value) pairs you expect to have in them? That is, is there an idiomatic way to convert lists/dicts to (fast) data structures in Cython?

问题：有没有办法对列表/字典做类似的事情，例如通过大致说明您希望其中包含多少个元素或（键，值）对？也就是说，是否有一种惯用的方法可以将列表/字典转换为 Cython 中的（快速）数据结构？

If not I guess I'll just have to write it in C++ and wrap in a Cython import.

如果不是，我想我只需要用 C++ 编写它并包装在 Cython 导入中。

Answer 1

采纳答案by Sam Hartsfield

Cython now has template support, and comes with declarations for some of the STL containers.

Cython 现在有模板支持，并带有一些 STL 容器的声明。

See http://docs.cython.org/src/userguide/wrapping_CPlusPlus.html#standard-library

请参阅http://docs.cython.org/src/userguide/wrapping_CPlusPlus.html#standard-library

Here's the example they give:

这是他们给出的例子：

from libcpp.vector cimport vector

cdef vector[int] vect
cdef int i
for i in range(10):
    vect.push_back(i)
for i in range(10):
    print vect[i]

Answer 2

回答by Mike Graham

Doing similar operations in Python as in C++ can often be slower. listand dictare actually implemented very well, but you gain a lot of overhead using Python objects, which are more abstract than C++ objects and require a lot more lookup at runtime.

在 Python 中执行与在 C++ 中类似的操作通常会更慢。list并且dict实际上实现得很好，但是使用 Python 对象会获得很多开销，这些对象比 C++ 对象更抽象，并且需要在运行时进行更多查找。

Incidentally, std::vectoris implemented in a pretty similar way to list. std::map, though, is actually implemented in a way that many operations are slower than dictas its size gets large. For suitably large examples of each, dictovercomes the constant factor by which it's slower than std::mapand will actually do operations like lookup, insertion, etc. faster.

顺便说一下，std::vector它的实现方式与list. std::map但是，实际上是以一种方式实现的，即许多操作都比dict它的大小变大时要慢。对于每一个适当大的例子，dict克服了它比std::map实际执行查找、插入等操作更慢的常数因素。

If you want to use std::mapand std::vector, nothing is stopping you. You'll have to wrap them yourself if you want to expose them to Python. Do not be shocked if this wrapping consumes all or much of the time you were hoping to save. I am not aware of any tools that make this automatic for you.

如果你想使用std::mapand std::vector，没有什么能阻止你。如果要将它们暴露给 Python，则必须自己包装它们。如果这种包装占用了您希望节省的全部或大部分时间，请不要感到震惊。我不知道有任何工具可以使您自动执行此操作。

There are C API calls for controlling the creation of objects with some detail. You can say "Make a list with at least this many elements", but this doesn't improve the overall complexity of your list creation-and-filling operation. It certainly doesn't change much later as you try to change your list.

有一些 C API 调用用于控制对象的创建和一些细节。您可以说“制作一个至少包含这么多元素的列表”，但这并不能提高您的列表创建和填充操作的整体复杂性。当您尝试更改列表时，它当然不会更改太多。

My general advice is

我的一般建议是

If you want a fixed-size array (you talk about specifying the size of a list), you may actually want something like a numpy array.
I doubt you are going to get any speedup you want out of using std::vectorover listfor a generalreplacement in your code. If you want to use it behind the scenes, it may give you a satisfying size and space improvement (I of course don't know without measuring, nor do you. ;) ).
dictactually does its job really well. I definitely wouldn't try introducing a new general-purpose type for use in Python based on std::map, which has worse algorithmic complexity in time for many important operations and—in at least some implementations—leaves some optimisations to the user that dictalready has.
If I did want something that worked a little more like std::map, I'd probably use a database. This is generally what I do if stuff I want to store in a dict(or for that matter, stuff I store in a list) gets too big for me to feel comfortable storing in memory. Python has sqlite3in the stdlib and drivers for all other major databases available.

如果你想要一个固定大小的数组（你说的是指定列表的大小），你实际上可能想要一个类似 numpy 的数组。
我怀疑您是否会std::vector通过在代码中使用overlist来进行一般替换而获得任何想要的加速。如果你想在幕后使用它，它可能会给你一个令人满意的尺寸和空间改进（我当然不知道不测量，你也不知道。;））。
dict实际上做得很好。我绝对不会尝试引入一种新的通用类型，用于基于的 Python 中std::map，它在许多重要操作的时间上具有更差的算法复杂性，并且 - 至少在某些实现中 - 为dict已经拥有的用户留下了一些优化。
如果我确实想要更像的东西std::map，我可能会使用数据库。如果我想存储在 a 中的东西dict（或者就此而言，我存储在 a 中的东西list）太大而让我无法舒适地存储在内存中，这通常是我所做的。Pythonsqlite3在标准库和驱动程序中有所有其他可用的主要数据库。

Answer 3

回答by Andreas

C++ is fast not just because of the static declarations of the vector and the elements that go into it, but crucially because using templates/generics one specifies that the vector will onlycontain elements of a certain type, e.g. vector with tuples of three elements. Cython can't do this last thing and it sounds nontrivial -- it would have to be enforced at compile time, somehow (typechecking at runtime is what Python already does). So right now when you pop something off a list in Cython there is no way of knowing in advance what type it is , and putting it in a typed variable only adds a typecheck, not speed. This means that there is no way of bypassing the Python interpreter in this regard, and it seems to me it's the most crucial shortcoming of Cython for non-numerical tasks.

C++ 很快，不仅因为向量的静态声明和进入它的元素，而且关键是因为使用模板/泛型指定向量将只包含某种类型的元素，例如带有三个元素的元组的向量。Cython 不能做最后一件事，这听起来很重要——它必须在编译时强制执行，不知何故（运行时的类型检查是 Python 已经做的）。因此，现在当您从 Cython 中的列表中弹出某些内容时，无法提前知道它是什么类型，将其放入类型化变量中只会增加类型检查，而不是速度。这意味着在这方面没有办法绕过 Python 解释器，在我看来这是 Cython 对非数字任务最关键的缺点。

The manual way of solving this is to subclass the python list/dict (or perhaps std::vector) with a cdef class for a specific type of element or key-value combination. This would amount to the same thing as the code that templates are generating. As long as you use the resulting class in Cython code it should provide an improvement.

解决此问题的手动方法是使用 cdef 类为特定类型的元素或键值组合对 python 列表/字典（或可能是 std::vector）进行子类化。这与模板生成的代码相同。只要您在 Cython 代码中使用生成的类，它就会提供改进。

Using databases or arrays just solves a different problem, because this is about putting arbitrary objects (but with a specific type, and preferably a cdef class) in containers.

使用数据库或数组只是解决了一个不同的问题，因为这是将任意对象（但具有特定类型，最好是 cdef 类）放入容器中。

And std::map shouldn't be compared to dict; std::map maintains keys in sorted order because it is a balanced tree, dict solves a different problem. A better comparison would be dict and Google's hashtable.

并且 std::map 不应与 dict 进行比较；std::map 以排序的顺序维护键，因为它是一个平衡树，dict 解决了一个不同的问题。更好的比较是 dict 和 Google 的哈希表。

Answer 4

回答by Andrey Vlasovskikh

You can take a look at the standard arraymodule for Python if this is appropriate for your Cython setting. I'm not sure since I have never used Cython.

array如果这适合您的 Cython 设置，您可以查看Python的标准模块。我不确定，因为我从未使用过 Cython。

Answer 5

回答by Karl Guertin

There is no way to get native Python lists/dicts up to the speed of a C++ map/vector or even anywhere close. It has nothing to do with allocation or type declaration but rather paying the interpreter overhead. The example you mention (numpy) is a C extension and is written in C for precisely this reason.

没有办法让原生 Python 列表/字典达到 C++ 映射/向量的速度，甚至无法接近任何地方。它与分配或类型声明无关，而是支付解释器开销。您提到的示例 (numpy) 是 C 扩展，正是出于这个原因，它是用 C 编写的。

Answer 6

回答by Jan Joswig

Just because it was not mentioned here: You can easily wrap for example a C++ vector in a custom extension type.

仅仅因为这里没有提到它：例如，您可以轻松地将 C++ 向量包装在自定义扩展类型中。

from libcpp.vector cimport vector

cdef class pyvector:
    """Extension type wrapping a vector"""
    cdef vector[long] _data

    cpdef void push_back(self, long x):
        self._data.push_back(x)

    @property
    def data(self):
        return self._data

In this way, you can store your data in a vector allowing fast Cython operations while still being able to access the data (with some overhead) from the Python side.

通过这种方式，您可以将数据存储在一个向量中，允许快速 Cython 操作，同时仍然能够从 Python 端访问数据（有一些开销）。

python 在 Cython 中做列表/字典的惯用方法？

提问by ramanujan

采纳答案by Sam Hartsfield

回答by Mike Graham

回答by Andreas

回答by Andrey Vlasovskikh

回答by Karl Guertin

回答by Jan Joswig

相关推荐

最近更新

标签

python 在 Cython 中做列表/字典的惯用方法？

提问by ramanujan

采纳答案by Sam Hartsfield

回答by Mike Graham

回答by Andreas

回答by Andrey Vlasovskikh

回答by Karl Guertin

回答by Jan Joswig

相关推荐

python Django - 从自定义过滤器中访问 RequestContext

Python 自动导入

Python 中的通用命令模式和命令调度模式

是否有用于从自然语言解析日期和时间的 Python 库？

相关推荐

最近更新

标签