Python 格林莱特 Vs。线程

Question

提问by Rsh

I am new to gevents and greenlets. I found some good documentation on how to work with them, but none gave me justification on how and when I should use greenlets!

我是 gevents 和 greenlets 的新手。我找到了一些关于如何使用它们的很好的文档，但没有一个文档告诉我应该如何以及何时使用 greenlets！

What are they really good at?
Is it a good idea to use them in a proxy server or not?
Why not threads?

他们真正擅长什么？
在代理服务器中使用它们是否是个好主意？
为什么不是线程？

What I am not sure about is how they can provide us with concurrency if they're basically co-routines.

我不确定的是，如果它们基本上是协同程序，它们如何为我们提供并发性。

Answer 1

采纳答案by Matt Joiner

Greenlets provide concurrency but notparallelism. Concurrency is when code can run independently of other code. Parallelism is the execution of concurrent code simultaneously. Parallelism is particularly useful when there's a lot of work to be done in userspace, and that's typically CPU-heavy stuff. Concurrency is useful for breaking apart problems, enabling different parts to be scheduled and managed more easily in parallel.

Greenlets 提供并发性，但不提供并行性。并发是指代码可以独立于其他代码运行。并行是同时执行并发代码。当在用户空间中有大量工作要做时，并行性特别有用，而这通常是 CPU 密集型的工作。并发对于分解问题很有用，可以更轻松地并行调度和管理不同的部分。

Greenlets really shine in network programming where interactions with one socket can occur independently of interactions with other sockets. This is a classic example of concurrency. Because each greenlet runs in its own context, you can continue to use synchronous APIs without threading. This is good because threads are very expensive in terms of virtual memory and kernel overhead, so the concurrency you can achieve with threads is significantly less. Additionally, threading in Python is more expensive and more limited than usual due to the GIL. Alternatives to concurrency are usually projects like Twisted, libevent, libuv, node.js etc, where all your code shares the same execution context, and register event handlers.

Greenlets 在网络编程中确实大放异彩，其中与一个套接字的交互可以独立于与其他套接字的交互而发生。这是并发的经典示例。因为每个 greenlet 都在自己的上下文中运行，所以您可以继续使用同步 API，而无需线程化。这很好，因为线程在虚拟内存和内核开销方面非常昂贵，因此您可以使用线程实现的并发性要低得多。此外，由于 GIL，Python 中的线程比平常更昂贵且更受限制。并发的替代方案通常是 Twisted、libevent、libuv、node.js 等项目，其中所有代码共享相同的执行上下文，并注册事件处理程序。

It's an excellent idea to use greenlets (with appropriate networking support such as through gevent) for writing a proxy, as your handling of requests are able to execute independently and should be written as such.

使用greenlets（具有适当的网络支持，例如通过gevent）来编写代理是一个很好的主意，因为您对请求的处理能够独立执行并且应该这样编写。

Greenlets provide concurrency for the reasons I gave earlier. Concurrency is not parallelism. By concealing event registration and performing scheduling for you on calls that would normally block the current thread, projects like gevent expose this concurrency without requiring change to an asynchronous API, and at significantly less cost to your system.

由于我之前给出的原因，Greenlets 提供了并发性。并发不是并行。通过隐藏事件注册并在通常会阻塞当前线程的调用上为您执行调度，像 gevent 这样的项目无需更改异步 API 即可公开这种并发性，并且显着降低您的系统成本。

Answer 2

回答by max

This is interesting enough to analyze. Here is a code to compare performance of greenlets versus multiprocessing pool versus multi-threading:

这足以分析起来很有趣。这是一个比较 greenlets 与多处理池与多线程的性能的代码：

import gevent
from gevent import socket as gsock
import socket as sock
from multiprocessing import Pool
from threading import Thread
from datetime import datetime

class IpGetter(Thread):
    def __init__(self, domain):
        Thread.__init__(self)
        self.domain = domain
    def run(self):
        self.ip = sock.gethostbyname(self.domain)

if __name__ == "__main__":
    URLS = ['www.google.com', 'www.example.com', 'www.python.org', 'www.yahoo.com', 'www.ubc.ca', 'www.wikipedia.org']
    t1 = datetime.now()
    jobs = [gevent.spawn(gsock.gethostbyname, url) for url in URLS]
    gevent.joinall(jobs, timeout=2)
    t2 = datetime.now()
    print "Using gevent it took: %s" % (t2-t1).total_seconds()
    print "-----------"
    t1 = datetime.now()
    pool = Pool(len(URLS))
    results = pool.map(sock.gethostbyname, URLS)
    t2 = datetime.now()
    pool.close()
    print "Using multiprocessing it took: %s" % (t2-t1).total_seconds()
    print "-----------"
    t1 = datetime.now()
    threads = []
    for url in URLS:
        t = IpGetter(url)
        t.start()
        threads.append(t)
    for t in threads:
        t.join()
    t2 = datetime.now()
    print "Using multi-threading it took: %s" % (t2-t1).total_seconds()

here are the results:

结果如下：

Using gevent it took: 0.083758
-----------
Using multiprocessing it took: 0.023633
-----------
Using multi-threading it took: 0.008327

I think that greenlet claims that it is not bound by GIL unlike the multithreading library. Moreover, Greenlet doc says that it is meant for network operations. For a network intensive operation, thread-switching is fine and you can see that the multithreading approach is pretty fast. Also it's always prefeerable to use python's official libraries; I tried installing greenlet on windows and encountered a dll dependency problem so I ran this test on a linux vm. Alway try to write a code with the hope that it runs on any machine.

我认为greenlet声称它不像多线程库那样受GIL约束。此外，Greenlet doc 说它用于网络操作。对于网络密集型操作，线程切换很好，您可以看到多线程方法非常快。此外，最好使用 python 的官方库；我尝试在 windows 上安装 greenlet 并遇到了 dll 依赖问题，所以我在 linux vm 上运行了这个测试。总是尝试编写代码，希望它可以在任何机器上运行。

Answer 3

回答by TemporalBeing

Taking @Max's answer and adding some relevance to it for scaling, you can see the difference. I achieved this by changing the URLs to be filled as follows:

参考@Max 的答案并为其添加一些相关性以进行缩放，您可以看到不同之处。我通过更改要填充的 URL 来实现这一点，如下所示：

URLS_base = ['www.google.com', 'www.example.com', 'www.python.org', 'www.yahoo.com', 'www.ubc.ca', 'www.wikipedia.org']
URLS = []
for _ in range(10000):
    for url in URLS_base:
        URLS.append(url)

I had to drop out the multiprocess version as it fell before I had 500; but at 10,000 iterations:

我不得不放弃多进程版本，因为它在我拥有 500 之前就下降了；但在 10,000 次迭代时：

Using gevent it took: 3.756914
-----------
Using multi-threading it took: 15.797028

So you can see there is some significant difference in I/O using gevent

所以你可以看到使用 gevent 的 I/O 有一些显着差异

Answer 4

回答by zzzeek

Correcting for @TemporalBeing 's answer above, greenlets are not "faster" than threads and it is an incorrect programming technique to spawn 60000 threadsto solve a concurrency problem, a small pool of threads is instead appropriate. Here is a more reasonable comparison (from my reddit postin response to people citing this SO post).

更正上面@TemporalBeing 的回答，greenlets 并不比线程“快”，并且产生60000 个线程来解决并发问题是一种不正确的编程技术，一个小的线程池是合适的。这是一个更合理的比较（来自我的reddit 帖子，以回应引用此 SO 帖子的人）。

import gevent
from gevent import socket as gsock
import socket as sock
import threading
from datetime import datetime


def timeit(fn, URLS):
    t1 = datetime.now()
    fn()
    t2 = datetime.now()
    print(
        "%s / %d hostnames, %s seconds" % (
            fn.__name__,
            len(URLS),
            (t2 - t1).total_seconds()
        )
    )


def run_gevent_without_a_timeout():
    ip_numbers = []

    def greenlet(domain_name):
        ip_numbers.append(gsock.gethostbyname(domain_name))

    jobs = [gevent.spawn(greenlet, domain_name) for domain_name in URLS]
    gevent.joinall(jobs)
    assert len(ip_numbers) == len(URLS)


def run_threads_correctly():
    ip_numbers = []

    def process():
        while queue:
            try:
                domain_name = queue.pop()
            except IndexError:
                pass
            else:
                ip_numbers.append(sock.gethostbyname(domain_name))

    threads = [threading.Thread(target=process) for i in range(50)]

    queue = list(URLS)
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    assert len(ip_numbers) == len(URLS)

URLS_base = ['www.google.com', 'www.example.com', 'www.python.org',
             'www.yahoo.com', 'www.ubc.ca', 'www.wikipedia.org']

for NUM in (5, 50, 500, 5000, 10000):
    URLS = []

    for _ in range(NUM):
        for url in URLS_base:
            URLS.append(url)

    print("--------------------")
    timeit(run_gevent_without_a_timeout, URLS)
    timeit(run_threads_correctly, URLS)

Here are some results:

以下是一些结果：

--------------------
run_gevent_without_a_timeout / 30 hostnames, 0.044888 seconds
run_threads_correctly / 30 hostnames, 0.019389 seconds
--------------------
run_gevent_without_a_timeout / 300 hostnames, 0.186045 seconds
run_threads_correctly / 300 hostnames, 0.153808 seconds
--------------------
run_gevent_without_a_timeout / 3000 hostnames, 1.834089 seconds
run_threads_correctly / 3000 hostnames, 1.569523 seconds
--------------------
run_gevent_without_a_timeout / 30000 hostnames, 19.030259 seconds
run_threads_correctly / 30000 hostnames, 15.163603 seconds
--------------------
run_gevent_without_a_timeout / 60000 hostnames, 35.770358 seconds
run_threads_correctly / 60000 hostnames, 29.864083 seconds

the misunderstanding everyone has about non-blocking IO with Python is the belief that the Python interpreter can attend to the work of retrieving results from sockets at a large scale faster than the network connections themselves can return IO. While this is certainly true in some cases, it is not true nearly as often as people think, because the Python interpreter is really, really slow. In my blog post here, I illustrate some graphical profiles that show that for even very simple things, if you are dealing with crisp and fast network access to things like databases or DNS servers, those services can come back a lot faster than the Python code can attend to many thousands of those connections.

每个人对 Python 的非阻塞 IO 的误解是认为 Python 解释器可以比网络连接本身返回 IO 的速度更快地处理从套接字中大规模检索结果的工作。虽然在某些情况下确实如此，但它并不像人们想象的那么频繁，因为 Python 解释器真的非常非常慢。在我的博客文章中，我说明了一些图形配置文件，这些配置文件表明，即使是非常简单的事情，如果您正在处理对数据库或 DNS 服务器等事物的清晰快速的网络访问，这些服务可以比 Python 代码更快地返回可以处理成千上万的这些连接。

Python 格林莱特 Vs。线程

提问by Rsh

采纳答案by Matt Joiner

回答by max

回答by TemporalBeing

回答by zzzeek

相关推荐

最近更新

标签

Python 格林莱特 Vs。线程

提问by Rsh

采纳答案by Matt Joiner

回答by max

回答by TemporalBeing

回答by zzzeek

相关推荐

Python 在 matplotlib 中更改 X（时间，而不是数字）频率上的滴答频率

Python 检查 NoneType 不起作用

Python 从网站获取 csv 数据

Python Pandas 中的多个直方图

相关推荐

最近更新

标签