Python 糟糕的 Django / uwsgi 性能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14962289/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 12:58:12  来源:igfitidea点击:

Bad Django / uwsgi performance

pythondjangouwsgidjango-rest-framework

提问by Maverick

I am running a django app with nginx & uwsgi. Here's how i run uwsgi:

我正在使用 nginx 和 uwsgi 运行 django 应用程序。这是我运行 uwsgi 的方式:

sudo uwsgi -b 25000 --chdir=/www/python/apps/pyapp --module=wsgi:application --env DJANGO_SETTINGS_MODULE=settings --socket=/tmp/pyapp.socket --cheaper=8 --processes=16  --harakiri=10  --max-requests=5000  --vacuum --master --pidfile=/tmp/pyapp-master.pid --uid=220 --gid=499

& nginx configurations:

& nginx 配置:

server {
    listen 80;
    server_name test.com

    root /www/python/apps/pyapp/;

    access_log /var/log/nginx/test.com.access.log;
    error_log /var/log/nginx/test.com.error.log;

    # https://docs.djangoproject.com/en/dev/howto/static-files/#serving-static-files-in-production
    location /static/ {
        alias /www/python/apps/pyapp/static/;
        expires 30d;
    }

    location /media/ {
        alias /www/python/apps/pyapp/media/;
        expires 30d;
    }

    location / {
        uwsgi_pass unix:///tmp/pyapp.socket;
        include uwsgi_params;
        proxy_read_timeout 120;
    }

    # what to serve if upstream is not available or crashes
    #error_page 500 502 503 504 /media/50x.html;
}

Here comes the problem. When doing "ab" (ApacheBenchmark) on the server i get the following results:

问题来了。在服务器上执行“ab”(ApacheBenchmark)时,我得到以下结果:

nginx version: nginx version: nginx/1.2.6

nginx 版本: nginx 版本: nginx/1.2.6

uwsgi version:1.4.5

uwsgi 版本:1.4.5

Server Software:        nginx/1.0.15
Server Hostname:        pycms.com
Server Port:            80

Document Path:          /api/nodes/mostviewed/8/?format=json
Document Length:        8696 bytes

Concurrency Level:      100
Time taken for tests:   41.232 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      8866000 bytes
HTML transferred:       8696000 bytes
Requests per second:    24.25 [#/sec] (mean)
Time per request:       4123.216 [ms] (mean)
Time per request:       41.232 [ms] (mean, across all concurrent requests)
Transfer rate:          209.99 [Kbytes/sec] received

While running on 500 concurrency level

在 500 并发级别上运行时

oncurrency Level:      500
Time taken for tests:   2.175 seconds
Complete requests:      1000
Failed requests:        50
   (Connect: 0, Receive: 0, Length: 50, Exceptions: 0)
Write errors:           0
Non-2xx responses:      950
Total transferred:      629200 bytes
HTML transferred:       476300 bytes
Requests per second:    459.81 [#/sec] (mean)
Time per request:       1087.416 [ms] (mean)
Time per request:       2.175 [ms] (mean, across all concurrent requests)
Transfer rate:          282.53 [Kbytes/sec] received

As you can see... all requests on the server fail with either timeout errors or "Client prematurely disconnected" or:

如您所见...服务器上的所有请求均因超时错误或“客户端过早断开连接”而失败,或者:

writev(): Broken pipe [proto/uwsgi.c line 124] during GET /api/nodes/mostviewed/9/?format=json

Here's a little bit more about my application: Basically, it's a collection of models that reflect MySQL tables which contain all the content. At the frontend, i have django-rest-framework which serves json content to the clients.

下面是关于我的应用程序的更多信息:基本上,它是反映包含所有内容的 MySQL 表的模型集合。在前端,我有 django-rest-framework 为客户端提供 json 内容。

I've installed django-profiling & django debug toolbar to see whats going on. On django-profiling here's what i get when running a single request:

我已经安装了 django-profiling 和 django 调试工具栏来查看发生了什么。在 django-profiling 中,这是我在运行单个请求时得到的结果:

Instance wide RAM usage

Partition of a set of 147315 objects. Total size = 20779408 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  63960  43  5726288  28   5726288  28 str
     1  36887  25  3131112  15   8857400  43 tuple
     2   2495   2  1500392   7  10357792  50 dict (no owner)
     3    615   0  1397160   7  11754952  57 dict of module
     4   1371   1  1236432   6  12991384  63 type
     5   9974   7  1196880   6  14188264  68 function
     6   8974   6  1076880   5  15265144  73 types.CodeType
     7   1371   1  1014408   5  16279552  78 dict of type
     8   2684   2   340640   2  16620192  80 list
     9    382   0   328912   2  16949104  82 dict of class
<607 more rows. Type e.g. '_.more' to view.>



CPU Time for this request

         11068 function calls (10158 primitive calls) in 0.064 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.064    0.064 /usr/lib/python2.6/site-packages/django/views/generic/base.py:44(view)
        1    0.000    0.000    0.064    0.064 /usr/lib/python2.6/site-packages/django/views/decorators/csrf.py:76(wrapped_view)
        1    0.000    0.000    0.064    0.064 /usr/lib/python2.6/site-packages/rest_framework/views.py:359(dispatch)
        1    0.000    0.000    0.064    0.064 /usr/lib/python2.6/site-packages/rest_framework/generics.py:144(get)
        1    0.000    0.000    0.064    0.064 /usr/lib/python2.6/site-packages/rest_framework/mixins.py:46(list)
        1    0.000    0.000    0.038    0.038 /usr/lib/python2.6/site-packages/rest_framework/serializers.py:348(data)
     21/1    0.000    0.000    0.038    0.038 /usr/lib/python2.6/site-packages/rest_framework/serializers.py:273(to_native)
     21/1    0.000    0.000    0.038    0.038 /usr/lib/python2.6/site-packages/rest_framework/serializers.py:190(convert_object)
     11/1    0.000    0.000    0.036    0.036 /usr/lib/python2.6/site-packages/rest_framework/serializers.py:303(field_to_native)
    13/11    0.000    0.000    0.033    0.003 /usr/lib/python2.6/site-packages/django/db/models/query.py:92(__iter__)
      3/1    0.000    0.000    0.033    0.033 /usr/lib/python2.6/site-packages/django/db/models/query.py:77(__len__)
        4    0.000    0.000    0.030    0.008 /usr/lib/python2.6/site-packages/django/db/models/sql/compiler.py:794(execute_sql)
        1    0.000    0.000    0.021    0.021 /usr/lib/python2.6/site-packages/django/views/generic/list.py:33(paginate_queryset)
        1    0.000    0.000    0.021    0.021 /usr/lib/python2.6/site-packages/django/core/paginator.py:35(page)
        1    0.000    0.000    0.020    0.020 /usr/lib/python2.6/site-packages/django/core/paginator.py:20(validate_number)
        3    0.000    0.000    0.020    0.007 /usr/lib/python2.6/site-packages/django/core/paginator.py:57(_get_num_pages)
        4    0.000    0.000    0.020    0.005 /usr/lib/python2.6/site-packages/django/core/paginator.py:44(_get_count)
        1    0.000    0.000    0.020    0.020 /usr/lib/python2.6/site-packages/django/db/models/query.py:340(count)
        1    0.000    0.000    0.020    0.020 /usr/lib/python2.6/site-packages/django/db/models/sql/query.py:394(get_count)
        1    0.000    0.000    0.020    0.020 /usr/lib/python2.6/site-packages/django/db/models/query.py:568(_prefetch_related_objects)
        1    0.000    0.000    0.020    0.020 /usr/lib/python2.6/site-packages/django/db/models/query.py:1596(prefetch_related_objects)
        4    0.000    0.000    0.020    0.005 /usr/lib/python2.6/site-packages/django/db/backends/util.py:36(execute)
        1    0.000    0.000    0.020    0.020 /usr/lib/python2.6/site-packages/django/db/models/sql/query.py:340(get_aggregation)
        5    0.000    0.000    0.020    0.004 /usr/lib64/python2.6/site-packages/MySQLdb/cursors.py:136(execute)
        2    0.000    0.000    0.020    0.010 /usr/lib/python2.6/site-packages/django/db/models/query.py:1748(prefetch_one_level)
        4    0.000    0.000    0.020    0.005 /usr/lib/python2.6/site-packages/django/db/backends/mysql/base.py:112(execute)
        5    0.000    0.000    0.019    0.004 /usr/lib64/python2.6/site-packages/MySQLdb/cursors.py:316(_query)
       60    0.000    0.000    0.018    0.000 /usr/lib/python2.6/site-packages/django/db/models/query.py:231(iterator)
        5    0.012    0.002    0.015    0.003 /usr/lib64/python2.6/site-packages/MySQLdb/cursors.py:278(_do_query)
       60    0.000    0.000    0.013    0.000 /usr/lib/python2.6/site-packages/django/db/models/sql/compiler.py:751(results_iter)
       30    0.000    0.000    0.010    0.000 /usr/lib/python2.6/site-packages/django/db/models/manager.py:115(all)
       50    0.000    0.000    0.009    0.000 /usr/lib/python2.6/site-packages/django/db/models/query.py:870(_clone)
       51    0.001    0.000    0.009    0.000 /usr/lib/python2.6/site-packages/django/db/models/sql/query.py:235(clone)
        4    0.000    0.000    0.009    0.002 /usr/lib/python2.6/site-packages/django/db/backends/__init__.py:302(cursor)
        4    0.000    0.000    0.008    0.002 /usr/lib/python2.6/site-packages/django/db/backends/mysql/base.py:361(_cursor)
        1    0.000    0.000    0.008    0.008 /usr/lib64/python2.6/site-packages/MySQLdb/__init__.py:78(Connect)
  910/208    0.003    0.000    0.008    0.000 /usr/lib64/python2.6/copy.py:144(deepcopy)
       22    0.000    0.000    0.007    0.000 /usr/lib/python2.6/site-packages/django/db/models/query.py:619(filter)
       22    0.000    0.000    0.007    0.000 /usr/lib/python2.6/site-packages/django/db/models/query.py:633(_filter_or_exclude)
       20    0.000    0.000    0.005    0.000 /usr/lib/python2.6/site-packages/django/db/models/fields/related.py:560(get_query_set)
        1    0.000    0.000    0.005    0.005 /usr/lib64/python2.6/site-packages/MySQLdb/connections.py:8()

..etc

..等等

However, django-debug-toolbar shows the following:

但是, django-debug-toolbar 显示以下内容:

Resource Usage
Resource    Value
User CPU time   149.977 msec
System CPU time 119.982 msec
Total CPU time  269.959 msec
Elapsed time    326.291 msec
Context switches    11 voluntary, 40 involuntary

and 5 queries in 27.1 ms

The problem is that "top" shows the load average rising quickly and apache benchmark which i ran both on the local server and from a remote machine within the network shows that i am not serving many requests / second. What is the problem? this is as far as i could reach when profiling the code so it would be appreciated if someone can point of what i am doing here.

问题是“top”显示平均负载迅速上升,我在本地服务器和网络中的远程机器上运行的 apache 基准测试表明我没有处理很多请求/秒。问题是什么?这是我在分析代码时所能达到的范围,所以如果有人能指出我在这里做什么,我将不胜感激。

Edit (23/02/2013): Adding more details based on Andrew Alcock's answer:The points that require my attention / answer are (3)(3) I've executed "show global variables" on MySQL and found out that MySQL configurations had 151 for max_connections setting which is more than enough to serve the workers i am starting for uwsgi.

编辑 (23/02/2013):根据 Andrew Alcock 的回答添加更多细节:需要我注意/回答的要点是 (3)(3) 我已经在 MySQL 上执行了“显示全局变量”并发现 MySQL 配置max_connections 设置为 151,这足以为我为 uwsgi 开始的工作人员提供服务。

(3)(4)(2) The single request i am profiling is the heaviest one. It executes 4 queries according to django-debug-toolbar. What happens is that all queries run in: 3.71, 2.83, 0.88, 4.84 ms respectively.

(3)(4)(2) 我正在分析的单个请求是最重的一个。它根据 django-debug-toolbar 执行 4 个查询。发生的情况是所有查询运行时间分别为:3.71、2.83、0.88、4.84 毫秒。

(4) Here you're referring to memory paging? if so, how could i tell?

(4) 这里你指的是内存分页?如果是这样,我怎么知道?

(5) On 16 workers, 100 concurrency rate, 1000 requests the load average goes up to ~ 12 I ran the tests on different number of workers (concurrency level is 100):

(5) 在 16 个 worker 上,100 个并发率,1000 个请求平均负载上升到 ~ 12 我在不同数量的 worker 上进行了测试(并发级别为 100):

  1. 1 worker, load average ~ 1.85, 19 reqs / second, Time per request: 5229.520, 0 non-2xx
  2. 2 worker, load average ~ 1.5, 19 reqs / second, Time per request: 516.520, 0 non-2xx
  3. 4 worker, load average ~ 3, 16 reqs / second, Time per request: 5929.921, 0 non-2xx
  4. 8 worker, load average ~ 5, 18 reqs / second, Time per request: 5301.458, 0 non-2xx
  5. 16 worker, load average ~ 19, 15 reqs / second, Time per request: 6384.720, 0 non-2xx
  1. 1 个工人,平均负载 ~ 1.85,19 个请求/秒,每个请求的时间:5229.520,0 个非 2xx
  2. 2 个工人,平均负载 ~ 1.5,19 个请求/秒,每个请求的时间:516.520,0 个非 2xx
  3. 4 个工人,平均负载 ~ 3,16 个请求/秒,每个请求的时间:5929.921,0 个非 2xx
  4. 8 个工人,平均负载 ~ 5,18 个请求/秒,每个请求的时间:5301.458,0 个非 2xx
  5. 16 个工人,平均负载 ~ 19,15 个请求/秒,每个请求的时间:6384.720,0 非 2xx

AS you can see, the more workers we have, the more load we have on the system. I can see in uwsgi's daemon log that the response time in milliseconds increases when i increase the number of workers.

如您所见,我们拥有的工人越多,系统上的负载就越大。我可以在 uwsgi 的守护进程日志中看到,当我增加工作人员的数量时,以毫秒为单位的响应时间会增加。

On 16 workers, running 500 concurrency level requests uwsgi starts loggin the errors:

在 16 个工作人员上,运行 500 个并发级别请求 uwsgi 开始记录错误:

 writev(): Broken pipe [proto/uwsgi.c line 124] 

Load goes up to ~ 10 as well. and the tests don't take much time because non-2xx responses are 923 out of 1000 which is why the response here is quite fast as it's almost empty. Which is also a reply to your point #4 in the summary.

负载也上升到 ~ 10。并且测试不需要太多时间,因为非 2xx 响应是 1000 中的 923,这就是为什么这里的响应非常快,因为它几乎是空的。这也是对摘要中第 4 点的回复。

Assuming that what i am facing here is an OS latency based on I/O and networking, what is the recommended action to scale this up? new hardware? bigger server?

假设我在这里面临的是基于 I/O 和网络的操作系统延迟,建议采取什么措施来扩大它?新硬件?更大的服务器?

Thanks

谢谢

采纳答案by Andrew Alcock

EDIT 1Seen the comment that you have 1 virtual core, adding commentary through on all relavant points

编辑 1看到您有 1 个虚拟核心的评论,在所有相关点上添加评论

EDIT 2More information from Maverick, so I'm eliminating ideas ruled out and developing the confirmed issues.

编辑 2来自 Maverick 的更多信息,因此我正在消除排除的想法并开发已确认的问题。

EDIT 3Filled out more details about uwsgi request queue and scaling options. Improved grammar.

编辑 3填写了有关 uwsgi 请求队列和缩放选项的更多详细信息。改进了语法。

EDIT 4Updates from Maverick and minor improvements

编辑 4来自 Maverick 的更新和小改进

Comments are too small, so here are some thoughts:

评论太少,所以这里有一些想法:

  1. Load average is basically how many processes are running on or waiting for CPU attention. For a perfectly loaded system with 1 CPU core, the load average should be 1.0; for a 4 core system, it should be 4.0. The moment you run the web test, the threading rockets and you have a lotof processes waiting for CPU. Unless the load average exceeds the number of CPU cores by a significant margin, it is not a concern
  2. The first 'Time per request' value of 4s correlates to the length of the request queue - 1000 requests dumped on Django nearly instantaneously and took on average 4s to service, about 3.4s of which were waiting in a queue. This is due to the very heavy mismatch between the number of requests (100) vs. the number of processors (16) causing 84 of the requests to be waiting for a processor at any one moment.
  3. Running at a concurrency of 100, the tests take 41 seconds at 24 requests/sec. You have 16 processes (threads), so each request is processed about 700ms. Given your type of transaction, that is a longtime per request. This may be because:

    1. The CPU cost of each request is high in Django (which is highly unlikely given the low CPU value from the debug toolbar)
    2. The OS is task switching a lot (especially if the load average is higher than 4-8), and the latency is purely down to having too many processes.
    3. There are not enough DB connections serving the 16 processes so processes are waiting to have one come available. Do you have at least one connection available per process?
    4. There is considerablelatency around the DB, either:

      1. Tens of small requests each taking, say, 10ms, most of which is networking overhead. If so, can you introducing caching or reduce the SQL calls to a smaller number. Or
      2. One or a couple of requests are taking 100's of ms. To check this, run profiling on the DB. If so, you need to optimise that request.
  4. The split between system and user CPU cost is unusually high in system, although the total CPU is low. This implies that most of the work in Django is kernel related, such as networking or disk. In this scenario, it might be network costs (eg receiving and sending HTTP requests and receiving and sending requests to the DB). Sometimes this will be high because of paging. If there's no paging going on, then you probably don't have to worry about this at all.

  5. You have set the processes at 16, but have a high load average (how high you don't state). Ideally you should always have at least oneprocess waiting for CPU (so that CPUs don't spin idly). Processes here don't seem CPU bound, but have a significant latency, so you need more processes than cores. How many more? Try running the uwsgi with different numbers of processors (1, 2, 4, 8, 12, 16, 24, etc) until you have the best throughput. If you change latency of the average process, you will need to adjust this again.
  6. The 500 concurrency level definitely is a problem, but is it the client or the server? The report says 50 (out of 100) had the incorrect content-length which implies a server problem. The non-2xx also seems to point there. Is it possible to capture the non-2xx responses for debugging - stack traces or the specific error message would be incredibly useful(EDIT) and is caused by the uwsgi request queue running with it's default value of 100.
  1. 平均负载基本上是有多少进程正在运行或等待 CPU 关注。对于具有 1 个 CPU 内核的完美负载系统,负载平均值应为 1.0;对于 4 核系统,它应该是 4.0。当您运行 Web 测试时,线程会飞速发展,并且您有很多进程在等待 CPU。除非平均负载大大超过 CPU 内核数,否则不需要担心
  2. 第一个“每个请求的时间”值 4 秒与请求队列的长度相关 - 1000 个请求几乎立即转储到 Django 上,平均需要 4 秒才能提供服务,其中大约 3.4 秒在队列中等待。这是由于请求数量 (100) 与处理器数量 (16) 之间的严重不匹配导致在任一时刻有 84 个请求正在等待处理器。
  3. 以 100 的并发运行,测试需要 41 秒,24 个请求/秒。您有 16 个进程(线程),因此每个请求的处理时间约为 700 毫秒。鉴于您的交易类型,每个请求需要长时间。这可能是因为:

    1. 每个请求的 CPU 成本在 Django 中都很高(鉴于调试工具栏中的 CPU 值很低,这不太可能)
    2. 操作系统的任务切换很多(特别是如果平均负载高于 4-8),延迟纯粹是由于进程太多。
    3. 没有足够的数据库连接为 16 个进程提供服务,因此进程正在等待一个可用。每个进程是否至少有一个可用连接?
    4. 数据库周围存在相当大的延迟,或者

      1. 数十个小请求每个需要 10 毫秒,其中大部分是网络开销。如果是这样,您是否可以引入缓存或将 SQL 调用减少到较小的数量。或者
      2. 一个或几个请求需要 100 毫秒。要检查这一点,请在数据库上运行分析。如果是这样,您需要优化该请求。
  4. 系统和用户之间的 CPU 成本分摊在系统中异常高,尽管总 CPU 很低。这意味着 Django 中的大部分工作都与内核相关,例如网络或磁盘。在这种情况下,可能是网络成本(例如接收和发送 HTTP 请求以及接收和发送请求到数据库)。有时这会因为分页而很高。如果没有进行分页,那么您可能根本不必担心这一点。

  5. 您已将进程设置为 16,但平均负载很高(您没有说明有多高)。理想情况下,您应该始终至少有一个进程在等待 CPU(以便 CPU 不会空转)。这里的进程似乎不受 CPU 限制,但具有显着的延迟,因此您需要比内核更多的进程。还有多少?尝试使用不同数量的处理器(1、2、4、8、12、16、24 等)运行 uwsgi,直到获得最佳吞吐量。如果更改平均进程的延迟,则需要再次调整。
  6. 500的并发级别肯定是有问题的,但是是客户端还是服务器呢?报告说 50 个(共 100 个)的内容长度不正确,这意味着服务器问题。非 2xx 似乎也指向那里。是否可以捕获用于调试的非 2xx 响应 - 堆栈跟踪或特定错误消息将非常有用(编辑),并且是由 uwsgi 请求队列以其默认值 100 运行引起的。

So, in summary:

所以,总结一下:

enter image description here

在此处输入图片说明

  1. Django seems fine
  2. Mismatch between concurrency of load test (100 or 500) vs. processes (16): You're pushing way too many concurrent requests into the system for the number of processes to handle. Once you are above the number of processes, all that will happen is that you will lengthen the HTTP Request queue in the web server
  3. There is a large latency, so either

    1. Mismatch between processes (16) and CPU cores (1): If the load average is >3, then it's probably too many processes. Try again with a smaller number of processes

      1. Load average > 2 -> try 8 processes
      2. Load average > 4 -> try 4 processes
      3. Load average > 8 -> try 2 processes
    2. If the load average <3, it may be in the DB, so profile the DB to see whether there are loads of small requests (additively causing the latency) or one or two SQL statements are the problem

  4. Without capturing the failed response, there's not much I can say about the failures at 500 concurrency
  1. Django 看起来不错
  2. 负载测试的并发性(100 或 500)与进程(16)之间的不匹配:您将太多的并发请求推送到系统中,无法处理的进程数。一旦超过进程数,将会发生的事情就是延长 Web 服务器中的 HTTP 请求队列
  3. 有很大的延迟,所以要么

    1. 进程 (16) 和 CPU 内核 (1) 之间的不匹配:如果平均负载 > 3,则可能是进程过多。使用较少数量的进程重试

      1. 平均负载 > 2 -> 尝试 8 个进程
      2. 平均负载 > 4 -> 尝试 4 个进程
      3. 平均负载 > 8 -> 尝试 2 个进程
    2. 如果平均负载<3,则可能在数据库中,因此请分析数据库以查看是否有小请求的负载(附加导致延迟)或一两个SQL语句有问题

  4. 如果没有捕获失败的响应,关于 500 并发的失败,我无话可说

Developing ideas

发展思路

Your load averages >10 on a single cored machine is reallynasty and (as you observe) leads to a lot of task switching and general slow behaviour. I personally don't remember seeing a machine with a load average of 19 (which you have for 16 processes) - congratulations for getting it so high ;)

您在单核机器上的平均负载 >10真的很糟糕,并且(正如您所观察到的)导致大量任务切换和一般缓慢的行为。我个人不记得看到一台机器的平均负载为 19(您有 16 个进程) - 祝贺它变得如此之高;)

The DB performance is great, so I'd give that an all-clear right now.

DB 性能非常好,所以我现在就给它一个明确的答案。

Paging: To answer you question on how to see paging - you can detect OS paging in several ways. For example, in top, the header has page-ins and outs (see the last line):

分页:回答有关如何查看分页的问题 - 您可以通过多种方式检测操作系统分页。例如,在顶部,标题有页面输入和输出(见最后一行):

Processes: 170 total, 3 running, 4 stuck, 163 sleeping, 927 threads                                                                                                        15:06:31
Load Avg: 0.90, 1.19, 1.94  CPU usage: 1.37% user, 2.97% sys, 95.65% idle  SharedLibs: 144M resident, 0B data, 24M linkedit.
MemRegions: 31726 total, 2541M resident, 120M private, 817M shared. PhysMem: 1420M wired, 3548M active, 1703M inactive, 6671M used, 1514M free.
VM: 392G vsize, 1286M framework vsize, 1534241(0) pageins, 0(0) pageouts. Networks: packets: 789684/288M in, 912863/482M out. Disks: 739807/15G read, 996745/24G written.

Number of processes: In your current configuration, the number of processes is waytoo high. Scale the number of processes back to a 2. We might bring this value up later, depending on shifting further load off this server.

进程数:在您当前的配置,处理的数量是方式太高。将进程数缩放回 2。我们可能会在稍后提高这个值,这取决于进一步转移这个服务器的负载。

Location of Apache Benchmark: The load average of 1.85 for one process suggests to me that you are running the load generator on the same machine as uwsgi - is that correct?

Apache Benchmark 的位置:一个进程的平均负载为 1.85 向我表明您正在与 uwsgi 在同一台机器上运行负载生成器 - 是这样吗?

If so, you really need to run this from another machine otherwise the test runs are not representative of actual load - you're taking memory and CPU from the web processes for use in the load generator. In addition, the load generator's 100 or 500 threads will generally stress your server in a way that does not happen in real life. Indeed this might be the reason the whole test fails.

如果是这样,您真的需要从另一台机器上运行它,否则测试运行不能代表实际负载 - 您正在从 Web 进程中获取内存和 CPU 以用于负载生成器。此外,负载生成器的 100 或 500 个线程通常会给您的服务器带来压力,这在现实生活中不会发生。事实上,这可能是整个测试失败的原因。

Location of the DB: The load average for one process also suggest that you are running the DB on the same machine as the web processes - is this correct?

DB 的位置:一个进程的平均负载还表明您正在与 Web 进程在同一台机器上运行 DB - 这是正确的吗?

If I'm correct about the DB, then the first and best way to start scaling is to move the DB to another machine. We do this for a couple of reasons:

如果我对数据库是正确的,那么开始扩展的第一个也是最好的方法是将数据库移动到另一台机器上。我们这样做有几个原因:

  1. A DB server needs a different hardware profile from a processing node:

    1. Disk: DB needs a lot of fast, redundant, backed up disk, and a processing node needs just a basic disk
    2. CPU: A processing node needs the fastest CPU you can afford whereas a DB machine can often make do without (often its performance is gated on disk and RAM)
    3. RAM: a DB machine generally needs as much RAM as possible (and the fastest DB has allits data in RAM), whereas many processing nodes need much less (yours needs about 20MB per process - very small
    4. Scaling: AtomicDBs scale best by having monster machines with many CPUs whereas the web tier (not having state) can scale by plugging in many identical small boxen.
  2. CPU affinity: It's better for the CPU to have a load average of 1.0 and processes to have affinity to a single core. Doing so maximizes the use of the CPU cache and minimizes task switching overheads. By separating the DB and processing nodes, you are enforcing this affinity in HW.

  1. 数据库服务器需要与处理节点不同的硬件配置文件:

    1. 磁盘:DB需要大量快速、冗余、备份的磁盘,一个处理节点只需要一个基本磁盘
    2. CPU:处理节点需要你能负担得起的最快的 CPU,而 DB 机器通常可以不用(通常它的性能受磁盘和 RAM 限制)
    3. RAM:DB 机器通常需要尽可能多的 RAM(最快的 DB 将所有数据存储在 RAM 中),而许多处理节点需要的要少得多(您的每个进程需要大约 20MB - 非常小
    4. 扩展性:原子数据库通过拥有许多 CPU 的巨型机器来实现最佳扩展,而 Web 层(没有状态)可以通过插入许多相同的小盒子来扩展。
  2. CPU 亲和性:CPU 的平均负载为 1.0 并且进程与单核具有亲和性更好。这样做可以最大限度地利用 CPU 缓存并最大限度地减少任务切换开销。通过分离 DB 和处理节点,您可以在硬件中强制执行这种关联。

500 concurrency with exceptionsThe request queue in the diagram above is at most 100 - if uwsgi receives a request when the queue is full, the request is rejected with a 5xx error. I think this was happening in your 500 concurrency load test - basically the queue filled up with the first 100 or so threads, then the other 400 threads issued the remaining 900 requests and received immediate 5xx errors.

500 并发异常上图中的请求队列最多为 100 - 如果 uwsgi 在队列已满时收到请求,则该请求被拒绝并返回 5xx 错误。我认为这发生在您的 500 个并发负载测试中 - 基本上队列被前 100 个左右的线程填满,然后其他 400 个线程发出剩余的 900 个请求并立即收到 5xx 错误。

To handle 500 requests per second you need to ensure two things:

要每秒处理 500 个请求,您需要确保两件事:

  1. The Request Queue size is configured to handle the burst: Use the --listenargument to uwsgi
  2. The system can handle a throughput at above 500 requests per second if 500 is a normal condition, or a bit below if 500 is a peak. See scaling notes below.
  1. 请求队列大小配置为处理突发:使用--listen参数uwsgi
  2. 如果 500 是正常条件,系统可以以每秒 500 个以上的请求处理吞吐量,如果 500 是峰值,则该吞吐量略低。请参阅下面的缩放说明。

I imagine that uwsgi has the queue set to a smaller number to better handle DDoS attacks; if placed under huge load, most requests immediately fail with almost no processing allowing the box as a whole to still be responsive to the administrators.

我想 uwsgi 将队列设置为较小的数字以更好地处理 DDoS 攻击;如果置于巨大负载下,大多数请求会立即失败,几乎不进行任何处理,从而使整个框仍然可以响应管理员。

General advice for scaling a system

扩展系统的一般建议

Your most important consideration is probably to maximize throughput. Another possible need to minimize response time, but I won't discuss this here. In maximising throughput, you are trying to maximize the system, not individual components; some local decreases might improve overall system throughput (for example, making a change that happens to add latency in the web tier in order to improve performance of the DBis a net gain).

您最重要的考虑可能是最大化吞吐量。另一个可能需要最大限度地减少响应时间,但我不会在这里讨论这个。在最大化吞吐量时,您正在尝试最大化系统,而不是单个组件;一些局部减少可能会提高整体系统吞吐量(例如,为了提高数据库的性能而进行的更改恰好增加了 Web 层的延迟是一种净收益)。

Onto specifics:

具体情况:

  1. Move the DB to a separate machine. After this, profile the DB during your load test by running topand your favorite MySQL monitoring tool. You need to be able to profile . Moving the DB to a separate machine will introduce some additional latency (several ms) per request, so expect to slightly increase the number of processes at the web tier to keep the same throughput.
  2. Ensure that uswgirequest queue is large enough to handle a burst of traffic using the --listenargument. This should be several times the maximum steady-state requests-per-second your system can handle.
  3. On the web/app tier: Balance the number of processes with the number of CPU coresand the inherent latency in the process. Too many processes slows performance, too few means that you'll never fully utilize the system resources. There is no fixed balancing point, as every application and usage pattern is different, so benchmark and adjust. As a guide, use the processes' latency, if each task has:

    • 0% latency, then you need 1 process per core
    • 50% latency (i.e. the CPU time is half the actual time), then you need 2 processes per core
    • 67% latency, then you need 3 processes per core
  4. Check topduring the test to ensure that you are above 90% cpu utilisation (for every core) andyou have a load average a little above 1.0. If the load average is higher, scale back the processes. If all goes well, at some point you won't be able to achieve this target, and DB might now be the bottleneck

  5. At some point you will need more power in the web tier. You can either choose to add more CPU to the machine (relatively easy) and so add more processes, and/oryou can add in more processing nodes (horizontal scaleability). The latter can be achieved in uwsgi using the method discussed hereby ?ukasz Mierzwa
  1. 将数据库移到单独的机器上。在此之后,通过运行top和您最喜欢的 MySQL 监控工具在负载测试期间分析数据库。您需要能够配置文件。将数据库移动到单独的机器会为每个请求引入一些额外的延迟(几毫秒),因此期望稍微增加 Web 层的进程数量以保持相同的吞吐量。
  2. uswgi使用--listen参数确保请求队列足够大以处理突发流量。这应该是您的系统每秒可以处理的最大稳态请求数的几倍。
  3. 在 Web/应用程序层:平衡进程数与 CPU 内核数以及进程中的固有延迟。进程太多会降低性能,太少意味着您永远不会充分利用系统资源。没有固定的平衡点,因为每个应用程序和使用模式都不同,因此进行基准测试和调整。作为指导,使用进程的延迟,如果每个任务都有:

    • 0% 延迟,那么每个核心需要 1 个进程
    • 50% 延迟(即 CPU 时间是实际时间的一半),那么每个核心需要 2 个进程
    • 67% 的延迟,那么每个核心需要 3 个进程
  4. 检查top在测试过程中,以确保您有高于90%的CPU使用率(每一个核心)你有1.0以上的平均负载一点。如果平均负载较高,则缩减进程。如果一切顺利,在某个时候你将无法实现这个目标,而 DB 现在可能是瓶颈

  5. 在某些时候,您将需要 Web 层中的更多功能。您可以选择向机器添加更多 CPU(相对容易)并因此添加更多进程,和/或您可以添加更多处理节点(水平可扩展性)。后者可以在uwsgi使用所讨论的方法来实现这里通过?ukasz Mierzwa

回答by barracel

Adding more workers and getting less r/s means that your request "is pure CPU" and there is no IO waits that another worker can use to serve another request.

添加更多工作人员并减少 r/s 意味着您的请求“是纯 CPU”,并且没有其他工作人员可以用来服务另一个请求的 IO 等待。

If you want to scale you will need to use another server with more (or faster) cpu's.

如果您想扩展,您将需要使用另一台具有更多(或更快)CPU 的服务器。

However this is a synthetic test, the number of r/s you get are the upper bound for the exact request that you are testing, once on production there are many more variables that can affect the performance.

然而,这是一个综合测试,您获得的 r/s 数量是您正在测试的确切请求的上限,一旦投入生产,就会有更多可能影响性能的变量。

回答by ?ukasz Mierzwa

Please run benchmarks much longer than a minute (5-10 at least), You really won't get much information from such a short test. And use uWSGI's carbon plugin to push stats to carbon/graphite server (You will need to have one), You will have much more information for debugging.

请运行超过一分钟的基准测试(至少 5-10 分钟),您真的不会从如此短的测试中获得太多信息。并使用 uWSGI 的 carbon 插件将统计信息推送到 carbon/graphite 服务器(您将需要一个),您将获得更多用于调试的信息。

When You send 500 concurrent requests to Your app and it can't handle such load, listen queue on each backend will be filled pretty quickly (it's 100 requests by default), You might want to increase that, but if workers can't process requests that fast and listen queue (also known as backlog) is full, linux network stack will drop request and You will start getting errors.

当您向您的应用程序发送 500 个并发请求并且它无法处理此类负载时,每个后端的侦听队列将很快填满(默认情况下为 100 个请求),您可能希望增加该请求,但如果工作人员无法处理请求快速且侦听队列(也称为积压)已满,Linux 网络堆栈将丢弃请求,您将开始收到错误消息。

Your first benchmark states that You can process single request in ~42 ms, so single worker could process at most 1000ms / 42ms = ~23 requests per second (if db and other parts of app stack didn't slow down as concurrency goes up). So to process 500 concurrent requests You would need at least 500 / 23 = 21 workers (but in reality I would say at least 40), You have only 16, no wonder it breaks under such load.

您的第一个基准表明您可以在 ~42 毫秒内处理单个请求,因此单个工作人员最多可以处理每秒 1000 毫秒 / 42 毫秒 = ~23 个请求(如果数据库和应用程序堆栈的其他部分没有随着并发性上升而减慢) . 因此,要处理 500 个并发请求,您至少需要 500 / 23 = 21 个工人(但实际上我会说至少是 40 个),您只有 16 个,难怪在这种负载下它会中断。

EDIT: I've mixed rate with concurrency - at least 21 workers will allow You to process 500 requests per second, not 500 concurrent requests. If You really want to handle 500 concurrent requests than You simply need 500 workers. Unless You will run Your app in async mode, check "Gevent" section in uWSGI docs.

编辑:我混合了并发率 - 至少 21 名工作人员将允许您每秒处理 500 个请求,而不是 500 个并发请求。如果您真的想处理 500 个并发请求,那么您只需要 500 个工作人员。除非您将在异步模式下运行您的应用程序,否则请查看 uWSGI 文档中的“Gevent”部分。

PS. uWSGI comes with great load balancer with backend autoconfiguration (read docs under "Subscription Server" and "FastRouter"). You can setup it in a way that allows You to hot-plug new backend as needed, You just start workers on new node and they will subscribe to FastRouter and start getting requests. This is the best way to scale horizontally. And with backends on AWS You can automate this so that new backends will be started quickly when needed.

附注。uWSGI 带有出色的负载均衡器和后端自动配置(阅读“订阅服务器”和“FastRouter”下的文档)。您可以以允许您根据需要热插拔新后端的方式进行设置,您只需在新节点上启动工作人员,他们将订阅 FastRouter 并开始获取请求。这是横向扩展的最佳方式。使用 AWS 上的后端,您可以自动执行此操作,以便在需要时快速启动新后端。