php PHP5-FPM 随机开始消耗大量 CPU

Question

提问by Eugene

I've run into a really strange problem which I am not sure how to debug further. I have an NGINX + PHP5-FPM + APC Amazon Ubuntu instance and there is a website installed on it which is a complex PHP framework. While trying to debug the problem, I've reduced the flow to this: a lot of big classes get included, main objects are created, session is started, array of configs is retrieved from memcached, an XML file is retrieved from memcached, HTML templates are included, output is sent to the client.

我遇到了一个非常奇怪的问题，我不确定如何进一步调试。我有一个 NGINX + PHP5-FPM + APC Amazon Ubuntu 实例，上面安装了一个网站，它是一个复杂的 PHP 框架。在尝试调试问题时，我将流程简化为：包含许多大类，创建主要对象，启动会话，从 memcached 检索配置数组，从 memcached 检索 XML 文件，HTML包括模板，输出发送到客户端。

Then I use http_loadtool to put the website under the load of 20 requests per second: http_load -timeout 10 -rate 20 -fetches 10000 ./urls.txt

然后我使用http_load工具将网站置于每秒 20 个请求的负载下：http_load -timeout 10 -rate 20 -fetches 10000 ./urls.txt

What happens next is rather strange. topshows a bunch of php5-fpm processes spawned each taking a few % of CPU and everything runs smoothly, like this:

接下来发生的事情比较奇怪。top显示了产生的一堆 php5-fpm 进程，每个进程占用几个 CPU 并且一切运行顺利，如下所示：

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28440 www-data 20 0 67352 10m 5372 S 4.3 1.8 0:20.33 php5-fpm
28431 www-data 20 0 67608 10m 5304 S 3.3 1.8 0:16.77 php5-fpm
28444 www-data 20 0 67352 10m 5372 S 3.3 1.8 0:17.17 php5-fpm
28445 www-data 20 0 67352 10m 5372 S 3.0 1.8 0:16.83 php5-fpm
28422 www-data 20 0 67608 10m 5292 S 2.3 1.8 0:18.99 php5-fpm
28424 www-data 20 0 67352 10m 5368 S 2.0 1.8 0:16.59 php5-fpm
28438 www-data 20 0 67608 10m 5304 S 2.0 1.8 0:17.91 php5-fpm
28439 www-data 20 0 67608 10m 5304 S 2.0 1.8 0:23.34 php5-fpm
28423 www-data 20 0 67608 10m 5292 S 1.7 1.8 0:20.02 php5-fpm
28430 www-data 20 0 67608 10m 5300 S 1.7 1.8 0:15.77 php5-fpm
28433 www-data 20 0 67352 10m 5372 S 1.7 1.8 0:17.08 php5-fpm
28434 www-data 20 0 67608 10m 5292 S 1.7 1.8 0:18.56 php5-fpm
20648 memcache 20 0 51568 8192 708 S 1.3 1.3 2:51.06 memcached
28420 www-data 20 0 69876 13m 6300 S 1.3 2.3 0:20.89 php5-fpm
28421 www-data 20 0 67608 10m 5300 S 1.3 1.8 0:21.19 php5-fpm
28429 www-data 20 0 9524 2260 992 S 1.3 0.4 0:11.68 nginx
28435 www-data 20 0 67608 10m 5304 S 1.3 1.8 0:18.58 php5-fpm
28437 www-data 20 0 67352 10m 5372 S 1.3 1.8 0:17.87 php5-fpm
28441 www-data 20 0 67608 10m 5292 S 1.3 1.8 0:20.75 php5-fpm

Then after some time which can be anywhere between one second and minutes, several (usually two) php5-fpm processes suddenly consume all the CPU:

然后在可能是一秒到几分钟之间的任何时间之后，几个（通常是两个）php5-fpm 进程突然消耗了所有 CPU：

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28436 www-data 20 0 67608 10m 5304 R 48.5 1.8 0:23.68 php5-fpm
28548 www-data 20 0 67608 10m 5276 R 45.2 1.7 0:07.62 php5-fpm
28434 www-data 20 0 67608 10m 5292 R 2.0 1.8 0:23.28 php5-fpm
28439 www-data 20 0 67608 10m 5304 R 2.0 1.8 0:26.63 php5-fpm

At this point everything gets stuck and all new HTTP requests timeout. If I stop http_load tool, the php5-fpm will hang there for many minutes. Interestingly enough if I do php5-fpm stop, the php5-fpm processes will disappear but any commands that make use of filesystem will have problems executing. E.g. if I try to download a file via ssh, topwill show the following, taking many minutes to initiate the actual download:

此时一切都被卡住了，所有新的 HTTP 请求都超时了。如果我停止 http_load 工具，php5-fpm 将挂在那里几分钟。有趣的是，如果我这样做php5-fpm stop，php5-fpm 进程将消失，但任何使用文件系统的命令都会在执行时出现问题。例如，如果我尝试通过 ssh 下载文件，top将显示以下内容，需要几分钟才能启动实际下载：

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3298 sshd 20 0 7032 876 416 R 75.2 0.1 0:04.52 sshd
3297 sshd 20 0 7032 876 416 R 24.9 0.1 0:04.49 sshd

PHP error log usually has this:

PHP 错误日志通常是这样的：

[05-Dec-2012 20:31:39] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 58 total children
[05-Dec-2012 20:32:08] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 0 idle, and 66 total children

Nginx error log is flooded with these entries:

Nginx 错误日志中充斥着以下条目：

2012/12/05 20:31:36 [error] 4800#0: *5559 connect() to unix:/dev/shm/php-fpm-www.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: ..., server: ec2-....compute-1.amazonaws.com, request: "GET /usa/index.php?page=contact_us HTTP/1.0", upstream: "fastcgi://unix:/dev/shm/php-fpm-www.sock:", host: "ec2-....compute-1.amazonaws.com"

PHP-FPM slow log doesn't show anything interesting, swapping never happens and I didn't manage to gather any other interesting facts about the problem. I've gone through many iterations of config file changes, the most recent ones being

PHP-FPM 慢日志没有显示任何有趣的东西，交换从来没有发生过，我也没有设法收集到关于这个问题的任何其他有趣的事实。我经历了多次配置文件更改，最近的一次是

nginx.conf: http://pastebin.com/uaD56hJF

nginx.conf：http: //pastebin.com/uaD56hJF

pool.d/www.conf: http://pastebin.com/mFeeUULC

pool.d/www.conf：http://pastebin.com/mFeeUULC

===UPDATE 1===

===更新1===

site's config: http://pastebin.com/qvinVNhB

站点配置：http: //pastebin.com/qvinVNhB

===UPDATE 2===

===更新2===

Also just found that dmesgreports errors like this

也刚发现dmesg报告这样的错误

[6483131.164331] php5-fpm[28687]: segfault at b6ec8ff4 ip b78c3c32 sp bff551f0 error 4 in ld-2.13.so[b78b5000+1c000]

===UPDATE 3===

===更新3===

We've got a new Amazon EC2 micro instance just in case, to exclude possible hardware issues. Also I am using php-fastcgi now to exclude possible fpm bugs. Other differences are minor, I think the only thing that change is Ubuntu->Debian. The same problem still happens except that now server manages to slightly recover after the max_execution_time seconds (and then spikes again).

我们有一个新的 Amazon EC2 微型实例以防万一，以排除可能的硬件问题。此外，我现在正在使用 php-fastcgi 来排除可能的 fpm 错误。其他差异很小，我认为唯一改变的是 Ubuntu->Debian。同样的问题仍然发生，除了现在服务器设法在 max_execution_time 秒后稍微恢复（然后再次飙升）。

I tried playing with a separate test.php and I am not sure if it's the same issue but at least in topit looks the same. I created a test.php and included a bunch of libs that belong to our framework. The libs don't do anything except for defining classes or including other libs that define classes. I checked with APC and all of this gets successfully served by it. The I started pressuring test.php with 200 requests per second and after some time the same thing happened. Except that now I managed to get some errors saying "too many open files". It doesn't happen always though, sometimes it just starts timing out without outputting the error and a few php processes are stuck consuming all CPU. I played only a bit with it but I think there is a correlation here - by controlling the number of included libs or slightly varying requests/second rate, I can control when the CPU spike will happen. I increased the relevant OS variables but the issue is still there although it takes longer for it to happen (also note that I've set the limits to values N times larger than the total number of requests I do during tests).

我尝试使用单独的 test.php，但我不确定它是否是相同的问题，但至少在 top它看起来一样。我创建了一个 test.php 并包含了一堆属于我们框架的库。除了定义类或包含定义类的其他库之外，这些库不做任何事情。我检查了 APC，所有这些都由它成功提供。我开始以每秒 200 个请求对 test.php 施加压力，一段时间后发生了同样的事情。除了现在我设法得到一些错误，说“打开的文件太多”。但是，它并不总是发生，有时它只是开始超时而不输出错误，并且一些 php 进程卡住了消耗所有 CPU。我只玩了一点，但我认为这里存在相关性 - 通过控制包含的库的数量或稍微改变请求/秒速率，我可以控制 CPU 峰值何时发生。

fs.file-max = 70000
...
*       soft    nofile   10000
*       hard    nofile  30000
...
worker_rlimit_nofile 10000;
...
(reloaded all the configs and made sure the new system vars actually took affect)

So the next best and only explanation I can come up with so far is that even though APC is supposed to pull files from memory, internally it is implemented in a way that still uses a file descriptor whenever PHP include-s are called. And either because it releases them with a delay or when at some unfortunate moment too many requests arrive at the same moment, system runs our of descriptors and newly arriving HTTP requests get quickly stacked into a huge queue. I'll try to test this somehow.

所以到目前为止我能想到的下一个最好也是唯一的解释是，尽管 APC 应该从内存中提取文件，但它在内部实现的方式仍然是在调用 PHP include-s 时仍然使用文件描述符。要么是因为它延迟释放它们，要么是在某个不幸的时刻太多请求同时到达，系统运行我们的描述符，新到达的 HTTP 请求会迅速堆积到一个巨大的队列中。我会尝试以某种方式测试这个。

Answer 1

采纳答案by Kevin A. Naudé

I've run a website with similar configuration for many months, with zero down time. I've had a look at your config, and it looks ok. That being said, I did my config quite a while ago.

我已经运行了一个具有类似配置的网站好几个月了，停机时间为零。我看过你的配置，看起来没问题。话虽如此，我很久以前就做了我的配置。

I would consider reducing pm.max_requests = 10000to something more reasonable like pm.max_requests = 500. This just means "don't use each instance for more than X number of requests". It is good not to have this number too high, because doing so gives you resilience with respect of possible PHP engine bugs.

我会考虑减少pm.max_requests = 10000到更合理的东西，比如pm.max_requests = 500. 这只是意味着“不要将每个实例用于超过 X 个请求”。最好不要将此数字设置得太高，因为这样做可以让您在可能的 PHP 引擎错误方面具有弹性。

I think the real problem is most likely in your PHP scripts. It's hard to say without knowing more.

我认为真正的问题很可能出在您的 PHP 脚本中。不知道更多就很难说。

EDIT: Consider uncommenting ;request_terminate_timeout = 0and setting it to something like request_terminate_timeout = 20. Your scripts will then be required to complete within 20 seconds. You will most likely see a change in behaviour, but I think your site might stay live. That would indicate a PHP script error.

编辑：考虑取消注释;request_terminate_timeout = 0并将其设置为类似request_terminate_timeout = 20. 您的脚本将需要在 20 秒内完成。您很可能会看到行为的变化，但我认为您的网站可能会保持活跃。这将表明一个 PHP 脚本错误。

EDIT2: My own php-fpm config is as follows:

EDIT2：我自己的 php-fpm 配置如下：

[example.com]
listen = /var/run/sockets/example.com.socket
user = www-data
group = www-data
pm = dynamic
pm.start_servers = 5
pm.max_children = 15
pm.min_spare_servers = 5
pm.max_spare_servers = 10
pm.max_requests = 500
php_flag[expose_php] = off
php_flag[short_open_tag] = on

EDIT3: I spotted something unexpected in your nginx config, but it may be nothing.

EDIT3：我在你的 nginx 配置中发现了一些意想不到的东西，但它可能什么都没有。

You are using fastcgi_ignore_client_abort on;which causes problems in worker processes under older versions of nginx. I haven't seen this problem myself, since I am running a custom compile of a recent version. Here's the description of the problem on the nginx site:

您正在使用fastcgi_ignore_client_abort on;这会导致旧版本 nginx 下的工作进程出现问题。我自己没有看到这个问题，因为我正在运行最新版本的自定义编译。下面是nginx站点上对问题的描述：

In 1.0.2 POST requests are not handled correctly when fastcgi_ignore_client_abort is set to on which can lead to workers processes segfaulting. Switching fastcgi_ignore_client_abort back to default (off) should resolve this issue.

在 1.0.2 中，当 fastcgi_ignore_client_abort 设置为 on 时，POST 请求没有得到正确处理，这可能导致工作进程段错误。将 fastcgi_ignore_client_abort 切换回默认值（关闭）应该可以解决此问题。

Answer 2

回答by Animanga

Simple trick but very usefull to reduce processor usage upto 50%, just edit your php-fpm config:

简单的技巧，但非常有用，可以将处理器使用率降低 50%，只需编辑您的 php-fpm 配置：

pm = dynamic

and change it to:

并将其更改为：

pm = ondemand

Answer 3

回答by minhhq

The behavior of PHP-FPM on my Server is the same as you. bottleneck somewhere for sure.
The question turn out: How to find Bottleneck on Nginx - PHP-FPM - Mysql? The fastest way to find out is: Enable Slowlog for PHP-FPM.
Add the lines below into your php-fpm.conf pool, and make sure the path is existed

PHP-FPM 在我的服务器上的行为和你一样。瓶颈肯定在某个地方。
问题是：如何在 Nginx - PHP-FPM - Mysql 上找到瓶颈？最快的找出方法是：为 PHP-FPM 启用 Slowlog。
将以下行添加到 php-fpm.conf 池中，并确保路径存在

request_slowlog_timeout = 10
slowlog = /var/log/php-fpm/slow.$pool.log

By read the log backtrace, you can find out why PHP-FPM spent so much CPU or timeout. Here is my cases:

通过阅读日志回溯，您可以找出 PHP-FPM 花费如此多 CPU 或超时的原因。这是我的案例：

[28-Dec-2018 14:56:55]  [pool laravel] pid 19061
script_filename = /public_html/index.php
[0x00007efdda4d8100] hasChildren() /public_html/laravel/vendor/symfony/finder/Iterator/ExcludeDirectoryFilterIterator.php:75
[0x00007ffe31cd9e40] hasChildren() unknown:0
[0x00007ffe31cda200] next() unknown:0
[0x00007ffe31cda540] next() unknown:0
[0x00007ffe31cda880] next() unknown:0
[0x00007efdda4d7fa8] gc() /public_html/laravel/vendor/laravel/framework/src/Illuminate/Session/FileSessionHandler.php:91
[0x00007efdda4d7e50] gc() /public_html/laravel/vendor/laravel/framework/src/Illuminate/Session/Middleware.php:159
[0x00007efdda4d7d48] collectGarbage() /public_html/laravel/vendor/laravel/framework/src/Illuminate/Session/Middleware.php:128
[0x00007efdda4d7c20] closeSession() /public_html/laravel/vendor/laravel/framework/src/Illuminate/Session/Middleware.php:79
[0x00007efdda4d7ac8] handle() /public_html/laravel/vendor/laravel/framework/src/Illuminate/Cookie/Queue.php:47
[0x00007efdda4d7930] handle() /public_html/laravel/vendor/laravel/framework/src/Illuminate/Cookie/Guard.php:51
[0x00007efdda4d7818] handle() /public_html/laravel/vendor/stack/builder/src/Stack/StackedHttpKernel.php:23
[0x00007efdda4d76e0] handle() /public_html/laravel/vendor/laravel/framework/src/Illuminate/Foundation/Application.php:641
[0x00007efdda4d7598] run() 
/public_html/index.php:51

The backtrace mentions about these keywords:

回溯提到了这些关键字：

"cookie" "session" "collectGarbage()" "laravel"

I keep searching and TADA, Laravel using RANDOM method to clear expired session. And in my config, PHP using SSD to handle Session.
When Sessions number become "very big" This make PHP spent more time to handle => High CPU Usage.

我一直在搜索和 TADA，Laravel 使用 RANDOM 方法来清除过期的会话。在我的配置中，PHP 使用 SSD 来处理会话。
当会话数变得“非常大”时，这使 PHP 花费更多时间来处理 => 高 CPU 使用率。

We can have many kinds of bottleneck, we can just know it when we "debugged" it.

我们可以有很多种瓶颈， 当我们“调试”它时我们才能知道它。

Have a nice investigate.

好好调查一下。

Answer 4

回答by Berto

I'm going through this same problem right now, and wanted to point you to this post:

我现在正在经历同样的问题，并想向您指出这篇文章：

How to determine which script is being executed in PHP-FPM process

如何确定在 PHP-FPM 进程中正在执行哪个脚本

It's got to be one of your PHP scripts. See if you can connect the dots between the runaway process IDs and the .php script file that's holding you up.

它必须是您的 PHP 脚本之一。看看您是否可以将失控的进程 ID 和阻碍您的 .php 脚本文件连接起来。

Funny, this has been on a server that's been impeccably fast. I think a WordPress upgrade (plugin or core) might be very responsible.

有趣的是，这是在一个速度无可挑剔的服务器上。我认为 WordPress 升级（插件或核心）可能非常负责任。

Answer 5

回答by Allen

I had the same problem. I tried reconfiguring PHP-FPM and NGINX and didn't get very far. One of our guys disabled v8js.php (http://php.net/manual/en/book.v8js.php) and it fixed the problem. I suggest disabling any php modules until you find the troublemaker. Hopefully that helps someone.

我有同样的问题。我尝试重新配置 PHP-FPM 和 NGINX，但并没有走多远。我们的一个人禁用了 v8js.php ( http://php.net/manual/en/book.v8js.php) 并解决了这个问题。我建议禁用任何 php 模块，直到找到麻烦制造者。希望这可以帮助某人。

php PHP5-FPM 随机开始消耗大量 CPU

提问by Eugene

采纳答案by Kevin A. Naudé

回答by Animanga

回答by minhhq

回答by Berto

How to determine which script is being executed in PHP-FPM process

如何确定在 PHP-FPM 进程中正在执行哪个脚本

回答by Allen

相关推荐

最近更新

标签

php PHP5-FPM 随机开始消耗大量 CPU

提问by Eugene

采纳答案by Kevin A. Naudé

回答by Animanga

回答by minhhq

回答by Berto

How to determine which script is being executed in PHP-FPM process

如何确定在 PHP-FPM 进程中正在执行哪个脚本

回答by Allen

相关推荐

php 如何将 am/pm 格式的时间插入到数据库中

php 创建随机哈希/字符串的最佳方法是什么？

使用 PHP 和 google Maps Api 计算出 2 个邮政编码之间的距离（英国）

PHP - 注意：未定义索引：

相关推荐

最近更新

标签