管理长时间运行的 php 脚本的最佳方法？

Question

提问by kbanman

I have a PHP script that takes a long time (5-30 minutes) to complete. Just in case it matters, the script is using curl to scrape data from another server. This is the reason it's taking so long; it has to wait for each page to load before processing it and moving to the next.

我有一个需要很长时间（5-30 分钟）才能完成的 PHP 脚本。以防万一，脚本正在使用 curl 从另一台服务器抓取数据。这就是它需要这么长时间的原因；它必须等待每个页面加载完毕，然后才能处理并移动到下一个页面。

I want to be able to initiate the script and let it be until it's done, which will set a flag in a database table.

我希望能够启动脚本并让它一直运行直到完成，这将在数据库表中设置一个标志。

What I need to know is how to be able to end the http request before the script is finished running. Also, is a php script the best way to do this?

我需要知道的是如何在脚本完成运行之前结束 http 请求。另外，php 脚本是最好的方法吗？

Answer 1

回答by symcbean

Certainly it can be done with PHP, however you should NOT do this as a background task - the new process has to be dissocated from the process group where it is initiated.

当然，它可以用 PHP 来完成，但是您不应该将其作为后台任务来执行 - 新进程必须与启动它的进程组分离。

Since people keep giving the same wrong answer to this FAQ, I've written a fuller answer here:

由于人们一直对这个常见问题给出相同的错误答案，我在这里写了一个更完整的答案：

http://symcbean.blogspot.com/2010/02/php-and-long-running-processes.html

From the comments:

来自评论：

The short version is shell_exec('echo /usr/bin/php -q longThing.php | at now');but the reasons why are a bit long for inclusion here.

简短的版本shell_exec('echo /usr/bin/php -q longThing.php | at now');只是这里包含的有点长的原因。

Answer 2

回答by FlorianH

The quick and dirty way would be to use the ignore_user_abortfunction in php. This basically says: Don't care what the user does, run this script until it is finished. This is somewhat dangerous if it is a public facing site (because it is possible, that you end up having 20++ versions of the script running at the same time if it is initiated 20 times).

快速而肮脏的方法是ignore_user_abort在 php 中使用该函数。这基本上是说：不管用户做什么，运行这个脚本直到它完成。如果它是面向公众的站点，这有点危险（因为如果启动 20 次，您最终可能会同时运行 20++ 个版本的脚本）。

The "clean" way (at least IMHO) is to set a flag (in the db for example) when you want to initiate the process and run a cronjob every hour (or so) to check if that flag is set. If it IS set, the long running script starts, if it is NOT set, nothin happens.

“干净”的方式（至少恕我直言）是在您想要启动进程并每小时（左右）运行一次 cronjob 以检查是否设置了该标志时设置一个标志（例如在数据库中）。如果已设置，则长时间运行的脚本将启动，如果未设置，则不会发生任何事情。

Answer 3

回答by Leon Timmermans

You could use execor systemto start a background job, and then do the work in that.

您可以使用exec或system来启动后台作业，然后在其中进行工作。

Also, there are better approaches to scraping the web that the one you're using. You could use a threaded approach (multiple threads doing one page at a time), or one using an eventloop (one thread doing multiple pages at at time). My personal approach using Perl would be using AnyEvent::HTTP.

此外，还有更好的方法来抓取您正在使用的网络。您可以使用线程方法（多个线程一次处理一页），或者使用事件循环（一个线程一次处理多个页面）。我个人使用 Perl 的方法是使用AnyEvent::HTTP。

ETA: symcbeanexplained how to detach the background process properly here.

ETA：symcbean 在这里解释了如何正确分离后台进程。

Answer 4

回答by jamieb

No, PHP is not the best solution.

不，PHP 不是最好的解决方案。

I'm not sure about Ruby or Perl, but with Python you could rewrite your page scraper to be multi-threaded and it would probably run at least 20x faster. Writing multi-threaded apps can be somewhat of a challenge, but the very first Python app I wrote was mutlti-threaded page scraper. And you could simply call the Python script from within your PHP page by using one of the shell execution functions.

我不确定 Ruby 或 Perl，但是使用 Python，您可以将页面抓取器重写为多线程，并且它的运行速度可能至少提高 20 倍。编写多线程应用程序可能有点挑战，但我编写的第一个 Python 应用程序是多线程页面抓取器。并且您可以使用 shell 执行函数之一从 PHP 页面中简单地调用 Python 脚本。

Answer 5

回答by aljo f

Yes, you can do it in PHP. But in addition to PHP it would be wise to use a Queue Manager. Here's the strategy:

是的，您可以在 PHP 中完成。但是除了 PHP 之外，最好使用队列管理器。这是策略：

Break up your large task into smaller tasks. In your case, each task could be loading a single page.
Send each small task to the queue.
Run your queue workers somewhere.

把你的大任务分解成小任务。在您的情况下，每个任务都可能加载单个页面。
将每个小任务发送到队列中。
在某处运行您的队列工作人员。

Using this strategy has the following advantages:

使用这种策略有以下优点：

For long running tasks it has the ability to recover in case a fatal problem occurs in the middle of the run -- no need to start from the beginning.
If your tasks do not have to be run sequentially, you can run multiple workers to run tasks simultaneously.

对于长时间运行的任务，它能够在运行过程中发生致命问题时进行恢复——无需从头开始。
如果您的任务不必按顺序运行，您可以运行多个工作器来同时运行任务。

You have a variety of options (this is just a few):

您有多种选择（这只是几个）：

RabbitMQ (https://www.rabbitmq.com/tutorials/tutorial-one-php.html)
ZeroMQ (http://zeromq.org/bindings:php)
If you're using the Laravel framework, queues are built-in (https://laravel.com/docs/5.4/queues), with drivers for AWS SES, Redis, Beanstalkd

RabbitMQ ( https://www.rabbitmq.com/tutorials/tutorial-one-php.html)
ZeroMQ ( http://zeromq.org/bindings:php)
如果您使用 Laravel 框架，队列是内置的 ( https://laravel.com/docs/5.4/queues)，带有 AWS SES、Redis、Beanstalkd 的驱动程序

Answer 6

回答by daotoad

PHP may or may not be the best tool, but you know how to use it, and the rest of your application is written using it. These two qualities, combined with the fact that PHP is "good enough" make a pretty strong case for using it, instead of Perl, Ruby, or Python.

PHP 可能是也可能不是最好的工具，但您知道如何使用它，并且您的应用程序的其余部分都是使用它编写的。这两个品质，再加上 PHP “足够好”这一事实，非常适合使用它，而不是 Perl、Ruby 或 Python。

If your goal is to learn another language, then pick one and use it. Any language you mentioned will do the job, no problem. I happen to like Perl, but what you like may be different.

如果您的目标是学习另一种语言，请选择一种语言并使用它。您提到的任何语言都可以完成这项工作，没问题。我碰巧喜欢 Perl，但你喜欢的可能不同。

Symcbean has some good advice about how to manage background processes at his link.

Symcbean 在他的链接中提供了一些关于如何管理后台进程的好建议。

In short, write a CLI PHP script to handle the long bits. Make sure that it reports status in some way. Make a php page to handle status updates, either using AJAX or traditional methods. Your kickoff script will the start the process running in its own session, and return confirmation that the process is going.

简而言之，编写一个 CLI PHP 脚本来处理长位。确保它以某种方式报告状态。使用 AJAX 或传统方法制作一个 php 页面来处理状态更新。您的启动脚本将启动在其自己的会话中运行的流程，并返回该流程正在进行的确认。

Good luck.

祝你好运。

Answer 7

回答by JAL

You can send it as an XHR (Ajax) request. Clients don't usually have any timeout for XHRs, unlike normal HTTP requests.

您可以将其作为 XHR (Ajax) 请求发送。与普通的 HTTP 请求不同，客户端通常没有任何 XHR 超时。

Answer 8

回答by Francisco Luz

I realize this is a quite old question but would like to give it a shot. This script tries to address both the initial kick off call to finish quickly and chop down the heavy load into smaller chunks. I haven't tested this solution.

我意识到这是一个很老的问题，但想试一试。该脚本尝试解决初始启动调用以快速完成并将繁重负载切成较小块的问题。我还没有测试过这个解决方案。

<?php
/**
 * crawler.php located at http://mysite.com/crawler.php
 */

// Make sure this script will keep on runing after we close the connection with
// it.
ignore_user_abort(TRUE);


function get_remote_sources_to_crawl() {
  // Do a database or a log file query here.

  $query_result = array (
    1 => 'http://exemple.com',
    2 => 'http://exemple1.com',
    3 => 'http://exemple2.com',
    4 => 'http://exemple3.com',
    // ... and so on.
  );

  // Returns the first one on the list.
  foreach ($query_result as $id => $url) {
    return $url;
  }
  return FALSE;
}

function update_remote_sources_to_crawl($id) {
  // Update my database or log file list so the $id record wont show up
  // on my next call to get_remote_sources_to_crawl()
}

$crawling_source = get_remote_sources_to_crawl();

if ($crawling_source) {


  // Run your scraping code on $crawling_source here.


  if ($your_scraping_has_finished) {
    // Update you database or log file.
    update_remote_sources_to_crawl($id);

    $ctx = stream_context_create(array(
      'http' => array(
        // I am not quite sure but I reckon the timeout set here actually
        // starts rolling after the connection to the remote server is made
        // limiting only how long the downloading of the remote content should take.
        // So as we are only interested to trigger this script again, 5 seconds 
        // should be plenty of time.
        'timeout' => 5,
      )
    ));

    // Open a new connection to this script and close it after 5 seconds in.
    file_get_contents('http://' . $_SERVER['HTTP_HOST'] . '/crawler.php', FALSE, $ctx);

    print 'The cronjob kick off has been initiated.';
  }
}
else {
  print 'Yay! The whole thing is done.';
}

Answer 9

回答by YudhiWidyatama

I would like to propose a solution that is a little different from symcbean's, mainly because I have additional requirement that the long running process need to be run as another user, and not as apache / www-data user.

我想提出一个与 symcbean 有点不同的解决方案，主要是因为我有一个额外的要求，即长时间运行的进程需要作为另一个用户运行，而不是作为 apache/www-data 用户运行。

First solution using cron to poll a background task table:

使用 cron 轮询后台任务表的第一个解决方案：

PHP web page inserts into a background task table, state 'SUBMITTED'
cron runs once each 3 minutes, using another user, running PHP CLI script that checks the background task table for 'SUBMITTED' rows
PHP CLI will update the state column in the row into 'PROCESSING' and begin processing, after completion it will be updated to 'COMPLETED'

PHP 网页插入后台任务表，状态为“已提交”
cron 每 3 分钟运行一次，使用另一个用户，运行 PHP CLI 脚本，检查后台任务表中的“提交”行
PHP CLI 会将行中的状态列更新为“PROCESSING”并开始处理，完成后将更新为“COMPLETED”

Second solution using Linux inotify facility:

使用 Linux inotify 工具的第二个解决方案：

PHP web page updates a control file with the parameters set by user, and also giving a task id
shell script (as a non-www user) running inotifywait will wait for the control file to be written
after control file is written, a close_write event will be raised an the shell script will continue
shell script executes PHP CLI to do the long running process
PHP CLI writes the output to a log file identified by task id, or alternatively updates progress in a status table
PHP web page could poll the log file (based on task id) to show progress of the long running process, or it could also query status table

PHP 网页使用用户设置的参数更新控制文件，并给出任务 id
运行 inotifywait 的 shell 脚本（作为非 www 用户）将等待写入控制文件
写入控制文件后，将引发 close_write 事件，shell 脚本将继续
shell 脚本执行 PHP CLI 来完成长时间运行的过程
PHP CLI 将输出写入由任务 ID 标识的日志文件，或者更新状态表中的进度
PHP 网页可以轮询日志文件（基于任务 id）以显示长时间运行的进程的进度，也可以查询状态表

Some additional info could be found in my post : http://inventorsparadox.blogspot.co.id/2016/01/long-running-process-in-linux-using-php.html

一些额外的信息可以在我的帖子中找到：http: //inventorsparadox.blogspot.co.id/2016/01/long-running-process-in-linux-using-php.html

Answer 10

回答by Jacob

I agree with the answers that say this should be run in a background process. But it's also important that you report on the status so the user knows that the work is being done.

我同意说这应该在后台进程中运行的答案。但报告状态也很重要，以便用户知道工作正在完成。

When receiving the PHP request to kick off the process, you could store in a database a representation of the task with a unique identifier. Then, start the screen-scraping process, passing it the unique identifier. Report back to the iPhone app that the task has been started and that it should check a specified URL, containing the new task ID, to get the latest status. The iPhone application can now poll (or even "long poll") this URL. In the meantime, the background process would update the database representation of the task as it worked with a completion percentage, current step, or whatever other status indicators you'd like. And when it has finished, it would set a completed flag.

当接收到启动进程的 PHP 请求时，您可以在数据库中存储具有唯一标识符的任务表示。然后，启动屏幕抓取过程，将唯一标识符传递给它。向 iPhone 应用报告任务已启动，它应该检查包含新任务 ID 的指定 URL 以获取最新状态。iPhone 应用程序现在可以轮询（甚至“长轮询”）这个 URL。与此同时，后台进程将更新任务的数据库表示，因为它使用完成百分比、当前步骤或您想要的任何其他状态指示器。当它完成时，它会设置一个完成标志。

管理长时间运行的 php 脚本的最佳方法？

提问by kbanman

回答by symcbean

回答by FlorianH

回答by Leon Timmermans

回答by jamieb

回答by aljo f

回答by daotoad

回答by JAL

回答by Francisco Luz

回答by YudhiWidyatama

回答by Jacob

相关推荐

最近更新

标签

管理长时间运行的 php 脚本的最佳方法？

提问by kbanman

回答by symcbean

回答by FlorianH

回答by Leon Timmermans

回答by jamieb

回答by aljo f

回答by daotoad

回答by JAL

回答by Francisco Luz

回答by YudhiWidyatama

回答by Jacob

相关推荐

如何访问在 PHP 多选下拉列表中选择的值？

php 基于字符串动态创建PHP对象

PHP 解析 HTML 标签

php PHP从上传的文本文件中读取？

相关推荐

最近更新

标签