php 处理一个长度为 3000 万个字符的字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1342583/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 02:06:39  来源:igfitidea点击:

Manipulate a string that is 30 million characters long

phpmemory-management

提问by JD Isaacks

I am downloading a CSV file from another server as a data feed from a vendor.

我正在从另一台服务器下载一个 CSV 文件作为来自供应商的数据源。

I am using curl to get the contents of the file and saving that into a variable called $contents.

我正在使用 curl 来获取文件的内容并将其保存到一个名为$contents.

I can get to that part just fine, but I tried exploding by \rand \nto get an array of lines but it fails with an 'out of memory' error.

我可以很好地到达那部分,但是我尝试爆炸\r\n获取一系列行,但它因“内存不足”错误而失败。

I echo strlen($contents)and it's about 30.5 million chars.

echo strlen($contents)和它大约有 3050 万个字符。

I need to manipulate the values and insert them into a database. What do I need to do to avoid memory allocation errors?

我需要操作这些值并将它们插入到数据库中。我需要做什么来避免内存分配错误?

回答by Pascal MARTIN

As other answers said :

正如其他答案所说:

  • you can't have all that in memory
  • a solution would be to use CURLOPT_FILE
  • 你不能把所有这些都记在记忆里
  • 一个解决方案是使用 CURLOPT_FILE

But, you might not want to really create a file you could want to work with data in memory... Using it as soon as it "arrives".

但是,您可能不想真正创建一个文件,您可能想要处理内存中的数据......一旦它“到达”就使用它。

One possible solution might be defining your own stream wrapper, and use this one, instead of a real file, with CURLOPT_FILE

一种可能的解决方案可能是定义您自己的流包装器,并使用这个包装器,而不是真正的文件, CURLOPT_FILE

First of all, see :

首先,请看:


And now, let's go with an example.


现在,让我们举一个例子。

First, let's create our stream wrapper class :

首先,让我们创建我们的流包装类:

class MyStream {
    protected $buffer;

    function stream_open($path, $mode, $options, &$opened_path) {
        // Has to be declared, it seems...
        return true;
    }

    public function stream_write($data) {
        // Extract the lines ; on y tests, data was 8192 bytes long ; never more
        $lines = explode("\n", $data);

        // The buffer contains the end of the last line from previous time
        // => Is goes at the beginning of the first line we are getting this time
        $lines[0] = $this->buffer . $lines[0];

        // And the last line os only partial
        // => save it for next time, and remove it from the list this time
        $nb_lines = count($lines);
        $this->buffer = $lines[$nb_lines-1];
        unset($lines[$nb_lines-1]);

        // Here, do your work with the lines you have in the buffer
        var_dump($lines);
        echo '<hr />';

        return strlen($data);
    }
}

What I do is :

我做的是:

  • work on the chunks of data (I use var_dump, but you'd do your usual stuff instead) when they arrive
  • Note that you don't get "full lines": the end of a line is the beginning of a chunk, and the beginning of that same line was at the end of the previous chunk so, you have to keep some parts of a chunk between the calls to stream_write
  • 当它们到达时处理数据块(我使用 var_dump,但你会做你平常的事情)
  • 请注意,您不会得到“完整的行”:行的末尾是块的开头,同一行的开头位于前一个块的末尾,因此,您必须保留块的某些部分在调用之间 stream_write


Next, we register this stream wrapper, to be used with the pseudo-protocol "test" :


接下来,我们注册这个流包装器,与伪协议“test”一起使用:

// Register the wrapper
stream_wrapper_register("test", "MyStream")
    or die("Failed to register protocol");


And, now, we do our curl request, like we would do when writting to a "real" file, like other answers suggested :


而且,现在,我们执行 curl 请求,就像我们在写入“真实”文件时所做的那样,就像其他建议的答案一样:

// Open the "file"
$fp = fopen("test://MyTestVariableInMemory", "r+");

// Configuration of curl
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.rue89.com/");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_BUFFERSIZE, 256);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FILE, $fp);    // Data will be sent to our stream ;-)

curl_exec($ch);

curl_close($ch);

// Don't forget to close the "file" / stream
fclose($fp);

Note we don't work with a real file, but with our pseudo-protocol.

请注意,我们不使用真实文件,而是使用我们的伪协议。


This way, each time a chunk of data arrives, MyStream::stream_writemethod will get called, and will be able to work on a small amount of data (when I tested, I always got 8192 bytes, whatever value I used for CURLOPT_BUFFERSIZE)


这样,每次有一大块数据到达时,MyStream::stream_write方法都会被调用,并且能够处理少量数据(当我测试时,我总是得到 8192 字节,无论我使用什么值CURLOPT_BUFFERSIZE


A few notes :


一些注意事项:

  • You need to test this more than I did, obviously
  • my stream_write implementation will probably not work if lines are longer than 8192 bytes up to you to patch it ;-)
  • It's only meant as a few pointers, and not a fully-working solution: you have to test (again), and probably code a bit more!
  • 你需要比我更多地测试这个,显然
  • 如果行的长度超过 8192 字节,我的 stream_write 实现可能无法工作,由您来修补它;-)
  • 这只是一些提示,而不是一个完整的解决方案:您必须(再次)测试,并且可能需要编写更多代码!

Still, I hope this helps ;-)
Have fun !

不过,我希望这会有所帮助;-)
玩得开心!

回答by Alan Storm

PHP is choking because it's running out memory. Instead of having curl populate a PHP variable with the contents of the file, use the

PHP 正在窒息,因为它的内存不足。不要让 curl 用文件内容填充 PHP 变量,而是使用

CURLOPT_FILE

option to save the file to disk instead.

将文件保存到磁盘的选项。

//pseudo, untested code to give you the idea

$fp = fopen('path/to/save/file', 'w');
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_exec ($ch);
curl_close ($ch);
fclose($fp);

Then, once the file is saved, instead of using the fileor file_get_contentsfunctions (which would load the entire file into memory, killing PHP again), use fopenand fgetsto read the file one line at a time.

然后,一旦文件被保存,而不是使用fileorfile_get_contents函数(这会将整个文件加载到内存中,再次杀死 PHP),使用fopenfgets一次读取文件一行。

回答by Googol

Darren Cook comment to Pascal MARTIN response is really interesting. In modern PHP+Curl versions, the CURLOPT_WRITEFUNCTIONoption can be set so CURL invokes a callback for each received "chunk" of data. Specifically, the "callable" will received two parameters, the first one with the invoking curl object, and the second one with the data chunk. The funcion should return strlen($data)in order for curl to continue sending more data.

Darren Cook 对 Pascal MARTIN 回应的评论真的很有趣。在现代 PHP+Curl 版本中,CURLOPT_WRITEFUNCTION可以设置该选项,以便 CURL 为每个接收到的“数据块”调用回调。具体来说,“callable”将接收两个参数,第一个带有调用 curl 对象,第二个带有数据块。该函数应该返回strlen($data)以便 curl 继续发送更多数据。

Callables can be methods in PHP. Using all this, I've developed a possible solution that I find more readable that the previous one (although Pascal Martin response is really great, things have changed since then). I've used public attributes for simplicity, but I'm sure readers could adapt and improve the code. You can even abort the CURL request when a number of lines (or bytes) have been reached. I hope this would be useful for others.

可调用对象可以是 PHP 中的方法。使用所有这些,我开发了一个可能的解决方案,我发现它比前一个更具可读性(尽管 Pascal Martin 的反应非常好,但从那时起事情发生了变化)。为简单起见,我使用了公共属性,但我相信读者可以调整和改进代码。您甚至可以在达到一定数量的行(或字节)时中止 CURL 请求。我希望这对其他人有用。

<?
class SplitCurlByLines {

    public function curlCallback($curl, $data) {

        $this->currentLine .= $data;
        $lines = explode("\n", $this->currentLine);
        // The last line could be unfinished. We should not
        // proccess it yet.
        $numLines = count($lines) - 1;
        $this->currentLine = $lines[$numLines]; // Save for the next callback.

        for ($i = 0; $i < $numLines; ++$i) {
            $this->processLine($lines[$i]); // Do whatever you want
            ++$this->totalLineCount; // Statistics.
            $this->totalLength += strlen($lines[$i]) + 1;
        }
        return strlen($data); // Ask curl for more data (!= value will stop).

    }

    public function processLine($str) {
        // Do what ever you want (split CSV, ...).
        echo $str . "\n";
    }

    public $currentLine = '';
    public $totalLineCount = 0;
    public $totalLength = 0;

} // SplitCurlByLines

// Just for testing, I will echo the content of Stackoverflow
// main page. To avoid artifacts, I will inform the browser about
// plain text MIME type, so the source code should be vissible.
Header('Content-type: text/plain');

$splitter = new SplitCurlByLines();

// Configuration of curl
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://stackoverflow.com/");
curl_setopt($ch, CURLOPT_WRITEFUNCTION, array($splitter, 'curlCallback'));

curl_exec($ch);

// Process the last line.
$splitter->processLine($splitter->currentLine);

curl_close($ch);

error_log($splitter->totalLineCount . " lines; " .
 $splitter->totalLength . " bytes.");
?>

回答by Sebastian Paaske T?rholm

You might want to consider saving it to a temporary file, and then reading it one line at a time using fgetsor fgetcsv.

您可能需要考虑将其保存到一个临时文件,然后使用fgets或一次读取一行fgetcsv

This way you avoid the initial big array you get from exploding such a large string.

通过这种方式,您可以避免从爆炸如此大的字符串中获得的初始大数组。

回答by pingw33n

  1. Increase memory_limitin php.ini.
  2. Read data using fopen()and fgets().
  1. 增加memory_limitphp.ini
  2. 使用fopen()和读取数据fgets()

回答by Daniel Pryden

Spool it to a file. Don't try to hold all that data in memory at once.

将其假脱机到一个文件。不要试图一次将所有数据保存在内存中。

回答by Mark White

NB:

注意:

"Basically, if you open a file with fopen, fclose it and then unlink it, it works fine. But if between fopen and fclose, you give the file handle to cURL to do some writing into the file, then the unlink fails. Why this is happening is beyond me. I think it may be related to Bug #48676"

“基本上,如果你用 fopen 打开一个文件,fclose 它然后取消链接它,它工作正常。但是如果在 fopen 和 fclose 之间,你给 cURL 文件句柄来对文件做一些写入,那么取消链接会失败。为什么这种情况超出了我的范围。我认为这可能与错误 #48676 相关”

http://bugs.php.net/bug.php?id=49517

http://bugs.php.net/bug.php?id=49517

So be careful if you're on an older version of PHP. There is a simple fix on this page to double-close the file resource:

因此,如果您使用的是旧版本的 PHP,请务必小心。此页面上有一个简单的修复来双重关闭文件资源:

fclose($fp);
if (is_resource($fp))
    fclose($fp);