使用 Laravel 在 MySQL 中导入大型 CSV 文件

Question

提问by dmotors

I have a csv file that can range from 50k to over 100k rows of data.

我有一个 csv 文件，其范围可以从 50k 到超过 100k 行的数据。

I'm currently using Laravel w/ Laravel Forge, MySQL, and Maatwebsite Laravel Excel package.

我目前正在使用带有 Laravel Forge、MySQL 和 Maatwebsite Laravel Excel 包的 Laravel。

This is to be used by an end-user and not myself so I have created a simple form on my blade view as such:

这是由最终用户而不是我自己使用的，所以我在我的刀片视图上创建了一个简单的表单：

{!! Form::open(
    array(
        'route' => 'import.store', 
        'class' => 'form',
        'id' => 'upload',
        'novalidate' => 'novalidate', 
        'files' => true)) !!}

    <div class="form-group">
        <h3>CSV Product Import</h3>
        {!! Form::file('upload_file', null, array('class' => 'file')) !!}
    </div>

    <div class="form-group">
        {!! Form::submit('Upload Products', array('class' => 'btn btn-success')) !!}
    </div>
{!! Form::close() !!}

This then stores the file on the server successfully and I'm now able to iterate through the results using something such as a foreach loop.

这然后成功地将文件存储在服务器上，我现在可以使用诸如 foreach 循环之类的东西迭代结果。

Now here are the issues I'm facing in chronological order and fixes/attempts: (10k rows test csv file)

现在这是我按时间顺序面临的问题和修复/尝试：（10k 行测试 csv 文件）

[issue] PHP times out.
[remedy] Changed it to run asynchronously via a job command.
[result] Imports up to 1500 rows.
[issue] Server runs out of memory.
[remedy] Added a swap drive of 1gb.
[result] Imports up to 3000 rows.
[issue] Server runs out of memory.
[remedy] Turned on chunking results of 250 rows each chunk.
[result] Imports up to 5000 rows.
[issue] Server runs out of memory.
[remedy] Removed some tranposing/joined tables logic.
[result] Imports up to 7000 rows.

[问题] PHP 超时。
[补救措施] 将其更改为通过作业命令异步运行。
[结果] 最多导入 1500 行。
[问题] 服务器内存不足。
[补救措施] 添加了 1GB 的交换驱动器。
[结果] 最多导入 3000 行。
[问题] 服务器内存不足。
[补救措施] 打开每个块 250 行的分块结果。
[结果] 最多导入 5000 行。
[问题] 服务器内存不足。
[补救措施] 删除了一些转置/连接表逻辑。
[结果] 最多导入 7000 行。

As you can see the results are marginal and nowhere near 50k, I can barely even make it near 10k.

正如你所看到的，结果是微不足道的，远不及 50k，我什至几乎无法接近 10k。

I've read up and looked into possible suggestions such as:

我已经阅读并研究了可能的建议，例如：

Use a raw query to run Load Data Local Infile.
Split files before importing.
Store on server then have server split into files and have a cron process them.
Upgrade my 512mb DO droplet to 1gb as a last resort.

使用原始查询运行 Load Data Local Infile。
导入前拆分文件。
存储在服务器上，然后将服务器拆分为文件并让 cron 处理它们。
作为最后的手段，将我的 512mb DO droplet 升级到 1gb。

Going with load data local infile may not work because my header columns could change per file that's why I have logic to process/iterate through them.

使用本地 infile 加载数据可能不起作用，因为我的标题列可能会更改每个文件，这就是为什么我有逻辑来处理/迭代它们。

Splitting files before importing is fine under 10k but for 50k or more? That would be highly impractical.

在导入前拆分文件在 10k 以下可以，但对于 50k 或更多？那将是非常不切实际的。

Store on server and then have the server split it and run them individually without troubling the end-user? Possibly but not even sure how to achieve this in PHP yet just only briefly read about that.

存储在服务器上，然后让服务器拆分它并单独运行它们而不会给最终用户带来麻烦？可能但甚至不确定如何在 PHP 中实现这一点，但只是简单地阅读一下。

Also to note, my queue worker is set to timeout in 10000 seconds which is also very impractical and bad-practice but seems that was the only way it will keep running before memory takes a hit.

还要注意的是，我的队列工作器设置为 10000 秒超时，这也是非常不切实际和糟糕的做法，但似乎这是它在内存受到影响之前保持运行的唯一方法。

Now I can give-in and just upgrade the memory to 1gb but I feel at best it may jump me to 20k rows before it fails again. Something needs to process all these rows quickly and efficiently.

现在我可以让步，只需将内存升级到 1gb，但我觉得它最多可以让我在它再次失败之前跳到 20k 行。有些东西需要快速有效地处理所有这些行。

Lastly here is a glimpse of my table structure:

最后，这里是我的表结构的一瞥：

Inventory
+----+------------+-------------+-------+---------+
| id | profile_id | category_id |  sku  |  title  |
+----+------------+-------------+-------+---------+
|  1 |         50 |       51234 | mysku | mytitle |
+----+------------+-------------+-------+---------+

Profile
+----+---------------+
| id |     name      |
+----+---------------+
| 50 | myprofilename |
+----+---------------+

Category
+----+------------+--------+
| id | categoryId |  name  |
+----+------------+--------+
|  1 |      51234 | brakes |
+----+------------+--------+

Specifics
+----+---------------------+------------+-------+
| id | specificsCategoryId | categoryId | name  |
+----+---------------------+------------+-------+
|  1 |                  20 |      57357 | make  |
|  2 |                  20 |      57357 | model |
|  3 |                  20 |      57357 | year  |
+----+---------------------+------------+-------+

SpecificsValues
+----+-------------+-------+--------+
| id | inventoryId | name  | value  |
+----+-------------+-------+--------+
|  1 |           1 | make  | honda  |
|  2 |           1 | model | accord |
|  3 |           1 | year  | 1998   |
+----+-------------+-------+--------+

Full CSV Sample
+----+------------+-------------+-------+---------+-------+--------+------+
| id | profile_id | category_id |  sku  |  title  | make  | model  | year |
+----+------------+-------------+-------+---------+-------+--------+------+
|  1 |         50 |       51234 | mysku | mytitle | honda | accord | 1998 |
+----+------------+-------------+-------+---------+-------+--------+------+

So a quick run-through of my logic workflow as simple as possible would be:

因此，尽可能简单地快速运行我的逻辑工作流程将是：

Load file into Maatwebsite/Laravel-Excel and iterate through a chunked loop
Check if category_id and sku are empty else ignore and log error to an array.
Lookup category_id and pull all relevant column fields from all related tables it uses and then if no null insert into the database.
Generate a custom title using more logic using the fields available on the file.
Rinse and repeat.
Lastly export the errors array into a file and log it into a database for download to view errors at the end.

将文件加载到 Maatwebsite/Laravel-Excel 并通过分块循环进行迭代
检查 category_id 和 sku 是否为空，否则忽略并将错误记录到数组中。
查找 category_id 并从它使用的所有相关表中提取所有相关列字段，然后如果没有空值则插入到数据库中。
使用文件中可用的字段使用更多逻辑生成自定义标题。
冲洗并重复。
最后将错误数组导出到文件中并将其记录到数据库中以供下载以查看最后的错误。

I hope someone can share with me some insight on some possible ideas on how I should tackle this while keeping in mind of using Laravel and also that it's not a simple upload I need to process and put into different related tables per line else I'd load data infile it all at once.

我希望有人可以与我分享一些关于我应该如何解决这个问题的一些可能想法的见解，同时牢记使用 Laravel 并且这不是一个简单的上传我需要处理并每行放入不同的相关表中，否则我会一次性加载数据文件。

Thanks!

谢谢！

Answer 1

回答by Daniel Castro

You seem to have already figured out the logic for interpreting the CSV lines and converting them to insert queries on the database, so I will focus on the memory exhaustion issue.

您似乎已经弄清楚了解释 CSV 行并将它们转换为对数据库的插入查询的逻辑，因此我将重点讨论内存耗尽问题。

When working with large files in PHP, any approach that loads the entire file to memory will either fail, became unbearably slow or require a lot more RAM than you Droplet has.

在 PHP 中处理大文件时，任何将整个文件加载到内存的方法要么会失败，要么会变得非常慢，或者需要比 Droplet 多得多的 RAM。

So my advices are:

所以我的建议是：

Read the file line by line using fgetcsv

使用 fgetcsv

$handle = fopen('file.csv', 'r');
if ($handle) {
    while ($line = fgetcsv($handle)) {
        // Process this line and save to database
    }
}

This way only one row at a time will be loaded to memory. Then, you can process it, save to the database, and overwrite it with the next one.

这样一次只能将一行加载到内存中。然后，您可以处理它，保存到数据库，并用下一个覆盖它。

Keep a separate file handle for logging

保留一个单独的文件句柄用于日志记录

Your server is short on memory, so logging errors to an array may not be a good idea as all errors will be kept in it. That can become a problem if your csv has lots of entries with empty skus and category ids.

您的服务器内存不足，因此将错误记录到数组可能不是一个好主意，因为所有错误都将保存在其中。如果您的 csv 有大量带有空 skus 和类别 id 的条目，这可能会成为一个问题。

Laravel comes out of the box with Monologand you can try to adapt it to your needs. However, if it also ends up using too much resources or not fitting your needs, a simpler approach may be the solution.

Laravel出来与箱体的独白，你可以尝试，以使其适应您的需求。但是，如果它最终也使用了过多的资源或不符合您的需求，那么更简单的方法可能是解决方案。

$log = fopen('log.txt', 'w');
if (some_condition) {
    fwrite($log, $text . PHP_EOL);
}

Then, at the end of the script you can store the log file wherever you want.

然后，在脚本的末尾，您可以将日志文件存储在您想要的任何位置。

Disable Laravel's query log

禁用 Laravel 的查询日志

Laravel keeps all your queries stored in memory, and that's likely to be a problem for your application. Luckily, you can use the disableQueryLog methodto free some precious RAM.

Laravel 将您的所有查询都存储在内存中，这可能会成为您的应用程序的问题。幸运的是，您可以使用disableQueryLog 方法来释放一些宝贵的 RAM。

DB::connection()->disableQueryLog();

Use raw queries if needed

如果需要，使用原始查询

I think it's unlikely that you will run out of memory again if you follow these tips, but you can always sacrifice some of Laravel's convenience to extract that last drop of performance.

我认为如果你遵循这些提示，你不太可能再次耗尽内存，但是你总是可以牺牲 Laravel 的一些便利来提取最后一滴性能。

If you know your way around SQL, you can execute raw queries to the database.

如果您了解 SQL，则可以对数据库执行原始查询。

Edit:

编辑：

As for the timeout issue, you shouldbe running this code as a queued task as suggested in the comments regardless. Inserting that many rows WILL take some time (specially if you have lots of indexes) and the user shouldn't be staring at an unresponsive page for that long.

至于超时问题，无论如何，您都应该按照评论中的建议将此代码作为排队任务运行。插入那么多行需要一些时间（特别是如果你有很多索引），用户不应该长时间盯着一个没有响应的页面。

使用 Laravel 在 MySQL 中导入大型 CSV 文件

提问by dmotors

回答by Daniel Castro

相关推荐

最近更新

标签

使用 Laravel 在 MySQL 中导入大型 CSV 文件

提问by dmotors

回答by Daniel Castro

相关推荐

laravel 为什么在使用 Form::open() 时我的 CSRF 令牌为空？

FatalErrorException in HtmlServiceProvider.php line 36: laravel

覆盖 Laravel 容器中的单例

将模型与表 Laravel 5.2 连接

相关推荐

最近更新

标签