java MySQL用Java从文件中插入大数据集

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1066264/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 15:01:26  来源:igfitidea点击:

MySQL Inserting large data sets from file with Java

javamysql

提问by Derek Organ

I need to insert about 1.8 million rows from a CSV file into a MySQL database. (only one table)

我需要将 CSV 文件中的大约 180 万行插入 MySQL 数据库。(只有一张桌子)

Currently using Java to parse through the file and insert each line.

目前使用Java解析文件并插入每一行。

As you can imagine this takes quite a few hours to run. (10 roughtly)

可以想象,这需要几个小时才能运行。(大概10个)

The reason I'm not piping it straight in from the file into the db, is the data has to be manipulated before it adds it to the database.

我没有将它从文件直接通过管道传输到数据库中的原因是,必须在将数据添加到数据库之前对其进行操作。

This process needs to be run by an IT manager in there. So I've set it up as a nice batch file for them to run after they drop the new csv file into the right location. So, I need to make this work nicely by droping the file into a certain location and running a batch file. (Windows enviroment)

这个过程需要由 IT 经理在那里运行。所以我将它设置为一个很好的批处理文件,让他们在将新的 csv 文件放入正确的位置后运行。因此,我需要通过将文件拖放到某个位置并运行批处理文件来很好地完成这项工作。(Windows 环境)

My question is, what way would be the fastest way to insert this much data; large inserts, from a temp parsed file or one insert at a time? some other idea possibly?

我的问题是,插入这么多数据的最快方法是什么?大量插入,来自临时解析文件还是一次插入一个?可能有其他想法吗?

The second question is, how can I optimize my MySQL installation to allow very quick inserts. (there will be a point where a large select of all the data is required as well)

第二个问题是,如何优化我的 MySQL 安装以允许非常快速的插入。(还有一点需要大量选择所有数据)

Note: the table will be eventually droped and the whole process run again at a later date.

注意:表最终将被删除,整个过程将在以后再次运行。

Some clarification: currently using ...opencsv.CSVReader to parse the file then doing an insert on each line. I'm concating some columns though and ignoring others.

一些澄清:目前使用 ...opencsv.CSVReader 来解析文件,然后在每一行上进行插入。我正在连接一些列而忽略其他列。

More clarification: Local DB MyISAM table

更多说明:本地数据库 MyISAM 表

回答by Vinko Vrsalovic

Tips for fast insertion:

快速插入的提示:

  • Use the LOAD DATA INFILEsyntax to let MySQL parse it and insert it, even if you have to mangle it and feed it after the manipulation.
  • Use this insert syntax:

    insert into table (col1, col2) values (val1, val2), (val3, val4), ...

  • Remove all keys/indexes prior to insertion.

  • Do it in the fastest machine you've got (IO-wise mainly, but RAM and CPU also matter). Both the DB server, but also the inserting client, remember you'll be paying twice the IO price (once reading, the second inserting)
  • 使用LOAD DATA INFILE语法让 MySQL 解析并插入它,即使您必须在操作后对其进行处理并提供它。
  • 使用此插入语法:

    插入表 (col1, col2) 值 (val1, val2), (val3, val4), ...

  • 在插入之前删除所有键/索引。

  • 在您拥有的最快的机器上执行此操作(主要是 IO 方面,但 RAM 和 CPU 也很重要)。无论是数据库服务器,还是插入客户端,请记住您将支付两倍的 IO 价格(一次读取,第二次插入)

回答by Hardwareguy

I'd probably pick a large number, like 10k rows, and load that many rows from the CSV, massage the data, and do a batch update, then repeat until you've gone through the entire csv. Depending on the massaging/amount of data 1.8 mil rows shouldn't take 10 hours, more like 1-2 hours depending on your hardware.

我可能会选择一个很大的数字,比如 10k 行,然后从 CSV 加载那么多行,处理数据,并进行批量更新,然后重复直到你完成整个 csv。根据按摩/数据量,180 万行不应该需要 10 个小时,更像是 1-2 个小时,具体取决于您的硬件。

edit: whoops, left out a fairly important part, your con has to have autocommit set to false, the code I copied this from was doing it as part of the GetConnection() method.

编辑:哎呀,遗漏了一个相当重要的部分,您的骗局必须将自动提交设置为 false,我从中复制的代码是作为 GetConnection() 方法的一部分进行的。

    Connection con = GetConnection();
con.setAutoCommit(false);
            try{
                PreparedStatement ps = con.prepareStatement("INSERT INTO table(col1, col2) VALUES(?, ?)");
                try{
                    for(Data d : massagedData){
                        ps.setString(1, d.whatever());
                                        ps.setString(2, d.whatever2());
                                            ps.addBatch();
                    }
                    ps.executeBatch();
                }finally{
                    ps.close();
                }
            }finally{
                con.close();
            }

回答by Thorbj?rn Ravn Andersen

Are you absolutely CERTAIN you have disabled auto commits in the JDBC driver?

您绝对确定在 JDBC 驱动程序中禁用了自动提交吗?

This is the typical performance killer for JDBC clients.

这是 JDBC 客户端的典型性能杀手。

回答by Brian

You can improve bulk INSERT performance from MySQL / Java by using the batching capability in its Connector J JDBC driver.

您可以通过使用其连接器 J JDBC 驱动程序中的批处理功能来提高 MySQL/Java 的批量插入性能。

MySQL doesn't "properly" handle batches (see my article link, bottom), but it can rewrite INSERTs to make use of quirky MySQL syntax, e.g. you can tell the driver to rewrite two INSERTs:

MySQL 不能“正确”处理批处理(参见我的文章链接,底部),但它可以重写 INSERT 以利用古怪的 MySQL 语法,例如,您可以告诉驱动程序重写两个 INSERT:

INSERT INTO (val1, val2) VALUES ('val1', 'val2'); 
INSERT INTO (val1, val2) VALUES ('val3', 'val4');

as a single statement:

作为一个单一的声明:

INSERT INTO (val1, val2) VALUES ('val1', 'val2'), ('val3','val4'); 

(Note that I'm not saying youneed to rewrite your SQL in this way; the driverdoes it when it can)

(请注意,我并不是说需要以这种方式重写 SQL;驱动程序会在可能的情况下进行)

We did this for a bulk insert investigation of our own: it made an order of magnitude of difference. Used with explicit transactions as mentioned by others and you'll see a big improvement overall.

我们这样做是为了我们自己的批量插入调查:它产生了一个数量级的差异。与其他人提到的显式事务一起使用,您会看到整体上的巨大改进。

The relevant driver property setting is:

相关的驱动程序属性设置是:

jdbc:mysql:///<dbname>?rewriteBatchedStatements=true

See: A 10x Performance Increase for Batch INSERTs With MySQL Connector/J Is On The Way

请参阅:使用 MySQL Connector/J 将批量插入的 10 倍性能提升正在进行中

回答by Pierre

Another idea: do you use a PreparedStatementfor inserting your data with JDBC ?

另一个想法:您是否使用PreparedStatement使用 JDBC 插入数据?

回答by Roee Adler

You should really use LOAD DATA on the MySQL console itself for this and not work through the code...

为此,您真的应该在 MySQL 控制台本身上使用 LOAD DATA 而不是通过代码工作...

LOAD DATA INFILE 'data.txt' INTO TABLE db2.my_table;

If you need to manipulate the data, I would still recommend manipulating in memory, rewriting to a flat file, and pushing it to the database using LOAD DATA, I think it should be more efficient.

如果需要操作数据,我还是建议在内存中操作,改写成平面文件,再用LOAD DATA推送到数据库,我觉得应该效率更高。

回答by ChssPly76

Depending on what exactly you need to do with the data prior to inserting it your best options in terms of speed are:

根据您在插入数据之前究竟需要对数据做什么,您在速度方面的最佳选择是:

  • Parse the file in java / do what you need with the data / write the "massaged" data out to a new CSV file / use "load data infile" on that.
  • 在 java 中解析文件/对数据执行您需要的操作/将“按摩”数据写入新的 CSV 文件/在其上使用“加载数据文件”。
  • If your data manipulation is conditional (e.g. you need to check for record existence and do different things based on whether it's an insert or and update, etc...) then (1) may be impossible. In which case you're best off doing batch inserts / updates.
    Experiment to find the best batch size working for you (starting with about 500-1000 should be ok). Depending on the storage engine you're using for your table, you may need to split this into multiple transactions as well - having a single one span 1.8M rows ain't going to do wonders for performance.
  • 如果您的数据操作是有条件的(例如,您需要检查记录是否存在并根据它是插入还是更新等执行不同的操作……)那么 (1) 可能是不可能的。在这种情况下,您最好进行批量插入/更新。
    尝试找到最适合您的批量大小(从大约 500-1000 开始应该没问题)。根据您用于表的存储引擎,您可能还需要将其拆分为多个事务 - 单个事务跨越 1.8M 行不会对性能产生奇迹。
  • 回答by Nathan Voxland

    Your biggest performance problem is most likely not java but mysql, in particular any indexes, constraints, and foreign keys you have on the table you are inserting into. Before you begin your inserts, make sure you disable them. Re-enabling them at the end will take a considerable amount of time, but it is far more efficient than having the database evaluate them after each statement.

    您最大的性能问题很可能不是 java 而是 mysql,尤其是您要插入的表上的任何索引、约束和外键。在开始插入之前,请确保禁用它们。最后重新启用它们将花费相当长的时间,但这比让数据库在每个语句之后评估它们要高效得多。

    You may also be seeing mysql performance problems due to the size of your transaction. Your transaction log will grow very large with that many inserts, so performing a commit after X number of inserts (say 10,000-100,000) will help insert speed as well.

    由于事务的大小,您可能还会看到 mysql 性能问题。随着插入次数的增加,您的事务日志会变得非常大,因此在 X 次插入(例如 10,000-100,000)后执行提交也将有助于提高插入速度。

    From the jdbc layer, make sure you are using the addBatch() and executeBatch() commands rather on your PreparedStatement rather than the normal executeUpdate().

    从 jdbc 层,确保您使用的是 addBatch() 和 executeBatch() 命令,而不是使用 PreparedStatement 而不是正常的 executeUpdate()。

    回答by Pierre

    Wouldn't it be faster if you used LOAD DATA INFILEinstead of inserting each row ?

    如果您使用LOAD DATA INFILE而不是插入每一行,会不会更快?

    回答by Reed

    I would run three threads...

    我会运行三个线程...

    1) Reads the input file and pushes each row into a transformation queue 2) Pops from the queue, transforms the data, and pushes into a db queue 3) Pops from the db queue and inserts the data

    1) 读取输入文件并将每一行推入转换队列 2) 从队列中弹出,转换数据,并推送到 db 队列 3) 从 db 队列中弹出并插入数据

    In this manner, you can be reading data from disk while the db threads are waiting for their IO to complete and vice-versa

    通过这种方式,您可以在 db 线程等待 IO 完成时从磁盘读取数据,反之亦然