Java 使用 Hibernate 的 ScrollableResults 缓慢读取 9000 万条记录
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2826319/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using Hibernate's ScrollableResults to slowly read 90 million records
提问by at.
I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate:
我只需要使用 Hibernate 读取 MySQL 数据库中表中的每一行并基于它编写一个文件。但是有 9000 万行,而且它们非常大。所以看起来以下是合适的:
ScrollableResults results = session.createQuery("SELECT person FROM Person person")
.setReadOnly(true).setCacheable(false).scroll(ScrollMode.FORWARD_ONLY);
while (results.next())
storeInFile(results.get()[0]);
The problem is the above will try and load all 90 million rows into RAM before moving on to the while loop... and that will kill my memory with OutOfMemoryError: Java heap space exceptions :(.
问题是上面的代码会尝试将所有 9000 万行加载到 RAM 中,然后再进入 while 循环......这将杀死我的内存 OutOfMemoryError: Java heap space exceptions :(。
So I guess ScrollableResults isn't what I was looking for? What is the proper way to handle this? I don't mind if this while loop takes days (well I'd love it to not).
所以我猜 ScrollableResults 不是我想要的?处理这个问题的正确方法是什么?我不介意这个 while 循环是否需要几天时间(好吧,我不介意)。
I guess the only other way to handle this is to use setFirstResult and setMaxResults to iterate through the results and just use regular Hibernate results instead of ScrollableResults. That feels like it will be inefficient though and will start taking a ridiculously long time when I'm calling setFirstResult on the 89 millionth row...
我想处理这个问题的唯一其他方法是使用 setFirstResult 和 setMaxResults 来迭代结果,并且只使用常规的 Hibernate 结果而不是 ScrollableResults。感觉它会效率低下,并且当我在第 89 百万行调用 setFirstResult 时会开始花费可笑的很长时间......
UPDATE: setFirstResult/setMaxResults doesn't work, it turns out to take an unusably long time to get to the offsets like I feared. There must be a solution here! Isn't this a pretty standard procedure?? I'm willing to forgo Hibernate and use JDBC or whatever it takes.
更新:setFirstResult/setMaxResults 不起作用,事实证明需要很长时间才能达到我担心的偏移量。这里一定有解决办法!这不是很标准的程序吗??我愿意放弃 Hibernate 并使用 JDBC 或任何需要的东西。
UPDATE 2: the solution I've come up with which works ok, not great, is basically of the form:
更新 2:我提出的解决方案可以正常工作,但不是很好,基本上是以下形式:
select * from person where id > <offset> and <other_conditions> limit 1
Since I have other conditions, even all in an index, it's still not as fast as I'd like it to be... so still open for other suggestions..
由于我有其他条件,即使所有条件都在索引中,它仍然没有我希望的那么快......所以仍然可以接受其他建议..
采纳答案by Michael
Using setFirstResult and setMaxResults is your only option that I'm aware of.
使用 setFirstResult 和 setMaxResults 是我知道的唯一选择。
Traditionally a scrollable resultset would only transfer rows to the client on an as required basis. Unfortunately the MySQL Connector/J actually fakes it, it executes the entire query and transports it to the client, so the driver actually has the entire result set loaded in RAM and will drip feed it to you (evidenced by your out of memory problems). You had the right idea, it's just shortcomings in the MySQL java driver.
传统上,可滚动结果集只会根据需要将行传输到客户端。不幸的是,MySQL Connector/J 实际上伪造了它,它执行整个查询并将其传输到客户端,因此驱动程序实际上将整个结果集加载到 RAM 中,并将其滴给您(由您的内存不足问题证明) . 您的想法是正确的,这只是 MySQL java 驱动程序的缺点。
I found no way to get around this, so went with loading large chunks using the regular setFirst/max methods. Sorry to be the bringer of bad news.
我发现没有办法解决这个问题,所以继续使用常规的 setFirst/max 方法加载大块。很抱歉带来坏消息。
Just make sure to use a stateless session so there's no session level cache or dirty tracking etc.
只需确保使用无状态会话,这样就没有会话级缓存或脏跟踪等。
EDIT:
编辑:
Your UPDATE 2 is the best you're going to get unless you break out of the MySQL J/Connector. Though there's no reason you can't up the limit on the query. Provided you have enough RAM to hold the index this should be a somewhat cheap operation. I'd modify it slightly, and grab a batch at a time, and use the highest id of that batch to grab the next batch.
除非您脱离 MySQL J/Connector,否则您的 UPDATE 2 是最好的。尽管您没有理由不能提高查询的限制。如果您有足够的 RAM 来保存索引,这应该是一个有点便宜的操作。我会稍微修改一下,一次抓取一个批次,并使用该批次的最高 id 抓取下一个批次。
Note: this will only work if other_conditionsuse equality (no range conditions allowed) and have the last column of the index as id.
注意:这仅在other_conditions使用相等(不允许范围条件)并且将索引的最后一列作为id时才有效。
select *
from person
where id > <max_id_of_last_batch> and <other_conditions>
order by id asc
limit <batch_size>
回答by SteveD
With 90 million records, it sounds like you should be batching your SELECTs. I've done with with Oracle when doing the initial load into a distrbuted cache. Looking at the MySQL documentation, the equivalent seems to be using the LIMIT clause: http://dev.mysql.com/doc/refman/5.0/en/select.html
有 9000 万条记录,听起来您应该对 SELECT 进行批处理。在对分布式缓存进行初始加载时,我已经使用了 Oracle。查看 MySQL 文档,等效项似乎是使用 LIMIT 子句:http: //dev.mysql.com/doc/refman/5.0/en/select.html
Here's an example:
下面是一个例子:
SELECT * from Person
LIMIT 200, 100
This would return rows 201 through 300 of the Person
table.
这将返回Person
表的第 201 到 300 行。
You'd need to get the record count from your table first and then divide it by your batch size and work out your looping and LIMIT
parameters from there.
您需要先从表中获取记录数,然后将其除以批量大小,然后LIMIT
从那里计算出循环和参数。
The other benefit of this would be parallelism - you can execute multiple threads in parallel on this for faster processing.
这样做的另一个好处是并行性 - 您可以在其上并行执行多个线程以加快处理速度。
Processing 90 million records also doesn't sound like the sweet spot for using Hibernate.
处理 9000 万条记录听起来也不是使用 Hibernate 的最佳选择。
回答by Brian Deterling
I've used the Hibernate scroll functionality successfully before without it reading the entire result set in. Someone said that MySQL does not do true scroll cursors, but it claims to based on the JDBC dmd.supportsResultSetType(ResultSet.TYPE_SCROLL_INSENSITIVE) and searching around it seems like other people have used it. Make sure it's not caching the Person objects in the session - I've used it on SQL queries where there was no entity to cache. You can call evict at the end of the loop to be sure or test with a sql query. Also play around with setFetchSize to optimize the number of trips to the server.
我之前成功使用了 Hibernate 滚动功能,但没有读取整个结果集。有人说 MySQL 不做真正的滚动游标,但它声称基于 JDBC dmd.supportsResultSetType(ResultSet.TYPE_SCROLL_INSENSITIVE) 并在它周围搜索似乎其他人已经使用过它。确保它没有缓存会话中的 Person 对象 - 我已经在没有实体缓存的 SQL 查询中使用了它。您可以在循环结束时调用 evict 来确定或使用 sql 查询进行测试。还可以使用 setFetchSize 来优化到服务器的旅行次数。
回答by Reboot
The problem could be, that Hibernate keeps references to all objests in the session until you close the session. That has nothing to do with query caching. Maybe it would help to evict() the objects from the session, after you are done writing the object to the file. If they are no longer references by the session, the garbage collector can free the memory and you won't run out of memory anymore.
问题可能是,Hibernate 保留对会话中所有对象的引用,直到您关闭会话。这与查询缓存无关。在完成将对象写入文件后,从会话中 evict() 对象可能会有所帮助。如果会话不再引用它们,垃圾收集器可以释放内存,您将不会再耗尽内存。
回答by einnocent
Actually you could have gotten what you wanted -- low-memory scrollable results with MySQL -- if you had used the answer mentioned here:
实际上,如果您使用了此处提到的答案,您就可以得到想要的结果——MySQL 的低内存可滚动结果:
Streaming large result sets with MySQL
Note that you will have problems with Hibernate lazy-loading because it will throw an exception on any queries performed before the scroll is finished.
请注意,您将遇到 Hibernate 延迟加载的问题,因为它会在滚动完成之前执行的任何查询引发异常。
回答by Haris
Set fetch size in query to an optimal value as given below.
将查询中的提取大小设置为最佳值,如下所示。
Also, when caching is not required, it may be better to use StatelessSession.
此外,当不需要缓存时,最好使用 StatelessSession。
ScrollableResults results = session.createQuery("SELECT person FROM Person person")
.setReadOnly(true)
.setFetchSize( 1000 ) // <<--- !!!!
.setCacheable(false).scroll(ScrollMode.FORWARD_ONLY)
回答by ChechuHa
FetchSize must be Integer.MIN_VALUE
, otherwise it won't work.
FetchSize 必须是Integer.MIN_VALUE
,否则将不起作用。
It must be literally taken from the official reference: https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
它必须从字面上取自官方参考:https: //dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
回答by Sean S.
You should be able to use a ScrollableResults
, though it requires a few magic incantations to get working with MySQL. I wrote up my findings in a blog post (http://www.numerati.com/2012/06/26/reading-large-result-sets-with-hibernate-and-mysql/) but I'll summarize here:
您应该能够使用ScrollableResults
,尽管它需要一些魔法咒语才能使用 MySQL。我在一篇博文(http://www.numerati.com/2012/06/26/reading-large-result-sets-with-hibernate-and-mysql/)中写下了我的发现,但我会在这里总结:
"The [JDBC] documentation says:
“[JDBC] 文档说:
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
This can be done using the Query interface (this should work for Criteria as well) in version 3.2+ of the Hibernate API:
这可以使用 Hibernate API 3.2+ 版本中的 Query 接口(这也适用于 Criteria)来完成:
Query query = session.createQuery(query);
query.setReadOnly(true);
// MIN_VALUE gives hint to JDBC driver to stream results
query.setFetchSize(Integer.MIN_VALUE);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
// iterate over results
while (results.next()) {
Object row = results.get();
// process row then release reference
// you may need to evict() as well
}
results.close();
This allows you to stream over the result set, however Hibernate will still cache results in the Session
, so you'll need to call session.evict()
or session.clear()
every so often. If you are only reading data, you might consider using a StatelessSession
, though you should read its documentation beforehand."
这允许您对结果集进行流式处理,但是 Hibernate 仍会将结果缓存在 中Session
,因此您需要经常调用session.evict()
或session.clear()
。如果您只是阅读数据,则可以考虑使用StatelessSession
,但您应该事先阅读其文档。”
回答by smalbequi
I propose more than a sample code, but a query template based on Hibernate
to do this workaround for you (pagination
, scrolling
and clearing
Hibernate session).
我建议多示例代码,但基于查询模板Hibernate
来为你做这种解决方法(pagination
,scrolling
和clearing
Hibernate会话)。
It can also easily be adapted to use an EntityManager
.
它也可以很容易地适应使用EntityManager
.
回答by user2928872
recently i worked over a problem like this, and i wrote a blog about how face that problem. is very like, i hope be helpfull for any one. i use lazy list approach with partial adquisition. i Replaced the limit and offset or the pagination of query to a manual pagination. In my example, the select returns 10 millions of records, i get them and insert them in a "temporal table":
最近我解决了一个这样的问题,我写了一篇关于如何面对这个问题的博客。非常喜欢,希望对大家有所帮助。我使用部分收购的惰性列表方法。i 将查询的限制和偏移量或分页替换为手动分页。在我的示例中,select 返回 1000 万条记录,我获取它们并将它们插入到“临时表”中:
create or replace function load_records ()
returns VOID as $$
BEGIN
drop sequence if exists temp_seq;
create temp sequence temp_seq;
insert into tmp_table
SELECT linea.*
FROM
(
select nextval('temp_seq') as ROWNUM,* from table1 t1
join table2 t2 on (t2.fieldpk = t1.fieldpk)
join table3 t3 on (t3.fieldpk = t2.fieldpk)
) linea;
END;
$$ language plpgsql;
after that, i can paginate without count each row but using the sequence assigned:
之后,我可以不计算每一行而是使用分配的序列进行分页:
select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000
From java perspective, i implemented this pagination through partial adquisition with a lazy list. this is, a list that extends from Abstract list and implements get() method. The get method can use a data access interface to continue get next set of data and release the memory heap:
从 Java 的角度来看,我通过使用惰性列表的部分获取来实现此分页。这是一个从抽象列表扩展并实现 get() 方法的列表。get 方法可以使用数据访问接口继续获取下一组数据并释放内存堆:
@Override
public E get(int index) {
if (bufferParcial.size() <= (index - lastIndexRoulette))
{
lastIndexRoulette = index;
bufferParcial.removeAll(bufferParcial);
bufferParcial = new ArrayList<E>();
bufferParcial.addAll(daoInterface.getBufferParcial());
if (bufferParcial.isEmpty())
{
return null;
}
}
return bufferParcial.get(index - lastIndexRoulette);<br>
}
by other hand, the data access interface use query to paginate and implements one method to iterate progressively, each 25000 records to complete it all.
另一方面,数据访问接口使用查询进行分页,并实现一种方法逐步迭代,每25000条记录完成一次。
results for this approach can be seen here http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html
这种方法的结果可以在这里看到 http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html