Java 如何使用 JPA（或至少使用 Hibernate）处理大型数据集？

Question

提问by Roman

I need to make my web-app work with really huge datasets. At the moment I get either OutOfMemoryException or output which is being generated 1-2 minutes.

我需要让我的网络应用程序处理非常庞大的数据集。目前我得到 OutOfMemoryException 或正在生成 1-2 分钟的输出。

Let's put it simple and suppose that we have 2 tables in DB: Workerand WorkLogwith about 1000 rows in the first one and 10 000 000 rows in the second one. Latter table has several fields including 'workerId' and 'hoursWorked' fields among others. What we need is:

让我们把它简单，假设我们有在DB 2个表：Worker并WorkLog在第一个约1000列，并在第二个10个000 000行。后一个表有几个字段，包括 'workerId' 和 'hoursWorked' 字段等。我们需要的是：

count total hours worked by each user;
list of work periods for each user.

计算每个用户的总工作时间；
每个用户的工作时间列表。

The most straightforward approach (IMO) for each task in plain SQL is:

对于普通 SQL 中的每个任务，最直接的方法 (IMO) 是：

1)

select Worker.name, sum(hoursWorked) from Worker, WorkLog 
   where Worker.id = WorkLog.workerId 
   group by Worker.name;

//results of this query should be transformed to Multimap<Worker, Long>

2)

select Worker.name, WorkLog.start, WorkLog.hoursWorked from Worker, WorkLog
   where Worker.id = WorkLog.workerId;

//results of this query should be transformed to Multimap<Worker, Period>
//if it was JDBC then it would be vitally 
//to set resultSet.setFetchSize (someSmallNumber), ~100

So, I have two questions:

所以，我有两个问题：

how to implement each of my approaches with JPA (or at least with Hibernate);
how would you handle this problem (with JPA or Hibernate of course)?

如何使用 JPA（或至少使用 Hibernate）实现我的每个方法；
您将如何处理这个问题（当然是使用 JPA 或 Hibernate）？

Answer 1

采纳答案by Pascal Thivent

suppose that we have 2 tables in DB: Worker and WorkLog with about 1000 rows in the first one and 10 000 000 rows in the second one

假设我们在 DB 中有 2 个表：Worker 和 WorkLog，第一个表大约有 1000 行，第二个表有 10 000 000 行

For high volumes like this, my recommendation would be to use The StatelessSessioninterfacefrom Hibernate:

对于这样的大容量，我的建议是使用Hibernate的StatelessSession接口：

Alternatively, Hibernate provides a command-oriented API that can be used for streaming data to and from the database in the form of detached objects. A StatelessSessionhas no persistence context associated with it and does not provide many of the higher-level life cycle semantics. In particular, a stateless session does not implement a first-level cache nor interact with any second-level or query cache. It does not implement transactional write-behind or automatic dirty checking. Operations performed using a stateless session never cascade to associated instances. Collections are ignored by a stateless session. Operations performed via a stateless session bypass Hibernate's event model and interceptors. Due to the lack of a first-level cache, Stateless sessions are vulnerable to data aliasing effects. A stateless session is a lower-level abstraction that is much closer to the underlying JDBC.
StatelessSession session = sessionFactory.openStatelessSession();
Transaction tx = session.beginTransaction();

ScrollableResults customers = session.getNamedQuery("GetCustomers")
    .scroll(ScrollMode.FORWARD_ONLY);
while ( customers.next() ) {
    Customer customer = (Customer) customers.get(0);
    customer.updateStuff(...);
    session.update(customer);
}

tx.commit();
session.close();
In this code example, the Customerinstances returned by the query are immediately detached. They are never associated with any persistence context.
The insert(), update()and delete()operations defined by the StatelessSessioninterface are considered to be direct database row-level operations. They result in the immediate execution of a SQL INSERT, UPDATEor DELETErespectively. They have different semantics to the save(), saveOrUpdate()and delete()operations defined by the Sessioninterface.

或者，Hibernate 提供了一个面向命令的 API，可用于以分离对象的形式将数据传入和传出数据库。一种StatelessSession没有与之关联的持久性上下文，并且不提供许多更高级别的生命周期语义。特别是，无状态会话不实现一级缓存，也不与任何二级缓存或查询缓存交互。它不实现事务性后写或自动脏检查。使用无状态会话执行的操作永远不会级联到关联的实例。无状态会话会忽略集合。通过无状态会话执行的操作绕过 Hibernate 的事件模型和拦截器。由于缺乏一级缓存，无状态会话容易受到数据混叠效应的影响。无状态会话是一种更接近底层 JDBC 的较低级别的抽象。
StatelessSession session = sessionFactory.openStatelessSession();
Transaction tx = session.beginTransaction();

ScrollableResults customers = session.getNamedQuery("GetCustomers")
    .scroll(ScrollMode.FORWARD_ONLY);
while ( customers.next() ) {
    Customer customer = (Customer) customers.get(0);
    customer.updateStuff(...);
    session.update(customer);
}

tx.commit();
session.close();
在此代码示例中，Customer查询返回的实例会立即分离。它们从不与任何持久性上下文相关联。
接口定义的insert(), update()和 delete()操作 StatelessSession被认为是直接的数据库行级操作。它们导致立即执行 SQL INSERT, UPDATE或DELETE分别。它们与接口定义的save(), saveOrUpdate()和delete()操作具有不同的语义Session。

Answer 2

回答by Archimedes Trajano

Raw SQL shouldn't be considered a last resort. It should still be considered an option if you want to keep things "standard" on the JPA tier, but not on the database tier. JPA also has support for native queries where it will still do the mapping to standard entities for you.

不应将原始 SQL 视为最后的手段。如果您想在 JPA 层而不是在数据库层上保持“标准”，它仍然应该被视为一个选项。JPA 还支持本机查询，它仍然会为您映射到标准实体。

However, if you have a large result set that cannot be processed in the database, then you really should just use plain JDBC as JPA (standard) does not support streaming of large sets of data.

但是，如果您有一个无法在数据库中处理的大型结果集，那么您真的应该只使用普通的 JDBC，因为 JPA（标准）不支持大型数据集的流传输。

It will be harder to port your application across different application servers if you use JPA implementation specific constructs since the JPA engine is embedded in the application server and you may not have a control on which JPA provider is being used.

如果您使用 JPA 实现特定的构造，将您的应用程序移植到不同的应用程序服务器将更加困难，因为 JPA 引擎嵌入在应用程序服务器中，并且您可能无法控制正在使用的 JPA 提供程序。

Answer 3

回答by Bojan Kraut

I'm using something like this and it works very fast. I also hate to use native SQL as our application should work on any database.

我正在使用这样的东西，它的工作速度非常快。我也不喜欢使用原生 SQL，因为我们的应用程序应该可以在任何数据库上运行。

Folowing resutls into a very optimized sql and returns list of records which are maps.

将结果跟踪到一个非常优化的 sql 并返回映射记录列表。

String hql = "select distinct " +
            "t.uuid as uuid, t.title as title, t.code as code, t.date as date, t.dueDate as dueDate, " +
            "t.startDate as startDate, t.endDate as endDate, t.constraintDate as constraintDate, t.closureDate as closureDate, t.creationDate as creationDate, " +
            "sc.category as category, sp.priority as priority, sd.difficulty as difficulty, t.progress as progress, st.type as type, " +
            "ss.status as status, ss.color as rowColor, (p.rKey || ' ' || p.name) as project, ps.status as projectstatus, (r.code || ' ' || r.title) as requirement, " +
            "t.estimate as estimate, w.title as workgroup, o.name || ' ' || o.surname as owner, " +
            "ROUND(sum(COALESCE(a.duration, 0)) * 100 / case when ((COALESCE(t.estimate, 0) * COALESCE(t.progress, 0)) = 0) then 1 else (COALESCE(t.estimate, 0) * COALESCE(t.progress, 0)) end, 2) as factor " +
            "from " + Task.class.getName() + " t " +
            "left join t.category sc " +
            "left join t.priority sp " +
            "left join t.difficulty sd " +
            "left join t.taskType st " +
            "left join t.status ss " +
            "left join t.project p " +
            "left join t.owner o " +
            "left join t.workgroup w " +
            "left join p.status ps " +
            "left join t.requirement r " +
            "left join p.status sps " +
            "left join t.iterationTasks it " +
            "left join t.taskActivities a " +
            "left join it.iteration i " +
            "where sps.active = true and " +
            "ss.done = false and " +
            "(i.uuid <> :iterationUuid or it.uuid is null) " + filterHql +
            "group by t.uuid, t.title, t.code, t.date, t.dueDate, " +
            "t.startDate, t.endDate, t.constraintDate, t.closureDate, t.creationDate, " +
            "sc.category, sp.priority, sd.difficulty, t.progress, st.type, " +
            "ss.status, ss.color, p.rKey, p.name, ps.status, r.code, r.title, " +
            "t.estimate, w.title, o.name, o.surname " + sortHql;

    if (logger.isDebugEnabled()) {
        logger.debug("Executing hql: " + hql );
    }

    Query query =  hibernateTemplate.getSessionFactory().getCurrentSession().getSession(EntityMode.MAP).createQuery(hql);
    for(String key: filterValues.keySet()) {
        Object valueSet = filterValues.get(key);

        if (logger.isDebugEnabled()) {
            logger.debug("Setting query parameter for " + key );
        }

        if (valueSet instanceof java.util.Collection<?>) {
            query.setParameterList(key, (Collection)filterValues.get(key));
        } else {
            query.setParameter(key, filterValues.get(key));
        }
    }       
    query.setString("iterationUuid", iteration.getUuid());
    query.setResultTransformer(Transformers.ALIAS_TO_ENTITY_MAP);

    if (logger.isDebugEnabled()) {
        logger.debug("Query building complete.");
        logger.debug("SQL: " + query.getQueryString());
    }

    return query.list();

Answer 4

回答by Steve Ebersole

I agree that doing the calculation on the database server is your best option in the particular case you mentioned. HQL and JPAQL can handle both of those queries:

我同意在您提到的特定情况下，在数据库服务器上进行计算是您的最佳选择。HQL 和 JPAQL 可以处理这两个查询：

1)

select w, sum(wl.hoursWorked) 
from Worker w, WorkLog wl
where w.id = wl.workerId 
group by w

or, if the association is mapped:

或者，如果关联被映射：

select w, sum(wl.hoursWorked) 
from Worker w join w.workLogs wl
group by w

both or which return you List where the Object[]s are Worker and Long. Or you could also use "dynamic instantiation" queries to wrap that up, for example:

两者或哪个返回您列表，其中 Object[]s 是 Worker 和 Long。或者您也可以使用“动态实例化”查询来包装它，例如：

select new WorkerTotal( select w, sum(wl.hoursWorked) )
from Worker w join w.workLogs wl
group by w

or (depending on need) probably even just:

或（根据需要）甚至可能只是：

select new WorkerTotal( select w.id, w.name, sum(wl.hoursWorked) )
from Worker w join w.workLogs wl
group by w.id, w.name

WorkerTotal is just a plain class. It must have matching constructor(s).

WorkerTotal 只是一个普通的类。它必须具有匹配的构造函数。

2)

select w, new Period( wl.start, wl.hoursWorked )
from Worker w join w.workLogs wl

this will return you a result for each row in the WorkLog table... The new Period(...)bit is called "dynamic instantiation" and is used to wrap tuples from the result into objects (easier consumption).

这将为 WorkLog 表中的每一行返回一个结果......该new Period(...)位称为“动态实例化”，用于将结果中的元组包装到对象中（更容易使用）。

For manipulation and general usage, I recommend StatelessSession as Pascal points out.

对于操作和一般使用，我推荐如 Pascal 指出的 StatelessSession。

Answer 5

回答by Darrell Teague

There are several techniques that may need to be used in conjunction with one another to create and manipulate queries for large data-sets where memory is a limitation:

有几种技术可能需要相互结合使用，以便为内存有限的大型数据集创建和操作查询：

Use setFetchSize(some value, maybe 100+) as the default (via JDBC) is 10. This is more about performance and is the single biggest related factor thereof. Can be done in JPA using queryHint available from provider (Hibernate, etc). There does not (for whatever reason) seem to be a JPA Query.setFetchSize(int)method.
Do not try to marshall the entire result-set for 10K+ records. Several strategies apply: For GUIs, use paging or a framework that does paging. Consider Lucene or commercial searching/indexing engines (Endeca if the company has the money). For sending data somewhere, stream it and flush the buffer every N records to limit how much memory is used. The stream may be flushed to a file, network, etc. Remember that underneath, JPA uses JDBC and JDBC keeps the result-set on the Server, only fetching N-rows in a row-set group at a time. This break-down can be manipulated to facilitate flushing data in groups.
Consider what the use-case is. Typically, an application is trying to answer questions. When the answer is to weed through 10K+ rows, then the design should be reviewed. Again, consider using indexing engines like Lucene, refine the queries, consider using BloomFilters as containscheck caches to find needles in haystacks without going to the database, etc.

使用 setFetchSize(some value, may 100+) 作为默认值（通过 JDBC）是 10。这更多是关于性能的，是其最大的相关因素。可以使用提供者（Hibernate 等）提供的 queryHint 在 JPA 中完成。似乎没有（无论出于何种原因）JPAQuery.setFetchSize(int)方法。
不要试图为 10K+ 记录编组整个结果集。几种策略适用：对于 GUI，使用分页或执行分页的框架。考虑使用 Lucene 或商业搜索/索引引擎（如果公司有钱，则使用 Endeca）。要在某处发送数据，请流式传输并每 N 条记录刷新缓冲区以限制使用的内存量。流可能会被刷新到文件、网络等。请记住，在下面，JPA 使用 JDBC，而 JDBC 将结果集保存在服务器上，一次只获取行集组中的 N 行。可以操纵此细分以促进按组刷新数据。
考虑用例是什么。通常，应用程序试图回答问题。如果答案是清除 10K+ 行，则应设计。再次，考虑使用像 Lucene 这样的索引引擎，优化查询，考虑使用 BloomFilters 作为包含检查缓存来在大海捞针中找到针头，而无需访问数据库等。

Answer 6

回答by lapsus63

It seems you can do this with EclipseLink too. Check this : http://wiki.eclipse.org/EclipseLink/Examples/JPA/Pagination:

看来你也可以用 EclipseLink 做到这一点。检查这个：http: //wiki.eclipse.org/EclipseLink/Examples/JPA/Pagination：

Query query = em.createQuery...
query.setHint(QueryHints.CURSOR, true)
     .setHint(QueryHints.SCROLLABLE_CURSOR, true)
ScrollableCursor scrl = (ScrollableCursor)q.getSingleResult();
Object o = null;
while ((o = scrl.next()) != null) { ... }

Answer 7

回答by Tvaroh

This blog postcan also help. It summarizes approach with stateless session and adds some additional hints, e.g. how to stream results with JAX-RS.

这篇博文也可以提供帮助。它总结了无状态会话的方法并添加了一些额外的提示，例如如何使用 JAX-RS 流式传输结果。

Java 如何使用 JPA（或至少使用 Hibernate）处理大型数据集？

提问by Roman

采纳答案by Pascal Thivent

回答by Archimedes Trajano

回答by Bojan Kraut

回答by Steve Ebersole

回答by Darrell Teague

回答by lapsus63

回答by Tvaroh

相关推荐

最近更新

标签

Java 如何使用 JPA（或至少使用 Hibernate）处理大型数据集？

提问by Roman

采纳答案by Pascal Thivent

回答by Archimedes Trajano

回答by Bojan Kraut

回答by Steve Ebersole

回答by Darrell Teague

回答by lapsus63

回答by Tvaroh

相关推荐

Java Spring MVC 视图层的 JSP 替代方案

Java Hibernate/C3P0 错误：“无法获取连接元数据。客户端检查连接的尝试已超时。”

Java 使用选择中的子选择将 SQL 转换为 HQL

Java 使用键作为字符串序列化和反序列化映射

相关推荐

最近更新

标签