java 使用Java处理大量数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/986784/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Process huge volume of data using Java
提问by Gaurav Saini
As part of the requirement we need to process nearly 3 million records and associate them with a bucket. This association is decided on a set of rules (comprising of 5-15 attributes, with single or range of values and precedence) which derive the bucket for a record. Sequential processing of such a big number is clearly out of scope. Can someone guide us on the approach to effectively design a solution?
作为要求的一部分,我们需要处理近 300 万条记录并将它们与一个存储桶相关联。这种关联是根据一组规则(由 5-15 个属性组成,具有单个或一系列值和优先级)决定的,这些规则导出记录的桶。对如此大的数字进行顺序处理显然超出了范围。有人可以指导我们有效设计解决方案的方法吗?
回答by skaffman
3 million records isn't really that much from a volume-of-data point of view (depending on record size, obviously), so I'd suggest that the easiest thing to try is parallelising the processing across multiple threads (using the java.util.concurrent.Executor framework). As long as you have multiple CPU cores available, you should be able to get near-linear performance increases.
从数据量的角度来看,300 万条记录并不是那么多(显然取决于记录大小),所以我建议最简单的尝试是跨多个线程并行处理(使用 java .util.concurrent.Executor 框架)。只要您有多个 CPU 内核可用,您就应该能够获得接近线性的性能提升。
回答by akarnokd
It depends on the data source. If it is a single database, you will spend most of the time retrieving the data anyway. If it is in a local file, then you can partition the data into smaller files or you can pad the records to have equal size - this allows random access to a batch of records.
这取决于数据源。如果它是单个数据库,则无论如何您将花费大部分时间来检索数据。如果它在本地文件中,那么您可以将数据分区为较小的文件,或者您可以填充记录以使其具有相同的大小 - 这允许随机访问一批记录。
If you have a multi-core machine, the partitioned data can be processed in parallel. If you determined the record-bucket assignment, you can write back the information into the database using the PreparedStatement's batch capability.
如果您有一台多核机器,则可以并行处理分区数据。如果您确定了记录桶分配,您可以使用 PreparedStatement 的批处理功能将信息写回到数据库中。
If you have only a single core machine, you can still achieve some performance improvements by designing a data retrieval - data processing - batch writeback separation to take advantage of the pause times of the I/O operations.
如果你只有单核机器,你仍然可以通过设计数据检索-数据处理-批量回写分离来实现一些性能提升,以利用 I/O 操作的暂停时间。
回答by Dave Webb
I'm not quite sure what you're after but here's a blog post about how the New York Times used Apache Hadoop Project to process a large volume of data.
我不太确定您想要什么,但这里有一篇关于纽约时报如何使用 Apache Hadoop 项目处理大量数据的博客文章。
回答by Will Hartung
As a meaningless benchmark, we have a system that has a internal cache. We're currently loading 500K rows. For each row we generate statistics, place keys in different caches, etc. Currently this takes < 20s for us to process.
作为一个毫无意义的基准,我们有一个具有内部缓存的系统。我们目前正在加载 500K 行。对于每一行,我们生成统计信息,将键放在不同的缓存中,等等。目前这需要 < 20 秒来处理。
It's a meaningless benchmark, but it is an instance that, depending on the circumstances, 3M rows is not a lot of rows on todays hardware.
这是一个毫无意义的基准测试,但它是一个实例,根据情况,3M 行在今天的硬件上并不是很多行。
That said.
那说。
As others have suggested, break the job up in to pieces, and parallelize the runs, 1-2 threads per core. Each thread maintains their own local data structures and state, and at the end, the master process consolidates the results. This is a crude "map/reduce" algorithm. The key here is to ensure that the threads aren't fighting over global resources like global counters, etc. Let the final processing of the thread results deal with those serially.
正如其他人所建议的那样,将工作分解为多个部分,并并行运行,每个内核 1-2 个线程。每个线程维护自己的本地数据结构和状态,最后,主进程合并结果。这是一个粗略的“映射/减少”算法。这里的关键是确保线程不会争夺全局资源,如全局计数器等。让线程结果的最终处理顺序处理这些。
You can use more than one thread per core if each thread is doing DB IO, since no single thread will be purely CPU bound. Simply run the process several times with different thread counts until it comes out fastest.
如果每个线程都在执行 DB IO,则每个内核可以使用多个线程,因为没有单个线程会纯粹受 CPU 限制。只需使用不同的线程数运行该进程几次,直到它出现最快。
We've seen 50% speed ups even when we run batches through a persistent queueing system like JMS to distribute the work vs linear processing, and I've seen these gains on 2 core laptop computers, so there is definite room for progress here.
即使当我们通过 JMS 等持久排队系统运行批处理以分配工作与线性处理时,我们也看到了 50% 的速度提升,而且我已经在 2 核笔记本电脑上看到了这些收益,因此这里有一定的进步空间。
Another thing if possible is don't do ANY disk IO (save reading the data from the DB) until the very end. At that point you have a lot more opportunity to batch any updates that need to be made so you can, at least, cut down on network round trip times. Even if you had to update every single row, large batches of SQL will still show net gains in performance. Obviously this can be memory intensive. Thankfully, most modern systems have a lot of memory.
如果可能,另一件事是直到最后才执行任何磁盘 IO(保存从数据库读取数据)。那时,您有更多机会批量处理需要进行的任何更新,这样您至少可以减少网络往返时间。即使您必须更新每一行,大量 SQL 仍会显示性能的净收益。显然,这可能是内存密集型的。值得庆幸的是,大多数现代系统都有大量内存。
回答by CoderTao
Based on the revised description, I think I'd try and look at sorting the data.
根据修改后的描述,我想我会尝试查看对数据进行排序。
Sorting can be an nlog(n) process; and if most of the comparisons are for direct equality on sortable fields, this should yield a total complexity of ~O(nlog(n)). Theoretically. If after assigning an item to a bucket it's no longer needed, just remove it from the list of data.
排序可以是一个 n log(n) 过程;并且如果大多数比较是针对可排序字段的直接相等性,这应该会产生 ~O(nlog(n))的总复杂度。理论上。如果将项目分配给存储桶后不再需要它,只需将其从数据列表中删除。
Even if the data needed to be resorted a few times for various steps in the logic, it should still be a bit faster then then n^2 approach.
即使数据需要为逻辑中的各个步骤重新调用几次,它仍然应该比 n^2 方法快一点。
Basically, this would involve preprocessing the data to make it easier for actual processing.
基本上,这将涉及对数据进行预处理,以使其更易于实际处理。
This makes certain assumptions about the logic of bucket assigning (nameley that it's not too far from the psuedo code provided); and would be invalid if you needed to extract data from every pair of A,B.
这对桶分配的逻辑做出了某些假设(即与提供的伪代码相差不远);如果您需要从每对 A、B 中提取数据,则无效。
Hope this helps.
希望这可以帮助。
Edit: I would comment if I could; but, alas, I am too new. Preprocessing applies as much to the data as it does to the individual categories. Ultimately all you need to do to go from a 15 minute compute time to a 5 minute compute time is to be able to programmatically determine 2/3s+ of the categories that cannot and will never match... in less then O(n) amortized time. Which might not be applicable to your specific situation, I admit.
编辑:如果可以,我会发表评论;但是,唉,我太新了。预处理既适用于数据,也适用于单个类别。最终,从 15 分钟的计算时间到 5 分钟的计算时间,您需要做的就是能够以编程方式确定 2/3s+ 不能也永远不会匹配的类别......时间。我承认,这可能不适用于您的具体情况。
回答by Brian
I would make efforts to push back with the specification author to focus more on 'what' needs to be done, rather than how. I can't imagine why a specifcation would push'java' for a data intensive operation. If it has to do with data, do it with SQL. If your using Oracle there is a function called nTile. So creating a fixed set of buckets is as trivial as:
我会努力与规范作者一起反驳,更多地关注需要完成的“什么”,而不是如何完成。我无法想象为什么规范会为数据密集型操作推送“java”。如果它与数据有关,请使用 SQL。如果您使用 Oracle,则有一个名为 nTile 的函数。因此,创建一组固定的存储桶非常简单:
select ntile(4)over(order by empno) grp, empno, ename from emp
select ntile(4)over(order by empno) grp, empno, ename from emp
Which results in:
结果是:
GRP EMPNO ENAME
--- ----- ---------
1 7369 SMITH
1 7499 ALLEN
1 7521 WARD
1 7566 JONES
2 7654 MARTIN
2 7698 BLAKE
2 7782 CLARK
2 7788 SCOTT
3 7839 KING
3 7844 TURNER
3 7876 ADAMS
4 7900 JAMES
4 7902 FORD
4 7934 MILLER
At minimum you could at least establish your 'buckets' in SQL, then your Java Code would just need to process a given bucket.
至少您至少可以在 SQL 中建立您的“存储桶”,然后您的 Java 代码只需要处理给定的存储桶。
Worker worker = new Worker(bucketID);
worker.doWork();
If you don't care about the number of buckets (the example above was asking for 4 buckets) tbut rather a fixed size of each bucket (5 records per bucket) then the SQL is:
如果您不关心存储桶的数量(上面的示例要求 4 个存储桶),而是每个存储桶的固定大小(每个存储桶 5 条记录),那么 SQL 是:
select ceil(row_number()over(order by empno)/5.0) grp,
empno,
ename
from emp
Output:
输出:
GRP EMPNO ENAME
--- ---------- -------
1 7369 SMITH
1 7499 ALLEN
1 7521 WARD
1 7566 JONES
1 7654 MARTIN
2 7698 BLAKE
2 7782 CLARK
2 7788 SCOTT
2 7839 KING
2 7844 TURNER
3 7876 ADAMS
3 7900 JAMES
3 7902 FORD
3 7934 MILLER
Both examples above come from the terrific book: SQL Cookbook, 1st Edition by Anthony Molinaro
上面的两个例子都来自一本很棒的书:SQL Cookbook, 1st Edition by Anthony Molinaro
回答by Frank V
Is there a reason that you have to use Java to process the data? Couldn't you use SQL queries to write to intermediate fields? You could build upon each field -- attributes -- until you have everything in the bucket you need.
是否有理由必须使用 Java 来处理数据?您不能使用 SQL 查询写入中间字段吗?您可以在每个字段(属性)上进行构建,直到您拥有所需的所有内容。
Or you could use a hybrid of SQL and java... Use different procedures to get different "buckets" of information and then send that down one thread path for more detailed processing and another query to get another set of data and send that down a different thread path...
或者你可以使用 SQL 和 java 的混合......使用不同的过程来获取不同的信息“桶”,然后将其发送到一个线程路径以进行更详细的处理和另一个查询以获取另一组数据并将其发送到不同的线程路径...
回答by John Bellone
This goes the same for most projects where you need to process large amounts of information. I am going to assume that each record is the same, e.g. you process it the same way each time, which would be the point you can spawn a separate thread to do the processing.
对于需要处理大量信息的大多数项目而言,情况也是如此。我将假设每条记录都是相同的,例如您每次都以相同的方式处理它,这就是您可以产生一个单独的线程来进行处理的点。
The second obvious point is where you are fetching your information, this case you mentioned a database, but really that is pretty irrelevant. You want to separate your I/O and processing elements in your code to separate threads (or more likely, a pool of executors for the processing).
第二个明显的点是你在哪里获取你的信息,在这种情况下你提到了一个数据库,但这实际上是无关紧要的。您希望将代码中的 I/O 和处理元素分开以分离线程(或者更有可能是用于处理的执行程序池)。
Try to make each as independent as possible, and remember to use locking when necessary. Here are some links that you may want to read up on.
尽量使每个都尽可能独立,并记住在必要时使用锁定。以下是您可能想要阅读的一些链接。
http://www.ibm.com/developerworks/library/j-thread.html
http://www.ibm.com/developerworks/java/library/j-threads1.htmlhttp://www.devarticles.com/c/a/Java/Multithreading-in-Java/
http://www.ibm.com/developerworks/library/j-thread.html http://www.ibm.com/developerworks/java/library/j-threads1.html http://www.devarticles.com/ c/a/Java/Java 中的多线程/
回答by Tetsujin no Oni
Effective design steps for this scenario consist of first, determining any and all places where you can partition the records to be processed to allow full-engine parallelization (i.e., four units running against 750k records each is comparatively cheap). Then, depending upon the cost of the rules that summarize your record (I am viewing assignment of a bucket as a summarization operation), determine if your operation is going to be CPU bound or record retrieval bound.
此场景的有效设计步骤包括首先确定您可以对要处理的记录进行分区以允许全引擎并行化的任何和所有位置(即,四个单元分别针对 750k 记录运行相对便宜)。然后,根据汇总记录的规则的成本(我将存储桶的分配视为汇总操作),确定您的操作是受 CPU 限制还是受记录检索限制。
If you're CPU bound, increasing the partitioning is your best performance gain. If you're IO bound, rule processing worker threads that can work in parallel in response to chunked data retrieval is a better-performing design.
如果您受 CPU 限制,增加分区是您获得最佳性能的方法。如果您受 IO 限制,则可以并行工作以响应分块数据检索的规则处理工作线程是一种性能更好的设计。
All of this assumes that your rules will not result in state which needs to be tracked between records. Such a scenario deeply threatens the parallelization approach. If parallelization is not a tractable solution because of cumulative state being a component of the rule set, then your best solution may in fact be sequential processing of individual records.
所有这些都假设您的规则不会导致需要在记录之间跟踪的状态。这种情况严重威胁到并行化方法。如果由于累积状态是规则集的一个组成部分,并行化不是一个易于处理的解决方案,那么您的最佳解决方案实际上可能是对单个记录进行顺序处理。
回答by Carl Manaster
Sequential processing of such a big number is clearly out of scope.
对如此大的数字进行顺序处理显然超出了范围。
I don't think you know that. How long does it take to process 1,000 records in this way? 10,000? 100,000? 1,000,000? If the answer is really "too long," then fine: start to look for optimizations. But you might find the answer is "insignificant," and then you're done.
我想你不知道。这样处理1000条记录需要多长时间?10,000?十万?1,000,000?如果答案真的“太长”,那很好:开始寻找优化。但是您可能会发现答案“无关紧要”,然后就大功告成了。
Other answers have alluded to this, but it's my entire answer. Prove that you have a problem before you start optimizing. Then you've at least got a simple, correct system to profile and against which to compare optimized answers.
其他答案也提到了这一点,但这是我的全部答案。在开始优化之前证明你有问题。那么你至少有一个简单、正确的系统来分析和比较优化的答案。

