oracle 快速Oracle选择【海量数据】
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2334462/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fast Oracle Select [Huge Data]
提问by Henry B
I have a project whereby I'm reading huge volumes of data from an Oracle database from Java.
我有一个项目,我正在从 Java 的 Oracle 数据库中读取大量数据。
I have the feeling that the application we are writing is going to process the data far faster than it will be given to us using a single threaded SELECT query and so I've been trying to research faster ways of obtaining the data.
我有一种感觉,我们正在编写的应用程序将比使用单线程 SELECT 查询提供给我们的数据处理数据的速度要快得多,因此我一直在努力研究获取数据的更快方法。
Does anyone have anything I could read that would help me with my plight?
有没有人有任何我可以阅读的东西可以帮助我解决困境?
回答by Vincent Malgrat
Oracle supports parallel DML. In particular this applies to SELECT queries. Ultimately the bottleneck will probably be the IO read speed. Either use faster disks or stripe the data accross many disks.
Oracle 支持并行 DML。这尤其适用于 SELECT 查询。最终瓶颈可能是 IO 读取速度。要么使用速度更快的磁盘,要么将数据条带化到多个磁盘上。
Update
更新
As APCnoted in the comments Parallel Queries/DMLis an Entreprise Edition featureand is not available in the Standard Edition.
正如APC在评论中指出的,Parallel Queries/DML是企业版的一项功能,在标准版中不可用。
Also, Parallel DML/Queryis not the solution to all performance problems. Since more than one process will be used by the query it may improve throughput, but at the cost of concurrency. The purpose of parallelism is to use more resources to process the query faster. If the query is IO-bound or CPU-bound, there is no extra resources to use and adding parallelism will only make matter worse.
此外,Parallel DML/Query并不是所有性能问题的解决方案。由于查询将使用多个进程,因此可能会提高吞吐量,但以并发为代价。并行的目的是使用更多的资源来更快地处理查询。如果查询是 IO 密集型或 CPU 密集型的,则没有额外的资源可供使用,添加并行性只会让事情变得更糟。
From the link above:
从上面的链接:
Parallel execution is not normally useful for:
- Environments in which the CPU, memory, or I/O resources are already heavily utilized. Parallel execution is designed to exploit additional available hardware resources; if no such resources are available, then parallel execution will not yield any benefits and indeed may be detrimental to performance.
并行执行通常不适用于:
- CPU、内存或 I/O 资源已经被大量使用的环境。并行执行旨在利用额外的可用硬件资源;如果没有这样的资源可用,那么并行执行将不会产生任何好处,甚至可能会损害性能。
回答by dpbradley
You haven't given us a lot of information on why it will be necessary to bring "huge volumes of data" into the Java application instead of processing it on the database side. Although there can be exceptions, usually this is signal to re-think the design. As a general rule with Oracle it is most efficient to do as much work as you can with pure set operations (SQL), followed by procedural processing with the rdbms engine (PL/SQL) before bringing results back to the client application.
您没有向我们提供很多关于为什么需要将“大量数据”引入 Java 应用程序而不是在数据库端处理它的信息。尽管可能有例外,但通常这是重新思考设计的信号。作为 Oracle 的一般规则,使用纯集合操作 (SQL) 完成尽可能多的工作是最有效的,然后在将结果返回给客户端应用程序之前使用 rdbms 引擎 (PL/SQL) 进行过程处理。
回答by redcayuga
Use the setFetchSize(int) method on the Statement or PreparedStatement before you open the query. You should experiment with different sizes. Try 75 as a starting point.
在打开查询之前,在 Statement 或 PreparedStatement 上使用 setFetchSize(int) 方法。您应该尝试不同的尺寸。尝试以 75 作为起点。
On a slightly different useage, people have said that the PL/SQL bulk fetch "sweet spot" is between 2000 and 3000 but I saw one benchmark that indicated that 75 was optimum.
在稍微不同的用途上,人们说 PL/SQL 批量获取“最佳位置”在 2000 到 3000 之间,但我看到一个基准测试表明 75 是最佳的。
A large fetch size will tend to reduce the number of round trips between client and server. But if it is too large the database has to have a big buffer and the networking software may have to break up the big message into a lot of packets.
较大的提取大小将倾向于减少客户端和服务器之间的往返次数。但是如果它太大,数据库必须有一个大缓冲区,网络软件可能不得不将大消息分解成很多数据包。
回答by Gary Myers
Firstly, 'huge data' to database people is [at least] gigabytes, in which case I suspect your problems are going to be reading those sort of volumes into your processes memory and aggregating them there. Why do you think a single-threaded select will be the bottleneck ?
首先,数据库人员的“海量数据”[至少] 千兆字节,在这种情况下,我怀疑您的问题将是将这些卷读取到您的进程内存中并在那里聚合它们。为什么你认为单线程选择会成为瓶颈?
If the bottleneck were getting the data from disk, then having multiple threads pulling data from the same disk wouldn't necessarily be faster and may even be slower. But if you could spread the data over separate disks, separate threads would be faster. If, using SSD, you don't think disks will be a contention point,we can look elsewhere.
如果瓶颈是从磁盘获取数据,那么从同一个磁盘获取数据的多个线程不一定会更快,甚至可能会更慢。但是,如果您可以将数据分布在单独的磁盘上,那么单独的线程会更快。如果使用 SSD,您认为磁盘不会成为争用点,我们可以查看其他地方。
If the bottleneck was network bandwidth, again multiple threads wouldn't fit any more data through the pipe any faster. You may even benefit from unloading the data to a flat file, compressing it and transferring that.
如果瓶颈是网络带宽,那么多个线程将无法更快地通过管道容纳更多数据。您甚至可以从将数据卸载到平面文件、压缩和传输数据中受益。
If the select is being sorted or comes from a hash-join, you may use memory more efficiently with a single thread. Multiple sessions would have to share the machine's memory.
如果选择正在排序或来自散列连接,则可以通过单个线程更有效地使用内存。多个会话必须共享机器的内存。
If there is a CPU intensive processing then multiple threads may help. That could be as simple as having multiple connections from java, each getting a different slice of data (eg A-K and L-Z), but it would very much depend on the SELECT.
如果存在 CPU 密集型处理,则多线程可能会有所帮助。这可能就像从 java 建立多个连接一样简单,每个连接都获得不同的数据切片(例如 AK 和 LZ),但这在很大程度上取决于 SELECT。
I agree with dpbradley that you should determine the bottleneck first. If you have the data and select, it should be simple enough to determine how long it takes (both on the local machine and through the network), and a trace would be a necessary starting point to really go into how it could be speeded up.
我同意 dpbradley 的观点,即您应该首先确定瓶颈。如果你有数据并选择,它应该足够简单以确定需要多长时间(在本地机器上和通过网络),并且跟踪将是真正进入如何加速它的必要起点.