oracle 以极高的速度获取行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25335813/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fetching rows at extremely high speed
提问by user3007501
I have very large table (hundreds of millions rows, contains numbers and strings) in Oracle and I need to read all content of this table, format it and write to file or any other resource. Generally my solution looks like this:
我在 Oracle 中有非常大的表(数亿行,包含数字和字符串),我需要读取该表的所有内容,对其进行格式化并写入文件或任何其他资源。通常我的解决方案是这样的:
package my.odp;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.TimeUnit;
import java.lang.Throwable;
import java.sql.*;
public class Main {
public static volatile boolean finished = false;
public static void main(final String[] args) throws InterruptedException {
final ArrayBlockingQueue<String> queue = new ArrayBlockingQueue<String>(10000);
final Thread writeWorker = new Thread("ODP Writer") {
public void run() {
try {
File targetFile = new File(args[0]);
FileWriter fileWriter = new FileWriter(targetFile);
BufferedWriter writer = new BufferedWriter(fileWriter);
String str;
try {
while (!finished) {
str = queue.poll(200, TimeUnit.MILLISECONDS);
if (str == null) {
Thread.sleep(50);
continue;
}
writer.write(str);
writer.write('\n');
}
} catch (InterruptedException e) {
writer.close();
return;
}
}
catch (Throwable e) {
e.printStackTrace();
return;
}
}
};
final Thread readerThread = new Thread("ODP Reader") {
public void run() {
try {
Class.forName("oracle.jdbc.OracleDriver");
Connection conn = DriverManager.getConnection("jdbc:oracle:thin:@//xxx.xxx.xxx.xxx:1521/orcl", "user", "pass");
Statement stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(500000);
ResultSet rs = stmt.executeQuery("select * from src_schema.big_table_view");
System.out.println("Fetching result");
while (rs.next()) {
StringBuilder sb = new StringBuilder();
sb.append(rs.getString(1)).append('\t');//OWNER
sb.append(rs.getString(2)).append('\t');//OBJECT_NAME
sb.append(rs.getString(3)).append('\t');//SUBOBJECT_NAME
sb.append(rs.getLong(4)).append('\t');//OBJECT_ID
sb.append(rs.getLong(5)).append('\t');//DATA_OBJECT_ID
sb.append(rs.getString(6)).append('\t');//OBJECT_TYPE
sb.append(rs.getString(7)).append('\t');//CREATED
sb.append(rs.getString(8)).append('\t');//LAST_DDL_TIME
sb.append(rs.getString(9)).append('\t');//TIMESTAMP
sb.append(rs.getString(10)).append('\t');//STATUS
sb.append(rs.getString(11)).append('\t');//TEMPORARY
sb.append(rs.getString(12)).append('\t');//GENERATED
sb.append(rs.getString(13)).append('\t');//SECONDARY
sb.append(rs.getString(14)).append('\t');//NAMESPACE
sb.append(rs.getString(15));//EDITION_NAME
queue.put(sb.toString());
}
rs.close();
stmt.close();
conn.close();
finished = true;
} catch (Throwable e) {
e.printStackTrace();
return;
}
}
};
long startTime = System.currentTimeMillis();
writeWorker.start();
readerThread.start();
System.out.println("Waiting for join..");
writeWorker.join();
System.out.println("Exit:"+ (System.currentTimeMillis() - startTime));
}
}
}
There're two threads: one for fetching rows from result set and one for writing string values. Measured loading speed was about 10Mb/s and in my case I need to make it 10 times faster. Profiler shows that the most time consuming methods are
有两个线程:一个用于从结果集中获取行,另一个用于写入字符串值。测得的加载速度约为 10Mb/s,在我的情况下,我需要使其速度提高 10 倍。Profiler 显示最耗时的方法是
oracle.jdbc.driver.OracleResultSetImpl.getString()
oracle.jdbc.driver.OracleResultSetImpl.getString()
and
和
oracle.net.ns.Packet.receive()
oracle.net.ns.Packet.receive()
Do you have any ideas how to make jdbc to load data much faster? Any ideas about query optimisation, string loading optimisation, tweaking JDBC driver or using another one, direct using oracle JDBC implementations, tweaking Oracle is appreciated.
你有什么想法如何让 jdbc 更快地加载数据?任何关于查询优化、字符串加载优化、调整 JDBC 驱动程序或使用另一个、直接使用 oracle JDBC 实现、调整 Oracle 的想法都表示赞赏。
UPDATE:I compiled and listed discussion results below:
更新:我编译并列出了下面的讨论结果:
I've no access to DBMS server except connection to Oracle db and server can't connect to any external resource. Any dump and extraction utils which use server's or remote file system can't be applied, also it's impossible to install and use any external java or PL/SQL routines on server. Only connetion to execute queries - thats all.
I used profiler and digged in Oracle JDBC driver. I found out that the most expencive operation is reading data, i.e. Socket.read(). All string fields are represented as one char array and have almost no influence on perfomance. Generally, I checked with profiler the whole app and Socket.read() is definitely the most expensive operation. Extracting fields, building strings, writing data consume almost nothing. The problem is only in reading data.
Any optimisations in data representation on server side don't have real effect. Concatenating strings and converting timestamps have no result for performance.
App was rewritten to have several reader threads which put ready data in writer queue. Each thread has its own connection, no pools are used because they slow down the extraction (I used UCP pool which is recommended by oracle and it consumes about 10% of execution time, so i gave up from it). Also result set fetchSize was increased because switching from default value (10) to 50000 gives up to 50% perfomance growth.
I tested how multithreaded version works with 4 reading threads and found out that increasing readers count only slows the extraction. I tried to launch 2 instances where each of them has two readers and both worked the same time as single instance, i.e. double data extraction requires same time as single. Don't know why this happens, but it looks like oracle driver have some performance limitations. Application with 4 independent connections works slower then 2 App instances with 2 connections. (Profiler was used to ensure that driver's Socket.read() is still the main issue, all other parts works fine in multithreaded mode).
I tried to fetch all data with SAS and it can perform same extraction 2 times faster then JDBC, both used single connection to Oracle and can't use any dump operations. Oracle ensures that JDBC thin driver is as fast as native one..
除了与 Oracle 数据库的连接之外,我无法访问 DBMS 服务器,并且服务器无法连接到任何外部资源。无法应用任何使用服务器或远程文件系统的转储和提取实用程序,也不可能在服务器上安装和使用任何外部 java 或 PL/SQL 例程。只有连接来执行查询 - 仅此而已。
我使用了探查器并深入研究了 Oracle JDBC 驱动程序。我发现最昂贵的操作是读取数据,即 Socket.read()。所有字符串字段都表示为一个字符数组,对性能几乎没有影响。一般来说,我用分析器检查了整个应用程序,Socket.read() 绝对是最昂贵的操作。提取字段、构建字符串、写入数据几乎没有任何消耗。问题仅在于读取数据。
服务器端数据表示的任何优化都没有实际效果。连接字符串和转换时间戳对性能没有影响。
应用程序被重写为有几个读取器线程,将准备好的数据放入写入器队列。每个线程都有自己的连接,没有使用池,因为它们会减慢提取速度(我使用了 oracle 推荐的 UCP 池,它消耗了大约 10% 的执行时间,所以我放弃了它)。结果集 fetchSize 也增加了,因为从默认值 (10) 切换到 50000 可实现高达 50% 的性能增长。
我测试了多线程版本如何与 4 个阅读线程一起工作,并发现增加读者数量只会减慢提取速度。我尝试启动 2 个实例,其中每个实例都有两个读取器,并且都与单个实例同时工作,即双数据提取需要与单实例相同的时间。不知道为什么会发生这种情况,但看起来 oracle 驱动程序有一些性能限制。具有 4 个独立连接的应用程序的运行速度比具有 2 个连接的 2 个应用程序实例慢。(Profiler 用于确保驱动程序的 Socket.read() 仍然是主要问题,所有其他部分在多线程模式下都可以正常工作)。
我尝试使用 SAS 获取所有数据,它执行相同提取的速度比 JDBC 快 2 倍,两者都使用与 Oracle 的单一连接并且不能使用任何转储操作。Oracle 确保 JDBC 瘦驱动程序与本机驱动程序一样快..
Maybe Oracle have another ways to perform fast extraction to remote host through ODBC or smth else?
也许 Oracle 有另一种方法可以通过 ODBC 或其他方式对远程主机执行快速提取?
回答by Rajeev Sreedharan
Assuming you have already checked the basic network stuff like interfaces, firewalls, proxies, as also the DB server's hardware elements.
假设您已经检查了基本的网络内容,如接口、防火墙、代理,以及数据库服务器的硬件元素。
Option 1 :
选项1 :
Instead of :
代替 :
Class.forName("oracle.jdbc.OracleDriver");
Connection conn = DriverManager.getConnection("jdbc:oracle:thin:@//xxx.xxx.xxx.xxx:1521/orcl", "user", "pass");
try using :
尝试使用:
OracleDataSource ods = new OracleDataSource();
java.util.Properties prop = new java.util.Properties();
prop.setProperty("MinLimit", "2");
prop.setProperty("MaxLimit", "10");
String url = "jdbc:oracle:oci8:@//xxx.xxx.xxx.xxx:1521/orcl";
ods.setURL(url);
ods.setUser("USER");
ods.setPassword("PWD");
ods.setConnectionCachingEnabled(true);
ods.setConnectionCacheProperties (prop);
ods.setConnectionCacheName("ImplicitCache01");
More details here
更多细节在这里
Option 2 : Fetchsize
选项 2:提取大小
As strongly pointed by Stephen, the fetchsize seems too large.
正如斯蒂芬强烈指出的那样,fetchsize 似乎太大了。
And, for fetch size of 500,000 what is your -Xms and -Xmx. Also, in profiler, whats the highest heap size?
而且,对于 500,000 的提取大小,您的 -Xms 和 -Xmx 是多少。另外,在分析器中,最大的堆大小是多少?
Option 3 : DB
选项 3:数据库
Check indexes and query plan for
src_schema.big_table_view
Is this a tool or an application system. If just a tool, you could add parallel degrees, index hints, partitioning etc based on DB systems capabilities
检查索引和查询计划
src_schema.big_table_view
这是一个工具还是一个应用系统。如果只是一个工具,您可以根据数据库系统功能添加并行度、索引提示、分区等
Option 4 : Threads
选项 4:线程
Say n
< Number of cores on application server
说n
< 应用服务器上的核心数
You can start n
Threads of writer, each configured to process a certain bucket e.g. thread1 processes 0 to 10000, writing to n
different files, and once all theads done, post join, merge the files together preferably using a low level OS command.
您可以启动n
编写器的线程,每个线程都配置为处理某个存储桶,例如,线程 1 处理 0 到 10000,写入n
不同的文件,一旦所有线程完成,加入后,最好使用低级操作系统命令将文件合并在一起。
That said, all this should never be pre-defined code like its now. 'n'
and the buckets should be calculated at runtime. And creating number of threads more than what your system supports only screws up.
也就是说,所有这些都不应该像现在这样是预定义的代码。'n'
并且应该在运行时计算桶。并且创建的线程数超过您的系统支持的线程数只会搞砸。
Option 5 :
选项 5:
Instead of
代替
select * from src_schema.big_table_view
You could use
你可以用
SELECT column1||CHR(9)||column2||CHR(9).....||columnN FROM src_schema.big_table_view
This avoids creating 500000 StringBuilders
and Strings
. (Assuming no other complex formatting involved). CHR(9) is the tab character.
这避免了创建 500000StringBuilders
和Strings
. (假设不涉及其他复杂的格式)。CHR(9) 是制表符。
Option 6 :
选项 6:
Meantime, you could also check with your DBA for any DB system issues and raise an SR with Oracle support.
同时,您还可以与您的 DBA 检查是否有任何 DB 系统问题,并通过Oracle 支持提出 SR 。
回答by Stephen C
It looks like you have already found and tweaked the row prefetch parameter. However, according to the Oracle documentation:
看起来您已经找到并调整了行预取参数。但是,根据 Oracle 文档:
"There is no maximum prefetch setting, but empirical evidence suggests that 10 is effective. Oracle has never observed a performance benefit to setting prefetch higher than 50. If you do not set the default row-prefetch value for a connection, then 10 is the default."
“没有最大预取设置,但经验证据表明 10 是有效的。Oracle 从未观察到将预取设置为高于 50 的性能优势。如果您没有为连接设置默认的行预取值,那么 10 是默认。”
You are setting it to 500000. Try winding it back to around 50 ... as Oracle recommend. (Why? Well it could bethat a hugely excessive prefetch size is causing the server or client to use excessive amounts of memory to buffer the prefetched data. That could have a "knock on effect" on other things, leading to reduced throughput.)
您将其设置为 500000。尝试将其回绕到 50 左右……正如 Oracle 建议的那样。(为什么?嗯,可能是预取大小过大导致服务器或客户端使用过多的内存来缓冲预取数据。这可能会对其他事情产生“影响”,导致吞吐量降低。)
Reference (from the Oracle 10g documentation):
参考(来自 Oracle 10g 文档):
You might be able to get a greater throughput by running simultaneous queries in multiple Java threads (e.g. on separate "sections" of the table), writing each resultset to a separate stream / file. But then you have the problem of stitching the output streams / files together. (And whether you doget an overall improvement will depend on the number of client and server-side cores, network and NIC capacity and disc I/O capacity.)
通过在多个 Java 线程(例如,在表的单独“部分”上)同时运行查询,将每个结果集写入单独的流/文件,您可能能够获得更大的吞吐量。但是,您会遇到将输出流/文件拼接在一起的问题。(而您是否确实获得了整体改进将取决于客户端和服务器端内核的数量、网络和 NIC 容量以及磁盘 I/O 容量。)
Apart from that, I can't think of any way to do this faster in Java. But you could try PL/SQL, or something lower level. (I'm not an Oracle expert. Talk to your DBAs.)
除此之外,我想不出有什么方法可以在 Java 中更快地做到这一点。但是你可以尝试 PL/SQL,或者更低级别的东西。(我不是 Oracle 专家。请咨询您的 DBA。)
A factor of 10 speed up in Java is ... ambitious.
Java 中的 10 倍加速是......雄心勃勃的。
回答by Stephen C
Your profiling is flawed
你的分析有缺陷
The methods you list are most likely highly optimized already. I have analyzed systems where the most called and the most time was spent inside StringBuffer.append()
inside the Oracle JDBC code because the entire system used PreparedStatement
and it calls that method a lot!. Needless to say, that was a red herring in our case.
您列出的方法很可能已经高度优化。我分析了在StringBuffer.append()
Oracle JDBC 代码中调用最多和花费最多时间的系统,因为整个系统都使用PreparedStatement
并且它调用了很多方法!毋庸置疑,在我们的案例中,这是一个红鲱鱼。
Profile your network traffic:
分析您的网络流量:
If your connection is saturated that is your bottleneck not the code you listed.
如果您的连接饱和,那是您的瓶颈,而不是您列出的代码。
This needs to be done on the server side if it has to be Oracle as the source of the data. you will never pull hundreds of millions of records across a network connection and then back again at 10X the speed you are getting now unless you have 10X the network cards in both endpoints and have all of them bonded. Even then I am skeptical you will get 10X the throughput
如果必须将 Oracle 作为数据源,则需要在服务器端完成此操作。您永远不会通过网络连接提取数亿条记录,然后再以现在速度的 10 倍返回,除非您在两个端点都拥有 10 倍的网卡并且将它们全部绑定。即便如此,我仍然怀疑您是否会获得 10 倍的吞吐量
If you really are limited to Java and Oracle, the the only way you will get more throughput than you are getting now is to run the Java as a stored procedureon the server(s) generate the files you need and then retrieve them from the remote system.
如果您真的仅限于 Java 和 Oracle,那么获得比现在更高的吞吐量的唯一方法是在服务器上将Java 作为存储过程运行,生成您需要的文件,然后从远程系统。
I have built systems that dealt with millions of transactions as minute, that kind of throughput isn't happening over a single network connection, it happens over a grid of machines with multiple network interfaces on dedicated send/receive switches on a dedicated subnet isolated from the rest of the traffic in the data center.
我构建的系统每分钟处理数百万个事务,这种吞吐量不是通过单个网络连接发生的,而是发生在具有多个网络接口的机器网格上,这些机器位于专用子网上的专用发送/接收交换机上,与数据中心的其余流量。
Also
还
Your threading code is naive at best. You should never create and manage threads manually. ExecutorService
has been around for 10 years, use it! ExecutorCompletionService
is what you want to use in this case, actually in almost all cases.
您的线程代码充其量是幼稚的。您永远不应该手动创建和管理线程。ExecutorService
已经存在10年了,使用它!ExecutorCompletionService
是你想在这种情况下使用的,实际上几乎在所有情况下。
ListenableFuture
is an even better choice if you can use Guava.
ListenableFuture
如果您可以使用 Guava,则是更好的选择。