将数据从 oracle 移动到 HDFS,处理并从 HDFS 移动到 Teradata
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16890053/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Move data from oracle to HDFS, process and move to Teradata from HDFS
提问by Manikandan Kannan
My requirement is to
我的要求是
- Move data from Oracle to HDFS
- Process the data on HDFS
- Move processed data to Teradata.
- 将数据从 Oracle 移动到 HDFS
- 处理 HDFS 上的数据
- 将处理后的数据移动到 Teradata。
It is also required to do this entire processing every 15 minutes. The volume of source data may be close to 50 GB and the processed data also may be the same.
还需要每 15 分钟进行一次整个处理。源数据量可能接近50GB,处理后的数据也可能相同。
After searching a lot on the internet, i found that
在互联网上搜索了很多之后,我发现
- ORAOOP to move data from Oracle to HDFS (Have the code withing the shell script and schedule it to run at the required interval).
- Do large scale processing either by Custom MapReduce or Hive or PIG.
- SQOOP - Teradata Connector to move data from HDFS to Teradata (again have a shell script with the code and then schedule it).
- ORAOOP 将数据从 Oracle 移动到 HDFS(使用 shell 脚本编写代码并安排它以所需的时间间隔运行)。
- 通过自定义 MapReduce 或 Hive 或 PIG 进行大规模处理。
- SQOOP - Teradata 连接器将数据从 HDFS 移动到 Teradata(再次使用带有代码的 shell 脚本,然后对其进行调度)。
Is this the right option in the first place and is this feasible for the required time period (Please note that this is not the daily batch or so)?
这首先是正确的选择吗,这在所需的时间段内是否可行(请注意,这不是每日批次左右)?
Other options that i found are the following
我发现的其他选项如下
- STORM (for real time data processing). But i am not able to find the oracle Spout or Teradata bolt out of the box.
- Any open source ETL tools like Talend or Pentaho.
- STORM(用于实时数据处理)。但是我找不到开箱即用的 oracle Spout 或 Teradata bolt。
- 任何开源 ETL 工具,如 Talend 或 Pentaho。
Please share your thoughts on these options as well and any other possibilities.
请分享您对这些选项以及任何其他可能性的想法。
回答by Charles Menguy
Looks like you have several questions so let's try to break it down.
看起来您有几个问题,所以让我们尝试将其分解。
Importing in HDFS
在 HDFS 中导入
It seems you are looking for Sqoop. Sqoop is a tool that lets you easily transfer data in/out of HDFS, and can connect to various databases including Oracle natively. Sqoop is compatible with the Oracle JDBC thin driver. Here is how you would transfer from Oracle to HDFS:
看来您正在寻找Sqoop。Sqoop 是一个工具,可以让您轻松地将数据传入/传出 HDFS,并且可以本地连接到包括 Oracle 在内的各种数据库。Sqoop 与 Oracle JDBC 瘦驱动程序兼容。以下是从 Oracle 转移到 HDFS 的方法:
sqoop import --connect jdbc:oracle:thin@myhost:1521/db --username xxx --password yyy --table tbl --target-dir /path/to/dir
For more information: hereand here. Note than you can also import directly into a Hive table with Sqoop which could be convenient to do your analysis.
欲了解更多信息:这里和这里。请注意,您还可以使用 Sqoop 直接导入到 Hive 表中,这可以方便地进行分析。
Processing
加工
As you noted, since your data initially is relational, it is a good idea to use Hive to do your analysis since you might be more familiar with SQL-like syntax. Pig is more pure relational algebra and the syntax is NOT SQL-like, it is more a matter of preference but both approaches should work fine.
正如您所指出的,由于您的数据最初是关系数据,因此最好使用 Hive 进行分析,因为您可能更熟悉类似 SQL 的语法。Pig 是更纯粹的关系代数,并且语法不像 SQL,它更多是一个偏好问题,但两种方法都应该可以正常工作。
Since you can import data into Hive directly with Sqoop, your data should be directly ready to be processed after it is imported.
由于您可以使用 Sqoop 直接将数据导入 Hive,因此您的数据在导入后应该可以直接进行处理。
In Hive you could run your query and tell it to write the results in HDFS:
在 Hive 中,您可以运行查询并告诉它在 HDFS 中写入结果:
hive -e "insert overwrite directory '/path/to/output' select * from mytable ..."
Exporting into TeraData
导出到 TeraData
Cloudera released last year a connector for Teradata for Sqoop as described here, so you should take a look as this looks like exactly what you want. Here is how you would do it:
Cloudera的描述去年公布的连接器为Teradata为Sqoop在这里,所以你应该看一看,因为这看起来像你想要什么。以下是您的操作方法:
sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir /path/to/hive/output
The whole thing is definitely doable in whatever time period you want, in the end what will matter is the size of your cluster, if you want it quick then scale your cluster up as needed. The good thing with Hive and Sqoop is that processing will be distributed in your cluster, so you have total control over the schedule.
在您想要的任何时间段内,整件事绝对是可行的,最终重要的是您的集群的大小,如果您希望它快速,然后根据需要扩展您的集群。Hive 和 Sqoop 的好处是处理将分布在您的集群中,因此您可以完全控制计划。
回答by user2711581
If you have concerns about the overhead or latency of moving the data from Oracle into HDFS, a possible commercial solution might be Dell Software's SharePlex. They recently released a connector for Hadoop that would allow you to replicate table data from Oracle to Hadoop. More info here.
如果您担心将数据从 Oracle 移动到 HDFS 的开销或延迟,一个可能的商业解决方案可能是 Dell Software 的 SharePlex。他们最近发布了一个 Hadoop 连接器,允许您将表数据从 Oracle 复制到 Hadoop。更多信息在这里。
I'm not sure if you need to reprocess the entire data set each time or can possibly just use the deltas. SharePlex also supports replicating the change data to a JMS queue. It might be possible to create a Spout that reads from that queue. You could probably also build your own trigger based solution but it would be a bit of work.
我不确定您是否每次都需要重新处理整个数据集,或者是否可以只使用增量。SharePlex 还支持将更改数据复制到 JMS 队列。可以创建一个从该队列读取的 Spout。您可能还可以构建自己的基于触发器的解决方案,但这需要一些工作。
As a disclosure, I work for Dell Software.
作为披露,我为戴尔软件工作。