oracle 如何在运行 Sqoop 导入和导出时找到最佳映射器数量?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16618753/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-19 01:40:37  来源:igfitidea点击:

How to find optimal number of mappers when running Sqoop import and export?

oraclehadoopmapreducehdfssqoop

提问by Bohdan

I'm using Sqoop version 1.4.2 and Oracle database.

我使用的是 Sqoop 1.4.2 版和 Oracle 数据库。

When running Sqoop command. For example like this:

运行 Sqoop 命令时。例如像这样:

./sqoop import                               \
    --fs <name node>                         \
    --jt <job tracker>                       \
    --connect <JDBC string>                  \
    --username <user> --password <password>  \
    --table <table> --split-by <cool column> \
    --target-dir <where>                     \
    --verbose --m 2

We can specify --m- how many parallel tasks do we want Sqoop to run (also they might be accessing Database at same time). Same option is available for ./sqoop export <...>

我们可以指定--m- 我们希望 Sqoop 运行多少个并行任务(它们可能同时访问数据库)。相同的选项可用于 ./sqoop export <...>

Is there some heuristic (probably based on size of data) which will help to guess what is optimal number of task to use?

是否有一些启发式方法(可能基于数据大小)可以帮助猜测要使用的最佳任务数量是多少?

Thank you!

谢谢!

回答by Chris Marotta

This is taken from Apache Sqoop Cookbook by O'Reilly Media, and seems to be the most logical answer.

这摘自 O'Reilly Media 的 Apache Sqoop Cookbook,似乎是最合乎逻辑的答案。

The optimal number of mappers depends on many variables: you need to take into account your database type, the hardware that is used for your database server, and the impact to other requests that your database needs to serve. There is no optimal number of mappers that works for all scenarios. Instead, you're encouraged to experiment to find the optimal degree of parallelism for your environment and use case. It's a good idea to start with a small number of mappers, slowly ramping up, rather than to start with a large number of mappers, working your way down.

映射器的最佳数量取决于许多变量:您需要考虑数据库类型、用于数据库服务器的硬件,以及对数据库需要处理的其他请求的影响。没有适用于所有场景的最佳映射器数量。相反,我们鼓励您尝试找到适合您的环境和用例的最佳并行度。最好从少量映射器开始,慢慢增加,而不是从大量映射器开始,逐步减少。

回答by Engineiro

In "Hadoop: The Definitive Guide," they explain that when setting up your maximum map/reduce task on each Tasktracker consider the processor and its cores to define the number of tasks for your cluster, so I would apply the same logic to this and take a look at how many processes you can run on your processor(s) (Counting HyperTreading, Cores) and set your --m to this value - 1 (leave one open for other tasks that may pop up during the export) BUT this is only if you have a large dataset and want to get the export done in a timely manner.

在“Hadoop: The Definitive Guide”中,他们解释说,在每个 Tasktracker 上设置最大映射/减少任务时,请考虑处理器及其内核来定义集群的任务数量,因此我将对此应用相同的逻辑,并且看看您可以在处理器上运行多少个进程(计算 HyperTreading,核心数)并将您的 --m 设置为此值 - 1(为导出期间可能弹出的其他任务保留一个打开状态)但是这个仅当您有一个大型数据集并希望及时完成导出时。

If you don't have a large dataset, then remember that your output will be the value of --m number of files, so if you are exporting a 100 row table, you may want to set --m to 1 to keep all the data localized in one file.

如果您没有大型数据集,那么请记住您的输出将是 --m 文件数的值,因此如果您要导出 100 行表,您可能需要将 --m 设置为 1 以保留所有本地化在一个文件中的数据。