SQL Spark导入数据时如何设置分区/节点数

Question

提问by pemfir

Problem:I want to import data into Spark EMR from S3 using:

问题：我想使用以下方法将数据从 S3 导入 Spark EMR：

data = sqlContext.read.json("s3n://.....")

Is there a way I can set the number of nodes that Spark uses to loadand processthe data? This is an example of how I process the data:

有没有办法设置 Spark 用来加载和处理数据的节点数？这是我如何处理数据的示例：

data.registerTempTable("table")
SqlData = sqlContext.sql("SELECT * FROM table")

Context: The data is not too big, takes a long time to load into Spark and also to query from. I think Spark partitions the data into too many nodes. I want to be able to set that manually. I know when dealing with RDDs and sc.parallelizeI can pass the number of partitions as an input. Also, I have seen repartition(), but I am not sure if it can solve my problem. The variable datais a DataFramein my example.

上下文：数据不是太大，加载到 Spark 和查询需要很长时间。我认为 Spark 将数据划分为太多节点。我希望能够手动设置。我知道在处理 RDD 时，sc.parallelize我可以将分区数作为输入传递。另外，我已经看到了repartition()，但我不确定它是否可以解决我的问题。在我的例子中变量data是 a DataFrame。

Let me define partition more precisely. Definition one: commonly referred to as "partition key" , where a column is selected and indexed to speed up query (that is not what i want). Definition two: (this is where my concern is) suppose you have a data set, Spark decides it is going to distribute it across many nodes so it can run operations on the data in parallel. If the data size is too small, this may further slow down the process. How can i set that value

让我更精确地定义分区。定义一：通常称为“分区键”，其中选择一列并编制索引以加快查询速度（这不是我想要的）。定义二：（这是我关心的地方）假设你有一个数据集，Spark 决定将它分布在许多节点上，以便它可以并行地对数据运行操作。如果数据大小太小，这可能会进一步减慢进程。我如何设置该值

Answer 1

回答by Durga Viswanath Gadiraju

By default it partitions into 200 sets. You can change it by using set command in sql context sqlContext.sql("set spark.sql.shuffle.partitions=10");. However you need to set it with caution based up on your data characteristics.

默认情况下，它分为 200 个集合。您可以通过在 sql 上下文中使用 set 命令来更改它sqlContext.sql("set spark.sql.shuffle.partitions=10");。但是，您需要根据您的数据特征谨慎设置它。

Answer 2

回答by Raju Bairishetti

You can call repartition()on dataframe for setting partitions. You can even set spark.sql.shuffle.partitionsthis property after creating hive context or by passing to spark-submit jar:

您可以调用repartition()数据框来设置分区。您甚至可以spark.sql.shuffle.partitions在创建 hive 上下文后或通过传递给 spark-submit jar 来设置此属性：

spark-submit .... --conf spark.sql.shuffle.partitions=100

or

或者

dataframe.repartition(100)

Answer 3

回答by Thomas Decaux

Number of "input" partitions are fixed by the File System configuration.

“输入”分区的数量由文件系统配置固定。

1 file of 1Go, with a block size of 128M will give you 10 tasks. I am not sure you can change it.

1Go 的 1 个文件，块大小为 128M，将为您提供 10 个任务。我不确定你能不能改变它。

repartition can be very bad, if you have lot of input partitions this will make lot of shuffle (data traffic) between partitions.

重新分区可能非常糟糕，如果您有很多输入分区，这将在分区之间产生大量混洗（数据流量）。

There is no magic method, you have to try, and use the webUI to see how many tasks are generated.

没有什么神奇的方法，你必须尝试，并使用webUI查看生成了多少任务。

SQL Spark导入数据时如何设置分区/节点数

提问by pemfir

回答by Durga Viswanath Gadiraju

回答by Raju Bairishetti

回答by Thomas Decaux

相关推荐

最近更新

标签

SQL Spark导入数据时如何设置分区/节点数

提问by pemfir

回答by Durga Viswanath Gadiraju

回答by Raju Bairishetti

回答by Thomas Decaux

相关推荐

用于在两个日期和时间之间选择数据的 SQL 查询

SQL 将 Oracle DATE 列迁移到带时区的 TIMESTAMP

SQL Postgres - 将两列聚合为一项

SQL 如何检测和删除仅包含空值的列？

相关推荐

最近更新

标签