SQL Spark导入数据时如何设置分区/节点数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34597923/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to set the number of partitions/nodes when importing data into Spark
提问by pemfir
Problem:I want to import data into Spark EMR from S3 using:
问题:我想使用以下方法将数据从 S3 导入 Spark EMR:
data = sqlContext.read.json("s3n://.....")
Is there a way I can set the number of nodes that Spark uses to loadand processthe data? This is an example of how I process the data:
有没有办法设置 Spark 用来加载和处理数据的节点数?这是我如何处理数据的示例:
data.registerTempTable("table")
SqlData = sqlContext.sql("SELECT * FROM table")
Context: The data is not too big, takes a long time to load into Spark and also to query from. I think Spark partitions the data into too many nodes. I want to be able to set that manually. I know when dealing with RDDs and sc.parallelize
I can pass the number of partitions as an input. Also, I have seen repartition()
, but I am not sure if it can solve my problem. The variable data
is a DataFrame
in my example.
上下文:数据不是太大,加载到 Spark 和查询需要很长时间。我认为 Spark 将数据划分为太多节点。我希望能够手动设置。我知道在处理 RDD 时,sc.parallelize
我可以将分区数作为输入传递。另外,我已经看到了repartition()
,但我不确定它是否可以解决我的问题。在我的例子中变量data
是 a DataFrame
。
Let me define partition more precisely. Definition one: commonly referred to as "partition key" , where a column is selected and indexed to speed up query (that is not what i want). Definition two: (this is where my concern is) suppose you have a data set, Spark decides it is going to distribute it across many nodes so it can run operations on the data in parallel. If the data size is too small, this may further slow down the process. How can i set that value
让我更精确地定义分区。定义一:通常称为“分区键”,其中选择一列并编制索引以加快查询速度(这不是我想要的)。定义二:(这是我关心的地方)假设你有一个数据集,Spark 决定将它分布在许多节点上,以便它可以并行地对数据运行操作。如果数据大小太小,这可能会进一步减慢进程。我如何设置该值
回答by Durga Viswanath Gadiraju
By default it partitions into 200 sets. You can change it by using set command in sql context sqlContext.sql("set spark.sql.shuffle.partitions=10");
. However you need to set it with caution based up on your data characteristics.
默认情况下,它分为 200 个集合。您可以通过在 sql 上下文中使用 set 命令来更改它sqlContext.sql("set spark.sql.shuffle.partitions=10");
。但是,您需要根据您的数据特征谨慎设置它。
回答by Raju Bairishetti
You can call repartition()
on dataframe for setting partitions. You can even set spark.sql.shuffle.partitions
this property after creating hive context or by passing to spark-submit jar:
您可以调用repartition()
数据框来设置分区。您甚至可以spark.sql.shuffle.partitions
在创建 hive 上下文后或通过传递给 spark-submit jar 来设置此属性:
spark-submit .... --conf spark.sql.shuffle.partitions=100
or
或者
dataframe.repartition(100)
回答by Thomas Decaux
Number of "input" partitions are fixed by the File System configuration.
“输入”分区的数量由文件系统配置固定。
1 file of 1Go, with a block size of 128M will give you 10 tasks. I am not sure you can change it.
1Go 的 1 个文件,块大小为 128M,将为您提供 10 个任务。我不确定你能不能改变它。
repartition can be very bad, if you have lot of input partitions this will make lot of shuffle (data traffic) between partitions.
重新分区可能非常糟糕,如果您有很多输入分区,这将在分区之间产生大量混洗(数据流量)。
There is no magic method, you have to try, and use the webUI to see how many tasks are generated.
没有什么神奇的方法,你必须尝试,并使用webUI查看生成了多少任务。