json 什么是 SparkSession 配置选项
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43024766/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What are SparkSession Config Options
提问by Sha2b
I am trying to use SparkSession to convert JSON data of a file to RDD with Spark Notebook. I already have the JSON file.
我正在尝试使用 SparkSession 将文件的 JSON 数据转换为带有 Spark Notebook 的 RDD。我已经有了 JSON 文件。
val spark = SparkSession
.builder()
.appName("jsonReaderApp")
.config("config.key.here", configValueHere)
.enableHiveSupport()
.getOrCreate()
val jread = spark.read.json("search-results1.json")
I am very new to spark and do not know what to use for config.key.hereand configValueHere.
我很新的火花,不知道用什么config.key.here和configValueHere。
回答by Clay
SparkSession
火花会话
To get all the "various Spark parameters as key-value pairs" for a SparkSession, “The entry point to programming Spark with the Dataset and DataFrame API," run the following (this is using spark python api, scala would be very similar).
要获取 SparkSession 的所有“作为键值对的各种 Spark 参数”,“使用 Dataset 和 DataFrame API 对 Spark 进行编程的入口点”,请运行以下命令(这是使用 spark python api,scala 将非常相似) .
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
SparkConf().getAll()
or without importing SparkConf:
或不导入SparkConf:
spark.sparkContext.getConf().getAll()
Depending on which api you are using, see one of the following:
根据您使用的 api,请参阅以下内容之一:
- https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SparkSession
- https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?#pyspark.sql.SparkSession
- https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html
- https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SparkSession
- https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?#pyspark.sql.SparkSession
- https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html
You can get a deeper level list of SparkSession configuration options by running the code below. Most are the same, but there are a few extra ones. I am not sure if you can change these.
您可以通过运行以下代码获得更深层次的 SparkSession 配置选项列表。大多数是相同的,但还有一些额外的。我不确定你是否可以改变这些。
spark.sparkContext._conf.getAll()
SparkContext
火花上下文
To get all the "various Spark parameters as key-value pairs" for a SparkContext, the "Main entry point for Spark functionality," ... "connection to a Spark cluster," ... and "to create RDDs, accumulators and broadcast variables on that cluster,” run the following.
要获取 SparkContext 的所有“作为键值对的各种 Spark 参数”、“Spark 功能的主要入口点”、……“连接到 Spark 集群”、……以及“创建 RDD、累加器和该集群上的广播变量”,运行以下命令。
import pyspark
from pyspark import SparkConf, SparkContext
spark_conf = SparkConf().setAppName("test")
spark = SparkContext(conf = spark_conf)
SparkConf().getAll()
Depending on which api you are using, see one of the following:
根据您使用的 api,请参阅以下内容之一:
- https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
- https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
- https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html
- https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
- https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
- https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html
Spark parameters
火花参数
You should get a list of tuples that contain the "various Spark parameters as key-value pairs" similar to the following:
您应该得到一个包含“各种 Spark 参数作为键值对”的元组列表,类似于以下内容:
[(u'spark.eventLog.enabled', u'true'),
(u'spark.yarn.appMasterEnv.PYSPARK_PYTHON', u'/<yourpath>/parcels/Anaconda-4.2.0/bin/python'),
...
...
(u'spark.yarn.jars', u'local:/<yourpath>/lib/spark2/jars/*')]
Depending on which api you are using, see one of the following:
根据您使用的 api,请参阅以下内容之一:
- https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkConf
- https://spark.apache.org/docs/latest/api/python/pyspark.html?#pyspark.SparkConf
- https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkConf.html
- https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkConf
- https://spark.apache.org/docs/latest/api/python/pyspark.html?#pyspark.SparkConf
- https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkConf.html
For a complete list of Spark properties, see:
http://spark.apache.org/docs/latest/configuration.html#viewing-spark-properties
有关 Spark 属性的完整列表,请参阅:http:
//spark.apache.org/docs/latest/configuration.html#viewing-spark-properties
Setting Spark parameters
设置 Spark 参数
Each tuple is ("spark.some.config.option", "some-value")which you can set with:
每个元组是("spark.some.config.option", "some-value")您可以设置的:
SparkSession
火花会话
spark = (SparkSession
.builder
.appName("Your App Name")
.config("spark.some.config.option1", "some-value")
.config("spark.some.config.option2", "some-value")
.getOrCreate())
SparkContext
火花上下文
spark_conf = (SparkConf()
.setAppName("Your App Name"))
.set("spark.some.config.option1", "some-value")
.set("spark.some.config.option2", "some-value")
sc = SparkContext(conf = spark_conf)
回答by Jeff A.
This is how it worked for me to add spark or hive settings in my scala:
这就是我在 Scala 中添加 spark 或 hive 设置的方式:
{
val spark = SparkSession
.builder()
.appName("StructStreaming")
.master("yarn")
.config("hive.merge.mapfiles", "false")
.config("hive.merge.tezfiles", "false")
.config("parquet.enable.summary-metadata", "false")
.config("spark.sql.parquet.mergeSchema","false")
.config("hive.merge.smallfiles.avgsize", "160000000")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.orc.impl", "native")
.config("spark.sql.parquet.binaryAsString","true")
.config("spark.sql.parquet.writeLegacyFormat","true")
//.config(“spark.sql.streaming.checkpointLocation”, “hdfs://pp/apps/hive/warehouse/dev01_landing_initial_area.db”)
.getOrCreate()
}
回答by Sriram
In simple terms, values set in "config" method are automatically propagated to both SparkConf and SparkSession's own configuration.
简单来说,“config”方法中设置的值会自动传播到 SparkConf 和 SparkSession 自己的配置。
for eg : you can refer to https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-settings.htmlto understand how hive warehouse locations are set for SparkSession using config option
例如:您可以参考 https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-settings.html以了解如何使用配置选项为 SparkSession 设置配置单元仓库位置
To know about the this api you can refer to : https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html
要了解此 api,您可以参考:https: //spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html
回答by Anil
Every Spark config option is expolained at: http://spark.apache.org/docs/latest/configuration.html
每个 Spark 配置选项都在以下位置进行了说明:http://spark.apache.org/docs/latest/configuration.html
You can set these at run-time as in your example above or through the config file given to spark-submit
您可以在运行时设置这些,如上面的示例或通过提供给 spark-submit 的配置文件

