json 什么是 SparkSession 配置选项

Question

提问by Sha2b

I am trying to use SparkSession to convert JSON data of a file to RDD with Spark Notebook. I already have the JSON file.

我正在尝试使用 SparkSession 将文件的 JSON 数据转换为带有 Spark Notebook 的 RDD。我已经有了 JSON 文件。

 val spark = SparkSession
   .builder()
   .appName("jsonReaderApp")
   .config("config.key.here", configValueHere)
   .enableHiveSupport()
   .getOrCreate()
val jread = spark.read.json("search-results1.json")

I am very new to spark and do not know what to use for config.key.hereand configValueHere.

我很新的火花，不知道用什么config.key.here和configValueHere。

Answer 1

回答by Clay

SparkSession

火花会话

To get all the "various Spark parameters as key-value pairs" for a SparkSession, “The entry point to programming Spark with the Dataset and DataFrame API," run the following (this is using spark python api, scala would be very similar).

要获取 SparkSession 的所有“作为键值对的各种 Spark 参数”，“使用 Dataset 和 DataFrame API 对 Spark 进行编程的入口点”，请运行以下命令（这是使用 spark python api，scala 将非常相似） .

import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
SparkConf().getAll()

or without importing SparkConf:

或不导入SparkConf：

spark.sparkContext.getConf().getAll()

Depending on which api you are using, see one of the following:

根据您使用的 api，请参阅以下内容之一：

You can get a deeper level list of SparkSession configuration options by running the code below. Most are the same, but there are a few extra ones. I am not sure if you can change these.

您可以通过运行以下代码获得更深层次的 SparkSession 配置选项列表。大多数是相同的，但还有一些额外的。我不确定你是否可以改变这些。

spark.sparkContext._conf.getAll()

SparkContext

火花上下文

To get all the "various Spark parameters as key-value pairs" for a SparkContext, the "Main entry point for Spark functionality," ... "connection to a Spark cluster," ... and "to create RDDs, accumulators and broadcast variables on that cluster,” run the following.

要获取 SparkContext 的所有“作为键值对的各种 Spark 参数”、“Spark 功能的主要入口点”、……“连接到 Spark 集群”、……以及“创建 RDD、累加器和该集群上的广播变量”，运行以下命令。

import pyspark
from pyspark import SparkConf, SparkContext 
spark_conf = SparkConf().setAppName("test")
spark = SparkContext(conf = spark_conf)
SparkConf().getAll()

Depending on which api you are using, see one of the following:

根据您使用的 api，请参阅以下内容之一：

Spark parameters

火花参数

You should get a list of tuples that contain the "various Spark parameters as key-value pairs" similar to the following:

您应该得到一个包含“各种 Spark 参数作为键值对”的元组列表，类似于以下内容：

[(u'spark.eventLog.enabled', u'true'),
 (u'spark.yarn.appMasterEnv.PYSPARK_PYTHON', u'/<yourpath>/parcels/Anaconda-4.2.0/bin/python'),
 ...
 ...
 (u'spark.yarn.jars', u'local:/<yourpath>/lib/spark2/jars/*')]

Depending on which api you are using, see one of the following:

根据您使用的 api，请参阅以下内容之一：

For a complete list of Spark properties, see:
http://spark.apache.org/docs/latest/configuration.html#viewing-spark-properties

有关 Spark 属性的完整列表，请参阅：http:
//spark.apache.org/docs/latest/configuration.html#viewing-spark-properties

Setting Spark parameters

设置 Spark 参数

Each tuple is ("spark.some.config.option", "some-value")which you can set with:

每个元组是("spark.some.config.option", "some-value")您可以设置的：

SparkSession

火花会话

spark = (SparkSession
          .builder
          .appName("Your App Name")
          .config("spark.some.config.option1", "some-value")
          .config("spark.some.config.option2", "some-value")
          .getOrCreate())

SparkContext

火花上下文

spark_conf = (SparkConf()
           .setAppName("Your App Name"))
           .set("spark.some.config.option1", "some-value")
           .set("spark.some.config.option2", "some-value")
sc = SparkContext(conf = spark_conf)

Answer 2

回答by Jeff A.

This is how it worked for me to add spark or hive settings in my scala:

这就是我在 Scala 中添加 spark 或 hive 设置的方式：

{
    val spark = SparkSession
        .builder()
        .appName("StructStreaming")
        .master("yarn")
        .config("hive.merge.mapfiles", "false")
        .config("hive.merge.tezfiles", "false")
        .config("parquet.enable.summary-metadata", "false")
        .config("spark.sql.parquet.mergeSchema","false")
        .config("hive.merge.smallfiles.avgsize", "160000000")
        .enableHiveSupport()
        .config("hive.exec.dynamic.partition", "true")
        .config("hive.exec.dynamic.partition.mode", "nonstrict")
        .config("spark.sql.orc.impl", "native")
        .config("spark.sql.parquet.binaryAsString","true")
        .config("spark.sql.parquet.writeLegacyFormat","true")
        //.config(“spark.sql.streaming.checkpointLocation”, “hdfs://pp/apps/hive/warehouse/dev01_landing_initial_area.db”)
        .getOrCreate()
}

Answer 3

回答by Sriram

In simple terms, values set in "config" method are automatically propagated to both SparkConf and SparkSession's own configuration.

简单来说，“config”方法中设置的值会自动传播到 SparkConf 和 SparkSession 自己的配置。

for eg : you can refer to https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-settings.htmlto understand how hive warehouse locations are set for SparkSession using config option

例如：您可以参考 https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-settings.html以了解如何使用配置选项为 SparkSession 设置配置单元仓库位置

To know about the this api you can refer to : https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html

要了解此 api，您可以参考：https: //spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html

Answer 4

回答by Anil

Every Spark config option is expolained at: http://spark.apache.org/docs/latest/configuration.html

每个 Spark 配置选项都在以下位置进行了说明：http://spark.apache.org/docs/latest/configuration.html

You can set these at run-time as in your example above or through the config file given to spark-submit

您可以在运行时设置这些，如上面的示例或通过提供给 spark-submit 的配置文件

json 什么是 SparkSession 配置选项

提问by Sha2b

回答by Clay

SparkSession

火花会话

SparkContext

火花上下文

Spark parameters

火花参数

Setting Spark parameters

设置 Spark 参数

SparkSession

火花会话

SparkContext

火花上下文

回答by Jeff A.

回答by Sriram

回答by Anil

相关推荐

最近更新

标签

json 什么是 SparkSession 配置选项

提问by Sha2b

回答by Clay

SparkSession

火花会话

SparkContext

火花上下文

Spark parameters

火花参数

Setting Spark parameters

设置 Spark 参数

SparkSession

火花会话

SparkContext

火花上下文

回答by Jeff A.

回答by Sriram

回答by Anil

相关推荐

json文件的mongoimport

如何在 Python 中创建 JSON 对象

json 如何将多个数据放入数据表列中？

如何在 Spark 2 Scala 中将 Row 转换为 json

相关推荐

最近更新

标签