scala SparkContext、JavaSparkContext、SQLContext 和 SparkSession 之间的区别?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43802809/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?
提问by Mostwanted Mani
- What is the difference between
SparkContext,JavaSparkContext,SQLContextandSparkSession? - Is there any method to convert or create a Context using a
SparkSession? - Can I completely replace all the Contexts using one single entry
SparkSession? - Are all the functions in
SQLContext,SparkContext, andJavaSparkContextalso inSparkSession? - Some functions like
parallelizehave different behaviors inSparkContextandJavaSparkContext. How do they behave inSparkSession? How can I create the following using a
SparkSession?RDDJavaRDDJavaPairRDDDataset
SparkContext,JavaSparkContext,SQLContext和 和有什么不一样SparkSession?- 是否有任何方法可以使用 a 转换或创建上下文
SparkSession? - 我可以使用一个条目完全替换所有上下文
SparkSession吗? - 在所有的功能
SQLContext,SparkContext和JavaSparkContext也SparkSession? - 有些函数
parallelize在SparkContext和 中具有不同的行为JavaSparkContext。他们表现如何SparkSession? 如何使用 a 创建以下内容
SparkSession?RDDJavaRDDJavaPairRDDDataset
Is there a method to transform a JavaPairRDDinto a Datasetor a Datasetinto a JavaPairRDD?
是否有将 aJavaPairRDD转换为 aDataset或 aDataset转换为 a 的方法JavaPairRDD?
回答by Balaji Reddy
sparkContextis a Scala implementation entry point and JavaSparkContextis a java wrapper of sparkContext.
sparkContext是 Scala 实现入口点,JavaSparkContext是sparkContext.
SQLContextis entry point of SparkSQL which can be received from sparkContext.Prior to 2.x.x, RDD ,DataFrame and Data-set were three different data abstractions.Since Spark 2.x.x, All three data abstractions are unified and SparkSessionis the unified entry point of Spark.
SQLContext是SparkSQL的入口点,sparkContext在2.xx之前可以接收,RDD、DataFrame和Data-set是三种不同的数据抽象。从Spark 2.xx开始,三种数据抽象都是统一的, SparkSession是Spark统一的入口点.
An additional note is , RDD meant for unstructured data, strongly typed data and DataFrames are for structured and loosely typed data. You can check
另外要注意的是,RDD 用于非结构化数据、强类型数据,而 DataFrames 用于结构化和松散类型数据。你可以检查
Is there any method to convert or create Context using Sparksession ?
有什么方法可以使用 Sparksession 转换或创建 Context 吗?
yes. its sparkSession.sparkContext()and for SQL, sparkSession.sqlContext()
是的。它sparkSession.sparkContext()和 SQL,sparkSession.sqlContext()
Can I completely replace all the Context using one single entry SparkSession ?
我可以使用单个条目 SparkSession 完全替换所有 Context 吗?
yes. you can get respective contexs from sparkSession.
是的。您可以从 sparkSession 获取相应的上下文。
Does all the functions in SQLContext, SparkContext,JavaSparkContext etc are added in SparkSession?
SQLContext、SparkContext、JavaSparkContext 等中的所有函数都添加到 SparkSession 中了吗?
Not directly. you got to get respective context and make use of it.something like backward compatibility
不直接。您必须获得相应的上下文并使用它。类似于向后兼容性
How to use such function in SparkSession?
如何在 SparkSession 中使用这样的功能?
get respective context and make use of it.
获取相应的上下文并利用它。
How to create the following using SparkSession?
如何使用 SparkSession 创建以下内容?
- RDD can be created from
sparkSession.sparkContext.parallelize(???) - JavaRDD same applies with this but in java implementation
- JavaPairRDD
sparkSession.sparkContext.parallelize(???).map(//making your data as key-value pair here is one way) - Dataset what sparkSession returns is Dataset if it is structured data.
- RDD 可以从创建
sparkSession.sparkContext.parallelize(???) - JavaRDD 同样适用于此,但在 Java 实现中
- Java对RDD
sparkSession.sparkContext.parallelize(???).map(//making your data as key-value pair here is one way) - 如果是结构化数据,sparkSession 返回的数据集就是数据集。
回答by Deanzz
Explanation from spark source code under branch-2.1
来自branch-2.1下spark源码的说明
SparkContext:Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
SparkContext:Spark 功能的主要入口点。SparkContext 表示与 Spark 集群的连接,可用于在该集群上创建 RDD、累加器和广播变量。
Only one SparkContext may be active per JVM. You must stop()the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
每个 JVM 只能有一个 SparkContext 处于活动状态。stop()在创建新的之前,您必须激活 SparkContext。这个限制最终可能会被取消;有关更多详细信息,请参阅 SPARK-2243。
JavaSparkContext:A Java-friendly version of [[org.apache.spark.SparkContext]] that returns [[org.apache.spark.api.java.JavaRDD]]s and works with Java collections instead of Scala ones.
JavaSparkContext:[[org.apache.spark.SparkContext]] 的 Java 友好版本,返回 [[org.apache.spark.api.java.JavaRDD]] 并使用 Java 集合而不是 Scala 集合。
Only one SparkContext may be active per JVM. You must stop()the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
每个 JVM 只能有一个 SparkContext 处于活动状态。stop()在创建新的之前,您必须激活 SparkContext。这个限制最终可能会被取消;有关更多详细信息,请参阅 SPARK-2243。
SQLContext:The entry point for working with structured data (rows and columns) in Spark 1.x.
SQLContext:在 Spark 1.x 中处理结构化数据(行和列)的入口点。
As of Spark 2.0, this is replaced by [[SparkSession]]. However, we are keeping the class here for backward compatibility.
从 Spark 2.0 开始,这被替换为 [[SparkSession]]。但是,我们将类保留在这里是为了向后兼容。
SparkSession:The entry point to programming Spark with the Dataset and DataFrame API.
SparkSession:使用 Dataset 和 DataFrame API 对 Spark 进行编程的入口点。
回答by Gaurang Shah
I will talk about Spark version 2.xonly.
我将只讨论Spark 2.x 版。
SparkSession:It's a main entry point of your spark Application. To run any code on your spark, this is the first thing you should create.
SparkSession:它是您的Spark应用程序的主要入口点。要在 Spark 上运行任何代码,这是您应该创建的第一件事。
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("Word Count")\
.config("spark.some.config.option", "some-value")\
.getOrCreate()
SparkContext:It's a inner Object (property) of SparkSession. It's used to interact with Low-Level APIThrough SparkContextyou can create RDD, accumlatorand Broadcast variables.
SparkContext:它是SparkSession的内部对象(属性)。它用于交互Low-Level API通过SparkContext您可以创建RDD,accumlator和Broadcast variables。
for most cases you won't need SparkContext. You can get SparkContextfrom SparkSession
在大多数情况下,您不需要SparkContext. 你可以SparkContext从SparkSession
val sc = spark.sparkContext
回答by Nilesh Shinde
Spark Context is Class in Spark APIwhich is the first stage to build the spark application. Functionality of the spark context is to create memory in RAM we call this as driver memory, allocation of number of executers and cores in short its all about the cluster management. Spark Context can be used to create RDD and shared variables. To access this we need to create object of it.
Spark Context 是 Spark API中的类,它是构建Spark应用程序的第一阶段。spark上下文的功能是在RAM中创建内存,我们称之为驱动程序内存,执行器和内核数量的分配简而言之就是集群管理。Spark Context 可用于创建 RDD 和共享变量。要访问它,我们需要创建它的对象。
This way we can create Spark Context :: var sc=new SparkContext()
这样我们就可以创建 Spark Context :: var sc=new SparkContext()
Spark Session this is new Object added since spark 2.xwhich is replacement of Sql Context and Hive Context. Earlier we had two options like one is Sql Context which is way to do sql operation on Dataframe and second is Hive Context which manage the Hive connectivity related stuff and fetch/insert the data from/to the hive tables.
Spark Session 这是自 spark 2.x 以来添加的新对象,它替代了 Sql Context 和 Hive Context。早些时候我们有两个选项,比如一个是 Sql Context,它是在 Dataframe 上执行 sql 操作的方法,第二个是 Hive Context,它管理 Hive 连接相关的东西,并从/向 hive 表获取/插入数据。
Since 2.x came We can create SparkSession for the SQL operation on Dataframe and if you have any Hive related work just call the Method enablehivesupport() then you can use the SparkSession for the both Dataframe and hive related SQL operations.
自 2.x 以来,我们可以为 Dataframe 上的 SQL 操作创建 SparkSession,如果您有任何与 Hive 相关的工作,只需调用方法 enablehivesupport() 然后您就可以将 SparkSession 用于 Dataframe 和 hive 相关的 SQL 操作。
This way we can create SparkSession for Sql operation on Dataframe
这样我们就可以为Dataframe上的Sql操作创建SparkSession
val sparksession=SparkSession.builder().getOrCreate();
val sparksession=SparkSession.builder().getOrCreate();
Second way is to create SparkSession for Sql operation on Dataframe as well as Hive Operation.
第二种方法是为 Dataframe 上的 Sql 操作以及 Hive 操作创建 SparkSession。
val sparkSession=SparkSession.builder().enableHiveSupport().getOrCreate()
val sparkSession=SparkSession.builder().enableHiveSupport().getOrCreate()

