Python 为 pyspark 设置 SparkContext

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24996302/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 05:35:18  来源:igfitidea点击:

setting SparkContext for pyspark

pythonapache-spark

提问by Dalek

I am newbie with sparkand pyspark. I will appreciate if somebody explain what exactly does SparkContextparameter do? And how could I set spark_contextfor python application?

我是spark和的新手pyspark。如果有人解释SparkContext参数究竟是做什么的,我将不胜感激?我如何spark_context为 python 应用程序设置?

采纳答案by mdurant

See here: the spark_context represents your interface to a running spark cluster manager. In other words, you will have already defined one or more running environments for spark (see the installation/initialization docs), detailing the nodes to run on etc. You start a spark_context object with a configuration which tells it which environment to use and, for example, the application name. All further interaction, such as loading data, happen as methods of the context object.

请参见此处:spark_context 表示您与正在运行的 Spark 集群管理器的接口。换句话说,您已经为 spark 定义了一个或多个运行环境(请参阅安装/初始化文档),详细说明要在其上运行的节点等。您使用配置启动 spark_context 对象,该配置告诉它使用哪个环境,并且,例如,应用程序名称。所有进一步的交互,例如加载数据,都作为上下文对象的方法发生。

For the simple examples and testing, you can run the spark cluster "locally", and skip much of the detail of what is above, e.g.,

对于简单的示例和测试,您可以“本地”运行 spark 集群,并跳过上面的大部分细节,例如,

./bin/pyspark --master local[4]

will start an interpreter with a context already set to use four threads on your own CPU.

将启动一个已经设置为在您自己的 CPU 上使用四个线程的上下文的解释器。

In a standalone app, to be run with sparksubmit:

在独立应用程序中,与 sparksubmit 一起运行:

from pyspark import SparkContext
sc = SparkContext("local", "Simple App")

回答by iec2011007

The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.

Spark 程序必须做的第一件事是创建一个 SparkContext 对象,它告诉 Spark 如何访问集群。要创建 SparkContext,您首先需要构建一个 SparkConf 对象,其中包含有关您的应用程序的信息。

If you are running pyspark i.e. shell then Spark automatically creates the SparkContext object for you with the name sc. But if you are writing your python program you have to do something like

如果您正在运行 pyspark ie shell,那么 Spark 会自动为您创建名为 的 SparkContext 对象sc。但是,如果您正在编写 Python 程序,则必须执行以下操作

from pyspark import SparkContext
sc = SparkContext(appName = "test")

Any configuration would go into this spark context object like setting the executer memory or the number of core.

任何配置都会进入这个 spark 上下文对象,比如设置执行器内存或核心数。

These parameters can also be passed from the shell while invoking for example

这些参数也可以在调用时从 shell 传递,例如

./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1
lib/spark-examples*.jar \
10

For passing parameters to pyspark use something like this

将参数传递给 pyspark 使用类似这样的东西

./bin/pyspark --num-executors 17 --executor-cores 5 --executor-memory 8G

回答by SeekingAlpha

The SparkContext object is the driver program. This object co-ordinates the processes over the cluster that you will be running your application on.

SparkContext 对象是驱动程序。该对象协调您将在其上运行应用程序的集群上的进程。

When you run PySpark shell a default SparkContext object is automatically created with variable sc.

当您运行 PySpark shell 时,会使用变量 sc 自动创建一个默认的 SparkContext 对象。

If you create a standalone application you will need to initialize the SparkContext object in your script like below:

如果您创建一个独立的应用程序,您将需要在您的脚本中初始化 SparkContext 对象,如下所示:

sc = SparkContext("local", "My App")

Where the first parameter is the URL to the cluster and the second parameter is the name of your app.

其中第一个参数是集群的 URL,第二个参数是您的应用程序的名称。

I have written an article that goes through the basics of PySpark and Apache which you may find useful: https://programmathics.com/big-data/apache-spark/apache-installation-and-building-stand-alone-applications/

我写了一篇文章,介绍了 PySpark 和 Apache 的基础知识,您可能会觉得有用:https://programmathics.com/big-data/apache-spark/apache-installation-and-building-stand-alone-applications/

DISCLAIMER: I am the creator of that website.

免责声明:我是该网站的创建者。