Python 教程中的多个 SparkContexts 错误

Question

提问by Glenn Strycker

I am attempting to run the very basic Spark+Python pyspark tutorial -- see http://spark.apache.org/docs/0.9.0/quick-start.html

我正在尝试运行非常基本的 Spark+Python pyspark 教程——请参阅http://spark.apache.org/docs/0.9.0/quick-start.html

When I attempt to initialize a new SparkContext,

当我尝试初始化一个新的 SparkContext 时，

from pyspark import SparkContext
sc = SparkContext("local[4]", "test")

I get the following error:

我收到以下错误：

ValueError: Cannot run multiple SparkContexts at once

I'm wondering if my previous attempts at running example code loaded something into memory that didn't clear out. Is there a way to list current SparkContexts already in memory and/or clear them out so the sample code will run?

我想知道我之前运行示例代码的尝试是否将一些未清除的内容加载到内存中。有没有办法列出内存中已经存在的当前 SparkContexts 和/或清除它们以便示例代码运行？

Answer 1

采纳答案by Glenn Strycker

Turns out that running ./bin/pyspark interactively AUTOMATICALLY LOADS A SPARKCONTEXT. Here is what I see when I start pyspark:

事实证明，以交互方式运行 ./bin/pyspark 会自动加载 SPARKCONTEXT。这是我在启动 pyspark 时看到的：

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 0.9.1
      /_/

Using Python version 2.6.6 (r266:84292, Feb 22 2013 00:00:18)
Spark context available as sc.

...so you can either run "del sc" at the beginning or else go ahead and use "sc" as automatically defined.

...因此您可以在开头运行“del sc”，或者继续使用自动定义的“sc”。

The other problem with the example is that it appears to look at a regular NFS filesystem location, whereas it really is trying to look at the HDFS filesystem for Hadoop. I had to upload the README.md file in the $SPARK_HOME location using "hadoop fs -put README.md README.md" before running the code.

该示例的另一个问题是它似乎查看的是常规 NFS 文件系统位置，而实际上它试图查看 Hadoop 的 HDFS 文件系统。在运行代码之前，我必须使用“hadoop fs -put README.md README.md”将 README.md 文件上传到 $SPARK_HOME 位置。

Here is the modified example program that I ran interactively:

这是我交互式运行的修改后的示例程序：

from pyspark import SparkContext
logFile = "README.md"
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)

and here is the modified version of the stand-alone python file:

这是独立 python 文件的修改版本：

"""SimpleApp.py"""
from pyspark import SparkContext
logFile = "README.md"  # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)

which I can now execute using $SPARK_HOME/bin/pyspark SimpleApp.py

我现在可以使用 $SPARK_HOME/bin/pyspark SimpleApp.py 执行

Answer 2

回答by Kun

Have you tried to use sc.stop() before you were trying to create another SparkContext?

在尝试创建另一个 SparkContext 之前，您是否尝试过使用 sc.stop()？

Answer 3

回答by Sourabh Potnis

Instead of setting custom configurations to the SparkContext at PySpark prompt, you can set those at the time of starting PySpark.

您可以在启动 PySpark 时设置这些配置，而不是在 PySpark 提示时为 SparkContext 设置自定义配置。

e.g.

例如

pyspark --master yarn --queue my_spark_pool1 --conf 
   spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf 
   spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"

It will apply these conf to the sc object in PySpark.

它会将这些 conf 应用到 PySpark 中的 sc 对象。

Answer 4

回答by Statham

This happens because when you type "pyspark" in the terminal, the system automatically initialized the SparkContext (maybe a Object?), so you should stop it before creating a new one.

发生这种情况是因为当您在终端中键入“pyspark”时，系统会自动初始化 SparkContext（可能是一个对象？），因此您应该在创建新的之前停止它。

You can use

您可以使用

sc.stop()

before you create your new SparkContext.

在创建新的 SparkContext 之前。

Also, you can use

此外，您可以使用

sc = SparkContext.getOrCreate()

instead of

代替

sc = SparkContext()

I am new in Spark and I don't know much about the meaning of the parameters of the function SparkContext() but the code showed above both worked for me.

我是 Spark 的新手，我不太了解函数 SparkContext() 参数的含义，但上面显示的代码都对我有用。

Python 教程中的多个 SparkContexts 错误

提问by Glenn Strycker

采纳答案by Glenn Strycker

回答by Kun

回答by Sourabh Potnis

回答by Statham

相关推荐

最近更新

标签

Python 教程中的多个 SparkContexts 错误

提问by Glenn Strycker

采纳答案by Glenn Strycker

回答by Kun

回答by Sourabh Potnis

回答by Statham

相关推荐

在 python 中导入之前设置 LD_LIBRARY_PATH

Python zip 文件并避免目录结构

在 __init__.py 中找不到引用“xxx”-Python/Pycharm

欧拉在python中的方法

相关推荐

最近更新

标签

在 init.py 中找不到引用“xxx”-Python/Pycharm