Python 教程中的多个 SparkContexts 错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23280629/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
multiple SparkContexts error in tutorial
提问by Glenn Strycker
I am attempting to run the very basic Spark+Python pyspark tutorial -- see http://spark.apache.org/docs/0.9.0/quick-start.html
我正在尝试运行非常基本的 Spark+Python pyspark 教程——请参阅http://spark.apache.org/docs/0.9.0/quick-start.html
When I attempt to initialize a new SparkContext,
当我尝试初始化一个新的 SparkContext 时,
from pyspark import SparkContext
sc = SparkContext("local[4]", "test")
I get the following error:
我收到以下错误:
ValueError: Cannot run multiple SparkContexts at once
I'm wondering if my previous attempts at running example code loaded something into memory that didn't clear out. Is there a way to list current SparkContexts already in memory and/or clear them out so the sample code will run?
我想知道我之前运行示例代码的尝试是否将一些未清除的内容加载到内存中。有没有办法列出内存中已经存在的当前 SparkContexts 和/或清除它们以便示例代码运行?
采纳答案by Glenn Strycker
Turns out that running ./bin/pyspark interactively AUTOMATICALLY LOADS A SPARKCONTEXT. Here is what I see when I start pyspark:
事实证明,以交互方式运行 ./bin/pyspark 会自动加载 SPARKCONTEXT。这是我在启动 pyspark 时看到的:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 0.9.1
/_/
Using Python version 2.6.6 (r266:84292, Feb 22 2013 00:00:18)
Spark context available as sc.
...so you can either run "del sc" at the beginning or else go ahead and use "sc" as automatically defined.
...因此您可以在开头运行“del sc”,或者继续使用自动定义的“sc”。
The other problem with the example is that it appears to look at a regular NFS filesystem location, whereas it really is trying to look at the HDFS filesystem for Hadoop. I had to upload the README.md file in the $SPARK_HOME location using "hadoop fs -put README.md README.md" before running the code.
该示例的另一个问题是它似乎查看的是常规 NFS 文件系统位置,而实际上它试图查看 Hadoop 的 HDFS 文件系统。在运行代码之前,我必须使用“hadoop fs -put README.md README.md”将 README.md 文件上传到 $SPARK_HOME 位置。
Here is the modified example program that I ran interactively:
这是我交互式运行的修改后的示例程序:
from pyspark import SparkContext
logFile = "README.md"
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
and here is the modified version of the stand-alone python file:
这是独立 python 文件的修改版本:
"""SimpleApp.py"""
from pyspark import SparkContext
logFile = "README.md" # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
which I can now execute using $SPARK_HOME/bin/pyspark SimpleApp.py
我现在可以使用 $SPARK_HOME/bin/pyspark SimpleApp.py 执行
回答by Kun
Have you tried to use sc.stop() before you were trying to create another SparkContext?
在尝试创建另一个 SparkContext 之前,您是否尝试过使用 sc.stop()?
回答by Sourabh Potnis
Instead of setting custom configurations to the SparkContext at PySpark prompt, you can set those at the time of starting PySpark.
您可以在启动 PySpark 时设置这些配置,而不是在 PySpark 提示时为 SparkContext 设置自定义配置。
e.g.
例如
pyspark --master yarn --queue my_spark_pool1 --conf
spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf
spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
It will apply these conf to the sc object in PySpark.
它会将这些 conf 应用到 PySpark 中的 sc 对象。
回答by Statham
This happens because when you type "pyspark" in the terminal, the system automatically initialized the SparkContext (maybe a Object?), so you should stop it before creating a new one.
发生这种情况是因为当您在终端中键入“pyspark”时,系统会自动初始化 SparkContext(可能是一个对象?),因此您应该在创建新的之前停止它。
You can use
您可以使用
sc.stop()
before you create your new SparkContext.
在创建新的 SparkContext 之前。
Also, you can use
此外,您可以使用
sc = SparkContext.getOrCreate()
instead of
代替
sc = SparkContext()
I am new in Spark and I don't know much about the meaning of the parameters of the function SparkContext() but the code showed above both worked for me.
我是 Spark 的新手,我不太了解函数 SparkContext() 参数的含义,但上面显示的代码都对我有用。