Python 配置 Spark 以使用 Jupyter Notebook 和 Anaconda
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47824131/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Configuring Spark to work with Jupyter Notebook and Anaconda
提问by puifais
I've spent a few days now trying to make Spark work with my Jupyter Notebook and Anaconda. Here's what my .bash_profile looks like:
我花了几天时间试图让 Spark 与我的 Jupyter Notebook 和 Anaconda 一起工作。这是我的 .bash_profile 的样子:
PATH="/my/path/to/anaconda3/bin:$PATH"
export JAVA_HOME="/my/path/to/jdk"
export PYTHON_PATH="/my/path/to/anaconda3/bin/python"
export PYSPARK_PYTHON="/my/path/to/anaconda3/bin/python"
export PATH=$PATH:/my/path/to/spark-2.1.0-bin-hadoop2.7/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
export SPARK_HOME=/my/path/to/spark-2.1.0-bin-hadoop2.7
alias pyspark="pyspark --conf spark.local.dir=/home/puifais --num-executors 30 --driver-memory 128g --executor-memory 6g --packages com.databricks:spark-csv_2.11:1.5.0"
When I type /my/path/to/spark-2.1.0-bin-hadoop2.7/bin/spark-shell
, I can launch Spark just fine in my command line shell. And the output sc
is not empty. It seems to work fine.
当我输入 时/my/path/to/spark-2.1.0-bin-hadoop2.7/bin/spark-shell
,我可以在命令行 shell 中很好地启动 Spark。并且输出sc
不为空。它似乎工作正常。
When I type pyspark
, it launches my Jupyter Notebook fine. When I create a new Python3 notebook, this error appears:
当我输入 时pyspark
,它会很好地启动我的 Jupyter Notebook。当我创建一个新的 Python3 notebook 时,会出现这个错误:
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
And sc
in my Jupyter Notebook is empty.
而sc
在我的笔记本Jupyter是空的。
Can anyone help solve this situation?
任何人都可以帮助解决这种情况吗?
Just want to clarify: There is nothing after the colon at the end of the error. I also tried to create my own start-up file using this postand I quote here so you don't have to go look there:
只是想澄清一下:错误末尾的冒号后没有任何内容。我也尝试使用这篇文章创建我自己的启动文件,我在这里引用,所以你不必去那里看:
I created a short initialization script init_spark.py as follows:
from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster("yarn-client") sc = SparkContext(conf = conf)
and placed it in the ~/.ipython/profile_default/startup/ directory
我创建了一个简短的初始化脚本 init_spark.py 如下:
from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster("yarn-client") sc = SparkContext(conf = conf)
并将其放在 ~/.ipython/profile_default/startup/ 目录中
When I did this, the error then became:
当我这样做时,错误变成了:
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
[IPKernelApp] WARNING | Unknown error in handling startup files:
采纳答案by Alain Domissy
Conda can help correctly manage a lot of dependencies...
Conda 可以帮助正确管理许多依赖项...
Install spark. Assuming spark is installed in /opt/spark, include this in your ~/.bashrc:
安装火花。假设 spark 安装在 /opt/spark 中,请将其包含在您的 ~/.bashrc 中:
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
Create a conda environment with all needed dependencies apart from spark:
创建一个 conda 环境,其中包含除 spark 之外的所有需要的依赖项:
conda create -n findspark-jupyter-openjdk8-py3 -c conda-forge python=3.5 jupyter=1.0 notebook=5.0 openjdk=8.0.144 findspark=1.1.0
Activate the environment
激活环境
$ source activate findspark-jupyter-openjdk8-py3
Launch a Jupyter Notebook server:
启动 Jupyter Notebook 服务器:
$ jupyter notebook
In your browser, create a new Python3 notebook
在浏览器中,创建一个新的 Python3 笔记本
Try calculating PI with the following script (borrowed from this)
尝试使用以下脚本计算 PI(从这里借用)
import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
回答by desertnaut
Well, it really gives me pain to see how crappy hacks, like setting PYSPARK_DRIVER_PYTHON=jupyter
, have been promoted to "solutions" and tend now to become standard practices, despite the fact that they evidently lead to uglyoutcomes, like typing pyspark
and ending up with a Jupyter notebook instead of a PySpark shell, plus yet-unseen problems lurking downstream, such as when you try to use spark-submit
with the above settings... :(
好吧,看到诸如设置之类的蹩脚黑客如何PYSPARK_DRIVER_PYTHON=jupyter
被提升为“解决方案”并且现在往往成为标准实践,这真的让我感到痛苦,尽管它们显然会导致丑陋的结果,例如打字pyspark
并以 Jupyter 结束notebook 而不是 PySpark shell,以及潜伏在下游的尚未发现的问题,例如当您尝试使用spark-submit
上述设置时...... :(
(Don't get me wrong, it is not your fault and I am not blaming you; I have seen dozens of posts here at SO where this "solution" has been proposed, accepted, and upvoted...).
(不要误会我的意思,这不是你的错,我也没有责怪你;我在 SO 上看到了几十个帖子,其中提出、接受和投票赞成这个“解决方案”......)。
There is one and only one proper way to customize a Jupyter notebook in order to work with other languages (PySpark here), and this is the use of Jupyter kernels.
自定义 Jupyter Notebook 以使用其他语言(此处为 PySpark)只有一种且唯一一种正确的方法,这就是使用Jupyter kernels。
The first thing to do is run a jupyter kernelspec list
command, to get the list of any already available kernels in your machine; here is the result in my case (Ubuntu):
首先要做的是运行一个jupyter kernelspec list
命令,以获取机器中所有可用内核的列表;这是我的情况(Ubuntu)的结果:
$ jupyter kernelspec list
Available kernels:
python2 /usr/lib/python2.7/site-packages/ipykernel/resources
caffe /usr/local/share/jupyter/kernels/caffe
ir /usr/local/share/jupyter/kernels/ir
pyspark /usr/local/share/jupyter/kernels/pyspark
pyspark2 /usr/local/share/jupyter/kernels/pyspark2
tensorflow /usr/local/share/jupyter/kernels/tensorflow
The first kernel, python2
, is the "default" one coming with IPython (there is a great chance of this being the only one present in your system); as for the rest, I have 2 more Python kernels (caffe
& tensorflow
), an R one (ir
), and two PySpark kernels for use with Spark 1.6 and Spark 2.0 respectively.
第一个内核python2
是 IPython 附带的“默认”内核(很有可能这是您系统中唯一存在的内核);至于其余部分,我还有 2 个 Python 内核 ( caffe
& tensorflow
)、一个 R ( ir
) 和两个 PySpark 内核,分别用于 Spark 1.6 和 Spark 2.0。
The entries of the list above are directories, and each one contains one single file, named kernel.json
. Let's see the contents of this file for my pyspark2
kernel:
上面列表的条目是目录,每个条目都包含一个名为kernel.json
. 让我们看看我的pyspark2
内核这个文件的内容:
{
"display_name": "PySpark (Spark 2.0)",
"language": "python",
"argv": [
"/opt/intel/intelpython27/bin/python2",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
"PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
"PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
}
}
I have not bothered to change my details to /my/path/to
etc., and you can already see that there are some differences between our cases (I use Intel Python 2.7, and not Anaconda Python 3), but hopefully you get the idea (BTW, don't worry about the connection_file
- I don't use one either).
我没有费心将我的详细信息更改为/my/path/to
等,您已经可以看到我们的案例之间存在一些差异(我使用 Intel Python 2.7,而不是 Anaconda Python 3),但希望您能明白(顺便说一句,不要不用担心connection_file
- 我也不用一个)。
Now, the easiest way for you would be to manually do the necessary changes (paths only) to my above shown kernel and save it in a new subfolder of the .../jupyter/kernels
directory (that way, it should be visible if you run again a jupyter kernelspec list
command). And if you think this approach is also a hack, well, I would agree with you, but it is the one recommended in the Jupyter documentation(page 12):
现在,对您来说最简单的方法是手动对我上面显示的内核进行必要的更改(仅路径)并将其保存在.../jupyter/kernels
目录的新子文件夹中(这样,如果您再次运行jupyter kernelspec list
命令,它应该是可见的)。如果您认为这种方法也是一种 hack,那么我同意您的看法,但它是Jupyter 文档(第 12 页)中推荐的方法:
However, there isn't a great way to modify the kernelspecs. One approach uses
jupyter kernelspec list
to find thekernel.json
file and then modifies it, e.g.kernels/python3/kernel.json
, by hand.
但是,没有一个很好的方法来修改内核规范。一种方法使用
jupyter kernelspec list
查找kernel.json
文件然后修改它,例如kernels/python3/kernel.json
,手动。
If you don't have already a .../jupyter/kernels
folder, you can still install a new kernel using jupyter kernelspec install
- haven't tried it, but have a look at this SO answer.
如果您还没有.../jupyter/kernels
文件夹,您仍然可以使用安装新内核jupyter kernelspec install
- 尚未尝试过,但请查看此 SO answer。
Finally, don't forget to remove all the PySpark-related environment variables from your bash profile (leaving only SPARK_HOME
should be OK). And confirm that, when you type pyspark
, you find yourself with a PySpark shell, as it should be, and not with a Jupyter notebook...
最后,不要忘记从 bash 配置文件中删除所有与 PySpark 相关的环境变量(只留下SPARK_HOME
应该没问题)。并确认,当您键入 时pyspark
,您会发现自己使用的是 PySpark shell,而不是 Jupyter 笔记本...
UPDATE(after comment): If you want to pass command-line arguments to PySpark, you should add the PYSPARK_SUBMIT_ARGS
setting under env
; for example, here is the last line of my respective kernel file for Spark 1.6.0, where we still had to use the external spark-csv package for reading CSV files:
更新(评论后):如果要将命令行参数传递给 PySpark,则应PYSPARK_SUBMIT_ARGS
在env
;下添加设置。例如,这是 Spark 1.6.0 各自内核文件的最后一行,我们仍然必须使用外部 spark-csv 包来读取 CSV 文件:
"PYSPARK_SUBMIT_ARGS": "--master local --packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell"
回答by matanster
After fiddling here a little, I just conda installed sparkmagic (after re-installing a newer version of Spark). I think that alone simply works.
在稍微摆弄一下之后,我只是 conda 安装了 sparkmagic(在重新安装了较新版本的 Spark 之后)。我认为只有这样才有效。
I am not sure as I've fiddled a little before that, but I am placing this as a tentative answer as it is much simpler than fiddling configuration files by hand.
我不确定,因为我之前已经摆弄过一点,但我把它作为一个试探性的答案,因为它比手动摆弄配置文件要简单得多。