Python 配置 Spark 以使用 Jupyter Notebook 和 Anaconda

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47824131/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:23:51  来源:igfitidea点击:

Configuring Spark to work with Jupyter Notebook and Anaconda

pythonpysparkanacondajupyter-notebookjupyter

提问by puifais

I've spent a few days now trying to make Spark work with my Jupyter Notebook and Anaconda. Here's what my .bash_profile looks like:

我花了几天时间试图让 Spark 与我的 Jupyter Notebook 和 Anaconda 一起工作。这是我的 .bash_profile 的样子:

PATH="/my/path/to/anaconda3/bin:$PATH"

export JAVA_HOME="/my/path/to/jdk"
export PYTHON_PATH="/my/path/to/anaconda3/bin/python"
export PYSPARK_PYTHON="/my/path/to/anaconda3/bin/python"

export PATH=$PATH:/my/path/to/spark-2.1.0-bin-hadoop2.7/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
export SPARK_HOME=/my/path/to/spark-2.1.0-bin-hadoop2.7
alias pyspark="pyspark --conf spark.local.dir=/home/puifais --num-executors 30 --driver-memory 128g --executor-memory 6g --packages com.databricks:spark-csv_2.11:1.5.0"

When I type /my/path/to/spark-2.1.0-bin-hadoop2.7/bin/spark-shell, I can launch Spark just fine in my command line shell. And the output scis not empty. It seems to work fine.

当我输入 时/my/path/to/spark-2.1.0-bin-hadoop2.7/bin/spark-shell,我可以在命令行 shell 中很好地启动 Spark。并且输出sc不为空。它似乎工作正常。

When I type pyspark, it launches my Jupyter Notebook fine. When I create a new Python3 notebook, this error appears:

当我输入 时pyspark,它会很好地启动我的 Jupyter Notebook。当我创建一个新的 Python3 notebook 时,会出现这个错误:

[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py: 

And scin my Jupyter Notebook is empty.

sc在我的笔记本Jupyter是空的。

Can anyone help solve this situation?

任何人都可以帮助解决这种情况吗?



Just want to clarify: There is nothing after the colon at the end of the error. I also tried to create my own start-up file using this postand I quote here so you don't have to go look there:

只是想澄清一下:错误末尾的冒号后没有任何内容。我也尝试使用这篇文章创建我自己的启动文件,我在这里引用,所以你不必去那里看:

I created a short initialization script init_spark.py as follows:

from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("yarn-client")
sc = SparkContext(conf = conf)

and placed it in the ~/.ipython/profile_default/startup/ directory

我创建了一个简短的初始化脚本 init_spark.py 如下:

from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("yarn-client")
sc = SparkContext(conf = conf)

并将其放在 ~/.ipython/profile_default/startup/ 目录中

When I did this, the error then became:

当我这样做时,错误变成了:

[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
[IPKernelApp] WARNING | Unknown error in handling startup files:

采纳答案by Alain Domissy

Conda can help correctly manage a lot of dependencies...

Conda 可以帮助正确管理许多依赖项...

Install spark. Assuming spark is installed in /opt/spark, include this in your ~/.bashrc:

安装火花。假设 spark 安装在 /opt/spark 中,请将其包含在您的 ~/.bashrc 中:

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH

Create a conda environment with all needed dependencies apart from spark:

创建一个 conda 环境,其中包含除 spark 之外的所有需要​​的依赖项:

conda create -n findspark-jupyter-openjdk8-py3 -c conda-forge python=3.5 jupyter=1.0 notebook=5.0 openjdk=8.0.144 findspark=1.1.0

Activate the environment

激活环境

$ source activate findspark-jupyter-openjdk8-py3

Launch a Jupyter Notebook server:

启动 Jupyter Notebook 服务器:

$ jupyter notebook

In your browser, create a new Python3 notebook

在浏览器中,创建一个新的 Python3 笔记本

Try calculating PI with the following script (borrowed from this)

尝试使用以下脚本计算 PI(从这里借用)

import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

回答by desertnaut

Well, it really gives me pain to see how crappy hacks, like setting PYSPARK_DRIVER_PYTHON=jupyter, have been promoted to "solutions" and tend now to become standard practices, despite the fact that they evidently lead to uglyoutcomes, like typing pysparkand ending up with a Jupyter notebook instead of a PySpark shell, plus yet-unseen problems lurking downstream, such as when you try to use spark-submitwith the above settings... :(

好吧,看到诸如设置之类的蹩脚黑客如何PYSPARK_DRIVER_PYTHON=jupyter被提升为“解决方案”并且现在往往成为标准实践,这真的让我感到痛苦,尽管它们显然会导致丑陋的结果,例如打字pyspark并以 Jupyter 结束notebook 而不是 PySpark shell,以及潜伏在下游的尚未发现的问题,例如当您尝试使用spark-submit上述设置时...... :(

(Don't get me wrong, it is not your fault and I am not blaming you; I have seen dozens of posts here at SO where this "solution" has been proposed, accepted, and upvoted...).

(不要误会我的意思,这不是你的错,我也没有责怪你;我在 SO 上看到了几十个帖子,其中提出、接受和投票赞成这个“解决方案”......)。

There is one and only one proper way to customize a Jupyter notebook in order to work with other languages (PySpark here), and this is the use of Jupyter kernels.

自定义 Jupyter Notebook 以使用其他语言(此处为 PySpark)只有一种且唯一一种正确的方法,这就是使用Jupyter kernels

The first thing to do is run a jupyter kernelspec listcommand, to get the list of any already available kernels in your machine; here is the result in my case (Ubuntu):

首先要做的是运行一个jupyter kernelspec list命令,以获取机器中所有可用内核的列表;这是我的情况(Ubuntu)的结果:

$ jupyter kernelspec list
Available kernels:
  python2       /usr/lib/python2.7/site-packages/ipykernel/resources
  caffe         /usr/local/share/jupyter/kernels/caffe
  ir            /usr/local/share/jupyter/kernels/ir
  pyspark       /usr/local/share/jupyter/kernels/pyspark
  pyspark2      /usr/local/share/jupyter/kernels/pyspark2
  tensorflow    /usr/local/share/jupyter/kernels/tensorflow

The first kernel, python2, is the "default" one coming with IPython (there is a great chance of this being the only one present in your system); as for the rest, I have 2 more Python kernels (caffe& tensorflow), an R one (ir), and two PySpark kernels for use with Spark 1.6 and Spark 2.0 respectively.

第一个内核python2是 IPython 附带的“默认”内核(很有可能这是您系统中唯一存在的内核);至于其余部分,我还有 2 个 Python 内核 ( caffe& tensorflow)、一个 R ( ir) 和两个 PySpark 内核,分别用于 Spark 1.6 和 Spark 2.0。

The entries of the list above are directories, and each one contains one single file, named kernel.json. Let's see the contents of this file for my pyspark2kernel:

上面列表的条目是目录,每个条目都包含一个名为kernel.json. 让我们看看我的pyspark2内核这个文件的内容:

{
 "display_name": "PySpark (Spark 2.0)",
 "language": "python",
 "argv": [
  "/opt/intel/intelpython27/bin/python2",
  "-m",
  "ipykernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
  "PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
  "PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
  "PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
 }
}

I have not bothered to change my details to /my/path/toetc., and you can already see that there are some differences between our cases (I use Intel Python 2.7, and not Anaconda Python 3), but hopefully you get the idea (BTW, don't worry about the connection_file- I don't use one either).

我没有费心将我的详细信息更改为/my/path/to等,您已经可以看到我们的案例之间存在一些差异(我使用 Intel Python 2.7,而不是 Anaconda Python 3),但希望您能明白(顺便说一句,不要不用担心connection_file- 我也不用一个)。

Now, the easiest way for you would be to manually do the necessary changes (paths only) to my above shown kernel and save it in a new subfolder of the .../jupyter/kernelsdirectory (that way, it should be visible if you run again a jupyter kernelspec listcommand). And if you think this approach is also a hack, well, I would agree with you, but it is the one recommended in the Jupyter documentation(page 12):

现在,对您来说最简单的方法是手动对我上面显示的内核进行必要的更改(仅路径)并将其保存在.../jupyter/kernels目录的新子文件夹中(这样,如果您再次运行jupyter kernelspec list命令,它应该是可见的)。如果您认为这种方法也是一种 hack,那么我同意您的看法,但它是Jupyter 文档(第 12 页)中推荐的方法:

However, there isn't a great way to modify the kernelspecs. One approach uses jupyter kernelspec listto find the kernel.jsonfile and then modifies it, e.g. kernels/python3/kernel.json, by hand.

但是,没有一个很好的方法来修改内核规范。一种方法使用jupyter kernelspec list查找kernel.json文件然后修改它,例如kernels/python3/kernel.json,手动。

If you don't have already a .../jupyter/kernelsfolder, you can still install a new kernel using jupyter kernelspec install- haven't tried it, but have a look at this SO answer.

如果您还没有.../jupyter/kernels文件夹,您仍然可以使用安装新内核jupyter kernelspec install- 尚未尝试过,但请查看此 SO answer

Finally, don't forget to remove all the PySpark-related environment variables from your bash profile (leaving only SPARK_HOMEshould be OK). And confirm that, when you type pyspark, you find yourself with a PySpark shell, as it should be, and not with a Jupyter notebook...

最后,不要忘记从 bash 配置文件中删除所有与 PySpark 相关的环境变量(只留下SPARK_HOME应该没问题)。并确认,当您键入 时pyspark,您会发现自己使用的是 PySpark shell,而不是 Jupyter 笔记本...

UPDATE(after comment): If you want to pass command-line arguments to PySpark, you should add the PYSPARK_SUBMIT_ARGSsetting under env; for example, here is the last line of my respective kernel file for Spark 1.6.0, where we still had to use the external spark-csv package for reading CSV files:

更新(评论后):如果要将命令行参数传递给 PySpark,则应PYSPARK_SUBMIT_ARGSenv;下添加设置。例如,这是 Spark 1.6.0 各自内核文件的最后一行,我们仍然必须使用外部 spark-csv 包来读取 CSV 文件:

"PYSPARK_SUBMIT_ARGS": "--master local --packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell"

回答by matanster

After fiddling here a little, I just conda installed sparkmagic (after re-installing a newer version of Spark). I think that alone simply works.

在稍微摆弄一下之后,我只是 conda 安装了 sparkmagic(在重新安装了较新版本的 Spark 之后)。我认为只有这样才有效。

I am not sure as I've fiddled a little before that, but I am placing this as a tentative answer as it is much simpler than fiddling configuration files by hand.

我不确定,因为我之前已经摆弄过一点,但我把它作为一个试探性的答案,因为它比手动摆弄配置文件要简单得多。