如何使用 PySpark 加载 IPython shell
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31862293/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to load IPython shell with PySpark
提问by pg2455
I want to load IPython shell (not IPython notebook) in which I can use PySpark through command line. Is that possible? I have installed Spark-1.4.1.
我想加载 IPython shell(不是 IPython notebook),我可以在其中通过命令行使用 PySpark。那可能吗?我已经安装了 Spark-1.4.1。
采纳答案by zero323
If you use Spark < 1.2 you can simply execute bin/pyspark
with an environmental variable IPYTHON=1
.
如果您使用 Spark < 1.2,您可以简单地bin/pyspark
使用环境变量执行IPYTHON=1
。
IPYTHON=1 /path/to/bin/pyspark
or
或者
export IPYTHON=1
/path/to/bin/pyspark
While above will still work on the Spark 1.2 and above recommended way to set Python environment for these versions is PYSPARK_DRIVER_PYTHON
虽然以上仍然适用于 Spark 1.2 及更高版本,但为这些版本设置 Python 环境的推荐方法是 PYSPARK_DRIVER_PYTHON
PYSPARK_DRIVER_PYTHON=ipython /path/to/bin/pyspark
or
或者
export PYSPARK_DRIVER_PYTHON=ipython
/path/to/bin/pyspark
You can replace ipython
with a path to the interpreter of your choice.
您可以替换ipython
为您选择的解释器的路径。
回答by pg2455
Here is what worked for me:
这是对我有用的:
# if you run your ipython with 2.7 version with ipython2
# whatever you use for launching ipython shell should come after '=' sign
export PYSPARK_DRIVER_PYTHON=ipython2
and then from the SPARK_HOME directory:
然后从 SPARK_HOME 目录:
./bin/pyspark
回答by Yang Bryan
According to the official Github, IPYTHON=1 is not available in Spark 2.0+ Please use PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON instead.
根据官方 Github 的说法,IPYTHON=1 在 Spark 2.0+ 中不可用,请改用 PYSPARK_PYTHON 和 PYSPARK_DRIVER_PYTHON。
回答by Jomonsugi
What I found to be helpful is to write bash scripts that load Spark in a specific way. Doing this will give you an easy way to start Spark in different environments (for example ipython and a jupyter notebook).
我发现有用的是编写以特定方式加载 Spark 的 bash 脚本。这样做将为您提供一种在不同环境(例如 ipython 和 jupyter notebook)中启动 Spark 的简单方法。
To do this open a blank script (using whatever text editor you prefer), for example one called ipython_spark.sh
为此,请打开一个空白脚本(使用您喜欢的任何文本编辑器),例如一个名为 ipython_spark.sh
For this example I will provide the script I use to open spark with the ipython interpreter:
对于这个例子,我将提供我用来用 ipython 解释器打开 spark 的脚本:
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=ipython
${SPARK_HOME}/bin/pyspark \
--master local[4] \
--executor-memory 1G \
--driver-memory 1G \
--conf spark.sql.warehouse.dir="file:///tmp/spark-warehouse" \
--packages com.databricks:spark-csv_2.11:1.5.0 \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
--packages org.apache.hadoop:hadoop-aws:2.7.3
Note that I have SPARK_HOME defined in my bash_profile, but you could just insert the whole path to wherever pyspark is located on your computer
请注意,我在 bash_profile 中定义了 SPARK_HOME,但您可以将整个路径插入到 pyspark 位于您计算机上的任何位置
I like to put all scripts like this in one place so I put this file in a folder called "scripts"
我喜欢把所有这样的脚本放在一个地方,所以我把这个文件放在一个名为“脚本”的文件夹中
Now for this example you need to go to your bash_profile and enter the following lines:
现在对于此示例,您需要转到 bash_profile 并输入以下行:
export PATH=$PATH:/Users/<username>/scripts
alias ispark="bash /Users/<username>/scripts/ipython_spark.sh"
These paths will be specific to where you put ipython_spark.sh and then you might need to update permissions:
这些路径将特定于您放置 ipython_spark.sh 的位置,然后您可能需要更新权限:
$ chmod 711 ipython_spark.sh
and source your bash_profile:
并获取您的 bash_profile:
$ source ~/.bash_profile
I'm on a mac, but this should all work for linux as well, although you will be updating .bashrc instead of bash_profile most likely.
我使用的是 mac,但这也应该适用于 linux,尽管您很可能会更新 .bashrc 而不是 bash_profile。
What I like about this method is that you can write up multiple scripts, with different configurations and open spark accordingly. Depending on if you are setting up a cluster, need to load different packages, or change the number of cores spark has at it's disposal, etc. you can either update this script, or make new ones. As noted by @zero323 above PYSPARK_DRIVER_PYTHON= is the correct syntax for Spark > 1.2 I am using Spark 2.2
我喜欢这种方法的地方在于,您可以编写多个脚本,使用不同的配置并相应地打开 spark。根据您是否正在设置集群,是否需要加载不同的包,或更改 spark 可用的内核数量等,您可以更新此脚本,或创建新脚本。正如上面@zero323 所指出的,PYSPARK_DRIVER_PYTHON= 是 Spark > 1.2 的正确语法,我使用的是 Spark 2.2
回答by NYCeyes
This answer is an adapted and shortened version of a similar post my website: https://jupyter.ai/pyspark-session/
这个答案是我的网站类似帖子的改编和缩短版本:https: //jupyter.ai/pyspark-session/
I use ptpython(1), which supplies ipythonfunctionality as well as your choice of either vi(1)or emacs(1)key-bindings. It also supplies dynamic code pop-up/intelligence, which is extremely useful when performing ad-hoc SPARK work on the CLI, or simply trying to learn the Spark API.
我使用ptpython(1),它提供ipython功能以及您选择的vi(1)或emacs(1)键绑定。它还提供动态代码弹出/智能,这在 CLI 上执行临时 SPARK 工作时非常有用,或者只是尝试学习 Spark API。
Here is what my vi-enabled ptpythonsession looks like, taking note of the VI (INSERT)mode at the bottom of the screehshot, as well as the ipythonstyle prompt to indicate that those ptpythoncapabilities have been selected (more on how to select them in a moment):
这是我的启用vi 的ptpython会话的样子,注意屏幕底部的VI (INSERT)模式,以及ipython风格的提示,表明这些ptpython功能已被选中(更多关于如何选择等一下):
To get all of this, perform the following simple steps:
要获得所有这些,请执行以下简单步骤:
user@linux$ pip3 install ptpython # Everything here assumes Python3
user@linux$ vi ${SPARK_HOME}/conf/spark-env.sh
# Comment-out/disable the following two lines. This is necessary because
# they take precedence over any UNIX environment settings for them:
# PYSPARK_PYTHON=/path/to/python
# PYSPARK_DRIVER_PYTHON=/path/to/python
user@linux$ vi ${HOME}/.profile # Or whatever your login RC-file is.
# Add these two lines:
export PYSPARK_PYTHON=python3 # Fully-Qualify this if necessary. (python3)
export PYSPARK_DRIVER_PYTHON=ptpython3 # Fully-Qualify this if necessary. (ptpython3)
user@linux$ . ${HOME}/.profile # Source the RC file.
user@linux$ pyspark
# You are now running pyspark(1) within ptpython; a code pop-up/interactive
# shell; with your choice of vi(1) or emacs(1) key-bindings; and
# your choice of ipython functionality or not.
To select your pypythonpreferences (and there are a bunch of them), simply press F2from within a ptpythonsession, and select whatever options you want.
要选择您的pypython首选项(并且有很多首选项),只需在ptpython会话中按F2,然后选择您想要的任何选项。
CLOSING NOTE: If you are submitting a Python Spark Application (as opposed to interacting with pyspark(1) via the CLI, as shown above), simply set PYSPARK_PYTHONand PYSPARK_DRIVER_PYTHONprogrammatically in Python, like so:
结语:如果您正在提交 Python Spark 应用程序(而不是通过 CLI 与 pyspark(1) 交互,如上所示),只需在 Python 中以编程方式设置PYSPARK_PYTHON和PYSPARK_DRIVER_PYTHON,如下所示:
os.environ['PYSPARK_PYTHON'] = 'python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = 'python3' # Not 'ptpython3' in this case.
I hope this answer and setup is useful.
我希望这个答案和设置有用。
回答by shengshan zhang
if version of spark >= 2.0 and the follow config could be adding to .bashrc
如果 spark 版本 >= 2.0 并且以下配置可以添加到 .bashrc
export PYSPARK_PYTHON=/data/venv/your_env/bin/python
export PYSPARK_DRIVER_PYTHON=/data/venv/your_env/bin/ipython
回答by stasdeep
None of the mentioned answers worked for me. I always got the error:
提到的答案都不适合我。我总是收到错误:
.../pyspark/bin/load-spark-env.sh: No such file or directory
What I did was launching ipython
and creating Spark session manually:
我所做的是ipython
手动启动和创建 Spark 会话:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("example-spark")\
.config("spark.sql.crossJoin.enabled","true")\
.getOrCreate()
To avoid doing this every time, I moved the code to ~/.ispark.py
and created the following alias (add this to ~/.bashrc
):
为了避免每次都这样做,我将代码移至~/.ispark.py
并创建了以下别名(将此添加到~/.bashrc
):
alias ipyspark="ipython -i ~/.ispark.py"
After that, you can launch PySpark with iPython by typing:
之后,您可以通过键入以下命令使用 iPython 启动 PySpark:
ipyspark