Python 如何安装 pyspark 以在独立脚本中使用?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25205264/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I install pyspark for use in standalone scripts?
提问by W.P. McNeill
I'm am trying to use Spark with Python. I installed the Spark 1.0.2 for Hadoop 2 binary distribution from the downloadspage. I can run through the quickstart examples in Python interactive mode, but now I'd like to write a standalone Python script that uses Spark. The quick start documentationsays to just import pyspark, but this doesn't work because it's not on my PYTHONPATH.
我正在尝试将 Spark 与 Python 一起使用。我从下载页面安装了 Spark 1.0.2 for Hadoop 2 二进制分发版。我可以在 Python 交互模式下运行快速入门示例,但现在我想编写一个使用 Spark 的独立 Python 脚本。该快速启动文件说只进口pyspark,但这并不工作,因为这不是我的PYTHONPATH。
I can run bin/pysparkand see that the module is installed beneath SPARK_DIR/python/pyspark. I can manually add this to my PYTHONPATH environment variable, but I'd like to know the preferred automated method.
我可以运行bin/pyspark并看到模块安装在SPARK_DIR/python/pyspark. 我可以手动将它添加到我的 PYTHONPATH 环境变量中,但我想知道首选的自动化方法。
What is the best way to add pysparksupport for standalone scripts? I don't see a setup.pyanywhere under the Spark install directory. How would I create a pip package for a Python script that depended on Spark?
添加pyspark对独立脚本的支持的最佳方法是什么?我setup.py在 Spark 安装目录下没有看到任何地方。如何为依赖 Spark 的 Python 脚本创建 pip 包?
采纳答案by mdurant
You can set the PYTHONPATH manually as you suggest, and this may be useful to you when testing stand-alone non-interactive scripts on a local installation.
您可以按照您的建议手动设置 PYTHONPATH,这可能对您在本地安装上测试独立的非交互式脚本有用。
However, (py)spark is all about distributing your jobs to nodes on clusters. Each cluster has a configuration defining a manager and many parameters; the details of setting this up are here, and include a simple local cluster (this may be useful for testing functionality).
但是,(py)spark 就是将您的作业分发到集群上的节点。每个集群都有一个定义管理器和许多参数的配置;设置它的细节在这里,包括一个简单的本地集群(这可能对测试功能有用)。
In production, you will be submitting tasks to spark via spark-submit, which will distribute your code to the cluster nodes, and establish the context for them to run within on those nodes. You do, however, need to make sure that the python installations on the nodes have all the required dependencies (the recommended way) or that the dependencies are passed along with your code (I don't know how that works).
在生产中,您将通过 spark-submit 将任务提交到 spark,这会将您的代码分发到集群节点,并为它们在这些节点上运行建立上下文。但是,您确实需要确保节点上的 python 安装具有所有必需的依赖项(推荐方式),或者依赖项与您的代码一起传递(我不知道它是如何工作的)。
回答by prabeesh
Spark-2.2.0 onwards use pip install pysparkto install pyspark in your machine.
Spark-2.2.0 以后用于pip install pyspark在您的机器中安装 pyspark。
For older versions refer following steps. Add Pyspark lib in Python path in the bashrc
对于旧版本,请参阅以下步骤。在 bashrc 的 Python 路径中添加 Pyspark lib
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
also don't forget to set up the SPARK_HOME. PySpark depends the py4j Python package. So install that as follows
也不要忘记设置 SPARK_HOME。PySpark 依赖于 py4j Python 包。所以安装如下
pip install py4j
For more details about stand alone PySpark application refer this post
有关独立 PySpark 应用程序的更多详细信息,请参阅此帖子
回答by ssoto
I install pyspark for use in standalone following a guide. The steps are:
我按照指南安装 pyspark 以供独立使用。步骤是:
export SPARK_HOME="/opt/spark"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
Then you need install py4j:
然后你需要安装py4j:
pip install py4j
To try it:
试试看:
./bin/spark-submit --master local[8] <python_file.py>
回答by waku
Don't export $SPARK_HOME, do export SPARK_HOME.
不要export $SPARK_HOME,做export SPARK_HOME。
回答by Kamil Sindi
As of Spark 2.2, PySpark is now available in PyPI. Thanks @Evan_Zamir.
从 Spark 2.2 开始,PySpark 现在可以在 PyPI 中使用。谢谢@Evan_Zamir。
pip install pyspark
pip install pyspark
As of Spark 2.1, you just need to download Spark and run setup.py:
从 Spark 2.1 开始,您只需要下载 Spark 并运行 setup.py:
cd my-spark-2.1-directory/python/
python setup.py install # or pip install -e .
There is also a ticketfor adding it to PyPI.
还有一张票可以将它添加到 PyPI。

