Python 如何将 PyCharm 与 PySpark 联系起来?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34685905/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to link PyCharm with PySpark?
提问by tumbleweed
I'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook:
我是 apache spark 的新手,显然我在 macbook 中安装了带有自制软件的 apache-spark:
Last login: Fri Jan 8 12:52:04 on console
user@MacBook-Pro-de-User-2:~$ pyspark
Python 2.7.10 (default, Jul 13 2015, 12:05:58)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/01/08 14:46:44 INFO SparkContext: Running Spark version 1.5.1
16/01/08 14:46:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/08 14:46:47 INFO SecurityManager: Changing view acls to: user
16/01/08 14:46:47 INFO SecurityManager: Changing modify acls to: user
16/01/08 14:46:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)
16/01/08 14:46:50 INFO Slf4jLogger: Slf4jLogger started
16/01/08 14:46:50 INFO Remoting: Starting remoting
16/01/08 14:46:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:50199]
16/01/08 14:46:51 INFO Utils: Successfully started service 'sparkDriver' on port 50199.
16/01/08 14:46:51 INFO SparkEnv: Registering MapOutputTracker
16/01/08 14:46:51 INFO SparkEnv: Registering BlockManagerMaster
16/01/08 14:46:51 INFO DiskBlockManager: Created local directory at /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/blockmgr-769e6f91-f0e7-49f9-b45d-1b6382637c95
16/01/08 14:46:51 INFO MemoryStore: MemoryStore started with capacity 530.0 MB
16/01/08 14:46:52 INFO HttpFileServer: HTTP File server directory is /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/spark-8e4749ea-9ae7-4137-a0e1-52e410a8e4c5/httpd-1adcd424-c8e9-4e54-a45a-a735ade00393
16/01/08 14:46:52 INFO HttpServer: Starting HTTP Server
16/01/08 14:46:52 INFO Utils: Successfully started service 'HTTP file server' on port 50200.
16/01/08 14:46:52 INFO SparkEnv: Registering OutputCommitCoordinator
16/01/08 14:46:52 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/01/08 14:46:52 INFO SparkUI: Started SparkUI at http://192.168.1.64:4040
16/01/08 14:46:53 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/01/08 14:46:53 INFO Executor: Starting executor ID driver on host localhost
16/01/08 14:46:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50201.
16/01/08 14:46:53 INFO NettyBlockTransferService: Server created on 50201
16/01/08 14:46:53 INFO BlockManagerMaster: Trying to register BlockManager
16/01/08 14:46:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:50201 with 530.0 MB RAM, BlockManagerId(driver, localhost, 50201)
16/01/08 14:46:53 INFO BlockManagerMaster: Registered BlockManager
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.5.1
/_/
Using Python version 2.7.10 (default, Jul 13 2015 12:05:58)
SparkContext available as sc, HiveContext available as sqlContext.
>>>
I would like start playing in order to learn more about MLlib. However, I use Pycharm to write scripts in python. The problem is: when I go to Pycharm and try to call pyspark, Pycharm can not found the module. I tried adding the path to Pycharm as follows:
我想开始玩以了解有关 MLlib 的更多信息。但是,我使用 Pycharm 在 python 中编写脚本。问题是:当我去 Pycharm 并尝试调用 pyspark 时,Pycharm 找不到该模块。我尝试将路径添加到 Pycharm,如下所示:
Then from a blogI tried this:
然后从一个博客我试过这个:
import os
import sys
# Path for spark source folder
os.environ['SPARK_HOME']="/Users/user/Apps/spark-1.5.2-bin-hadoop2.4"
# Append pyspark to Python Path
sys.path.append("/Users/user/Apps/spark-1.5.2-bin-hadoop2.4/python/pyspark")
try:
from pyspark import SparkContext
from pyspark import SparkConf
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
And still can not start using PySpark with Pycharm, any idea of how to "link" PyCharm with apache-pyspark?.
并且仍然无法开始将 PySpark 与 Pycharm 一起使用,知道如何将 PyCharm 与 apache-pyspark“链接”吗?
Update:
更新:
Then I search for apache-spark and python path in order to set the environment variables of Pycharm:
然后我搜索 apache-spark 和 python 路径以设置 Pycharm 的环境变量:
apache-spark path:
apache-spark 路径:
user@MacBook-Pro-User-2:~$ brew info apache-spark
apache-spark: stable 1.6.0, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/1.5.1 (649 files, 302.9M) *
Poured from bottle
From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb
python path:
蟒蛇路径:
user@MacBook-Pro-User-2:~$ brew info python
python: stable 2.7.11 (bottled), HEAD
Interpreted, interactive, object-oriented programming language
https://www.python.org
/usr/local/Cellar/python/2.7.10_2 (4,965 files, 66.9M) *
Then with the above information I tried to set the environment variables as follows:
然后根据上述信息,我尝试按如下方式设置环境变量:
Any idea of how to correctly link Pycharm with pyspark?
知道如何将 Pycharm 与 pyspark 正确链接吗?
Then when I run a python script with the above configuration I have this exception:
然后,当我使用上述配置运行 python 脚本时,出现以下异常:
/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/spark_examples/test_1.py
Traceback (most recent call last):
File "/Users/user/PycharmProjects/spark_examples/test_1.py", line 1, in <module>
from pyspark import SparkContext
ImportError: No module named pyspark
UPDATE:Then I tried this configurations proposed by @zero323
更新:然后我尝试了@zero323 提出的这个配置
Configuration 1:
配置1:
/usr/local/Cellar/apache-spark/1.5.1/
out:
出去:
user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1$ ls
CHANGES.txt NOTICE libexec/
INSTALL_RECEIPT.json README.md
LICENSE bin/
Configuration 2:
配置2:
/usr/local/Cellar/apache-spark/1.5.1/libexec
out:
出去:
user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1/libexec$ ls
R/ bin/ data/ examples/ python/
RELEASE conf/ ec2/ lib/ sbin/
采纳答案by zero323
With PySpark package (Spark 2.2.0 and later)
使用 PySpark 包(Spark 2.2.0 及更高版本)
With SPARK-1267being merged you should be able to simplify the process by pip
installing Spark in the environment you use for PyCharm development.
随着SPARK-1267被合并,你应该能够通过简化流程pip
在您使用PyCharm发展环境中安装的火花。
- Go to File-> Settings-> Project Interpreter
Click on install button and search for PySpark
Click on install package button.
Manually with user provided Spark installation
手动使用用户提供的 Spark 安装
Create Run configuration:
创建运行配置:
- Go to Run-> Edit configurations
- Add new Python configuration
- Set Scriptpath so it points to the script you want to execute
Edit Environment variablesfield so it contains at least:
SPARK_HOME
- it should point to the directory with Spark installation. It should contain directories such asbin
(withspark-submit
,spark-shell
, etc.) andconf
(withspark-defaults.conf
,spark-env.sh
, etc.)PYTHONPATH
- it should contain$SPARK_HOME/python
and optionally$SPARK_HOME/python/lib/py4j-some-version.src.zip
if not available otherwise.some-version
should match Py4J version used by a given Spark installation (0.8.2.1 - 1.5, 0.9 - 1.6, 0.10.3 - 2.0, 0.10.4 - 2.1, 0.10.4 - 2.2, 0.10.6 - 2.3, 0.10.7 - 2.4)
Apply the settings
- 转到运行->编辑配置
- 添加新的 Python 配置
- 设置脚本路径,使其指向您要执行的脚本
编辑环境变量字段,使其至少包含:
SPARK_HOME
- 它应该指向安装 Spark 的目录。它应包含的目录,例如bin
(具有spark-submit
,spark-shell
等)和conf
(用spark-defaults.conf
,spark-env.sh
等)PYTHONPATH
-如果不可用,它应该包含$SPARK_HOME/python
和可选$SPARK_HOME/python/lib/py4j-some-version.src.zip
。some-version
应该匹配给定 Spark 安装使用的 Py4J 版本(0.8.2.1 - 1.5、0.9 - 1.6、0.10.3 - 2.0、0.10.4 - 2.1、0.10.4 - 2.2、0.10.6 - 2.3、0.10.7 - )
应用设置
Add PySpark library to the interpreter path (required for code completion):
将 PySpark 库添加到解释器路径(代码完成所需):
- Go to File-> Settings-> Project Interpreter
- Open settings for an interpreter you want to use with Spark
- Edit interpreter paths so it contains path to
$SPARK_HOME/python
(an Py4J if required) - Save the settings
- 转到文件->设置->项目解释器
- 打开要与 Spark 一起使用的解释器的设置
- 编辑解释器路径,使其包含路径
$SPARK_HOME/python
(如果需要,则为 Py4J) - 保存设置
Optionally
可选
- Install or add to path type annotationsmatching installed Spark version to get better completion and static error detection (Disclaimer - I am an author of the project).
- 安装或添加到与安装的 Spark 版本匹配的路径类型注释以获得更好的完成和静态错误检测(免责声明 - 我是该项目的作者)。
Finally
最后
Use newly created configuration to run your script.
使用新创建的配置来运行您的脚本。
回答by grc
From the documentation:
从文档:
To run Spark applications in Python, use the bin/spark-submit script located in the Spark directory. This script will load Spark's Java/Scala libraries and allow you to submit applications to a cluster. You can also use bin/pyspark to launch an interactive Python shell.
要在 Python 中运行 Spark 应用程序,请使用位于 Spark 目录中的 bin/spark-submit 脚本。此脚本将加载 Spark 的 Java/Scala 库,并允许您将应用程序提交到集群。您还可以使用 bin/pyspark 启动交互式 Python shell。
You are invoking your script directly with the CPython interpreter, which I think is causing problems.
您正在使用 CPython 解释器直接调用您的脚本,我认为这会导致问题。
Try running your script with:
尝试使用以下命令运行脚本:
"${SPARK_HOME}"/bin/spark-submit test_1.py
If that works, you should be able to get it working in PyCharm by setting the project's interpreter to spark-submit.
如果可行,您应该能够通过将项目的解释器设置为 spark-submit 来使其在 PyCharm 中工作。
回答by obug
I used the following page as a reference and was able to get pyspark/Spark 1.6.1 (installed via homebrew) imported in PyCharm 5.
我使用以下页面作为参考,并且能够在 PyCharm 5 中导入 pyspark/Spark 1.6.1(通过自制软件安装)。
http://renien.com/blog/accessing-pyspark-pycharm/
http://renien.com/blog/accessing-pyspark-pycharm/
import os
import sys
# Path for spark source folder
os.environ['SPARK_HOME']="/usr/local/Cellar/apache-spark/1.6.1"
# Append pyspark to Python Path
sys.path.append("/usr/local/Cellar/apache-spark/1.6.1/libexec/python")
try:
from pyspark import SparkContext
from pyspark import SparkConf
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
With the above, pyspark loads, but I get a gateway error when I try to create a SparkContext. There's some issue with Spark from homebrew, so I just grabbed Spark from the Spark website (download the Pre-built for Hadoop 2.6 and later) and point to the spark and py4j directories under that. Here's the code in pycharm that works!
使用上述内容,pyspark 已加载,但是当我尝试创建 SparkContext 时出现网关错误。自制软件中的 Spark 存在一些问题,所以我只是从 Spark 网站(下载适用于 Hadoop 2.6 及更高版本的 Pre-built)获取 Spark 并指向其下的 spark 和 py4j 目录。这是 pycharm 中有效的代码!
import os
import sys
# Path for spark source folder
os.environ['SPARK_HOME']="/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6"
# Need to Explicitly point to python3 if you are using Python 3.x
os.environ['PYSPARK_PYTHON']="/usr/local/Cellar/python3/3.5.1/bin/python3"
#You might need to enter your local IP
#os.environ['SPARK_LOCAL_IP']="192.168.2.138"
#Path for pyspark and py4j
sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python")
sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip")
try:
from pyspark import SparkContext
from pyspark import SparkConf
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
sc = SparkContext('local')
words = sc.parallelize(["scala","java","hadoop","spark","akka"])
print(words.count())
I had a lot of help from these instructions, which helped me troubleshoot in PyDev and then get it working PyCharm - https://enahwe.wordpress.com/2015/11/25/how-to-configure-eclipse-for-developing-with-python-and-spark-on-hadoop/
我从这些说明中得到了很多帮助,这帮助我在 PyDev 中进行了故障排除,然后使其正常工作 PyCharm - https://enahwe.wordpress.com/2015/11/25/how-to-configure-eclipse-for-developing -with-python-and-spark-on-hadoop/
I'm sure somebody has spent a few hours bashing their head against their monitor trying to get this working, so hopefully this helps save their sanity!
我敢肯定有人已经花了几个小时用头撞显示器试图让它工作,所以希望这有助于挽救他们的理智!
回答by sthomps
Here's how I solved this on mac osx.
这是我在 mac osx 上解决这个问题的方法。
brew install apache-spark
Add this to ~/.bash_profile
export SPARK_VERSION=`ls /usr/local/Cellar/apache-spark/ | sort | tail -1` export SPARK_HOME="/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec" export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
Add pyspark and py4j to content root (use the correct Spark version):
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip
brew install apache-spark
将此添加到 ~/.bash_profile
export SPARK_VERSION=`ls /usr/local/Cellar/apache-spark/ | sort | tail -1` export SPARK_HOME="/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec" export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
将 pyspark 和 py4j 添加到内容根目录(使用正确的 Spark 版本):
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip
回答by Jason Wolosonovich
Check out this video.
看看这个视频。
Assume your spark python directory is: /home/user/spark/python
假设你的 spark python 目录是: /home/user/spark/python
Assume your Py4j source is: /home/user/spark/python/lib/py4j-0.9-src.zip
假设您的 Py4j 源是: /home/user/spark/python/lib/py4j-0.9-src.zip
Basically you add the the spark python directory and the py4j directory within that to the interpreter paths. I don't have enough reputation to post a screenshot or I would.
基本上,您将 spark python 目录和其中的 py4j 目录添加到解释器路径中。我没有足够的声誉来发布屏幕截图,否则我会。
In the video, the user creates a virtual environment within pycharm itself, however, you can make the virtual environment outside of pycharm or activate a pre-existing virtual environment, then start pycharm with it and add those paths to the virtual environment interpreter paths from within pycharm.
在视频中,用户在 pycharm 内部创建了一个虚拟环境,但是,您可以在 pycharm 之外创建虚拟环境或激活一个预先存在的虚拟环境,然后使用它启动 pycharm 并将这些路径添加到虚拟环境解释器路径中在 pycharm 中。
I used other methods to add spark via the bash environment variables, which works great outside of pycharm, but for some reason they weren't recognized within pycharm, but this method worked perfectly.
我使用其他方法通过 bash 环境变量添加 spark,这在 pycharm 之外效果很好,但由于某种原因,它们在 pycharm 中无法识别,但这种方法效果很好。
回答by thecheech
I followed the tutorials on-line and added the env variables to .bashrc:
我按照在线教程将 env 变量添加到 .bashrc:
# add pyspark to python
export SPARK_HOME=/home/lolo/spark-1.6.1
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
I then just got the value in SPARK_HOME and PYTHONPATH to pycharm:
然后我刚刚将 SPARK_HOME 和 PYTHONPATH 中的值获取到 pycharm:
(srz-reco)lolo@K:~$ echo $SPARK_HOME
/home/lolo/spark-1.6.1
(srz-reco)lolo@K:~$ echo $PYTHONPATH
/home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/python/lib/py4j-0.8.2.1-src.zip:/python/:
Then I copied it to Run/Debug Configurations -> Environment variables of the script.
然后我将它复制到脚本的 Run/Debug Configurations -> Environment variables。
回答by tczhaodachuan
You need to setup PYTHONPATH, SPARK_HOME before you launch IDE or Python.
在启动 IDE 或 Python 之前,您需要设置 PYTHONPATH、SPARK_HOME。
Windows, edit environment variables, added spark python and py4j into
Windows,编辑环境变量,加入spark python和py4j
PYTHONPATH=%PYTHONPATH%;{py4j};{spark python}
Unix,
Unix,
export PYTHONPATH=${PYTHONPATH};{py4j};{spark/python}
回答by Gaurav Khare
Configure pyspark in pycharm (windows)
在 pycharm (windows) 中配置 pyspark
File menu - settings - project interpreter - (gearshape) - more - (treebelowfunnel) - (+) - [add python folder form spark installation and then py4j-*.zip] - click ok
Ensure SPARK_HOME set in windows environment, pycharm will take from there. To confirm :
确保在 windows 环境中设置 SPARK_HOME,pycharm 将从那里获取。确认 :
Run menu - edit configurations - environment variables - [...] - show
Optionally set SPARK_CONF_DIR in environment variables.
可选择在环境变量中设置 SPARK_CONF_DIR。
回答by Michael
Here is the setup that works for me (Win7 64bit, PyCharm2017.3CE)
这是适合我的设置(Win7 64bit,PyCharm2017.3CE)
Set up Intellisense:
设置智能感知:
Click File -> Settings -> Project: -> Project Interpreter
Click the gear icon to the right of the Project Interpreter dropdown
Click More... from the context menu
Choose the interpreter, then click the "Show Paths" icon (bottom right)
Click the + icon two add the following paths:
\python\lib\py4j-0.9-src.zip
\bin\python\lib\pyspark.zip
Click OK, OK, OK
单击文件 -> 设置 -> 项目:-> 项目解释器
单击 Project Interpreter 下拉菜单右侧的齿轮图标
单击上下文菜单中的更多...
选择解释器,然后单击“显示路径”图标(右下角)
点击+图标两次添加以下路径:
\python\lib\py4j-0.9-src.zip
\bin\python\lib\pyspark.zip
单击确定,确定,确定
Go ahead and test your new intellisense capabilities.
继续测试您的新智能感知功能。
回答by H S Rathore
The easiest way is
最简单的方法是
Go to the site-packages folder of your anaconda/python installation, Copy paste the pysparkand pyspark.egg-infofolders there.
转到 anaconda/python 安装的 site-packages 文件夹,将pyspark和pyspark.egg-info文件夹复制粘贴到那里。
Restart pycharm to update index. The above mentioned two folders are present in spark/python folder of your spark installation. This way you'll get code completion suggestions also from pycharm.
重启pycharm更新索引。上面提到的两个文件夹存在于 Spark 安装的 spark/python 文件夹中。通过这种方式,您还可以从 pycharm 获得代码完成建议。
The site-packages can be easily found in your python installation. In anaconda its under anaconda/lib/pythonx.x/site-packages
site-packages 可以很容易地在你的 python 安装中找到。在 anaconda 中,它在anaconda/lib/pythonx.x/site-packages 下