Python 导入错误:火花工作者上没有名为 numpy 的模块
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35214231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
ImportError: No module named numpy on spark workers
提问by ajkl
Launching pyspark in client mode. bin/pyspark --master yarn-client --num-executors 60
The import numpy on the shell goes fine but it fails in the kmeans. Somehow the executors do not have numpy installed is my feeling. I didnt find any good solution anywhere to let workers know about numpy. I tried setting PYSPARK_PYTHON but that didnt work either.
在客户端模式下启动 pyspark。bin/pyspark --master yarn-client --num-executors 60
shell 上的 import numpy 运行良好,但在 kmeans 中失败。不知何故,执行者没有安装 numpy 是我的感觉。我没有在任何地方找到任何好的解决方案来让员工了解 numpy。我尝试设置 PYSPARK_PYTHON 但这也不起作用。
import numpy
features = numpy.load(open("combined_features.npz"))
features = features['arr_0']
features.shape
features_rdd = sc.parallelize(features, 5000)
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt
clusters = KMeans.train(features_rdd, 2, maxIterations=10, runs=10, initializationMode="random")
Stack trace
堆栈跟踪
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/mllib/__init__.py", line 25, in <module>
ImportError: No module named numpy
at org.apache.spark.api.python.PythonRunner$$anon.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
enter code here
采纳答案by dayman
To use Spark in Yarn client mode, you'll need to install any dependencies to the machines on which Yarn starts the executors. That's the only surefire way to make this work.
要在 Yarn 客户端模式下使用 Spark,您需要将所有依赖项安装到 Yarn 启动执行程序的机器上。这是使这项工作发挥作用的唯一可靠方法。
Using Spark with Yarn cluster mode is a different story. You can distribute python dependencies with spark-submit.
在 Yarn 集群模式下使用 Spark 是另一回事。您可以使用 spark-submit 分发 python 依赖项。
spark-submit --master yarn-cluster my_script.py --py-files my_dependency.zip
However, the situation with numpy is complicated by the same thing that makes it so fast: the fact that does the heavy lifting in C. Because of the way that it is installed, you won't be able to distribute numpy in this fashion.
然而,numpy 的情况由于同样的原因而变得复杂,这使得它变得如此之快:在 C 中完成繁重工作的事实。由于它的安装方式,您将无法以这种方式分发 numpy。
回答by Somum
I had similar issue but I dont think you need to set PYSPARK_PYTHON instead just install numpy on the worker machine (apt-get or yum). The error will also tell you on which machine the import was missing.
我有类似的问题,但我认为您不需要设置 PYSPARK_PYTHON 而只是在工作机器(apt-get 或 yum)上安装 numpy。该错误还会告诉您在哪台机器上缺少导入。
回答by sincosmos
numpy is not installed on the worker (virtual) machines. If you use anaconda, it's very convenient to upload such python dependencies when deploying the application in cluster mode. (So there is no need to install numpy or other modules on each machine, instead they must in your anaconda). Firstly, zip your anaconda and put the zip file to the cluster, and then you can submit a job using following script.
numpy 未安装在工作(虚拟)机器上。如果使用anaconda,在集群模式下部署应用时上传这样的python依赖非常方便。(因此无需在每台机器上安装 numpy 或其他模块,而是必须在您的 anaconda 中安装)。首先,将您的 anaconda 压缩并将 zip 文件放入集群,然后您可以使用以下脚本提交作业。
spark-submit \
--master yarn \
--deploy-mode cluster \
--archives hdfs://host/path/to/anaconda.zip#python-env
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pthon-env/anaconda/bin/python
app_main.py
Yarn will copy anaconda.zip from the hdfs path to each worker, and use that pthon-env/anaconda/bin/python to execute tasks.
Yarn 会将 anaconda.zip 从 hdfs 路径复制到每个 worker,并使用 pthon-env/anaconda/bin/python 执行任务。
Refer to Running PySpark with Virtualenvmay provide more information.
请参阅使用 Virtualenv 运行 PySpark可能会提供更多信息。
回答by shashank rai
I had the same issue. Try installing numpy on pip3 if you're using Python3
我遇到过同样的问题。如果您使用的是 Python3,请尝试在 pip3 上安装 numpy
pip3 install numpy
pip3 install numpy
回答by Mehdi LAMRANI
You have to be aware that you need to have numpy installed on each and every worker, and even the master itself (depending on your component placement)
Also ensure to launch pip install numpy
command from a root account (sudo does not suffice) after forcing umask to 022 (umask 022
) so it cascades the rights to Spark (or Zeppelin) User
你必须知道你需要在每个工人上安装 numpy,甚至是主人本身(取决于你的组件位置)
还要确保pip install numpy
在将 umask 强制为 022 后从 root 帐户启动命令(sudo 不够) ( umask 022
) 因此它将权限级联到 Spark(或 Zeppelin)用户
回答by Gal Bracha
What solved it for me (On mac
) was actually this guide (Which also explains how to run python through Jupyter Notebooks
-
https://medium.com/@yajieli/installing-spark-pyspark-on-mac-and-fix-of-some-common-errors-355a9050f735
为我解决的问题(On mac
)实际上是本指南(其中还解释了如何运行 python Jupyter Notebooks
-
https://medium.com/@yajieli/installing-spark-pyspark-on-mac-and-fix-of-some -common-errors-355a9050f735
In a nutshell:
(Assuming you installed spark with brew install spark
)
简而言之:(假设您安装了 spark with brew install spark
)
- Find the
SPARK_PATH
using -brew info apache-spark
- Add those lines to your
~/.bash_profile
- 找到
SPARK_PATH
使用 -brew info apache-spark
- 将这些行添加到您的
~/.bash_profile
# Spark and Python
######
export SPARK_PATH=/usr/local/Cellar/apache-spark/2.4.1
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
#For python 3, You have to add the line below or you will get an error
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]'
######
- You should be able to open
Jupyter Notebook
simply by calling:pyspark
- 您应该能够
Jupyter Notebook
通过调用简单地打开:pyspark
And just remember you don't need to set the Spark Context
but instead simply call:
请记住,您不需要设置,Spark Context
而只需调用:
sc = SparkContext.getOrCreate()