Python 导入错误:火花工作者上没有名为 numpy 的模块

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35214231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 16:08:57  来源:igfitidea点击:

ImportError: No module named numpy on spark workers

pythonnumpyapache-sparkpyspark

提问by ajkl

Launching pyspark in client mode. bin/pyspark --master yarn-client --num-executors 60The import numpy on the shell goes fine but it fails in the kmeans. Somehow the executors do not have numpy installed is my feeling. I didnt find any good solution anywhere to let workers know about numpy. I tried setting PYSPARK_PYTHON but that didnt work either.

在客户端模式下启动 pyspark。bin/pyspark --master yarn-client --num-executors 60shell 上的 import numpy 运行良好,但在 kmeans 中失败。不知何故,执行者没有安装 numpy 是我的感觉。我没有在任何地方找到任何好的解决方案来让员工了解 numpy。我尝试设置 PYSPARK_PYTHON 但这也不起作用。

import numpy
features = numpy.load(open("combined_features.npz"))
features = features['arr_0']
features.shape
features_rdd = sc.parallelize(features, 5000)
from pyspark.mllib.clustering import KMeans, KMeansModel

from numpy import array
from math import sqrt
clusters = KMeans.train(features_rdd, 2, maxIterations=10, runs=10, initializationMode="random")

Stack trace

堆栈跟踪

 org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/worker.py", line 98, in main
    command = pickleSer._read_with_length(infile)
  File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 422, in loads
    return pickle.loads(obj)
  File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/mllib/__init__.py", line 25, in <module>

ImportError: No module named numpy

        at org.apache.spark.api.python.PythonRunner$$anon.read(PythonRDD.scala:166)
        at org.apache.spark.api.python.PythonRunner$$anon.<init>(PythonRDD.scala:207)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
        at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
        enter code here

采纳答案by dayman

To use Spark in Yarn client mode, you'll need to install any dependencies to the machines on which Yarn starts the executors. That's the only surefire way to make this work.

要在 Yarn 客户端模式下使用 Spark,您需要将所有依赖项安装到 Yarn 启动执行程序的机器上。这是使这项工作发挥作用的唯一可靠方法。

Using Spark with Yarn cluster mode is a different story. You can distribute python dependencies with spark-submit.

在 Yarn 集群模式下使用 Spark 是另一回事。您可以使用 spark-submit 分发 python 依赖项。

spark-submit --master yarn-cluster my_script.py --py-files my_dependency.zip

However, the situation with numpy is complicated by the same thing that makes it so fast: the fact that does the heavy lifting in C. Because of the way that it is installed, you won't be able to distribute numpy in this fashion.

然而,numpy 的情况由于同样的原因而变得复杂,这使得它变得如此之快:在 C 中完成繁重工作的事实。由于它的安装方式,您将无法以这种方式分发 numpy。

回答by Somum

I had similar issue but I dont think you need to set PYSPARK_PYTHON instead just install numpy on the worker machine (apt-get or yum). The error will also tell you on which machine the import was missing.

我有类似的问题,但我认为您不需要设置 PYSPARK_PYTHON 而只是在工作机器(apt-get 或 yum)上安装 numpy。该错误还会告诉您在哪台机器上缺少导入。

回答by sincosmos

numpy is not installed on the worker (virtual) machines. If you use anaconda, it's very convenient to upload such python dependencies when deploying the application in cluster mode. (So there is no need to install numpy or other modules on each machine, instead they must in your anaconda). Firstly, zip your anaconda and put the zip file to the cluster, and then you can submit a job using following script.

numpy 未安装在工作(虚拟)机器上。如果使用anaconda,在集群模式下部署应用时上传这样的python依赖非常方便。(因此无需在每台机器上安装 numpy 或其他模块,而是必须在您的 anaconda 中安装)。首先,将您的 anaconda 压缩并将 zip 文件放入集群,然后您可以使用以下脚本提交作业。

 spark-submit \
 --master yarn \
 --deploy-mode cluster \
 --archives hdfs://host/path/to/anaconda.zip#python-env
 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pthon-env/anaconda/bin/python 
 app_main.py

Yarn will copy anaconda.zip from the hdfs path to each worker, and use that pthon-env/anaconda/bin/python to execute tasks.

Yarn 会将 anaconda.zip 从 hdfs 路径复制到每个 worker,并使用 pthon-env/anaconda/bin/python 执行任务。

Refer to Running PySpark with Virtualenvmay provide more information.

请参阅使用 Virtualenv 运行 PySpark可能会提供更多信息。

回答by shashank rai

I had the same issue. Try installing numpy on pip3 if you're using Python3

我遇到过同样的问题。如果您使用的是 Python3,请尝试在 pip3 上安装 numpy

pip3 install numpy

pip3 install numpy

回答by Mehdi LAMRANI

You have to be aware that you need to have numpy installed on each and every worker, and even the master itself (depending on your component placement)

Also ensure to launch pip install numpycommand from a root account (sudo does not suffice) after forcing umask to 022 (umask 022) so it cascades the rights to Spark (or Zeppelin) User

你必须知道你需要在每个工人上安装 numpy,甚至是主人本身(取决于你的组件位置)

还要确保pip install numpy在将 umask 强制为 022 后从 root 帐户启动命令(sudo 不够) ( umask 022) 因此它将权限级联到 Spark(或 Zeppelin)用户

回答by Gal Bracha

What solved it for me (On mac) was actually this guide (Which also explains how to run python through Jupyter Notebooks- https://medium.com/@yajieli/installing-spark-pyspark-on-mac-and-fix-of-some-common-errors-355a9050f735

为我解决的问题(On mac)实际上是本指南(其中还解释了如何运行 python Jupyter Notebooks- https://medium.com/@yajieli/installing-spark-pyspark-on-mac-and-fix-of-some -common-errors-355a9050f735

In a nutshell: (Assuming you installed spark with brew install spark)

简而言之:(假设您安装了 spark with brew install spark

  1. Find the SPARK_PATHusing - brew info apache-spark
  2. Add those lines to your ~/.bash_profile
  1. 找到SPARK_PATH使用 -brew info apache-spark
  2. 将这些行添加到您的 ~/.bash_profile
# Spark and Python
######
export SPARK_PATH=/usr/local/Cellar/apache-spark/2.4.1
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
#For python 3, You have to add the line below or you will get an error
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]'
######
  1. You should be able to open Jupyter Notebooksimply by calling: pyspark
  1. 您应该能够Jupyter Notebook通过调用简单地打开: pyspark

And just remember you don't need to set the Spark Contextbut instead simply call:

请记住,您不需要设置,Spark Context而只需调用:

sc = SparkContext.getOrCreate()