如何在 pyspark 中获取 Python 库？

Question

提问by thenakulchawla

I want to use matplotlib.bblpath or shapely.geometry libraries in pyspark.

我想在 pyspark 中使用 matplotlib.bblpath 或 shapely.geometry 库。

When I try to import any of them I get the below error:

当我尝试导入其中任何一个时，出现以下错误：

>>> from shapely.geometry import polygon
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ImportError: No module named shapely.geometry

I know the module isn't present, but how can these packages be brought to my pyspark libraries?

我知道该模块不存在，但是如何将这些包带到我的 pyspark 库中？

Answer 1

回答by armatita

In the Spark context try using:

在 Spark 上下文中尝试使用：

SparkContext.addPyFile("module.py")  # also .zip

, quoting from the docs:

，引自文档：

Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

为将来要在此 SparkContext 上执行的所有任务添加 .py 或 .zip 依赖项。传递的路径可以是本地文件、HDFS（或其他 Hadoop 支持的文件系统）中的文件，也可以是 HTTP、HTTPS 或 FTP URI。

Answer 2

回答by Hussain Bohra

This is how I get it worked in our AWS EMR cluster (It should be same in any other cluster as well). I created the following shell script and executed it as a bootstrap-actions:

这就是我让它在我们的 AWS EMR 集群中工作的方式（在任何其他集群中也应该相同）。我创建了以下 shell 脚本并将其作为引导操作执行：

#!/bin/bash
# shapely installation
wget http://download.osgeo.org/geos/geos-3.5.0.tar.bz2
tar jxf geos-3.5.0.tar.bz2
cd geos-3.5.0 && ./configure --prefix=$HOME/geos-bin && make && make install
sudo cp /home/hadoop/geos-bin/lib/* /usr/lib
sudo /bin/sh -c 'echo "/usr/lib" >> /etc/ld.so.conf'
sudo /bin/sh -c 'echo "/usr/lib/local" >> /etc/ld.so.conf'
sudo /sbin/ldconfig
sudo /bin/sh -c 'echo -e "\nexport LD_LIBRARY_PATH=/usr/lib" >> /home/hadoop/.bashrc'
source /home/hadoop/.bashrc
sudo pip install shapely
echo "Shapely installation complete"
pip install https://pypi.python.org/packages/74/84/fa80c5e92854c7456b591f6e797c5be18315994afd3ef16a58694e1b5eb1/Geohash-1.0.tar.gz
#
exit 0

Note: Instead of running as a bootstrap-actions this script can be executed independently in every node in a cluster. I have tested both scenarios.

注意：该脚本可以在集群中的每个节点中独立执行，而不是作为引导操作运行。我已经测试了这两种情况。

Following is a sample pyspark and shapely code (Spark SQL UDF) to ensure above commands are working as expected:

以下是一个示例 pyspark 和匀称的代码 (Spark SQL UDF)，以确保上述命令按预期工作：

Python 2.7.10 (default, Dec  8 2015, 18:25:23) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Python version 2.7.10 (default, Dec  8 2015 18:25:23)
SparkContext available as sc, HiveContext available as sqlContext.
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import StringType
>>> from shapely.wkt import loads as load_wkt
>>> def parse_region(region):
...     from shapely.wkt import loads as load_wkt
...     reverse_coordinate = lambda coord: ' '.join(reversed(coord.split(':')))
...     coordinate_list = map(reverse_coordinate, region.split(', '))
...     if coordinate_list[0] != coordinate_list[-1]:
...         coordinate_list.append(coordinate_list[0])
...     return str(load_wkt('POLYGON ((%s))' % ','.join(coordinate_list)).wkt)
... 
>>> udf_parse_region=udf(parse_region, StringType())
16/09/06 22:18:34 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/09/06 22:18:34 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
>>> df = sqlContext.sql('select id, bounds from <schema.table_name> limit 10')
>>> df2 = df.withColumn('bounds1', udf_parse_region('bounds'))
>>> df2.first()
Row(id=u'0089d43a-1b42-4fba-80d6-dda2552ee08e', bounds=u'33.42838509594465:-119.0533447265625, 33.39170168789402:-119.0203857421875, 33.29992542601392:-119.0478515625', bounds1=u'POLYGON ((-119.0533447265625 33.42838509594465, -119.0203857421875 33.39170168789402, -119.0478515625 33.29992542601392, -119.0533447265625 33.42838509594465))')
>>>

Thanks, Hussain Bohra

谢谢，侯赛因·博拉

Answer 3

回答by Jon

Is this on standalone (i.e. laptop/desktop) or in a cluster environment (e.g. AWS EMR)?

这是在独立（即笔记本电脑/台式机）上还是在集群环境中（例如 AWS EMR）？

If on your laptop/desktop, pip install shapelyshould work just fine. You may need to check your environment variables for your default python environment(s). For example, if you typically use Python 3 but use Python 2 for pyspark, then you would not have shapely available for pyspark.

If in a cluster environment such as in AWS EMR, you can try:

import os

def myfun(x):`
        os.system("pip install shapely")
        return x
rdd = sc.parallelize([1,2,3,4]) ## assuming 4 worker nodes
rdd.map(lambda x: myfun(x)).collect() 
## call each cluster to run the code to import the library

如果在您的笔记本电脑/台式机上，pip install shapely应该可以正常工作。您可能需要检查默认 python 环境的环境变量。例如，如果您通常使用 Python 3，但将 Python 2 用于 pyspark，那么您将无法使用 pyspark。

如果在 AWS EMR 等集群环境中，您可以尝试：

import os

def myfun(x):`
        os.system("pip install shapely")
        return x
rdd = sc.parallelize([1,2,3,4]) ## assuming 4 worker nodes
rdd.map(lambda x: myfun(x)).collect() 
## call each cluster to run the code to import the library

"I know the module isn't present, but I want to know how can these packages be brought to my pyspark libraries."

“我知道该模块不存在，但我想知道如何将这些包带到我的 pyspark 库中。”

On EMR, if you want pyspark to be pre-prepared with whatever other libraries and configurations you want, you can use a bootstrap step to make those adjustments. Aside from that, you can't "add" a library to pyspark without compiling Spark in Scala (which would be a pain to do if you're not savvy with SBT).

在 EMR 上，如果您希望 pyspark 预先准备好您想要的任何其他库和配置，您可以使用引导步骤进行这些调整。除此之外，您不能在不使用 Scala 编译 Spark 的情况下将库“添加”到 pyspark（如果您不熟悉 SBT，这样做会很痛苦）。

Answer 4

回答by faisal12

I found a great solution from AWS Docs using SparkContext. I was able to add Pandas and other packages using this:

我使用 SparkContext 从 AWS Docs 找到了一个很好的解决方案。我能够使用这个添加 Pandas 和其他包：

Using SparkContext to add packages to notebook with PySpark Kernel in EMR

在 EMR 中使用 SparkContext 将包添加到带有 PySpark 内核的笔记本

sc.install_pypi_package("pandas==0.25.1")

如何在 pyspark 中获取 Python 库？

提问by thenakulchawla

回答by armatita

回答by Hussain Bohra

回答by Jon

回答by faisal12

相关推荐

最近更新

标签

如何在 pyspark 中获取 Python 库？

提问by thenakulchawla

回答by armatita

回答by Hussain Bohra

回答by Jon

回答by faisal12

相关推荐

Python 子进程 .check_call 与 .check_output

如何使用 tls/ssl 使用 python 为 office365 发送 SMTP 电子邮件

Python 导入错误：没有名为“google”的模块

Python Seaborn - 根据色调名称更改条形颜色

相关推荐

最近更新

标签