如何在 pyspark 中获取 Python 库?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36217090/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I get Python libraries in pyspark?
提问by thenakulchawla
I want to use matplotlib.bblpath or shapely.geometry libraries in pyspark.
我想在 pyspark 中使用 matplotlib.bblpath 或 shapely.geometry 库。
When I try to import any of them I get the below error:
当我尝试导入其中任何一个时,出现以下错误:
>>> from shapely.geometry import polygon
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named shapely.geometry
I know the module isn't present, but how can these packages be brought to my pyspark libraries?
我知道该模块不存在,但是如何将这些包带到我的 pyspark 库中?
回答by armatita
In the Spark context try using:
在 Spark 上下文中尝试使用:
SparkContext.addPyFile("module.py") # also .zip
, quoting from the docs:
,引自文档:
Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
为将来要在此 SparkContext 上执行的所有任务添加 .py 或 .zip 依赖项。传递的路径可以是本地文件、HDFS(或其他 Hadoop 支持的文件系统)中的文件,也可以是 HTTP、HTTPS 或 FTP URI。
回答by Hussain Bohra
This is how I get it worked in our AWS EMR cluster (It should be same in any other cluster as well). I created the following shell script and executed it as a bootstrap-actions:
这就是我让它在我们的 AWS EMR 集群中工作的方式(在任何其他集群中也应该相同)。我创建了以下 shell 脚本并将其作为引导操作执行:
#!/bin/bash
# shapely installation
wget http://download.osgeo.org/geos/geos-3.5.0.tar.bz2
tar jxf geos-3.5.0.tar.bz2
cd geos-3.5.0 && ./configure --prefix=$HOME/geos-bin && make && make install
sudo cp /home/hadoop/geos-bin/lib/* /usr/lib
sudo /bin/sh -c 'echo "/usr/lib" >> /etc/ld.so.conf'
sudo /bin/sh -c 'echo "/usr/lib/local" >> /etc/ld.so.conf'
sudo /sbin/ldconfig
sudo /bin/sh -c 'echo -e "\nexport LD_LIBRARY_PATH=/usr/lib" >> /home/hadoop/.bashrc'
source /home/hadoop/.bashrc
sudo pip install shapely
echo "Shapely installation complete"
pip install https://pypi.python.org/packages/74/84/fa80c5e92854c7456b591f6e797c5be18315994afd3ef16a58694e1b5eb1/Geohash-1.0.tar.gz
#
exit 0
Note: Instead of running as a bootstrap-actions this script can be executed independently in every node in a cluster. I have tested both scenarios.
注意:该脚本可以在集群中的每个节点中独立执行,而不是作为引导操作运行。我已经测试了这两种情况。
Following is a sample pyspark and shapely code (Spark SQL UDF) to ensure above commands are working as expected:
以下是一个示例 pyspark 和匀称的代码 (Spark SQL UDF),以确保上述命令按预期工作:
Python 2.7.10 (default, Dec 8 2015, 18:25:23)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.1
/_/
Using Python version 2.7.10 (default, Dec 8 2015 18:25:23)
SparkContext available as sc, HiveContext available as sqlContext.
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import StringType
>>> from shapely.wkt import loads as load_wkt
>>> def parse_region(region):
... from shapely.wkt import loads as load_wkt
... reverse_coordinate = lambda coord: ' '.join(reversed(coord.split(':')))
... coordinate_list = map(reverse_coordinate, region.split(', '))
... if coordinate_list[0] != coordinate_list[-1]:
... coordinate_list.append(coordinate_list[0])
... return str(load_wkt('POLYGON ((%s))' % ','.join(coordinate_list)).wkt)
...
>>> udf_parse_region=udf(parse_region, StringType())
16/09/06 22:18:34 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/09/06 22:18:34 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
>>> df = sqlContext.sql('select id, bounds from <schema.table_name> limit 10')
>>> df2 = df.withColumn('bounds1', udf_parse_region('bounds'))
>>> df2.first()
Row(id=u'0089d43a-1b42-4fba-80d6-dda2552ee08e', bounds=u'33.42838509594465:-119.0533447265625, 33.39170168789402:-119.0203857421875, 33.29992542601392:-119.0478515625', bounds1=u'POLYGON ((-119.0533447265625 33.42838509594465, -119.0203857421875 33.39170168789402, -119.0478515625 33.29992542601392, -119.0533447265625 33.42838509594465))')
>>>
Thanks, Hussain Bohra
谢谢,侯赛因·博拉
回答by Jon
Is this on standalone (i.e. laptop/desktop) or in a cluster environment (e.g. AWS EMR)?
这是在独立(即笔记本电脑/台式机)上还是在集群环境中(例如 AWS EMR)?
If on your laptop/desktop,
pip install shapely
should work just fine. You may need to check your environment variables for your default python environment(s). For example, if you typically use Python 3 but use Python 2 for pyspark, then you would not have shapely available for pyspark.If in a cluster environment such as in AWS EMR, you can try:
import os def myfun(x):` os.system("pip install shapely") return x rdd = sc.parallelize([1,2,3,4]) ## assuming 4 worker nodes rdd.map(lambda x: myfun(x)).collect() ## call each cluster to run the code to import the library
如果在您的笔记本电脑/台式机上,
pip install shapely
应该可以正常工作。您可能需要检查默认 python 环境的环境变量。例如,如果您通常使用 Python 3,但将 Python 2 用于 pyspark,那么您将无法使用 pyspark。如果在 AWS EMR 等集群环境中,您可以尝试:
import os def myfun(x):` os.system("pip install shapely") return x rdd = sc.parallelize([1,2,3,4]) ## assuming 4 worker nodes rdd.map(lambda x: myfun(x)).collect() ## call each cluster to run the code to import the library
"I know the module isn't present, but I want to know how can these packages be brought to my pyspark libraries."
“我知道该模块不存在,但我想知道如何将这些包带到我的 pyspark 库中。”
On EMR, if you want pyspark to be pre-prepared with whatever other libraries and configurations you want, you can use a bootstrap step to make those adjustments. Aside from that, you can't "add" a library to pyspark without compiling Spark in Scala (which would be a pain to do if you're not savvy with SBT).
在 EMR 上,如果您希望 pyspark 预先准备好您想要的任何其他库和配置,您可以使用引导步骤进行这些调整。除此之外,您不能在不使用 Scala 编译 Spark 的情况下将库“添加”到 pyspark(如果您不熟悉 SBT,这样做会很痛苦)。
回答by faisal12
I found a great solution from AWS Docs using SparkContext. I was able to add Pandas and other packages using this:
我使用 SparkContext 从 AWS Docs 找到了一个很好的解决方案。我能够使用这个添加 Pandas 和其他包:
Using SparkContext to add packages to notebook with PySpark Kernel in EMR
在 EMR 中使用 SparkContext 将包添加到带有 PySpark 内核的笔记本
sc.install_pypi_package("pandas==0.25.1")
sc.install_pypi_package("pandas==0.25.1")