将 AWS Glue Python 与 NumPy 和 Pandas Python 包结合使用

Question

提问by jumpman23

What is the easiest way to use packages such as NumPy and Pandas within the new ETL tool on AWS called Glue? I have a completed script within Python I would like to run in AWS Glue that utilizes NumPy and Pandas.

在 AWS 上名为 Glue 的新 ETL 工具中使用 NumPy 和 Pandas 等包的最简单方法是什么？我在 Python 中有一个完整的脚本，我想在使用 NumPy 和 Pandas 的 AWS Glue 中运行。

Answer 1

采纳答案by Jasper_Li

I think the current answer is you cannot. According to AWS Glue Documentation:

我认为目前的答案是你不能。根据AWS Glue 文档：

Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.

只能使用纯 Python 库。尚不支持依赖 C 扩展的库，例如 pandas Python 数据分析库。

But even when I try to include a normal python written library in S3, the Glue job failed because of some HDFS permission problem. If you find a way to solve this, please let me know as well.

但即使我尝试在 S3 中包含一个普通的 Python 编写的库，由于某些 HDFS 权限问题，Glue 作业失败了。如果您找到解决此问题的方法，也请告诉我。

Answer 2

回答by Prabhakar Reddy

If you don't have pure python libraries and still want to use then you can use below script to use it in your Glue code:

如果您没有纯 python 库但仍想使用，那么您可以使用以下脚本在您的 Glue 代码中使用它：

import os
import site
from setuptools.command import easy_install
install_path = os.environ['GLUE_INSTALLATION']
easy_install.main( ["--install-dir", install_path, "<library-name>"] )
reload(site)


import <installed library>

Answer 3

回答by letstry

when you click run job you have a button Job parameters (optional) that is collapsed by default , when we click on it we have the following options which we can use to save the libraries in s3 and this works for me :

当您单击运行作业时，您有一个默认折叠的作业参数按钮（可选），当我们单击它时，我们有以下选项可用于将库保存在 s3 中，这对我有用：

Python library path

Python库路径

s3://bucket-name/folder-name/file-name

s3://bucket-name/文件夹名/文件名

Dependent jars path

依赖 jars 路径

s3://bucket-name/folder-name/file-name

s3://bucket-name/文件夹名/文件名

Referenced files path s3://bucket-name/folder-name/file-name

引用的文件路径 s3://bucket-name/folder-name/file-name

Answer 4

回答by Cristian Saavedra Desmoineaux

There is an update:

有一个更新：

...You can now use Python shell jobs... ...Python shell jobs in AWS Glue support scripts that are compatible with Python 2.7 and come pre-loaded with libraries such as the Boto3, NumPy, SciPy, pandas, and others.

...您现在可以使用 Python shell 作业... ...AWS Glue 支持脚本中的 Python shell 作业与 Python 2.7 兼容并预加载了 Boto3、NumPy、SciPy、pandas 等库.

https://aws.amazon.com/about-aws/whats-new/2019/01/introducing-python-shell-jobs-in-aws-glue/

Answer 5

回答by BigData-Guru

As of now, You can use Python extension modules and libraries with your AWS Glue ETL scripts as long as they are written in pure Python. C libraries such as pandas are not supported at the present time, nor are extensions written in other languages.

截至目前，您可以将 Python 扩展模块和库与您的 AWS Glue ETL 脚本一起使用，只要它们是用纯 Python 编写的。目前不支持 Pandas 等 C 库，也不支持用其他语言编写的扩展。

Answer 6

回答by MadCityDev

If you go to edit a job (or when you create a new one) there is an optional section that is collapsed called "Script libraries and job parameters (optional)". In there, you can specify an S3 bucket for Python libraries (as well as other things). I haven't tried it out myself for that part yet, but I think that's what you are looking for.

如果您要编辑作业（或创建新作业时），则会有一个折叠的可选部分，称为“脚本库和作业参数（可选）”。在那里，您可以为 Python 库（以及其他东西）指定一个 S3 存储桶。我自己还没有尝试过那部分，但我认为这就是你要找的。

Answer 7

回答by Vin Odh

If you want to integrate python modules into your AWS GLUE ETL job, you can do. You can use whatever Python Module you want. Because Glue is nothing but serverless with Python run environment. SO all you need is to package the modules that your scrpt requires using pip install -t /path/to/your/dircetory. And then upload to your s3 bucket. And while creating AWS Glue job, after pointing s3 scripts, temp location, if you go to advanced job parrameters option, you will see python_libraries option there. enter image description hereYou can just point that to python module packages that you uploaded to s3.

如果您想将 Python 模块集成到您的 AWS GLUE ETL 作业中，您可以这样做。你可以使用任何你想要的 Python 模块。因为 Glue 只不过是具有 Python 运行环境的无服务器。所以你所需要的只是打包你的 scrpt 需要使用的模块pip install -t /path/to/your/dircetory。然后上传到您的 s3 存储桶。在创建 AWS Glue 作业时，在指向 s3 脚本、临时位置后，如果您转到高级作业参数选项，您将在那里看到 python_libraries 选项。在此处输入图像描述您可以将其指向您上传到 s3 的 python 模块包。

Answer 8

回答by Sergey Nasonov

In order to install a specific version(for instance, for AWS Glue python job), navigate to the website with python packages, for example to the page of package "pg8000" https://pypi.org/project/pg8000/1.12.5/#files

为了安装特定版本（例如，对于 AWS Glue python 作业），导航到带有 python 包的网站，例如到包“pg8000”的页面https://pypi.org/project/pg8000/1.12。 5/#文件

Then select an appropriate version, copy the link to the file, and paste it into the snippet below:

然后选择合适的版本，将链接复制到文件中，然后将其粘贴到以下代码段中：

import os
import site
from setuptools.command import easy_install
install_path = os.environ['GLUE_INSTALLATION']

easy_install.main( ["--install-dir", install_path, "https://files.pythonhosted.org/packages/83/03/10902758730d5cc705c0d1dd47072b6216edc652bc2e63a078b58c0b32e6/pg8000-1.12.5.tar.gz"] )
reload(site)

Answer 9

回答by Jingkun

The picked answer is not longer true since 2019

自 2019 年以来，选择的答案不再正确

awswrangleris what you need. It allows you to use pandas in glue and lambda

awswrangler是你所需要的。它允许您在胶水和 lambda 中使用Pandas

https://github.com/awslabs/aws-data-wrangler

Install using AWS Lambda Layer

使用 AWS Lambda 层安装

https://aws-data-wrangler.readthedocs.io/en/latest/install.html#setting-up-lambda-layer

Example: Typical Pandas ETL

示例：典型的 Pandas ETL

import pandas
import awswrangler as wr

df = pandas.read_...  # Read from anywhere

# Typical Pandas, Numpy or Pyarrow transformation HERE!

wr.pandas.to_parquet(  # Storing the data and metadata to Data Lake
    dataframe=df,
    database="database",
    path="s3://...",
    partition_cols=["col_name"],
)

将 AWS Glue Python 与 NumPy 和 Pandas Python 包结合使用

提问by jumpman23

采纳答案by Jasper_Li

回答by Prabhakar Reddy

回答by letstry

回答by Cristian Saavedra Desmoineaux

回答by BigData-Guru

回答by MadCityDev

回答by Vin Odh

回答by Sergey Nasonov

回答by Jingkun

Install using AWS Lambda Layer

使用 AWS Lambda 层安装

Example: Typical Pandas ETL

示例：典型的 Pandas ETL

相关推荐

最近更新

标签

将 AWS Glue Python 与 NumPy 和 Pandas Python 包结合使用

提问by jumpman23

采纳答案by Jasper_Li

回答by Prabhakar Reddy

回答by letstry

回答by Cristian Saavedra Desmoineaux

回答by BigData-Guru

回答by MadCityDev

回答by Vin Odh

回答by Sergey Nasonov

回答by Jingkun

Install using AWS Lambda Layer

使用 AWS Lambda 层安装

Example: Typical Pandas ETL

示例：典型的 Pandas ETL

相关推荐

pandas.DatetimeIndex 频率为 None 且无法设置

pandas Python熊猫唯一值忽略NaN

使用 Pandas 将列添加到数据透视表

pandas “str”和“int”的实例之间不支持“>”

相关推荐

最近更新

标签