Python 如何使用pyarrow从S3读取拼花文件列表作为pandas数据框？

Question

提问by Diego Mora Cespedes

I have a hacky way of achieving this using boto3(1.4.4), pyarrow(0.4.1) and pandas(0.20.3).

我有一种使用boto3(1.4.4)、pyarrow(0.4.1) 和pandas(0.20.3)实现这一点的hacky 方法。

First, I can read a single parquet file locally like this:

首先，我可以像这样在本地读取单个镶木地板文件：

import pyarrow.parquet as pq

path = 'parquet/part-r-00000-1e638be4-e31f-498a-a359-47d017a0059c.gz.parquet'
table = pq.read_table(path)
df = table.to_pandas()

I can also read a directory of parquet files locally like this:

我还可以像这样在本地读取镶木地板文件的目录：

import pyarrow.parquet as pq

dataset = pq.ParquetDataset('parquet/')
table = dataset.read()
df = table.to_pandas()

Both work like a charm. Now I want to achieve the same remotely with files stored in a S3 bucket. I was hoping that something like this would work:

两者都像魅力一样工作。现在我想通过存储在 S3 存储桶中的文件远程实现相同的目标。我希望这样的事情会奏效：

dataset = pq.ParquetDataset('s3n://dsn/to/my/bucket')

But it does not:

但它没有：

OSError: Passed non-file path: s3n://dsn/to/my/bucket

After reading pyarrow's documentationthoroughly, this does not seem possible at the moment. So I came out with the following solution:

看完后pyarrow的文档彻底，这似乎并不可能在此刻。所以我想出了以下解决方案：

Reading a single file from S3 and getting a pandas dataframe:

从 S3 读取单个文件并获取 Pandas 数据帧：

import io
import boto3
import pyarrow.parquet as pq

buffer = io.BytesIO()
s3 = boto3.resource('s3')
s3_object = s3.Object('bucket-name', 'key/to/parquet/file.gz.parquet')
s3_object.download_fileobj(buffer)
table = pq.read_table(buffer)
df = table.to_pandas()

And here my hacky, not-so-optimized, solution to create a pandas dataframe from a S3 folder path:

在这里，我从 S3 文件夹路径创建一个 Pandas 数据框的 hacky，不是那么优化的解决方案：

import io
import boto3
import pandas as pd
import pyarrow.parquet as pq

bucket_name = 'bucket-name'
def download_s3_parquet_file(s3, bucket, key):
    buffer = io.BytesIO()
    s3.Object(bucket, key).download_fileobj(buffer)
    return buffer

client = boto3.client('s3')
s3 = boto3.resource('s3')
objects_dict = client.list_objects_v2(Bucket=bucket_name, Prefix='my/folder/prefix')
s3_keys = [item['Key'] for item in objects_dict['Contents'] if item['Key'].endswith('.parquet')]
buffers = [download_s3_parquet_file(s3, bucket_name, key) for key in s3_keys]
dfs = [pq.read_table(buffer).to_pandas() for buffer in buffers]
df = pd.concat(dfs, ignore_index=True)

Is there a better way to achieve this? Maybe some kind of connector for pandas using pyarrow? I would like to avoid using pyspark, but if there is no other solution, then I would take it.

有没有更好的方法来实现这一目标？也许某种使用pyarrow的熊猫连接器？我想避免使用pyspark，但如果没有其他解决方案，那么我会接受它。

Answer 1

采纳答案by vak

You should use the s3fsmodule as proposed by yjk21. However as result of calling ParquetDataset you'll get a pyarrow.parquet.ParquetDataset object. To get the Pandas DataFrame you'll rather want to apply .read_pandas().to_pandas()to it:

您应该使用yjk21s3fs建议的模块。但是，作为调用 ParquetDataset 的结果，您将获得一个 pyarrow.parquet.ParquetDataset 对象。要获得 Pandas DataFrame，您宁愿应用它：.read_pandas().to_pandas()

import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()

pandas_dataframe = pq.ParquetDataset('s3://your-bucket/', filesystem=s3).read_pandas().to_pandas()

Answer 2

回答by yjk21

You can use s3fs from dask which implements a filesystem interface for s3. Then you can use the filesystem argument of ParquetDataset like so:

您可以使用 dask 中的 s3fs，它实现了 s3 的文件系统接口。然后你可以像这样使用 ParquetDataset 的文件系统参数：

import s3fs
s3 = s3fs.S3FileSystem()
dataset = pq.ParquetDataset('s3n://dsn/to/my/bucket', filesystem=s3)

Answer 3

回答by oya163

It can be done using boto3 as well without the use of pyarrow

它也可以使用 boto3 完成，而无需使用 pyarrow

import boto3
import io
import pandas as pd

# Read the parquet file
buffer = io.BytesIO()
s3 = boto3.resource('s3')
object = s3.Object('bucket_name','key')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)

print(df.head())

Answer 4

回答by Rich Signell

Probably the easiest way to read parquet data on the cloud into dataframes is to use dask.dataframein this way:

将云上的 parquet 数据读入数据帧的最简单方法可能是以这种方式使用dask.dataframe：

import dask.dataframe as dd
df = dd.read_parquet('s3://bucket/path/to/data-*.parq')

dask.dataframecan read from Google Cloud Storage, Amazon S3, Hadoop file system and more!

dask.dataframe可以从 Google Cloud Storage、Amazon S3、Hadoop 文件系统等读取！

Answer 5

回答by Louis Yang

Thanks! Your question actually tell me a lot. This is how I do it now with pandas(0.21.1), which will call pyarrow, and boto3(1.3.1).

谢谢！你的问题实际上告诉了我很多。这就是我现在使用pandas(0.21.1) 执行的方法，它将调用pyarrow和boto3(1.3.1)。

import boto3
import io
import pandas as pd

# Read single parquet file from S3
def pd_read_s3_parquet(key, bucket, s3_client=None, **args):
    if s3_client is None:
        s3_client = boto3.client('s3')
    obj = s3_client.get_object(Bucket=bucket, Key=key)
    return pd.read_parquet(io.BytesIO(obj['Body'].read()), **args)

# Read multiple parquets from a folder on S3 generated by spark
def pd_read_s3_multiple_parquets(filepath, bucket, s3=None, 
                                 s3_client=None, verbose=False, **args):
    if not filepath.endswith('/'):
        filepath = filepath + '/'  # Add '/' to the end
    if s3_client is None:
        s3_client = boto3.client('s3')
    if s3 is None:
        s3 = boto3.resource('s3')
    s3_keys = [item.key for item in s3.Bucket(bucket).objects.filter(Prefix=filepath)
               if item.key.endswith('.parquet')]
    if not s3_keys:
        print('No parquet found in', bucket, filepath)
    elif verbose:
        print('Load parquets:')
        for p in s3_keys: 
            print(p)
    dfs = [pd_read_s3_parquet(key, bucket=bucket, s3_client=s3_client, **args) 
           for key in s3_keys]
    return pd.concat(dfs, ignore_index=True)

Then you can read multiple parquets under a folder from S3 by

然后您可以通过以下方式从 S3 读取文件夹下的多个镶木地板

df = pd_read_s3_multiple_parquets('path/to/folder', 'my_bucket')

(One can simplify this code a lot I guess.)

（我猜可以大大简化这段代码。）

Answer 6

回答by Igor Tavares

If you are open to also use AWS Data Wrangler.

如果您也愿意使用AWS Data Wrangler。

import awswrangler as wr

df = wr.s3.read_parquet(path="s3://...")

Python 如何使用pyarrow从S3读取拼花文件列表作为pandas数据框？

提问by Diego Mora Cespedes

采纳答案by vak

回答by yjk21

回答by oya163

回答by Rich Signell

回答by Louis Yang

回答by Igor Tavares

相关推荐

最近更新

标签

Python 如何使用pyarrow从S3读取拼花文件列表作为pandas数据框？

提问by Diego Mora Cespedes

采纳答案by vak

回答by yjk21

回答by oya163

回答by Rich Signell

回答by Louis Yang

回答by Igor Tavares

相关推荐

Python：将文件中的十六进制读入列表？

Python Django 序列化程序 Imagefield 以获取完整 URL

Python 在 Pandas value_counts() 中提取值

Python Pandas 在连接后重新计算索引

相关推荐

最近更新

标签