Python 如何在不写入磁盘的情况下将 AWS S3 上的文本文件导入 Pandas

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37703634/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:49:40  来源:igfitidea点击:

How to import a text file on AWS S3 into pandas without writing to disk

pythonpandasherokuamazon-s3boto3

提问by alpalalpal

I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.

我有一个保存在 S3 上的文本文件,它是一个制表符分隔的表格。我想将它加载到 Pandas 中,但无法先保存它,因为我在 Heroku 服务器上运行。这是我到目前为止所拥有的。

import io
import boto3
import os
import pandas as pd

os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"

s3_client = boto3.client('s3')
response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt")
file = response["Body"]


pd.read_csv(file, header=14, delimiter="\t", low_memory=False)

the error is

错误是

OSError: Expected file path name or file-like object, got <class 'bytes'> type

How do I convert the response body into a format pandas will accept?

如何将响应正文转换为 Pandas 可接受的格式?

pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)

returns

TypeError: initial_value must be str or None, not StreamingBody

pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)

returns

TypeError: 'StreamingBody' does not support the buffer interface

UPDATE - Using the following worked

更新 - 使用以下工作

file = response["Body"].read()

and

pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)

回答by Stefan

pandasuses botofor read_csv, so you should be able to:

pandas使用botofor read_csv,所以你应该能够:

import boto
data = pd.read_csv('s3://bucket....csv')

If you need boto3because you are on python3.4+, you can

如果你需要boto3因为你在python3.4+,你可以

import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))

Since version 0.20.1pandasuses s3fs, see answer below.

由于版本 0.20.1pandas使用s3fs,请参阅下面的答案。

回答by Wesam

Now pandas can handle S3 URLs. You could simply do:

现在pandas 可以处理 S3 URLs。你可以简单地做:

import pandas as pd
import s3fs

df = pd.read_csv('s3://bucket-name/file.csv')

You need to install s3fsif you don't have it. pip install s3fs

s3fs如果没有,则需要安装pip install s3fs

Authentication

验证

If your S3 bucket is private and requires authentication, you have two options:

如果您的 S3 存储桶是私有的并且需要身份验证,您有两个选择:

1- Add access credentials to your ~/.aws/credentialsconfig file

1-将访问凭据添加到您的~/.aws/credentials配置文件

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Or

或者

2- Set the following environment variableswith their proper values:

2-使用适当的值设置以下环境变量

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token
  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token

回答by Raveen Beemsingh

This is now supported in latest pandas. See

最新的熊猫现在支持这一点。看

http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files

http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files

eg.,

例如。,

df = pd.read_csv('s3://pandas-test/tips.csv')

回答by Dror

With s3fsit can be done as follow:

使用s3fs可以按如下方式完成:

import s3fs
import pandas as pd
fs = s3fs.S3FileSystem(anon=False)

# CSV
with fs.open('mybucket/path/to/object/foo.pkl') as f:
    df = pd.read_csv(f)

# Pickle
with fs.open('mybucket/path/to/object/foo.pkl') as f:
    df = pd.read_pickle(f)

回答by aviral sanjay

Since the files can be too large, it is not wise to load them in the dataframe altogether. Hence, read line by line and save it in the dataframe. Yes, we can also provide the chunk size in the read_csv but then we have to maintain the number of rows read.

由于文件可能太大,将它们完全加载到数据框中是不明智的。因此,逐行读取并将其保存在数据帧中。是的,我们也可以在 read_csv 中提供块大小,但是我们必须维护读取的行数。

Hence, I came up with this engineering:

因此,我想出了这个工程:

def create_file_object_for_streaming(self):
        print("creating file object for streaming")
        self.file_object = self.bucket.Object(key=self.package_s3_key)
        print("File object is: " + str(self.file_object))
        print("Object file created.")
        return self.file_object

for row in codecs.getreader(self.encoding)(self.response[u'Body']).readlines():
            row_string = StringIO(row)
            df = pd.read_csv(row_string, sep=",")

I also delete the df once work is done. del df

工作完成后,我也会删除 df。 del df

回答by billmanH

An option is to convert the csv to json via df.to_dict()and then store it as a string. Note this is only relevant if the CSV is not a requirement but you just want to quickly put the dataframe in an S3 bucket and retrieve it again.

一种选择是通过将 csv 转换为 json df.to_dict(),然后将其存储为字符串。请注意,这仅在 CSV 不是必需的但您只想快速将数据帧放入 S3 存储桶并再次检索时才相关。

from boto.s3.connection import S3Connection
import pandas as pd
import yaml

conn = S3Connection()
mybucket = conn.get_bucket('mybucketName')
myKey = mybucket.get_key("myKeyName")

myKey.set_contents_from_string(str(df.to_dict()))

This will convert the df to a dict string, and then save that as json in S3. You can later read it in the same json format:

这会将 df 转换为 dict 字符串,然后将其保存为 S3 中的 json。您可以稍后以相同的 json 格式读取它:

df = pd.DataFrame(yaml.load(myKey.get_contents_as_string()))

The other solutions are also good, but this is a little simpler. Yaml may not necessarily be required but you need something to parse the json string. If the S3 file doesn't necessarily needto be a CSV this can be a quick fix.

其他解决方案也不错,但这个稍微简单一些。Yaml 可能不一定是必需的,但您需要一些东西来解析 json 字符串。如果 S3 文件不一定需要是 CSV,这可以是一个快速修复。

回答by Harry_pb

For text files, you can use below code with pipe-delimited file for example :-

对于文本文件,您可以将以下代码与管道分隔文件一起使用,例如:-

import pandas as pd
import io
import boto3
s3_client = boto3.client('s3', use_ssl=False)
bucket = #
prefix = #
obj = s3_client.get_object(Bucket=bucket, Key=prefix+ filename)
df = pd.read_fwf((io.BytesIO(obj['Body'].read())) , encoding= 'unicode_escape', delimiter='|', error_bad_lines=False,header=None, dtype=str)