Python 如何在不写入磁盘的情况下将 AWS S3 上的文本文件导入 Pandas
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37703634/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to import a text file on AWS S3 into pandas without writing to disk
提问by alpalalpal
I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.
我有一个保存在 S3 上的文本文件,它是一个制表符分隔的表格。我想将它加载到 Pandas 中,但无法先保存它,因为我在 Heroku 服务器上运行。这是我到目前为止所拥有的。
import io
import boto3
import os
import pandas as pd
os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"
s3_client = boto3.client('s3')
response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt")
file = response["Body"]
pd.read_csv(file, header=14, delimiter="\t", low_memory=False)
the error is
错误是
OSError: Expected file path name or file-like object, got <class 'bytes'> type
How do I convert the response body into a format pandas will accept?
如何将响应正文转换为 Pandas 可接受的格式?
pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)
returns
TypeError: initial_value must be str or None, not StreamingBody
pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
returns
TypeError: 'StreamingBody' does not support the buffer interface
UPDATE - Using the following worked
更新 - 使用以下工作
file = response["Body"].read()
and
和
pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
回答by Stefan
pandas
uses boto
for read_csv
, so you should be able to:
pandas
使用boto
for read_csv
,所以你应该能够:
import boto
data = pd.read_csv('s3://bucket....csv')
If you need boto3
because you are on python3.4+
, you can
如果你需要boto3
因为你在python3.4+
,你可以
import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
Since version 0.20.1pandas
uses s3fs
, see answer below.
由于版本 0.20.1pandas
使用s3fs
,请参阅下面的答案。
回答by Wesam
Now pandas can handle S3 URLs. You could simply do:
现在pandas 可以处理 S3 URLs。你可以简单地做:
import pandas as pd
import s3fs
df = pd.read_csv('s3://bucket-name/file.csv')
You need to install s3fs
if you don't have it. pip install s3fs
s3fs
如果没有,则需要安装。 pip install s3fs
Authentication
验证
If your S3 bucket is private and requires authentication, you have two options:
如果您的 S3 存储桶是私有的并且需要身份验证,您有两个选择:
1- Add access credentials to your ~/.aws/credentials
config file
1-将访问凭据添加到您的~/.aws/credentials
配置文件
[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Or
或者
2- Set the following environment variableswith their proper values:
2-使用适当的值设置以下环境变量:
aws_access_key_id
aws_secret_access_key
aws_session_token
aws_access_key_id
aws_secret_access_key
aws_session_token
回答by Raveen Beemsingh
This is now supported in latest pandas. See
最新的熊猫现在支持这一点。看
http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files
http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files
eg.,
例如。,
df = pd.read_csv('s3://pandas-test/tips.csv')
回答by Dror
回答by aviral sanjay
Since the files can be too large, it is not wise to load them in the dataframe altogether. Hence, read line by line and save it in the dataframe. Yes, we can also provide the chunk size in the read_csv but then we have to maintain the number of rows read.
由于文件可能太大,将它们完全加载到数据框中是不明智的。因此,逐行读取并将其保存在数据帧中。是的,我们也可以在 read_csv 中提供块大小,但是我们必须维护读取的行数。
Hence, I came up with this engineering:
因此,我想出了这个工程:
def create_file_object_for_streaming(self):
print("creating file object for streaming")
self.file_object = self.bucket.Object(key=self.package_s3_key)
print("File object is: " + str(self.file_object))
print("Object file created.")
return self.file_object
for row in codecs.getreader(self.encoding)(self.response[u'Body']).readlines():
row_string = StringIO(row)
df = pd.read_csv(row_string, sep=",")
I also delete the df once work is done.
del df
工作完成后,我也会删除 df。
del df
回答by billmanH
An option is to convert the csv to json via df.to_dict()
and then store it as a string. Note this is only relevant if the CSV is not a requirement but you just want to quickly put the dataframe in an S3 bucket and retrieve it again.
一种选择是通过将 csv 转换为 json df.to_dict()
,然后将其存储为字符串。请注意,这仅在 CSV 不是必需的但您只想快速将数据帧放入 S3 存储桶并再次检索时才相关。
from boto.s3.connection import S3Connection
import pandas as pd
import yaml
conn = S3Connection()
mybucket = conn.get_bucket('mybucketName')
myKey = mybucket.get_key("myKeyName")
myKey.set_contents_from_string(str(df.to_dict()))
This will convert the df to a dict string, and then save that as json in S3. You can later read it in the same json format:
这会将 df 转换为 dict 字符串,然后将其保存为 S3 中的 json。您可以稍后以相同的 json 格式读取它:
df = pd.DataFrame(yaml.load(myKey.get_contents_as_string()))
The other solutions are also good, but this is a little simpler. Yaml may not necessarily be required but you need something to parse the json string. If the S3 file doesn't necessarily needto be a CSV this can be a quick fix.
其他解决方案也不错,但这个稍微简单一些。Yaml 可能不一定是必需的,但您需要一些东西来解析 json 字符串。如果 S3 文件不一定需要是 CSV,这可以是一个快速修复。
回答by Harry_pb
For text files, you can use below code with pipe-delimited file for example :-
对于文本文件,您可以将以下代码与管道分隔文件一起使用,例如:-
import pandas as pd
import io
import boto3
s3_client = boto3.client('s3', use_ssl=False)
bucket = #
prefix = #
obj = s3_client.get_object(Bucket=bucket, Key=prefix+ filename)
df = pd.read_fwf((io.BytesIO(obj['Body'].read())) , encoding= 'unicode_escape', delimiter='|', error_bad_lines=False,header=None, dtype=str)