Python 使用 boto 从 S3 逐行读取文件？

Question

提问by gignosko

I have a csv file in S3 and I'm trying to read the header line to get the size (these files are created by our users so they could be almost any size). Is there a way to do this using boto? I thought maybe I could us a python BufferedReader, but I can't figure out how to open a stream from an S3 key. Any suggestions would be great. Thanks!

我在 S3 中有一个 csv 文件，我正在尝试读取标题行以获取大小（这些文件是由我们的用户创建的，因此它们几乎可以是任何大小）。有没有办法使用 boto 来做到这一点？我想也许我可以使用 python BufferedReader，但我不知道如何从 S3 密钥打开流。任何建议都会很棒。谢谢！

Answer 1

采纳答案by John Rotenstein

It appears that boto has a read()function that can do this. Here's some code that works for me:

看起来 boto 有一个read()功能可以做到这一点。这是一些对我有用的代码：

>>> import boto
>>> from boto.s3.key import Key
>>> conn = boto.connect_s3('ap-southeast-2')
>>> bucket = conn.get_bucket('bucket-name')
>>> k = Key(bucket)
>>> k.key = 'filename.txt'
>>> k.open()
>>> k.read(10)
'This text '

The call to read(n)returns the next n bytes from the object.

调用read(n)从对象返回接下来的 n 个字节。

Of course, this won't automatically return "the header line", but you could call it with a large enough number to return the header line at a minimum.

当然，这不会自动返回“标题行”，但您可以使用足够大的数字调用它以至少返回标题行。

Answer 2

回答by Michael Korbakov

You may find https://pypi.python.org/pypi/smart_openuseful for your task.

您可能会发现https://pypi.python.org/pypi/smart_open对您的任务很有用。

From documentation:

从文档：

for line in smart_open.smart_open('s3://mybucket/mykey.txt'):
    print line

Answer 3

回答by kooshywoosh

Here's a solution which actually streams the data line by line:

这是一个实际逐行流式传输数据的解决方案：

from io import TextIOWrapper
from gzip import GzipFile
...

# get StreamingBody from botocore.response
response = s3.get_object(Bucket=bucket, Key=key)
# if gzipped
gzipped = GzipFile(None, 'rb', fileobj=response['Body'])
data = TextIOWrapper(gzipped)

for line in data:
    # process line

Answer 4

回答by robertzp

With boto3 you can access a raw stream and read line by line. Just note raw stream is a private property for some reason

使用 boto3，您可以访问原始流并逐行读取。请注意，由于某种原因，原始流是私有财产

s3 = boto3.resource('s3', aws_access_key_id='xxx', aws_secret_access_key='xxx')
obj = s3.Object('bucket name', 'file key')

obj.get()['Body']._raw_stream.readline() # line 1
obj.get()['Body']._raw_stream.readline() # line 2
obj.get()['Body']._raw_stream.readline() # line 3...

Answer 5

回答by oneschilling

If you want to read multiple files (line by line) with a specific bucket prefix (i.e., in a "subfolder") you can do this:

如果您想读取具有特定存储桶前缀（即在“子文件夹”中）的多个文件（逐行），您可以这样做：

s3 = boto3.resource('s3', aws_access_key_id='<key_id>', aws_secret_access_key='<access_key>')

    bucket = s3.Bucket('<bucket_name>')
    for obj in bucket.objects.filter(Prefix='<your prefix>'):
        for line in obj.get()['Body'].read().splitlines():
            print(line.decode('utf-8'))

Here lines are bytes so I am decoding them; but if they are already a string, you can skip that.

这里的行是字节，所以我正在解码它们；但如果它们已经是一个字符串，你可以跳过它。

Answer 6

回答by KiteCoder

The most dynamic and low cost way to read the file is to read each byte until you find the number of lines you need.

读取文件的最动态和低成本的方法是读取每个字节，直到找到所需的行数。

line_count = 0
line_data_bytes = b''

while line_count < 2 :

    incoming = correlate_file_obj['Body'].read(1)
    if incoming == b'\n':
        line_count = line_count + 1

    line_data_bytes = line_data_bytes + incoming

logger.debug("read bytes:")
logger.debug(line_data_bytes)

line_data = line_data_bytes.split(b'\n')

You won't need to guess about header size if the header size can change, you won't end up downloading the whole file, and you don't need 3rd party tools. Granted you need to make sure the line delimeter in your file is correct and you are reading the right number of bytes to find it.

如果标题大小可以更改，您将不需要猜测标题大小，您最终不会下载整个文件，并且您不需要 3rd 方工具。当然，您需要确保文件中的行分隔符正确，并且您正在读取正确的字节数以找到它。

Answer 7

回答by hansaplast

Using boto3:

使用 boto3：

s3 = boto3.resource('s3')
obj = s3.Object(BUCKET, key)
for line in obj.get()['Body']._raw_stream:
    # do something with line

Answer 8

回答by peon

I know it's a very old question.

我知道这是一个非常古老的问题。

But as for now, we can just use s3_conn.get_object(Bucket=bucket, Key=key)['Body'].iter_lines()

但至于现在，我们可以使用 s3_conn.get_object(Bucket=bucket, Key=key)['Body'].iter_lines()

Answer 9

回答by Dean Gurvitz

Expanding on kooshywoosh's answer: using TextIOWrapper (which is very useful) on a StreamingBody from a plain binary file directly isn't possible, as you'll get the following error:

扩展 kooshywoosh 的答案：在 StreamingBody 上直接从纯二进制文件使用 TextIOWrapper（这非常有用）是不可能的，因为您将收到以下错误：

"builtins.AttributeError: 'StreamingBody' object has no attribute 'readable'"

However, you can use the following hack mentioned in thislong standing issue on botocore's github page, and define a very simple wrapper class around StreamingBody:

但是，您可以使用botocore 的 github 页面上这个长期存在的问题中提到的以下技巧，并围绕 StreamingBody 定义一个非常简单的包装类：

from io import RawIOBase
...

class StreamingBodyIO(RawIOBase):
"""Wrap a boto StreamingBody in the IOBase API."""
def __init__(self, body):
    self.body = body

def readable(self):
    return True

def read(self, n=-1):
    n = None if n < 0 else n
    return self.body.read(n)

Then, you can simply use the following code:

然后，您可以简单地使用以下代码：

from io import TextIOWrapper
...

# get StreamingBody from botocore.response
response = s3.get_object(Bucket=bucket, Key=key)
data = TextIOWrapper(StreamingBodyIO(response))
for line in data:
    # process line

Python 使用 boto 从 S3 逐行读取文件？

提问by gignosko

采纳答案by John Rotenstein

回答by Michael Korbakov

回答by kooshywoosh

回答by robertzp

回答by oneschilling

回答by KiteCoder

回答by hansaplast

回答by peon

回答by Dean Gurvitz

相关推荐

最近更新

标签

Python 使用 boto 从 S3 逐行读取文件？

提问by gignosko

采纳答案by John Rotenstein

回答by Michael Korbakov

回答by kooshywoosh

回答by robertzp

回答by oneschilling

回答by KiteCoder

回答by hansaplast

回答by peon

回答by Dean Gurvitz

相关推荐

Python 如何规范化熊猫数据框中一系列列中的数据

从请求 Python 中清除 cookie

Python 如何将两个向量相乘并得到一个矩阵？

如何将 Python 变量设置为“未定义”？

相关推荐

最近更新

标签