Python 从 Google Cloud 存储读取 csv 到 Pandas 数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49357352/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:03:53  来源:igfitidea点击:

Read csv from Google Cloud storage to pandas dataframe

pythonpandascsvgoogle-cloud-platformgoogle-cloud-storage

提问by user1838940

I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.

我正在尝试将 Google Cloud Storage 存储桶上的 csv 文件读取到熊猫数据帧上。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from io import BytesIO

from google.cloud import storage

storage_client = storage.Client()
bucket = storage_client.get_bucket('createbucket123')
blob = bucket.blob('my.csv')
path = "gs://createbucket123/my.csv"
df = pd.read_csv(path)

It shows this error message:

它显示此错误消息:

FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist

What am I doing wrong, I am not able to find any solution which does not involve google datalab?

我做错了什么,我找不到任何不涉及谷歌数据实验室的解决方案?

回答by Lukasz Tracewski

UPDATE

更新

As of version 0.24 of pandas, read_csvsupports reading directly from Google Cloud Storage. Simply provide link to the bucket like this:

从 pandas 0.24 版本开始,read_csv支持直接从 Google Cloud Storage 读取。只需像这样提供指向存储桶的链接:

df = pd.read_csv('gs://bucket/your_path.csv')

I leave three other options for the sake of completeness.

为了完整起见,我留下了其他三个选项。

  • Home-made code
  • gcsfs
  • dask
  • 自制码
  • 全球金融服务中心
  • 达斯

I will cover them below.

我将在下面介绍它们。

The hard way: do-it-yourself code

困难的方法:自己动手编写代码

I have written some convenience functions to read from Google Storage. To make it more readable I added type annotations. If you happen to be on Python 2, simply remove these and code will work all the same.

我编写了一些从 Google Storage 读取的便利函数。为了使其更具可读性,我添加了类型注释。如果您碰巧使用的是 Python 2,只需删除它们,代码就可以正常工作。

It works equally on public and private data sets, assuming you are authorised. In this approach you don't need to download first the data to your local drive.

假设您已获得授权,它在公共和私人数据集上同样适用。在这种方法中,您无需先将数据下载到本地驱动器。

How to use it:

如何使用它:

fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
df = pd.read_csv(fileobj)

The code:

编码:

from io import BytesIO, StringIO
from google.cloud import storage
from google.oauth2 import service_account

def get_byte_fileobj(project: str,
                     bucket: str,
                     path: str,
                     service_account_credentials_path: str = None) -> BytesIO:
    """
    Retrieve data from a given blob on Google Storage and pass it as a file object.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: file object (BytesIO)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    byte_stream = BytesIO()
    blob.download_to_file(byte_stream)
    byte_stream.seek(0)
    return byte_stream

def get_bytestring(project: str,
                   bucket: str,
                   path: str,
                   service_account_credentials_path: str = None) -> bytes:
    """
    Retrieve data from a given blob on Google Storage and pass it as a byte-string.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: byte-string (needs to be decoded)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    s = blob.download_as_string()
    return s


def _get_blob(bucket_name, path, project, service_account_credentials_path):
    credentials = service_account.Credentials.from_service_account_file(
        service_account_credentials_path) if service_account_credentials_path else None
    storage_client = storage.Client(project=project, credentials=credentials)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    return blob

gcsfs

全球金融服务中心

gcsfsis a "Pythonic file-system for Google Cloud Storage".

gcsfs是“用于 Google Cloud Storage 的 Pythonic 文件系统”。

How to use it:

如何使用它:

import pandas as pd
import gcsfs

fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
    df = pd.read_csv(f)

dask

达斯

Dask"provides advanced parallelism for analytics, enabling performance at scale for the tools you love". It's great when you need to deal with large volumes of data in Python. Dask tries to mimic much of the pandasAPI, making it easy to use for newcomers.

Dask“为分析提供高级并行性,为您喜爱的工具提供大规模性能”。当您需要在 Python 中处理大量数据时,这非常棒。Dask 试图模仿大部分pandasAPI,使其易于新手使用。

Here is the read_csv

这是read_csv

How to use it:

如何使用它:

import dask.dataframe as dd

df = dd.read_csv('gs://bucket/data.csv')
df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!

# df is now Dask dataframe, ready for distributed processing
# If you want to have the pandas version, simply:
df_pd = df.compute()

回答by Lak

Another option is to use TensorFlow which comes with the ability to do a streaming read from Google Cloud Storage:

另一种选择是使用 TensorFlow,它可以从 Google Cloud Storage 进行流式读取:

from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://bucket/file.csv', 'r') as f:
  df = pd.read_csv(f)

Using tensorflow also gives you a convenient way to handle wildcards in the filename. For example:

使用 tensorflow 还为您提供了一种方便的方法来处理文件名中的通配符。例如:

Reading wildcard CSV into Pandas

将通配符 CSV 读入 Pandas

Here is code that will read all CSVs that match a specific pattern (e.g: gs://bucket/some/dir/train-*) into a Pandas dataframe:

这是将所有与特定模式(例如:gs://bucket/some/dir/train-*)匹配的 CSV 读取到 Pandas 数据帧中的代码:

import tensorflow as tf
from tensorflow.python.lib.io import file_io
import pandas as pd

def read_csv_file(filename):
  with file_io.FileIO(filename, 'r') as f:
    df = pd.read_csv(f, header=None, names=['col1', 'col2'])
    return df

def read_csv_files(filename_pattern):
  filenames = tf.gfile.Glob(filename_pattern)
  dataframes = [read_csv_file(filename) for filename in filenames]
  return pd.concat(dataframes)

usage

用法

DATADIR='gs://my-bucket/some/dir'
traindf = read_csv_files(os.path.join(DATADIR, 'train-*'))
evaldf = read_csv_files(os.path.join(DATADIR, 'eval-*'))

回答by bnaul

As of pandas==0.24.0this is supported natively if you have gcsfsinstalled: https://github.com/pandas-dev/pandas/pull/22704.

由于pandas==0.24.0这一点,如果你已经是原生支持gcsfs安装:https://github.com/pandas-dev/pandas/pull/22704

Until the official release you can try it out with pip install pandas==0.24.0rc1.

在正式发布之前,您可以使用pip install pandas==0.24.0rc1.

回答by Burhan Khalid

read_csvdoes not support gs://

read_csv不支持 gs://

From the documentation:

文档

The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv

该字符串可以是一个 URL。有效的 URL 方案包括 http、ftp、s3 和文件。对于文件 URL,需要一个主机。例如,本地文件可以是 file://localhost/path/to/table.csv

You can download the fileor fetch it as a stringin order to manipulate it.

您可以下载文件或将其作为字符串获取以对其进行操作。

回答by Ahmad M.

There are threeways of accessing files in the GCS:

GCS中访问文件的方式有以下三种

  1. Downloading the client library (this one for you)
  2. Using Cloud Storage Browser in the Google Cloud Platform Console
  3. Using gsutil, a command-line tool for working with files in Cloud Storage.
  1. 下载客户端库(这个给你
  2. 在 Google Cloud Platform Console 中使用 Cloud Storage Browser
  3. 使用 gsutil,这是一种用于处理 Cloud Storage 中文件的命令行工具。

Using Step 1, setupthe GSC for your work. After which you have to:

使用步骤 1,为您的工作设置GSC。之后你必须:

import cloudstorage as gcs
from google.appengine.api import app_identity

Then you have to specify the Cloud Storage bucket name and create read/write functions for to access your bucket:

然后您必须指定 Cloud Storage 存储桶名称并创建读/写函数以访问您的存储桶:

You can find the remaining read/write tutorial here:

您可以在此处找到剩余的读/写教程:

回答by shubham

If i understood your question correctly then maybe this link can help u get a better URLfor your read_csv()function :

如果我正确理解了您的问题,那么此链接可能可以帮助您为read_csv()函数获得更好的URL

https://cloud.google.com/storage/docs/access-public-data

https://cloud.google.com/storage/docs/access-public-data

回答by Ashwin Kasilingam

One will still need to use import gcsfsif loading compressed files.

import gcsfs如果加载压缩文件,仍然需要使用。

Tried pd.read_csv('gs://your-bucket/path/data.csv.gz')in pd.version=> 0.25.3 got the following error,

pd.read_csv('gs://your-bucket/path/data.csv.gz')在 pd 中尝试过。版本=> 0.25.3 出现以下错误,

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    438     # See https://github.com/python/mypy/issues/1297
    439     fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
--> 440         filepath_or_buffer, encoding, compression
    441     )
    442     kwds["compression"] = compression

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
    211 
    212     if is_gcs_url(filepath_or_buffer):
--> 213         from pandas.io import gcs
    214 
    215         return gcs.get_filepath_or_buffer(

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/gcs.py in <module>
      3 
      4 gcsfs = import_optional_dependency(
----> 5     "gcsfs", extra="The gcsfs library is required to handle GCS files"
      6 )
      7 

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version)
     91     except ImportError:
     92         if raise_on_missing:
---> 93             raise ImportError(message.format(name=name, extra=extra)) from None
     94         else:
     95             return None

ImportError: Missing optional dependency 'gcsfs'. The gcsfs library is required to handle GCS files Use pip or conda to install gcsfs.