Python 从 Google Cloud 存储读取 csv 到 Pandas 数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49357352/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read csv from Google Cloud storage to pandas dataframe
提问by user1838940
I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.
我正在尝试将 Google Cloud Storage 存储桶上的 csv 文件读取到熊猫数据帧上。
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from io import BytesIO
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('createbucket123')
blob = bucket.blob('my.csv')
path = "gs://createbucket123/my.csv"
df = pd.read_csv(path)
It shows this error message:
它显示此错误消息:
FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist
What am I doing wrong, I am not able to find any solution which does not involve google datalab?
我做错了什么,我找不到任何不涉及谷歌数据实验室的解决方案?
回答by Lukasz Tracewski
UPDATE
更新
As of version 0.24 of pandas, read_csv
supports reading directly from Google Cloud Storage. Simply provide link to the bucket like this:
从 pandas 0.24 版本开始,read_csv
支持直接从 Google Cloud Storage 读取。只需像这样提供指向存储桶的链接:
df = pd.read_csv('gs://bucket/your_path.csv')
I leave three other options for the sake of completeness.
为了完整起见,我留下了其他三个选项。
- Home-made code
- gcsfs
- dask
- 自制码
- 全球金融服务中心
- 达斯
I will cover them below.
我将在下面介绍它们。
The hard way: do-it-yourself code
困难的方法:自己动手编写代码
I have written some convenience functions to read from Google Storage. To make it more readable I added type annotations. If you happen to be on Python 2, simply remove these and code will work all the same.
我编写了一些从 Google Storage 读取的便利函数。为了使其更具可读性,我添加了类型注释。如果您碰巧使用的是 Python 2,只需删除它们,代码就可以正常工作。
It works equally on public and private data sets, assuming you are authorised. In this approach you don't need to download first the data to your local drive.
假设您已获得授权,它在公共和私人数据集上同样适用。在这种方法中,您无需先将数据下载到本地驱动器。
How to use it:
如何使用它:
fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
df = pd.read_csv(fileobj)
The code:
编码:
from io import BytesIO, StringIO
from google.cloud import storage
from google.oauth2 import service_account
def get_byte_fileobj(project: str,
bucket: str,
path: str,
service_account_credentials_path: str = None) -> BytesIO:
"""
Retrieve data from a given blob on Google Storage and pass it as a file object.
:param path: path within the bucket
:param project: name of the project
:param bucket_name: name of the bucket
:param service_account_credentials_path: path to credentials.
TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
:return: file object (BytesIO)
"""
blob = _get_blob(bucket, path, project, service_account_credentials_path)
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
byte_stream.seek(0)
return byte_stream
def get_bytestring(project: str,
bucket: str,
path: str,
service_account_credentials_path: str = None) -> bytes:
"""
Retrieve data from a given blob on Google Storage and pass it as a byte-string.
:param path: path within the bucket
:param project: name of the project
:param bucket_name: name of the bucket
:param service_account_credentials_path: path to credentials.
TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
:return: byte-string (needs to be decoded)
"""
blob = _get_blob(bucket, path, project, service_account_credentials_path)
s = blob.download_as_string()
return s
def _get_blob(bucket_name, path, project, service_account_credentials_path):
credentials = service_account.Credentials.from_service_account_file(
service_account_credentials_path) if service_account_credentials_path else None
storage_client = storage.Client(project=project, credentials=credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(path)
return blob
gcsfs
全球金融服务中心
gcsfsis a "Pythonic file-system for Google Cloud Storage".
gcsfs是“用于 Google Cloud Storage 的 Pythonic 文件系统”。
How to use it:
如何使用它:
import pandas as pd
import gcsfs
fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
df = pd.read_csv(f)
dask
达斯
Dask"provides advanced parallelism for analytics, enabling performance at scale for the tools you love". It's great when you need to deal with large volumes of data in Python. Dask tries to mimic much of the pandas
API, making it easy to use for newcomers.
Dask“为分析提供高级并行性,为您喜爱的工具提供大规模性能”。当您需要在 Python 中处理大量数据时,这非常棒。Dask 试图模仿大部分pandas
API,使其易于新手使用。
Here is the read_csv
这是read_csv
How to use it:
如何使用它:
import dask.dataframe as dd
df = dd.read_csv('gs://bucket/data.csv')
df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!
# df is now Dask dataframe, ready for distributed processing
# If you want to have the pandas version, simply:
df_pd = df.compute()
回答by Lak
Another option is to use TensorFlow which comes with the ability to do a streaming read from Google Cloud Storage:
另一种选择是使用 TensorFlow,它可以从 Google Cloud Storage 进行流式读取:
from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://bucket/file.csv', 'r') as f:
df = pd.read_csv(f)
Using tensorflow also gives you a convenient way to handle wildcards in the filename. For example:
使用 tensorflow 还为您提供了一种方便的方法来处理文件名中的通配符。例如:
Reading wildcard CSV into Pandas
将通配符 CSV 读入 Pandas
Here is code that will read all CSVs that match a specific pattern (e.g: gs://bucket/some/dir/train-*) into a Pandas dataframe:
这是将所有与特定模式(例如:gs://bucket/some/dir/train-*)匹配的 CSV 读取到 Pandas 数据帧中的代码:
import tensorflow as tf
from tensorflow.python.lib.io import file_io
import pandas as pd
def read_csv_file(filename):
with file_io.FileIO(filename, 'r') as f:
df = pd.read_csv(f, header=None, names=['col1', 'col2'])
return df
def read_csv_files(filename_pattern):
filenames = tf.gfile.Glob(filename_pattern)
dataframes = [read_csv_file(filename) for filename in filenames]
return pd.concat(dataframes)
usage
用法
DATADIR='gs://my-bucket/some/dir'
traindf = read_csv_files(os.path.join(DATADIR, 'train-*'))
evaldf = read_csv_files(os.path.join(DATADIR, 'eval-*'))
回答by bnaul
As of pandas==0.24.0
this is supported natively if you have gcsfs
installed: https://github.com/pandas-dev/pandas/pull/22704.
由于pandas==0.24.0
这一点,如果你已经是原生支持gcsfs
安装:https://github.com/pandas-dev/pandas/pull/22704。
Until the official release you can try it out with pip install pandas==0.24.0rc1
.
在正式发布之前,您可以使用pip install pandas==0.24.0rc1
.
回答by Burhan Khalid
read_csv
does not support gs://
read_csv
不支持 gs://
From the documentation:
从文档:
The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv
该字符串可以是一个 URL。有效的 URL 方案包括 http、ftp、s3 和文件。对于文件 URL,需要一个主机。例如,本地文件可以是 file://localhost/path/to/table.csv
You can download the fileor fetch it as a stringin order to manipulate it.
回答by Ahmad M.
There are threeways of accessing files in the GCS:
GCS中访问文件的方式有以下三种:
- Downloading the client library (this one for you)
- Using Cloud Storage Browser in the Google Cloud Platform Console
- Using gsutil, a command-line tool for working with files in Cloud Storage.
- 下载客户端库(这个给你)
- 在 Google Cloud Platform Console 中使用 Cloud Storage Browser
- 使用 gsutil,这是一种用于处理 Cloud Storage 中文件的命令行工具。
Using Step 1, setupthe GSC for your work. After which you have to:
使用步骤 1,为您的工作设置GSC。之后你必须:
import cloudstorage as gcs
from google.appengine.api import app_identity
Then you have to specify the Cloud Storage bucket name and create read/write functions for to access your bucket:
然后您必须指定 Cloud Storage 存储桶名称并创建读/写函数以访问您的存储桶:
You can find the remaining read/write tutorial here:
您可以在此处找到剩余的读/写教程:
回答by shubham
If i understood your question correctly then maybe this link can help u get a better URLfor your read_csv()function :
如果我正确理解了您的问题,那么此链接可能可以帮助您为read_csv()函数获得更好的URL:
回答by Ashwin Kasilingam
One will still need to use import gcsfs
if loading compressed files.
import gcsfs
如果加载压缩文件,仍然需要使用。
Tried pd.read_csv('gs://your-bucket/path/data.csv.gz')
in pd.version=> 0.25.3 got the following error,
pd.read_csv('gs://your-bucket/path/data.csv.gz')
在 pd 中尝试过。版本=> 0.25.3 出现以下错误,
/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
438 # See https://github.com/python/mypy/issues/1297
439 fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
--> 440 filepath_or_buffer, encoding, compression
441 )
442 kwds["compression"] = compression
/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
211
212 if is_gcs_url(filepath_or_buffer):
--> 213 from pandas.io import gcs
214
215 return gcs.get_filepath_or_buffer(
/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/gcs.py in <module>
3
4 gcsfs = import_optional_dependency(
----> 5 "gcsfs", extra="The gcsfs library is required to handle GCS files"
6 )
7
/opt/conda/anaconda/lib/python3.6/site-packages/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version)
91 except ImportError:
92 if raise_on_missing:
---> 93 raise ImportError(message.format(name=name, extra=extra)) from None
94 else:
95 return None
ImportError: Missing optional dependency 'gcsfs'. The gcsfs library is required to handle GCS files Use pip or conda to install gcsfs.