Python 将 Pandas DataFrame 写入 Google Cloud Storage 或 BigQuery
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36314797/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Write a Pandas DataFrame to Google Cloud Storage or BigQuery
提问by EcoWarrior
Hello and thanks for your time and consideration. I am developing a Jupyter Notebook in the Google Cloud Platform / Datalab. I have created a Pandas DataFrame and would like to write this DataFrame to both Google Cloud Storage(GCS) and/or BigQuery. I have a bucket in GCS and have, via the following code, created the following objects:
您好,感谢您的时间和考虑。我正在 Google Cloud Platform/Datalab 中开发 Jupyter Notebook。我已经创建了一个 Pandas DataFrame 并希望将此 DataFrame 写入 Google Cloud Storage(GCS) 和/或 BigQuery。我在 GCS 中有一个存储桶,并通过以下代码创建了以下对象:
import gcp
import gcp.storage as storage
project = gcp.Context.default().project_id
bucket_name = 'steve-temp'
bucket_path = bucket_name
bucket = storage.Bucket(bucket_path)
bucket.exists()
I have tried various approaches based on Google Datalab documentation but continue to fail. Thanks
我尝试了基于 Google Datalab 文档的各种方法,但仍然失败。谢谢
采纳答案by Anthonios Partheniou
Try the following working example:
尝试以下工作示例:
from datalab.context import Context
import google.datalab.storage as storage
import google.datalab.bigquery as bq
import pandas as pd
# Dataframe to write
simple_dataframe = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'
# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)
# Create storage bucket if it does not exist
if not sample_bucket.exists():
sample_bucket.create()
# Define BigQuery dataset and table
dataset = bq.Dataset(bigquery_dataset_name)
table = bq.Table(bigquery_dataset_name + '.' + bigquery_table_name)
# Create BigQuery dataset
if not dataset.exists():
dataset.create()
# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(simple_dataframe)
table.create(schema = table_schema, overwrite = True)
# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable simple_dataframe --object $sample_bucket_object
# Write the DataFrame to a BigQuery table
table.insert(simple_dataframe)
I used thisexample, and the _table.pyfile from the datalab github siteas a reference. You can find other datalab
source code files at thislink.
我使用了这个例子,以及来自datalab github 站点的_table.py文件作为参考。您可以在此链接中找到其他源代码文件。datalab
回答by Jan Krynauw
Using the Google Cloud Datalab documentation
使用 Google Cloud Datalab 文档
import datalab.storage as gcs
gcs.Bucket('bucket-name').item('to/data.csv').write_to(simple_dataframe.to_csv(),'text/csv')
回答by Ekaba Bisong
Writing a Pandas DataFrame to BigQuery
将 Pandas DataFrame 写入 BigQuery
Updateon @Anthonios Partheniou's answer.
The code is a bit different now - as of Nov. 29 2017
更新@Anthonios Partheniou 的回答。
现在的代码有点不同 - 截至2017 年 11 月 29 日
To define a BigQuery dataset
定义 BigQuery 数据集
Pass a tuple containing project_id
and dataset_id
to bq.Dataset
.
将包含project_id
和的元组传递dataset_id
给bq.Dataset
。
# define a BigQuery dataset
bigquery_dataset_name = ('project_id', 'dataset_id')
dataset = bq.Dataset(name = bigquery_dataset_name)
To define a BigQuery table
定义 BigQuery 表
Pass a tuple containing project_id
, dataset_id
and the table name to bq.Table
.
传递包含一个元组project_id
,dataset_id
和表名bq.Table
。
# define a BigQuery table
bigquery_table_name = ('project_id', 'dataset_id', 'table_name')
table = bq.Table(bigquery_table_name)
Create the dataset/ table and write to table in BQ
创建数据集/表并写入 BQ 中的表
# Create BigQuery dataset
if not dataset.exists():
dataset.create()
# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(dataFrame_name)
table.create(schema = table_schema, overwrite = True)
# Write the DataFrame to a BigQuery table
table.insert(dataFrame_name)
回答by Theo
Uploading to Google Cloud Storage without writing a temporary file and only using the standard GCS module
上传到 Google Cloud Storage 而不写入临时文件,仅使用标准 GCS 模块
from google.cloud import storage
import os
import pandas as pd
# Only need this if you're running this code locally.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'/your_GCP_creds/credentials.json'
df = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])
client = storage.Client()
bucket = client.get_bucket('my-bucket-name')
bucket.blob('upload_test/test.csv').upload_from_string(df.to_csv(), 'text/csv')
回答by Porada Kev
I have a little bit simpler solution for the task using Dask. You can convert your DataFrame to Dask DataFrame, which can be written to csv on Cloud Storage
对于使用Dask的任务,我有一个更简单的解决方案。您可以将 DataFrame 转换为 Dask DataFrame,后者可以写入 Cloud Storage 上的 csv
import dask.dataframe as dd
import pandas
df # your Pandas DataFrame
ddf = dd.from_pandas(df,npartitions=1, sort=True)
dd.to_csv('gs://YOUR_BUCKET/ddf-*.csv', index=False, sep=',', header=False,
storage_options={'token': gcs.session.credentials})
回答by intotecho
Since 2017, Pandas has a Dataframe to BigQuery function pandas.DataFrame.to_gbq
自 2017 年以来,Pandas 有一个 Dataframe 到 BigQuery 函数pandas.DataFrame.to_gbq
The documentationhas an example:
该文档有一个示例:
import pandas_gbq as gbq
gbq.to_gbq(df, 'my_dataset.my_table', projectid, if_exists='fail')
import pandas_gbq as gbq
gbq.to_gbq(df, 'my_dataset.my_table', projectid, if_exists='fail')
Parameter if_exists
can be set to 'fail', 'replace' or 'append'
参数if_exists
可以设置为“失败”、“替换”或“附加”
See also this example.
另请参阅此示例。
回答by Vova Pytsyuk
I spent a lot of time to find the easiest way to solve this:
我花了很多时间找到最简单的方法来解决这个问题:
import pandas as pd
df = pd.DataFrame(...)
df.to_csv('gs://bucket/path')
回答by dartdog
I think you need to load it into a plain bytes variable and use a %%storage write --variable $sample_bucketpath(see the doc) in a separate cell... I'm still figuring it out... But That is roughly the inverse of what I needed to do to read a CSV file in, I don't know if it makes a difference on write but I had to use BytesIO to read the buffer created by the %% storage read command... Hope it helps, let me know!
我认为你需要将它加载到一个普通的字节变量中,并在一个单独的单元格中使用 %%storage write --variable $sample_bucketpath(see the doc) ......我仍在弄清楚......但这大致与读取 CSV 文件所需要做的相反,我不知道它是否对写入有影响,但我必须使用 BytesIO 来读取由 %% storage read 命令创建的缓冲区......希望它帮助,让我知道!