Python 将 Pandas DataFrame 写入 Google Cloud Storage 或 BigQuery

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36314797/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:43:50  来源:igfitidea点击:

Write a Pandas DataFrame to Google Cloud Storage or BigQuery

pythongoogle-cloud-storagegoogle-cloud-platformgoogle-cloud-datalab

提问by EcoWarrior

Hello and thanks for your time and consideration. I am developing a Jupyter Notebook in the Google Cloud Platform / Datalab. I have created a Pandas DataFrame and would like to write this DataFrame to both Google Cloud Storage(GCS) and/or BigQuery. I have a bucket in GCS and have, via the following code, created the following objects:

您好,感谢您的时间和考虑。我正在 Google Cloud Platform/Datalab 中开发 Jupyter Notebook。我已经创建了一个 Pandas DataFrame 并希望将此 DataFrame 写入 Google Cloud Storage(GCS) 和/或 BigQuery。我在 GCS 中有一个存储桶,并通过以下代码创建了以下对象:

import gcp
import gcp.storage as storage
project = gcp.Context.default().project_id    
bucket_name = 'steve-temp'           
bucket_path  = bucket_name   
bucket = storage.Bucket(bucket_path)
bucket.exists()  

I have tried various approaches based on Google Datalab documentation but continue to fail. Thanks

我尝试了基于 Google Datalab 文档的各种方法,但仍然失败。谢谢

采纳答案by Anthonios Partheniou

Try the following working example:

尝试以下工作示例:

from datalab.context import Context
import google.datalab.storage as storage
import google.datalab.bigquery as bq
import pandas as pd

# Dataframe to write
simple_dataframe = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create storage bucket if it does not exist
if not sample_bucket.exists():
    sample_bucket.create()

# Define BigQuery dataset and table
dataset = bq.Dataset(bigquery_dataset_name)
table = bq.Table(bigquery_dataset_name + '.' + bigquery_table_name)

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(simple_dataframe)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert(simple_dataframe)

I used thisexample, and the _table.pyfile from the datalab github siteas a reference. You can find other datalabsource code files at thislink.

我使用了这个例子,以及来自datalab github 站点_table.py文件作为参考。您可以在链接中找到其他源代码文件。datalab

回答by Jan Krynauw

Using the Google Cloud Datalab documentation

使用 Google Cloud Datalab 文档

import datalab.storage as gcs
gcs.Bucket('bucket-name').item('to/data.csv').write_to(simple_dataframe.to_csv(),'text/csv')

回答by Ekaba Bisong

Writing a Pandas DataFrame to BigQuery

将 Pandas DataFrame 写入 BigQuery

Updateon @Anthonios Partheniou's answer.
The code is a bit different now - as of Nov. 29 2017

更新@Anthonios Partheniou 的回答。
现在的代码有点不同 - 截至2017 年 11 月 29 日

To define a BigQuery dataset

定义 BigQuery 数据集

Pass a tuple containing project_idand dataset_idto bq.Dataset.

将包含project_id和的元组传递dataset_idbq.Dataset

# define a BigQuery dataset    
bigquery_dataset_name = ('project_id', 'dataset_id')
dataset = bq.Dataset(name = bigquery_dataset_name)

To define a BigQuery table

定义 BigQuery 表

Pass a tuple containing project_id, dataset_idand the table name to bq.Table.

传递包含一个元组project_iddataset_id和表名bq.Table

# define a BigQuery table    
bigquery_table_name = ('project_id', 'dataset_id', 'table_name')
table = bq.Table(bigquery_table_name)

Create the dataset/ table and write to table in BQ

创建数据集/表并写入 BQ 中的表

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(dataFrame_name)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to a BigQuery table
table.insert(dataFrame_name)

回答by Theo

Uploading to Google Cloud Storage without writing a temporary file and only using the standard GCS module

上传到 Google Cloud Storage 而不写入临时文件,仅使用标准 GCS 模块

from google.cloud import storage
import os
import pandas as pd

# Only need this if you're running this code locally.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'/your_GCP_creds/credentials.json'

df = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

client = storage.Client()
bucket = client.get_bucket('my-bucket-name')

bucket.blob('upload_test/test.csv').upload_from_string(df.to_csv(), 'text/csv')

回答by Porada Kev

I have a little bit simpler solution for the task using Dask. You can convert your DataFrame to Dask DataFrame, which can be written to csv on Cloud Storage

对于使用Dask的任务,我有一个更简单的解决方案。您可以将 DataFrame 转换为 Dask DataFrame,后者可以写入 Cloud Storage 上的 csv

import dask.dataframe as dd
import pandas
df # your Pandas DataFrame
ddf = dd.from_pandas(df,npartitions=1, sort=True)
dd.to_csv('gs://YOUR_BUCKET/ddf-*.csv', index=False, sep=',', header=False,  
                               storage_options={'token': gcs.session.credentials})  

回答by intotecho

Since 2017, Pandas has a Dataframe to BigQuery function pandas.DataFrame.to_gbq

自 2017 年以来,Pandas 有一个 Dataframe 到 BigQuery 函数pandas.DataFrame.to_gbq

The documentationhas an example:

文档有一个示例:

import pandas_gbq as gbq gbq.to_gbq(df, 'my_dataset.my_table', projectid, if_exists='fail')

import pandas_gbq as gbq gbq.to_gbq(df, 'my_dataset.my_table', projectid, if_exists='fail')

Parameter if_existscan be set to 'fail', 'replace' or 'append'

参数if_exists可以设置为“失败”、“替换”或“附加”

See also this example.

另请参阅此示例

回答by Vova Pytsyuk

I spent a lot of time to find the easiest way to solve this:

我花了很多时间找到最简单的方法来解决这个问题:

import pandas as pd

df = pd.DataFrame(...)

df.to_csv('gs://bucket/path')

回答by dartdog

I think you need to load it into a plain bytes variable and use a %%storage write --variable $sample_bucketpath(see the doc) in a separate cell... I'm still figuring it out... But That is roughly the inverse of what I needed to do to read a CSV file in, I don't know if it makes a difference on write but I had to use BytesIO to read the buffer created by the %% storage read command... Hope it helps, let me know!

我认为你需要将它加载到一个普通的字节变量中,并在一个单独的单元格中使用 %%storage write --variable $sample_bucketpath(see the doc) ......我仍在弄清楚......但这大致与读取 CSV 文件所需要做的相反,我不知道它是否对写入有影响,但我必须使用 BytesIO 来读取由 %% storage read 命令创建的缓冲区......希望它帮助,让我知道!