高效地将 Pandas 数据帧写入 Google BigQuery
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48886761/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Efficiently write a Pandas dataframe to Google BigQuery
提问by Pablo
I'm trying to upload a pandas.DataFrame
to google big query using the pandas.DataFrame.to_gbq()
function documented here. The problem is that to_gbq()
takes 2.3 minutes while uploading directly to Google Cloud Storage GUI takes less than a minute. I'm planing to upload a bunch of dataframes (~32) each one with a similar size, so i want to know what its the faster alternative.
我正在尝试pandas.DataFrame
使用此处pandas.DataFrame.to_gbq()
记录的功能将一个大查询上传到谷歌。问题是需要 2.3 分钟,而直接上传到 Google Cloud Storage GUI 需要不到一分钟。我打算上传一堆数据帧(~32),每个都具有相似的大小,所以我想知道什么是更快的替代方案。to_gbq()
This is the script that i'm using:
这是我正在使用的脚本:
dataframe.to_gbq('my_dataset.my_table',
'my_project_id',
chunksize=None, # i've tryed with several chunksizes, it runs faster when is one big chunk (at least for me)
if_exists='append',
verbose=False
)
dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds
# manually upload the file into GCS GUI
print(dataframe.shape)
(363364, 21)
my question is, what is faster?
我的问题是,什么更快?
- Upload
Dataframe
usingpandas.DataFrame.to_gbq()
function - Saving
Dataframe
as csv and then upload as a file to BigQuery using the Python API - Saving
Dataframe
as csv and then upload the file to Google Cloud Storage using this procedureand then reading it from BigQuery
Dataframe
使用pandas.DataFrame.to_gbq()
功能上传- 保存
Dataframe
为 csv,然后使用Python API作为文件上传到 BigQuery - 保存
Dataframe
为 csv,然后使用此过程将文件上传到 Google Cloud Storage ,然后从 BigQuery 中读取它
update:
更新:
alternative 2, using pd.DataFrame.to_csv()
and load_data_from_file()
seems to take longer than alternative 1 ( 17.9 sec more in average with 3 loops):
备选方案 2,使用pd.DataFrame.to_csv()
和load_data_from_file()
似乎比备选方案 1 花费的时间更长(3 个循环平均多 17.9 秒):
def load_data_from_file(dataset_id, table_id, source_file_name):
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
with open(source_file_name, 'rb') as source_file:
# This example uses CSV, but you can use other formats.
# See https://cloud.google.com/bigquery/loading-data
job_config = bigquery.LoadJobConfig()
job_config.source_format = 'text/csv'
job_config.autodetect=True
job = bigquery_client.load_table_from_file(
source_file, table_ref, job_config=job_config)
job.result() # Waits for job to complete
print('Loaded {} rows into {}:{}.'.format(
job.output_rows, dataset_id, table_id))
thank you!
谢谢你!
采纳答案by enle lin
I did the comparison for alternative 1 and 3 in Datalab
using the following code:
我Datalab
使用以下代码对备选方案 1 和 3 进行了比较:
from datalab.context import Context
import datalab.storage as storage
import datalab.bigquery as bq
import pandas as pd
from pandas import DataFrame
import time
# Dataframe to write
my_data = [{1,2,3}]
for i in range(0,100000):
my_data.append({1,2,3})
not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])
#Alternative 1
start = time.time()
not_so_simple_dataframe.to_gbq('TestDataSet.TestTable',
Context.default().project_id,
chunksize=10000,
if_exists='append',
verbose=False
)
end = time.time()
print("time alternative 1 " + str(end - start))
#Alternative 3
start = time.time()
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'
# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)
# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)
# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable not_so_simple_dataframe --object $sample_bucket_object
# Write the DataFrame to a BigQuery table
table.insert_data(not_so_simple_dataframe)
end = time.time()
print("time alternative 3 " + str(end - start))
and here are the results for n = {10000,100000,1000000}:
以下是 n = {10000,100000,1000000} 的结果:
n alternative_1 alternative_3
10000 30.72s 8.14s
100000 162.43s 70.64s
1000000 1473.57s 688.59s
Judging from the results, alternative 3 is faster than alternative 1.
从结果来看,方案 3 比方案 1 快。