高效地将 Pandas 数据帧写入 Google BigQuery

Question

提问by Pablo

I'm trying to upload a pandas.DataFrameto google big query using the pandas.DataFrame.to_gbq()function documented here. The problem is that to_gbq()takes 2.3 minutes while uploading directly to Google Cloud Storage GUI takes less than a minute. I'm planing to upload a bunch of dataframes (~32) each one with a similar size, so i want to know what its the faster alternative.

我正在尝试pandas.DataFrame使用此处pandas.DataFrame.to_gbq()记录的功能将一个大查询上传到谷歌。问题是需要 2.3 分钟，而直接上传到 Google Cloud Storage GUI 需要不到一分钟。我打算上传一堆数据帧（~32），每个都具有相似的大小，所以我想知道什么是更快的替代方案。to_gbq()

This is the script that i'm using:

这是我正在使用的脚本：

dataframe.to_gbq('my_dataset.my_table', 
                 'my_project_id',
                 chunksize=None, # i've tryed with several chunksizes, it runs faster when is one big chunk (at least for me)
                 if_exists='append',
                 verbose=False
                 )

dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds 
# manually upload the file into GCS GUI
print(dataframe.shape)
(363364, 21)

my question is, what is faster?

我的问题是，什么更快？

Upload Dataframeusing pandas.DataFrame.to_gbq()function
Saving Dataframeas csv and then upload as a file to BigQuery using the Python API
Saving Dataframeas csv and then upload the file to Google Cloud Storage using this procedureand then reading it from BigQuery

Dataframe使用pandas.DataFrame.to_gbq()功能上传
保存Dataframe为 csv，然后使用Python API作为文件上传到 BigQuery
保存Dataframe为 csv，然后使用此过程将文件上传到 Google Cloud Storage ，然后从 BigQuery 中读取它

update:

更新：

alternative 2, using pd.DataFrame.to_csv()and load_data_from_file()seems to take longer than alternative 1 ( 17.9 sec more in average with 3 loops):

备选方案 2，使用pd.DataFrame.to_csv()和load_data_from_file()似乎比备选方案 1 花费的时间更长（3 个循环平均多 17.9 秒）：

def load_data_from_file(dataset_id, table_id, source_file_name):
    bigquery_client = bigquery.Client()
    dataset_ref = bigquery_client.dataset(dataset_id)
    table_ref = dataset_ref.table(table_id)

    with open(source_file_name, 'rb') as source_file:
        # This example uses CSV, but you can use other formats.
        # See https://cloud.google.com/bigquery/loading-data
        job_config = bigquery.LoadJobConfig()
        job_config.source_format = 'text/csv'
        job_config.autodetect=True
        job = bigquery_client.load_table_from_file(
            source_file, table_ref, job_config=job_config)

    job.result()  # Waits for job to complete

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_id, table_id))

thank you!

谢谢你！

Answer 1

采纳答案by enle lin

I did the comparison for alternative 1 and 3 in Datalabusing the following code:

我Datalab使用以下代码对备选方案 1 和 3 进行了比较：

from datalab.context import Context
import datalab.storage as storage
import datalab.bigquery as bq
import pandas as pd
from pandas import DataFrame
import time

# Dataframe to write
my_data = [{1,2,3}]
for i in range(0,100000):
    my_data.append({1,2,3})
not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])

#Alternative 1
start = time.time()
not_so_simple_dataframe.to_gbq('TestDataSet.TestTable', 
                 Context.default().project_id,
                 chunksize=10000, 
                 if_exists='append',
                 verbose=False
                 )
end = time.time()
print("time alternative 1 " + str(end - start))

#Alternative 3
start = time.time()
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable not_so_simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert_data(not_so_simple_dataframe)
end = time.time()
print("time alternative 3 " + str(end - start))

and here are the results for n = {10000,100000,1000000}:

以下是 n = {10000,100000,1000000} 的结果：

n       alternative_1  alternative_3
10000   30.72s         8.14s
100000  162.43s        70.64s
1000000 1473.57s       688.59s

Judging from the results, alternative 3 is faster than alternative 1.

从结果来看，方案 3 比方案 1 快。

高效地将 Pandas 数据帧写入 Google BigQuery

提问by Pablo

采纳答案by enle lin

相关推荐

最近更新

标签

高效地将 Pandas 数据帧写入 Google BigQuery

提问by Pablo

采纳答案by enle lin

相关推荐

pandas 与熊猫的时间序列相关性

pandas ValueError：无法使用多维键建立索引

Pandas 绘制计数器随时间累积的总和

pandas 熊猫：选择两列不同的行

相关推荐

最近更新

标签