如何将数据写入 Redshift,这是在 Python 中创建的数据帧的结果?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38402995/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to write data to Redshift that is a result of a dataframe created in Python?
提问by Sahil
I have a dataframe in Python. Can I write this data to Redshift as a new table? I have successfully created a db connection to Redshift and am able to execute simple sql queries. Now I need to write a dataframe to it.
我在 Python 中有一个数据框。我可以将此数据作为新表写入 Redshift 吗?我已经成功地创建了一个到 Redshift 的数据库连接,并且能够执行简单的 sql 查询。现在我需要给它写一个数据框。
回答by Andrew
You can use to_sql
to push data to a Redshift database. I've been able to do this using a connection to my database through a SQLAlchemy engine. Just be sure to set index = False
in your to_sql
call. The table will be created if it doesn't exist, and you can specify if you want you call to replace the table, append to the table, or fail if the table already exists.
您可以使用to_sql
将数据推送到 Redshift 数据库。我已经能够通过 SQLAlchemy 引擎使用到我的数据库的连接来做到这一点。请务必index = False
在您的to_sql
通话中进行设置。如果该表不存在,则将创建该表,并且您可以指定是否要调用来替换该表、追加到该表中,或者如果该表已存在则失败。
from sqlalchemy import create_engine
import pandas as pd
conn = create_engine('postgresql://username:[email protected]:5439/yourdatabase')
df = pd.DataFrame([{'A': 'foo', 'B': 'green', 'C': 11},{'A':'bar', 'B':'blue', 'C': 20}])
df.to_sql('your_table', conn, index=False, if_exists='replace')
Note that you may need to pip install psycopg2
in order to connect to Redshift through SQLAlchemy.
请注意,您可能需pip install psycopg2
要这样做才能通过 SQLAlchemy 连接到 Redshift。
回答by AidanGawronski
import pandas_redshift as pr
pr.connect_to_redshift(dbname = <dbname>,
host = <host>,
port = <port>,
user = <user>,
password = <password>)
pr.connect_to_s3(aws_access_key_id = <aws_access_key_id>,
aws_secret_access_key = <aws_secret_access_key>,
bucket = <bucket>,
subdirectory = <subdirectory>)
# Write the DataFrame to S3 and then to redshift
pr.pandas_to_redshift(data_frame = data_frame,
redshift_table_name = 'gawronski.nba_shots_log')
回答by Gaurav
I tried using pandas df.to_sql()
but it was tremendously slow. It was taking me well over 10 minutes to insert 50 rows. See thisopen issue (as of writing)
我尝试使用熊猫,df.to_sql()
但速度非常慢。我花了 10 多分钟才插入 50 行。请参阅此未解决的问题(撰写本文时)
I tried using odo
from the blaze ecosystem (as per the recommendations in the issue discussion), but faced a ProgrammingError
which I didn't bother to investigate into.
我尝试odo
从 blaze 生态系统中使用(根据问题讨论中的建议),但遇到了一个ProgrammingError
我懒得调查的问题。
Finally what worked:
最后什么工作:
import psycopg2
# Fill in the blanks for the conn object
conn = psycopg2.connect(user = 'user',
password = 'password',
host = 'host',
dbname = 'db',
port = 666)
cursor = conn.cursor()
# Adjust ... according to number of columns
args_str = b','.join(cursor.mogrify("(%s,%s,...)", x) for x in tuple(map(tuple,np_data)))
cursor.execute("insert into table (a,b,...) VALUES "+args_str.decode("utf-8"))
cursor.close()
conn.commit()
conn.close()
Yep, plain old psycopg2
. This is for a numpy array but converting from a df
to a ndarray
shouldn't be too difficult. This gave me around 3k rows/minute.
是的,老样子psycopg2
。这是一个 numpy 数组,但从 a 转换df
为 andarray
应该不会太困难。这给了我大约 3k 行/分钟。
However, the fastest solution as per recommendations from other team mates is to use the COPY command after dumping the dataframe as a TSV/CSV into a S3 cluster and then copying over. You should investigate into this if you're copying really huge datasets. (I will update here if and when I try it out)
但是,根据其他团队成员的建议,最快的解决方案是在将数据帧作为 TSV/CSV 转储到 S3 集群然后复制之后使用 COPY 命令。如果您要复制非常庞大的数据集,则应该对此进行调查。(当我尝试时,我会在这里更新)
回答by BigPanda
Assuming you have access to S3, this approach should work:
假设您可以访问 S3,这种方法应该有效:
Step 1: Write the DataFrame as a csv to S3 (I use AWS SDK boto3 for this)
Step 2: You know the columns, datatypes, and key/index for your Redshift table from your DataFrame, so you should be able to generate a create table
script and push it to Redshift to create an empty table
Step 3: Send a copy
command from your Python environment to Redshift to copy data from S3 into the empty table created in step 2
第 1 步:将 DataFrame 作为 csv 写入 S3(为此我使用 AWS SDK boto3)
第 2 步:您知道 DataFrame 中 Redshift 表的列、数据类型和键/索引,因此您应该能够生成一个create table
脚本并将其推送到 Redshift 以创建空表
第 3 步:copy
从 Python 环境向 Redshift发送命令以将数据从 S3 复制到第 2 步中创建的空表
Works like a charm everytime.
每次都像魅力一样工作。
Step 4: Before your cloud storage folks start yelling at you delete the csv from S3
第 4 步:在您的云存储人员开始对您大喊大叫之前,请从 S3 中删除 csv
If you see yourself doing this several times, wrapping all four steps in a function keeps it tidy.
如果您看到自己多次这样做,将所有四个步骤包装在一个函数中可以保持整洁。
回答by erncyp
I used to rely on pandas to_sql()
function, but it is just too slow. I have recently switched to doing the following:
我曾经依赖pandasto_sql()
函数,但它太慢了。我最近切换到执行以下操作:
import pandas as pd
import s3fs # great module which allows you to read/write to s3 easily
import sqlalchemy
df = pd.DataFrame([{'A': 'foo', 'B': 'green', 'C': 11},{'A':'bar', 'B':'blue', 'C': 20}])
s3 = s3fs.S3FileSystem(anon=False)
filename = 'my_s3_bucket_name/file.csv'
with s3.open(filename, 'w') as f:
df.to_csv(f, index=False, header=False)
con = sqlalchemy.create_engine('postgresql://username:[email protected]:5439/yourdatabase')
# make sure the schema for mytable exists
# if you need to delete the table but not the schema leave DELETE mytable
# if you want to only append, I think just removing the DELETE mytable would work
con.execute("""
DELETE mytable;
COPY mytable
from 's3://%s'
iam_role 'arn:aws:iam::xxxx:role/role_name'
csv;""" % filename)
the role has to allow redshift access to S3 see herefor more details
该角色必须允许红移访问 S3,请参阅此处了解更多详细信息
I found that for a 300KB file (12000x2 dataframe) this takes 4 seconds compared to the 8 minutes I was getting with pandas to_sql()
function
我发现对于一个 300KB 的文件(12000x2 数据帧),这需要 4 秒,而我使用 pandasto_sql()
函数需要 8 分钟
回答by Michael Robellard
For the purpose of this conversation Postgres = RedShift You have two options:
出于本次对话的目的 Postgres = RedShift 您有两个选择:
Option 1:
选项1:
From Pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#io-sql
来自熊猫:http: //pandas.pydata.org/pandas-docs/stable/io.html#io-sql
The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL.
pandas.io.sql 模块提供了一组查询包装器,以促进数据检索并减少对特定于数据库的 API 的依赖。如果安装,SQLAlchemy 会提供数据库抽象。此外,您将需要一个用于数据库的驱动程序库。此类驱动程序的示例是用于 PostgreSQL 的 psycopg2 或用于 MySQL 的 pymysql。
Writing DataFrames
编写数据帧
Assuming the following data is in a DataFrame data, we can insert it into the database using to_sql().
假设以下数据在一个DataFrame数据中,我们可以使用to_sql()将其插入到数据库中。
id Date Col_1 Col_2 Col_3
26 2012-10-18 X 25.7 True
42 2012-10-19 Y -12.4 False
63 2012-10-20 Z 5.73 True
In [437]: data.to_sql('data', engine)
With some databases, writing large DataFrames can result in errors due to packet size limitations being exceeded. This can be avoided by setting the chunksize parameter when calling to_sql. For example, the following writes data to the database in batches of 1000 rows at a time:
对于某些数据库,由于超出数据包大小限制,写入大型 DataFrame 可能会导致错误。这可以通过在调用 to_sql 时设置 chunksize 参数来避免。例如,以下将数据以 1000 行为一组批量写入数据库:
In [438]: data.to_sql('data_chunked', engine, chunksize=1000)
Option 2
选项 2
Or you can simply do your own If you have a dataframe called data simply loop over it using iterrows:
或者您可以简单地自己做如果您有一个名为 data 的数据框,只需使用 iterrows 循环遍历它:
for row in data.iterrows():
then add each row to your database. I would use copy instead of insert for each row, as it will be much faster.
然后将每一行添加到您的数据库中。我会为每一行使用复制而不是插入,因为它会快得多。
http://initd.org/psycopg/docs/usage.html#using-copy-to-and-copy-from
http://initd.org/psycopg/docs/usage.html#using-copy-to-and-copy-from