使用 psycopg2 将 Pandas DataFrame 快速插入 Postgres DB
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9826431/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fast insertion of pandas DataFrame into Postgres DB using psycopg2
提问by Arthur G
I am trying to insert a pandasDataFrame into a Postgresql DB (9.1) in the most efficient way (using Python 2.7).
Using "cursor.execute_many" is really slow, so is "DataFrame.to_csv(buffer,...)" together with "copy_from".
I found an already much! faster solution on the web (http://eatthedots.blogspot.de/2008/08/faking-read-support-for-psycopgs.html) which I adapted to work with pandas.
My code can be found below.
My question is whether the method of this related question (using "copy from stdin with binary") can be easily transferred to work with DataFrames and if this would be much faster.
Use binary COPY table FROM with psycopg2
Unfortunately my Python skills aren't sufficient to understand the implementation of this approach.
This is my approach:
我试图插入一个大熊猫的最有效的方式(使用Python 2.7)数据帧为一个PostgreSQL数据库(9.1)。
使用“cursor.execute_many”真的很慢,“DataFrame.to_csv(buffer,...)”和“copy_from”也是如此。
我发现了一个已经很多了!网络上更快的解决方案(http://eatthedots.blogspot.de/2008/08/faking-read-support-for-psycopgs.html)我适应了与熊猫一起工作。
我的代码可以在下面找到。
我的问题是这个相关问题的方法(使用“从标准输入复制二进制文件”)是否可以很容易地转移到数据帧上,如果这会快得多。
使用二进制 COPY 表 FROM 和 psycopg2
不幸的是,我的 Python 技能不足以理解这种方法的实现。
这是我的方法:
import psycopg2
import connectDB # this is simply a module that returns a connection to the db
from datetime import datetime
class ReadFaker:
"""
This could be extended to include the index column optionally. Right now the index
is not inserted
"""
def __init__(self, data):
self.iter = data.itertuples()
def readline(self, size=None):
try:
line = self.iter.next()[1:] # element 0 is the index
row = '\t'.join(x.encode('utf8') if isinstance(x, unicode) else str(x) for x in line) + '\n'
# in my case all strings in line are unicode objects.
except StopIteration:
return ''
else:
return row
read = readline
def insert(df, table, con=None, columns = None):
time1 = datetime.now()
close_con = False
if not con:
try:
con = connectDB.getCon() ###dbLoader returns a connection with my settings
close_con = True
except psycopg2.Error, e:
print e.pgerror
print e.pgcode
return "failed"
inserted_rows = df.shape[0]
data = ReadFaker(df)
try:
curs = con.cursor()
print 'inserting %s entries into %s ...' % (inserted_rows, table)
if columns is not None:
curs.copy_from(data, table, null='nan', columns=[col for col in columns])
else:
curs.copy_from(data, table, null='nan')
con.commit()
curs.close()
if close_con:
con.close()
except psycopg2.Error, e:
print e.pgerror
print e.pgcode
con.rollback()
if close_con:
con.close()
return "failed"
time2 = datetime.now()
print time2 - time1
return inserted_rows
回答by foobarbecue
回答by lbolla
I have not tested the performance, but maybe you can use something like this:
我还没有测试过性能,但也许你可以使用这样的东西:
- Iterate thru the rows of the DataFrame, yielding a string representing a row (see below)
- Convert this iterable in a stream, using for example Python: Convert an iterable to a stream?
- Finally use psycopg's
copy_fromon this stream.
- 遍历 DataFrame 的行,产生一个表示行的字符串(见下文)
- 在流中转换此可迭代对象,例如使用Python:将可迭代对象转换为流?
- 最后
copy_from在这个流上使用 psycopg 。
To yield rows of a DataFrame efficiently use something like:
要有效地生成 DataFrame 的行,请使用以下内容:
def r(df):
for idx, row in df.iterrows():
yield ','.join(map(str, row))

