将 Pandas 数据帧变成内存中的类文件对象?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38204064/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:31:10  来源:igfitidea点击:

Turn pandas dataframe into a file-like object in memory?

pandaspsycopg2

提问by trench

I am loading about 2 - 2.5 million records into a Postgres database every day.

我每天将大约 2 - 250 万条记录加载到 Postgres 数据库中。

I then read this data with pd.read_sql to turn it into a dataframe and then I do some column manipulation and some minor merging. I am saving this modified data as a separate table for other people to use.

然后我用 pd.read_sql 读取这些数据将它转换成一个数据框,然后我做一些列操作和一些小的合并。我将这些修改后的数据另存为一个单独的表供其他人使用。

When I do pd.to_sql it takes forever. If I save a csv file and use COPY FROM in Postgres, the whole thing only takes a few minutes but the server is on a separate machine and it is a pain to transfer files there.

当我做 pd.to_sql 时,它需要永远。如果我保存一个 csv 文件并在 Postgres 中使用 COPY FROM,整个过程只需要几分钟,但服务器在一台单独的机器上,在那里传输文件很痛苦。

Using psycopg2, it looks like I can use copy_expert to benefit from the bulk copying, but still use python. I want to, if possible, avoid writing an actual csv file. Can I do this in memory with a pandas dataframe?

使用 psycopg2,看起来我可以使用 copy_expert 从批量复制中受益,但仍然使用 python。如果可能,我想避免编写实际的 csv 文件。我可以使用 Pandas 数据框在内存中执行此操作吗?

Here is an example of my pandas code. I would like to add the copy_expert or something to make saving this data much faster if possible.

这是我的Pandas代码示例。如果可能的话,我想添加 copy_expert 或其他东西来更快地保存这些数据。

    for date in required_date_range:
        df = pd.read_sql(sql=query, con=pg_engine, params={'x' : date})
        ...
        do stuff to the columns
        ...
        df.to_sql('table_name', pg_engine, index=False, if_exists='append',  dtype=final_table_dtypes)

Can someone help me with example code? I would prefer to use pandas still and it would be nice to do it in memory. If not, I will just write a csv temporary file and do it that way.

有人可以帮我提供示例代码吗?我更喜欢仍然使用Pandas,并且在内存中使用它会很好。如果没有,我将只写一个 csv 临时文件并这样做。

Edit- here is my final code which works. It only takes a couple of hundred seconds per date (millions of rows) instead of a couple of hours.

编辑 - 这是我的最终代码。每个日期只需要几百秒(数百万行)而不是几个小时。

to_sql = """COPY %s FROM STDIN WITH CSV HEADER"""

to_sql = """COPY %s FROM STDIN WITH CSV HEADER"""

def process_file(conn, table_name, file_object):
    fake_conn = cms_dtypes.pg_engine.raw_connection()
    fake_cur = fake_conn.cursor()
    fake_cur.copy_expert(sql=to_sql % table_name, file=file_object)
    fake_conn.commit()
    fake_cur.close()


#after doing stuff to the dataframe
    s_buf = io.StringIO()
    df.to_csv(s_buf) 
    process_file(cms_dtypes.pg_engine, 'fact_cms_employee', s_buf)

回答by ptrj

Python module io(docs) has necessary tools for file-like objects.

Python 模块io( docs) 具有用于类文件对象的必要工具。

import io

# text buffer
s_buf = io.StringIO()

# saving a data frame to a buffer (same as with a regular file):
df.to_csv(s_buf)

Edit.(I forgot) In order to read from the buffer afterwards, its position should be set to the beginning:

编辑。(我忘了)为了之后从缓冲区读取,它的位置应该设置为开头:

s_buf.seek(0)

I'm not familiar with psycopg2but according to docsboth copy_expertand copy_fromcan be used, for example:

我不熟悉的psycopg2,但根据文档copy_expertcopy_from可以使用,例如:

cur.copy_from(s_buf, table)

(For Python 2, see StringIO.)

(对于 Python 2,请参阅StringIO。)

回答by a_bigbadwolf

I had problems implementing the solution from ptrj.

我在从 ptrj 实施解决方案时遇到了问题。

I think the issue stems from pandas setting the pos of the buffer to the end.

我认为这个问题源于Pandas将缓冲区的 pos 设置到最后。

See as follows:

如下:

from StringIO import StringIO
df = pd.DataFrame({"name":['foo','bar'],"id":[1,2]})
s_buf = StringIO()
df.to_csv(s_buf)
s_buf.__dict__

# Output
# {'softspace': 0, 'buflist': ['foo,1\n', 'bar,2\n'], 'pos': 12, 'len': 12, 'closed': False, 'buf': ''}

Notice that pos is at 12. I had to set the pos to 0 in order for the subsequent copy_from command to work

请注意 pos 为 12。我必须将 pos 设置为 0 以便后续的 copy_from 命令工作

s_buf.pos = 0
cur = conn.cursor()
cur.copy_from(s_buf, tablename, sep=',')
conn.commit()