从 MySQL 加载 500 万行到 Pandas

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31702621/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:41:46  来源:igfitidea点击:

Loading 5 million rows into Pandas from MySQL

mysqlpandas

提问by Dervin Thunk

I have 5 million rows in a MySQL DB sitting over the (local) network (so quick connection, not on the internet).

我在位于(本地)网络上的 MySQL 数据库中有 500 万行(连接如此之快,而不是在互联网上)。

The connection to the DB works fine, but if I try to do

与数据库的连接工作正常,但如果我尝试这样做

f = pd.read_sql_query('SELECT * FROM mytable', engine, index_col = 'ID')

This takes a reallylong time. Even chunking with chunksizewill be slow. Besides, I don't really know whether it's just hung there or indeed retrieving information.

这需要长时间。即使分块chunksize也会很慢。此外,我真的不知道它是挂在那里还是确实在检索信息。

I would like to ask, for those people working with large data on a DB, how they retrieve their data for their Pandas session?

我想问一下,对于那些在数据库上处理大数据的人,他们如何为 Pandas 会话检索数据?

Would it be "smarter", for example, to run the query, return a csv file with the results and load thatinto Pandas? Sounds much more involved than it needs to be.

难道是“聪明”,例如,以运行查询,返回的结果和负载CSV文件为Pandas?听起来比它需要的要复杂得多。

采纳答案by firelynx

The best way of loading alldata from a table out of -any-SQL database into pandas is:

将表中的所有数据从 -any-SQL 数据库加载到 Pandas的最佳方法是:

  1. Dumping the data out of the database using COPYfor PostgreSQL, SELECT INTO OUTFILEfor MySQL or similar for other dialects.
  2. Reading the csv file with pandas using the pandas.read_csvfunction
  1. 使用COPYfor PostgreSQL、SELECT INTO OUTFILEfor MySQL 或其他方言的类似方法将数据从数据库中转储出来。
  2. 使用函数读取带有Pandas的 csv 文件pandas.read_csv

Use the connector only for reading a few rows. The power of an SQL database is its ability to deliver small chunks of data based on indices.

连接器仅用于读取几行。SQL 数据库的强大之处在于它能够根据索引提供小块数据。

Delivering entire tables is something you do with dumps.

交付整个表是您使用转储做的事情。

回答by Thomas Kimber

I had a similar issue whilst working with an Oracle db (for me it turned out it was taking a long time to retrieve all the data, during which time I had no idea how far it was or whether there was any problem going on) - my solution was to stream the results of my query into a set of csv files, and then upload them into Pandas.

我在使用 Oracle db 时遇到了类似的问题(对我来说,检索所有数据需要很长时间,在此期间我不知道它有多远或是否有任何问题发生)-我的解决方案是将查询结果流式传输到一组 csv 文件中,然后将它们上传到 Pandas。

I'm sure there are faster ways of doing this, but this worked surprisingly well for datasets of around 8 million rows.

我确信有更快的方法可以做到这一点,但是对于大约 800 万行的数据集来说,这非常有效。

You can see the code I used at my Github page for easy_query.pybut the core function I used looked like this:

你可以看到我在我的 Github 页面上为easy_query.py使用的代码,但我使用的核心函数如下所示:

def SQLCurtoCSV (sqlstring, connstring, filename, chunksize):
    connection = ora.connect(connstring)
    cursor = connection.cursor()
    params = []
    cursor.execute(sqlstring, params)
    cursor.arraysize = 256
    r=[]
    c=0
    i=0
    for row in cursor:
        c=c+1
        r.append(row)
        if c >= chunksize:
            c = 0
            i=i+1
            df = pd.DataFrame.from_records(r)
            df.columns = [rec[0] for rec in cursor.description]
            df.to_csv(filename.replace('%%',str(i)), sep='|')
            df = None
            r = []
    if i==0:
        df = pd.DataFrame.from_records(r)
        df.columns = [rec[0] for rec in cursor.description]
        df.to_csv(filename.replace('%%',str(i)), sep='|')
        df = None
        r = []

The surrounding module imports cx_Oracle, to provide various database hooks/api-calls, but I'd expect there to be similar functions available using some similarly provided MySQL api.

周围的模块导入 cx_Oracle,以提供各种数据库挂钩/api 调用,但我希望使用一些类似提供的 MySQL api 可以使用类似的功能。

What's nice is that you can see the files building up in your chosen directory, so you get some kind of feedback as to whether your extract is working, and how many results per second/minute/hour you can expect to receive.

不错的是,您可以看到在您选择的目录中构建的文件,因此您可以获得某种反馈,了解您的提取是否正常工作,以及您可以预期每秒/分钟/小时收到多少结果。

It also means you can work on the initial files whilst the rest are being fetched.

这也意味着您可以在获取其余文件的同时处理初始文件。

Once all the data is saved down to individual files, they can be loaded up into a single Pandas dataframe using multiple pandas.read_csv and pandas.concat statements.

一旦所有数据都保存到单个文件中,就可以使用多个 pandas.read_csv 和 pandas.concat 语句将它们加载到单个 Pandas 数据帧中。

回答by Saurabh Nair

query: Write your query.
conn: Connect to your database
chunksize: Extracts data in batches. Returns a generator.

query: 写下你的查询。
conn:连接到您的数据库
chunksize:批量提取数据。返回一个生成器。

Try the below code to extract the data in chunks. Then use the function to convert the generator object to dataframe.

尝试以下代码以分块提取数据。然后使用该函数将生成器对象转换为数据帧。

df_chunks = pd.read_sql_query(query, conn, chunksize=50000)

def chunks_to_df(gen):
    chunks = []
    for df in gen:
        chunks.append(df)
    return pd.concat(chunks).reset_index().drop('index', axis=1)

df = chunks_to_df(df_chunks)

This will help you reduce load on the database server and get all your data in batches and use it for your further analysis.

这将帮助您减少数据库服务器上的负载并批量获取所有数据并将其用于进一步分析。