Pandas 加载 CSV 的速度比 SQL 快

Question

提问by Haelle

It seems that loading data from a CSV is faster than from SQL (Postgre SQL) with Pandas. (I have a SSD)

从 CSV 加载数据似乎比从 SQL (Postgre SQL) 和 Pandas 加载数据更快。（我有固态硬盘）

Here is my test code :

这是我的测试代码：

import pandas as pd
import numpy as np

start = time.time()
df = pd.read_csv('foo.csv')
df *= 3
duration = time.time() - start
print('{0}s'.format(duration))

engine = create_engine('postgresql://user:password@host:port/schema')
start = time.time()
df = pd.read_sql_query("select * from mytable", engine)
df *= 3
duration = time.time() - start
print('{0}s'.format(duration))

The foo.csv and the database are the same (same amount of data and columns in both, 4 columns, 100 000 rows full of random int).

foo.csv 和数据库是相同的（两者中的数据和列数相同，4 列，100 000 行充满随机整数）。

CSV takes 0.05s

CSV 需要 0.05 秒

SQL takes 0.5s

SQL 耗时 0.5 秒

Do you think it's normal that CSV is 10 time faster than SQL ? I'm wondering if I'm missing something here...

你认为 CSV 比 SQL 快 10 倍是正常的吗？我想知道我是否在这里遗漏了什么......

Answer 1

采纳答案by Steven G

This is a normal behavior, reading a csv file is always one of the quickest way to simply load data

这是正常行为，读取 csv 文件始终是简单加载数据的最快方法之一

A CSV is very naive and simple. loading directly from it will be very quick. For massive database with complex structure CSV is not an option. SQL is super fast to select data from table an return that data to you. naturally, if you can select, modify and manipulate data it will add an overhead time cost to your call.

CSV 非常幼稚和简单。直接从它加载会非常快。对于结构复杂的海量数据库，CSV 不是一种选择。SQL 从表中选择数据并将该数据返回给您的速度非常快。自然地，如果您可以选择、修改和操作数据，则会为您的通话增加开销时间成本。

imagine that you have a time series in a csv from 1920 to 2017 in a csv but you only want data from 2010 to today.

假设您在 csv 中有一个从 1920 年到 2017 年的时间序列，但您只需要从 2010 年到今天的数据。

csv approachwould be to load the entire csv then select the years 2010 to 2017.

csv 方法是加载整个 csv，然后选择 2010 年到 2017 年。

SQL approachwould be to pre-select the years via SQL select function

SQL 方法是通过 SQL select 函数预先选择年份

In that scenario, SQL would be MUCH faster.

在那种情况下，SQL 会快得多。

Answer 2

回答by Simon G.

When working with a PostgreSQL database, you can use a combination of SQL and CSV to get the best from both methods. SQL to select exactly the data you need and CSV output to quickly load it into a pandas DataFrame.

使用 PostgreSQL 数据库时，您可以结合使用 SQL 和 CSV 来充分利用这两种方法。SQL 精确选择您需要的数据和 CSV 输出以将其快速加载到 Pandas DataFrame 中。

conn = psycopg2.connect(**conn_params)
with conn.cursor() as cur:
    sql = 'SELECT * FROM large_table'
    buf = io.StringIO()
    cur.copy_expert(f'COPY ({sql}) TO STDOUT WITH CSV HEADER', buf)
    buf.seek(0)
    df = pd.read_csv(buf, header=0, low_memory=False,
                     true_values='t', false_values='f')
conn.close()

This uses PostgreSQL's fast COPY command in combination with psycopg2's copy_expert()function to read query results into a string buffer in CSV format. You can then use pandas read_csv()on that string buffer.

这使用 PostgreSQL 的快速 COPY 命令结合 psycopg2 的copy_expert()函数将查询结果以 CSV 格式读取到字符串缓冲区中。然后您可以read_csv()在该字符串缓冲区上使用Pandas。

The drawback is that you may have to convert data types afterwards (e.g. timestamps would be strings). The read_csv()function has a few parameters that can help deal with that (e.g. parse_dates, true_values, false_values, ...).

缺点是您可能必须在之后转换数据类型（例如时间戳将是字符串）。该read_csv()函数有几个参数，可以帮助处理这一（例如parse_dates，true_values，false_values，...）。

In my use case (30 million rows, 15 columns) this gave me a performance improvement of about 2-3x compared to the pandas read_sql()function.

在我的用例中（3000 万行，15 列），与 pandasread_sql()函数相比，这使我的性能提高了大约 2-3 倍。

Answer 3

回答by Mikael Gueck

While Steven G's explanation of the process is fundamentally the answer to your question, and Simon G's COPYsolution is the most efficient one I've been able to find, I decided to look into your question in more depth, and actually measure the different aspects relevant to it.

虽然 Steven G 对该过程的解释从根本上回答了您的问题，而 Simon G 的COPY解决方案是我能找到的最有效的解决方案，但我决定更深入地研究您的问题，并实际衡量相关的不同方面到它。

At https://git.mikael.io/mikaelhg/pandas-pg-csv-speed-pocthere is a project which contains pytest benchmarks for the various alternative solutions.

在https://git.mikael.io/mikaelhg/pandas-pg-csv-speed-poc有一个项目，其中包含各种替代解决方案的 pytest 基准测试。

The CSV for this test is a order of magnitude larger than in the question, with the shape of (3742616, 6). Just to make sure that there is less of a chance for various buffers being just the right size to skew the results.

此测试的 CSV 比问题中的 CSV 大一个数量级，形状为(3742616, 6). 只是为了确保各种缓冲区的大小恰到好处而使结果产生偏差的可能性较小。

Thanks to the Finnish Traffic Safety Bureau Trafi's open data initiativefor providing the test data.

感谢芬兰交通安全局 Trafi 的开放数据计划提供了测试数据。

As for the PostgreSQL installation, it's inside the canonical Docker container, and was started with upped shared_buffersand work_memvalues, with the data files stored under the host machine's /dev/shmmount point, in order to negate actual disk I/O. Its UNIX socket connection point is also similarly exposed.

至于 PostgreSQL 安装，它位于规范的 Docker 容器内，并以 uppedshared_buffers和work_memvalues启动，数据文件存储在主机的/dev/shm挂载点下，以抵消实际的磁盘 I/O。它的 UNIX 套接字连接点也同样暴露。

version: '3'

services:

  db:
    image: 'postgres:10-alpine'
    command: "postgres -c 'shared_buffers=512MB' -c 'temp_buffers=80MB' -c 'work_mem=256MB'"
    ports:
      - '5432:5432'
    volumes:
      - '/dev/shm/pgtest/data:/var/lib/postgresql/data'
      - '/dev/shm/pgtest/run:/var/run/postgresql'
    environment:
      POSTGRES_USER: test
      POSTGRES_PASSWORD: test
      POSTGRES_DB: test

  test:
    image: pandas_speed_poc:temp
    build:
      context: .
      dockerfile: Dockerfile.test-runner
    volumes:
      - '.:/app'
      - '/dev/shm/pgtest/run:/var/run/postgresql'
    working_dir: '/app'
    user: '1000'

The test runner is a simple Ubuntu 18.04 container:

测试运行器是一个简单的 Ubuntu 18.04 容器：

FROM ubuntu:18.04
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get -qq update && \
    apt-get -y -qq install python3-dev python3-pip python3-psycopg2 \
                    build-essential \
                    bash less nano wait-for-it
RUN pip3 install sqlalchemy numpy pandas \
                pytest pytest-benchmark
WORKDIR /app
CMD wait-for-it db:5432 -- /bin/bash -c "trap : TERM INT; sleep infinity & wait"

The actual benchmark is a Python 3 unittestwritten for pytest-benchmark:

实际的基准测试是unittest为pytest-benchmark以下目的编写的 Python 3 ：

#!/usr/bin/python3

from sqlalchemy import create_engine
import psycopg2
import psycopg2.extensions

import pandas as pd
import numpy as np

import io
import time
import gzip

import unittest
import pytest

DATA_FILE = 'data/licenses.csv.gz'

DROP_TABLE = "DROP TABLE IF EXISTS licenses"

CREATE_TABLE = """
    CREATE TABLE licenses (
        a VARCHAR(16),
        b CHAR(3),
        c CHAR(6),
        d INTEGER,
        e INTEGER,
        f INTEGER
    )
"""

COPY_FROM = """
    COPY licenses (a, b, c, d, e, f) FROM STDIN
    WITH (FORMAT CSV, DELIMITER ';', HEADER)
"""

COPY_TO = "COPY licenses TO STDOUT WITH (FORMAT CSV, HEADER)"

SELECT_FROM = 'SELECT * FROM licenses'

VACUUM = "VACUUM FULL ANALYZE"

DB_UNIX_SOCKET_URL = 'postgresql://test:test@/test'

DB_TCP_URL = 'postgresql://test:test@db/test'

def my_cursor_factory(*args, **kwargs):
    cursor = psycopg2.extensions.cursor(*args, **kwargs)
    cursor.itersize = 10240
    return cursor

class TestImportDataSpeed(unittest.TestCase):

    @pytest.fixture(autouse=True)
    def setupBenchmark(self, benchmark):
        self.benchmark = benchmark

    @classmethod
    def setUpClass(cls):
        cls.engine = create_engine(DB_TCP_URL, connect_args={'cursor_factory': my_cursor_factory})
        connection = cls.engine.connect().connection
        cursor = connection.cursor()

        cursor.execute(DROP_TABLE)
        cursor.execute(CREATE_TABLE)

        with gzip.open(DATA_FILE, 'rb') as f:
            cursor.copy_expert(COPY_FROM, file=f, size=1048576)

        connection.commit()

        connection.set_session(autocommit=True)
        cursor.execute(VACUUM)

        cursor.close()
        connection.close()

    def test_pd_csv(self):

        def result():
            return pd.read_csv(DATA_FILE, delimiter=';', low_memory=False)

        df = self.benchmark(result)
        assert df.shape == (3742616, 6)

    def test_psycopg2_cursor(self):

        def result():
            connection = self.engine.connect().connection
            cursor = connection.cursor()
            cursor.itersize = 102400
            cursor.arraysize = 102400
            cursor.execute(SELECT_FROM)
            rows = cursor.fetchall()
            cursor.close()
            connection.close()
            return pd.DataFrame(rows)

        df = self.benchmark(result)
        assert df.shape == (3742616, 6)

    def test_pd_sqla_naive(self):

        def result():
            return pd.read_sql_query(SELECT_FROM, self.engine)

        df = self.benchmark(result)
        assert df.shape == (3742616, 6)

    def test_pd_sqla_chunked(self):

        def result():
            gen = (x for x in pd.read_sql(SELECT_FROM, self.engine, chunksize=10240))
            return pd.concat(gen, ignore_index=True)

        df = self.benchmark(result)
        assert df.shape == (3742616, 6)

    def test_pg_copy(self):
        connection = self.engine.connect().connection
        cursor = connection.cursor()

        def result(cursor):
            f = io.StringIO()
            cursor.copy_expert(COPY_TO, file=f, size=1048576)
            f.seek(0)
            return pd.read_csv(f, low_memory=False)

        df = self.benchmark(result, cursor)
        assert df.shape == (3742616, 6)

The final results:

最终结果：

speed_test.py .....


-------------------------------------------------------------------------------- benchmark: 5 tests -------------------------------------------------------------------------------
Name (time in s)            Min               Max              Mean            StdDev            Median               IQR            Outliers     OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_pd_csv              1.4623 (1.0)      1.4903 (1.0)      1.4776 (1.0)      0.0110 (1.21)     1.4786 (1.0)      0.0171 (1.15)          2;0  0.6768 (1.0)           5           1
test_pg_copy             3.0631 (2.09)     3.0842 (2.07)     3.0732 (2.08)     0.0091 (1.0)      3.0769 (2.08)     0.0149 (1.0)           2;0  0.3254 (0.48)          5           1
test_psycopg2_cursor     4.5325 (3.10)     4.5724 (3.07)     4.5531 (3.08)     0.0161 (1.77)     4.5481 (3.08)     0.0249 (1.68)          2;0  0.2196 (0.32)          5           1
test_pd_sqla_naive       6.0177 (4.12)     6.0523 (4.06)     6.0369 (4.09)     0.0147 (1.62)     6.0332 (4.08)     0.0242 (1.63)          2;0  0.1656 (0.24)          5           1
test_pd_sqla_chunked     6.0247 (4.12)     6.1454 (4.12)     6.0889 (4.12)     0.0442 (4.86)     6.0963 (4.12)     0.0524 (3.52)          2;0  0.1642 (0.24)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Your results will vary! Run the tests on your own architecture for relevant results.

你的结果会有所不同！在您自己的架构上运行测试以获得相关结果。

Answer 4

回答by dader

It's perfectly normal that CSV is much faster than SQL, but they are not meant for the same thing, even though you can use them for the same thing :

CSV 比 SQL 快得多是完全正常的，但它们并非用于同一件事，即使您可以将它们用于同一件事：

CSV is for sequential access, ie, you start at the beginning of the file and you read each row one after the other, treating it as needed.
SQL is for indexed access, ie, you look at an index and then you go to the row you're looking for. You can also perform a full table scan, ie not make use of any index, which makes the table essentially a bloated CSV.

CSV 用于顺序访问，即您从文件的开头开始，然后逐行读取每一行，并根据需要对其进行处理。
SQL 用于索引访问，即，您查看索引，然后转到您要查找的行。您还可以执行全表扫描，即不使用任何索引，这使得表本质上是一个臃肿的 CSV。

Your query is a full table scan, it doesn't look at the index, because it goes for ALL the data, so yes, it's normal.

您的查询是全表扫描，它不查看索引，因为它适用于所有数据，所以是的，这是正常的。

On the other hand, if you try a query like

另一方面，如果您尝试像这样的查询

select * from mytable where myindex = "myvalue";

You'll get a HUGE boost compared to searching for the same rows in the csv. That's because of the indexes in SQL

与在 csv 中搜索相同行相比，您将获得巨大的提升。那是因为 SQL 中的索引

Pandas 加载 CSV 的速度比 SQL 快

提问by Haelle

采纳答案by Steven G

回答by Simon G.

回答by Mikael Gueck

Your results will vary! Run the tests on your own architecture for relevant results.

你的结果会有所不同！在您自己的架构上运行测试以获得相关结果。

回答by dader

相关推荐

最近更新

标签

Pandas 加载 CSV 的速度比 SQL 快

提问by Haelle

采纳答案by Steven G

回答by Simon G.

回答by Mikael Gueck

Your results will vary! Run the tests on your own architecture for relevant results.

你的结果会有所不同！在您自己的架构上运行测试以获得相关结果。

回答by dader

相关推荐

pandas 根据第 2 列的不同值获取行

从 Pandas DataFrame 创建时间序列

Pandas：数据帧错误 - 2 列通过，传递的数据有 3 列

Pandas - 按 id 分组并使用阈值删除重复项

相关推荐

最近更新

标签