Pandas 加载 CSV 的速度比 SQL 快
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43874559/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas is faster to load CSV than SQL
提问by Haelle
It seems that loading data from a CSV is faster than from SQL (Postgre SQL) with Pandas. (I have a SSD)
从 CSV 加载数据似乎比从 SQL (Postgre SQL) 和 Pandas 加载数据更快。(我有固态硬盘)
Here is my test code :
这是我的测试代码:
import pandas as pd
import numpy as np
start = time.time()
df = pd.read_csv('foo.csv')
df *= 3
duration = time.time() - start
print('{0}s'.format(duration))
engine = create_engine('postgresql://user:password@host:port/schema')
start = time.time()
df = pd.read_sql_query("select * from mytable", engine)
df *= 3
duration = time.time() - start
print('{0}s'.format(duration))
The foo.csv and the database are the same (same amount of data and columns in both, 4 columns, 100 000 rows full of random int).
foo.csv 和数据库是相同的(两者中的数据和列数相同,4 列,100 000 行充满随机整数)。
CSV takes 0.05s
CSV 需要 0.05 秒
SQL takes 0.5s
SQL 耗时 0.5 秒
Do you think it's normal that CSV is 10 time faster than SQL ? I'm wondering if I'm missing something here...
你认为 CSV 比 SQL 快 10 倍是正常的吗?我想知道我是否在这里遗漏了什么......
采纳答案by Steven G
This is a normal behavior, reading a csv file is always one of the quickest way to simply load data
这是正常行为,读取 csv 文件始终是简单加载数据的最快方法之一
A CSV is very naive and simple. loading directly from it will be very quick. For massive database with complex structure CSV is not an option. SQL is super fast to select data from table an return that data to you. naturally, if you can select, modify and manipulate data it will add an overhead time cost to your call.
CSV 非常幼稚和简单。直接从它加载会非常快。对于结构复杂的海量数据库,CSV 不是一种选择。SQL 从表中选择数据并将该数据返回给您的速度非常快。自然地,如果您可以选择、修改和操作数据,则会为您的通话增加开销时间成本。
imagine that you have a time series in a csv from 1920 to 2017 in a csv but you only want data from 2010 to today.
假设您在 csv 中有一个从 1920 年到 2017 年的时间序列,但您只需要从 2010 年到今天的数据。
csv approachwould be to load the entire csv then select the years 2010 to 2017.
csv 方法是加载整个 csv,然后选择 2010 年到 2017 年。
SQL approachwould be to pre-select the years via SQL select function
SQL 方法是通过 SQL select 函数预先选择年份
In that scenario, SQL would be MUCH faster.
在那种情况下,SQL 会快得多。
回答by Simon G.
When working with a PostgreSQL database, you can use a combination of SQL and CSV to get the best from both methods. SQL to select exactly the data you need and CSV output to quickly load it into a pandas DataFrame.
使用 PostgreSQL 数据库时,您可以结合使用 SQL 和 CSV 来充分利用这两种方法。SQL 精确选择您需要的数据和 CSV 输出以将其快速加载到 Pandas DataFrame 中。
conn = psycopg2.connect(**conn_params)
with conn.cursor() as cur:
sql = 'SELECT * FROM large_table'
buf = io.StringIO()
cur.copy_expert(f'COPY ({sql}) TO STDOUT WITH CSV HEADER', buf)
buf.seek(0)
df = pd.read_csv(buf, header=0, low_memory=False,
true_values='t', false_values='f')
conn.close()
This uses PostgreSQL's fast COPY command in combination with psycopg2's copy_expert()
function to read query results into a string buffer in CSV format. You can then use pandas read_csv()
on that string buffer.
这使用 PostgreSQL 的快速 COPY 命令结合 psycopg2 的copy_expert()
函数将查询结果以 CSV 格式读取到字符串缓冲区中。然后您可以read_csv()
在该字符串缓冲区上使用Pandas。
The drawback is that you may have to convert data types afterwards (e.g. timestamps would be strings). The read_csv()
function has a few parameters that can help deal with that (e.g. parse_dates
, true_values
, false_values
, ...).
缺点是您可能必须在之后转换数据类型(例如时间戳将是字符串)。该read_csv()
函数有几个参数,可以帮助处理这一(例如parse_dates
,true_values
,false_values
,...)。
In my use case (30 million rows, 15 columns) this gave me a performance improvement of about 2-3x compared to the pandas read_sql()
function.
在我的用例中(3000 万行,15 列),与 pandasread_sql()
函数相比,这使我的性能提高了大约 2-3 倍。
回答by Mikael Gueck
While Steven G's explanation of the process is fundamentally the answer to your question, and Simon G's COPY
solution is the most efficient one I've been able to find, I decided to look into your question in more depth, and actually measure the different aspects relevant to it.
虽然 Steven G 对该过程的解释从根本上回答了您的问题,而 Simon G 的COPY
解决方案是我能找到的最有效的解决方案,但我决定更深入地研究您的问题,并实际衡量相关的不同方面到它。
At https://git.mikael.io/mikaelhg/pandas-pg-csv-speed-pocthere is a project which contains pytest benchmarks for the various alternative solutions.
在https://git.mikael.io/mikaelhg/pandas-pg-csv-speed-poc有一个项目,其中包含各种替代解决方案的 pytest 基准测试。
The CSV for this test is a order of magnitude larger than in the question, with the shape of (3742616, 6)
. Just to make sure that there is less of a chance for various buffers being just the right size to skew the results.
此测试的 CSV 比问题中的 CSV 大一个数量级,形状为(3742616, 6)
. 只是为了确保各种缓冲区的大小恰到好处而使结果产生偏差的可能性较小。
Thanks to the Finnish Traffic Safety Bureau Trafi's open data initiativefor providing the test data.
感谢芬兰交通安全局 Trafi 的开放数据计划提供了测试数据。
As for the PostgreSQL installation, it's inside the canonical Docker container, and was started with upped shared_buffers
and work_mem
values, with the data files stored under the host machine's /dev/shm
mount point, in order to negate actual disk I/O. Its UNIX socket connection point is also similarly exposed.
至于 PostgreSQL 安装,它位于规范的 Docker 容器内,并以 uppedshared_buffers
和work_mem
values启动,数据文件存储在主机的/dev/shm
挂载点下,以抵消实际的磁盘 I/O。它的 UNIX 套接字连接点也同样暴露。
version: '3'
services:
db:
image: 'postgres:10-alpine'
command: "postgres -c 'shared_buffers=512MB' -c 'temp_buffers=80MB' -c 'work_mem=256MB'"
ports:
- '5432:5432'
volumes:
- '/dev/shm/pgtest/data:/var/lib/postgresql/data'
- '/dev/shm/pgtest/run:/var/run/postgresql'
environment:
POSTGRES_USER: test
POSTGRES_PASSWORD: test
POSTGRES_DB: test
test:
image: pandas_speed_poc:temp
build:
context: .
dockerfile: Dockerfile.test-runner
volumes:
- '.:/app'
- '/dev/shm/pgtest/run:/var/run/postgresql'
working_dir: '/app'
user: '1000'
The test runner is a simple Ubuntu 18.04 container:
测试运行器是一个简单的 Ubuntu 18.04 容器:
FROM ubuntu:18.04
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get -qq update && \
apt-get -y -qq install python3-dev python3-pip python3-psycopg2 \
build-essential \
bash less nano wait-for-it
RUN pip3 install sqlalchemy numpy pandas \
pytest pytest-benchmark
WORKDIR /app
CMD wait-for-it db:5432 -- /bin/bash -c "trap : TERM INT; sleep infinity & wait"
The actual benchmark is a Python 3 unittest
written for pytest-benchmark
:
实际的基准测试是unittest
为pytest-benchmark
以下目的编写的 Python 3 :
#!/usr/bin/python3
from sqlalchemy import create_engine
import psycopg2
import psycopg2.extensions
import pandas as pd
import numpy as np
import io
import time
import gzip
import unittest
import pytest
DATA_FILE = 'data/licenses.csv.gz'
DROP_TABLE = "DROP TABLE IF EXISTS licenses"
CREATE_TABLE = """
CREATE TABLE licenses (
a VARCHAR(16),
b CHAR(3),
c CHAR(6),
d INTEGER,
e INTEGER,
f INTEGER
)
"""
COPY_FROM = """
COPY licenses (a, b, c, d, e, f) FROM STDIN
WITH (FORMAT CSV, DELIMITER ';', HEADER)
"""
COPY_TO = "COPY licenses TO STDOUT WITH (FORMAT CSV, HEADER)"
SELECT_FROM = 'SELECT * FROM licenses'
VACUUM = "VACUUM FULL ANALYZE"
DB_UNIX_SOCKET_URL = 'postgresql://test:test@/test'
DB_TCP_URL = 'postgresql://test:test@db/test'
def my_cursor_factory(*args, **kwargs):
cursor = psycopg2.extensions.cursor(*args, **kwargs)
cursor.itersize = 10240
return cursor
class TestImportDataSpeed(unittest.TestCase):
@pytest.fixture(autouse=True)
def setupBenchmark(self, benchmark):
self.benchmark = benchmark
@classmethod
def setUpClass(cls):
cls.engine = create_engine(DB_TCP_URL, connect_args={'cursor_factory': my_cursor_factory})
connection = cls.engine.connect().connection
cursor = connection.cursor()
cursor.execute(DROP_TABLE)
cursor.execute(CREATE_TABLE)
with gzip.open(DATA_FILE, 'rb') as f:
cursor.copy_expert(COPY_FROM, file=f, size=1048576)
connection.commit()
connection.set_session(autocommit=True)
cursor.execute(VACUUM)
cursor.close()
connection.close()
def test_pd_csv(self):
def result():
return pd.read_csv(DATA_FILE, delimiter=';', low_memory=False)
df = self.benchmark(result)
assert df.shape == (3742616, 6)
def test_psycopg2_cursor(self):
def result():
connection = self.engine.connect().connection
cursor = connection.cursor()
cursor.itersize = 102400
cursor.arraysize = 102400
cursor.execute(SELECT_FROM)
rows = cursor.fetchall()
cursor.close()
connection.close()
return pd.DataFrame(rows)
df = self.benchmark(result)
assert df.shape == (3742616, 6)
def test_pd_sqla_naive(self):
def result():
return pd.read_sql_query(SELECT_FROM, self.engine)
df = self.benchmark(result)
assert df.shape == (3742616, 6)
def test_pd_sqla_chunked(self):
def result():
gen = (x for x in pd.read_sql(SELECT_FROM, self.engine, chunksize=10240))
return pd.concat(gen, ignore_index=True)
df = self.benchmark(result)
assert df.shape == (3742616, 6)
def test_pg_copy(self):
connection = self.engine.connect().connection
cursor = connection.cursor()
def result(cursor):
f = io.StringIO()
cursor.copy_expert(COPY_TO, file=f, size=1048576)
f.seek(0)
return pd.read_csv(f, low_memory=False)
df = self.benchmark(result, cursor)
assert df.shape == (3742616, 6)
The final results:
最终结果:
speed_test.py .....
-------------------------------------------------------------------------------- benchmark: 5 tests -------------------------------------------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_pd_csv 1.4623 (1.0) 1.4903 (1.0) 1.4776 (1.0) 0.0110 (1.21) 1.4786 (1.0) 0.0171 (1.15) 2;0 0.6768 (1.0) 5 1
test_pg_copy 3.0631 (2.09) 3.0842 (2.07) 3.0732 (2.08) 0.0091 (1.0) 3.0769 (2.08) 0.0149 (1.0) 2;0 0.3254 (0.48) 5 1
test_psycopg2_cursor 4.5325 (3.10) 4.5724 (3.07) 4.5531 (3.08) 0.0161 (1.77) 4.5481 (3.08) 0.0249 (1.68) 2;0 0.2196 (0.32) 5 1
test_pd_sqla_naive 6.0177 (4.12) 6.0523 (4.06) 6.0369 (4.09) 0.0147 (1.62) 6.0332 (4.08) 0.0242 (1.63) 2;0 0.1656 (0.24) 5 1
test_pd_sqla_chunked 6.0247 (4.12) 6.1454 (4.12) 6.0889 (4.12) 0.0442 (4.86) 6.0963 (4.12) 0.0524 (3.52) 2;0 0.1642 (0.24) 5 1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Your results will vary! Run the tests on your own architecture for relevant results.
你的结果会有所不同!在您自己的架构上运行测试以获得相关结果。
回答by dader
It's perfectly normal that CSV is much faster than SQL, but they are not meant for the same thing, even though you can use them for the same thing :
CSV 比 SQL 快得多是完全正常的,但它们并非用于同一件事,即使您可以将它们用于同一件事:
CSV is for sequential access, ie, you start at the beginning of the file and you read each row one after the other, treating it as needed.
SQL is for indexed access, ie, you look at an index and then you go to the row you're looking for. You can also perform a full table scan, ie not make use of any index, which makes the table essentially a bloated CSV.
CSV 用于顺序访问,即您从文件的开头开始,然后逐行读取每一行,并根据需要对其进行处理。
SQL 用于索引访问,即,您查看索引,然后转到您要查找的行。您还可以执行全表扫描,即不使用任何索引,这使得表本质上是一个臃肿的 CSV。
Your query is a full table scan, it doesn't look at the index, because it goes for ALL the data, so yes, it's normal.
您的查询是全表扫描,它不查看索引,因为它适用于所有数据,所以是的,这是正常的。
On the other hand, if you try a query like
另一方面,如果您尝试像这样的查询
select * from mytable where myindex = "myvalue";
You'll get a HUGE boost compared to searching for the same rows in the csv. That's because of the indexes in SQL
与在 csv 中搜索相同行相比,您将获得巨大的提升。那是因为 SQL 中的索引