Python 在 Pandas 中解析大型 CSV 文件的最快方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25508510/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:20:08  来源:igfitidea点击:

Fastest way to parse large CSV files in Pandas

pythonpandas

提问by Ginger

I am using pandas to analyse the large data files here: http://www.nielda.co.uk/betfair/data/They are around 100 megs in size.

我正在使用熊猫来分析这里的大型数据文件:http: //www.nielda.co.uk/betfair/data/ 它们的大小约为 100 兆。

Each load from csv takes a few seconds, and then more time to convert the dates.

每次从 csv 加载需要几秒钟,然后更多时间来转换日期。

I have tried loading the files, converting the dates from strings to datetimes, and then re-saving them as pickle files. But loading those takes a few seconds as well.

我尝试加载文件,将日期从字符串转换为日期时间,然后将它们重新保存为泡菜文件。但是加载这些也需要几秒钟。

What fast methods could I use to load/save the data from disk?

我可以使用哪些快速方法从磁盘加载/保存数据?

回答by DrV

One thing to check is the actual performance of the disk system itself. Especially if you use spinning disks (not SSD), your practical disk read speed may be one of the explaining factors for the performance. So, before doing too much optimization, check if reading the same data into memory (by, e.g., mydata = open('myfile.txt').read()) takes an equivalent amount of time. (Just make sure you do not get bitten by disk caches; if you load the same data twice, the second time it will be much faster because the data is already in RAM cache.)

要检查的一件事是磁盘系统本身的实际性能。特别是如果您使用旋转磁盘(不是 SSD),您的实际磁盘读取速度可能是性能的解释因素之一。因此,在进行过多优化之前,请检查将相同的数据读入内存(例如,通过mydata = open('myfile.txt').read())是否花费了相同的时间。(只要确保你不会被磁盘缓存咬到;如果你加载相同的数据两次,第二次会快得多,因为数据已经在 RAM 缓存中了。)

See the update below before believing what I write underneath

在相信我在下面写的内容之前,请参阅下面的更新

If your problem is really parsing of the files, then I am not sure if any pure Python solution will help you. As you know the actual structure of the files, you do not need to use a generic CSV parser.

如果您的问题确实是解析文件,那么我不确定是否有任何纯 Python 解决方案可以帮助您。正如您知道文件的实际结构一样,您不需要使用通用 CSV 解析器。

There are three things to try, though:

不过,有三件事可以尝试:

  1. Python csvpackage and csv.reader
  2. NumPy genfromtext
  3. Numpy loadtxt
  1. Pythoncsv包和csv.reader
  2. NumPy genfromtext
  3. 麻木 loadtxt

The third one is probably fastest if you can use it with your data. At the same time it has the most limited set of features. (Which actually may make it fast.)

如果您可以将其与数据一起使用,则第三个可能是最快的。同时,它具有最有限的功能集。(这实际上可能会使其更快。)

Also, the suggestions given you in the comments by crclayton, BKay, and EdChumare good ones.

此外,建议给你的所作的评论crclaytonBKay以及EdChum是好的。

Try the different alternatives! If they do not work, then you will have to do write something in a compiled language (either compiled Python or, e.g. C).

尝试不同的选择!如果它们不起作用,那么您将不得不用编译语言(编译的 Python 或例如 C)编写一些东西。

Update:I do believe what chrisbsays below, i.e. the pandasparser is fast.

更新:我相信chrisb下面所说的,即pandas解析器很快。

Then the only way to make the parsing faster is to write an application-specific parser in C (or other compiled language). Generic parsing of CSV files is not straightforward, but if the exact structure of the file is known there may be shortcuts. In any case parsing text files is slow, so if you ever can translate it into something more palatable (HDF5, NumPy array), loading will be only limited by the I/O performance.

那么使解析更快的唯一方法是用 C(或其他编译语言)编写特定于应用程序的解析器。CSV 文件的通用解析并不简单,但如果文件的确切结构已知,则可能有快捷方式。在任何情况下解析文本文件都很慢,所以如果你能把它翻译成更可口的东西(HDF5、NumPy 数组),加载将只受 I/O 性能的限制。

回答by joris

As @chrisb said, pandas' read_csvis probably faster than csv.reader/numpy.genfromtxt/loadtxt. I don't think you will find something better to parse the csv (as a note, read_csvis not a 'pure python' solution, as the CSV parser is implemented in C).

正如@chrisb 所说,pandasread_csv可能比csv.reader/numpy.genfromtxt/loadtxt. 我认为您找不到更好的方法来解析 csv(请注意,read_csv这不是“纯 python”解决方案,因为 CSV 解析器是用 C 实现的)。

But, if you have to load/query the data often, a solution would be to parse the CSV only once and then store it in another format, eg HDF5. You can use pandas(with PyTablesin background) to query that efficiently (docs).
See here for a comparison of the io performance of HDF5, csv and SQL with pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

但是,如果您必须经常加载/查询数据,则解决方案是仅解析 CSV 一次,然后将其存储为另一种格式,例如 HDF5。您可以使用pandasPyTables在后台)来有效地查询(docs)。
有关 HDF5、csv 和 SQL 与 Pandas 的 io 性能的比较,请参见此处:http: //pandas.pydata.org/pandas-docs/stable/io.html#performance-考虑

And a possibly relevant other question: "Large data" work flows using pandas

还有一个可能相关的其他问题:使用熊猫的“大数据”工作流程

回答by Ravi Singh

Modin is an early-stage project at UC Berkeley's RISELab designed to facilitate the use of distributed computing for Data Science. It is a multiprocess Dataframe library with an identical API to pandas that allows users to speed up their Pandas workflows. Modin accelerates Pandas queries by 4x on an 8-core machine, only requiring users to change a single line of code in their notebooks.

Modin 是加州大学伯克利分校 RISELab 的一个早期项目,旨在促进分布式计算在数据科学中的使用。它是一个多进程 Dataframe 库,具有与 Pandas 相同的 API,允许用户加快他们的 Pandas 工作流程。Modin 在 8 核机器上将 Pandas 查询速度提高了 4 倍,只需要用户更改笔记本中的一行代码。

pip install modin

if using dask

如果使用 dask

pip install modin[dask]

import modin by typing

通过键入导入 modin

import modin.pandas as pd

It uses all CPU cores to import csv file and it is almost like pandas.

它使用所有 CPU 内核来导入 csv 文件,它几乎就像熊猫一样。