在 Pandas 中对大型数据集进行排序
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21271727/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Sorting in pandas for large datasets
提问by user1867185
I would like to sort my data by a given column, specifically p-values. However, the issue is that I am not able to load my entire data into memory. Thus, the following doesn't work or rather works for only small datasets.
我想按给定的列对我的数据进行排序,特别是 p 值。但是,问题是我无法将整个数据加载到内存中。因此,以下内容不起作用,或者仅适用于小数据集。
data = data.sort(columns=["P_VALUE"], ascending=True, axis=0)
Is there a quick way to sort my data by a given column that only takes chunks into account and doesn't require loading entire datasets in memory?
有没有一种快速的方法可以按给定的列对我的数据进行排序,该列只考虑块并且不需要将整个数据集加载到内存中?
回答by Ami Tavory
In the past, I've used Linux's pair of venerable sortand splitutilities, to sort massive files that choked pandas.
在过去,我用linux的一对古老的sort和split实用工具,到哽咽大Pandas大量文件进行排序。
I don't want to disparage the other answer on this page. However, since your data is text format (as you indicated in the comments), I think it's a tremendous complication to start transferring it into other formats (HDF, SQL, etc.), for something that GNU/Linux utilities have been solving very efficiently for the last 30-40 years.
我不想贬低此页面上的其他答案。但是,由于您的数据是文本格式(如您在评论中指出的),我认为将其转换为其他格式(HDF、SQL 等)是非常复杂的,因为 GNU/Linux 实用程序已经解决了很多问题在过去的 30 到 40 年中一直有效。
Say your file is called stuff.csv, and looks like this:
假设您的文件名为stuff.csv,如下所示:
4.9,3.0,1.4,0.6
4.8,2.8,1.3,1.2
Then the following command will sort it by the 3rd column:
然后以下命令将按第 3 列对其进行排序:
sort --parallel=8 -t . -nrk3 stuff.csv
Note that the number of threads here is set to 8.
请注意,这里的线程数设置为 8。
The above will work with files that fit into the main memory. When your file is too large, you would first split it into a number of parts. So
以上将适用于适合主内存的文件。当您的文件太大时,您首先会将其拆分为多个部分。所以
split -l 100000 stuff.csv stuff
would split the file into files of length at most 100000 lines.
会将文件拆分为最多 100000 行的文件。
Now you would sort each file individually, as above. Finally, you would use mergesort, again through (waith for it...) sort:
现在您可以单独对每个文件进行排序,如上所示。最后,您将使用mergesort,再次通过 (waith for it...) sort:
sort -m sorted_stuff_* > final_sorted_stuff.csv
Finally, if your file is not in CSV (say it is a tgzfile), then you should find a way to pipe a CSV version of it into split.
最后,如果您的文件不是 CSV 格式(假设它是一个tgz文件),那么您应该找到一种方法将它的 CSV 版本通过管道传输到split.
回答by iled
As I referred in the comments, this answeralready provides a possible solution. It is based on the HDF format.
正如我在评论中提到的,这个答案已经提供了一个可能的解决方案。它基于 HDF 格式。
About the sorting problem, there are at least three possible ways to solve it with that approach.
关于排序问题,至少有三种可能的方法可以用这种方法解决。
First, you can try to use pandas directly, querying the HDF-stored-DataFrame.
首先,您可以尝试直接使用 pandas,查询 HDF-stored-DataFrame。
Second, you can use PyTables, which pandas uses under the hood.
其次,您可以使用PyTables,这是Pandas在幕后使用的。
Francesc Alted gives a hint in the PyTables mailing list:
Francesc Alted 在PyTables 邮件列表中给出了一个提示:
The simplest way is by setting the
sortbyparameter to true in theTable.copy()method. This triggers an on-disk sorting operation, so you don't have to be afraid of your available memory. You will need the Pro version for getting this capability.
最简单的方法是
sortby在Table.copy()方法中将参数设置为true 。这会触发磁盘排序操作,因此您不必担心可用内存。您将需要 Pro 版本才能获得此功能。
In the docs, it says:
在文档中,它说:
sortby :If specified, and sortby corresponds to a column with an index, then the copy will be sorted by this index. If you want to ensure a fully sorted order, the index must be a CSI one. A reverse sorted copy can be achieved by specifying a negative value for the step keyword. If sortby is omitted or None, the original table order is used
sortby :如果指定,并且 sortby 对应有索引的列,则副本将按此索引排序。如果要确保完全排序,索引必须是 CSI 索引。可以通过为 step 关键字指定负值来实现反向排序副本。如果 sortby 被省略或 None,则使用原始表顺序
Third, still with PyTables, you can use the method Table.itersorted().
第三,仍然使用 PyTables,您可以使用 方法Table.itersorted()。
From the docs:
从文档:
Table.itersorted(sortby, checkCSI=False, start=None, stop=None, step=None)
Iterate table data following the order of the index of sortby column. The sortby column must have associated a full index.
桌子。itersorted( sortby, checkCSI=False, start=None, stop=None, step=None)
按照 sortby 列的索引顺序迭代表数据。sortby 列必须关联完整索引。
Another approach consists in using a database in between. The detailed workflow can be seen in this IPython Notebookpublished at plot.ly.
另一种方法是在两者之间使用数据库。详细的工作流程可以在发布于plot.ly 的IPython Notebook中看到。
This allows to solve the sorting problem, along with other data analyses that are possible with pandas. It looks like it was created by the user chris, so all the credit goes to him. I am copying here the relevant parts.
这可以解决排序问题,以及可以使用 Pandas 进行的其他数据分析。看起来它是由用户chris创建的,所以所有的功劳都归功于他。我在这里复制相关部分。
Introduction
介绍
This notebook explores a 3.9Gb CSV file.
This notebook is a primer on out-of-memory data analysis with
- pandas:A library with easy-to-use data structures and data analysis tools. Also, interfaces to out-of-memory databases like SQLite.
- IPython notebook:An interface for writing and sharing python code, text, and plots.
- SQLite:An self-contained, server-less database that's easy to set-up and query from Pandas.
- Plotly:A platform for publishing beautiful, interactive graphs from Python to the web.
此笔记本探索 3.9Gb CSV 文件。
本笔记本是内存不足数据分析的入门读物
- pandas:具有易于使用的数据结构和数据分析工具的库。此外,还有与 SQLite 等内存不足数据库的接口。
- IPython notebook:用于编写和共享 Python 代码、文本和绘图的界面。
- SQLite:一个独立的、无服务器的数据库,易于从 Pandas 设置和查询。
- Plotly:一个将精美的交互式图形从 Python 发布到网络的平台。
Requirements
要求
import pandas as pd
from sqlalchemy import create_engine # database connection
Import the CSV data into SQLite
将 CSV 数据导入 SQLite
- Load the CSV, chunk-by-chunk, into a DataFrame
- Process the data a bit, strip out uninteresting columns
- Append it to the SQLite database
- 将 CSV 逐块加载到 DataFrame 中
- 稍微处理一下数据,去掉不感兴趣的列
- 将其附加到 SQLite 数据库
disk_engine = create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory
chunksize = 20000
index_start = 1
for df in pd.read_csv('311_100M.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):
# do stuff
df.index += index_start
df.to_sql('data', disk_engine, if_exists='append')
index_start = df.index[-1] + 1
Query value counts and order the results
查询值计数并对结果进行排序
Housing and Development Dept receives the most complaints
住建部接获最多投诉
df = pd.read_sql_query('SELECT Agency, COUNT(*) as `num_complaints`'
'FROM data '
'GROUP BY Agency '
'ORDER BY -num_complaints', disk_engine)
Limiting the number of sorted entries
限制排序条目的数量
What's the most 10 common complaint in each city?
每个城市最常见的 10 个投诉是什么?
df = pd.read_sql_query('SELECT City, COUNT(*) as `num_complaints` '
'FROM data '
'GROUP BY `City` '
'ORDER BY -num_complaints '
'LIMIT 10 ', disk_engine)
Possibly related and useful links
可能相关且有用的链接
回答by Back2Basics
Blaze might be the tool for you with the ability to work with pandas and csv files out of core. http://blaze.readthedocs.org/en/latest/ooc.html
Blaze 可能是适合您的工具,它能够在核心之外处理 Pandas 和 csv 文件。 http://blaze.readthedocs.org/en/latest/ooc.html
import blaze
import pandas as pd
d = blaze.Data('my-large-file.csv')
d.P_VALUE.sort() # Uses Chunked Pandas
For faster processing, load it into a database first which blaze can control. But if this is a one off and you have some time then the posted code should do it.
为了更快的处理,首先将其加载到 blaze 可以控制的数据库中。但是,如果这是一次性的并且您有一些时间,那么发布的代码应该可以做到。
回答by ZFY
If your csv file contains only structured data, I would suggest approach using only linux commands.
如果您的 csv 文件仅包含结构化数据,我建议仅使用 linux 命令的方法。
Assume csv file contains two columns, COL_1and P_VALUE:
假设 csv 文件包含两列,COL_1并且P_VALUE:
map.py:
地图.py:
import sys
for line in sys.stdin:
col_1, p_value = line.split(',')
print "%f,%s" % (p_value, col_1)
then the following linux command will generate the csv file with p_value sorted:
然后以下 linux 命令将生成 p_value 排序的 csv 文件:
cat input.csv | ./map.py | sort > output.csv
If you're familiar with hadoop, using the above map.py also adding a simple reduce.py will generate the sorted csv file via hadoop streaming system.
如果您熟悉hadoop,使用上面的map.py 并添加一个简单的reduce.py 将通过hadoop 流系统生成排序后的csv 文件。
回答by Sampath
Here is my Honest sugg./ Three options you can do.
这是我诚实的建议。/你可以做三个选择。
I like Pandas for its rich doc and features but I been suggested to use NUMPY as it feel faster comparatively for larger datasets. You can think of using other tools as well for easier job.
In case you are using Python3, you can break your big data chunk into sets and do Congruent Threading. I am too lazy for this and it does nt look cool, you see Panda, Numpy, Scipy are build with Hardware design perspectives to enable multi threading I believe.
I prefer this, this is easy and lazy technique acc. to me. Check the document at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort.html
我喜欢 Pandas 的丰富文档和功能,但有人建议我使用 NUMPY,因为它对于较大的数据集感觉相对更快。您也可以考虑使用其他工具来简化工作。
如果您使用的是 Python3,您可以将大数据块分成多个集合并执行 Congruent Threading。我太懒了,它看起来并不酷,你看到 Panda、Numpy、Scipy 都是从硬件设计角度构建的,我相信它可以实现多线程。
我更喜欢这个,这是一种简单而懒惰的技术acc。对我来说。检查http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort.html 上的文档
You can also use 'kind' parameter in your pandas-sort function you are using.
您还可以在正在使用的 pandas-sort 函数中使用 'kind' 参数。
Godspeed my friend.
上帝保佑我的朋友。

