Python 使用熊猫的“大数据”工作流程

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14262433/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 10:54:31  来源:igfitidea点击:

"Large data" work flows using pandas

pythonmongodbpandashdf5large-data

提问by Zelazny7

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.

几个月来,我在学习熊猫的过程中一直试图找出这个问题的答案。我在日常工作中使用 SAS,它的核心外支持非常棒。然而,由于许多其他原因,SAS 作为一款软件很糟糕。

One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.

有一天,我希望用 Python 和 Pandas 取代我对 SAS 的使用,但我目前缺乏用于大型数据集的核心外工作流程。我说的不是需要分布式网络的“大数据”,而是大到无法放入内存但小到可以放入硬盘的文件。

My first thought is to use HDFStoreto hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this:

我的第一个想法是使用HDFStore将大型数据集保存在磁盘上,并仅将我需要的部分提取到数据帧中进行分析。其他人提到 MongoDB 是一种更易于使用的替代方案。我的问题是这样的:

What are some best-practice workflows for accomplishing the following:

完成以下工作的最佳实践工作流程有哪些:

  1. Loading flat files into a permanent, on-disk database structure
  2. Querying that database to retrieve data to feed into a pandas data structure
  3. Updating the database after manipulating pieces in pandas
  1. 将平面文件加载到永久的磁盘数据库结构中
  2. 查询该数据库以检索数据以提供给 Pandas 数据结构
  3. 在 Pandas 中操作片段后更新数据库

Real-world examples would be much appreciated, especially from anyone who uses pandas on "large data".

真实世界的例子将不胜感激,尤其是任何在“大数据”上使用熊猫的人。

Edit -- an example of how I would like this to work:

编辑 - 我希望它如何工作的一个例子:

  1. Iteratively import a large flat-file and store it in a permanent, on-disk database structure. These files are typically too large to fit in memory.
  2. In order to use Pandas, I would like to read subsets of this data (usually just a few columns at a time) that can fit in memory.
  3. I would create new columns by performing various operations on the selected columns.
  4. I would then have to append these new columns into the database structure.
  1. 反复导入大型平面文件并将其存储在永久的磁盘数据库结构中。这些文件通常太大而无法放入内存。
  2. 为了使用 Pandas,我想读取可以放入内存的数据子集(通常一次只有几列)。
  3. 我将通过对所选列执行各种操作来创建新列。
  4. 然后我必须将这些新列附加到数据库结构中。

I am trying to find a best-practice way of performing these steps. Reading links about pandas and pytables it seems that appending a new column could be a problem.

我正在尝试找到执行这些步骤的最佳实践方法。阅读有关 pandas 和 pytables 的链接似乎附加新列可能是一个问题。

Edit -- Responding to Jeff's questions specifically:

编辑——专门回答杰夫的问题:

  1. I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data. I rarely append rows, but I do perform many operations that create new columns.
  2. Typical operations involve combining several columns using conditional logic into a new, compound column. For example, if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'. The result of these operations is a new column for every record in my dataset.
  3. Finally, I would like to append these new columns into the on-disk data structure. I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model.
  4. A typical project file is usually about 1GB. Files are organized into such a manner where a row consists of a record of consumer data. Each row has the same number of columns for every record. This will always be the case.
  5. It's pretty rare that I would subset by rows when creating a new column. However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics. For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards. To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on. When creating new columns, however, I would pull all rows of data and only the columns I need for the operations.
  6. The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships. The columns that I explore are usually done in small sets. For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan. Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process. What I'm doing is creating candidate variables that explain the relationship between my data and some outcome. At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.
  1. 我正在构建消费者信用风险模型。数据类型包括电话、SSN 和地址特征;财产价值;诸如犯罪记录、破产等贬义信息……我每天使用的数据集平均有近 1,000 到 2,000 个字段,混合数据类型:数字和字符数据的连续、名义和有序变量。我很少追加行,但我确实执行了许多创建新列的操作。
  2. 典型的操作包括使用条件逻辑将几个列组合成一个新的复合列。例如,if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'。这些操作的结果是数据集中每条记录的新列。
  3. 最后,我想将这些新列附加到磁盘数据结构中。我会重复第 2 步,使用交叉表和描述性统计探索数据,试图找到有趣、直观的模型关系。
  4. 一个典型的项目文件通常约为 1GB。文件被组织成这样一种方式,其中一行包含消费者数据的记录。对于每条记录,每一行具有相同数量的列。这将永远如此。
  5. 在创建新列时,我很少会按行进行子集化。但是,在创建报告或生成描述性统计信息时,对行进行子集化是很常见的。例如,我可能想为特定业务线创建一个简单的频率,比如零售信用卡。为此,除了要报告的列之外,我将仅选择业务线 = 零售的记录。但是,在创建新列时,我会拉取所有数据行和操作所需的列。
  6. 建模过程要求我分析每一列,寻找与某些结果变量的有趣关系,并创建描述这些关系的新复合列。我探索的专栏通常是小集完成的。例如,我将重点关注一组仅处理财产价值的 20 列,并观察它们与贷款违约之间的关系。一旦探索了这些并创建了新的专栏,我就会转到另一组专栏,比如大学教育,并重复这个过程。我正在做的是创建候选变量来解释我的数据和某些结果之间的关系。在这个过程的最后,我应用了一些学习技术,从这些复合列中创建一个方程。

It is rare that I would ever add rows to the dataset. I will nearly always be creating new columns (variables or features in statistics/machine learning parlance).

我很少会向数据集添加行。我几乎总是会创建新列(统计/机器学习用语中的变量或特征)。

采纳答案by Jeff

I routinely use tens of gigabytes of data in just this fashion e.g. I have tables on disk that I read via queries, create data and append back.

我经常以这种方式使用数十 GB 的数据,例如我在磁盘上有通过查询读取的表,创建数据并附加回来。

It's worth reading the docsand late in this threadfor several suggestions for how to store your data.

值得阅读本主题中的文档后期,以获取有关如何存储数据的一些建议。

Details which will affect how you store your data, like:
Give as much detail as you can; and I can help you develop a structure.

将影响您存储数据方式的详细信息,例如: 提供
尽可能多的详细信息;我可以帮你开发一个结构。

  1. Size of data, # of rows, columns, types of columns; are you appending rows, or just columns?
  2. What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these.
    (Giving a toy example could enable us to offer more specific recommendations.)
  3. After that processing, then what do you do? Is step 2 ad hoc, or repeatable?
  4. Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file?
  5. Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)?
  6. Do you 'work on' all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don't need to pull in that column explicity until final results time)?
  1. 数据大小、行数、列数、列类型;你是追加行,还是只是列?
  2. 典型的操作是什么样的。例如,对列进行查询以选择一堆行和特定列,然后执行操作(内存中),创建新列,保存这些。
    (举一个玩具的例子可以让我们提供更具体的建议。)
  3. 处理完之后,你会怎么做?第 2 步是临时的还是可重复的?
  4. 输入平面文件:多少个,大概的总大小(以 Gb 为单位)。这些是如何组织的,例如通过记录?每个文件都包含不同的字段,还是每个文件都有一些记录,每个文件中的所有字段都包含在内?
  5. 您是否曾经根据条件选择行(记录)的子集(例如,选择字段 A > 5 的行)?然后做一些事情,或者你只是选择包含所有记录的字段 A、B、C(然后做一些事情)?
  6. 您是否“处理”了所有列(以组为单位),或者是否有很大一部分可能仅用于报告(例如,您想保留数据,但不需要明确拉入该列,直到最终结果时间)?

Solution

解决方案

Ensure you have pandas at least 0.10.1installed.

确保您至少0.10.1安装了熊猫

Read iterating files chunk-by-chunkand multiple table queries.

逐块读取迭代文件多表查询

Since pytables is optimized to operate on row-wise (which is what you query on), we will create a table for each group of fields. This way it's easy to select a small group of fields (which will work with a big table, but it's more efficient to do it this way... I think I may be able to fix this limitation in the future... this is more intuitive anyhow):
(The following is pseudocode.)

由于 pytables 被优化为按行操作(这是您查询的内容),我们将为每组字段创建一个表。这种方式很容易选择一小组字段(这将适用于一个大表,但这样做更有效......我想我将来可能会解决这个限制......这是无论如何更直观):(
以下是伪代码。)

import numpy as np
import pandas as pd

# create a store
store = pd.HDFStore('mystore.h5')

# this is the key to your storage:
#    this maps your fields to a specific group, and defines 
#    what you want to have as data_columns.
#    you might want to create a nice class wrapping this
#    (as you will want to have this map and its inversion)  
group_map = dict(
    A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
    B = dict(fields = ['field_10',......        ], dc = ['field_10']),
    .....
    REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),

)

group_map_inverted = dict()
for g, v in group_map.items():
    group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))

Reading in the files and creating the storage (essentially doing what append_to_multipledoes):

读入文件并创建存储(本质上是做什么append_to_multiple):

for f in files:
   # read in the file, additional options may be necessary here
   # the chunksize is not strictly necessary, you may be able to slurp each 
   # file into memory in which case just eliminate this part of the loop 
   # (you can also change chunksize if necessary)
   for chunk in pd.read_table(f, chunksize=50000):
       # we are going to append to each table by group
       # we are not going to create indexes at this time
       # but we *ARE* going to create (some) data_columns

       # figure out the field groupings
       for g, v in group_map.items():
             # create the frame for this group
             frame = chunk.reindex(columns = v['fields'], copy = False)    

             # append it
             store.append(g, frame, index=False, data_columns = v['dc'])

Now you have all of the tables in the file (actually you could store them in separate files if you wish, you would prob have to add the filename to the group_map, but probably this isn't necessary).

现在您拥有文件中的所有表(实际上,如果您愿意,您可以将它们存储在单独的文件中,您可能必须将文件名添加到 group_map,但这可能不是必需的)。

This is how you get columns and create new ones:

这是获取列并创建新列的方式:

frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
#     select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows

# do calculations on this frame
new_frame = cool_function_on_frame(frame)

# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)

When you are ready for post_processing:

当您准备好进行 post_processing 时:

# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)

About data_columns, you don't actually need to define ANYdata_columns; they allow you to sub-select rows based on the column. E.g. something like:

关于 data_columns,您实际上不需要定义任何data_columns;它们允许您根据列子选择行。例如:

store.select(group, where = ['field_1000=foo', 'field_1001>0'])

They may be most interesting to you in the final report generation stage (essentially a data column is segregated from other columns, which might impact efficiency somewhat if you define a lot).

它们可能在最终报告生成阶段对您来说最有趣(本质上,一个数据列与其他列是分开的,如果您定义很多,这可能会影响效率)。

You also might want to:

您可能还想:

  • create a function which takes a list of fields, looks up the groups in the groups_map, then selects these and concatenates the results so you get the resulting frame (this is essentially what select_as_multiple does). This way the structure would be pretty transparent to you.
  • indexes on certain data columns (makes row-subsetting much faster).
  • enable compression.
  • 创建一个函数,它接受一个字段列表,在 groups_map 中查找组,然后选择这些组并连接结果,以便获得结果帧(这本质上是 select_as_multiple 所做的)。这样结构对你来说就非常透明了。
  • 某些数据列上的索引(使行子集设置更快)。
  • 启用压缩。

Let me know when you have questions!

当您有问题时,请告诉我!

回答by brian_the_bungler

This is the case for pymongo. I have also prototyped using sql server, sqlite, HDF, ORM (SQLAlchemy) in python. First and foremost pymongo is a document based DB, so each person would be a document (dictof attributes). Many people form a collection and you can have many collections (people, stock market, income).

pymongo 就是这种情况。我还在 python 中使用 sql server、sqlite、HDF、ORM (SQLAlchemy) 进行了原型设计。首先,pymongo 是一个基于文档的数据库,因此每个人都是一个文档(dict属性)。许多人形成一个集合,您可以拥有许多集合(人、股票市场、收入)。

pd.dateframe -> pymongo Note: I use the chunksizein read_csvto keep it to 5 to 10k records(pymongo drops the socket if larger)

pd.dateframe -> pymongo 注意:我使用chunksizeinread_csv将其保持在 5 到 10k 记录(如果较大,pymongo 会丢弃套接字)

aCollection.insert((a[1].to_dict() for a in df.iterrows()))

querying: gt = greater than...

查询:gt = 大于...

pd.DataFrame(list(mongoCollection.find({'anAttribute':{'$gt':2887000, '$lt':2889000}})))

.find()returns an iterator so I commonly use ichunkedto chop into smaller iterators.

.find()返回一个迭代器,所以我通常ichunked用来切分成更小的迭代器。

How about a join since I normally get 10 data sources to paste together:

连接怎么样,因为我通常将 10 个数据源粘贴在一起:

aJoinDF = pandas.DataFrame(list(mongoCollection.find({'anAttribute':{'$in':Att_Keys}})))

then (in my case sometimes I have to agg on aJoinDFfirst before its "mergeable".)

然后(在我的情况下,有时我必须aJoinDF在“可合并”之前先进行聚合。)

df = pandas.merge(df, aJoinDF, on=aKey, how='left')

And you can then write the new info to your main collection via the update method below. (logical collection vs physical datasources).

然后您可以通过下面的更新方法将新信息写入您的主集合。(逻辑集合与物理数据源)。

collection.update({primarykey:foo},{key:change})

On smaller lookups, just denormalize. For example, you have code in the document and you just add the field code text and do a dictlookup as you create documents.

在较小的查找中,只需非规范化。例如,您在文档中有代码,您只需添加域代码文本并在dict创建文档时进行查找。

Now you have a nice dataset based around a person, you can unleash your logic on each case and make more attributes. Finally you can read into pandas your 3 to memory max key indicators and do pivots/agg/data exploration. This works for me for 3 million records with numbers/big text/categories/codes/floats/...

现在你有一个基于一个人的不错的数据集,你可以在每个案例上释放你的逻辑并创建更多属性。最后,您可以将您的 3 项读入 Pandas 以记忆最大关键指标并进行数据透视/聚合/数据探索。这对我来说适用于 300 万条带有数字/大文本/类别/代码/浮点数/...

You can also use the two methods built into MongoDB (MapReduce and aggregate framework). See here for more info about the aggregate framework, as it seems to be easier than MapReduce and looks handy for quick aggregate work. Notice I didn't need to define my fields or relations, and I can add items to a document. At the current state of the rapidly changing numpy, pandas, python toolset, MongoDB helps me just get to work :)

您还可以使用 MongoDB 内置的两种方法(MapReduce 和聚合框架)。有关聚合框架的更多信息,请参见此处,因为它似乎比 MapReduce 更容易,并且对于快速聚合工作看起来很方便。请注意,我不需要定义我的字段或关系,我可以向文档添加项目。在快速变化的 numpy、pandas、python 工具集的当前状态下,MongoDB 帮助我开始工作:)

回答by Johann Hibschman

I spotted this a little late, but I work with a similar problem (mortgage prepayment models). My solution has been to skip the pandas HDFStore layer and use straight pytables. I save each column as an individual HDF5 array in my final file.

我发现这个有点晚了,但我遇到了类似的问题(抵押预付模型)。我的解决方案是跳过pandas HDFStore 层并使用直接的pytables。我在最终文件中将每一列保存为一个单独的 HDF5 数组。

My basic workflow is to first get a CSV file from the database. I gzip it, so it's not as huge. Then I convert that to a row-oriented HDF5 file, by iterating over it in python, converting each row to a real data type, and writing it to a HDF5 file. That takes some tens of minutes, but it doesn't use any memory, since it's only operating row-by-row. Then I "transpose" the row-oriented HDF5 file into a column-oriented HDF5 file.

我的基本工作流程是首先从数据库中获取一个 CSV 文件。我压缩了它,所以它不是那么大。然后我将其转换为面向行的 HDF5 文件,方法是在 python 中对其进行迭代,将每一行转换为实际数据类型,并将其写入 HDF5 文件。这需要几十分钟,但它不使用任何内存,因为它只是逐行运行。然后我将面向行的 HDF5 文件“转置”为面向列的 HDF5 文件。

The table transpose looks like:

表转置看起来像:

def transpose_table(h_in, table_path, h_out, group_name="data", group_path="/"):
    # Get a reference to the input data.
    tb = h_in.getNode(table_path)
    # Create the output group to hold the columns.
    grp = h_out.createGroup(group_path, group_name, filters=tables.Filters(complevel=1))
    for col_name in tb.colnames:
        logger.debug("Processing %s", col_name)
        # Get the data.
        col_data = tb.col(col_name)
        # Create the output array.
        arr = h_out.createCArray(grp,
                                 col_name,
                                 tables.Atom.from_dtype(col_data.dtype),
                                 col_data.shape)
        # Store the data.
        arr[:] = col_data
    h_out.flush()

Reading it back in then looks like:

读回来然后看起来像:

def read_hdf5(hdf5_path, group_path="/data", columns=None):
    """Read a transposed data set from a HDF5 file."""
    if isinstance(hdf5_path, tables.file.File):
        hf = hdf5_path
    else:
        hf = tables.openFile(hdf5_path)

    grp = hf.getNode(group_path)
    if columns is None:
        data = [(child.name, child[:]) for child in grp]
    else:
        data = [(child.name, child[:]) for child in grp if child.name in columns]

    # Convert any float32 columns to float64 for processing.
    for i in range(len(data)):
        name, vec = data[i]
        if vec.dtype == np.float32:
            data[i] = (name, vec.astype(np.float64))

    if not isinstance(hdf5_path, tables.file.File):
        hf.close()
    return pd.DataFrame.from_items(data)

Now, I generally run this on a machine with a ton of memory, so I may not be careful enough with my memory usage. For example, by default the load operation reads the whole data set.

现在,我通常在具有大量内存的机器上运行它,所以我可能对内存使用不够小心。例如,默认情况下加载操作读取整个数据集。

This generally works for me, but it's a bit clunky, and I can't use the fancy pytables magic.

这通常对我有用,但它有点笨拙,而且我不能使用花哨的 pytables 魔法。

Edit: The real advantage of this approach, over the array-of-records pytables default, is that I can then load the data into R using h5r, which can't handle tables. Or, at least, I've been unable to get it to load heterogeneous tables.

编辑:与默认的记录数组 pytables 相比,这种方法的真正优势在于,我可以使用无法处理表的 h5r 将数据加载到 R 中。或者,至少,我一直无法让它加载异构表。

回答by rjurney

If your datasets are between 1 and 20GB, you should get a workstation with 48GB of RAM. Then Pandas can hold the entire dataset in RAM. I know its not the answer you're looking for here, but doing scientific computing on a notebook with 4GB of RAM isn't reasonable.

如果您的数据集在 1 到 20GB 之间,您应该获得一个具有 48GB RAM 的工作站。然后 Pandas 可以将整个数据集保存在 RAM 中。我知道这不是您在这里寻找的答案,但是在具有 4GB RAM 的笔记本电脑上进行科学计算是不合理的。

回答by user1827356

I think the answers above are missing a simple approach that I've found very useful.

我认为上面的答案缺少一种我发现非常有用的简单方法。

When I have a file that is too large to load in memory, I break up the file into multiple smaller files (either by row or cols)

当我的文件太大而无法加载到内存中时,我会将文件分解为多个较小的文件(按行或列)

Example: In case of 30 days worth of trading data of ~30GB size, I break it into a file per day of ~1GB size. I subsequently process each file separately and aggregate results at the end

示例:如果有 30 天价值 ~30GB 大小的交易数据,我每天将其分解为 ~1GB 大小的文件。我随后分别处理每个文件并在最后汇总结果

One of the biggest advantages is that it allows parallel processing of the files (either multiple threads or processes)

最大的优点之一是它允许并行处理文件(多线程或多进程)

The other advantage is that file manipulation (like adding/removing dates in the example) can be accomplished by regular shell commands, which is not be possible in more advanced/complicated file formats

另一个优点是文件操作(如示例中的添加/删除日期)可以通过常规 shell 命令完成,这在更高级/复杂的文件格式中是不可能的

This approach doesn't cover all scenarios, but is very useful in a lot of them

这种方法并没有涵盖所有场景,但在很多场景中都非常有用

回答by Golf Monkey

Consider Ruffusif you go the simple path of creating a data pipeline which is broken down into multiple smaller files.

如果您选择创建一个分解为多个较小文件的数据管道的简单路径,请考虑Ruffus

回答by chishaku

I know this is an old thread but I think the Blazelibrary is worth checking out. It's built for these types of situations.

我知道这是一个旧线程,但我认为Blaze库值得一试。它是为这些类型的情况而构建的。

From the docs:

从文档:

Blaze extends the usability of NumPy and Pandas to distributed and out-of-core computing. Blaze provides an interface similar to that of the NumPy ND-Array or Pandas DataFrame but maps these familiar interfaces onto a variety of other computational engines like Postgres or Spark.

Blaze 将 NumPy 和 Pandas 的可用性扩展到分布式和核外计算。Blaze 提供了一个类似于 NumPy ND-Array 或 Pandas DataFrame 的接口,但将这些熟悉的接口映射到各种其他计算引擎,如 Postgres 或 Spark。

Edit:By the way, it's supported by ContinuumIO and Travis Oliphant, author of NumPy.

编辑:顺便说一下,它得到了 ContinuumIO 和 NumPy 的作者 Travis Oliphant 的支持。

回答by Ophir Yoktan

One more variation

又一个变种

Many of the operations done in pandas can also be done as a db query (sql, mongo)

在pandas中完成的许多操作也可以作为db查询来完成(sql、mongo)

Using a RDBMS or mongodb allows you to perform some of the aggregations in the DB Query (which is optimized for large data, and uses cache and indexes efficiently)

使用 RDBMS 或 mongodb 允许您在 DB Query 中执行一些聚合(它针对大数据进行了优化,并有效地使用了缓存和索引)

Later, you can perform post processing using pandas.

稍后,您可以使用 Pandas 执行后期处理。

The advantage of this method is that you gain the DB optimizations for working with large data, while still defining the logic in a high level declarative syntax - and not having to deal with the details of deciding what to do in memory and what to do out of core.

这种方法的优点是您可以获得用于处理大数据的数据库优化,同时仍然以高级声明性语法定义逻辑 - 而不必处理决定在内存中做什么以及做什么的细节的核心。

And although the query language and pandas are different, it's usually not complicated to translate part of the logic from one to another.

虽然查询语言和 Pandas 不同,但将部分逻辑从一种翻译到另一种通常并不复杂。

回答by Private

There is now, two years after the question, an 'out-of-core' pandas equivalent: dask. It is excellent! Though it does not support all of pandas functionality, you can get really far with it.

现在,在提出问题两年后,有一个“核心外”pandas 等价物:dask。太棒了!虽然它不支持所有的 Pandas 功能,但你可以用它走得很远。

回答by timpjohns

I recently came across a similar issue. I found simply reading the data in chunks and appending it as I write it in chunks to the same csv works well. My problem was adding a date column based on information in another table, using the value of certain columns as follows. This may help those confused by dask and hdf5 but more familiar with pandas like myself.

我最近遇到了类似的问题。我发现只需将数据分块读取并在我将其分块写入同一个 csv 时将其附加即可。我的问题是根据另一个表中的信息添加日期列,使用某些列的值,如下所示。这可能会帮助那些对 dask 和 hdf5 感到困惑但更熟悉像我这样的大熊猫的人。

def addDateColumn():
"""Adds time to the daily rainfall data. Reads the csv as chunks of 100k 
   rows at a time and outputs them, appending as needed, to a single csv. 
   Uses the column of the raster names to get the date.
"""
    df = pd.read_csv(pathlist[1]+"CHIRPS_tanz.csv", iterator=True, 
                     chunksize=100000) #read csv file as 100k chunks

    '''Do some stuff'''

    count = 1 #for indexing item in time list 
    for chunk in df: #for each 100k rows
        newtime = [] #empty list to append repeating times for different rows
        toiterate = chunk[chunk.columns[2]] #ID of raster nums to base time
        while count <= toiterate.max():
            for i in toiterate: 
                if i ==count:
                    newtime.append(newyears[count])
            count+=1
        print "Finished", str(chunknum), "chunks"
        chunk["time"] = newtime #create new column in dataframe based on time
        outname = "CHIRPS_tanz_time2.csv"
        #append each output to same csv, using no header
        chunk.to_csv(pathlist[2]+outname, mode='a', header=None, index=None)