Pandas:将 Lambda 应用于多个数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31077382/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: Applying Lambda to Multiple Data Frames
提问by jtrowbridge
I'm trying to figure out how to apply a lambda function to multiple dataframes simultaneously, without first merging the data frames together. I am working with large data sets (>60MM records) and I need to be extra careful with memory management.
我试图弄清楚如何将 lambda 函数同时应用于多个数据帧,而不是先将数据帧合并在一起。我正在处理大型数据集(> 60MM 记录),我需要格外小心内存管理。
My hope is that there is a way to apply lambda to just the underlying dataframes so that I can avoid the cost of stitching them together first, and then dropping that intermediary dataframe from memory before I move on to the next step in the process.
我希望有一种方法可以将 lambda 仅应用于底层数据帧,这样我就可以避免先将它们拼接在一起,然后在进入流程的下一步之前从内存中删除该中间数据帧的成本。
I have experience dodging out of memory issues by using HDF5 based dataframes, but I'd rather try exploring something different first.
我有使用基于 HDF5 的数据帧来避免内存不足问题的经验,但我宁愿先尝试探索不同的东西。
I have provided a toy problem to help demonstrate what I am talking about.
我提供了一个玩具问题来帮助演示我在说什么。
import numpy as np
import pandas as pd
# Here's an arbitrary function to use with lambda
def someFunction(input1, input2, input3, input4):
theSum = input1 + input2
theAverage = (input1 + input2 + input3 + input4) / 4
theProduct = input2 * input3 * input4
return pd.Series({'Sum' : theSum, 'Average' : theAverage, 'Product' : theProduct})
# Cook up some dummy dataframes
df1 = pd.DataFrame(np.random.randn(6,2),columns=list('AB'))
df2 = pd.DataFrame(np.random.randn(6,1),columns=list('C'))
df3 = pd.DataFrame(np.random.randn(6,1),columns=list('D'))
# Currently, I merge the dataframes together and then apply the lambda function
dfConsolodated = pd.concat([df1, df2, df3], axis=1)
# This works just fine, but merging the dataframes seems like an extra step
dfResults = dfConsolodated.apply(lambda x: someFunction(x['A'], x['B'], x['C'], x['D']), axis = 1)
# I want to avoid the concat completely in order to be more efficient with memory. I am hoping for something like this:
# I am COMPLETELY making this syntax up for conceptual purposes, my apologies.
dfResultsWithoutConcat = [df1, df2, df3].apply(lambda x: someFunction(df1['A'], df1['B'], df2['C'], df3['D']), axis = 1)
采纳答案by Alexander
One option would be to explicitly create the desired aggregations:
一种选择是显式创建所需的聚合:
theSum = df1.A + df1.B
theAverage = (df1.A + df1.B + df2.C + df3.D) / 4.
theProduct = df1.B * df2.C * df3.D
theResult = pd.concat([theSum, theAverage, theProduct])
theResult.columns = ['Sum', 'Average', 'Product']
Another possibility is to use query, but this really depends on your use case and how you intend to aggregate your data. Here is an example per the docs that might be applicable for you.
另一种可能性是使用query,但这实际上取决于您的用例以及您打算如何聚合数据。以下是可能适用于您的每个文档的示例。
map(lambda frame: frame.query(expr), [df, df2])
回答by arthaigo
I know this question is kind of old, but here is a way I came up with. It is not nice, but it works.
我知道这个问题有点老了,但这是我想出的一种方法。这不是很好,但它有效。
The basic idea is to query the second dataframe inside the applied function. By using the name of the passed series, you can identfiy the column/index and use it to retrieve the needed value from the other dataframe(s).
基本思想是查询应用函数内的第二个数据帧。通过使用传递的系列的名称,您可以识别列/索引并使用它从其他数据帧中检索所需的值。
def func(x, other):
other_value = other.loc[x.name]
return your_actual_method(x, other_value)
result = df1.apply(lambda x: func(x, df2))

