pandas 使用 groupby 有效地在大型数据帧上填充(前向填充)?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36871783/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fillna (forward fill) on a large dataframe efficiently with groupby?
提问by trench
What is the most efficient way to forward fill information in a large dataframe?
在大型数据帧中转发填充信息的最有效方法是什么?
I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.
我从日常文件中合并了大约 600 万行 x 50 列的维度数据。我删除了重复项,现在我有大约 200,000 行独特的数据,可以跟踪发生在其中一个维度上的任何更改。
Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?
不幸的是,一些原始数据被弄乱了并且有空值。如何使用以前的值有效地填充空数据?
id start_date end_date is_current location dimensions...
xyz987 2016-03-11 2016-04-02 Expired CA lots_of_stuff
xyz987 2016-04-03 2016-04-21 Expired NaN lots_of_stuff
xyz987 2016-04-22 NaN Current CA lots_of_stuff
That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.
这就是数据的基本形状。问题是某些维度不应该是空白的(这是原始数据中的错误)。例如,对于前一行,该位置为该行填写,但在下一行中为空白。我知道该位置没有改变,但由于它是空白的,因此将其捕获为唯一的行。
I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?
我假设我需要使用 ID 字段进行分组。这是正确的语法吗?我是否需要列出数据框中的所有列?
cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)
There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a
在 200,000 行数据框中大约有 75,000 个唯一 ID。我试着做一个
df.fillna(method='ffill', inplace=True)
but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).
但我需要根据 ID 进行操作,并且我想确保我尽可能高效(我的计算机花了很长时间才能读取所有这些文件并将其合并到内存中)。
采纳答案by Alexander
How about forward filling each group?
如何向前填充每个组?
df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())
回答by bbaker
回答by xmduhan
github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group. here's an easy way to do this. url:https://github.com/pandas-dev/pandas/issues/11296
github/jreback:这是对#7895 的欺骗。.ffill 没有在 cython 中在 groupby 操作中实现(尽管它肯定可以),而是在每个组上调用 python 空间。这是一个简单的方法来做到这一点。网址:https: //github.com/pandas-dev/pandas/issues/11296
according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
根据 jreback 的回答,当您执行 groupby 时, ffill() 未优化,但 cumsum() 是。尝试这个:
df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)