pandas 使用 groupby 有效地在大型数据帧上填充（前向填充）？

Question

提问by trench

What is the most efficient way to forward fill information in a large dataframe?

在大型数据帧中转发填充信息的最有效方法是什么？

I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.

我从日常文件中合并了大约 600 万行 x 50 列的维度数据。我删除了重复项，现在我有大约 200,000 行独特的数据，可以跟踪发生在其中一个维度上的任何更改。

Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?

不幸的是，一些原始数据被弄乱了并且有空值。如何使用以前的值有效地填充空数据？

id       start_date   end_date    is_current  location  dimensions...
xyz987   2016-03-11   2016-04-02  Expired       CA      lots_of_stuff
xyz987   2016-04-03   2016-04-21  Expired       NaN     lots_of_stuff
xyz987   2016-04-22          NaN  Current       CA      lots_of_stuff

That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.

这就是数据的基本形状。问题是某些维度不应该是空白的（这是原始数据中的错误）。例如，对于前一行，该位置为该行填写，但在下一行中为空白。我知道该位置没有改变，但由于它是空白的，因此将其捕获为唯一的行。

I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?

我假设我需要使用 ID 字段进行分组。这是正确的语法吗？我是否需要列出数据框中的所有列？

cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)

There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a

在 200,000 行数据框中大约有 75,000 个唯一 ID。我试着做一个

df.fillna(method='ffill', inplace=True)

but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).

但我需要根据 ID 进行操作，并且我想确保我尽可能高效（我的计算机花了很长时间才能读取所有这些文件并将其合并到内存中）。

Answer 1

采纳答案by Alexander

How about forward filling each group?

如何向前填充每个组？

 df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())

Answer 2

回答by bbaker

It is likely efficient to execute the fillnadirectly on the groupby object:

fillna直接在 groupby 对象上执行可能很有效：

df = df.groupby(['id']).fillna(method='ffill')

Method referenced herein documentation.

方法引用这里的文档。

Answer 3

回答by xmduhan

github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group. here's an easy way to do this. url:https://github.com/pandas-dev/pandas/issues/11296

github/jreback：这是对#7895 的欺骗。.ffill 没有在 cython 中在 groupby 操作中实现（尽管它肯定可以），而是在每个组上调用 python 空间。这是一个简单的方法来做到这一点。网址：https: //github.com/pandas-dev/pandas/issues/11296

according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:

根据 jreback 的回答，当您执行 groupby 时， ffill() 未优化，但 cumsum() 是。尝试这个：

df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)

pandas 使用 groupby 有效地在大型数据帧上填充（前向填充）？

提问by trench

采纳答案by Alexander

回答by bbaker

回答by xmduhan

according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:

根据 jreback 的回答，当您执行 groupby 时， ffill() 未优化，但 cumsum() 是。尝试这个：

相关推荐

最近更新

标签

pandas 使用 groupby 有效地在大型数据帧上填充（前向填充）？

提问by trench

采纳答案by Alexander

回答by bbaker

回答by xmduhan

according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:

根据 jreback 的回答，当您执行 groupby 时， ffill() 未优化，但 cumsum() 是。尝试这个：

相关推荐

pandas 如何检索pandas Series对象中第n个元素的值？

Pandas：如何将多个数据帧引用和打印为 HTML 表格

pandas XLRDError：python 中没有名为 <'Sheet1'> 的工作表

在 Pandas 中合并列名相同但列数不同的两个数据框

相关推荐

最近更新

标签