Pandas 在 groupby 内插值

Question

提问by R. W.

I've got a dataframe with the following information:

我有一个包含以下信息的数据框：

    filename    val1    val2
t                   
1   file1.csv   5       10
2   file1.csv   NaN     NaN
3   file1.csv   15      20
6   file2.csv   NaN     NaN
7   file2.csv   10      20
8   file2.csv   12      15

I would like to interpolate the values in the dataframe based on the indices, but only within each file group.

我想根据索引插入数据框中的值，但仅在每个文件组内。

To interpolate, I would normally do

插值，我通常会做

df = df.interpolate(method="index")

And to group, I do

和分组，我做

grouped = df.groupby("filename")

I would like the interpolated dataframe to look like this:

我希望内插数据框看起来像这样：

    filename    val1    val2
t                   
1   file1.csv   5       10
2   file1.csv   10      15
3   file1.csv   15      20
6   file2.csv   NaN     NaN
7   file2.csv   10      20
8   file2.csv   12      15

Where the NaN's are still present at t = 6 since they are the first items in the file2 group.

NaN 在 t = 6 时仍然存在，因为它们是 file2 组中的第一项。

I suspect I need to use "apply", but haven't been able to figure out exactly how...

我怀疑我需要使用“应用”，但一直无法弄清楚究竟如何......

grouped.apply(interp1d)
...
TypeError: __init__() takes at least 3 arguments (2 given)

Any help would be appreciated.

任何帮助，将不胜感激。

Answer 1

采纳答案by Alexander

>>> df.groupby('filename').apply(lambda group: group.interpolate(method='index'))
    filename  val1  val2
t                       
1  file1.csv     5    10
2  file1.csv    10    15
3  file1.csv    15    20
6  file2.csv   NaN   NaN
7  file2.csv    10    20
8  file2.csv    12    15

Answer 2

回答by PMende

I ran into this as well. Instead of using apply, you can use transform, which will reduce your run time by more than 25% if you have on the order of 1000 groups:

我也遇到了这个。apply您可以使用代替使用，transform如果您拥有 1000 个组，这将减少超过 25% 的运行时间：

import numpy as np
import pandas as pd

np.random.seed(500)
test_df = pd.DataFrame({
    'a': np.random.randint(low=0, high=1000, size=10000),
    'b': np.random.choice([1, 2, 4, 7, np.nan], size=10000, p=([0.2475]*4 + [0.01]))
})

Tests:

测试：

%timeit test_df.groupby('a').transform(pd.DataFrame.interpolate)

Output: 566 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

输出： 566 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit test_df.groupby('a').apply(pd.DataFrame.interpolate)

Output: 788 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

输出： 788 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit test_df.groupby('a').apply(lambda group: group.interpolate())

Output: 787 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

输出： 787 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit test_df.interpolate()

Output: 918 μs ± 16.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

输出： 918 μs ± 16.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You will still see a significant increase in run-time compared to a fully vectorized call to interpolateon the full DataFrame, but I don't think you can do much better in pandas.

与对interpolate完整 DataFrame的完全矢量化调用相比，您仍会看到运行时间显着增加，但我认为在 Pandas 中您不会做得更好。

Answer 3

回答by WkCui

Considering the long running time of above methods, I suggest use a for loop and interpolate(), which is no more than few lines of codes, but much faster in speed.

考虑到上述方法的运行时间较长，我建议使用for循环和interpolate()，这不过是几行代码，但速度要快得多。

for i in range(len(df.filename.unique())):
      mask = df.loc[:,'filename']==df.filename.unique()[i]
      df[mask]=dfs[mask].interpolate(method='index')

Pandas 在 groupby 内插值

提问by R. W.

采纳答案by Alexander

回答by PMende

回答by WkCui

相关推荐

最近更新

标签

Pandas 在 groupby 内插值

提问by R. W.

采纳答案by Alexander

回答by PMende

回答by WkCui

相关推荐

将 Pandas 数据帧对象传递给函数

pandas 如何将数据框列表保存到 csv

pandas 如何在 Seaborn boxplot 中编辑胡须、传单、帽子等的属性

将 Pandas DataFrame 行复制到多个其他行

相关推荐

最近更新

标签