Pandas 在 groupby 内插值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37057187/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas interpolate within a groupby
提问by R. W.
I've got a dataframe with the following information:
我有一个包含以下信息的数据框:
filename val1 val2
t
1 file1.csv 5 10
2 file1.csv NaN NaN
3 file1.csv 15 20
6 file2.csv NaN NaN
7 file2.csv 10 20
8 file2.csv 12 15
I would like to interpolate the values in the dataframe based on the indices, but only within each file group.
我想根据索引插入数据框中的值,但仅在每个文件组内。
To interpolate, I would normally do
插值,我通常会做
df = df.interpolate(method="index")
And to group, I do
和分组,我做
grouped = df.groupby("filename")
I would like the interpolated dataframe to look like this:
我希望内插数据框看起来像这样:
filename val1 val2
t
1 file1.csv 5 10
2 file1.csv 10 15
3 file1.csv 15 20
6 file2.csv NaN NaN
7 file2.csv 10 20
8 file2.csv 12 15
Where the NaN's are still present at t = 6 since they are the first items in the file2 group.
NaN 在 t = 6 时仍然存在,因为它们是 file2 组中的第一项。
I suspect I need to use "apply", but haven't been able to figure out exactly how...
我怀疑我需要使用“应用”,但一直无法弄清楚究竟如何......
grouped.apply(interp1d)
...
TypeError: __init__() takes at least 3 arguments (2 given)
Any help would be appreciated.
任何帮助,将不胜感激。
采纳答案by Alexander
>>> df.groupby('filename').apply(lambda group: group.interpolate(method='index'))
filename val1 val2
t
1 file1.csv 5 10
2 file1.csv 10 15
3 file1.csv 15 20
6 file2.csv NaN NaN
7 file2.csv 10 20
8 file2.csv 12 15
回答by PMende
I ran into this as well. Instead of using apply
, you can use transform
, which will reduce your run time by more than 25% if you have on the order of 1000 groups:
我也遇到了这个。apply
您可以使用代替使用,transform
如果您拥有 1000 个组,这将减少超过 25% 的运行时间:
import numpy as np
import pandas as pd
np.random.seed(500)
test_df = pd.DataFrame({
'a': np.random.randint(low=0, high=1000, size=10000),
'b': np.random.choice([1, 2, 4, 7, np.nan], size=10000, p=([0.2475]*4 + [0.01]))
})
Tests:
测试:
%timeit test_df.groupby('a').transform(pd.DataFrame.interpolate)
Output: 566 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
输出: 566 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_df.groupby('a').apply(pd.DataFrame.interpolate)
Output: 788 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
输出: 788 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_df.groupby('a').apply(lambda group: group.interpolate())
Output: 787 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
输出: 787 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_df.interpolate()
Output: 918 μs ± 16.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
输出: 918 μs ± 16.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You will still see a significant increase in run-time compared to a fully vectorized call to interpolate
on the full DataFrame, but I don't think you can do much better in pandas.
与对interpolate
完整 DataFrame的完全矢量化调用相比,您仍会看到运行时间显着增加,但我认为在 Pandas 中您不会做得更好。
回答by WkCui
Considering the long running time of above methods, I suggest use a for loop and interpolate(), which is no more than few lines of codes, but much faster in speed.
考虑到上述方法的运行时间较长,我建议使用for循环和interpolate(),这不过是几行代码,但速度要快得多。
for i in range(len(df.filename.unique())):
mask = df.loc[:,'filename']==df.filename.unique()[i]
df[mask]=dfs[mask].interpolate(method='index')