缺失数据,在 Pandas 中插入行并用 NAN 填充
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25909984/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Missing data, insert rows in Pandas and fill with NAN
提问by mati
I'm new to Python and Pandas so there might be a simple solution which I don't see.
我是 Python 和 Pandas 的新手,所以可能有一个我没有看到的简单解决方案。
I have a number of discontinuous datasets which look like this:
我有许多不连续的数据集,如下所示:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
我现在正在寻找一种解决方案来获得以下内容:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...
问题是,A 中的差距在位置和长度上因数据集而异...
回答by cronos
set_indexand reset_indexare your friends.
set_index和reset_index是你的朋友。
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
首先将 A 列移动到索引:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Indexobject since we can name it; this will be used in the next step.
然后用新索引重新索引,这里缺失的数据用 nans 填充。我们使用Index对象是因为我们可以命名它;这将在下一步中使用。
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
最后将索引移回带有reset_index. 由于我们为索引命名,所以一切都神奇地工作:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
回答by EdChum
In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
在这种情况下,我使用新生成的数据框覆盖您的 A 列并将其合并到您的原始 df 中,然后我使用它:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arangefunction which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
因此,在一般情况下,您可以调整arange采用开始值和结束值的函数,请注意,当范围打开关闭时,我在末尾添加了 0.5,并传递了一个步长值。
A more general method could be like this:
更通用的方法可能是这样的:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column Abut don't drop it and then reindex the df using the arangefunction.
在这里,我们将索引设置为 columnA但不删除它,然后使用该arange函数重新索引 df 。
回答by JustAC0der
Using the answer by EdChum above, I created the following function
使用上面 EdChum 的答案,我创建了以下函数
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
用法示例:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)

