缺失数据，在 Pandas 中插入行并用 NAN 填充

Question

提问by mati

I'm new to Python and Pandas so there might be a simple solution which I don't see.

我是 Python 和 Pandas 的新手，所以可能有一个我没有看到的简单解决方案。

I have a number of discontinuous datasets which look like this:

我有许多不连续的数据集，如下所示：

ind A    B  C  
0   0.0  1  3  
1   0.5  4  2  
2   1.0  6  1  
3   3.5  2  0  
4   4.0  4  5  
5   4.5  3  3

I now look for a solution to get the following:

我现在正在寻找一种解决方案来获得以下内容：

ind A    B  C  
0   0.0  1  3  
1   0.5  4  2  
2   1.0  6  1  
3   1.5  NAN NAN  
4   2.0  NAN NAN  
5   2.5  NAN NAN  
6   3.0  NAN NAN  
7   3.5  2  0  
8   4.0  4  5  
9   4.5  3  3

The problem is,that the gap in A varies from dataset to dataset in position and length...

问题是，A 中的差距在位置和长度上因数据集而异...

Answer 1

回答by cronos

set_indexand reset_indexare your friends.

set_index和reset_index是你的朋友。

df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})

First move column A to the index:

首先将 A 列移动到索引：

In [64]: df.set_index("A")
Out[64]: 
     B  C
 A        
0.0  1  3
0.5  4  2
1.0  6  1
3.5  2  0
4.0  4  5
4.5  3  3

Then reindex with a new index, here the missing data is filled in with nans. We use the Indexobject since we can name it; this will be used in the next step.

然后用新索引重新索引，这里缺失的数据用 nans 填充。我们使用Index对象是因为我们可以命名它；这将在下一步中使用。

In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]: 
      B   C
0.0   1   3
0.5   4   2
1.0   6   1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5   2   0
4.0   4   5
4.5   3   3

Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:

最后将索引移回带有reset_index. 由于我们为索引命名，所以一切都神奇地工作：

In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]: 
       A   B   C
0    0.0   1   3
1    0.5   4   2
2    1.0   6   1
3    1.5 NaN NaN
4    2.0 NaN NaN
5    2.5 NaN NaN
6    3.0 NaN NaN
7    3.5   2   0
8    4.0   4   5
9    4.5   3   3

Answer 2

回答by EdChum

In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:

在这种情况下，我使用新生成的数据框覆盖您的 A 列并将其合并到您的原始 df 中，然后我使用它：

    In [177]:

df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
     A   B   C
0  0.0   1   3
1  0.5   4   2
2  1.0   6   1
3  1.5 NaN NaN
4  2.0 NaN NaN
5  2.5 NaN NaN
6  3.0 NaN NaN
7  3.5   2   0
8  4.0   4   5
9  4.5   3   3

So in the general case you can adjust the arangefunction which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.

因此，在一般情况下，您可以调整arange采用开始值和结束值的函数，请注意，当范围打开关闭时，我在末尾添加了 0.5，并传递了一个步长值。

A more general method could be like this:

更通用的方法可能是这样的：

In [197]:

df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True) 
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
   index   B   C
0    0.0   1   3
1    0.5   4   2
2    1.0   6   1
3    1.5 NaN NaN
4    2.0 NaN NaN
5    2.5 NaN NaN
6    3.0 NaN NaN
7    3.5   2   0
8    4.0   4   5
9    4.5   3   3

Here we set the index to column Abut don't drop it and then reindex the df using the arangefunction.

在这里，我们将索引设置为 columnA但不删除它，然后使用该arange函数重新索引 df 。

Answer 3

回答by JustAC0der

Using the answer by EdChum above, I created the following function

使用上面 EdChum 的答案，我创建了以下函数

def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
    return df\
      .merge(how='right', on=field,
            right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
      .sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)

Example usage:

用法示例：

fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)

缺失数据，在 Pandas 中插入行并用 NAN 填充

提问by mati

回答by cronos

回答by EdChum

回答by JustAC0der

相关推荐

最近更新

标签

缺失数据，在 Pandas 中插入行并用 NAN 填充

提问by mati

回答by cronos

回答by EdChum

回答by JustAC0der

相关推荐

pandas DataFrame.drop_duplicates 和 DataFrame.drop 不删除行

Pandas：为什么 pandas.Series.std() 与 numpy.std() 不同

pandas 使用 Seaborn FacetGrid 绘制时间序列

pandas 使用 np.where 但如果条件为 False 则保持现有值

相关推荐

最近更新

标签