Python 过滤掉特定列中的 nan 行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43821529/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:25:11  来源:igfitidea点击:

Filter out nan rows in a specific column

pythonpandasnumpy

提问by Dinosaurius

df =

Col1 Col2 Col3
1    nan  4
2    5    4
3    3    nan

Given the dataframe df, I want to obtain a new dataframe df2that does not contain nanin the column Col2. This is the expected result: df2 =

给定数据框df,我想获得一个df2不包含nan在列中的新数据框Col2。这是预期的结果:df2 =

Col1 Col2 Col3
2    5    4
3    3    nan

I know that it's possible to use pandas.isnulland dropna, however how to specify only particular column to which filtering should be applied?

我知道可以使用pandas.isnulland dropna,但是如何仅指定应应用过滤的特定列?

回答by MaxU

you can use DataFrame.dropna()method:

你可以使用DataFrame.dropna()方法:

In [202]: df.dropna(subset=['Col2'])
Out[202]:
   Col1  Col2  Col3
1     2   5.0   4.0
2     3   3.0   NaN

or (in this case) less idiomatic Series.notnull():

或(在这种情况下)不太惯用的Series.notnull()

In [204]: df.loc[df.Col2.notnull()]
Out[204]:
   Col1  Col2  Col3
1     2   5.0   4.0
2     3   3.0   NaN

or using DataFrame.query()method:

或使用DataFrame.query()方法:

In [205]: df.query("Col2 == Col2")
Out[205]:
   Col1  Col2  Col3
1     2   5.0   4.0
2     3   3.0   NaN

numexprsolution:

numexpr解决方案:

In [241]: import numexpr as ne

In [242]: col = df.Col2

In [243]: df[ne.evaluate("col == col")]
Out[243]:
   Col1  Col2  Col3
1     2   5.0   4.0
2     3   3.0   NaN

回答by jezrael

Use dropna:

使用dropna

df = df.dropna(subset=['Col2'])
print (df)
  Col1  Col2  Col3
1     2   5.0   4.0
2     3   3.0   NaN

Another solution - boolean indexingwith notnull:

另一个解决方案 -boolean indexing使用notnull

df = df[df['Col2'].notnull()]
print (df)
   Col1  Col2  Col3
1     2   5.0   4.0
2     3   3.0   NaN

What is same as:

什么是相同的:

df = df[~df['Col2'].isnull()]
print (df)
   Col1  Col2  Col3
1     2   5.0   4.0
2     3   3.0   NaN

回答by piRSquared

Using numpy's isnanto mask and construct a new dataframe

使用numpy'sisnan来屏蔽和构造一个新的数据框

m = ~np.isnan(df.Col2.values)
pd.DataFrame(df.values[m], df.index[m], df.columns)

   Col1  Col2  Col3
1   2.0   5.0   4.0
2   3.0   3.0   NaN


Timing
Bigger Data

定时
大数据

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice([np.nan, 1], size=(10000, 10))).add_prefix('Col')

%%timeit
m = ~np.isnan(df.Col2.values)
pd.DataFrame(df.values[m], df.index[m], df.columns)
1000 loops, best of 3: 326 μs per loop

%timeit df.query("Col2 == Col2")
1000 loops, best of 3: 1.48 ms per loop

%timeit df.loc[df.Col2.notnull()]
1000 loops, best of 3: 417 μs per loop

%timeit df[~df['Col2'].isnull()]
1000 loops, best of 3: 385 μs per loop

%timeit df.dropna(subset=['Col2'])
1000 loops, best of 3: 913 μs per loop

回答by Alex Schwab

If you want to count and graph the number of nan's before dropping your column(s)

如果您想在删除列之前计算和绘制 nan 的数量

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

cols = df.columns
nans = [df[col].isna().sum() for col in cols] 

sns.set(font_scale=1.1)
ax = sns.barplot(cols, nans, palette='hls', log=False)
ax.set(xlabel='Feature', ylabel='Number of NaNs', title='Number of NaNs per feature')
for p, uniq in zip(ax.patches, nans):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            uniq,
            ha="center") 
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()

回答by Grant Shannon

The simple implementation below follows on from the above - but shows filtering out nanrows in a specific column - in place- and for largedata frames count rows with nanby column name(before and after)

下面的简单实现继承了上面的内容 - 但显示过滤掉特定列中的nan行 -就地- 并且对于大型数据帧,按列名(之前和之后)计算具有nan

import pandas as pd
import numpy as np

df = pd.DataFrame([[1,np.nan,'A100'],[4,5,'A213'],[7,8,np.nan],[10,np.nan,'GA23']])
df.columns = ['areaCode','Distance','accountCode']

dataframe

数据框

areaCode    Distance    accountCode
1           NaN         A100
4           5.0         A213
7           8.0         NaN
10          NaN         GA23

Before: count rows with nan (for each column):

之前:用 nan 计算行数(每列):

df.isnull().sum()

count by column:

按列计数:

areaCode       0
Distance       2
accountCode    1
dtype: int64

remove unwanted rows in-place:

就地删除不需要的行:

df.dropna(subset=['Distance'],inplace=True)

After: count rows with nan (for each column):

之后:用 nan 计算行数(每列):

df.isnull().sum()

count by column:

按列计数:

areaCode       0
Distance       0
accountCode    1
dtype: int64

dataframe:

数据框:

areaCode    Distance    accountCode
4           5.0         A213
7           8.0         NaN