Python 过滤掉特定列中的 nan 行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43821529/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Filter out nan rows in a specific column
提问by Dinosaurius
df =
Col1 Col2 Col3
1 nan 4
2 5 4
3 3 nan
Given the dataframe df
, I want to obtain a new dataframe df2
that does not contain nan
in the column Col2
. This is the expected result:
df2 =
给定数据框df
,我想获得一个df2
不包含nan
在列中的新数据框Col2
。这是预期的结果:df2 =
Col1 Col2 Col3
2 5 4
3 3 nan
I know that it's possible to use pandas.isnull
and dropna
, however how to specify only particular column to which filtering should be applied?
我知道可以使用pandas.isnull
and dropna
,但是如何仅指定应应用过滤的特定列?
回答by MaxU
you can use DataFrame.dropna()
method:
你可以使用DataFrame.dropna()
方法:
In [202]: df.dropna(subset=['Col2'])
Out[202]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
or (in this case) less idiomatic Series.notnull():
或(在这种情况下)不太惯用的Series.notnull():
In [204]: df.loc[df.Col2.notnull()]
Out[204]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
or using DataFrame.query()method:
或使用DataFrame.query()方法:
In [205]: df.query("Col2 == Col2")
Out[205]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
numexpr
solution:
numexpr
解决方案:
In [241]: import numexpr as ne
In [242]: col = df.Col2
In [243]: df[ne.evaluate("col == col")]
Out[243]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
回答by jezrael
Use dropna
:
使用dropna
:
df = df.dropna(subset=['Col2'])
print (df)
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
Another solution - boolean indexing
with notnull
:
另一个解决方案 -boolean indexing
使用notnull
:
df = df[df['Col2'].notnull()]
print (df)
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
What is same as:
什么是相同的:
df = df[~df['Col2'].isnull()]
print (df)
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
回答by piRSquared
Using numpy
's isnan
to mask and construct a new dataframe
使用numpy
'sisnan
来屏蔽和构造一个新的数据框
m = ~np.isnan(df.Col2.values)
pd.DataFrame(df.values[m], df.index[m], df.columns)
Col1 Col2 Col3
1 2.0 5.0 4.0
2 3.0 3.0 NaN
Timing
Bigger Data
定时
大数据
np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice([np.nan, 1], size=(10000, 10))).add_prefix('Col')
%%timeit
m = ~np.isnan(df.Col2.values)
pd.DataFrame(df.values[m], df.index[m], df.columns)
1000 loops, best of 3: 326 μs per loop
%timeit df.query("Col2 == Col2")
1000 loops, best of 3: 1.48 ms per loop
%timeit df.loc[df.Col2.notnull()]
1000 loops, best of 3: 417 μs per loop
%timeit df[~df['Col2'].isnull()]
1000 loops, best of 3: 385 μs per loop
%timeit df.dropna(subset=['Col2'])
1000 loops, best of 3: 913 μs per loop
回答by Alex Schwab
If you want to count and graph the number of nan's before dropping your column(s)
如果您想在删除列之前计算和绘制 nan 的数量
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
cols = df.columns
nans = [df[col].isna().sum() for col in cols]
sns.set(font_scale=1.1)
ax = sns.barplot(cols, nans, palette='hls', log=False)
ax.set(xlabel='Feature', ylabel='Number of NaNs', title='Number of NaNs per feature')
for p, uniq in zip(ax.patches, nans):
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 10,
uniq,
ha="center")
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()
回答by Grant Shannon
The simple implementation below follows on from the above - but shows filtering out nanrows in a specific column - in place- and for largedata frames count rows with nanby column name(before and after)
下面的简单实现继承了上面的内容 - 但显示过滤掉特定列中的nan行 -就地- 并且对于大型数据帧,按列名(之前和之后)计算具有nan的行
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,np.nan,'A100'],[4,5,'A213'],[7,8,np.nan],[10,np.nan,'GA23']])
df.columns = ['areaCode','Distance','accountCode']
dataframe
数据框
areaCode Distance accountCode
1 NaN A100
4 5.0 A213
7 8.0 NaN
10 NaN GA23
Before: count rows with nan (for each column):
之前:用 nan 计算行数(每列):
df.isnull().sum()
count by column:
按列计数:
areaCode 0
Distance 2
accountCode 1
dtype: int64
remove unwanted rows in-place:
就地删除不需要的行:
df.dropna(subset=['Distance'],inplace=True)
After: count rows with nan (for each column):
之后:用 nan 计算行数(每列):
df.isnull().sum()
count by column:
按列计数:
areaCode 0
Distance 0
accountCode 1
dtype: int64
dataframe:
数据框:
areaCode Distance accountCode
4 5.0 A213
7 8.0 NaN