Python 过滤掉特定列中的 nan 行

Question

提问by Dinosaurius

df =

Col1 Col2 Col3
1    nan  4
2    5    4
3    3    nan

Given the dataframe df, I want to obtain a new dataframe df2that does not contain nanin the column Col2. This is the expected result: df2 =

给定数据框df，我想获得一个df2不包含nan在列中的新数据框Col2。这是预期的结果：df2 =

Col1 Col2 Col3
2    5    4
3    3    nan

I know that it's possible to use pandas.isnulland dropna, however how to specify only particular column to which filtering should be applied?

我知道可以使用pandas.isnulland dropna，但是如何仅指定应应用过滤的特定列？

Answer 1

回答by MaxU

you can use DataFrame.dropna()method:

你可以使用DataFrame.dropna()方法：

In [202]: df.dropna(subset=['Col2'])
Out[202]:
   Col1  Col2  Col3
1     2   5.0   4.0
2     3   3.0   NaN

or (in this case) less idiomatic Series.notnull():

或（在这种情况下）不太惯用的Series.notnull()：

In [204]: df.loc[df.Col2.notnull()]
Out[204]:
   Col1  Col2  Col3
1     2   5.0   4.0
2     3   3.0   NaN

or using DataFrame.query()method:

或使用DataFrame.query()方法：

In [205]: df.query("Col2 == Col2")
Out[205]:
   Col1  Col2  Col3
1     2   5.0   4.0
2     3   3.0   NaN

numexprsolution:

numexpr解决方案：

In [241]: import numexpr as ne

In [242]: col = df.Col2

In [243]: df[ne.evaluate("col == col")]
Out[243]:
   Col1  Col2  Col3
1     2   5.0   4.0
2     3   3.0   NaN

Answer 2

回答by jezrael

Use dropna:

使用dropna：

df = df.dropna(subset=['Col2'])
print (df)
  Col1  Col2  Col3
1     2   5.0   4.0
2     3   3.0   NaN

Another solution - boolean indexingwith notnull:

另一个解决方案 -boolean indexing使用notnull：

df = df[df['Col2'].notnull()]
print (df)
   Col1  Col2  Col3
1     2   5.0   4.0
2     3   3.0   NaN

What is same as:

什么是相同的：

df = df[~df['Col2'].isnull()]
print (df)
   Col1  Col2  Col3
1     2   5.0   4.0
2     3   3.0   NaN

Answer 3

回答by piRSquared

Using numpy's isnanto mask and construct a new dataframe

使用numpy'sisnan来屏蔽和构造一个新的数据框

m = ~np.isnan(df.Col2.values)
pd.DataFrame(df.values[m], df.index[m], df.columns)

   Col1  Col2  Col3
1   2.0   5.0   4.0
2   3.0   3.0   NaN

Timing
Bigger Data

定时
大数据

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice([np.nan, 1], size=(10000, 10))).add_prefix('Col')

%%timeit
m = ~np.isnan(df.Col2.values)
pd.DataFrame(df.values[m], df.index[m], df.columns)
1000 loops, best of 3: 326 μs per loop

%timeit df.query("Col2 == Col2")
1000 loops, best of 3: 1.48 ms per loop

%timeit df.loc[df.Col2.notnull()]
1000 loops, best of 3: 417 μs per loop

%timeit df[~df['Col2'].isnull()]
1000 loops, best of 3: 385 μs per loop

%timeit df.dropna(subset=['Col2'])
1000 loops, best of 3: 913 μs per loop

Answer 4

回答by Alex Schwab

If you want to count and graph the number of nan's before dropping your column(s)

如果您想在删除列之前计算和绘制 nan 的数量

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

cols = df.columns
nans = [df[col].isna().sum() for col in cols] 

sns.set(font_scale=1.1)
ax = sns.barplot(cols, nans, palette='hls', log=False)
ax.set(xlabel='Feature', ylabel='Number of NaNs', title='Number of NaNs per feature')
for p, uniq in zip(ax.patches, nans):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            uniq,
            ha="center") 
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()

Answer 5

回答by Grant Shannon

The simple implementation below follows on from the above - but shows filtering out nanrows in a specific column - in place- and for largedata frames count rows with nanby column name(before and after)

下面的简单实现继承了上面的内容 - 但显示过滤掉特定列中的nan行 -就地- 并且对于大型数据帧，按列名（之前和之后）计算具有nan的行

import pandas as pd
import numpy as np

df = pd.DataFrame([[1,np.nan,'A100'],[4,5,'A213'],[7,8,np.nan],[10,np.nan,'GA23']])
df.columns = ['areaCode','Distance','accountCode']

dataframe

数据框

areaCode    Distance    accountCode
1           NaN         A100
4           5.0         A213
7           8.0         NaN
10          NaN         GA23

Before: count rows with nan (for each column):

之前：用 nan 计算行数（每列）：

df.isnull().sum()

count by column:

按列计数：

areaCode       0
Distance       2
accountCode    1
dtype: int64

remove unwanted rows in-place:

就地删除不需要的行：

df.dropna(subset=['Distance'],inplace=True)

After: count rows with nan (for each column):

之后：用 nan 计算行数（每列）：

df.isnull().sum()

count by column:

按列计数：

areaCode       0
Distance       0
accountCode    1
dtype: int64

dataframe:

数据框：

areaCode    Distance    accountCode
4           5.0         A213
7           8.0         NaN

Python 过滤掉特定列中的 nan 行

提问by Dinosaurius

回答by MaxU

回答by jezrael

回答by piRSquared

回答by Alex Schwab

回答by Grant Shannon

相关推荐

最近更新

标签

Python 过滤掉特定列中的 nan 行

提问by Dinosaurius

回答by MaxU

回答by jezrael

回答by piRSquared

回答by Alex Schwab

回答by Grant Shannon

相关推荐

Python spaCy 的词性和依赖标签是什么意思？

如何检查变量是python列表、numpy数组还是pandas系列

Python Spark 数据帧随机拆分

Python super(type, obj): obj 必须是 type 的实例或子类型

相关推荐

最近更新

标签