pandas 如何在熊猫的字符串列中按 NaN 过滤?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37157444/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:13:12  来源:igfitidea点击:

How to filter by NaN in string column in pandas?

pythonpandas

提问by Richard

I'm using pandas 0.18. I have loaded a dataframe from CSV using pd.read_csv(), and it looks as thought the empty cells in CSV have loaded as NaNin the dataframe.

我正在使用Pandas 0.18。我已经使用 CSV 从 CSV 加载了一个数据框pd.read_csv(),看起来 CSV 中的空单元格已经像NaN在数据框中一样加载了。

Now I want to find the number of rows with an empty value in a particular column, but I'm struggling.

现在我想在特定列中找到具有空值的行数,但我很挣扎。

This is my dataframe:

这是我的数据框:

      ods         id provider
0  A86016        NaN     emis
1  L81042     463061      NaN
2  C84013        NaN      tpp
3  G82228     462941     emis
4  C81083        NaN      tpp

This is what I get from a df.describe():

这是我从一个df.describe()

           ods         id provider
count     9897       7186     9022
unique    8066        192        4
top     N83028     463090     emis
freq         7        169     4860

I want to get all the rows where providerwas empty in the CSV. This is what I've tried:

我想获取providerCSV 中所有为空的行。这是我尝试过的:

>>> print len(df[df.provider == 'NaN'])
0
>>> print len(df[df.provider == np.nan])
0

I can see that there are some NaNvalues in there (e.g. row 1) so what gives?

我可以看到那里有一些NaN值(例如第 1 行)那么是什么给出的?

Also, why does pandas convert empty values in string columns like providerto NaN- wouldn't it make more sense to convert them to an empty string?

此外,为什么Pandas转换为空值字符串列样providerNaN-那岂不是更有意义,将它们转换为空字符串?

回答by jezrael

Use isnullfor comparing NaN:

使用isnull比较NaN

df = pd.DataFrame({'ods': {0: 'A86016', 1: 'L81042', 2: 'C84013', 3: 'G82228', 4: 'C81083'}, 
                   'id': {0: np.nan, 1: 463061.0, 2: np.nan, 3: 462941.0, 4: np.nan}, 
                   'provider': {0: 'emis', 1: np.nan, 2: 'tpp', 3: 'emis', 4: 'tpp'}})

print df
         id     ods provider
0       NaN  A86016     emis
1  463061.0  L81042      NaN
2       NaN  C84013      tpp
3  462941.0  G82228     emis
4       NaN  C81083      tpp

print (df[df.provider.isnull()])

      ods        id provider
1  L81042  463061.0      NaN

print len(df[df.provider.isnull()])
1

If you need convert NaNto `` use fillna:

如果您需要转换NaN为 `` 使用fillna

df.provider.fillna('', inplace=True)
print df
         id     ods provider
0       NaN  A86016     emis
1  463061.0  L81042         
2       NaN  C84013      tpp
3  462941.0  G82228     emis
4       NaN  C81083      tpp

Docs:

文档

Warning

One has to be mindful that in python (and numpy), the nan's don't compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan.

警告

必须注意,在 python(和 numpy)中,nan 不相等,但 None 不相等。请注意,Pandas/numpy 使用 np.nan != np.nan 的事实,并将 None 视为 np.nan。

In [11]: None == None
Out[11]: True

In [12]: np.nan == np.nan
Out[12]: False

So as compared to above, a scalar equality comparison versus a None/np.nan doesn't provide useful information.

因此,与上面相比,标量相等比较与 None/np.nan 没有提供有用的信息。

In [13]: df2['one'] == np.nan
Out[13]: 
a    False
b    False
c    False
d    False
e    False
f    False
g    False
h    False
Name: one, dtype: bool

But if nanis string:

但如果nan是字符串:

df = pd.DataFrame({'ods': {0: 'A86016', 1: 'L81042', 2: 'C84013', 3: 'G82228', 4: 'C81083'}, 
                   'id': {0: np.nan, 1: 463061.0, 2: np.nan, 3: 462941.0, 4: np.nan}, 
                   'provider': {0: 'emis', 1: 'nan', 2: 'tpp', 3: 'emis', 4: 'tpp'}})

print df
      ods        id provider
0  A86016       NaN     emis
1  L81042  463061.0      nan
2  C84013       NaN      tpp
3  G82228  462941.0     emis
4  C81083       NaN      tpp


print (df[df.provider == 'nan'])
      ods        id provider
1  L81042  463061.0      nan

do you know why pandas imports empty strings as NaN rather than empty strings?

你知道为什么 Pandas 将空字符串作为 NaN 而不是空字符串导入吗?

See docs(bold by me):

查看文档(我加粗):

na_values : str, list-likeor dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''.

na_values : str, list-likedict, 默认

要识别为 NA/NaN 的其他字符串。如果 dict 通过,特定的每列 NA 值。默认情况下,以下值被解释为 NaN:'-1.#IND'、'1.#QNAN'、'1.#IND'、'-1.#QNAN'、'#N/AN/A'、' #N/A', 'N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''

回答by ysearka

You can first store the na values, and then drop all the rest:

您可以先存储 na 值,然后删除所有其余值:

without_na = df['provider'].dropna()
df[~df.index.isin(without_na.index)]