pandas 如何在熊猫的字符串列中按 NaN 过滤？

Question

提问by Richard

I'm using pandas 0.18. I have loaded a dataframe from CSV using pd.read_csv(), and it looks as thought the empty cells in CSV have loaded as NaNin the dataframe.

我正在使用Pandas 0.18。我已经使用 CSV 从 CSV 加载了一个数据框pd.read_csv()，看起来 CSV 中的空单元格已经像NaN在数据框中一样加载了。

Now I want to find the number of rows with an empty value in a particular column, but I'm struggling.

现在我想在特定列中找到具有空值的行数，但我很挣扎。

This is my dataframe:

这是我的数据框：

      ods         id provider
0  A86016        NaN     emis
1  L81042     463061      NaN
2  C84013        NaN      tpp
3  G82228     462941     emis
4  C81083        NaN      tpp

This is what I get from a df.describe():

这是我从一个df.describe()：

           ods         id provider
count     9897       7186     9022
unique    8066        192        4
top     N83028     463090     emis
freq         7        169     4860

I want to get all the rows where providerwas empty in the CSV. This is what I've tried:

我想获取providerCSV 中所有为空的行。这是我尝试过的：

>>> print len(df[df.provider == 'NaN'])
0
>>> print len(df[df.provider == np.nan])
0

I can see that there are some NaNvalues in there (e.g. row 1) so what gives?

我可以看到那里有一些NaN值（例如第 1 行）那么是什么给出的？

Also, why does pandas convert empty values in string columns like providerto NaN- wouldn't it make more sense to convert them to an empty string?

此外，为什么Pandas转换为空值字符串列样provider来NaN-那岂不是更有意义，将它们转换为空字符串？

Answer 1

回答by jezrael

Use isnullfor comparing NaN:

使用isnull比较NaN：

df = pd.DataFrame({'ods': {0: 'A86016', 1: 'L81042', 2: 'C84013', 3: 'G82228', 4: 'C81083'}, 
                   'id': {0: np.nan, 1: 463061.0, 2: np.nan, 3: 462941.0, 4: np.nan}, 
                   'provider': {0: 'emis', 1: np.nan, 2: 'tpp', 3: 'emis', 4: 'tpp'}})

print df
         id     ods provider
0       NaN  A86016     emis
1  463061.0  L81042      NaN
2       NaN  C84013      tpp
3  462941.0  G82228     emis
4       NaN  C81083      tpp

print (df[df.provider.isnull()])

      ods        id provider
1  L81042  463061.0      NaN

print len(df[df.provider.isnull()])
1

If you need convert NaNto `` use fillna:

如果您需要转换NaN为 `` 使用fillna：

df.provider.fillna('', inplace=True)
print df
         id     ods provider
0       NaN  A86016     emis
1  463061.0  L81042         
2       NaN  C84013      tpp
3  462941.0  G82228     emis
4       NaN  C81083      tpp

Docs:

文档：

Warning
One has to be mindful that in python (and numpy), the nan's don't compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan.

警告
必须注意，在 python（和 numpy）中，nan 不相等，但 None 不相等。请注意，Pandas/numpy 使用 np.nan != np.nan 的事实，并将 None 视为 np.nan。

In [11]: None == None
Out[11]: True

In [12]: np.nan == np.nan
Out[12]: False

So as compared to above, a scalar equality comparison versus a None/np.nan doesn't provide useful information.

因此，与上面相比，标量相等比较与 None/np.nan 没有提供有用的信息。

In [13]: df2['one'] == np.nan
Out[13]: 
a    False
b    False
c    False
d    False
e    False
f    False
g    False
h    False
Name: one, dtype: bool

But if nanis string:

但如果nan是字符串：

df = pd.DataFrame({'ods': {0: 'A86016', 1: 'L81042', 2: 'C84013', 3: 'G82228', 4: 'C81083'}, 
                   'id': {0: np.nan, 1: 463061.0, 2: np.nan, 3: 462941.0, 4: np.nan}, 
                   'provider': {0: 'emis', 1: 'nan', 2: 'tpp', 3: 'emis', 4: 'tpp'}})

print df
      ods        id provider
0  A86016       NaN     emis
1  L81042  463061.0      nan
2  C84013       NaN      tpp
3  G82228  462941.0     emis
4  C81083       NaN      tpp


print (df[df.provider == 'nan'])
      ods        id provider
1  L81042  463061.0      nan

do you know why pandas imports empty strings as NaN rather than empty strings?

你知道为什么 Pandas 将空字符串作为 NaN 而不是空字符串导入吗？

See docs(bold by me):

查看文档（我加粗）：

na_values : str, list-likeor dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''.

na_values : str, list-like或dict, 默认无
要识别为 NA/NaN 的其他字符串。如果 dict 通过，特定的每列 NA 值。默认情况下，以下值被解释为 NaN：'-1.#IND'、'1.#QNAN'、'1.#IND'、'-1.#QNAN'、'#N/AN/A'、' #N/A', 'N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''。

Answer 2

回答by ysearka

You can first store the na values, and then drop all the rest:

您可以先存储 na 值，然后删除所有其余值：

without_na = df['provider'].dropna()
df[~df.index.isin(without_na.index)]

pandas 如何在熊猫的字符串列中按 NaN 过滤？

提问by Richard

回答by jezrael

回答by ysearka

相关推荐

最近更新

标签

pandas 如何在熊猫的字符串列中按 NaN 过滤？

提问by Richard

回答by jezrael

回答by ysearka

相关推荐

pandas 如何从熊猫数据帧计算 jaccard 相似度

在笔记本中上传大 csv 文件以使用 python pandas 的最快方法是什么？

在带有 groupby 的时间序列列上使用 Pandas .diff()

pandas 熊猫：将多列转换为字符串

相关推荐

最近更新

标签