这是带有 notnull() 的 Pandas 错误还是我的根本误解(可能是误解)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23789378/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:04:29  来源:igfitidea点击:

Is this a Pandas bug with notnull() or a fundamental misunderstanding on my part (probably misunderstanding)

pythonpandasnull

提问by FinanceGuyThatCantCode

I have a pandas dataframe with two columns and default indexing. The first column is a string and the second is a date. The top date is NaN (though it should be NaT really).

我有一个包含两列和默认索引的 Pandas 数据框。第一列是字符串,第二列是日期。最高日期是 NaN(尽管它实际上应该是 NaT)。

index    somestr    date
0        ON         NaN
1        1C         2014-06-11 00:00:00
2        2C         2014-07-09 00:00:00
3        3C         2014-08-13 00:00:00
4        4C         2014-09-10 00:00:00
5        5C         2014-10-08 00:00:00
6        6C         2014-11-12 00:00:00
7        7C         2014-12-10 00:00:00
8        8C         2015-01-14 00:00:00
9        9C         2015-02-11 00:00:00
10       10C        2015-03-11 00:00:00
11       11C        2015-04-08 00:00:00
12       12C        2015-05-13 00:00:00

Call this dataframe df.

将此数据框称为 df。

When I run:

当我运行时:

df[pd.notnull(df['date'])]

I expect the first row to go away. It doesn't. If I remove the column with string by setting:

我希望第一行消失。它没有。如果我通过设置删除带有字符串的列:

df=df[['date']]

Then apply:

然后申请:

df[pd.notnull(df['date'])]

then the first row with the null does go away.

那么具有空值的第一行确实消失了。

Also, the row with the null always goes away if all columns are number/date types. When a column with a string appears, this problem occurs.

此外,如果所有列都是数字/日期类型,则带有 null 的行总是会消失。当出现带有字符串的列时,就会出现此问题。

Surely this is a bug, right? I am not sure if others will be able to replicate this. This was on my Enthought Canopy for Windows (I am not smart enough for UNIX/Linux command line noise)

这肯定是一个错误,对吧?我不确定其他人是否能够复制这一点。这是在我的 Enthought Canopy for Windows 上(我对 UNIX/Linux 命令行噪音不够聪明)

Per requests below from Jeff and unutbu: @ubuntu -

根据 Jeff 和 unutbu 的以下请求:@ubuntu -

df.dtypes
somestr    object
date       object
dtype:  object

Also:

还:

type(df.iloc[0]['date'])
pandas.tslib.NaTType

In the code this column was specifically assigned as pd.NaT I also do not understand why it says NaN when it should say NaT. The filtering I used worked fine when I used this toy frame:

在代码中,该列被专门指定为 pd.NaT 我也不明白为什么它应该说 NaT 时说 NaN。当我使用这个玩具框架时,我使用的过滤效果很好:

df=pd.DataFrame({'somestr' : ['aa', 'bb'], 'date' : [pd.NaT, dt.datetime(2014,4,15)]}, columns=['somestr', 'date'])

It should also be noted that although the table above had NaN in the output, the following output NaT:

还需要注意的是,虽然上表的输出中有 NaN,但下面的输出是 NaT:

df['date'][0]
NaT

Also:

还:

pd.notnull(df['date'][0])
False

pd.notnull(df['date'][1])
True

but....when evaluating the array, they all came back True - bizarre...

但是......在评估数组时,他们都回来了 True - 奇怪......

np.all(pd.notnull(df['date']))
True

@Jeff - this is 0.12. I am stuck with this. The frame was created by concatenating two different frames that were grabbed from database queries using psql. The date and some other float columns were then added by calculations I did. Of course, I filtered to the two relevant columns that made sense here until I pinpointed that the string valued columns were causing problems.

@Jeff - 这是 0.12。我坚持这个。该框架是通过连接两个不同的框架来创建的,这些框架是使用 psql 从数据库查询中获取的。然后通过我所做的计算添加了日期和其他一些浮点列。当然,我过滤到了在这里有意义的两个相关列,直到我查明字符串值列导致了问题。

************How to Replicate **********

************如何复制************

import pandas as pd
import datetime as dt

print(pd.__version__)
# 0.12.0

df = pd.DataFrame({'somestr': ['aa', 'bb'], 'date': ['cc', 'dd']},
                  columns=['somestr', 'date'])
df['date'].iloc[0] = pd.NaT
df['date'].iloc[1] = pd.to_datetime(dt.datetime(2014, 4, 15))
print(df[pd.notnull(df['date'])])
#   somestr                 date
# 0      aa                  NaN
# 1      bb  2014-04-15 00:00:00

df2 = df[['date']]
print(df2[pd.notnull(df2['date'])])
#                  date
# 1 2014-04-15 00:00:00

So, this dataframe originally had all string entries - then the date column was converted to dates with an NaT at the top - note that in the table it is NaN, but when using df.iloc[0]['date'] you do see the NaT. Using the snippet above, you can see that the filtering by not null is bizarre with and without the somestr column. Again - this is Enthought Canopy for Windows with Pandas 0.12 and NumPy 1.8.

因此,此数据框最初包含所有字符串条目 - 然后日期列转换为顶部带有 NaT 的日期 - 请注意,在表中它是 NaN,但是当使用 df.iloc[0]['date'] 时,您会这样做参见 NaT。使用上面的代码片段,您可以看到使用和不使用 somestr 列的 not null 过滤都很奇怪。再次 - 这是带有 Pandas 0.12 和 NumPy 1.8 的 Enthought Canopy for Windows。

回答by Roger Miller

I encountered this problem also. Here's how I fixed it. "isnull()" is a function that checks if something is NaN or empty. The "~" (tilde) operator negates the following expression. So we are saying give me a dataframe from your original dataframe but only where the 'data' rows are NOT null.

我也遇到了这个问题。这是我修复它的方法。"isnull()" 是一个函数,用于检查某些内容是否为 NaN 或为空。“~”(波浪号)运算符否定以下表达式。所以我们说从你的原始数据框中给我一个数据框,但只有在“数据”行不为空的地方。

df = df[~df['data'].isnull()]

Hope this helps!

希望这可以帮助!