Pandas read_csv 用字符串 'nan' 填充空值,而不是解析日期
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16157939/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas read_csv fills empty values with string 'nan', instead of parsing date
提问by Jeff
I assign np.nanto the missing values in a column of a DataFrame. The DataFrame is then written to a csv file using to_csv. The resulting csv file correctly has nothing between the commas for the missing values if I open the file with a text editor. But when I read that csv file back into a DataFrame using read_csv, the missing values become the string 'nan'instead of NaN. As a result, isnull()does not work. For example:
我分配np.nan给 DataFrame 列中的缺失值。然后使用 to_csv 将 DataFrame 写入 csv 文件。如果我使用文本编辑器打开文件,则生成的 csv 文件在缺少值的逗号之间正确地没有任何内容。但是当我使用 read_csv 将该 csv 文件读回 DataFrame 时,缺失的值变成了字符串'nan'而不是 NaN。结果,isnull()不起作用。例如:
In [13]: df
Out[13]:
index value date
0 975 25.35 nan
1 976 26.28 nan
2 977 26.24 nan
3 978 25.76 nan
4 979 26.08 nan
In [14]: df.date.isnull()
Out[14]:
0 False
1 False
2 False
3 False
4 False
Am I doing anything wrong? Should I assign some other values instead of np.nanto the missing values so that the isnull()would be able to pick up?
我做错了什么吗?我是否应该分配一些其他值而不是np.nan缺失值,以便isnull()能够获取?
EDIT: Sorry, forgot to mention that I also set parse_dates = [2] to parse that column. That column contains dates with some rows missing. I would like to have the missing rows be NaN.
编辑:抱歉,忘了提到我还设置了 parse_dates = [2] 来解析该列。该列包含缺少某些行的日期。我希望丢失的行是NaN.
EIDT: I just found out that the issue is really due to parse_dates. If the date column contains missing values, read_csv will not parse that column. Instead, it will read the dates as string and assign the string 'nan' to the empty values.
EIDT:我刚刚发现问题确实是由 parse_dates 引起的。如果日期列包含缺失值,则 read_csv 将不会解析该列。相反,它会将日期作为字符串读取,并将字符串 'nan' 分配给空值。
In [21]: data = pd.read_csv('test.csv', parse_dates = [1])
In [22]: data
Out[22]:
value date id
0 2 2013-3-1 a
1 3 2013-3-1 b
2 4 2013-3-1 c
3 5 nan d
4 6 2013-3-1 d
In [23]: data.date[3]
Out[23]: 'nan'
pd.to_datetime does not work either:
pd.to_datetime 也不起作用:
In [12]: data
Out[12]:
value date id
0 2 2013-3-1 a
1 3 2013-3-1 b
2 4 2013-3-1 c
3 5 nan d
4 6 2013-3-1 d
In [13]: data.dtypes
Out[13]:
value int64
date object
id object
In [14]: pd.to_datetime(data['date'])
Out[14]:
0 2013-3-1
1 2013-3-1
2 2013-3-1
3 nan
4 2013-3-1
Name: date
Is there a way to have read_csv parse_dates to work with columns that contain missing values? I.e. assign NaN to missing values and still parse the valid dates?
有没有办法让 read_csv parse_dates 处理包含缺失值的列?即将 NaN 分配给缺失值并仍然解析有效日期?
回答by Jeff
This is currently a buglet in the parser, see: https://github.com/pydata/pandas/issues/3062easy workaround is to force convert the column after your read it in (and will populate the nans with NaT, which is the Not-A-Time marker, equiv to nan for datetimes). This should work on 0.10.1
这是当前解析器中的一个 buglet,请参阅:https: //github.com/pydata/pandas/issues/3062简单的解决方法是在您读入后强制转换该列(并将用 NaT 填充 nans,这是Not-A-Time 标记,相当于日期时间的 nan)。这应该适用于 0.10.1
In [22]: df
Out[22]:
value date id
0 2 2013-3-1 a
1 3 2013-3-1 b
2 4 2013-3-1 c
3 5 NaN d
4 6 2013-3-1 d
In [23]: df.dtypes
Out[23]:
value int64
date object
id object
dtype: object
In [24]: pd.to_datetime(df['date'])
Out[24]:
0 2013-03-01 00:00:00
1 2013-03-01 00:00:00
2 2013-03-01 00:00:00
3 NaT
4 2013-03-01 00:00:00
Name: date, dtype: datetime64[ns]
If the string 'nan' acutally appears in your data, you can do this:
如果字符串 'nan' 实际出现在您的数据中,您可以这样做:
In [31]: s = Series(['2013-1-1','2013-1-1','nan','2013-1-1'])
In [32]: s
Out[32]:
0 2013-1-1
1 2013-1-1
2 nan
3 2013-1-1
dtype: object
In [39]: s[s=='nan'] = np.nan
In [40]: s
Out[40]:
0 2013-1-1
1 2013-1-1
2 NaN
3 2013-1-1
dtype: object
In [41]: pandas.to_datetime(s)
Out[41]:
0 2013-01-01 00:00:00
1 2013-01-01 00:00:00
2 NaT
3 2013-01-01 00:00:00
dtype: datetime64[ns]
回答by bdiamante
回答by ccxxxx
I got the same problem. Importing a csv file using
我遇到了同样的问题。使用导入 csv 文件
dataframe1 = pd.read_csv(input_file, parse_date=['date1', 'date2'])
where date1 contains valid dates while date2 is an empty column. Apparently dataframe1['date2'] is filled with a whole column of 'nan'.
其中 date1 包含有效日期,而 date2 是空列。显然 dataframe1['date2'] 充满了一整列 'nan'。
The case is, after specifying the date columns from dataframe and use read_csv to import data, the empty date column will be filled with string of 'nan' instead of NaN.
情况是,在从数据框中指定日期列并使用 read_csv 导入数据后,空日期列将填充为 'nan' 而不是 NaN 的字符串。
The latter can be recognized by numpy and pandas as NULL while the first one couldn't.
后者可以被 numpy 和 pandas 识别为 NULL 而第一个不能。
A simple solution is:
一个简单的解决方案是:
from numpy import nan
dataframe.replace('nan', nan, inplace=True)
And then you should be good to go!
然后你应该很高兴去!

