pandas 将熊猫列转换为 datetime64 包括缺失值

Question

提问by ptpatil

Working with Pandas to work with some timeseries based data that contains dates, numbers, categories etc.

使用 Pandas 处理一些基于时间序列的数据，其中包含日期、数字、类别等。

The problem I'm having is getting pandas to deal with my date/time columns correctly from a DataFrame created from a CSV. There are 18 date columns in my data, they are not continuous and unknown values in the raw CSV have a string value of "Unknown". Some columns have ALL cells with a valid datetime in it and correctly get their dtype guessed by the pandas read_csv method. There are some columns however that in a particular data sample have ALL cells as "Unknown" and these get typed as object.

我遇到的问题是让 Pandas 从从 CSV 创建的 DataFrame 正确处理我的日期/时间列。我的数据中有 18 个日期列，它们不是连续的，原始 CSV 中的未知值的字符串值为“未知”。某些列包含所有单元格，其中包含有效的日期时间，并且可以通过 pandas read_csv 方法正确获取它们的 dtype。然而，有一些列在特定数据样本中将所有单元格都设为“未知”，并且这些列被输入为对象。

My code to load the CSV is as follows:

我加载CSV的代码如下：

self.datecols = ['Claim Date', 'Lock Date', 'Closed Date', 'Service Date', 'Latest_Submission', 'Statement Date 1', 'Statement Date 2', 'Statement Date 3', 'Patient Payment Date 1', 'Patient Payment Date 2', 'Patient Payment Date 3', 'Primary 1 Payment Date', 'Primary 2 Payment Date', 'Primary 3 Payment Date', 'Secondary 1 Payment Date', 'Secondary 2 Payment Date', 'Tertiary Payment Date']
self.csvbear = pd.read_csv(file_path, index_col="Claim ID", parse_dates=True, na_values=['Unknown'])
self.csvbear = pd.DataFrame.convert_objects(self.csvbear, convert_dates='coerce')
print self.csvbear.dtypes
print self.csvbear['Tertiary Payment Date'].values

The output from print self.csvbear.dtypes

打印 self.csvbear.dtypes 的输出

Prac                            object
Doctor Name                     object
Practice Name                   object
Specialty                       object
Speciality Code                  int64
Claim Date              datetime64[ns]
Lock Date               datetime64[ns]
Progress Note Locked            object
Aging by Claim Date              int64
Aging by Lock Date               int64
Closed Date             datetime64[ns]
Service Date            datetime64[ns]
Week Number                      int64
Month                   datetime64[ns]
Current Insurance               object
...
Secondary 2 Deductible        float64
Secondary 2 Co Insurance      float64
Secondary 2 Member Balance    float64
Secondary 2 Paid              float64
Secondary 2 Witheld           float64
Secondary 2 Ins                object
Tertiary Payment Date          object
Tertiary Payment ID           float64
Tertiary Allowed              float64
Tertiary Deductible           float64
Tertiary Co Insurance         float64
Tertiary Member Balance       float64
Tertiary Paid                 float64
Tertiary Witheld              float64
Tertiary Ins                  float64
Length: 96, dtype: object
[nan nan nan ..., nan nan nan]
Press any key to continue . . .

As you can see, the Tertiary Payment Date col should be a datetime64 dtype, but it's simply a object, and the actual content of it is just NaN (put there from the read_csv function for string 'Unknown').

如您所见，第三次付款日期 col 应该是 datetime64 dtype，但它只是一个对象，它的实际内容只是 NaN（从字符串 'Unknown' 的 read_csv 函数中放置）。

How can I reliably convert all of the date columns to have datetime64 as a dtype and NaT for 'Unknown' cells?

如何可靠地将所有日期列转换为将 datetime64 作为 dtype 和 NaT 用于“未知”单元格？

Answer 1

回答by Jeff

if you have an all-nan column it won't be coerced properly by read_csv. easiest is just to do this (which if a column is already datetime64[ns] will just pass thru).

如果你有一个全 nan 列，它不会被read_csv. 最简单的就是这样做（如果一列已经是 datetime64[ns] 将直接通过）。

In [3]: df = DataFrame(dict(A = Timestamp('20130101'), B = np.random.randn(5), C = np.nan))

In [4]: df
Out[4]: 
                    A         B   C
0 2013-01-01 00:00:00 -0.859994 NaN
1 2013-01-01 00:00:00 -2.562136 NaN
2 2013-01-01 00:00:00  0.410673 NaN
3 2013-01-01 00:00:00  0.480578 NaN
4 2013-01-01 00:00:00  0.464771 NaN

[5 rows x 3 columns]

In [5]: df.dtypes
Out[5]: 
A    datetime64[ns]
B           float64
C           float64
dtype: object

In [6]: df['A'] = pd.to_datetime(df['A'])

In [7]: df['C'] = pd.to_datetime(df['C'])

In [8]: df
Out[8]: 
                    A         B   C
0 2013-01-01 00:00:00 -0.859994 NaT
1 2013-01-01 00:00:00 -2.562136 NaT
2 2013-01-01 00:00:00  0.410673 NaT
3 2013-01-01 00:00:00  0.480578 NaT
4 2013-01-01 00:00:00  0.464771 NaT

[5 rows x 3 columns]

In [9]: df.dtypes
Out[9]: 
A    datetime64[ns]
B           float64
C    datetime64[ns]
dtype: object

convert_objectswon't forcibly convert a column to datetime unless it has a least 1 non-nan thing that is a date (that why your example fails). to_datetimecan be more aggressive because it 'knows' that you really want to convert it.

convert_objects不会强行将列转换为日期时间，除非它至少有 1 个非 nan 日期（这就是为什么您的示例失败）。to_datetime可以更具攻击性，因为它“知道”您确实想要转换它。

Answer 2

回答by ericmjl

I like your method of using DataFrame.convert_objects, much more elegant than what I had tried many eons before.

我喜欢你的使用方法，DataFrame.convert_objects比我之前尝试过很多年的方法要优雅得多。

Looking at the API docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

查看 API 文档：http: //pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column. {‘foo' : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo'

parse_dates ：布尔值、整数或名称列表、列表列表或字典
如果为 True -> 尝试解析索引。如果 [1, 2, 3] -> 尝试将第 1、2、3 列解析为单独的日期列。如果 [[1, 3]] -> 合并第 1 列和第 3 列并解析为单个日期列。{'foo' : [1, 3]} -> 将第 1、3 列解析为日期并调用结果 'foo'

I presume that you're at the data wrangling stage of your analysis right now. Formatting the data to get it into the right format is generally the longest part of analysis. Some things just don't behave well, so hard-coding those things for special cases is going to be necessary.

我认为您现在正处于分析的数据整理阶段。格式化数据以使其成为正确的格式通常是分析中最长的部分。有些事情表现得不好，因此有必要针对特殊情况对这些事情进行硬编码。

Therefore, since you know which columns don't parse properly, I suggest that you go back into the code, and parse those columns right at the read_csvstage. Here's a suggestion:

因此，既然您知道哪些列没有正确解析，我建议您回到代码中，并在read_csv舞台上解析这些列。这是一个建议：

self.csvbear = pd.read_csv(file_path, index_col="Claim ID", parse_dates=[column, numbers, go, here], na_values=['Unknown'])

You will note that parse_dates=Truehas been changed to parse_dates=[column, numbers, go, here]. For the columns that aren't behaving properly, this is probably the quickest brute-force way to get to where you need to.

您会注意到parse_dates=True已更改为parse_dates=[column, numbers, go, here]。对于行为不正常的列，这可能是到达您需要的位置的最快的蛮力方式。

pandas 将熊猫列转换为 datetime64 包括缺失值

提问by ptpatil

回答by Jeff

回答by ericmjl

相关推荐

最近更新

标签

pandas 将熊猫列转换为 datetime64 包括缺失值

提问by ptpatil

回答by Jeff

回答by ericmjl

相关推荐

pandas 熊猫数据框分组：仅正数的总和/计数

计算每个 pandas.DataFrame 列的 numpy.std？

如何以块的形式迭代两个 Pandas 数据帧

来自 Pandas DataFrame 的基本 Matplotlib 散点图

相关推荐

最近更新

标签