pandas 在熊猫中处理日期 - 删除日期时间中看不见的字符并转换为字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25653220/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
working with dates in pandas - remove unseen characters in datetime and convert to string
提问by Ryan
I am using pandas to import data dfST = read_csv( ... , parse_dates={'timestamp':[date]})In my csv, date is in the format YYY/MM/DD, which is all I need - there is no time. I have several data sets that I need to compare for membership. When I convert theses 'timestamp' to a string, sometimes I get something like this:
我正在使用 Pandas 导入数据 dfST = read_csv( ... , parse_dates={'timestamp':[date]})在我的 csv 中,日期的格式为 YYY/MM/DD,这就是我所需要的 - 没有时间。我有几个数据集需要比较成员资格。当我将这些“时间戳”转换为字符串时,有时我会得到这样的信息:
'1977-07-31T00:00:00.000000000Z'
which I understand is a datetime including milliseconds and a timezone. Is there any way to suppress the addition of the extraneous time on import? If not, I need to exclude it somehow.
我的理解是包含毫秒和时区的日期时间。有什么办法可以抑制导入时增加的额外时间?如果没有,我需要以某种方式排除它。
dfST.timestamp[1]
Out[138]: Timestamp('1977-07-31 00:00:00')
I have tried formatting it, which seemed to work until I called the formatted values:
我试过格式化它,这似乎在我调用格式化值之前有效:
dfSTdate=pd.to_datetime(dfST.timestamp, format="%Y-%m-%d")
dfSTdate.head()
Out[123]:
0 1977-07-31
1 1977-07-31
Name: timestamp, dtype: datetime64[ns]
But no... when I test the value of this I also get the time:
但是不......当我测试这个值时,我也有时间:
dfSTdate[1]
Out[124]: Timestamp('1977-07-31 00:00:00')
When I convert this to an array, the time is included with the millisecond and the timezone, which really messes my comparisons up.
当我将其转换为数组时,时间包含在毫秒和时区中,这确实使我的比较混乱。
test97=np.array(dfSTdate)
test97[1]
Out[136]: numpy.datetime64('1977-07-30T20:00:00.000000000-0400')
How can I get rid of the time?!?
Ultimately I wish to compare membership among data sets using numpy.in1dwith date as a string ('YYYY-MM-DD') as one part of the comparison
我怎样才能摆脱时间?!最终,我希望使用numpy.in1d日期作为字符串 ('YYYY-MM-DD') 作为比较的一部分来比较数据集之间的成员资格
回答by joris
This is due to the way datetimevalues are stored in pandas: using the numpy datetime64[ns]dtype. So datetime values are always storedat nanosecond resolution. Even if you only have a date, this will be converted to a timestamp with a zero time of nanosecond resolution. This is just due to the implementation in pandas.
这是由于datetime值在 Pandas 中的存储方式:使用 numpy datetime64[ns]dtype。所以日期时间值总是以纳秒分辨率存储。即使你只有一个日期,它也会被转换为一个零时间的纳秒分辨率的时间戳。这只是由于在 Pandas 中的实现。
The issues you have with printing the values and having unexpected results, is just because how these objects are printed in the python console (their representation), not their actual value.
If you print a single values, you get a the Timestamprepresentation of pandas:
您在打印值和出现意外结果时遇到的问题,仅仅是因为这些对象是如何在 python 控制台中打印的(它们的表示),而不是它们的实际值。
如果你打印单个值,你会得到一个TimestampPandas的表示:
Timestamp('1977-07-31 00:00:00')
So you get the seconds here as well, just because this is the default representation.
If you convert it to an array, and then print it, you get the standard numpy representation:
所以你也可以在这里得到秒数,因为这是默认表示。
如果将其转换为数组,然后将其打印出来,则会得到标准的 numpy 表示:
numpy.datetime64('1977-07-30T20:00:00.000000000-0400')
This is indeed a very misleading representation. Because numpy will, just for printing it in the console, convert it to your local timezone. But this doesn't change your actual value, it's just weird printing.
这确实是一种非常具有误导性的表述。因为 numpy 会,只是为了在控制台中打印它,将其转换为您的本地时区。但这不会改变你的实际价值,只是奇怪的印刷。
That is the background, now to answer your question, how do I get rid of the time?
That depends on your goal. Do you really want to convert it to a string? Or do you just don't like the repr?
那就是背景,现在回答你的问题,我如何摆脱时间?
这取决于你的目标。你真的想把它转换成字符串吗?或者你只是不喜欢这个代表?
if you just want to work with the datetime values, you don't needto get rid of it.
if you want to convert it to strings, you can apply
strfitme(df['timestamp'].apply(lambda x: x.strftime('%Y-%m-%d'))). Or if it is to write it as strings to csv, use thedate_formatkeyword into_csvif you really want a 'date', you can use the
datetime.datetype (standard python type) in a DataFrame column. You can convert your existing column to this with with:pd.DatetimeIndex(dfST['timestamp']).date. But personally I don't think this has many advantages.
如果您只想使用日期时间值,则无需删除它。
如果要将其转换为字符串,则可以应用
strfitme(df['timestamp'].apply(lambda x: x.strftime('%Y-%m-%d')))。或者如果是将其作为字符串写入csv,则使用date_format关键字into_csv如果你真的想要一个“日期”,你可以
datetime.date在 DataFrame 列中使用类型(标准 python 类型)。您可以使用以下命令将现有列转换为此:pd.DatetimeIndex(dfST['timestamp']).date。但我个人认为这没有很多优点。

