pandas 遍历熊猫数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15683588/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Iterating through a pandas dataframe
提问by sfactor
I have a pandas dataframe where one column represents if the location value in another column changed in the row below it. As an example,
我有一个 Pandas 数据框,其中一列表示另一列中的位置值是否在其下方的行中发生更改。举个例子,
2013-02-05 19:45:00 (39.94, -86.159) True
2013-02-05 19:50:00 (39.94, -86.159) True
2013-02-05 19:55:00 (39.94, -86.159) False
2013-02-05 20:00:00 (39.777, -85.995) False
2013-02-05 20:05:00 (39.775, -85.978) True
2013-02-05 20:10:00 (39.775, -85.978) True
2013-02-05 20:15:00 (39.775, -85.978) False
2013-02-05 20:20:00 (39.94, -86.159) True
2013-02-05 20:30:00 (39.94, -86.159) False
So, what I want to do is go row by row through this dataframe and check for the rows with False. And then (may be add another column) which has total 'continuous' time spent in that place. The same place can be visited again like in the example above. In that case it is taken to be as a separate condition. So, for the above example, something like:
所以,我想要做的是逐行浏览这个数据框并检查带有False. 然后(可能会添加另一列)在那个地方花费了总的“连续”时间。可以像上面的例子一样再次访问同一个地方。在这种情况下,它被视为一个单独的条件。因此,对于上面的示例,类似于:
2013-02-05 19:45:00 (39.94, -86.159) True 0
2013-02-05 19:50:00 (39.94, -86.159) True 0
2013-02-05 19:55:00 (39.94, -86.159) False 15
2013-02-05 20:00:00 (39.777, -85.995) False 5
2013-02-05 20:05:00 (39.775, -85.978) True 0
2013-02-05 20:10:00 (39.775, -85.978) True 0
2013-02-05 20:15:00 (39.775, -85.978) False 15
2013-02-05 20:20:00 (39.94, -86.159) True 0
2013-02-05 20:25:00 (39.94, -86.159) False 10
I would then plot a histogram of these 'continuous' time spent using the hist() function per day. How would I get the second dataframe from the first by iterating through the dataframe? I'm new to python and pandas and the real datafile is huge so, I would need something reasonably efficient.
然后,我将绘制每天使用 hist() 函数所花费的这些“连续”时间的直方图。如何通过遍历数据帧从第一个数据帧中获取第二个数据帧?我是 python 和 pandas 的新手,真正的数据文件很大,所以我需要一些相当有效的东西。
回答by user1827356
Here's another take
这是另一种看法
df['group'] = (df.condition == False).astype('int').cumsum().shift(1).fillna(0)
df
date long lat condition group
2/5/2013 19:45:00 39.940 -86.159 True 0
2/5/2013 19:50:00 39.940 -86.159 True 0
2/5/2013 19:55:00 39.940 -86.159 False 0
2/5/2013 20:00:00 39.777 -85.995 False 1
2/5/2013 20:05:00 39.775 -85.978 True 2
2/5/2013 20:10:00 39.775 -85.978 True 2
2/5/2013 20:15:00 39.775 -85.978 False 2
2/5/2013 20:20:00 39.940 -86.159 True 3
2/5/2013 20:25:00 39.940 -86.159 False 3
df['result'] = df.groupby(['group']).date.transform(lambda sdf: 5 *len(sdf))
df
date long lat condition group result
2/5/2013 19:45:00 39.940 -86.159 True 0 15
2/5/2013 19:50:00 39.940 -86.159 True 0 15
2/5/2013 19:55:00 39.940 -86.159 False 0 15
2/5/2013 20:00:00 39.777 -85.995 False 1 5
2/5/2013 20:05:00 39.775 -85.978 True 2 15
2/5/2013 20:10:00 39.775 -85.978 True 2 15
2/5/2013 20:15:00 39.775 -85.978 False 2 15
2/5/2013 20:20:00 39.940 -86.159 True 3 10
2/5/2013 20:25:00 39.940 -86.159 False 3 10
回答by Jeff
You will need 0.11-dev. I think this will give you what you are looking for. See this section: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#time-deltasfor more info as the timedeltas are a newer data that pandas is supporting
您将需要 0.11-dev。我认为这会给你你正在寻找的东西。请参阅本节:http: //pandas.pydata.org/pandas-docs/dev/timeseries.html#time-deltas 了解更多信息,因为 timedeltas 是 Pandas 支持的较新数据
Heres your data (I separated long/lat just for convenience, the key thing is that the condition column is a bool)
这是您的数据(为了方便起见,我将 long/lat 分开,关键是条件列是 bool)
In [137]: df = pd.read_csv(StringIO.StringIO(data),index_col=0,parse_dates=True)
In [138]: df
Out[138]:
date long lat condition
2013-02-05 19:45:00 39.940 -86.159 True
2013-02-05 19:50:00 39.940 -86.159 True
2013-02-05 19:55:00 39.940 -86.159 False
2013-02-05 20:00:00 39.777 -85.995 False
2013-02-05 20:05:00 39.775 -85.978 True
2013-02-05 20:10:00 39.775 -85.978 True
2013-02-05 20:15:00 39.775 -85.978 False
2013-02-05 20:20:00 39.940 -86.159 True
2013-02-05 20:25:00 39.940 -86.159 False
In [139]: df.dtypes
Out[139]:
date float64
long lat float64
condition bool
dtype: object
Create some date columns that are the index (these are datetime64[ns] dtype)
创建一些作为索引的日期列(这些是 datetime64[ns] dtype)
In [140]: df['date'] = df.index
In [141]: df['rdate'] = df.index
Set the rdate column that are False to NaT (np.nan's are transformed to NaT)
将 False 的 rdate 列设置为 NaT(np.nan 转换为 NaT)
In [142]: df.loc[~df['condition'],'rdate'] = np.nan
Forward fill the NaT's from the previous value
从前一个值向前填充 NaT
In [143]: df['rdate'] = df['rdate'].ffill()
Subtract the rdate from the date, this produces a timedelta64[ns] type column of the time differences
从日期中减去 rdate,这会产生一个 timedelta64[ns] 类型的时差列
In [144]: df['diff'] = df['date']-df['rdate']
In [151]: df
Out[151]:
date long lat condition rdate \
2013-02-05 19:45:00 2013-02-05 19:45:00 -86.159 True 2013-02-05 19:45:00
2013-02-05 19:50:00 2013-02-05 19:50:00 -86.159 True 2013-02-05 19:50:00
2013-02-05 19:55:00 2013-02-05 19:55:00 -86.159 False 2013-02-05 19:50:00
2013-02-05 20:00:00 2013-02-05 20:00:00 -85.995 False 2013-02-05 19:50:00
2013-02-05 20:05:00 2013-02-05 20:05:00 -85.978 True 2013-02-05 20:05:00
2013-02-05 20:10:00 2013-02-05 20:10:00 -85.978 True 2013-02-05 20:10:00
2013-02-05 20:15:00 2013-02-05 20:15:00 -85.978 False 2013-02-05 20:10:00
2013-02-05 20:20:00 2013-02-05 20:20:00 -86.159 True 2013-02-05 20:20:00
2013-02-05 20:25:00 2013-02-05 20:25:00 -86.159 False 2013-02-05 20:20:00
diff
2013-02-05 19:45:00 00:00:00
2013-02-05 19:50:00 00:00:00
2013-02-05 19:55:00 00:05:00
2013-02-05 20:00:00 00:10:00
2013-02-05 20:05:00 00:00:00
2013-02-05 20:10:00 00:00:00
2013-02-05 20:15:00 00:05:00
2013-02-05 20:20:00 00:00:00
2013-02-05 20:25:00 00:05:00
The diff column are now timedelta64[ns], so you want integers in minutes (FYI this is a little bit clunky now as pandas doesn't have a scalar type Timedelta similar to Timestamp for dates)
diff 列现在是 timedelta64[ns],所以你需要以分钟为单位的整数(仅供参考,现在有点笨重,因为Pandas没有类似于日期时间戳的标量类型 Timedelta)
(Also, you may have have to do a shift() on this rdate series before you ffill, I think I am off by 1 somewhere)...but this is the idea
(此外,在填写之前,您可能必须在此 rdate 系列上执行 shift() 操作,我想我在某个地方错了 1 点)……但这就是想法
In [175]: df['diff'].map(lambda x: x.item().seconds/60)
Out[175]:
2013-02-05 19:45:00 0
2013-02-05 19:50:00 0
2013-02-05 19:55:00 5
2013-02-05 20:00:00 10
2013-02-05 20:05:00 0
2013-02-05 20:10:00 0
2013-02-05 20:15:00 5
2013-02-05 20:20:00 0
2013-02-05 20:25:00 5

