pandas 遍历熊猫数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15683588/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:44:41  来源:igfitidea点击:

Iterating through a pandas dataframe

pythonpandas

提问by sfactor

I have a pandas dataframe where one column represents if the location value in another column changed in the row below it. As an example,

我有一个 Pandas 数据框,其中一列表示另一列中的位置值是否在其下方的行中发生更改。举个例子,

2013-02-05 19:45:00   (39.94, -86.159)     True
2013-02-05 19:50:00   (39.94, -86.159)     True
2013-02-05 19:55:00   (39.94, -86.159)    False
2013-02-05 20:00:00  (39.777, -85.995)    False
2013-02-05 20:05:00  (39.775, -85.978)     True
2013-02-05 20:10:00  (39.775, -85.978)     True
2013-02-05 20:15:00  (39.775, -85.978)    False
2013-02-05 20:20:00   (39.94, -86.159)     True
2013-02-05 20:30:00   (39.94, -86.159)    False

So, what I want to do is go row by row through this dataframe and check for the rows with False. And then (may be add another column) which has total 'continuous' time spent in that place. The same place can be visited again like in the example above. In that case it is taken to be as a separate condition. So, for the above example, something like:

所以,我想要做的是逐行浏览这个数据框并检查带有False. 然后(可能会添加另一列)在那个地方花费了总的“连续”时间。可以像上面的例子一样再次访问同一个地方。在这种情况下,它被视为一个单独的条件。因此,对于上面的示例,类似于:

2013-02-05 19:45:00   (39.94, -86.159)     True    0
2013-02-05 19:50:00   (39.94, -86.159)     True    0
2013-02-05 19:55:00   (39.94, -86.159)    False   15
2013-02-05 20:00:00  (39.777, -85.995)    False    5  
2013-02-05 20:05:00  (39.775, -85.978)     True    0
2013-02-05 20:10:00  (39.775, -85.978)     True    0
2013-02-05 20:15:00  (39.775, -85.978)    False   15
2013-02-05 20:20:00   (39.94, -86.159)     True    0 
2013-02-05 20:25:00   (39.94, -86.159)    False   10

I would then plot a histogram of these 'continuous' time spent using the hist() function per day. How would I get the second dataframe from the first by iterating through the dataframe? I'm new to python and pandas and the real datafile is huge so, I would need something reasonably efficient.

然后,我将绘制每天使用 hist() 函数所花费的这些“连续”时间的直方图。如何通过遍历数据帧从第一个数据帧中获取第二个数据帧?我是 python 和 pandas 的新手,真正的数据文件很大,所以我需要一些相当有效的东西。

回答by user1827356

Here's another take

这是另一种看法

df['group'] = (df.condition == False).astype('int').cumsum().shift(1).fillna(0)

df
             date    long     lat condition  group
2/5/2013 19:45:00  39.940 -86.159      True      0
2/5/2013 19:50:00  39.940 -86.159      True      0
2/5/2013 19:55:00  39.940 -86.159     False      0
2/5/2013 20:00:00  39.777 -85.995     False      1
2/5/2013 20:05:00  39.775 -85.978      True      2
2/5/2013 20:10:00  39.775 -85.978      True      2
2/5/2013 20:15:00  39.775 -85.978     False      2
2/5/2013 20:20:00  39.940 -86.159      True      3
2/5/2013 20:25:00  39.940 -86.159     False      3

df['result'] = df.groupby(['group']).date.transform(lambda sdf: 5 *len(sdf))

df
             date    long     lat condition  group result
2/5/2013 19:45:00  39.940 -86.159      True      0     15
2/5/2013 19:50:00  39.940 -86.159      True      0     15
2/5/2013 19:55:00  39.940 -86.159     False      0     15
2/5/2013 20:00:00  39.777 -85.995     False      1      5
2/5/2013 20:05:00  39.775 -85.978      True      2     15
2/5/2013 20:10:00  39.775 -85.978      True      2     15
2/5/2013 20:15:00  39.775 -85.978     False      2     15
2/5/2013 20:20:00  39.940 -86.159      True      3     10
2/5/2013 20:25:00  39.940 -86.159     False      3     10

回答by Jeff

You will need 0.11-dev. I think this will give you what you are looking for. See this section: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#time-deltasfor more info as the timedeltas are a newer data that pandas is supporting

您将需要 0.11-dev。我认为这会给你你正在寻找的东西。请参阅本节:http: //pandas.pydata.org/pandas-docs/dev/timeseries.html#time-deltas 了解更多信息,因为 timedeltas 是 Pandas 支持的较新数据

Heres your data (I separated long/lat just for convenience, the key thing is that the condition column is a bool)

这是您的数据(为了方便起见,我将 long/lat 分开,关键是条件列是 bool)

In [137]: df = pd.read_csv(StringIO.StringIO(data),index_col=0,parse_dates=True)

In [138]: df
Out[138]: 
               date    long       lat condition
2013-02-05 19:45:00  39.940   -86.159      True
2013-02-05 19:50:00  39.940   -86.159      True
2013-02-05 19:55:00  39.940   -86.159     False
2013-02-05 20:00:00  39.777   -85.995     False
2013-02-05 20:05:00  39.775   -85.978      True
2013-02-05 20:10:00  39.775   -85.978      True
2013-02-05 20:15:00  39.775   -85.978     False
2013-02-05 20:20:00  39.940   -86.159      True
2013-02-05 20:25:00  39.940   -86.159     False

In [139]: df.dtypes
Out[139]: 
date         float64
long lat     float64
condition       bool
dtype: object

Create some date columns that are the index (these are datetime64[ns] dtype)

创建一些作为索引的日期列(这些是 datetime64[ns] dtype)

In [140]: df['date'] = df.index   
In [141]: df['rdate'] = df.index

Set the rdate column that are False to NaT (np.nan's are transformed to NaT)

将 False 的 rdate 列设置为 NaT(np.nan 转换为 NaT)

In [142]: df.loc[~df['condition'],'rdate'] = np.nan

Forward fill the NaT's from the previous value

从前一个值向前填充 NaT

In [143]: df['rdate'] = df['rdate'].ffill()

Subtract the rdate from the date, this produces a timedelta64[ns] type column of the time differences

从日期中减去 rdate,这会产生一个 timedelta64[ns] 类型的时差列

In [144]: df['diff'] = df['date']-df['rdate']

In [151]: df
Out[151]: 
                                   date  long lat condition               rdate  \
2013-02-05 19:45:00 2013-02-05 19:45:00   -86.159      True 2013-02-05 19:45:00   
2013-02-05 19:50:00 2013-02-05 19:50:00   -86.159      True 2013-02-05 19:50:00   
2013-02-05 19:55:00 2013-02-05 19:55:00   -86.159     False 2013-02-05 19:50:00   
2013-02-05 20:00:00 2013-02-05 20:00:00   -85.995     False 2013-02-05 19:50:00   
2013-02-05 20:05:00 2013-02-05 20:05:00   -85.978      True 2013-02-05 20:05:00   
2013-02-05 20:10:00 2013-02-05 20:10:00   -85.978      True 2013-02-05 20:10:00   
2013-02-05 20:15:00 2013-02-05 20:15:00   -85.978     False 2013-02-05 20:10:00   
2013-02-05 20:20:00 2013-02-05 20:20:00   -86.159      True 2013-02-05 20:20:00   
2013-02-05 20:25:00 2013-02-05 20:25:00   -86.159     False 2013-02-05 20:20:00   

                        diff  
2013-02-05 19:45:00 00:00:00  
2013-02-05 19:50:00 00:00:00  
2013-02-05 19:55:00 00:05:00  
2013-02-05 20:00:00 00:10:00  
2013-02-05 20:05:00 00:00:00  
2013-02-05 20:10:00 00:00:00  
2013-02-05 20:15:00 00:05:00  
2013-02-05 20:20:00 00:00:00  
2013-02-05 20:25:00 00:05:00  

The diff column are now timedelta64[ns], so you want integers in minutes (FYI this is a little bit clunky now as pandas doesn't have a scalar type Timedelta similar to Timestamp for dates)

diff 列现在是 timedelta64[ns],所以你需要以分钟为单位的整数(仅供参考,现在有点笨重,因为Pandas没有类似于日期时间戳的标量类型 Timedelta)

(Also, you may have have to do a shift() on this rdate series before you ffill, I think I am off by 1 somewhere)...but this is the idea

(此外,在填写之前,您可能必须在此 rdate 系列上执行 shift() 操作,我想我在某个地方错了 1 点)……但这就是想法

In [175]: df['diff'].map(lambda x: x.item().seconds/60)
Out[175]: 
2013-02-05 19:45:00     0
2013-02-05 19:50:00     0
2013-02-05 19:55:00     5
2013-02-05 20:00:00    10
2013-02-05 20:05:00     0
2013-02-05 20:10:00     0
2013-02-05 20:15:00     5
2013-02-05 20:20:00     0
2013-02-05 20:25:00     5