pandas 从数据框中获取满足熊猫中条件的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19079104/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:12:24  来源:igfitidea点击:

Getting rows from a data frame which satisfy a condition in pandas

pythonpandas

提问by Geekster

I have a data frame and I have a range of numbers. I want to find the rows where values in a particular column lie in that range.

我有一个数据框,我有一个数字范围。我想找到特定列中的值位于该范围内的行。

This seems like a trivial job. I tried with the techniques given here - http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-boolean

这似乎是一项微不足道的工作。我尝试使用这里给出的技术 - http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-boolean

I took a simple example:

我举了一个简单的例子:

In [6]: df_s
Out[6]: 
   time  value
0     1      3
1     2      4
2     3      3
3     4      4
4     5      3
5     6      2
6     7      2
7     8      3
8     9      3

In [7]: df_s[df_s.time.isin(range(1,8))]
Out[7]: 
   time  value
0     1      3
1     2      4
2     3      3
3     4      4
4     5      3
5     6      2
6     7      2

Then, I tried with a sample from the data set I am working with which has timestamp and value as columns:

然后,我尝试使用我正在使用的数据集中的一个示例,该示例将时间戳和值作为列:

In [8]: df_s = pd.DataFrame({'time': range(1379945743841,1379945743850), 'value': [3,4,3,4,3,2,2,3,3]})

In [9]: df_s
Out[9]: 
            time  value
0  1379945743841      3
1  1379945743842      4
2  1379945743843      3
3  1379945743844      4
4  1379945743845      3
5  1379945743846      2
6  1379945743847      2
7  1379945743848      3
8  1379945743849      3

In [10]: df_s[df_s.time.isin(range(1379945743843,1379945743845))]
Out[10]: 
Empty DataFrame
Columns: [time, value]
Index: []

Why doesn't the same technique work in this case? What am I doing wrong?

为什么在这种情况下相同的技术不起作用?我究竟做错了什么?

I tried another approach:

我尝试了另一种方法:

In [11]: df_s[df_s.time >= 1379945743843 and df_s.time <=1379945743845]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-45c44def41b4> in <module>()
----> 1 df_s[df_s.time >= 1379945743843 and df_s.time <=1379945743845]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Then, I tried with a bit more complex approach:

然后,我尝试了一种更复杂的方法:

In [13]: df_s.ix[[idx for idx in df_s.index if df_s.ix[idx]['time'] in range(1379945743843, 1379945743845)]]
Out[13]: 
            time  value
2  1379945743843      3
3  1379945743844      4

This gives the desired result but it takes way too much time to give any result on my original data set. It has 209920 rows and it is expected that the number of rows will increase when I actually put my code to test.

这给出了所需的结果,但在我的原始数据集上给出任何结果需要太多时间。它有 209920 行,当我实际测试代码时,预计行数会增加。

Can anyone direct to me towards the right approach?

任何人都可以指导我采取正确的方法吗?

I am using python 2.7.3 and pandas 0.12.0

我正在使用 python 2.7.3 和 Pandas 0.12.0

Update:

更新:

Jeff's answer worked.

杰夫的回答奏效了。

But I find the isinapproach more simple, intuitive and less cluttered. Please comment if anyone has any idea why it failed.

但我发现这种isin方法更简单、直观且不那么混乱。如果有人知道为什么失败,请发表评论。

Thanks!

谢谢!

回答by Jeff

Try this way

试试这个方法

In [7]:  df_s = pd.DataFrame({'time': range(1379945743841,1379945743850), 'value': [3,4,3,4,3,2,2,3,3]})

Convert your ms epoch timestamps to actual times

将您的 ms 纪元时间戳转换为实际时间

In [8]: df_s['time'] = pd.to_datetime(df_s['time'],unit='ms')

In [9]: df_s
Out[9]: 
                        time  value
0 2013-09-23 14:15:43.841000      3
1 2013-09-23 14:15:43.842000      4
2 2013-09-23 14:15:43.843000      3
3 2013-09-23 14:15:43.844000      4
4 2013-09-23 14:15:43.845000      3
5 2013-09-23 14:15:43.846000      2
6 2013-09-23 14:15:43.847000      2
7 2013-09-23 14:15:43.848000      3
8 2013-09-23 14:15:43.849000      3

These are your converted endpoints

这些是您转换后的端点

In [10]: pd.to_datetime(1379945743843,unit='ms')
Out[10]: Timestamp('2013-09-23 14:15:43.843000', tz=None)

In [11]: pd.to_datetime(1379945743845,unit='ms')
Out[11]: Timestamp('2013-09-23 14:15:43.845000', tz=None)

In [12]: df = df_s.set_index('time')

You must use the &and use parens

您必须使用&和使用括号

In [13]: df_s[(df_s.time>pd.to_datetime(1379945743843,unit='ms')) & (df_s.time<pd.to_datetime(1379945743845,unit='ms'))]
Out[13]: 
                    time  value
3 2013-09-23 14:15:43.844000      4

In 0.13 (coming soon), you will be able to do this:

在 0.13(即将推出)中,您将能够执行以下操作:

In [7]: df_s.query('"2013-09-23 14:15:43.843" < time < "2013-09-23 14:15:43.845"')
Out[7]: 
                    time  value
3 2013-09-23 14:15:43.844000      4

Your isin approach DOES work. Not sure why its not working for you.

您的 isin 方法确实有效。不知道为什么它不适合你。

In [11]: df_s[df_s.time.isin(range(1379945743843,1379945743845))]
Out[11]: 
            time  value
2  1379945743843      3
3  1379945743844      4