子集 Python DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19237878/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:13:27  来源:igfitidea点击:

subsetting a Python DataFrame

pythonpandassubset

提问by user1717931

I am transitioning from R to Python. I just began using Pandas. I have an R code that subsets nicely:

我正在从 R 过渡到 Python。我刚开始使用 Pandas。我有一个很好的子集的 R 代码:

k1 <- subset(data, Product = p.id & Month < mn & Year == yr, select = c(Time, Product))

Now, I want to do similar stuff in Python. this is what I have got so far:

现在,我想在 Python 中做类似的事情。这是我到目前为止所得到的:

import pandas as pd
data = pd.read_csv("../data/monthly_prod_sales.csv")


#first, index the dataset by Product. And, get all that matches a given 'p.id' and time.
 data.set_index('Product')
 k = data.ix[[p.id, 'Time']]

# then, index this subset with Time and do more subsetting..

I am beginning to feel that I am doing this the wrong way. perhaps, there is an elegant solution. Can anyone help? I need to extract month and year from the timestamp I have and do subsetting. Perhaps there is a one-liner that will accomplish all this:

我开始觉得我这样做是错误的。也许,有一个优雅的解决方案。任何人都可以帮忙吗?我需要从我拥有的时间戳中提取月份和年份并进行子集化。也许有一种单线可以完成这一切:

k1 <- subset(data, Product = p.id & Time >= start_time & Time < end_time, select = c(Time, Product))

thanks.

谢谢。

采纳答案by Phillip Cloud

I'll assume that Timeand Productare columns in a DataFrame, dfis an instance of DataFrame, and that other variables are scalar values:

我将假设TimeProduct是 a 中的列DataFramedf是 的实例DataFrame,并且其他变量是标量值:

For now, you'll have to reference the DataFrameinstance:

现在,您必须引用该DataFrame实例:

k1 = df.loc[(df.Product == p_id) & (df.Time >= start_time) & (df.Time < end_time), ['Time', 'Product']]

The parentheses are also necessary, because of the precedence of the &operator vs. the comparison operators. The &operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators.

括号也是必要的,因为&运算符与比较运算符的优先级不同。的&操作者实际上是具有相同的优先级的算术运算符这反过来又具有比比较运算符的优先级高的重载位运算符。

In pandas0.13 a new experimental DataFrame.query()method will be available. It's extremely similar to subset modulo the selectargument:

pandas0.13DataFrame.query()中将提供一种新的实验方法。它与对select参数取模的子集非常相似:

With query()you'd do it like this:

随着query()你会做这样的:

df[['Time', 'Product']].query('Product == p_id and Month < mn and Year == yr')

Here's a simple example:

这是一个简单的例子:

In [9]: df = DataFrame({'gender': np.random.choice(['m', 'f'], size=10), 'price': poisson(100, size=10)})

In [10]: df
Out[10]:
  gender  price
0      m     89
1      f    123
2      f    100
3      m    104
4      m     98
5      m    103
6      f    100
7      f    109
8      f     95
9      m     87

In [11]: df.query('gender == "m" and price < 100')
Out[11]:
  gender  price
0      m     89
4      m     98
9      m     87

The final query that you're interested will even be able to take advantage of chained comparisons, like this:

您感兴趣的最终查询甚至可以利用链式比较,如下所示:

k1 = df[['Time', 'Product']].query('Product == p_id and start_time <= Time < end_time')

回答by sernle

Just for someone looking for a solution more similar to R:

仅适用于寻找更类似于 R 的解决方案的人:

df[(df.Product == p_id) & (df.Time> start_time) & (df.Time < end_time)][['Time','Product']]

No need for data.locor query, but I do think it is a bit long.

不需要data.locor query,但我确实认为它有点长。

回答by gpicard

I've found that you can use any subset condition for a given column by wrapping it in []. For instance, you have a df with columns ['Product','Time', 'Year', 'Color']

我发现您可以通过将给定列包装在 [] 中来对给定列使用任何子集条件。例如,您有一个 df 列 ['Product','Time', 'Year', 'Color']

And let's say you want to include products made before 2014. You could write,

假设你想包括 2014 年之前生产的产品。你可以写,

df[df['Year'] < 2014]

To return all the rows where this is the case. You can add different conditions.

在这种情况下返回所有行。您可以添加不同的条件。

df[df['Year'] < 2014][df['Color' == 'Red']

Then just choose the columns you want as directed above. For instance, the product color and key for the df above,

然后只需按照上面的指示选择您想要的列。例如,上面df的产品颜色和键,

df[df['Year'] < 2014][df['Color' == 'Red'][['Product','Color']]

回答by Santosh Vutukuri

Creating an Empty Dataframe with known Column Name:

创建一个已知列名的空数据框:

Names = ['Col1','ActivityID','TransactionID']
df = pd.DataFrame(columns = Names)

Creating a dataframe from csv:

csv创建数据

df = pd.DataFrame('...../file_name.csv')

Creating a dynamic filter to subset a dtaframe:

创建一个动态过滤器来子集 a dtaframe

i = 12
df[df['ActivitiID'] <= i]

Creating a dynamic filter to subset required columns of dtaframe

创建动态过滤器以对 dtaframe 的所需列进行子集化

df[df['ActivityID'] == i][['TransactionID','ActivityID']]

回答by miraculixx

Regarding some points mentioned in previous answers, and to improve readability:

关于之前答案中提到的一些要点,并提高可读性:

No need for data.loc or query, but I do think it is a bit long.

The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators.

不需要 data.loc 或 query,但我认为它有点长。

括号也是必要的,因为 & 运算符与比较运算符的优先级不同。

I like to write such expressions as follows - less brackets, faster to type, easier to read. Closer to R, too.

我喜欢写这样的表达式 - 更少的括号,更快的输入,更容易阅读。也更接近 R。

q_product = df.Product == p_id
q_start = df.Time > start_time
q_end = df.Time < end_time

df.loc[q_product & q_start & q_end, c('Time,Product')]

# c is just a convenience
c = lambda v: v.split(',')