熊猫系列(pandas.Series.query())是否有查询方法或类似方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40171498/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:15:37  来源:igfitidea点击:

Is there a query method or similar for pandas Series (pandas.Series.query())?

pythonpandasdataframeseriesmethod-chaining

提问by dmeu

The pandas.DataFrame.query()method is of great usage for (pre/post)-filtering data when loading or plotting. It comes particularly handy for method chaining.

pandas.DataFrame.query()方法对于加载或绘图时的(前/后)过滤数据非常有用。它对于方法链特别方便。

I find myself often wanting to apply the same logic to a pandas.Series, e.g. after having done a method such as df.value_countswhich returns a pandas.Series.

我发现自己经常想对 a 应用相同的逻辑pandas.Series,例如在完成诸如df.value_countswhich 返回 a 之类的方法之后pandas.Series

Example

例子

Lets assume there is a huge table with the columns Player, Game, Pointsand I want to plot a histogram of the players with more than 14 times 3 points. I first have to sum the points of each player (groupby -> agg) which will return a Series of ~1000 players and their overall points. Applying the .querylogic it would look something like this:

让我们假设有一个带有列的巨大表格,Player, Game, Points我想绘制超过 14 倍 3 分的玩家的直方图​​。我首先必须对每个玩家 ( groupby -> agg)的积分求和,这将返回一系列 ~1000 名玩家及其总分。应用.query逻辑它看起来像这样:

df = pd.DataFrame({
    'Points': [random.choice([1,3]) for x in range(100)], 
    'Player': [random.choice(["A","B","C"]) for x in range(100)]})

(df
     .query("Points == 3")
     .Player.values_count()
     .query("> 14")
     .hist())

The only solutions I find force me to do an unnecessary assignment and break the method chaining:

我找到的唯一解决方案迫使我做一个不必要的分配并打破方法链:

(points_series = df
     .query("Points == 3")
     .groupby("Player").size()
points_series[points_series > 100].hist()

Method chaining as well as the query method help to keep the code legible meanwhile the subsetting-filtering can get messy quite quickly.

方法链和查询方法有助于保持代码清晰,同时子集过滤会很快变得混乱。

# just to make my point :)
series_bestplayers_under_100[series_prefiltered_under_100 > 0].shape

Please help me out of my dilemma! Thanks

请帮助我摆脱困境!谢谢

采纳答案by jezrael

IIUC you can add query("Points > 100"):

IIUC 你可以添加query("Points > 100")

df = pd.DataFrame({'Points':[50,20,38,90,0, np.Inf],
                   'Player':['a','a','a','s','s','s']})

print (df)
  Player     Points
0      a  50.000000
1      a  20.000000
2      a  38.000000
3      s  90.000000
4      s   0.000000
5      s        inf

points_series = df.query("Points < inf").groupby("Player").agg({"Points": "sum"})['Points']
print (points_series)     
a = points_series[points_series > 100]
print (a)     
Player
a    108.0
Name: Points, dtype: float64


points_series = df.query("Points < inf")
                  .groupby("Player")
                  .agg({"Points": "sum"})
                  .query("Points > 100")

print (points_series)     
        Points
Player        
a        108.0


Another solution is Selection By Callable:

另一种解决方案是Selection By Callable

points_series = df.query("Points < inf")
                  .groupby("Player")
                  .agg({"Points": "sum"})['Points']
                  .loc[lambda x: x > 100]

print (points_series)     
Player
a    108.0
Name: Points, dtype: float64


Edited answer by edited question:

编辑问题的编辑答案:

np.random.seed(1234)
df = pd.DataFrame({
    'Points': [np.random.choice([1,3]) for x in range(100)], 
    'Player': [np.random.choice(["A","B","C"]) for x in range(100)]})

print (df.query("Points == 3").Player.value_counts().loc[lambda x: x > 15])
C    19
B    16
Name: Player, dtype: int64

print (df.query("Points == 3").groupby("Player").size().loc[lambda x: x > 15])
Player
B    16
C    19
dtype: int64

回答by Martin

Why not convert from Series to DataFrame, do the querying, and then convert back.

为什么不从 Series 转换为 DataFrame,进行查询,然后再转换回来。

df["Points"] = df["Points"].to_frame().query('Points > 100')["Points"]

Here, .to_frame()converts to DataFrame, while the trailing ["Points"]converts to Series.

在这里,.to_frame()转换为 DataFrame,而尾随["Points"]转换为 Series。

The method .query()can then be used consistently whether or not the Pandas object has 1 or more columns.

.query()无论 Pandas 对象是否有 1 列或更多列,该方法都可以一致地使用。

回答by Ilya Prokin

Instead of query you can use pipe:

您可以使用而不是查询pipe

s.pipe(lambda x: x[x>0]).pipe(lambda x: x[x<10])