Python Pandas 返回数据帧,其中值计数高于设定数字

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43945653/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:35:36  来源:igfitidea点击:

Python Pandas return DataFrame where value count is above a set number

pythonpandas

提问by Emac

I have a Pandas DataFrame, and I want to return the DataFrame only if that Customer Number occurs more than a set number of times.

我有一个 Pandas DataFrame,并且我想仅在该客户编号出现的次数超过设定次数时才返回该 DataFrame。

Here is a sample of the DataFrame:

这是 DataFrame 的示例:

114  2017-04-26      1       7507       34      13
115  2017-04-26      3      77314       41      14
116  2017-04-27      7       4525      190     315
117  2017-04-27      7       5525       67      94
118  2017-04-27      1       6525       43     378
119  2017-04-27      3       7415       38      27
120  2017-04-27      2       7613       47      10
121  2017-04-27      2      77314        9       3
122  2017-04-28      1        227       17       4
123  2017-04-28      8       4525      205     341
124  2017-04-28      1       7415       31      20
125  2017-04-28      2      77314        8       2

And here is if that customer occurs more than 5 times, using this code:

如果该客户出现超过 5 次,请使用以下代码:

print(zip_data_df['Customers'].value_counts()>5)

7415      True
4525      True
5525      True
77314     True
6525      True
4111      True
227       True
206      False
7507     False
7613     False
4108     False
3046     False
2605     False
4139     False
4119     False

Now I expected if I did this:

现在我希望如果我这样做:

print(zip_data_df[zip_data_df['Customers'].value_counts()>5])

It would show me the whole DataFrame for customers that occur more than 5 times, but I got a Boolean error. I realize why it gives me an error now: one DataFrame is just telling me if that customer number occurs more than 5 times or not, and the other is showing me every time that customer number occurs. They don't match in length. But how do I get it so the dataframe will only return records where that customer occurs more than 5 times?

它会向我显示出现超过 5 次的客户的整个 DataFrame,但我得到了一个布尔错误。我意识到为什么它现在给我一个错误:一个 DataFrame 只是告诉我该客户编号是否出现超过 5 次,而另一个在每次出现该客户编号时都向我显示。它们的长度不匹配。但是我如何获得它以便数据框只会返回该客户出现超过 5 次的记录?

I'm sure there is some simple answer I'm overlooking, but I appreciate any help you can get me.

我确定我忽略了一些简单的答案,但我很感激你能给我的任何帮助。

回答by abe

So the issue here is indexing: value_counts() returns a Series indexed on 'Customers,' while zip_data_df seems to be indexed on something else. You can do something like:

所以这里的问题是索引:value_counts() 返回一个以“客户”为索引的系列,而 zip_data_df 似乎以其他方式索引。您可以执行以下操作:

cust_counts = zip_data_df['Customers'].value_counts().rename('cust_counts')

zip_data_df = zip_data_df.merge(cust_counts.to_frame(),
                                left_on='Customers',
                                right_index=True)

From there, you can select conditionally from zip_data_df like so:

从那里,您可以像这样从 zip_data_df 有条件地选择:

zip_data_df[zip_data_df.cust_counts > 5]

回答by DrTRD

I believe what you're looking for is:

我相信你正在寻找的是:

zip_data_df['Customers'].value_counts()[zip_data_df['Customers'].value_counts()>5]

回答by Zak Keirn

I had a similar problem and solved it this way.

我有一个类似的问题,并以这种方式解决了它。

cust_counts = zip_data_df['Customers'].value_counts()
cust_list = cust_counts[cust_counts > 5].index.tolist()
zip_data_df = zip_data_df[zip_data_df['Customers'].isin(cust_list)]

回答by Tom Wattley

you can get the job done with a handy little groupby transform here

你可以在这里通过一个方便的小分组转换来完成工作

subset_customers_df = zip_data_df[
          zip_data_df.groupby('Customers')
         ['Customers'].transform('size')>5]

that works here for Pandas 0.25.3

这适用于 Pandas 0.25.3

回答by Koray Tugay

Haven 't tried but this should work:

没试过,但这应该有效:

cust_by_size = zip_data_df.groupBy("Customers").size()
cust_index_gt_5 = cust_by_size.index[cust_by_size > 5]
zip_data_cust_index_gt_5 = zip_data_df[zip_data_df["Customers"].isin(cust_index_gt_5)]