在 Python 中对 Pandas 中的数据帧进行分箱
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16947336/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
binning a dataframe in pandas in Python
提问by
given the following dataframe in pandas:
给定熊猫中的以下数据框:
import numpy as np
df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})
where idis an id for each point consisting of an aand bvalue, how can I bin aand binto a specified set of bins (so that I can then take the median/average value of aand bin each bin)? dfmight have NaNvalues for aor b(or both) for any given row in df. thanks.
id由一个a和b值组成的每个点的 id在哪里,我如何将a和b放入一组指定的 bin(以便我可以取每个 bin 中的a和的中值/平均值b)? df中的任何给定行可能具有或(或两者)的NaN值。谢谢。abdf
Here's a better example using Joe Kington's solution with a more realistic df. The thing I'm unsure about is how to access the df.b elements for each df.a group below:
这是一个更好的例子,使用 Joe Kington 的解决方案和更现实的 df。我不确定的是如何访问下面每个 df.a 组的 df.b 元素:
a = np.random.random(20)
df = pandas.DataFrame({"a": a, "b": a + 10})
# bins for df.a
bins = np.linspace(0, 1, 10)
# bin df according to a
groups = df.groupby(np.digitize(df.a,bins))
# Get the mean of a in each group
print groups.mean()
## But how to get the mean of b for each group of a?
# ...
采纳答案by Joe Kington
There may be a more efficient way (I have a feeling pandas.crosstabwould be useful here), but here's how I'd do it:
可能有一种更有效的方法(我觉得pandas.crosstab在这里会很有用),但这是我的方法:
import numpy as np
import pandas
df = pandas.DataFrame({"a": np.random.random(100),
"b": np.random.random(100),
"id": np.arange(100)})
# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(np.digitize(df.a, bins))
# Get the mean of each bin:
print groups.mean() # Also could do "groups.aggregate(np.mean)"
# Similarly, the median:
print groups.median()
# Apply some arbitrary function to aggregate binned data
print groups.aggregate(lambda x: np.mean(x[x > 0.5]))
Edit: As the OP was asking specifically for just the means of bbinned by the values in a, just do
编辑:由于 OP 专门要求b按 中的值进行分箱的方法a,只需执行
groups.mean().b
Also if you wanted the index to look nicer (e.g. display intervals as the index), as they do in @bdiamante's example, use pandas.cutinstead of numpy.digitize. (Kudos to bidamante. I didn't realize pandas.cutexisted.)
此外,如果你想要的索引查找更好(如显示间隔为指数),因为他们在@ bdiamante的例子做的,用pandas.cut的,而不是numpy.digitize。(感谢bidamante。我没有意识到pandas.cut存在。)
import numpy as np
import pandas
df = pandas.DataFrame({"a": np.random.random(100),
"b": np.random.random(100) + 10})
# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(pandas.cut(df.a, bins))
# Get the mean of b, binned by the values in a
print groups.mean().b
This results in:
这导致:
a
(0.00186, 0.111] 10.421839
(0.111, 0.22] 10.427540
(0.22, 0.33] 10.538932
(0.33, 0.439] 10.445085
(0.439, 0.548] 10.313612
(0.548, 0.658] 10.319387
(0.658, 0.767] 10.367444
(0.767, 0.876] 10.469655
(0.876, 0.986] 10.571008
Name: b
回答by bdiamante
Not 100% sure if this is what you're looking for, but here's what I think you're getting at:
不能 100% 确定这是否是您要查找的内容,但我认为您正在寻找以下内容:
In [144]: df = DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})
In [145]: bins = [0, .25, .5, .75, 1]
In [146]: a_bins = df.a.groupby(cut(df.a,bins))
In [147]: b_bins = df.b.groupby(cut(df.b,bins))
In [148]: a_bins.agg([mean,median])
Out[148]:
mean median
a
(0, 0.25] 0.124173 0.114613
(0.25, 0.5] 0.367703 0.358866
(0.5, 0.75] 0.624251 0.626730
(0.75, 1] 0.875395 0.869843
In [149]: b_bins.agg([mean,median])
Out[149]:
mean median
b
(0, 0.25] 0.147936 0.166900
(0.25, 0.5] 0.394918 0.386729
(0.5, 0.75] 0.636111 0.655247
(0.75, 1] 0.851227 0.838805
Of course, I don't know what bins you had in mind, so you'll have to swap mine out for your circumstance.
当然,我不知道你想到了什么垃圾箱,所以你必须根据你的情况换掉我的。
回答by Perk
Joe Kington's answer was very helpful, however, I noticed that it does not bin all of the data. It actually leaves out the row with a = a.min(). Summing up groups.size()gave 99 instead of 100.
Joe Kington 的回答非常有帮助,但是,我注意到它并没有对所有数据进行分类。它实际上省略了带有 a = a.min() 的行。总结groups.size()给出了 99 而不是 100。
To guarantee that all data is binned, just pass in the number of bins to cut() and that function will automatically pad the first[last] bin by 0.1% to ensure all data is included.
为了保证所有数据都被分箱,只需将分箱数量传递给 cut() ,该函数将自动填充第一个 [最后一个] 分箱 0.1% 以确保包括所有数据。
df = pandas.DataFrame({"a": np.random.random(100),
"b": np.random.random(100) + 10})
# Bin the data frame by "a" with 10 bins...
groups = df.groupby(pandas.cut(df.a, 10))
# Get the mean of b, binned by the values in a
print(groups.mean().b)
In this case, summing up groups.size() gave 100.
在这种情况下,总结 groups.size() 给出了 100。
I know this is a picky point for this particular problem, but for a similar problem I was trying to solve, it was crucial to obtain the correct answer.
我知道这是这个特定问题的一个挑剔点,但对于我试图解决的类似问题,获得正确答案至关重要。
回答by bio
If you do not have to stick to pandasgrouping, you could use scipy.stats.binned_statistic:
如果您不必坚持pandas分组,则可以使用scipy.stats.binned_statistic:
from scipy.stats import binned_statistic
means = binned_statistic(df.a, df.b, bins=np.linspace(min(df.a), max(df.a), 10))

