在 Python 中对 Pandas 中的数据帧进行分箱

Question

提问by

given the following dataframe in pandas:

给定熊猫中的以下数据框：

import numpy as np
df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})

where idis an id for each point consisting of an aand bvalue, how can I bin aand binto a specified set of bins (so that I can then take the median/average value of aand bin each bin)? dfmight have NaNvalues for aor b(or both) for any given row in df. thanks.

id由一个a和b值组成的每个点的 id在哪里，我如何将a和b放入一组指定的 bin（以便我可以取每个 bin 中的a和的中值/平均值b）？ df中的任何给定行可能具有或（或两者）的NaN值。谢谢。abdf

Here's a better example using Joe Kington's solution with a more realistic df. The thing I'm unsure about is how to access the df.b elements for each df.a group below:

这是一个更好的例子，使用 Joe Kington 的解决方案和更现实的 df。我不确定的是如何访问下面每个 df.a 组的 df.b 元素：

a = np.random.random(20)
df = pandas.DataFrame({"a": a, "b": a + 10})
# bins for df.a
bins = np.linspace(0, 1, 10)
# bin df according to a
groups = df.groupby(np.digitize(df.a,bins))
# Get the mean of a in each group
print groups.mean()
## But how to get the mean of b for each group of a?
# ...

Answer 1

采纳答案by Joe Kington

There may be a more efficient way (I have a feeling pandas.crosstabwould be useful here), but here's how I'd do it:

可能有一种更有效的方法（我觉得pandas.crosstab在这里会很有用），但这是我的方法：

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100),
                       "b": np.random.random(100),
                       "id": np.arange(100)})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(np.digitize(df.a, bins))

# Get the mean of each bin:
print groups.mean() # Also could do "groups.aggregate(np.mean)"

# Similarly, the median:
print groups.median()

# Apply some arbitrary function to aggregate binned data
print groups.aggregate(lambda x: np.mean(x[x > 0.5]))

Edit: As the OP was asking specifically for just the means of bbinned by the values in a, just do

编辑：由于 OP 专门要求b按中的值进行分箱的方法a，只需执行

groups.mean().b

Also if you wanted the index to look nicer (e.g. display intervals as the index), as they do in @bdiamante's example, use pandas.cutinstead of numpy.digitize. (Kudos to bidamante. I didn't realize pandas.cutexisted.)

此外，如果你想要的索引查找更好（如显示间隔为指数），因为他们在@ bdiamante的例子做的，用pandas.cut的，而不是numpy.digitize。（感谢bidamante。我没有意识到pandas.cut存在。）

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100), 
                       "b": np.random.random(100) + 10})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(pandas.cut(df.a, bins))

# Get the mean of b, binned by the values in a
print groups.mean().b

This results in:

这导致：

a
(0.00186, 0.111]    10.421839
(0.111, 0.22]       10.427540
(0.22, 0.33]        10.538932
(0.33, 0.439]       10.445085
(0.439, 0.548]      10.313612
(0.548, 0.658]      10.319387
(0.658, 0.767]      10.367444
(0.767, 0.876]      10.469655
(0.876, 0.986]      10.571008
Name: b

Answer 2

回答by bdiamante

Not 100% sure if this is what you're looking for, but here's what I think you're getting at:

不能 100% 确定这是否是您要查找的内容，但我认为您正在寻找以下内容：

In [144]: df = DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id":   np.arange(100)})

In [145]: bins = [0, .25, .5, .75, 1]

In [146]: a_bins = df.a.groupby(cut(df.a,bins))

In [147]: b_bins = df.b.groupby(cut(df.b,bins))

In [148]: a_bins.agg([mean,median])
Out[148]:
                 mean    median
a
(0, 0.25]    0.124173  0.114613
(0.25, 0.5]  0.367703  0.358866
(0.5, 0.75]  0.624251  0.626730
(0.75, 1]    0.875395  0.869843

In [149]: b_bins.agg([mean,median])
Out[149]:
                 mean    median
b
(0, 0.25]    0.147936  0.166900
(0.25, 0.5]  0.394918  0.386729
(0.5, 0.75]  0.636111  0.655247
(0.75, 1]    0.851227  0.838805

Of course, I don't know what bins you had in mind, so you'll have to swap mine out for your circumstance.

当然，我不知道你想到了什么垃圾箱，所以你必须根据你的情况换掉我的。

Answer 3

回答by Perk

Joe Kington's answer was very helpful, however, I noticed that it does not bin all of the data. It actually leaves out the row with a = a.min(). Summing up groups.size()gave 99 instead of 100.

Joe Kington 的回答非常有帮助，但是，我注意到它并没有对所有数据进行分类。它实际上省略了带有 a = a.min() 的行。总结groups.size()给出了 99 而不是 100。

To guarantee that all data is binned, just pass in the number of bins to cut() and that function will automatically pad the first[last] bin by 0.1% to ensure all data is included.

为了保证所有数据都被分箱，只需将分箱数量传递给 cut() ，该函数将自动填充第一个 [最后一个] 分箱 0.1% 以确保包括所有数据。

df = pandas.DataFrame({"a": np.random.random(100), 
                    "b": np.random.random(100) + 10})

# Bin the data frame by "a" with 10 bins...
groups = df.groupby(pandas.cut(df.a, 10))

# Get the mean of b, binned by the values in a
print(groups.mean().b)

In this case, summing up groups.size() gave 100.

在这种情况下，总结 groups.size() 给出了 100。

I know this is a picky point for this particular problem, but for a similar problem I was trying to solve, it was crucial to obtain the correct answer.

我知道这是这个特定问题的一个挑剔点，但对于我试图解决的类似问题，获得正确答案至关重要。

Answer 4

回答by bio

If you do not have to stick to pandasgrouping, you could use scipy.stats.binned_statistic:

如果您不必坚持pandas分组，则可以使用scipy.stats.binned_statistic：

from scipy.stats import binned_statistic

means = binned_statistic(df.a, df.b, bins=np.linspace(min(df.a), max(df.a), 10))

在 Python 中对 Pandas 中的数据帧进行分箱

提问by

采纳答案by Joe Kington

回答by bdiamante

回答by Perk

回答by bio

相关推荐

最近更新

标签

在 Python 中对 Pandas 中的数据帧进行分箱

提问by

采纳答案by Joe Kington

回答by bdiamante

回答by Perk

回答by bio

相关推荐

Python 使用熊猫数据框中的数据创建多个 Excel 工作表

Python 在空格或连字符上拆分？

Python 导入错误 django corsheaders

Python 如何使用 Mock @patch 获取呼叫计数？

相关推荐

最近更新

标签