pandas Python:带熊猫的加权中值算法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26102867/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:31:13  来源:igfitidea点击:

Python: weighted median algorithm with pandas

pythonalgorithmpandas

提问by svenkatesh

I have a dataframe that looks like this:

我有一个看起来像这样的数据框:

Out[14]:
    impwealth  indweight
16     180000     34.200
21     384000     37.800
26     342000     39.715
30    1154000     44.375
31     421300     44.375
32    1210000     45.295
33    1062500     45.295
34    1878000     46.653
35     876000     46.653
36     925000     53.476

I want to calculate the weighted median of the column impwealthusing the frequency weights in indweight. My pseudo code looks like this:

我想impwealth使用 中的频率权重计算列的加权中位数indweight。我的伪代码如下所示:

# Sort `impwealth` in ascending order 
df.sort('impwealth', 'inplace'=True)

# Find the 50th percentile weight, P
P = df['indweight'].sum() * (.5)

# Search for the first occurrence of `impweight` that is greater than P 
i = df.loc[df['indweight'] > P, 'indweight'].last_valid_index()

# The value of `impwealth` associated with this index will be the weighted median
w_median = df.ix[i, 'impwealth']

This method seems clunky, and I'm not sure it's correct. I didn't find a built in way to do this in pandas reference. What is the best way to go about finding weighted median?

这种方法看起来很笨拙,我不确定它是否正确。我没有在大Pandas参考中找到内置的方法来做到这一点。寻找加权中位数的最佳方法是什么?

采纳答案by prooffreader

If you want to do this in pure pandas, here's a way. It does not interpolate either. (@svenkatesh, you were missing the cumulative sum in your pseudocode)

如果你想在纯Pandas中做到这一点,这里有一种方法。它也不插值。(@svenkatesh,您的伪代码中缺少累积总和)

df.sort_values('impwealth', inplace=True)
cumsum = df.indweight.cumsum()
cutoff = df.indweight.sum() / 2.0
median = df.impwealth[cumsum >= cutoff].iloc[0]

This gives a median of 925000.

这给出了 925000 的中位数。

回答by chrisb

Have you tried the wquantilespackage? I had never used it before, but it has a weighted median function that seems to give at least a reasonable answer (you'll probably want to double check that it's using the approach you expect).

您是否尝试过wquantiles包?我以前从未使用过它,但它有一个加权中值函数,似乎至少给出了一个合理的答案(您可能需要仔细检查它是否使用了您期望的方法)。

In [12]: import weighted

In [13]: weighted.median(df['impwealth'], df['indweight'])
Out[13]: 914662.0859091772

回答by Max Ghenis

This function generalizes proofreader's solution:

此函数概括了校对者的解决方案:

def weighted_median(df, val, weight):
    df_sorted = df.sort_values(val)
    cumsum = df_sorted[weight].cumsum()
    cutoff = df_sorted[weight].sum() / 2.
    return df_sorted[cumsum >= cutoff][val].iloc[0]

In this example it'd be weighted_median(df, 'impwealth', 'indweight').

在这个例子中,它将是weighted_median(df, 'impwealth', 'indweight').

回答by Ash

You can also use this function that I've written for the same purpose.

您也可以使用我为相同目的编写的这个函数。

Note: weighted uses interpolation at the end to select the 0.5 quantile (you can look at the code yourself)

注意:weighted最后使用插值选择0.5分位数(可以自己看代码)

My written function just returns the one bounding 0.5 weight.

我编写的函数只返回一个边界为 0.5 的权重。

import numpy as np

def weighted_median(values, weights):
    ''' compute the weighted median of values list. The 
weighted median is computed as follows:
    1- sort both lists (values and weights) based on values.
    2- select the 0.5 point from the weights and return the corresponding values as results
    e.g. values = [1, 3, 0] and weights=[0.1, 0.3, 0.6] assuming weights are probabilities.
    sorted values = [0, 1, 3] and corresponding sorted weights = [0.6,     0.1, 0.3] the 0.5 point on
    weight corresponds to the first item which is 0. so the weighted     median is 0.'''

    #convert the weights into probabilities
    sum_weights = sum(weights)
    weights = np.array([(w*1.0)/sum_weights for w in weights])
    #sort values and weights based on values
    values = np.array(values)
    sorted_indices = np.argsort(values)
    values_sorted  = values[sorted_indices]
    weights_sorted = weights[sorted_indices]
    #select the median point
    it = np.nditer(weights_sorted, flags=['f_index'])
    accumulative_probability = 0
    median_index = -1
    while not it.finished:
        accumulative_probability += it[0]
        if accumulative_probability > 0.5:
            median_index = it.index
            return values_sorted[median_index]
        elif accumulative_probability == 0.5:
            median_index = it.index
            it.iternext()
            next_median_index = it.index
            return np.mean(values_sorted[[median_index, next_median_index]])
        it.iternext()

    return values_sorted[median_index]
#compare weighted_median function and np.median
print weighted_median([1, 3, 0, 7], [2,3,3,9])
print np.median([1,1,0,0,0,3,3,3,7,7,7,7,7,7,7,7,7])

回答by Max Ghenis

You can use this solutionto Weighted percentile using numpy:

您可以使用numpy将此解决方案用于加权百分位数

def weighted_quantile(values, quantiles, sample_weight=None, 
                      values_sorted=False, old_style=False):
    """ Very close to numpy.percentile, but supports weights.
    NOTE: quantiles should be in [0, 1]!
    :param values: numpy.array with data
    :param quantiles: array-like with many quantiles needed
    :param sample_weight: array-like of the same length as `array`
    :param values_sorted: bool, if True, then will avoid sorting of
        initial array
    :param old_style: if True, will correct output to be consistent
        with numpy.percentile.
    :return: numpy.array with computed quantiles.
    """
    values = np.array(values)
    quantiles = np.array(quantiles)
    if sample_weight is None:
        sample_weight = np.ones(len(values))
    sample_weight = np.array(sample_weight)
    assert np.all(quantiles >= 0) and np.all(quantiles <= 1), \
        'quantiles should be in [0, 1]'

    if not values_sorted:
        sorter = np.argsort(values)
        values = values[sorter]
        sample_weight = sample_weight[sorter]

    weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight
    if old_style:
        # To be convenient with numpy.percentile
        weighted_quantiles -= weighted_quantiles[0]
        weighted_quantiles /= weighted_quantiles[-1]
    else:
        weighted_quantiles /= np.sum(sample_weight)
    return np.interp(quantiles, weighted_quantiles, values)

Call as weighted_quantile(df.impwealth, quantiles=0.5, df.indweight).

调用为weighted_quantile(df.impwealth, quantiles=0.5, df.indweight)