在 Python/numpy 中计算基尼系数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39512260/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:22:17  来源:igfitidea点击:

calculating Gini coefficient in Python/numpy

pythonnumpystatistics

提问by mvd

i'm calculating Gini coefficient(similar to: Python - Gini coefficient calculation using Numpy) but i get an odd result. for a uniform distribution sampled from np.random.rand(), the Gini coefficient is 0.3 but I would have expected it to be close to 0 (perfect equality). what is going wrong here?

我正在计算基尼系数(类似于:Python - 使用 Numpy 计算基尼系数)但我得到了一个奇怪的结果。对于从 采样的均匀分布np.random.rand(),基尼系数为 0.3,但我原以为它接近 0(完全相等)。这里出了什么问题?

def G(v):
    bins = np.linspace(0., 100., 11)
    total = float(np.sum(v))
    yvals = []
    for b in bins:
        bin_vals = v[v <= np.percentile(v, b)]
        bin_fraction = (np.sum(bin_vals) / total) * 100.0
        yvals.append(bin_fraction)
    # perfect equality area
    pe_area = np.trapz(bins, x=bins)
    # lorenz area
    lorenz_area = np.trapz(yvals, x=bins)
    gini_val = (pe_area - lorenz_area) / float(pe_area)
    return bins, yvals, gini_val

v = np.random.rand(500)
bins, result, gini_val = G(v)
plt.figure()
plt.subplot(2, 1, 1)
plt.plot(bins, result, label="observed")
plt.plot(bins, bins, '--', label="perfect eq.")
plt.xlabel("fraction of population")
plt.ylabel("fraction of wealth")
plt.title("GINI: %.4f" %(gini_val))
plt.legend()
plt.subplot(2, 1, 2)
plt.hist(v, bins=20)

for the given set of numbers, the above code calculates the fraction of the total distribution's values that are in each percentile bin.

对于给定的一组数字,上面的代码计算每个百分位 bin 中总分布值的分数。

the result:

结果:

enter image description here

enter image description here

uniform distributions should be near "perfect equality" so the lorenz curve bending is off.

均匀分布应该接近“完全平等”,因此洛伦兹曲线弯曲是关闭的。

回答by Warren Weckesser

This is to be expected. A random sample from a uniform distribution does not result in uniform values (i.e. values that are all relatively close to each other). With a little calculus, it can be shown that the expectedvalue (in the statistical sense) of the Gini coefficient of a sample from the uniform distribution on [0, 1] is 1/3, so getting values around 1/3 for a given sample is reasonable.

这是可以预料的。来自均匀分布的随机样本不会产生均匀值(即彼此相对接近的值)。通过一点微积分,可以证明[0, 1] 上均匀分布的样本的基尼系数的期望值(在统计意义上)是 1/3,因此得到大约 1/3 的值给定的样本是合理的。

You'll get a lower Gini coefficient with a sample such as v = 10 + np.random.rand(500). Those values are all close to 10.5; the relativevariation is lower than the sample v = np.random.rand(500). In fact, the expected value of the Gini coefficient for the sample base + np.random.rand(n)is 1/(6*base + 3).

您将获得较低的基尼系数,例如v = 10 + np.random.rand(500). 这些值都接近 10.5;的相对变化量小于样品低v = np.random.rand(500)。事实上,样本的基尼系数的期望值base + np.random.rand(n)为 1/(6*base + 3)。

Here's a simple implementation of the Gini coefficient. It uses the fact that the Gini coefficient is half the relative mean absolute difference.

这是基尼系数的简单实现。它使用的事实是基尼系数是相对平均绝对差的一半。

def gini(x):
    # (Warning: This is a concise implementation, but it is O(n**2)
    # in time and memory, where n = len(x).  *Don't* pass in huge
    # samples!)

    # Mean absolute difference
    mad = np.abs(np.subtract.outer(x, x)).mean()
    # Relative mean absolute difference
    rmad = mad/np.mean(x)
    # Gini coefficient
    g = 0.5 * rmad
    return g

Here's the Gini coefficient for several samples of the form v = base + np.random.rand(500):

下面是几个表格样本的基尼系数v = base + np.random.rand(500)

In [80]: v = np.random.rand(500)

In [81]: gini(v)
Out[81]: 0.32760618249832563

In [82]: v = 1 + np.random.rand(500)

In [83]: gini(v)
Out[83]: 0.11121487509454202

In [84]: v = 10 + np.random.rand(500)

In [85]: gini(v)
Out[85]: 0.01567937753659053

In [86]: v = 100 + np.random.rand(500)

In [87]: gini(v)
Out[87]: 0.0016594595244509495

回答by andrewtavis

A quick note on the original methodology:

关于原始方法的快速说明:

When calculating Gini coefficients directly from areas under curves with np.trapsor another integration method, the first value of the Lorenz curve needs to be 0 so that the area between the origin and the second value is accounted for. The following changes to G(v)fix this:

当使用np.traps或其他积分方法直接从曲线下的面积计算基尼系数时,洛伦兹曲线的第一个值需要为 0,以便考虑原点和第二个值之间的面积。以下更改G(v)可解决此问题:

yvals = [0]
for b in bins[1:]:

I also discussed this issue in this answer, where including the origin in those calculations provides an equivalent answer to using the other methods discussed here (which do not need 0 to be appended).

我还在这个答案中讨论了这个问题,其中在这些计算中包含原点提供了使用此处讨论的其他方法(不需要附加 0)的等效答案。

In short, when calculating Gini coefficients directly using integration, you need to start from the origin. If using the other methods discussed here, then it's not needed.

总之,直接使用积分计算基尼系数时,需要从原点开始。如果使用此处讨论的其他方法,则不需要。

回答by bhartii

Gini coefficient is the area under the Lorence curve, usually calculated for analyzing the distribution of income in population. https://github.com/oliviaguest/giniprovides simple implementation for the same using python.

基尼系数是洛伦斯曲线下的面积,通常用于分析人口收入分配。 https://github.com/oliviaguest/gini使用 python 提供了简单的实现。

回答by Ulf Aslak

A slightly faster implementation(using numpy vectorization and only computing each difference once):

稍微快一点的实现(使用 numpy 向量化并且只计算每个差异一次):

def gini_coefficient(x):
    """Compute Gini coefficient of array of values"""
    diffsum = 0
    for i, xi in enumerate(x[:-1], 1):
        diffsum += np.sum(np.abs(xi - x[i:]))
    return diffsum / (len(x)**2 * np.mean(x))

Note: xmust be a numpy array.

注意:x必须是一个numpy数组。