Python numpy/scipy 等效于 R ecdf(x)(x) 函数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15792552/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:01:30  来源:igfitidea点击:

numpy/scipy equivalent of R ecdf(x)(x) function?

pythonrnumpyscipy

提问by

What is the equivalent of R's ecdf(x)(x)function in Python, in either numpy or scipy? Is ecdf(x)(x)basically the same as:

ecdf(x)(x)在 numpy 或 scipy 中,Python中 R函数的等价物是什么?是ecdf(x)(x)基本相同:

import numpy as np
def ecdf(x):
  # normalize X to sum to 1
  x = x / np.sum(x)
  return np.cumsum(x)

or is something else required?

还是需要其他东西?

EDIThow can one control the number of bins used by ecdf?

编辑如何控制使用的垃圾箱数量ecdf

采纳答案by yasouser

Try these links:

试试这些链接:

statsmodels.ECDF

统计模型

ECDF in python without step function?

没有步进函数的python中的ECDF?

Example code

示例代码

import numpy as np
from statsmodels.distributions.empirical_distribution import ECDF
import matplotlib.pyplot as plt

data = np.random.normal(0,5, size=2000)

ecdf = ECDF(data)
plt.plot(ecdf.x,ecdf.y)

回答by CompEcon

This author has a very nice example of a user-written ECDF function: John Stachurski's Python lectures. His lecture series is geared towards graduate students in computational economics; however they are my go-to resource for anyone learning general scientific computing in Python.

这位作者有一个非常好的用户编写的 ECDF 函数示例:John Stachurski 的 Python 讲座。他的系列讲座面向计算经济学研究生;然而,对于任何学习 Python 通用科学计算的人来说,它们是我的首选资源。

Edit: This is a year old now, but I thought I'd still answer the "Edit" part of your question, in case you (or others) still fin it useful.

编辑:这是一岁了,但我想我仍然会回答你问题的“编辑”部分,以防你(或其他人)仍然觉得它有用。

There really aren't any "bins" with ECDFs as there are with histograms. If G is your empirical distribution function formed using data vector Z, G(x) is literally the number of occurrences of Z <= x, divided by len(Z). This requires no "binning" to determine. Thus there is a sense in which the ECDF retains all possible information about a dataset (since it must retain the entire dataset for calculations), whereas a histogram actually loses some information about the dataset by binning. I much prefer to work with ecdfs vs histograms when possible, for this reason.

ECDF 确实没有像直方图那样的任何“垃圾箱”。如果 G 是使用数据向量 Z 形成的经验分布函数,则 G(x) 就是 Z <= x 的出现次数除以 len(Z)。这不需要“分箱”来确定。因此,在某种意义上,ECDF 保留了有关数据集的所有可能信息(因为它必须保留整个数据集用于计算),而直方图实际上通过分箱丢失了有关数据集的一些信息。出于这个原因,我更喜欢在可能的情况下使用 ecdfs 与直方图。

Fun bonus: if you need to create a small-footprint ECDF-like object from very large streaming data, you should look into this "Data Skeletons" paper by McDermott et al.

有趣的好处:如果您需要从非常大的流数据中创建一个小规模的类 ECDF 对象,您应该查看McDermott 等人的这篇“数据骨架”论文。

回答by RubenLaguna

The OP implementation for ecdfis wrong, you are not supposed to cumsum()the values. So not ys = np.cumsum(x)/np.sum(x)but ys = np.cumsum(1 for _ in x)/float(len(x))or better ys = np.arange(1, len(x)+1)/float(len(x))

的 OP 实现ecdf是错误的,您不应该使用cumsum()这些值。因此,不ys = np.cumsum(x)/np.sum(x)ys = np.cumsum(1 for _ in x)/float(len(x))或更好ys = np.arange(1, len(x)+1)/float(len(x))

You either go with statmodels's ECDFif you are OK with that extra dependency or provide your own implementation. See below:

如果您可以接受额外的依赖项,则可以使用statmodels'sECDF或提供您自己的实现。见下文:

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF
%matplotlib inline

grades = (93.5,93,60.8,94.5,82,87.5,91.5,99.5,86,93.5,92.5,78,76,69,94.5,
          89.5,92.8,78,65.5,98,98.5,92.3,95.5,76,91,95,61)


def ecdf_wrong(x):
    xs = np.sort(x) # need to be sorted
    ys = np.cumsum(xs)/np.sum(xs) # normalize so sum == 1
    return (xs,ys)
def ecdf(x):
    xs = np.sort(x)
    ys = np.arange(1, len(xs)+1)/float(len(xs))
    return xs, ys

xs, ys = ecdf_wrong(grades)
plt.plot(xs, ys, label="wrong cumsum")
xs, ys = ecdf(grades)
plt.plot(xs, ys, label="handwritten", marker=">", markerfacecolor='none')
cdf = ECDF(grades)
plt.plot(cdf.x, cdf.y, label="statmodels", marker="<", markerfacecolor='none')
plt.legend()
plt.show()

ECDF comparison

ECDF比较

回答by Tim

The ecdffunction in R returns the empirical cumulative distribution function, so the have exact equivalent would be rather:

R 中的ecdf函数返回经验累积分布函数,因此有确切的等价物是:

def ecdf(x):
    x = np.sort(x)
    n = len(x)
    def _ecdf(v):
        # side='right' because we want Pr(x <= v)
        return (np.searchsorted(x, v, side='right') + 1) / n
    return _ecdf

np.random.seed(42)
X = np.random.normal(size=10_000)
Fn = ecdf(X)
Fn([3, 2, 1]) - Fn([-3, -2, -1])
## array([0.9972, 0.9533, 0.682 ])

As shown, it gives the correct 68–95–99.7% probabilitiesfor normal distribution.

如图所示,它给出了正态分布正确概率 68-95-99.7%