Python numpy/scipy 等效于 R ecdf(x)(x) 函数？

Question

提问by

What is the equivalent of R's ecdf(x)(x)function in Python, in either numpy or scipy? Is ecdf(x)(x)basically the same as:

ecdf(x)(x)在 numpy 或 scipy 中，Python中 R函数的等价物是什么？是ecdf(x)(x)基本相同：

import numpy as np
def ecdf(x):
  # normalize X to sum to 1
  x = x / np.sum(x)
  return np.cumsum(x)

or is something else required?

还是需要其他东西？

EDIThow can one control the number of bins used by ecdf?

编辑如何控制使用的垃圾箱数量ecdf？

Answer 1

采纳答案by yasouser

Try these links:

试试这些链接：

statsmodels.ECDF

统计模型

ECDF in python without step function?

没有步进函数的python中的ECDF？

Example code

示例代码

import numpy as np
from statsmodels.distributions.empirical_distribution import ECDF
import matplotlib.pyplot as plt

data = np.random.normal(0,5, size=2000)

ecdf = ECDF(data)
plt.plot(ecdf.x,ecdf.y)

Answer 2

回答by CompEcon

This author has a very nice example of a user-written ECDF function: John Stachurski's Python lectures. His lecture series is geared towards graduate students in computational economics; however they are my go-to resource for anyone learning general scientific computing in Python.

这位作者有一个非常好的用户编写的 ECDF 函数示例：John Stachurski 的 Python 讲座。他的系列讲座面向计算经济学研究生；然而，对于任何学习 Python 通用科学计算的人来说，它们是我的首选资源。

Edit: This is a year old now, but I thought I'd still answer the "Edit" part of your question, in case you (or others) still fin it useful.

编辑：这是一岁了，但我想我仍然会回答你问题的“编辑”部分，以防你（或其他人）仍然觉得它有用。

There really aren't any "bins" with ECDFs as there are with histograms. If G is your empirical distribution function formed using data vector Z, G(x) is literally the number of occurrences of Z <= x, divided by len(Z). This requires no "binning" to determine. Thus there is a sense in which the ECDF retains all possible information about a dataset (since it must retain the entire dataset for calculations), whereas a histogram actually loses some information about the dataset by binning. I much prefer to work with ecdfs vs histograms when possible, for this reason.

ECDF 确实没有像直方图那样的任何“垃圾箱”。如果 G 是使用数据向量 Z 形成的经验分布函数，则 G(x) 就是 Z <= x 的出现次数除以 len(Z)。这不需要“分箱”来确定。因此，在某种意义上，ECDF 保留了有关数据集的所有可能信息（因为它必须保留整个数据集用于计算），而直方图实际上通过分箱丢失了有关数据集的一些信息。出于这个原因，我更喜欢在可能的情况下使用 ecdfs 与直方图。

Fun bonus: if you need to create a small-footprint ECDF-like object from very large streaming data, you should look into this "Data Skeletons" paper by McDermott et al.

有趣的好处：如果您需要从非常大的流数据中创建一个小规模的类 ECDF 对象，您应该查看McDermott 等人的这篇“数据骨架”论文。

Answer 3

回答by RubenLaguna

The OP implementation for ecdfis wrong, you are not supposed to cumsum()the values. So not ys = np.cumsum(x)/np.sum(x)but ys = np.cumsum(1 for _ in x)/float(len(x))or better ys = np.arange(1, len(x)+1)/float(len(x))

的 OP 实现ecdf是错误的，您不应该使用cumsum()这些值。因此，不ys = np.cumsum(x)/np.sum(x)但ys = np.cumsum(1 for _ in x)/float(len(x))或更好ys = np.arange(1, len(x)+1)/float(len(x))

You either go with statmodels's ECDFif you are OK with that extra dependency or provide your own implementation. See below:

如果您可以接受额外的依赖项，则可以使用statmodels'sECDF或提供您自己的实现。见下文：

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF
%matplotlib inline

grades = (93.5,93,60.8,94.5,82,87.5,91.5,99.5,86,93.5,92.5,78,76,69,94.5,
          89.5,92.8,78,65.5,98,98.5,92.3,95.5,76,91,95,61)


def ecdf_wrong(x):
    xs = np.sort(x) # need to be sorted
    ys = np.cumsum(xs)/np.sum(xs) # normalize so sum == 1
    return (xs,ys)
def ecdf(x):
    xs = np.sort(x)
    ys = np.arange(1, len(xs)+1)/float(len(xs))
    return xs, ys

xs, ys = ecdf_wrong(grades)
plt.plot(xs, ys, label="wrong cumsum")
xs, ys = ecdf(grades)
plt.plot(xs, ys, label="handwritten", marker=">", markerfacecolor='none')
cdf = ECDF(grades)
plt.plot(cdf.x, cdf.y, label="statmodels", marker="<", markerfacecolor='none')
plt.legend()
plt.show()

Answer 4

回答by Tim

The ecdffunction in R returns the empirical cumulative distribution function, so the have exact equivalent would be rather:

R 中的ecdf函数返回经验累积分布函数，因此有确切的等价物是：

def ecdf(x):
    x = np.sort(x)
    n = len(x)
    def _ecdf(v):
        # side='right' because we want Pr(x <= v)
        return (np.searchsorted(x, v, side='right') + 1) / n
    return _ecdf

np.random.seed(42)
X = np.random.normal(size=10_000)
Fn = ecdf(X)
Fn([3, 2, 1]) - Fn([-3, -2, -1])
## array([0.9972, 0.9533, 0.682 ])

As shown, it gives the correct 68–95–99.7% probabilitiesfor normal distribution.

如图所示，它给出了正态分布的正确概率 68-95-99.7%。

Python numpy/scipy 等效于 R ecdf(x)(x) 函数？

提问by

采纳答案by yasouser

回答by CompEcon

回答by RubenLaguna

回答by Tim

相关推荐

最近更新

标签

Python numpy/scipy 等效于 R ecdf(x)(x) 函数？

提问by

采纳答案by yasouser

回答by CompEcon

回答by RubenLaguna

回答by Tim

相关推荐

Python 在 Pandas 中混洗/排列 DataFrame

Python 使用直方图的 Matplotlib/Pandas 错误

python中一行lambda函数中的条件语句？

Python Flask：获取 request.files 对象的大小

相关推荐

最近更新

标签