Python 如何找到真实数据的概率分布和参数?(蟒蛇 3)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37487830/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:27:09  来源:igfitidea点击:

How to find probability distribution and parameters for real data? (Python 3)

pythonmachine-learningstatisticsdistributiondata-fitting

提问by O.rka

I have a dataset from sklearnand I plotted the distribution of the load_diabetes.targetdata (i.e. the values of the regression that the load_diabetes.dataare used to predict).

我有一个数据集sklearn,我绘制了load_diabetes.target数据的分布(即load_diabetes.data用于预测的回归值)。

I used this because it has the fewest number of variables/attributes of the regression sklearn.datasets.

我使用它是因为它具有最少数量的回归变量/属性sklearn.datasets

Using Python 3, How can I get the distribution-type and parameters of the distribution this most closely resembles?

使用 Python 3,如何获得最相似的分布的分布类型和参数?

All I know the targetvalues are all positive and skewed (positve skew/right skew). . . Is there a way in Python to provide a few distributions and then get the best fit for the targetdata/vector? OR, to actually suggest a fit based on the data that's given? That would be realllllly useful for people who have theoretical statistical knowledge but little experience with applying it to "real data".

我所知道的target值都是正的和偏斜的(正偏斜/右偏斜)。. . Python 中有没有办法提供一些分布,然后获得最适合target数据/向量的分布?或者,根据给定的数据实际提出合适的建议?这对于具有理论统计知识但很少将其应用于“真实数据”的人来说非常有用。

BonusWould it make sense to use this type of approach to figure out what your posterior distribution would be with "real data" ? If no, why not?

奖金使用这种类型的方法来计算“真实数据”的后验分布是否有意义?如果没有,为什么不呢?

from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd

#Get Data
data = load_diabetes()
X, y_ = data.data, data.target

#Organize Data
SR_y = pd.Series(y_, name="y_ (Target Vector Distribution)")

#Plot Data
fig, ax = plt.subplots()
sns.distplot(SR_y, bins=25, color="g", ax=ax)
plt.show()

enter image description here

在此处输入图片说明

采纳答案by carrdelling

To the best of my knowledge, there is no automatic way of obtaining the distribution type and parameters of a sample (as inferringthe distribution of a sample is a statistical problem by itself).

据我所知,没有自动获取样本分布类型和参数的方法(因为推断样本分布本身就是一个统计问题)。

In my opinion, the best you can do is:

在我看来,你能做的最好的事情是:

(for each attribute)

(对于每个属性)

  • Try to fit each attribute to a reasonably large list of possible distributions (e.g. see Fitting empirical distribution to theoretical ones with Scipy (Python)?for an example with Scipy)

  • Evaluate all your fits and pick the best one. This can be done by performing a Kolmogorov-Smirnov test between your sample and each of the distributions of the fit (you have an implementation in Scipy, again), and picking the one that minimises D, the test statistic (a.k.a. the difference between the sample and the fit).

  • 尝试将每个属性拟合到一个相当大的可能分布列表(例如,请参阅使用 Scipy (Python) 将经验分布拟合到理论分布?以 Scipy 为例)

  • 评估所有适合的情况并选择最好的。这可以通过在您的样本和拟合的每个分布之间执行 Kolmogorov-Smirnov 检验来完成(再次在 Scipy 中实现),并选择最小化 D、检验统计量的那个(也就是样本和拟合)。

Bonus: It would make sense - as you'll be building a model on each of the variables as you pick a fit for each one - although the goodness of your prediction would depend on the quality of your data and the distributions you are using for fitting. You are building a model, after all.

奖励:这是有道理的 - 因为您将在为每个变量选择拟合时为每个变量构建模型 - 尽管您的预测的优劣取决于您的数据质量和您使用的分布配件。毕竟,您正在构建模型。

回答by Pasindu Tennage

Use this approach

使用这种方法

import scipy.stats as st
def get_best_distribution(data):
    dist_names = ["norm", "exponweib", "weibull_max", "weibull_min", "pareto", "genextreme"]
    dist_results = []
    params = {}
    for dist_name in dist_names:
        dist = getattr(st, dist_name)
        param = dist.fit(data)

        params[dist_name] = param
        # Applying the Kolmogorov-Smirnov test
        D, p = st.kstest(data, dist_name, args=param)
        print("p value for "+dist_name+" = "+str(p))
        dist_results.append((dist_name, p))

    # select the best fitted distribution
    best_dist, best_p = (max(dist_results, key=lambda item: item[1]))
    # store the name of the best fit and its p value

    print("Best fitting distribution: "+str(best_dist))
    print("Best p value: "+ str(best_p))
    print("Parameters for the best fit: "+ str(params[best_dist]))

    return best_dist, best_p, params[best_dist]

回答by Alexis Clarembeau

You can use that code to fit (according to the maximum likelihood) different distributions with your datas:

您可以使用该代码来拟合(根据最大似然)不同的数据分布:

import matplotlib.pyplot as plt
import scipy
import scipy.stats

dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(y)
    # here's the parameters of your distribution, scale, location

You can see a sample snippet about how to use the parameters obtained here: Fitting empirical distribution to theoretical ones with Scipy (Python)?

您可以看到有关如何使用此处获得的参数的示例片段:使用 Scipy (Python) 将经验分布拟合到理论分布?

Then, you can pick the distribution with the best log likelihood(there are also other criteria to match the "best" distribution, such as Bayesian posterior probability, AIC, BIC or BICc values, ...).

然后,您可以选择具有最佳对数似然的分布(还有其他标准可以匹配“最佳”分布,例如贝叶斯后验概率、AIC、BIC 或 BICc 值,...)。

For your bonus question, there's I think no generic answer. If your set of data is significant and obtained under the same conditionsas the real word datas, you can do it.

对于您的奖金问题,我认为没有通用的答案。如果你的数据集很重要,并且是在与真实单词数据相同的条件下获得,那么你就可以做到。