pandas 概率分布函数 Python

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36150257/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:55:09  来源:igfitidea点击:

Probability Distribution Function Python

pythonnumpypandasmatplotlibvisualization

提问by Sitz Blogz

I have a set of raw data and I have to identify the distribution of that data. What is the easiest way to plot a probability distribution function? I have tried fitting it in normal distribution.

我有一组原始数据,我必须确定这些数据的分布。绘制概率分布函数的最简单方法是什么?我试过在正态分布中拟合它。

But I am more curious to know which distribution does the data carry within itself ?

但我更想知道数据本身带有哪种分布?

I have no code to show my progress as I have failed to find any functions in python that will allow me to test the distribution of the dataset. I do not want to slice the data and force it to fit in may be normal or skew distribution.

我没有代码来显示我的进度,因为我没有在 python 中找到任何允许我测试数据集分布的函数。我不想切片数据并强制它适合可能是正态分布或偏态分布。

Is any way to determine the distribution of the dataset ? Any suggestion appreciated.

有什么方法可以确定数据集的分布吗?任何建议表示赞赏。

Is this any correct approach ? Example
This is something close what I am looking for but again it fits the data into normal distribution. Example

这是任何正确的方法吗?示例
这与我正在寻找的内容相近,但它再次将数据拟合为正态分布。例子

EDIT:

编辑:

The input has million rows and the short sample is given below

输入有一百万行,下面给出了简短的示例

Hashtag,Frequency
#Car,45
#photo,4
#movie,6
#life,1

The frequency ranges from 1to 20,000count and I am trying to identify the distribution of the frequency of the keywords. I tried plotting a simple histogram but I get the output as a single bar.

频率范围从120,000计数,我试图确定关键字频率的分布。我尝试绘制一个简单的直方图,但我将输出作为单个条形图。

Code:

代码:

import pandas
import matplotlib.pyplot as plt


df = pandas.read_csv('Paris_random_hash.csv', sep=',')
plt.hist(df['Frequency'])
plt.show()

Output Output of frequency count

输出 频率计数输出

回答by Chiel

This is a minimal working example for showing a histogram. It only solves part of your question, but it can be a step towards your goal. Note that the histogramfunction gives you the values at the two corners of the bin and you have to interpolate to get the center value.

这是显示直方图的最小工作示例。它只能解决您的部分问题,但它可以是您实现目标的一步。请注意,该histogram函数为您提供 bin 两个角的值,您必须进行插值以获得中心值。

import numpy as np
import matplotlib.pyplot as pl

x = np.random.randn(10000)

nbins = 20

n, bins = np.histogram(x, nbins, density=1)
pdfx = np.zeros(n.size)
pdfy = np.zeros(n.size)
for k in range(n.size):
    pdfx[k] = 0.5*(bins[k]+bins[k+1])
    pdfy[k] = n[k]

pl.plot(pdfx, pdfy)

You can fit your data using the example shown in:

您可以使用以下示例拟合您的数据:

Fitting empirical distribution to theoretical ones with Scipy (Python)?

使用 Scipy (Python) 将经验分布拟合到理论分布?

回答by Greg Friedman

Did you try using the seaborn library? They have a nice kernel density estimation function. Try:

您是否尝试过使用 seaborn 库?他们有一个很好的核密度估计功能。尝试:

import seaborn as sns
sns.kdeplot(df['frequency'])

You find installation instructions here

您可以在此处找到安装说明

回答by Matt

Definitely a stats question - sounds like you're trying to do a probability test of whether the distribution is significantly similar to the normal, lognormal, binomial, etc. distributions. The easiest is to test for normal or lognormal as explained below.

绝对是一个统计问题 - 听起来您正在尝试对分布是否与正态、对数正态、二项式等分布显着相似进行概率测试。最简单的方法是测试正态或对数正态,如下所述。

Set your Pvalue cutoff, usually if your Pvalue <= 0.05 then it is NOT normally distributed.

设置您的 Pvalue 截止值,通常如果您的 Pvalue <= 0.05 那么它不是正态分布的。

In Python use SciPy, you just need your P value returned to test, so 2 return values from this function (I'm ignoring optional (not needed) inputs here for clarity):

在 Python 中使用 SciPy,您只需要返回 P 值进行测试,因此此函数有 2 个返回值(为了清楚起见,我在这里忽略了可选(不需要)输入):

import scipy.stats

import scipy.stats

[W, Pvalue] = scipy.stats.morestats.shapiro(x)

[W, Pvalue] = scipy.stats.morestats.shapiro(x)

Perform the Shapiro-Wilk test for normality. The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.

执行 Shapiro-Wilk 正态性检验。Shapiro-Wilk 检验检验数据来自正态分布的原假设。

If you want to see if it is lognormally distributed (provided it doesn't pass the P test above), you can try:

如果你想看看它是否是对数正态分布的(前提是它没有通过上面的P测试),你可以尝试:

import numpy

import numpy

[W, Pvalue] = scipy.stats.morestats.shapiro(numpy.log(x))

[W, Pvalue] = scipy.stats.morestats.shapiro(numpy.log(x))

Interpret the same way - I just tested on a known lognormally distributed simulation and got a 0.17 Pvalue on the np.log(x) test, and a number close to 0 for the standard shapiro(x) test. That tells me lognormally distributed is the better choice, normally distributed fails miserably.

以相同的方式解释 - 我刚刚在已知的对数正态分布模拟上进行了测试,并在 np.log(x) 测试中获得了 0.17 Pvalue,在标准 shapiro(x) 测试中获得了接近 0 的数字。这告诉我对数正态分布是更好的选择,正态分布会惨遭失败。

I made it simple which is what I gathered you are looking for. For other distributions, you may need to do more work along the lines of Q-Q plots https://en.wikipedia.org/wiki/Q%E2%80%93Q_plotand not simply following a few tests I proposed. That means you have a plot of the distribution you are trying to fit to vs. your data plotted. Here's a quick example that can get you down that path if you so desire:

我让它变得简单,这就是我收集到的你正在寻找的东西。对于其他发行版,您可能需要按照 QQ 图https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot 的方式做更多的工作,而不仅仅是遵循我提出的一些测试。这意味着你有一个你试图适应的分布图与你绘制的数据。这是一个快速示例,如果您愿意,可以让您走上这条道路:

import numpy as np 
import pylab 
import scipy.stats as stats

mydata = whatever data you are looking to fit to a distribution  
stats.probplot(mydata, dist='norm', plot=pylab)
pylab.show()

Above you can substitute anything for dist='norm'from the scipy library http://docs.scipy.org/doc/scipy/reference/tutorial/stats/continuous.html#continuous-distributions-in-scipy-statsthen find its scipy name (must add shape parameters according to the documentation such as stats.probplot(mydata, dist='loggamma', sparams=(1,1), plot=pylab)or for student T stats.probplot(mydata, dist='t', sparams=(1), plot=pylab)), then look at the plot and see how close your data follows that distribution. If the data points are close you've found your distribution. It will give you an R^2 value too on the graph; closer to 1 the better the fit generally.

在上面,您可以替换dist='norm'scipy 库中的任何内容http://docs.scipy.org/doc/scipy/reference/tutorial/stats/continuous.html#continuous-distributions-in-scipy-stats然后找到它的 scipy 名称(必须根据文档添加形状参数,例如stats.probplot(mydata, dist='loggamma', sparams=(1,1), plot=pylab)或为学生 T stats.probplot(mydata, dist='t', sparams=(1), plot=pylab)),然后查看绘图并查看您的数据与该分布的接近程度。如果数据点很接近,你就找到了你的分布。它也会在图表上为您提供 R^2 值;越接近 1,拟合越好。

And if you want to continue trying to do what you're doing with the dataframe, try changing to: plt.hist(df['Frequency'].values)

如果您想继续尝试使用数据框执行您正在执行的操作,请尝试更改为: plt.hist(df['Frequency'].values)

Please vote for this answer if it answers your question :) Need some bounty to get replies on my own programming dilemmas.

如果它回答了您的问题,请投票给这个答案:) 需要一些赏金才能获得有关我自己的编程困境的答复。

回答by BigSchottkyD

The histogram does not what you think it does, you try to show a bar graph. The histogram needs each data point separately in a list, not the frequency itself. You have [3,2,0,4,...] bout should have [1,1,1,2,2,4,4,4,4]. You can not determine a probability distribution automatically

直方图并不像您认为的那样,您尝试显示条形图。直方图需要一个列表中的每个数据点,而不是频率本身。你有 [3,2,0,4,...] 回合应该有 [1,1,1,2,2,4,4,4,4]。您无法自动确定概率分布

回答by Stop harming Monica

The only distribution the data carry within itself is the empirical probability. If your have data as a 1d numpy array datayou can compute the value of the empirical distribution functionat xas the cumulative relative frequency of the values lesser than or equal to x:

数据本身携带的唯一分布是经验概率。如果有数据作为一维阵列numpy的data可以计算的值的经验分布函数x作为值的累积相对较小的频率小于或等于X:

d[d <= x].size / d.size

This is a step function so it does not have an associated probability density function but a probability mass function where the mass of each observed value is its relative frequency. To compute the relative frequencies:

这是一个阶跃函数,因此它没有关联的概率密度函数,而是一个概率质量函数,其中每个观测值的质量是其相对频率。计算相对频率:

values, freqs = np.unique(data, return_counts=True)
rfreqs = freqs / data.size

This does not mean that the data is a random sample from their empirical distribution. If you want to know what distribution your data are a sample from (if any) just by looking at the data, the answer is you can't. But that is more about statistics than about programming.

这并不意味着数据是来自其经验分布的随机样本。如果您想仅通过查看数据就知道您的数据是来自(如果有的话)样本的分布,答案是您不能。但这更多是关于统计而不是关于编程。

回答by Charles Merriam

I think you are asking a slightly different question:

我认为你在问一个稍微不同的问题:

What is the correlation between my raw data and the curve to which I have mapped it?

我的原始数据与我绘制的曲线之间的相关性是什么?

This is a conceptual problem, and you're are trying to understand the meanings of the values R and R squared. Start by working through this MiniTab blog post. You may want to skim this non-Python Kaledia Graph Guideto understand the classes of curves to fit and the usage of Least-Mean-Squares in fitting the curves.

这是一个概念问题,您正在尝试理解值 R 和 R 平方的含义。从阅读这篇 MiniTab 博客文章开始。您可能想浏览此非 Python Kaledia Graph Guide以了解要拟合的曲线类别以及在拟合曲线时使用最小均值平方。

You were probably downvoted because it is a math question more than a programming question.

你可能被否决了,因为它是一个数学问题而不是一个编程问题。

回答by Christian

I may be missing something, but it seems that a major point is being overlooked across the board: The data set you are describing is a categorical data set. That is, the x-values are not numeric, they're just words (#Car, #photo, etc.). The concept of a the shape of a probability distribution is meaningless for a categorical data set, since there is no logical ordering for the categories. What would a histogram even look like? Would #Car be the first bin? Or would it be all the way to the right of your graph? Unless you have some criteria for quantifying your categories then trying to make judgments based on the shape of the distribution is meaningless.

我可能遗漏了一些东西,但似乎有一个重点被全面忽视:您所描述的数据集是一个分类数据集。也就是说,x 值不是数字,它们只是单词(#Car、#photo 等)。概率分布形状的概念对于分类数据集毫无意义,因为分类没有逻辑顺序。直方图甚至会是什么样子?#Car 会是第一个垃圾箱吗?或者它会一直在你的图表右侧?除非你有一些量化你的类别的标准,否则试图根据分布的形状做出判断是没有意义的。

Here's a small text-based example to clarify what I'm saying. Suppose I survey a group of people and ask their favorite color. I plot the results:

这是一个基于文本的小示例,用于阐明我在说什么。假设我调查了一群人并询问他们最喜欢的颜色。我绘制了结果:

   Red | ##
 Green | #####
  Blue | #######
Yellow | #####
Orange | ##

Huh, looks like color preferences are normally distributed. Wait, what if I had randomly put the colors in a different order in my graph:

呵呵,看起来颜色偏好是正态分布的。等等,如果我在图表中随机将颜色按不同的顺序排列会怎样:

  Blue | #######
Yellow | #####
 Green | #####
Orange | ##
   Red | ##

I guess the data is actually positively skewed? Not so, of course - for a categorical data set the shape of the distribution is meaningless. Only if you were to decide to some how quantify each hashtag in your data set would the problem have meaning. Do you want to compare the length of a hashtag to its frequency? Or the alphabetical ordering of a hashtag to its frequency? Etc.

我猜数据实际上是正偏斜的?当然不是这样——对于分类数据集,分布的形状是没有意义的。只有当您决定如何量化数据集中的每个主题标签时,问题才有意义。您想将主题标签的长度与其频率进行比较吗?或者一个主题标签的字母顺序是它的频率?等等。