在 Python 中读取文件并绘制 CDF

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24575869/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:48:27  来源:igfitidea点击:

Read file and plot CDF in Python

pythonnumpymatplotlibscipycdf

提问by Phani.lav

I need to read long file with timestamp in seconds, and plot of CDF using numpy or scipy. I did try with numpy but seems the output is NOT what it is supposed to be. The code below: Any suggestions appreciated.

我需要以秒为单位读取带有时间戳的长文件,并使用 numpy 或 scipy 绘制 CDF。我确实尝试过 numpy,但似乎输出不是它应该的样子。下面的代码:任何建议表示赞赏。

import numpy as np
import matplotlib.pyplot as plt

data = np.loadtxt('Filename.txt')
sorted_data = np.sort(data)
cumulative = np.cumsum(sorted_data)

plt.plot(cumulative)
plt.show()

采纳答案by tmdavison

You have two options:

您有两个选择:

1: you can bin the data first. This can be done easily with the numpy.histogramfunction:

1:可以先bin数据。这可以通过以下numpy.histogram功能轻松完成:

import numpy as np
import matplotlib.pyplot as plt

data = np.loadtxt('Filename.txt')

# Choose how many bins you want here
num_bins = 20

# Use the histogram function to bin the data
counts, bin_edges = np.histogram(data, bins=num_bins, normed=True)

# Now find the cdf
cdf = np.cumsum(counts)

# And finally plot the cdf
plt.plot(bin_edges[1:], cdf)

plt.show()

2: rather than use numpy.cumsum, just plot the sorted_dataarray against the number of items smaller than each element in the array (see this answer for more details https://stackoverflow.com/a/11692365/588071):

2:而不是使用numpy.cumsum,只需sorted_data根据小于数组中每个元素的项目数绘制数组(有关更多详细信息,参阅此答案https://stackoverflow.com/a/11692365/588071):

import numpy as np

import matplotlib.pyplot as plt

data = np.loadtxt('Filename.txt')

sorted_data = np.sort(data)

yvals=np.arange(len(sorted_data))/float(len(sorted_data)-1)

plt.plot(sorted_data,yvals)

plt.show()

回答by nayyarv

As a quick answer,

作为快速回答,

plt.plot(sorted_data, np.linspace(0,1,sorted_data.size)

plt.plot(sorted_data, np.linspace(0,1,sorted_data.size)

should have got you what you wanted

应该得到你想要的

回答by Amedeo

For completeness, you should also consider:

为了完整性,您还应该考虑:

  • duplicates: you could have the same point more than once in your data.
  • points can have different distances among themselves
  • points can be float
  • 重复:您可以在数据中多次使用相同的点。
  • 点之间可以有不同的距离
  • 点可以浮动

You can use numpy.histogram, setting the bins edges in such a way that each bin collects all the occurrences of only one point. You should keep density=False, because according to the documentation:

您可以使用numpy.histogram, 以这样一种方式设置 bin 边缘,即每个 bin 只收集一个点的所有出现。您应该保留density=False,因为根据文档:

Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen

请注意,直方图值的总和将不等于 1,除非选择了统一宽度的 bin

You can normalize instead the number of elements in each bin dividing it by the size of your data.

您可以标准化每个 bin 中的元素数除以数据大小。

import numpy as np
import matplotlib.pyplot as plt

def cdf(data):

    data_size=len(data)

    # Set bins edges
    data_set=sorted(set(data))
    bins=np.append(data_set, data_set[-1]+1)

    # Use the histogram function to bin the data
    counts, bin_edges = np.histogram(data, bins=bins, density=False)

    counts=counts.astype(float)/data_size

    # Find the cdf
    cdf = np.cumsum(counts)

    # Plot the cdf
    plt.plot(bin_edges[0:-1], cdf,linestyle='--', marker="o", color='b')
    plt.ylim((0,1))
    plt.ylabel("CDF")
    plt.grid(True)

    plt.show()


As an example, with the following data:

例如,使用以下数据:

#[ 0.   0.   0.1  0.1  0.2  0.2  0.3  0.3  0.4  0.4  0.6  0.8  1.   1.2]
data = np.concatenate((np.arange(0,0.5,0.1),np.arange(0.6,1.4,0.2),np.arange(0,0.5,0.1)))
cdf(data)

you would get:

你会得到:

CDF

发展基金



You can also interpolate the cdf in order to get a continuous function (with either a linear interpolation or a cubic spline):

您还可以对 cdf 进行插值以获得连续函数(使用线性插值或三次样条):

import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d

def cdf(data):

    data_size=len(data)

    # Set bins edges
    data_set=sorted(set(data))
    bins=np.append(data_set, data_set[-1]+1)

    # Use the histogram function to bin the data
    counts, bin_edges = np.histogram(data, bins=bins, density=False)

    counts=counts.astype(float)/data_size

    # Find the cdf
    cdf = np.cumsum(counts)

    x = bin_edges[0:-1]
    y = cdf

    f = interp1d(x, y)
    f2 = interp1d(x, y, kind='cubic')

    xnew = np.linspace(0, max(x), num=1000, endpoint=True)

    # Plot the cdf
    plt.plot(x, y, 'o', xnew, f(xnew), '-', xnew, f2(xnew), '--')
    plt.legend(['data', 'linear', 'cubic'], loc='best')
    plt.title("Interpolation")
    plt.ylim((0,1))
    plt.ylabel("CDF")
    plt.grid(True)

    plt.show()

Interpolation

插值

回答by Allen Downey

Here's an implementation that's a bit more efficient if there are many repeated values (since we only have to sort the unique values). And it plots the CDF as a step function, which it is, strictly speaking.

如果有许多重复值(因为我们只需要对唯一值进行排序),那么这是一个更有效的实现。它将 CDF 绘制为阶跃函数,严格来说确实如此。

import sys

import numpy as np
import matplotlib.pyplot as plt

from collections import Counter


def read_data(fp):
    t = []
    for line in fp:
        x = float(line.rstrip())
        t.append(x)
    return t


def main(script, filename=None):
    if filename is None:
        fp = sys.stdin
    else:
        fp = open(filename)

    t = read_data(fp)
    counter = Counter(t)

    xs = counter.keys()
    xs.sort()

    ys = np.cumsum(counter.values()).astype(float)
    ys /= ys[-1]

    options = dict(linewidth=3, alpha=0.5)
    plt.step(xs, ys, where='post', **options)
    plt.xlabel('Values')
    plt.ylabel('CDF')
    plt.show()


if __name__ == '__main__':
    main(*sys.argv)

回答by vergil_chiou

The following is the step of my implementation:

以下是我的实现步骤:

1.sort your data

1.整理你的数据

2.calculate the cumulative probability of every 'x'

2.计算每个'x'的累积概率

import numpy as np
import matplotlib.pyplab as plt

def cdf(data):
    n = len(data)
    x = np.sort(data) # sort your data
    y = np.arange(1, n + 1) / n # calculate cumulative probability
    return x, y

x_data, y_data = cdf(your_data)
plt.plot(x_data, y_data) 

Example:

例子:

test_data = np.random.normal(size= 100)
x_data, y_data = ecdf(test_data)
plt.plot(x_data, y_data, marker= '.', linestyle= 'none')

Figure: The link of graph

图:图 的链接

回答by svmldon

If you want can use seaborn library then proceed as follows:

如果你想可以使用 seaborn 库,请按以下步骤操作:

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('Filename.txt', sep=" ", header=None)
plt.figure()
sns.kdeplot(data,cumulative=True)
plt.show()