pandas 使用 Python 进行蒙特卡罗模拟:动态构建直方图

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18091694/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:04:02  来源:igfitidea点击:

Monte Carlo Simulation with Python: building a histogram on the fly

pythonnumpypandashistogrammontecarlo

提问by marillion

I have a conceptual question on building a histogram on the fly with Python. I am trying to figure out if there is a good algorithm or maybe an existing package.

我有一个关于使用 Python 即时构建直方图的概念性问题。我想弄清楚是否有一个好的算法或者一个现有的包。

I wrote a function, which runs a Monte Carlo simulation, gets called 1,000,000,000 times, and returns a 64 bit floating number at the end of each run. Below is the said function:

我编写了一个函数,它运行蒙特卡罗模拟,被调用 1,000,000,000 次,并在每次运行结束时返回一个 64 位浮点数。下面是所说的功能:

def MonteCarlo(df,head,span):
    # Pick initial truck
    rnd_truck = np.random.randint(0,len(df))
    full_length = df['length'][rnd_truck]
    full_weight = df['gvw'][rnd_truck]

    # Loop using other random trucks until the bridge is full
    while True:
        rnd_truck = np.random.randint(0,len(df))
        full_length += head + df['length'][rnd_truck]
        if full_length > span:
            break
        else:
            full_weight += df['gvw'][rnd_truck]

    # Return average weight per feet on the bridge
    return(full_weight/span)

dfis a Pandas dataframe object having columns labeled as 'length'and 'gvw', which are truck lengths and weights, respectively. headis the distance between two consecutive trucks, spanis the bridge length. The function randomly places trucks on the bridge as long as the total length of the truck train is less than the bridge length. Finally, calculates the average weight of the trucks existing on the bridge per foot (total weight existing on the bridge divided by the bridge length).

df是一个 Pandas 数据框对象,其列标记为'length''gvw',分别是卡车长度和重量。head是两辆连续卡车之间的距离,span是桥梁长度。只要卡车列车的总长度小于桥梁长度,该功能将随机将卡车放置在桥上。最后,计算每英尺桥上存在的卡车的平均重量(桥上存在的总重量除以桥长)。

As a result I would like to build a tabular histogram showing the distribution of the returned values, which can be plotted later. I had some ideas in mind:

因此,我想构建一个表格直方图,显示返回值的分布,稍后可以绘制。我有一些想法:

  1. Keep collecting the returned values in a numpy vector, then use existing histogram functions once the MonteCarlo analysis is completed. This would not be feasable, since if my calculation is correct, I would need 7.5 GB of memory for that vector only (1,000,000,000 64 bit floats ~ 7.5 GB)

  2. Initialize a numpy array with a given range and number of bins. Increase the number of items in the matching bin by one at the end of each run. The problem is, I do not know the range of values I would get. Setting up a histogram with a range and an appropriate bin size is an unknown. I also have to figure out how to assign values to the correct bins, but I think it is doable.

  3. Do it somehow on the fly. Modify ranges and bin sizes each time the function returns a number. This would be too tricky to write from scratch I think.

  1. 继续收集 numpy 向量中的返回值,然后在完成 MonteCarlo 分析后使用现有的直方图函数。这是不可行的,因为如果我的计算是正确的,我只需要 7.5 GB 的内存用于该向量(1,000,000,000 64 位浮点数 ~ 7.5 GB)

  2. 使用给定的范围和 bin 数量初始化一个 numpy 数组。在每次运行结束时将匹配箱中的项目数增加一个。问题是,我不知道我会得到的值范围。设置具有范围和适当 bin 大小的直方图是一个未知数。我还必须弄清楚如何为正确的 bin 分配值,但我认为这是可行的。

  3. 以某种方式即时执行。每次函数返回一个数字时修改范围和 bin 大小。我认为从头开始编写太棘手了。

Well, I bet there may be a better way to handle this problem. Any ideas would be welcome!

好吧,我敢打赌可能有更好的方法来处理这个问题。欢迎任何想法!

On a second note, I tested running the above function for 1,000,000,000 times only to get the largest value that is computed (the code snippet is below). And this takes around an hour when span = 200. The computation time would increase if I run it for longer spans (the while loop runs longer to fill the bridge with trucks). Is there a way to optimize this you think?

第二个注意事项,我测试了运行上述函数 1,000,000,000 次,只是为了获得计算出的最大值(代码片段如下)。这需要大约一个小时的时间span = 200。如果我运行更长的跨度,计算时间会增加(while 循环运行更长的时间以用卡车填满桥梁)。你认为有没有办法优化这个?

max_w = 0
i = 1
    while i < 1000000000:
        if max_w < MonteCarlo(df_basic, 15., 200.):
            max_w = MonteCarlo(df_basic, 15., 200.)
    i += 1
print max_w

Thanks!

谢谢!

采纳答案by marillion

Here is a possible solution, with fixed bin size, and bins of the form [k * size, (k + 1) * size[. The function finalizebins returns two lists: one with bin counts (a), and the other (b) with bin lower bounds (the upper bound is deduced by adding binsize).

这是一个可能的解决方案,具有固定的 bin 大小,并且 bin 的形式为 [k * size, (k + 1) * size[. 函数 finalizebins 返回两个列表:一个具有 bin 计数 (a),另一个 (b) 具有 bin 下限(通过添加 binsize 推导出上限)。

import math, random

def updatebins(bins, binsize, x):
    i = math.floor(x / binsize)
    if i in bins:
        bins[i] += 1
    else:
        bins[i] = 1

def finalizebins(bins, binsize):
    imin = min(bins.keys())
    imax = max(bins.keys())
    a = [0] * (imax - imin + 1)
    b = [binsize * k for k in range(imin, imax + 1)]
    for i in range(imin, imax + 1):
        if i in bins:
            a[i - imin] = bins[i]
    return a, b

# A test with a mixture of gaussian distributions

def check(n):
    bins = {}
    binsize = 5.0
    for i in range(n):
        if random.random() > 0.5:
            x = random.gauss(100, 50)
        else:
            x = random.gauss(-200, 150)
        updatebins(bins, binsize, x)
    return finalizebins(bins, binsize)

a, b = check(10000)

# This must be 10000
sum(a)

# Plot the data
from matplotlib.pyplot import *
bar(b,a)
show()

enter image description here

在此处输入图片说明