C++ 高效计算运行中位数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10930732/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 14:37:43  来源:igfitidea点击:

C++ Efficiently Calculating a Running Median

c++algorithmmedian

提问by Edge

Those of you that have read my previous questions know about my work at understanding and implementing quicksort and quickselect, as well as some other basic algorithms.

那些阅读过我之前问题的人都知道我在理解和实现快速排序和快速选择以及其他一些基本算法方面的工作。

Quickselect is used to calculate the kth smallest element in an unsorted list, and this concept can also be used to find the median in an unsorted list.

Quickselect 用于计算未排序列表中的第 k 个最小元素,此概念也可用于查找未排序列表中的中值。

This time, I need aid in devising an efficient technique to calculate the running median, because quickselect isn't a good choice as it needs to re-calculate every time the list changes. Because quickselect has to restart everytime, it can't take advantage of previous calculations done, so I'm looking for a different algorithm that's similar (possibly) but is more efficient in the area of running medians.

这一次,我需要帮助设计一种有效的技术来计算运行中位数,因为 quickselect 不是一个好的选择,因为它需要在每次列表更改时重新计算。因为 quickselect 每次都必须重新启动,所以它不能利用以前完成的计算,所以我正在寻找一种不同的算法,它相似(可能)但在运行中位数方面更有效。

回答by user448810

The streaming medianis computed using two heaps. All the numbers less than or equal to the current median are in the left heap, which is arranged so that the maximum number is at the root of the heap. All the numbers greater than or equal to the current median are in the right heap, which is arranged so that the minimum number is at the root of the heap. Note that numbers equal to the current median can be in either heap. The count of numbers in the two heaps never differs by more than 1.

流中位数是用两个堆计算。小于或等于当前中位数的所有数都在左堆中,其排列方式使得最大数位于堆的根部。所有大于或等于当前中位数的数字都在右堆中,它的排列使得最小的数字在堆的根部。请注意,等于当前中位数的数字可以位于任一堆中。两个堆中的数字计数的差异永远不会超过 1。

When the process begins the two heaps are initially empty. The first number in the input sequence is added to one of the heaps, it doesn't matter which, and returned as the first streaming median. The second number in the input sequence is then added to the other heap, if the root of the right heap is less than the root of the left heap the two heaps are swapped, and the average of the two numbers is returned as the second streaming median.

当进程开始时,这两个堆最初是空的。输入序列中的第一个数字被添加到其中一个堆中,哪个无关紧要,并作为第一个流中位数返回。然后将输入序列中的第二个数字添加到另一个堆中,如果右堆的根小于左堆的根,则交换两个堆,并将这两个数字的平均值作为第二个流返回中位数。

Then the main algorithm begins. Each subsequent number in the input sequence is compared to the current median, and added to the left heap if it is less than the current median or to the right heap if it is greater than the current median; if the input number is equal to the current median, it is added to whichever heap has the smaller count, or to either heap arbitrarily if they have the same count. If that causes the counts of the two heaps to differ by more than 1, the root of the larger heap is removed and inserted in the smaller heap. Then the current median is computed as the root of the larger heap, if they differ in count, or the average of the roots of the two heaps, if they are the same size.

然后主要算法开始。输入序列中的每个后续数字都与当前中位数进行比较,如果小于当前中位数,则添加到左堆,如果大于当前中位数,则添加到右堆;如果输入数字等于当前中位数,则将其添加到具有较小计数的堆中,或者如果它们具有相同的计数,则将其添加到任意堆中。如果这导致两个堆的计数相差超过 1,则较大堆的根将被删除并插入较小的堆中。然后将当前中值计算为较大堆的根(如果它们的计数不同)或两个堆的根的平均值(如果它们的大小相同)。

Code in Scheme and Python is available at my blog.

Scheme 和 Python 中的代码可在我的博客中找到

回答by Jeff McClintock

The Jeff McClintock running median estimate. Requires keeping only two values. This example iterates over an array of sampled values (CPU consumption). Seems to converge relatively quickly (about 100 samples) to an estimate of the median. The idea is at each iteration the median inches toward the input signal at a constant rate. The rate depends on what magnitude you estimate the median to be. I use the average as an estimate of the magnitude of the median, to determines the size of each increment of the median. If you need your median accurate to about 1%, use a step-size of 0.01 * the average.

杰夫麦克林托克运行中值估计。只需要保留两个值。此示例迭代一组采样值(CPU 消耗)。似乎相对较快地(大约 100 个样本)收敛到中值的估计值。这个想法是在每次迭代中以恒定速率向输入信号移动中位数英寸。该比率取决于您估计的中位数是多少。我使用平均值作为对中位数大小的估计,以确定中位数的每个增量的大小。如果您需要将中位数精确到 1% 左右,请使用 0.01 * 平均值的步长。

float median = 0.0f;
float average = 0.0f;

// for each sample
{
    average += ( sample - average ) * 0.1f; // rough running average.
    median += _copysign( average * 0.01, sample - median );
}

enter image description here

在此处输入图片说明

回答by Fred Foo

One solution would be to maintain an order statistic tree, inserting each element of the sequence in turn, then compute the median of the elements in the tree.

一种解决方案是维护一个顺序统计树依次插入序列的每个元素,然后计算树中元素的中值。

This would take O(lg n) time per insertion and O(lg n) time per median, for a total of O(nlg n) time, plus O(n) space.

每次插入需要 O(lg n) 时间,每个中位数需要 O(lg n) 时间,总共 O( nlg n) 时间,加上 O( n) 空间。

回答by Mike Gashler

Here is a C++ balanced tree structure that provides the ability to query by index in the sorted list. Since it maintains all values in sorted order, this is not be quite as efficient as the two-heaps approach, but it offers some additional flexibility. For example, it could also give you a running quartile.

这是一个 C++ 平衡树结构,它提供了按排序列表中的索引进行查询的能力。由于它以排序的顺序维护所有值,因此它不如两堆方法那么有效,但它提供了一些额外的灵活性。例如,它还可以为您提供运行四分位数。

template <typename T>
class Node
{
public:
    T key;
    Node* left;
    Node* right;
    size_t size;

    Node(T k) : key(k)
    {
        isolate();
    }

    ~Node()
    {
        delete(left);
        delete(right);
    }

    void isolate()
    {
        left = NULL;
        right = NULL;
        size = 1;
    }

    void recount()
    {
        size = 1 + (left ? left->size : 0) + (right ? right->size : 0);
    }

    Node<T>* rotateLeft()
    {
        Node<T>* c = right;
        Node<T>* gc = right->left;
        right = gc;
        c->left = this;
        recount();
        c->recount();
        return c;
    }

    Node<T>* rotateRight()
    {
        Node<T>* c = left;
        Node<T>* gc = left->right;
        left = gc;
        c->right = this;
        recount();
        c->recount();
        return c;
    }

    Node<T>* balance()
    {
        size_t lcount = left ? left->size : 0;
        size_t rcount = right ? right->size : 0;
        if((lcount + 1) * 2 < (rcount + 1))
        {
            size_t lcount2 = right->left ? right->left->size : 0;
            size_t rcount2 = right->right ? right->right->size : 0;
            if(lcount2 > rcount2)
                right = right->rotateRight();
            return rotateLeft();
        }
        else if((rcount + 1) * 2 <= (lcount + 1))
        {
            size_t lcount2 = left->left ? left->left->size : 0;
            size_t rcount2 = left->right ? left->right->size : 0;
            if(lcount2 < rcount2)
                left = left->rotateLeft();
            return rotateRight();
        }
        else
        {
            recount();
            return this;
        }
    }

    Node<T>* insert(Node<T>* newNode)
    {
        if(newNode->key < key)
        {
            if(left)
                left = left->insert(newNode);
            else
                left = newNode;
        }
        else
        {
            if(right)
                right = right->insert(newNode);
            else
                right = newNode;
        }
        return balance();
    }

    Node<T>* get(size_t index)
    {
        size_t lcount = left ? left->size : 0;
        if(index < lcount)
            return left->get(index);
        else if(index > lcount)
            return right ? right->get(index - lcount - 1) : NULL;
        else
            return this;
    }

    Node<T>* find(T k, size_t start, size_t* outIndex)
    {
        if(k < key)
            return left ? left->find(k, start, outIndex) : NULL;
        else if(k > key)
            return right ? right->find(k, left ? start + left->size + 1 : start + 1, outIndex) : NULL;
        else
        {
            if(outIndex)
                *outIndex = start + (left ? left->size : 0);
            return this;
        }
    }

    Node<T>* remove_by_index(size_t index, Node<T>** outNode)
    {
        size_t lcount = left ? left->size : 0;
        if(index < lcount)
            left = left->remove_by_index(index, outNode);
        else if(index > lcount)
            right = right->remove_by_index(index - lcount - 1, outNode);
        else
        {
            *outNode = this;
            size_t rcount = right ? right->size : 0;
            if(lcount < rcount)
                return left ? right->insert(left) : right;
            else
                return right ? left->insert(right) : left;
        }
        return balance();
    }

    Node<T>* remove_by_value(T k, Node<T>** outNode)
    {
        if(k < key)
        {
            if(!left)
                throw "not found";
            left = left->remove_by_value(k, outNode);
        }
        else if(k > key)
        {
            if(!right)
                throw "not found";
            right = right->remove_by_value(k, outNode);
        }
        else
        {
            *outNode = this;
            size_t lcount = left ? left->size : 0;
            size_t rcount = right ? right->size : 0;
            if(lcount < rcount)
                return left ? right->insert(left) : right;
            else
                return right ? left->insert(right) : left;
        }
        return balance();
    }
};

template <typename T>
class MyReasonablyEfficientRunningSortedIndexedCollection
{
private:
    Node<T>* root;
    Node<T>* spare;

public:
    MyReasonablyEfficientRunningSortedIndexedCollection() : root(NULL), spare(NULL)
    {
    }

    ~MyReasonablyEfficientRunningSortedIndexedCollection()
    {
        delete(root);
        delete(spare);
    }

    void insert(T key)
    {
        if(spare)
            spare->key = key;
        else
            spare = new Node<T>(key);
        if(root)
            root = root->insert(spare);
        else
            root = spare;
        spare = NULL;
    }

    void drop_by_index(size_t index)
    {
        if(!root || index >= root->size)
            throw "out of range";
        delete(spare);
        root = root->remove_by_index(index, &spare);
        spare->isolate();
    }

    void drop_by_value(T key)
    {
        if(!root)
            throw "out of range";
        delete(spare);
        root = root->remove_by_value(key, &spare);
        spare->isolate();
    }

    T get(size_t index)
    {
        if(!root || index >= root->size)
            throw "out of range";
        return root->get(index)->key;
    }

    size_t find(T key)
    {
        size_t outIndex;
        Node<T>* node = root ? root->find(key, 0, &outIndex) : NULL;
        if(node)
            return outIndex;
        else
            throw "not found";
    }

    size_t size()
    {
        return root ? root->size : 0;
    }
};

回答by Shimon Doodkin

Algorithm of rolling window median:

滚动窗口中位数算法:

median is a sorted array where you take from it the middle value.

中位数是一个排序数组,您可以从中获取中间值。

simple rolling implementation is with a queue(dqueue) and a sorted_array (any implementation, binary tree, skiparray).

简单的滚动实现是使用 queue(dqueue) 和 sorted_array(任何实现,二叉树,skiparray)。

d_queue is an array where you can push to tail and shift (pop) from the front of the array.

d_queue 是一个数组,您可以在其中推送到尾部并从数组的前面移动(弹出)。

sorted_array is an array where you insert by order at position found using binary search.

sorted_array 是一个数组,您可以在其中按顺序在使用二分搜索找到的位置插入。

I used a queue (first-in-first-out array) to track order of added values to know which items to remove from the median array, when they the queue is longer than the wanted size. to fall off elements by date time or some running index, it is possible to add another queue and check the first element is too old, and decide if to remove first value from both queues.

我使用了一个队列(先进先出数组)来跟踪添加值的顺序,以了解当队列长于所需大小时要从中值数组中删除哪些项目。要按日期时间或某个运行索引减少元素,可以添加另一个队列并检查第一个元素是否太旧,并决定是否从两个队列中删除第一个值。

To calculate a median efficiently I use a sorted array technique. it is when you insert new items into its sorted place, so the array is always sorted.

为了有效地计算中位数,我使用了排序数组技术。当您将新项目插入其已排序的位置时,因此数组始终是排序的。

  1. The insert:

    • Insert at ordered place in the sorted_array,
    • and push a value into a queue.
  2. The remove:

    • If d_queue first element is off the window, or if in an another queue you can have with indexes, the index is too old, then:
      • remove the first item from d_queue(s),
      • and binary search for it in the sorted array and remove it.
  3. To have the median:

    • Use the value(s) in the middle of the sorted_array.
    • If sorted_array length is even use the item in the middle.
    • If sorted_array length is odd use an average of two items in the middle.
  1. 插入:

    • 在 sorted_array 的有序位置插入,
    • 并将值推送到队列中。
  2. 移除:

    • 如果 d_queue 第一个元素不在窗口中,或者如果在另一个队列中可以使用索引,则索引太旧,则:
      • 从 d_queue(s) 中删除第一项,
      • 并在已排序的数组中二分搜索并将其删除。
  3. 要获得中位数:

    • 使用 sorted_array 中间的值。
    • 如果 sorted_array 长度甚至使用中间的项目。
    • 如果 sorted_array 长度是奇数,则使用中间两个项目的平均值。

回答by Abdel Hechavarria Diaz

#include<cstdio>
#include<iostream>
#include<queue>
#include <vector>         
#include <functional>

typedef priority_queue<unsigned int> type_H_low;
typedef priority_queue<unsigned int, std::vector<unsigned int>, std::greater<unsigned int> > type_H_high;

size_t signum(int left, int right) {
    if (left == right){
        return 0;
    }
    return (left < right)?-1:1;
}

void get_median( unsigned int x_i, unsigned int &m, type_H_low *l, type_H_high *r) {

    switch (signum( l->size(), r->size() )) {
    case 1: // There are more elements in left (max) heap
        if (x_i < m) {
            r->push(l->top());
            l->pop();
            l->push(x_i);
        } else {
            r->push(x_i);
        }
        break;

    case 0: // The left and right heaps contain same number of elements
        if (x_i < m){
            l->push(x_i);
        } else {
            r->push(x_i);
        }
        break;

    case -1: // There are more elements in right (min) heap
        if (x_i < m){
            l->push(x_i);
        } else {
            l->push(r->top());
            r->pop();
            r->push(x_i);
        }
        break;
    }

    if (l->size() == r->size()){
        m = l->top();
    } else if (l->size() > r->size()){
        m = l->top();
    } else {
        m = r->top();
    }

    return;
}

void print_median(vector<unsigned int> v) {
    unsigned int median = 0;
    long int sum = 0;
    type_H_low  H_low;
    type_H_high H_high;

    for (vector<unsigned int>::iterator x_i = v.begin(); x_i != v.end(); x_i++) {
        get_median(*x_i, median, &H_low, &H_high);
        std::cout << median << std::endl;
    }
}