C++ 如何快速混合 RGBA 无符号字节颜色？

Question

提问by user25749

I am using c++ , I want to do alpha blend using the following code.

我正在使用 c++ ，我想使用以下代码进行 alpha 混合。

#define CLAMPTOBYTE(color) \
    if ((color) & (~255)) { \
        color = (BYTE)((-(color)) >> 31); \
    } else { \
        color = (BYTE)(color); \
    }
#define GET_BYTE(accessPixel, x, y, scanline, bpp) \
    ((BYTE*)((accessPixel) + (y) * (scanline) + (x) * (bpp))) 

    for (int y = top ; y < bottom; ++y)
    {
        BYTE* resultByte = GET_BYTE(resultBits, left, y, stride, bytepp);
        BYTE* srcByte = GET_BYTE(srcBits, left, y, stride, bytepp);
        BYTE* srcByteTop = GET_BYTE(srcBitsTop, left, y, stride, bytepp);
        BYTE* maskCurrent = GET_GREY(maskSrc, left, y, width);
        int alpha = 0;
        int red = 0;
        int green = 0;
        int blue = 0;
        for (int x = left; x < right; ++x)
        {
            alpha = *maskCurrent;
            red = (srcByteTop[R] * alpha + srcByte[R] * (255 - alpha)) / 255;
            green = (srcByteTop[G] * alpha + srcByte[G] * (255 - alpha)) / 255;
            blue = (srcByteTop[B] * alpha + srcByte[B] * (255 - alpha)) / 255;
            CLAMPTOBYTE(red);
            CLAMPTOBYTE(green);
            CLAMPTOBYTE(blue);
            resultByte[R] = red;
            resultByte[G] = green;
            resultByte[B] = blue;
            srcByte += bytepp;
            srcByteTop += bytepp;
            resultByte += bytepp;
            ++maskCurrent;
        }
    }

however I find it is still slow, it takes about 40 - 60 ms when compose two 600 * 600 image. Is there any method to improve the speed to less then 16ms?

但是我发现它仍然很慢，合成两个 600 * 600 图像需要大约 40 - 60 毫秒。有什么方法可以将速度提高到小于 16ms？

Can any body help me to speed this code? Many thanks!

任何机构都可以帮助我加快此代码的速度吗？非常感谢！

Answer 1

回答by Tom Leys

Use SSE- start around page 131.

使用 SSE- 从第 131 页开始。

The basic workflow

基本工作流程

Load 4 pixels from src (16 1 byte numbers) RGBA RGBA RGBA RGBA (streaming load)
Load 4 more which you want to blend with srcbytetop RGBx RGBx RGBx RGBx
Do some swizzling so that the A term in 1 fills every slot I.e
xxxA xxxB xxxC xxxD -> AAAA BBBB CCCC DDDD
In my solution below I opted instead to re-use your existing "maskcurrent" array but having alpha integrated into the "A" field of 1 will require less loads from memory and thus be faster. Swizzling in this case would probably be: And with mask to select A, B, C, D. Shift right 8, Or with origional, shift right 16, or again.
Add the above to a vector that is all -255 in every slot
Multiply 1 * 4 (source with 255-alpha) and 2 * 3 (result with alpha).
You should be able to use the "multiply and discard bottom 8 bits" SSE2 instruction for this.
add those two (4 and 5) together
Store those somewhere else (if possible) or on top of your destination (if you must)

从 src 加载 4 个像素（16 个 1 字节数字） RGBA RGBA RGBA RGBA（流式加载）
再加载 4 个要与 srcbytetop RGBx RGBx RGBx RGBx 混合的
做一些 swizzling，使 1 中的 A 项填满每个插槽，即
xxxA xxxB xxxC xxxD -> AAAA BBBB CCCC DDDD
在下面的解决方案中，我选择重新使用现有的“maskcurrent”数组，但是将 alpha 集成到 1 的“A”字段中将需要更少的内存负载，因此速度更快。Swizzling 在这种情况下可能是：并使用掩码选择 A、B、C、D。右移 8，或者使用原始，右移 16，或再次。
将上述内容添加到每个插槽中均为 -255 的向量中
将 1 * 4（带有 255-alpha 的源）和 2 * 3（带有 alpha 的结果）相乘。
您应该能够为此使用“相乘并丢弃底部 8 位”SSE2 指令。
将这两个（4 和 5）加在一起
将它们存放在其他地方（如果可能）或在您的目的地之上（如果您必须）

Here is a starting point for you:

这是您的起点：

    //Define your image with __declspec(align(16)) i.e char __declspec(align(16)) image[640*480]
    // so the first byte is aligned correctly for SIMD.
    // Stride must be a multiple of 16.

    for (int y = top ; y < bottom; ++y)
    {
        BYTE* resultByte = GET_BYTE(resultBits, left, y, stride, bytepp);
        BYTE* srcByte = GET_BYTE(srcBits, left, y, stride, bytepp);
        BYTE* srcByteTop = GET_BYTE(srcBitsTop, left, y, stride, bytepp);
        BYTE* maskCurrent = GET_GREY(maskSrc, left, y, width);
        for (int x = left; x < right; x += 4)
        {
            //If you can't align, use _mm_loadu_si128()
            // Step 1
            __mm128i src = _mm_load_si128(reinterpret_cast<__mm128i*>(srcByte)) 
            // Step 2
            __mm128i srcTop = _mm_load_si128(reinterpret_cast<__mm128i*>(srcByteTop)) 

            // Step 3
            // Fill the 4 positions for the first pixel with maskCurrent[0], etc
            // Could do better with shifts and so on, but this is clear
            __mm128i mask = _mm_set_epi8(maskCurrent[0],maskCurrent[0],maskCurrent[0],maskCurrent[0],
                                        maskCurrent[1],maskCurrent[1],maskCurrent[1],maskCurrent[1],
                                        maskCurrent[2],maskCurrent[2],maskCurrent[2],maskCurrent[2],
                                        maskCurrent[3],maskCurrent[3],maskCurrent[3],maskCurrent[3],
                                        ) 

            // step 4
            __mm128i maskInv = _mm_subs_epu8(_mm_set1_epu8(255), mask) 

            //Todo : Multiply, with saturate - find correct instructions for 4..6
            //note you can use Multiply and add _mm_madd_epi16

            alpha = *maskCurrent;
            red = (srcByteTop[R] * alpha + srcByte[R] * (255 - alpha)) / 255;
            green = (srcByteTop[G] * alpha + srcByte[G] * (255 - alpha)) / 255;
            blue = (srcByteTop[B] * alpha + srcByte[B] * (255 - alpha)) / 255;
            CLAMPTOBYTE(red);
            CLAMPTOBYTE(green);
            CLAMPTOBYTE(blue);
            resultByte[R] = red;
            resultByte[G] = green;
            resultByte[B] = blue;
            //----

            // Step 7 - store result.
            //Store aligned if output is aligned on 16 byte boundrary
            _mm_store_si128(reinterpret_cast<__mm128i*>(resultByte), result)
            //Slow version if you can't guarantee alignment
            //_mm_storeu_si128(reinterpret_cast<__mm128i*>(resultByte), result)

            //Move pointers forward 4 places
            srcByte += bytepp * 4;
            srcByteTop += bytepp * 4;
            resultByte += bytepp * 4;
            maskCurrent += 4;
        }
    }

To find out which AMD processors will run this code (currently it is using SSE2 instructions) see Wikipedia's List of AMD Turion microprocessors. You could also look at other lists of processors on Wikipedia but my research shows that AMD cpus from around 4 years ago all support at least SSE2.

要了解哪些 AMD 处理器将运行此代码（目前它使用 SSE2 指令），请参阅维基百科的 AMD Turion 微处理器列表。您还可以查看维基百科上的其他处理器列表，但我的研究表明，大约 4 年前的 AMD cpu 都至少支持 SSE2。

You should expect a good SSE2 implimentation to run around 8-16 times faster than your current code. That is because we eliminate branches in the loop, process 4 pixels (or 12 channels) at once and improve cache performance by using streaming instructions. As an alternative to SSE, you could probably make your existing code run much faster by eliminating the if checks you are using for saturation. Beyond that I would need to run a profiler on your workload.

您应该期望良好的 SSE2 实现比您当前的代码快 8-16 倍。那是因为我们消除了循环中的分支，一次处理 4 个像素（或 12 个通道）并通过使用流指令提高缓存性能。作为 SSE 的替代方案，您可以通过消除用于饱和的 if 检查来使现有代码运行得更快。除此之外，我还需要在您的工作负载上运行分析器。

Of course, the best solution is to use hardware support (i.e code your problem up in DirectX) and have it done on the video card.

当然，最好的解决方案是使用硬件支持（即在 DirectX 中编码您的问题）并在视频卡上完成。

Answer 2

回答by Jasper Bekkers

You can always calculate the alpha of red and blue at the same time. You can also use this trick with the SIMD implementation mentioned before.

您始终可以同时计算红色和蓝色的 alpha。您也可以将此技巧与前面提到的 SIMD 实现结合使用。

unsigned int blendPreMulAlpha(unsigned int colora, unsigned int colorb, unsigned int alpha)
{
    unsigned int rb = (colora & 0xFF00FF) + ( (alpha * (colorb & 0xFF00FF)) >> 8 );
    unsigned int g = (colora & 0x00FF00) + ( (alpha * (colorb & 0x00FF00)) >> 8 );
    return (rb & 0xFF00FF) + (g & 0x00FF00);
}


unsigned int blendAlpha(unsigned int colora, unsigned int colorb, unsigned int alpha)
{
    unsigned int rb1 = ((0x100 - alpha) * (colora & 0xFF00FF)) >> 8;
    unsigned int rb2 = (alpha * (colorb & 0xFF00FF)) >> 8;
    unsigned int g1  = ((0x100 - alpha) * (colora & 0x00FF00)) >> 8;
    unsigned int g2  = (alpha * (colorb & 0x00FF00)) >> 8;
    return ((rb1 | rb2) & 0xFF00FF) + ((g1 | g2) & 0x00FF00);
}

0 <= alpha <= 0x100

Answer 3

回答by Guilherme Campos Hazan

For people that want to divide by 255, i found a perfect formula:

对于想要除以 255 的人，我找到了一个完美的公式：

pt->r = (r+1 + (r >> 8)) >> 8; // fast way to divide by 255

Answer 4

回答by Roddy

Here's some pointers.

这里有一些提示。

Consider using pre-multipliedforeground images as described by Porter and Duff. As well as potentially being faster, you avoid a lot of potential colour-fringing effects.

考虑使用Porter 和 Duff所述的预乘前景图像。除了可能更快，您还可以避免许多潜在的色边效应。

The compositing equation changes from

合成方程从

r =  kA + (1-k)B

... to ...

... 到 ...

r =  A + (1-k)B

Alternatively, you can rework the standard equation to remove one multiply.

或者，您可以修改标准方程以删除一个乘法。

r =  kA + (1-k)B
==  kA + B - kB
== k(A-B) + B

I may be wrong, but I think you shouldn't need the clamping either...

我可能是错的，但我认为你也不应该需要夹紧......

Answer 5

回答by Roddy

No exactly answering the question but...

没有完全回答这个问题，但是...

One thing is to do it fast, the other thing is to do it right. Alpha compositing is a dangerous beast, it looks straight forward and intuitive but common errors have been widespread for decades without anybody noticing it (almost)!

一件事是做快，另一件事是做对。Alpha 合成是一种危险的野兽，它看起来直截了当且直观，但几十年来常见的错误已经普遍存在，而没有人注意到它（几乎）！

The most famous and common mistake is about NOT using premultiplied alpha. I highly recommend this: Alpha Blending for Leaves

最著名和最常见的错误是不使用预乘 alpha。我强烈推荐这个：叶子的阿尔法混合

Answer 6

回答by Eric Bainville

You can use 4 bytes per pixel in both images (for memory alignment), and then use SSE instructions to process all channels together. Search "visual studio sse intrinsics".

您可以在两个图像中每个像素使用 4 个字节（用于内存对齐），然后使用 SSE 指令将所有通道一起处理。搜索“visual studio sse 内在函数”。

Answer 7

回答by nfries88

I can't comment because I don't have enough reputation, but I want to say that Jasper's version will notoverflow for valid input. Masking the multiplication result is necessary because otherwise the red+blue multiplication would leave bits in the green channel (this would also be true if you multiplied red and blue separately, you'd still need to mask out bits in the blue channel) and the green multiplication would leave bits in the blue channel. These are bits that are lost to right shift if you separate the components out, as is often the case with alpha blending. So they're not overflow, or underflow. They're just useless bits that need to be masked out to achieve expected results.

我无法发表评论，因为我没有足够的声誉，但我想说 Jasper 的版本不会因为有效输入而溢出。屏蔽乘法结果是必要的，因为否则红色+蓝色乘法会在绿色通道中留下位（如果您分别将红色和蓝色相乘，这也是正确的，您仍然需要屏蔽掉蓝色通道中的位）和绿色乘法将在蓝色通道中留下位。如果将组件分离出来，这些位会因右移而丢失，这在 alpha 混合中经常出现。所以它们不会上溢或下溢。它们只是需要屏蔽以达到预期结果的无用位。

That said, Jasper's version is incorrect. It should be 0xFF-alpha (255-alpha), not 0x100-alpha (256-alpha). This would probably not produce a visible error. What will produce a visible error is his use of | instead of + when merging the multiplication results.

也就是说，Jasper 的版本是不正确的。它应该是 0xFF-alpha (255-alpha)，而不是 0x100-alpha (256-alpha)。这可能不会产生可见的错误。会产生可见错误的是他对 | 的使用。而不是 + 合并乘法结果时。

I've found an adaptation of Jasper's code to be be faster than my old alpha blending code, which was already decent, and am currently using it in my software renderer project. I work with 32-bit ARGB pixels:

我发现对 Jasper 代码的改编比我旧的 alpha 混合代码更快，这已经很不错了，并且目前正在我的软件渲染器项目中使用它。我使用 32 位 ARGB 像素：

Pixel AlphaBlendPixels(Pixel p1, Pixel p2)
{
    static const int AMASK = 0xFF000000;
    static const int RBMASK = 0x00FF00FF;
    static const int GMASK = 0x0000FF00;
    static const int AGMASK = AMASK | GMASK;
    static const int ONEALPHA = 0x01000000;
    unsigned int a = (p2 & AMASK) >> 24;
    unsigned int na = 255 - a;
    unsigned int rb = ((na * (p1 & RBMASK)) + (a * (p2 & RBMASK))) >> 8;
    unsigned int ag = (na * ((p1 & AGMASK) >> 8)) + (a * (ONEALPHA | ((p2 & GMASK) >> 8)));
    return ((rb & RBMASK) | (ag & AGMASK));
}

Answer 8

回答by Vinnie Falco

First of all lets use the proper formula for each color component

首先让我们为每个颜色分量使用正确的公式

You start with this:

你从这个开始：

  v = ( 1-t ) * v0 + t * v1

where t=interpolation parameter [0..1] v0=source color value v1=transfer color value v=output value

其中 t=插值参数 [0..1] v0=源颜色值 v1=传输颜色值 v=输出值

Reshuffling the terms, we can reduce the number of operations:

重新组合术语，我们可以减少操作次数：

  v = v0 + t * (v1 - v0)

You would need to perform this calculation once per color channel (3 times for RGB).

您需要为每个颜色通道执行一次此计算（RGB 为 3 次）。

For 8-bit unsigned color components, you need to use correct fixed point math:

对于 8 位无符号颜色分量，您需要使用正确的定点数学：

  i = i0 + t * ( ( i1 - i0 ) + 127 ) / 255

where t = interpolation parameter [0..255] i0= source color value [0..255] i1= transfer color value [0..255] i = output color

其中 t = 插值参数 [0..255] i0= 源颜色值 [0..255] i1= 传输颜色值 [0..255] i = 输出颜色

If you leave out the +127 then your colors will be biased towards the darker end. Very often, people use /256 or >> 8 for speed. This is not correct! If you divide by 256, you will never be able to reach pure white (255,255,255) because 255/256 is slightly less than one.

如果您省略 +127，那么您的颜色将偏向较暗的一端。很多时候，人们使用 /256 或 >> 8 来提高速度。这是不正确的！如果除以 256，您将永远无法达到纯白色 (255,255,255)，因为 255/256 略小于 1。

I hope this helps.

我希望这有帮助。

Answer 9

回答by Crashworks

Move it to the GPU.

将其移至 GPU。

Answer 10

回答by colithium

I've done similar code in unsafe C#. Is there any reason you aren't looping through each pixel directly? Why use all the BYTE* and GET_BYTE() calls? That is probably part of the speed issue.

我在 unsafe C# 中做了类似的代码。有什么理由不直接循环遍历每个像素吗？为什么要使用所有 BYTE* 和 GET_BYTE() 调用？这可能是速度问题的一部分。

What does GET_GRAY look like?

GET_GRAY 是什么样子的？

More importantly, are you sure your platform doesn't expose alpha blending capabilities? What platform are you targeting? Wiki informs me that the following support it out of the box:

更重要的是，您确定您的平台没有公开 alpha 混合功能吗？你的目标是什么平台？Wiki 通知我，以下支持开箱即用：

Mac OS X
Windows 2000, XP, Server 2003, Windows CE, Vista and Windows 7
The XRender extension to the X Window System (this includes modern Linux systems)
RISC OS Adjust
QNX Neutrino
Plan 9
Inferno
AmigaOS 4.1
BeOS, Zeta and Haiku
Syllable
MorphOS

Mac OS X
Windows 2000、XP、Server 2003、Windows CE、Vista 和 Windows 7
X Window 系统的 XRender 扩展（包括现代 Linux 系统）
RISC 操作系统调整
QNX中微子
计划9
炼狱
AmigaOS 4.1
BeOS、Zeta 和 Haiku
音节
形态操作系统

C++ 如何快速混合 RGBA 无符号字节颜色？

提问by user25749

回答by Tom Leys

回答by Jasper Bekkers

回答by Guilherme Campos Hazan

回答by Roddy

回答by Roddy

回答by Eric Bainville

回答by nfries88

回答by Vinnie Falco

回答by Crashworks

回答by colithium

相关推荐

最近更新

标签

C++ 如何快速混合 RGBA 无符号字节颜色？

提问by user25749

回答by Tom Leys

回答by Jasper Bekkers

回答by Guilherme Campos Hazan

回答by Roddy

回答by Roddy

回答by Eric Bainville

回答by nfries88

回答by Vinnie Falco

回答by Crashworks

回答by colithium

相关推荐

C++ 什么时候应该在函数返回值上使用 std::move ？

C++ std::cout 不会打印

C++ #define 命名空间中的语句

C++如何避免浮点运算错误

相关推荐

最近更新

标签