GDIPlus blend functions use premultiplied rgb channel by alpha bitmaps for efficiency. However premultiplying by alpha is a very costly since you have to treat each pixel one by one.
It seem that it would be a good candidate for SSE assembly. Is there someone here that would want to share its implementation? I know that this is hard work so that's the reason I ask. I'm not trying to steal your work. You'll get all my consideration for sharing this if you can.
Edit : I'm not trying to do alpha blending by software. I'm trying to premultiply each color component of each pixel in an image by its alpha. I'm doing this because the alpha blend is done by the formula : dst=srcsrc.alpha+dst(1-dst.alpha) however the AlphaBlend Win32 function does implement dst=src+dst(1-dst.alpha) for optimisation reason. To get the correct result you need that src be equal to src*src.alpha before calling AlphaBlend.
It would take me a bit of time to write as I know little about assembly so I was asking if someone would like to share its implementation. SSE would be great as in the paper the gain would alpha blending by software is 300%.
There's a good article found here. It's a bit old but you might find something useful in the section where it uses MMX to implement alpha blending. This could be easily translated to SSE instructions to take advantage of larger register sizes (128bit)
MMX Enhanced Alpha Blending
Intel Application Notes here, with source code
Using MMX™ Instructions to Implement Alpha Blending
You may want to have a look at the Eigen C++ template library. It allows you to use high level C++ code that uses optimized assembler with support for SSE/Altivec.
Fast. (See benchmark).
Expression templates allow to intelligently remove temporaries and enable lazy evaluation, when that is appropriate -- Eigen takes care of this automatically and handles aliasing too in most cases.
Explicit vectorization is performed for the SSE (2 and later) and AltiVec instruction sets, with graceful fallback to non-vectorized code. Expression templates allow to perform these optimizations globally for whole expressions.
With fixed-size objects, dynamic memory allocation is avoided, and the loops are unrolled when that makes sense.
For large matrices, special attention is paid to cache-friendliness.
Elegant. (See API showcase).
The API is extremely clean and expressive, thanks to expression templates. Implementing an algorithm on top of Eigen feels like just copying pseudocode. You can use complex expressions and still rely on Eigen to produce optimized code: there is no need for you to manually decompose expressions into small steps.
Treating each pixel is not expensive with native Win32 GDI apis.
See MSDN
Related
I'm building a high-performance UI layout engine on top of Direct3D 11. The application is being developed using Visual Studio 2013, targeting x64 and is intended for Windows 7 (with Platform Update) and up.
I need to do matrix transformations on 2D elements in the visual tree and I am wondering whether using DirextXMath's built-in (SIMD-optimized) XMMATRIX and its related functions is efficient for 2D use (as that only requires a 3x3 matrix while XMMATRIX et al is 4x4), or whether I should roll my own matrix class / functions (probably without any SIMD-specific code, though).
It seems to me that a 4x4 matrix throughout would mean a lot of redundant calculations being performed, but then again that might be offset by SIMD instructions when compared to non-SIMD 3x3 matrix work.
Edit: Comments about how "premature optimization is the root of all evil" (and derivatives thereof) are superfluous here (and ironically premature, since you know nothing about the project - or me). The question sums up what I am interested in some viewpoints on / knowing more about.
Layout engines tend to have a lot of chained transformations, so using (and keeping for the duration of the chain) your data in SSE registers is likely to improve performance (even more so than typical game scenarios which usually only have a handful of chained transformations). If you are specifically not going to use SSE in your custom class, then XMMATRIX will probably be faster. The column difference shouldn't really matter much since each row fits in an SSE register, but the row difference will mean an extra load. Still, the benefit of SSE is probably worth it.
That said, many modern compilers auto-vectorize now, so a custom class you write in vanilla C++ might end up getting SSE-optimized behind the scenes anyway.
Either way, you probably won't see any difference in the performance if you haven't already optimized your engine for caching behavior. For example, if your engine represents the hierarchy using pointers, and you just allocate new elements on the heap whenever you need them, you'll thrash the cache and have plenty of time to calculate transformations while you wait for memory, SSE or not.
I wonder how does opencv do operations on Matrices. For example, when I write code for
cv::add (Mat mat1, Mat mat2, Mat &result)
using two for loops, it takes around 120-130 ms for 1000x750 image. But using opencv add function it takes 6-7 ms. Does anyone know what is their trick? I want to learn it to be able to write functions that opencv doesn't have.
I have searched inside opencv and find this two .cpp files(first, second) but I dont know if I'm looking at correct place.
I just want to know how to use this power. Could somebody help me?
Thanks,
The two cpp files you provided are for GPU operations (CUDA and OpenCL). From your question, I think you are looking for non-GPU operations and this is the correct file..
OpenCV is famous for its speed and it comes from a lot of optimizations they do in their codes. I will just give some hints to some of them.
1. SIMD Optimization
This is one of the major source of optimization in OpenCV. Almost all arithmetic operations are SIMD optimized. In your case also, SIMD optimization is the better option (which OpenCV has already done). It improves the performance by several times depending on the level of your implementation. All the modern day processors comes with in-built SIMD support (SSE, AVX etc).
It is a little bit complicated compared to our normal C++. Instead of adding only two pixels from both matrices at a time, you add some 16 pixels (It depends on the datatype) simultaneosly. Theoretically it provides 16x speedup. Here is a simple example which I wrote while I was learning SIMD assembly (you can use Intrinsics which are much more simpler). It is not much optimized (written just to learn it), still provides a speedup of 20x.
Similarly, for use in ARM platform, the codes are being NEON optimized (contributed mainly by Nvidia Team for their Tegra processors). Example
2. Multi-threading via TBB
Another important one is use of TBB, Some one has already mentioned it in his answer and you have to compile OpenCV source with TBB to achieve it. As he mentioned, it may not be an easy task to do. Many functions like face detection etc are TBB optimized in OpenCV.
OpenCV does some other techniques also like loop unrolling. (Example) It provides a slight improvement. Modern day compilers are already very good at this.
You can read Agner Fog's optimization techniques manuals for more details on optimizing C++ codes. All those details are relevant.
In this page they say at the end of the document that it is faster because functions of the core are multi-thread enabled via Intel Threaded Building Blocks.
for an application I'm developing I need to be able to
draw lines of different widths and colours
draw solid color filled triangles
draw textured (no alpha) quads
Very easy...but...
All coordinates are integer in pixel space and, very important: glReading all the pixels from the framebuffer
on two different machines, with two different graphic cards, running two different OS (Linux and freebsd),
must result in exactly the same sequence of bits (given an appropriate constant format conversion).
I think this is impossible to safely be achieved using opengl and hardware acceleration, since I bet different graphic
cards (from different vendors) may implement different algorithms for rasterization.
(OpenGl specs are clear about this, since they propose an algorithm but they also state that implementations may differ
under certain circumstances).
Also I don't really need hardware acceleration since I will be rendering very low speed and simple graphics.
Do you think I can achieve this by just disabling hardware acceleration? What happens in that case under linux, will I default on
MESA software rasterizer? And in that case, can I be sure it will always work or I am missing something?
That you're reading back in rendered pixels and strongly depend on their mathematical exactness/reproducability sounds like a design flaw. What's the purpose of this action? If you, for example, need to extract some information from the image, why don't you try to extract this information from the abstract, vectorized information prior to rendering?
Anyhow, if you depend on external rendering code and there's no way to make your reading code more robust to small errors, you're signing up for lots of pain and maintenance work. Other people could break your code with every tiny patch, because that kind of pixel exactness to the bit-level is usually a non-issue when they're doing their unit tests etc. Let alone the infinite permutations of hard- and software layers that are possible, and all might have influence on the exact pixel bits.
If you only need those two operatios: lines (with different widths and colors) and quads (with/without texture), I recommend writing your own rendering/rasterizer code which operates on a 8 bit uint array representing the image pixels (R8G8B8). The operations you're proposing aren't too nasty, so if performance is unimportant, this might actually be the better way to go on the long run.
I thought about:
1) Implement everything for the b/w images, then make wrappers for the methods that check if it's a color image. If it is, split the channels, make the operations on each individually and then merge them.
2) Use functors to correctly update the values depending on what I'm dealing with. Problem is that the compiler errors would be really complicated and I'm not used to it, and I think I may end up needing quite a few of them. Not sure if this is a good idea tbh.
There might be a correct design pattern here I'm not seeing too. There could also be a way to do this that's channel/color agnostic in OpenCV though I haven't found it yet, and so far the book I'm reading (OpenCV 2 Computer Vision Application Programming Cookbook) hasn't shown me such a possibility yet.
If speed is important, Don't.
It sounds like you're trying to encapsulate or abstract away the type of pixel using OO techniques or the like. This could add an extra level of indirection for every pixel access, killing your performance.
If you're calling staight to a function vs. a pointer to one (e.g., delegate, overriden method, functor) it can still be faster for the CPU, but if you're doing function calls at all reconsider; they're still extra work and if you can nest everything in the outer FOR loop, it will look ugly and functional programming snobs will sneer at you, remember, this isn't a big LOB app that will get hard to maintain. That's why engineers can still perfectly maintain 30 year old quickbasic code, the problem space doesn't need anything smarter (however usually their problems themselves need something a lot smarter than I!)
It's best to implement simple things (e.g., a threshold op or resizing) optimized for each kind of image if you want speed. You can also research transformation matrix and see if you can accomplish your work like that. That way you can write 2 transformer algorithms (b&w) only, and, using a similar (or same) matrix do the same thing for both types of pictures.
Hence accomplishing a major goal of abstraction anyway, seamless reuse, separation of concerns. And speed to boot (but hopefully not reboot!) good luck
Splitting the channels could work well with algorithms that work with the channels independently; not all of them do, so this will be quite limiting. You'll also spend a bit of time and space making all those copies.
By functors I presume you mean making templates out of your algorithm functions, with a pixel type as the template parameter. That could work also, but it means defining your basic pixel operations in a way that they could be implemented as functions or operators on a generic pixel type. This is harder than it looks and should be done after you've had some experience in implementing the algorithms.
A third option not mentioned is to promote the b/w images to full color, process them, and convert back to b/w. This optimizes the full color processing at the expense of the b/w.
For most algorithms it is not necessary to worry about monochrome vs. colour images. You either use the grey value of the monochrome image or you calculate the luminance/intensity/whatever of the colour and use that. You choose the measure luminance etc. by looking at which colour space will give you the result you want.
When you have calculated how you are going to modify your images you use some pixel aware processing, e.g. blending two pixels might be pixel_a*0.5 + pixel_b*0.5, your pixel class will sort out how to apply that to the different colour channels, i.e. Pixel::operator+(const Pixel &), Pixel::operator*(float) and so on.
There are algorithms that are applied individually to each colour channel but they are not as common and often there is some correlation between the spatiotemporal changes in the colours so you wouldn't do something as basic as process each channel totally independently of each other.
My own Image class uses a planar structure (that is, color channels are separate) instead of an interleaved structure. However this is VERY limiting when it comes to image quantization and other joint color processing tasks.
I am planning to rewrite it to use the other approach, to simply be a two dimensional array of pixels. At the moment I am not sure how will I implement it exactly (template pixel class, Pixel base class or a simple three dimensional array).
I also plan to write a planar wrapper for this interleaved image structure to ease any disadvantage I might encounter. One thing is sure, this wrapper will be much efficient than a pixel wrapper would be for planar images.
Frankly I believe splitting planes is rather inefficient, since you calculate various overheads several times. For example, if you want to resize an image, calculation of the various filter coefficients is very expensive, and it would be MUCH better to just calculate them once and apply Pixel::operator * and + instead of the same with the underlying subpixel components.
I have a device to acquire XRay images. Due to some technical constrains, the detector is made of heterogeneous pixel size and multiple tilted and partially overlapping tiles. The image is thus distorted. The detector geometry is known precisely.
I need a function converting these distorted images into a flat image with homogeneous pixel size. I have already done this by CPU, but I would like to give a try with OpenGL to use the GPU in a portable way.
I have no experience with OpenGL programming, and most of the information I could find on the web was useless for this use. How should I proceed ? How do I do this ?
Image size are 560x860 pixels and we have batches of 720 images to process. I'm on Ubuntu.
OpenGL is for rendering polygons. You might be able to do multiple passes and use shaders to get what you want but you are better off re-writing the algorithm in OpenCL. The bonus then would be you have something portable that will even use multi core CPUs if no graphics accelerator card is available.
Rather than OpenGL, this sounds like a CUDA, or more generally GPGPU problem.
If you have C or C++ code to do it already, CUDA should be little more than figuring out the types you want to use on the GPU and how the algorithm can be tiled.
If you want to do this with OpengGL, you'd normally do it by supplying the current data as a texture, and writing a fragment shader that processes that data, and set it up to render to a texture. Once the output texture is fully rendered, you can retrieve it back to the CPU and write it out as a file.
I'm afraid it's hard to do much more than a very general sketch of the overall flow without knowing more about what you're doing -- but if (as you said) you've already done this with CUDA, you apparently already have a pretty fair idea of most of the details.
At heart what you are asking here is "how can I use a GPU to solve this problem?"
Modern GPUs are essentially linear algebra engines, so your first step would be to define your problem as a matrix that transforms an input coordinate < x, y > to its output in homogenous space:
For example, you would represent a transformation of scaling x by ½, scaling y by 1.2, and translating up and left by two units as:
and you can work out analogous transforms for rotation, shear, etc, as well.
Once you've got your transform represented as a matrix-vector multiplication, all you need to do is load your source data into a texture, specify your transform as the projection matrix, and render it to the result. The GPU performs the multiplication per pixel. (You can also write shaders, etc, that do more complicated math, factor in multiple vectors and matrices and what-not, but this is the basic idea.)
That said, once you have got your problem expressed as a linear transform, you can make it run a lot faster on the CPU too by leveraging eg SIMD or one of the many linear algebra libraries out there. Unless you need real-time performance or have a truly immense amount of data to process, using CUDA/GL/shaders etc may be more trouble than it's strictly worth, as there's a bit of clumsy machinery involved in initializing the libraries, setting up render targets, learning the details of graphics development, etc.
Simply converting your inner loop from ad-hoc math to a well-optimized linear algebra subroutine may give you enough of a performance boost on the CPU that you're done right there.
You might find this tutorial useful (it's a bit old, but note that it does contain some OpenGL 2.x GLSL after the Cg section). I don't believe there are any shortcuts to image processing in GLSL, if that's what you're looking for... you do need to understand a lot of the 3D rasterization aspect and historical baggage to use it effectively, although once you do have a framework for inputs and outputs set up you can forget about that and play around with your own algorithms in shader code relatively easily.
Having being doing this sort of thing for years (initially using Direct3D shaders, but more recently with CUDA) I have to say that I entirely agree with the posts here recommending CUDA/OpenCL. It makes life much simpler, and generally runs faster. I'd have to be pretty desperate to go back to a graphics API implementation of non-graphics algorithms now.