Supporting multiple pixel formats - c++

I need to write some code which will operate on a number of pixel formats (eg A8R8G8B8, R8G8B8, R5G6B6, and even potentially floating point formats).
Ideally I would like to not have to write each function for each format since that is a massive amount of near identical code.
The only thing I could think of is some kind of interface letting it deal with pixel format conversions eg:
class IBitmap
{
public:
virtual unsigned getPixel(unsigned x, unsigned y)const=0;
virtual void setPixel(unsigned x, unsigned y, unsigned argb)=0;
virtual unsigned getWidth()const=0;
virtual unsigned getHeight()const=0;
};
However calling a virtual function for every get or set pixel operation is hardly fast, since not only is there the extra overhead of a virtual call, but it also far more importantly prevents inlining for something which is just a few instructions long.
Are there any other options which would allow for me to support all these formats efficiently? Generally speaking my code is likely to only operate on a small portion of the bitmap, and needs read/write access in many cases (blending).

Consider using templates. This will allow you to write generically where possible and take the hit for specialising the code at compile time
Sometimes known as static polymorphism.
http://en.wikipedia.org/wiki/Curiously_Recurring_Template_Pattern

On modern processors virtual functions are pretty darn fast, so I'm not sure I'd reject them out of hand. Run a timing test or two to be sure. Virtual functions give you runtime polymorphism, which allows you to conceivably write (or generate at compile time) less code. The static polymorphism represented by templates will probably cause the compiler to generate that massive amount of code that you're trying to avoid.

To avoid having to write loads of formats, write a Color class or similar that always stores a particular format (eg. A8R8G8B8 or A32R32G32B32F) then have methods on that class to retrieve in different formats, eg. get(FORMAT_R5G6B6) or similar. All your methods can then deal with that class. Whenever possible, convert the color only once (eg. when drawing a rect, convert the rect color to the destination format then write that to all the pixels - don't convert it for every pixel!).
Avoid having a SetPixel/GetPixel method at all. It cannot be done efficiently, especially if you're doing colour conversions on the fly, and especially not if the optimizer decides not to inline the calls. You'd do better to expose memory buffers directly, and trust that the caller will correctly use the memory. That's how every other API I've used before does it.

(This is my new answer because after extensive research, I concluded that Adobe GIL is not suitable for this purpose.)
I would highly recommend the architecture and interface design of the Windows Imaging Component.
What I mean is:
Their interface is what I recommend. Not the implementation.
Everyone can implement something similar. In fact, the Mono Project (Wine) contains a partial implementation of the WIC.
Instead of the producer-consumer model, WIC uses a pipeline-storage model.
A read-only bitmap implements IWICBitmapSource interface
There are only 5 member methods.
A writable bitmap implements IWICBitmap interface and provides direct memory read/write access to the pixel data.
There are only 3 additional member methods, on top of IWICBitmapSource.
Being a "storage" class means that this class does not depend on any other bitmap instances when producing pixel values.

The overhead of converting everything to a single common format, then converting back after the operation, may be less than you think. If you can limit the number of pixels that are converted at a time, say to a single row, the intermediate results may remain in cache for the entire operation. I've actually seen a case where the end-to-end times were faster using this approach, although it involved one-bit pixels that are inherently difficult to work with when still packed.
You are right to be wary of a pixel-level function call. I've never seen a case where a pixel function didn't make the processing unacceptably slow.

Related

How to store a boost::quantity with possible different boost::dimension

I am using boost::units library to enforce physical consistency in a scientific project. I have read and tried several examples from boost documentation. I am able to create my dimensions, units and quantities. I did some calculus, it works very well. It is exactly what I expected, except that...
In my project, I deal with time series which have several different units (temperature, concentration, density, etc.) based on six dimensions. In order to allow safe and easy units conversions, I would like to add a member to each channel class representing the dimensions and units of time series. And, data treatment (import, conversion, etc.) are user-driven, therefore dynamic.
My problem is the following, because of the boost::units structure, quantities within an homogeneous system but with different dimensions have different types. Therefore you cannot directly declare a member such as:
boost::units::quantity channelUnits;
Compiler will claim you have to specify dimensions using template chevrons. But if you do so, you will not be able to store different type of quantities (say quantities with different dimensions).
Then, I looked for boost::units::quantity declaration to find out if there is a base class that I can use in a polymorphic way. But I haven't found it, instead I discovered that boost::units heavily uses Template Meta Programming which is not an issue but does not exactly fit my dynamic needs since everything is resolved at compile-time not at run-time.
After more reading, I tried to wrap different quantities in a boost::variant object (nice to meet it for the very first time).
typedef boost::variant<
boost::units::quantity<dim1>,
...
> channelUnitsType;
channelUnitsType channelUnits;
I performed some tests and it seems to work. But I am not confident with boost::variant and the visitor-pattern.
My questions are the following:
Is there another - maybe best - way to have run-time type resolution?
Is dynamic_cast one of them? Units conversion will not happen very often and only few data are in concern.
If boost::variant is a suitable solution, what are its drawbacks?
Going deeper in my problem I read two articles providing tracks for a solution:
Kostadin Damevski, Expressing Measurements Units in Interfaces for Scientific Component Software;
Lingxiao Jiang, A Practical Type System for Validating Dimensional Units Correctness of C Programs.
The first gives good ideas for the interface implementation. The second gives a complete overview of what you must cope with.
I keep in mind that boost::units is a complete and efficient way for dimension consistency at compile-time without overhead at runtime. Anyway, for runtime dimension consistency involving dimensions changes you do need a dynamic structure that boost::units does not provide. So here am I: designing a units class that will exactly fit my needs. More work to achieve, more satisfaction at the end...
About the original questions:
boost::variant works well (it provides the dynamic boost::units is missing) for this job. And furthermore, it can be serialized out of the box. Thus it is a effective approach. But it is adding a layer of abstraction for a simple - I am not saying trivial - task that could be done by a single class.
Casting is achieved by boost::variant_cast<> instead of dynamic_cast<>.
boost::any could be easier to implement but serialization becomes an hard way.
I have been thinking about this problem and came up with the following conclusion:
1. Implement type erasure (pros: nice interfaces, cons:memory overhead)
It looks impossible to store without overhead a general quantity with common dimension, that break one of the design principles of the libraries. Even type erasure won't help here.
2. Implement a convertible type (pros: nice interfaces, cons:operational overhead)
The only way I see without storage overhead, is to choose a conventional (possibly hidden) system where all units are converted to and from. There is no memory overhead but there is a multiplication overhead in almost all queries to the values and a tremendous number of conversion and some loose of precision of high exponent, (think of conversion from avogadro number to the 10 power).
3. Allow implicit conversions (pros: nice interfaces, cons:harder to debug, unexpected operational overheads)
Another option, mostly in the practical side to alleviate the problem is to allow implicit conversion at the interface level, see here: https://groups.google.com/d/msg/boost-devel-archive/JvA5W9OETt8/5fMwXWuCdDsJ
4. Template/Generic code (pros: no runtime or memory overhead, conceptually correct, philosophy follows that of the library, cons: harder to debug, ugly interfaces, possible code bloat, lots of template parameters everywhere)
If you ask the library designer probably they will tell you that you need to make your functions generic. This is possible but it complicates the code. For example:
template<class Length>
auto square(Length l) -> decltype(l*l){return l*l;}
I use C++11 to simplify the example here (it is possible to do it in C++98), and also to show that this is becoming easier to do in C++11 (and even simpler in C++14 with decltype(auto).
I know that this is not the type of code you had in mind but it is consistent with the design of the library. You may think, well how do I restrict this function to physical length and not something else? Well, the answer is that you don't need to this, however if you insist, in the worst case...
template<class Length, typename std::enable_if<std::is_same<typename get_dimension<Lenght>::type, boost::units::length_dimension>::value>::type>
auto square(Length l) -> decltype(l*l){return l*l;}
(In better cases decltype will do the SFINAE job.)
In my opinion, option 4. and possibly combined with 3. is the most elegant way ahead.
References:
https://www.boost.org/doc/libs/1_69_0/boost/units/get_dimension.hpp

Is it sufficient that a class is POD to read it from binary?

For a client/server application I need to send and recive c++ objects. I don't need the corresponding classes to do anything fancy but want to have maximal performance (regarding network traffic and computation). So I though of simply transferring them as binary strings. Basicly I want to be able to do the following
//Create original object
MyClass oldObj();
//save to char array
char* save = new char[sizeof(MyClass)];
memcpy(save, &oldObj, sizeof(MyClass));
//Somewhere of course there would be the transfer to the client/server
//Read back from char array
MyClass newObj();
memcpy(&newObj, save, sizeof(MyClass));
My question: What does my class need to fullfill in order for this to work?
Naturaly Pointers as members won't work when transferring to another application. But is it sufficient that my Class is considered POD (in c++03 and/or c++11) and does not have any pointers or equivalents (like STL containers) as members?
Both machines need to:
Have the same Endianess (for int)
The same floating point representation (double)
The same size for all types.
The Same compiler
The Same flags used to build the application.
Pointers dont transfer well.
BUT the network is going to be the slowest part here.
The cost of serializing most objects is going to be irrelevant compared to the cost of transfer. Of course the bigger your object the higher the cost but it takes a while before it is significant to make a dent.
The higher cost of maintenance is also you should factor in.
What does my class need to fulfill in order for this to work?
It must not have pointer members, you already mention that.
It must not have members whose size is implementation defined, like int.
It must not have integers members, due to different endianness.
It must not have floating point members, due to different representations.
...and probably more!
Basically, you cannot do that except for very particularly constrained scenarios. You will have to pick a protocol and make your data conform to it to send it through the network safely.
Is not a big deal since performance will be bounded by network speed and latency, not by the operations needed on your values to conform to the protocol.
How much control do you have over the hardware/OS that this runs on? Are you writing code that is super-portable, or will it ONLY run on 32- and 64-bit x86 Windows [for example]?
To be fully "super-portable", as explained above, you can't have any form of "implementation defined" sized objects (such as int that can be 16, 18, 32, 36 or 64 bits, for example). Such items need to be stored as bytes of defined number and order to make sure it will not get cut off/re-ordered when transferring. Floating point can be even worse...
A lot of "super-portable" applications store their data as text. It's a little slower, but it makes it trivially portable, since text is just a stream of bytes whatever architecture you run it on, and it's ordered the same way whichever machine you use (as long as you stick to 0-9, A-Za-z, !?<>,.()*& and a few other characters - and beware of EBCDIC encoded machines, but they tend to handle "ascii-to-ebcdic" conversion). The other end just need to conver the text back to strings/integers/floats/doubles, whatever you need. A conversion from integer to string of digits takes one divide per digit (using hex or base-36 makes that a bit better, but makes it much less human readable - sometimes a good thing, sometimes a bad thing). This is clearly slower than storing 4 bytes. THe other drawback is that it's (depending on values used) often longer to store a number in text than as binary. So your network packets will be a little larger. This will have a greater impact than the conversion, as processors can do a lot of math in the time it takes to send 1KB with a 10Gbit network card. And of course, you need a few extra bytes (spaces, commas, newlines or whatever it may be) so that you can tell the difference between one number 123456 and three 12, 34, 56. [Of course, no need to use ", " between each]. And you need some code to parse the whole thing at the other end once it has arrived.
If you know that your system(s) always have 32-bit integers and IEEE-754 floating point numbers [these are extremely common!], then you may well get away with just worrying about byte order. And if you know that it's always going to be on "x86" or some such, you don't have to worry about byte order either. But you now may have to modify your code when you decide that "running my code on an iphone would be a good idea". Of course, you could leave that to the iphone side of things to conform to whatever the rest requires.
Other answers have mentioned how it is possible to use a class for this purpose. Personally, I prefer to use a struct instead. In C++, a struct can have member methods/operators, constructor/destructor, supports inheritance, etc just like a class does. However, a struct has a well-defined and predictable memory layout and can have that layout explicitally aligned via #pragma statements to add/remove the compiler's implicit padding (I have never tried aligning a class before, but I think it is supported). I always use an 8bit-aligned struct for data that has to be exchanged outside of the app's process. For all intents and purposes, in modern compilers, a struct is basically identical to a class, just the default visibility of its members is public instead of private. But I like to keep struct and class separated for different purposes. A struct is just a raw container of data that you can freely manpulate, overwrite in memory, etc. A class is an object whose memory layout and padding is compiler-defined and should not be messed with.

Image interpolation implementation with C++

I have a question related to the implementation of image interpolation (bicubic and bilinear methods) with C++. My main concern is speed. Based on my understanding of the problem, in order to make the interpolation program fast and efficient, the following strategies can be adopted:
Fast image interpolation using Streaming SIMD Extensions (SSE)
Image interpretation with multi-thread or GPU
Fast image interpolation algorithms
C++ implementation tricks
Here, I am more interested in the last strategy. I set up a class for interpolation:
/**
* This class is used to perform interpretaion for a certain poin in
* the image grid.
*/
class Sampling
{
public:
// samples[0] *-------------* samples[1]
// --------------
// --------------
// samples[2] *-------------*samples[3]
inline void sampling_linear(unsigned char *samples, unsigned char &res)
{
unsigned char res_temp[2];
sampling_linear_1D(samples,res_temp[0]);
sampling_linear_1D(samples+2,res_temp[1]);
sampling_linear_1D(res_temp,res);
}
private:
inline void sampling_linear_1D(unsigned char *samples, unsigned char &res)
{
}
}
Here I only give an example for bilinear interpolation. In order to make the program run faster, the inline function is employed. My question is whether this implementation scheme is efficient. Additionally, during the interpretation procedure if I give the use the option of choosing between different interpolation methods. Then I have two choices:
Depending on the interpolation method, invoke the function the perform interpolation for the whole image.
For each output image pixel, first determine its position in the input image, and then according to the interpolation method setting, determine the interpolation function.
The first method means more codes in the program while the second one may lead to inefficiency. Then, how could I choose between these two schemes? Thanks!
Fast image interpolation using Streaming SIMD Extensions (SSE)
This may not provide desired result, because I expect that your algorithm will be memory-bounded rather than FLOP/s bounded.
I mean - it definitely will be improvement, but not beneficial in compare to implementation cost.
And by the way, modern compilers can perform auto-vectorization (i.e. use of SSE and futher extensions): GCC starting from 4.0, MSVC starting from 2012, MSVC Auto-Vectorization video lectures.
Image interpretation with multi-thread or GPU
Multi-thread version should give good effect, because it would allow you to exploit all available memory throughput.
If you do not plan to process data several times, or use it in some way on GPU, then GPGPU may not give desired result. Yes, it will produce result faster (mostly due to higher memory speed), but this effect will be crossed out by slow transfer between main RAM and GPU's RAM.
Just for example, approximate modern throughputs:
CPU RAM ~ 20GiB/s
GPU RAM ~ 150GiB/s
Transfering between CPU RAM <-> GPU RAM ~ 3-5 GiB/s
For single pass memory bounded algorithms, in most cases, third item makes usage of GPUs impractical (for such algoirthms).
In order to make the program run faster, the inline function is employed
Class member functions are "inline" by default. Beaware, that main purpose of "inline" is not actually "inling", but helping to prevent One Definition Rule violation when your functions are defined in headers.
There are compiler-dependent "forceinline" features, for instance MSVC has __forceinline. Or abstracted from compiler ifdef'ed BOOST_FORCEINLINE macro.
Anyway, trust your compiler unless you don't prove otherwise (with help of assembler for example). Most important fact, is that compiler should see functions defenitions - then it can decide itself to inline, even if function is not inline itself.
My question is whether this implementation scheme is efficient.
As I understand, as pre-step, you gather samples into 2x2 matrix. I think it may be better to pass directly two pointers to arrays of two elements within image directly, or one pointer + width size (to calc second pointer automaticly). However, it is not a big issue, most likely your temporary 2x2 matrix will be optimized away.
What is really important - is how you traverse your image.
Let's say for given x and y, index is calculated as:
i=width*y+x;
Then your traversal loop should be:
for(int y=/*...*/)
for(int x=/*...*/)
{
// loop body
}
Because, if you would chose another order (x first, then y) - it will be not cache-friendly, and as the result performance drop can be up to 64x (depending on your pixel size). You may check it just for your interest.
The first method means more codes in the program while the second one may lead to inefficiency. Then, how could I choose between these two schemes? Thanks!
In this case, you can use compile-time polymorphism to reduce code amount in first version. For instance, based on templates.
Just look at std::accumulate - it can be written once, and then it will work on different types of iterators, different binary operations (functions or functors), without imply any runtime penalty due to it's polymorphism.
Alexander Stepanov says:
For many years, I tried to achieve relative efficiency in more advanced languages (e.g., Ada and Scheme) but failed. My generic versions of even simple algorithms were not able to compete with built-in primitives. But in C++ I was finally able to not only accomplish relative efficiency but come very close to the more ambitious goal of absolute efficiency. To verify this, I spent countless hours looking at the assembly code generated by different compilers on different architectures.
Check Boost's Generic Image Library - it has good tutorial, and there is video presentation from author.

What are the advantages to using bitsets for bitmap storage?

I'm currently evaluating whether I should utilize a single large bitset or many 64-bit unsigned longs (uint_64) to store a large amount of bitmap information. In this case, the bitmap represents the current status of a few GB of memory pages (dirty / not dirty), and has thousands of entries.
The work which I am performing requires that I be able to query and update the dirty pages, including performing OR operations between two dirty page bitmaps.
To be clear, I will be performing the following:
Importing a bitmap from a file, and performing a bitwise OR operation with the existing bitmap
Computing the hamming weight (counting the number of bits set to 1, which represents the number of dirty pages)
Resetting / clearing a bit, to mark it as updated / clean
Checking the current status of a bit, to determine if it is clean
It looks like it is easy to perform bitwise operations on a C++ bitset, and easily compute the hamming weight. However, I imagine there is no magic here -- the CPU can only perform bitwise operations on as many bytes as it can store in a register -- so the routine utilized by the bitset is likely the same I would implement myself. This is probably also true for the hamming weight.
In addition, importing the bitmap data from the file to the bitset looks ugly -- I need to perform bitshifts multiple times, as shown here. I imagine given the size of the bitsets I would be working with, this would have a negative performance impact. Of course, I imagine I could just use many small bitsets instead, but there may be no advantage to this (other then perhaps ease of implementation).
Any advice is appriciated, as always. Thanks!
Sounds like you have a very specific single-use application. Personally, I've never used a bitset, but from what I can tell its advantages are in being accessible as if it was an array of bools as well as being able to grow dynamically like a vector.
From what I can gather, you don't really have a need for either of those. If that's the case and if populating the bitset is a drama, I would tend towards doing it myself, given that it really is quite simple to allocate a whole bunch of integers and do bit operations on them.
Given that have very specific requirements, you will probably benefit from making your own optimizations. Having access to the raw bit data is kinda crucial for this (for example, using pre-calculated tables of hamming weights for a single byte, or even two bytes if you have memory to spare).
I don't generally advocate reinventing the wheel... But if you have special optimization requirements, it might be best to tailor your solution towards those. In this case, the functionality you are implementing is pretty simple.
Thousands bits does not sound as a lot. But maybe you have millions.
I suggest you write your code as-if you had the ideal implementation by abstracting it (to begin with use whatever implementation is easier to code, ignoring any performance and memory requirement problems) then try several alternative specific implementations to verify (by measuring them) which performs best.
One solution that you did not even consider is to use Judy arrays (specifically Judy1 arrays).
I think if I were you I would probably just save myself the hassle of any DIY and use boost::dynamic_bitset. They've got all the bases covered in terms of functionality, including stream operator overloads which you could use for file IO (or just read your data in as unsigned ints and use their conversions, see their examples) and a count method for your Hamming weight. Boost is very highly regarded a least by Sutter & Alexandrescu, and they do everything in the header file--no linking, just #include the appropriate files. In addition, unlike the Standard Library bitset, you can wait until runtime to specify the size of the bitset.
Edit: Boost does seem to allow for the fast input reading that you need. dynamic_bitset supplies the following constructor:
template <typename BlockInputIterator>
dynamic_bitset(BlockInputIterator first, BlockInputIterator last,
const Allocator& alloc = Allocator());
The underlying storage is a std::vector (or something almost identical to it) of Blocks, e.g. uint64s. So if you read in your bitmap as a std::vector of uint64s, this constructor will write them directly into memory without any bitshifting.

Storing collections of items that an algorithm might need

I have a class MyClass that stores a collection of PixelDescriptor* objects. MyClass uses a function specified by a Strategy pattern style object (call it DescriptorFunction) to do something for each descriptor:
void MyFunction()
{
descriptorA = DescriptorCollection[0];
for_each descriptor in DescriptorCollection
{
DescriptorFunction->DoSomething(descriptor)
}
}
However, this only makes sense if the descriptors are of a type that the DescriptorFunction knows how to handle. That is, not all DescriptorFunction's know how to handle all descriptor types, but as long as the descriptors that are stored are of the type that the visitor that is specified knows about, all is well.
How would you ensure the right type of descriptors are computed? Even worse, what if the strategy object needs more than one type of descriptor?
I was thinking about making a composite descriptor type, something like:
class CompositeDescriptor
{
std::vector<PixelDescriptor*> Descriptors;
}
Then a CompositeDescriptor could be passed to the DescriptorFunction. But again, how would I ensure that the correct descriptors are present in the CompositeDescriptor?
As a more concrete example, say one descriptor is Color and another is Intensity. One Strategy may be to average Colors. Another strategy may be to average Intensities. A third strategy may be to pick the larger of the average color or the average intensity.
I've thought about having another Strategy style class called DescriptorCreator that the client would be responsible for setting up. If a ColorDescriptorCreator was provided, then the ColorDescriptorFunction would have everything it needs. But making the client responsible for getting this pairing correct seems like a bad idea.
Any thoughts/suggestions?
EDIT: In response to Tom's comments, a bit more information:
Essentially DescriptorFunction is comparing two pixels in an image. These comparisons can be done in many ways (besides just finding the absolute difference between the pixels themseles). For example 1) Compute the average of corresponding pixels in regions around the pixels (centered at the pixels). 2) Compute a fancier "descriptor" which typically produces a vector at each pixel and average the difference of the two vectors element-wise. 3) compare 3D points corresponding to the pixel locations in external data, etc etc.
I've run into two problems.
1) I don't want to compute everything inside the strategy (if the strategy just took the 2 pixels to compare as arguments) because then the strategy has to store lots of other data (the image, there is a mask involved describing some invalid regions, etc etc) and I don't think it should be responsible for that.
2) Some of these things are expensive to compute. I have to do this millions of times (the pixels being compared are always difference, but the features at each pixel do not change), so I don't want to compute any feature more than once. For example, consider the strategy function compares the fancy descriptors. In each iteration, one pixels is compared to all other pixels. This means in the second iteration, all of the features would have to be computed again, which is extremely redundant. This data needs to be stored somewhere that all of the strategies can access it - this is why I was trying to keep a vector in the main client.
Does this clarify the problem? Thanks for the help so far!
The first part sounds like a visitor pattern could be appropriate. A visitor can ignore any types it doesn't want to handle.
If they require more than one descriptor type, then it is a different abstraction entirely. Your description is somewhat vague, so it's hard to say exactly what to do. I suspect that you are over thinking it. I say that because generally choosing arguments for an algorithm is a high level concern.
I think the best advice is to try it out.
I would write each algorithm with the concrete arguments (or stubs if its well understood). Write code to translate the collection into the concrete arguments. If there is an abstraction to be made, it should be apparent while writing your translations. Writing a few choice helper functions for these translations is usually the bulk of the work.
Giving a concrete example of the descriptors and a few example declarations might give enough information to offer more guidance.