Image interpolation implementation with C++ - c++

I have a question related to the implementation of image interpolation (bicubic and bilinear methods) with C++. My main concern is speed. Based on my understanding of the problem, in order to make the interpolation program fast and efficient, the following strategies can be adopted:
Fast image interpolation using Streaming SIMD Extensions (SSE)
Image interpretation with multi-thread or GPU
Fast image interpolation algorithms
C++ implementation tricks
Here, I am more interested in the last strategy. I set up a class for interpolation:
/**
* This class is used to perform interpretaion for a certain poin in
* the image grid.
*/
class Sampling
{
public:
// samples[0] *-------------* samples[1]
// --------------
// --------------
// samples[2] *-------------*samples[3]
inline void sampling_linear(unsigned char *samples, unsigned char &res)
{
unsigned char res_temp[2];
sampling_linear_1D(samples,res_temp[0]);
sampling_linear_1D(samples+2,res_temp[1]);
sampling_linear_1D(res_temp,res);
}
private:
inline void sampling_linear_1D(unsigned char *samples, unsigned char &res)
{
}
}
Here I only give an example for bilinear interpolation. In order to make the program run faster, the inline function is employed. My question is whether this implementation scheme is efficient. Additionally, during the interpretation procedure if I give the use the option of choosing between different interpolation methods. Then I have two choices:
Depending on the interpolation method, invoke the function the perform interpolation for the whole image.
For each output image pixel, first determine its position in the input image, and then according to the interpolation method setting, determine the interpolation function.
The first method means more codes in the program while the second one may lead to inefficiency. Then, how could I choose between these two schemes? Thanks!

Fast image interpolation using Streaming SIMD Extensions (SSE)
This may not provide desired result, because I expect that your algorithm will be memory-bounded rather than FLOP/s bounded.
I mean - it definitely will be improvement, but not beneficial in compare to implementation cost.
And by the way, modern compilers can perform auto-vectorization (i.e. use of SSE and futher extensions): GCC starting from 4.0, MSVC starting from 2012, MSVC Auto-Vectorization video lectures.
Image interpretation with multi-thread or GPU
Multi-thread version should give good effect, because it would allow you to exploit all available memory throughput.
If you do not plan to process data several times, or use it in some way on GPU, then GPGPU may not give desired result. Yes, it will produce result faster (mostly due to higher memory speed), but this effect will be crossed out by slow transfer between main RAM and GPU's RAM.
Just for example, approximate modern throughputs:
CPU RAM ~ 20GiB/s
GPU RAM ~ 150GiB/s
Transfering between CPU RAM <-> GPU RAM ~ 3-5 GiB/s
For single pass memory bounded algorithms, in most cases, third item makes usage of GPUs impractical (for such algoirthms).
In order to make the program run faster, the inline function is employed
Class member functions are "inline" by default. Beaware, that main purpose of "inline" is not actually "inling", but helping to prevent One Definition Rule violation when your functions are defined in headers.
There are compiler-dependent "forceinline" features, for instance MSVC has __forceinline. Or abstracted from compiler ifdef'ed BOOST_FORCEINLINE macro.
Anyway, trust your compiler unless you don't prove otherwise (with help of assembler for example). Most important fact, is that compiler should see functions defenitions - then it can decide itself to inline, even if function is not inline itself.
My question is whether this implementation scheme is efficient.
As I understand, as pre-step, you gather samples into 2x2 matrix. I think it may be better to pass directly two pointers to arrays of two elements within image directly, or one pointer + width size (to calc second pointer automaticly). However, it is not a big issue, most likely your temporary 2x2 matrix will be optimized away.
What is really important - is how you traverse your image.
Let's say for given x and y, index is calculated as:
i=width*y+x;
Then your traversal loop should be:
for(int y=/*...*/)
for(int x=/*...*/)
{
// loop body
}
Because, if you would chose another order (x first, then y) - it will be not cache-friendly, and as the result performance drop can be up to 64x (depending on your pixel size). You may check it just for your interest.
The first method means more codes in the program while the second one may lead to inefficiency. Then, how could I choose between these two schemes? Thanks!
In this case, you can use compile-time polymorphism to reduce code amount in first version. For instance, based on templates.
Just look at std::accumulate - it can be written once, and then it will work on different types of iterators, different binary operations (functions or functors), without imply any runtime penalty due to it's polymorphism.
Alexander Stepanov says:
For many years, I tried to achieve relative efficiency in more advanced languages (e.g., Ada and Scheme) but failed. My generic versions of even simple algorithms were not able to compete with built-in primitives. But in C++ I was finally able to not only accomplish relative efficiency but come very close to the more ambitious goal of absolute efficiency. To verify this, I spent countless hours looking at the assembly code generated by different compilers on different architectures.
Check Boost's Generic Image Library - it has good tutorial, and there is video presentation from author.

Related

Profiler tells that function call overhead is 10x of normal statement

I got a profile result states that overhead of calling a function is very large.
It is currently a bottle neck of my program.
The function is in a template class :-
template<class U> class CustomArray{
....
public: U& operator[](int n){ //<-- 2.8%
... some cheap assertion ... //<-- 0.2%
return database()[n]; //<-- 0.3% (just add address to allocated mem)
} //<-- 2.7%
}
(^ The image was edited a little to protect me from my boss.)
Question
Is it possible? Is profiler wrong?
If so, how to optimize it?
I have tried inline keyword (no different). This function is already inline, isn't it?
I am using Visual Studio 2015's profiler (optimization -O2).
The result is very inconsistent with How much overhead is there in calling a function in C++?.
Edit: I confirm that Profiling Collection = Sampling (not Instrumention).
Let's assume you are using the default sampling method of profiling in Visual Studio.
Such profilers usually work at the assembly level, for example, by sampling the current instruction pointer periodically. They then use debug data to try to map that data back to source lines. For heavily optimized and inlined code, this mapping isn't always reliable (indeed, some instruction origin may not originate from any line, or it may effectively be shared among several).
In addition to making profiling tricky, this also means statements like "function call has a 10x overhead of a normal statement" isn't really meaningful: there is no "typical" function call and there certainly is no typical "normal statement". Functions can vary from totally free (when inlined or even eliminated), to somewhat expensive (mis-predicted virtual calls1) and statements span an even greater range from free to almost unlimited in cost (but a common case would be a cache miss taking hundreds of cycles).
On top of that, sampling methods often have inherent error or skew. For example, an expensive instruction may tend to spread its samples out among subsequent instructions rather than being assigned all the samples itself. This leads to additional error at the instruction level.
All this adds up to mean that while sampling results may be quite accurate for broad-stroke profiling (i.e., identifying features on the order of hundreds of cycles), you can't always read too much into very fine-grained results such as your one-line function above.
If you do want to read into those results, the first step is to see if the sampling mode has an assembly level view and to use that view, since at least then you remove entirely the assembly-to-source mapping issue.
1 Is there anything worse that could reasonably be considered a "function call" in C++?

Is there any workaround to "reserve" a cache fraction?

Assume I have to write a C or C++ computational intensive function that has 2 arrays as input and one array as output. If the computation uses the 2 input arrays more often than it updates the output array, I'll end up in a situation where the output array seldom gets cached because it's evicted in order to fetch the 2 input arrays.
I want to reserve one fraction of the cache for the output array and enforce somehow that those lines don't get evicted once they are fetched, in order to always write partial results in the cache.
Update1(output[]) // Output gets cached
DoCompute1(input1[]); // Input 1 gets cached
DoCompute2(input2[]); // Input 2 gets cached
Update2(output[]); // Output is not in the cache anymore and has to get cached again
...
I know there are mechanisms to help eviction: clflush, clevict, _mm_clevict, etc. Are there any mechanisms for the opposite?
I am thinking of 3 possible solutions:
Using _mm_prefetch from time to time to fetch the data back if it has been evicted. However this might generate unnecessary traffic plus that I need to be very careful to when to introduce them;
Trying to do processing on smaller chunks of data. However this would work only if the problem allows it;
Disabling hardware prefetchers where that's possible to reduce the rate of unwanted evictions.
Other than that, is there any elegant solution?
Intel CPUs have something called No Eviction Mode (NEM) but I doubt this is what you need.
While you are attempting to optimise the second (unnecessary) fetch of output[], have you given thought to using SSE2/3/4 registers to store your intermediate output values, update them when necessary, and writing them back only when all updates related to that part of output[] are done?
I have done something similar while computing FFTs (Fast Fourier Transforms) where part of the output is in registers and they are moved out (to memory) only when it is known they will not be accessed anymore. Until then, all updates happen to the registers. You'll need to introduce inline assembly to effectively use SSE* registers. Of course, such optimisations are highly dependent on the nature of the algorithm and data placement.
I am trying to get a better understanding of the question:
If it is true that the 'output' array is strictly for output, and you never do something like
output[i] = Foo(newVal, output[i]);
then, all elements in output[] are strictly write. If so, all you would ever need to 'reserve' is one cache-line. Isn't that correct?
In this scenario, all writes to 'output' generate cache-fills and could compete with the cachelines needed for 'input' arrays.
Wouldn't you want a cap on the cachelines 'output' can consume as opposed to reserving a certain number of lines.
I see two options, which may or may not work depending on the CPU you are targeting, and on your precise program flow:
If output is only written to and not read, you can use streaming-stores, i.e., a write instruction with a no-read hint, so it will not be fetched into cache.
You can use prefetching with a non-temporally-aligned (NTA) hint for input. I don't know how this is implemented in general, but I know for sure that on some Intel CPUs (e.g., the Xeon Phi) each hardware thread uses a specific way of cache for NTA data, i.e., with an 8-way cache 1/8th per thread.
I guess solution to this is hidden inside, the algorithm employed and the L1 cache size and cache line size.
Though I am not sure how much performance improvement we will see with this.
We can probably introduce artificial reads which cleverly dodge compiler and while execution, do not hurt computations as well. Single artificial read should fill cache lines as many needed to accommodate one page. Therefore, algorithm should be modified to compute blocks of output array. Something like the ones used in matrix multiplication of huge matrices, done using GPUs. They use blocks of matrices for computation and writing result.
As pointed out earlier, the write to output array should happen in a stream.
To bring in artificial read, we should initialize at compile time the output array at right places, once in each block, probably with 0 or 1.

Supporting multiple pixel formats

I need to write some code which will operate on a number of pixel formats (eg A8R8G8B8, R8G8B8, R5G6B6, and even potentially floating point formats).
Ideally I would like to not have to write each function for each format since that is a massive amount of near identical code.
The only thing I could think of is some kind of interface letting it deal with pixel format conversions eg:
class IBitmap
{
public:
virtual unsigned getPixel(unsigned x, unsigned y)const=0;
virtual void setPixel(unsigned x, unsigned y, unsigned argb)=0;
virtual unsigned getWidth()const=0;
virtual unsigned getHeight()const=0;
};
However calling a virtual function for every get or set pixel operation is hardly fast, since not only is there the extra overhead of a virtual call, but it also far more importantly prevents inlining for something which is just a few instructions long.
Are there any other options which would allow for me to support all these formats efficiently? Generally speaking my code is likely to only operate on a small portion of the bitmap, and needs read/write access in many cases (blending).
Consider using templates. This will allow you to write generically where possible and take the hit for specialising the code at compile time
Sometimes known as static polymorphism.
http://en.wikipedia.org/wiki/Curiously_Recurring_Template_Pattern
On modern processors virtual functions are pretty darn fast, so I'm not sure I'd reject them out of hand. Run a timing test or two to be sure. Virtual functions give you runtime polymorphism, which allows you to conceivably write (or generate at compile time) less code. The static polymorphism represented by templates will probably cause the compiler to generate that massive amount of code that you're trying to avoid.
To avoid having to write loads of formats, write a Color class or similar that always stores a particular format (eg. A8R8G8B8 or A32R32G32B32F) then have methods on that class to retrieve in different formats, eg. get(FORMAT_R5G6B6) or similar. All your methods can then deal with that class. Whenever possible, convert the color only once (eg. when drawing a rect, convert the rect color to the destination format then write that to all the pixels - don't convert it for every pixel!).
Avoid having a SetPixel/GetPixel method at all. It cannot be done efficiently, especially if you're doing colour conversions on the fly, and especially not if the optimizer decides not to inline the calls. You'd do better to expose memory buffers directly, and trust that the caller will correctly use the memory. That's how every other API I've used before does it.
(This is my new answer because after extensive research, I concluded that Adobe GIL is not suitable for this purpose.)
I would highly recommend the architecture and interface design of the Windows Imaging Component.
What I mean is:
Their interface is what I recommend. Not the implementation.
Everyone can implement something similar. In fact, the Mono Project (Wine) contains a partial implementation of the WIC.
Instead of the producer-consumer model, WIC uses a pipeline-storage model.
A read-only bitmap implements IWICBitmapSource interface
There are only 5 member methods.
A writable bitmap implements IWICBitmap interface and provides direct memory read/write access to the pixel data.
There are only 3 additional member methods, on top of IWICBitmapSource.
Being a "storage" class means that this class does not depend on any other bitmap instances when producing pixel values.
The overhead of converting everything to a single common format, then converting back after the operation, may be less than you think. If you can limit the number of pixels that are converted at a time, say to a single row, the intermediate results may remain in cache for the entire operation. I've actually seen a case where the end-to-end times were faster using this approach, although it involved one-bit pixels that are inherently difficult to work with when still packed.
You are right to be wary of a pixel-level function call. I've never seen a case where a pixel function didn't make the processing unacceptably slow.

Coding Practices which enable the compiler/optimizer to make a faster program

Many years ago, C compilers were not particularly smart. As a workaround K&R invented the register keyword, to hint to the compiler, that maybe it would be a good idea to keep this variable in an internal register. They also made the tertiary operator to help generate better code.
As time passed, the compilers matured. They became very smart in that their flow analysis allowing them to make better decisions about what values to hold in registers than you could possibly do. The register keyword became unimportant.
FORTRAN can be faster than C for some sorts of operations, due to alias issues. In theory with careful coding, one can get around this restriction to enable the optimizer to generate faster code.
What coding practices are available that may enable the compiler/optimizer to generate faster code?
Identifying the platform and compiler you use, would be appreciated.
Why does the technique seem to work?
Sample code is encouraged.
Here is a related question
[Edit] This question is not about the overall process to profile, and optimize. Assume that the program has been written correctly, compiled with full optimization, tested and put into production. There may be constructs in your code that prohibit the optimizer from doing the best job that it can. What can you do to refactor that will remove these prohibitions, and allow the optimizer to generate even faster code?
[Edit] Offset related link
Here's a coding practice to help the compiler create fast code—any language, any platform, any compiler, any problem:
Do not use any clever tricks which force, or even encourage, the compiler to lay variables out in memory (including cache and registers) as you think best. First write a program which is correct and maintainable.
Next, profile your code.
Then, and only then, you might want to start investigating the effects of telling the compiler how to use memory. Make 1 change at a time and measure its impact.
Expect to be disappointed and to have to work very hard indeed for small performance improvements. Modern compilers for mature languages such as Fortran and C are very, very good. If you read an account of a 'trick' to get better performance out of code, bear in mind that the compiler writers have also read about it and, if it is worth doing, probably implemented it. They probably wrote what you read in the first place.
Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like
void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
for (int i=0; i<numFoo, i++)
{
barOut.munge(foo1, foo2[i]);
}
}
the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:
void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
Foo barTemp = barOut;
for (int i=0; i<numFoo, i++)
{
barTemp.munge(foo1, foo2[i]);
}
barOut = barTemp;
}
It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).
The order you traverse memory can have profound impacts on performance and compilers aren't really good at figuring that out and fixing it. You have to be conscientious of cache locality concerns when you write code if you care about performance. For example two-dimensional arrays in C are allocated in row-major format. Traversing arrays in column major format will tend to make you have more cache misses and make your program more memory bound than processor bound:
#define N 1000000;
int matrix[N][N] = { ... };
//awesomely fast
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[i][j];
}
}
//painfully slow
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[j][i];
}
}
Generic Optimizations
Here as some of my favorite optimizations. I have actually increased execution times and reduced program sizes by using these.
Declare small functions as inline or macros
Each call to a function (or method) incurs overhead, such as pushing variables onto the stack. Some functions may incur an overhead on return as well. An inefficient function or method has fewer statements in its content than the combined overhead. These are good candidates for inlining, whether it be as #define macros or inline functions. (Yes, I know inline is only a suggestion, but in this case I consider it as a reminder to the compiler.)
Remove dead and redundant code
If the code isn't used or does not contribute to the program's result, get rid of it.
Simplify design of algorithms
I once removed a lot of assembly code and execution time from a program by writing down the algebraic equation it was calculating and then simplified the algebraic expression. The implementation of the simplified algebraic expression took up less room and time than the original function.
Loop Unrolling
Each loop has an overhead of incrementing and termination checking. To get an estimate of the performance factor, count the number of instructions in the overhead (minimum 3: increment, check, goto start of loop) and divide by the number of statements inside the loop. The lower the number the better.
Edit: provide an example of loop unrolling
Before:
unsigned int sum = 0;
for (size_t i; i < BYTES_TO_CHECKSUM; ++i)
{
sum += *buffer++;
}
After unrolling:
unsigned int sum = 0;
size_t i = 0;
**const size_t STATEMENTS_PER_LOOP = 8;**
for (i = 0; i < BYTES_TO_CHECKSUM; **i = i / STATEMENTS_PER_LOOP**)
{
sum += *buffer++; // 1
sum += *buffer++; // 2
sum += *buffer++; // 3
sum += *buffer++; // 4
sum += *buffer++; // 5
sum += *buffer++; // 6
sum += *buffer++; // 7
sum += *buffer++; // 8
}
// Handle the remainder:
for (; i < BYTES_TO_CHECKSUM; ++i)
{
sum += *buffer++;
}
In this advantage, a secondary benefit is gained: more statements are executed before the processor has to reload the instruction cache.
I've had amazing results when I unrolled a loop to 32 statements. This was one of the bottlenecks since the program had to calculate a checksum on a 2GB file. This optimization combined with block reading improved performance from 1 hour to 5 minutes. Loop unrolling provided excellent performance in assembly language too, my memcpy was a lot faster than the compiler's memcpy. -- T.M.
Reduction of if statements
Processors hate branches, or jumps, since it forces the processor to reload its queue of instructions.
Boolean Arithmetic (Edited: applied code format to code fragment, added example)
Convert if statements into boolean assignments. Some processors can conditionally execute instructions without branching:
bool status = true;
status = status && /* first test */;
status = status && /* second test */;
The short circuiting of the Logical AND operator (&&) prevents execution of the tests if the status is false.
Example:
struct Reader_Interface
{
virtual bool write(unsigned int value) = 0;
};
struct Rectangle
{
unsigned int origin_x;
unsigned int origin_y;
unsigned int height;
unsigned int width;
bool write(Reader_Interface * p_reader)
{
bool status = false;
if (p_reader)
{
status = p_reader->write(origin_x);
status = status && p_reader->write(origin_y);
status = status && p_reader->write(height);
status = status && p_reader->write(width);
}
return status;
};
Factor Variable Allocation outside of loops
If a variable is created on the fly inside a loop, move the creation / allocation to before the loop. In most instances, the variable doesn't need to be allocated during each iteration.
Factor constant expressions outside of loops
If a calculation or variable value does not depend on the loop index, move it outside (before) the loop.
I/O in blocks
Read and write data in large chunks (blocks). The bigger the better. For example, reading one octect at a time is less efficient than reading 1024 octets with one read.
Example:
static const char Menu_Text[] = "\n"
"1) Print\n"
"2) Insert new customer\n"
"3) Destroy\n"
"4) Launch Nasal Demons\n"
"Enter selection: ";
static const size_t Menu_Text_Length = sizeof(Menu_Text) - sizeof('\0');
//...
std::cout.write(Menu_Text, Menu_Text_Length);
The efficiency of this technique can be visually demonstrated. :-)
Don't use printf family for constant data
Constant data can be output using a block write. Formatted write will waste time scanning the text for formatting characters or processing formatting commands. See above code example.
Format to memory, then write
Format to a char array using multiple sprintf, then use fwrite. This also allows the data layout to be broken up into "constant sections" and variable sections. Think of mail-merge.
Declare constant text (string literals) as static const
When variables are declared without the static, some compilers may allocate space on the stack and copy the data from ROM. These are two unnecessary operations. This can be fixed by using the static prefix.
Lastly, Code like the compiler would
Sometimes, the compiler can optimize several small statements better than one complicated version. Also, writing code to help the compiler optimize helps too. If I want the compiler to use special block transfer instructions, I will write code that looks like it should use the special instructions.
The optimizer isn't really in control of the performance of your program, you are. Use appropriate algorithms and structures and profile, profile, profile.
That said, you shouldn't inner-loop on a small function from one file in another file, as that stops it from being inlined.
Avoid taking the address of a variable if possible. Asking for a pointer isn't "free" as it means the variable needs to be kept in memory. Even an array can be kept in registers if you avoid pointers — this is essential for vectorizing.
Which leads to the next point, read the ^#$# manual! GCC can vectorize plain C code if you sprinkle a __restrict__ here and an __attribute__( __aligned__ ) there. If you want something very specific from the optimizer, you might have to be specific.
On most modern processors, the biggest bottleneck is memory.
Aliasing: Load-Hit-Store can be devastating in a tight loop. If you're reading one memory location and writing to another and know that they are disjoint, carefully putting an alias keyword on the function parameters can really help the compiler generate faster code. However if the memory regions do overlap and you used 'alias', you're in for a good debugging session of undefined behaviors!
Cache-miss: Not really sure how you can help the compiler since it's mostly algorithmic, but there are intrinsics to prefetch memory.
Also don't try to convert floating point values to int and vice versa too much since they use different registers and converting from one type to another means calling the actual conversion instruction, writing the value to memory and reading it back in the proper register set.
The vast majority of code that people write will be I/O bound (I believe all the code I have written for money in the last 30 years has been so bound), so the activities of the optimiser for most folks will be academic.
However, I would remind people that for the code to be optimised you have to tell the compiler to to optimise it - lots of people (including me when I forget) post C++ benchmarks here that are meaningless without the optimiser being enabled.
use const correctness as much as possible in your code. It allows the compiler to optimize much better.
In this document are loads of other optimization tips: CPP optimizations (a bit old document though)
highlights:
use constructor initialization lists
use prefix operators
use explicit constructors
inline functions
avoid temporary objects
be aware of the cost of virtual functions
return objects via reference parameters
consider per class allocation
consider stl container allocators
the 'empty member' optimization
etc
Attempt to program using static single assignment as much as possible. SSA is exactly the same as what you end up with in most functional programming languages, and that's what most compilers convert your code to to do their optimizations because it's easier to work with. By doing this places where the compiler might get confused are brought to light. It also makes all but the worst register allocators work as good as the best register allocators, and allows you to debug more easily because you almost never have to wonder where a variable got it's value from as there was only one place it was assigned.
Avoid global variables.
When working with data by reference or pointer pull that into local variables, do your work, and then copy it back. (unless you have a good reason not to)
Make use of the almost free comparison against 0 that most processors give you when doing math or logic operations. You almost always get a flag for ==0 and <0, from which you can easily get 3 conditions:
x= f();
if(!x){
a();
} else if (x<0){
b();
} else {
c();
}
is almost always cheaper than testing for other constants.
Another trick is to use subtraction to eliminate one compare in range testing.
#define FOO_MIN 8
#define FOO_MAX 199
int good_foo(int foo) {
unsigned int bar = foo-FOO_MIN;
int rc = ((FOO_MAX-FOO_MIN) < bar) ? 1 : 0;
return rc;
}
This can very often avoid a jump in languages that do short circuiting on boolean expressions and avoids the compiler having to try to figure out how to handle keeping
up with the result of the first comparison while doing the second and then combining them.
This may look like it has the potential to use up an extra register, but it almost never does. Often you don't need foo anymore anyway, and if you do rc isn't used yet so it can go there.
When using the string functions in c (strcpy, memcpy, ...) remember what they return -- the destination! You can often get better code by 'forgetting' your copy of the pointer to destination and just grab it back from the return of these functions.
Never overlook the oppurtunity to return exactly the same thing the last function you called returned. Compilers are not so great at picking up that:
foo_t * make_foo(int a, int b, int c) {
foo_t * x = malloc(sizeof(foo));
if (!x) {
// return NULL;
return x; // x is NULL, already in the register used for returns, so duh
}
x->a= a;
x->b = b;
x->c = c;
return x;
}
Of course, you could reverse the logic on that if and only have one return point.
(tricks I recalled later)
Declaring functions as static when you can is always a good idea. If the compiler can prove to itself that it has accounted for every caller of a particular function then it can break the calling conventions for that function in the name of optimization. Compilers can often avoid moving parameters into registers or stack positions that called functions usually expect their parameters to be in (it has to deviate in both the called function and the location of all callers to do this). The compiler can also often take advantage of knowing what memory and registers the called function will need and avoid generating code to preserve variable values that are in registers or memory locations that the called function doesn't disturb. This works particularly well when there are few calls to a function. This gets much of the benifit of inlining code, but without actually inlining.
I wrote an optimizing C compiler and here are some very useful things to consider:
Make most functions static. This allows interprocedural constant propagation and alias analysis to do its job, otherwise the compiler needs to presume that the function can be called from outside the translation unit with completely unknown values for the paramters. If you look at the well-known open-source libraries they all mark functions static except the ones that really need to be extern.
If global variables are used, mark them static and constant if possible. If they are initialized once (read-only), it's better to use an initializer list like static const int VAL[] = {1,2,3,4}, otherwise the compiler might not discover that the variables are actually initialized constants and will fail to replace loads from the variable with the constants.
NEVER use a goto to the inside of a loop, the loop will not be recognized anymore by most compilers and none of the most important optimizations will be applied.
Use pointer parameters only if necessary, and mark them restrict if possible. This helps alias analysis a lot because the programmer guarantees there is no alias (the interprocedural alias analysis is usually very primitive). Very small struct objects should be passed by value, not by reference.
Use arrays instead of pointers whenever possible, especially inside loops (a[i]). An array usually offers more information for alias analysis and after some optimizations the same code will be generated anyway (search for loop strength reduction if curious). This also increases the chance for loop-invariant code motion to be applied.
Try to hoist outside the loop calls to large functions or external functions that don't have side-effects (don't depend on the current loop iteration). Small functions are in many cases inlined or converted to intrinsics that are easy to hoist, but large functions might seem for the compiler to have side-effects when they actually don't. Side-effects for external functions are completely unknown, with the exception of some functions from the standard library which are sometimes modeled by some compilers, making loop-invariant code motion possible.
When writing tests with multiple conditions place the most likely one first. if(a || b || c) should be if(b || a || c) if b is more likely to be true than the others. Compilers usually don't know anything about the possible values of the conditions and which branches are taken more (they could be known by using profile information, but few programmers use it).
Using a switch is faster than doing a test like if(a || b || ... || z). Check first if your compiler does this automatically, some do and it's more readable to have the if though.
In the case of embedded systems and code written in C/C++, I try and avoid dynamic memory allocation as much as possible. The main reason I do this is not necessarily performance but this rule of thumb does have performance implications.
Algorithms used to manage the heap are notoriously slow in some platforms (e.g., vxworks). Even worse, the time that it takes to return from a call to malloc is highly dependent on the current state of the heap. Therefore, any function that calls malloc is going to take a performance hit that cannot be easily accounted for. That performance hit may be minimal if the heap is still clean but after that device runs for a while the heap can become fragmented. The calls are going to take longer and you cannot easily calculate how performance will degrade over time. You cannot really produce a worse case estimate. The optimizer cannot provide you with any help in this case either. To make matters even worse, if the heap becomes too heavily fragmented, the calls will start failing altogether. The solution is to use memory pools (e.g., glib slices ) instead of the heap. The allocation calls are going to be much faster and deterministic if you do it right.
A dumb little tip, but one that will save you some microscopic amounts of speed and code.
Always pass function arguments in the same order.
If you have f_1(x, y, z) which calls f_2, declare f_2 as f_2(x, y, z). Do not declare it as f_2(x, z, y).
The reason for this is that C/C++ platform ABI (AKA calling convention) promises to pass arguments in particular registers and stack locations. When the arguments are already in the correct registers then it does not have to move them around.
While reading disassembled code I've seen some ridiculous register shuffling because people didn't follow this rule.
Two coding technics I didn't saw in the above list:
Bypass linker by writing code as an unique source
While separate compilation is really nice for compiling time, it is very bad when you speak of optimization. Basically the compiler can't optimize beyond compilation unit, that is linker reserved domain.
But if you design well your program you can can also compile it through an unique common source. That is instead of compiling unit1.c and unit2.c then link both objects, compile all.c that merely #include unit1.c and unit2.c. Thus you will benefit from all the compiler optimizations.
It's very like writing headers only programs in C++ (and even easier to do in C).
This technique is easy enough if you write your program to enable it from the beginning, but you must also be aware it change part of C semantic and you can meet some problems like static variables or macro collision. For most programs it's easy enough to overcome the small problems that occurs. Also be aware that compiling as an unique source is way slower and may takes huge amount of memory (usually not a problem with modern systems).
Using this simple technique I happened to make some programs I wrote ten times faster!
Like the register keyword, this trick could also become obsolete soon. Optimizing through linker begin to be supported by compilers gcc: Link time optimization.
Separate atomic tasks in loops
This one is more tricky. It's about interaction between algorithm design and the way optimizer manage cache and register allocation. Quite often programs have to loop over some data structure and for each item perform some actions. Quite often the actions performed can be splitted between two logically independent tasks. If that is the case you can write exactly the same program with two loops on the same boundary performing exactly one task. In some case writing it this way can be faster than the unique loop (details are more complex, but an explanation can be that with the simple task case all variables can be kept in processor registers and with the more complex one it's not possible and some registers must be written to memory and read back later and the cost is higher than additional flow control).
Be careful with this one (profile performances using this trick or not) as like using register it may as well give lesser performances than improved ones.
I've actually seen this done in SQLite and they claim it results in performance boosts ~5%: Put all your code in one file or use the preprocessor to do the equivalent to this. This way the optimizer will have access to the entire program and can do more interprocedural optimizations.
Most modern compilers should do a good job speeding up tail recursion, because the function calls can be optimized out.
Example:
int fac2(int x, int cur) {
if (x == 1) return cur;
return fac2(x - 1, cur * x);
}
int fac(int x) {
return fac2(x, 1);
}
Of course this example doesn't have any bounds checking.
Late Edit
While I have no direct knowledge of the code; it seems clear that the requirements of using CTEs on SQL Server were specifically designed so that it can optimize via tail-end recursion.
Don't do the same work over and over again!
A common antipattern that I see goes along these lines:
void Function()
{
MySingleton::GetInstance()->GetAggregatedObject()->DoSomething();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingElse();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingCool();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingReallyNeat();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingYetAgain();
}
The compiler actually has to call all of those functions all of the time. Assuming you, the programmer, knows that the aggregated object isn't changing over the course of these calls, for the love of all that is holy...
void Function()
{
MySingleton* s = MySingleton::GetInstance();
AggregatedObject* ao = s->GetAggregatedObject();
ao->DoSomething();
ao->DoSomethingElse();
ao->DoSomethingCool();
ao->DoSomethingReallyNeat();
ao->DoSomethingYetAgain();
}
In the case of the singleton getter the calls may not be too costly, but it is certainly a cost (typically, "check to see if the object has been created, if it hasn't, create it, then return it). The more complicated this chain of getters becomes, the more wasted time we'll have.
Use the most local scope possible for all variable declarations.
Use const whenever possible
Dont use register unless you plan to profile both with and without it
The first 2 of these, especially #1 one help the optimizer analyze the code. It will especially help it to make good choices about what variables to keep in registers.
Blindly using the register keyword is as likely to help as hurt your optimization, It's just too hard to know what will matter until you look at the assembly output or profile.
There are other things that matter to getting good performance out of code; designing your data structures to maximize cache coherency for instance. But the question was about the optimizer.
Align your data to native/natural boundaries.
I was reminded of something that I encountered once, where the symptom was simply that we were running out of memory, but the result was substantially increased performance (as well as huge reductions in memory footprint).
The problem in this case was that the software we were using made tons of little allocations. Like, allocating four bytes here, six bytes there, etc. A lot of little objects, too, running in the 8-12 byte range. The problem wasn't so much that the program needed lots of little things, it's that it allocated lots of little things individually, which bloated each allocation out to (on this particular platform) 32 bytes.
Part of the solution was to put together an Alexandrescu-style small object pool, but extend it so I could allocate arrays of small objects as well as individual items. This helped immensely in performance as well since more items fit in the cache at any one time.
The other part of the solution was to replace the rampant use of manually-managed char* members with an SSO (small-string optimization) string. The minimum allocation being 32 bytes, I built a string class that had an embedded 28-character buffer behind a char*, so 95% of our strings didn't need to do an additional allocation (and then I manually replaced almost every appearance of char* in this library with this new class, that was fun or not). This helped a ton with memory fragmentation as well, which then increased the locality of reference for other pointed-to objects, and similarly there were performance gains.
A neat technique I learned from #MSalters comment on this answer allows compilers to do copy elision even when returning different objects according to some condition:
// before
BigObject a, b;
if(condition)
return a;
else
return b;
// after
BigObject a, b;
if(condition)
swap(a,b);
return a;
If you've got small functions you call repeatedly, i have in the past got large gains by putting them in headers as "static inline". Function calls on the ix86 are surprisingly expensive.
Reimplementing recursive functions in a non-recursive way using an explicit stack can also gain a lot, but then you really are in the realm of development time vs gain.
Here's my second piece of optimisation advice. As with my first piece of advice this is general purpose, not language or processor specific.
Read the compiler manual thoroughly and understand what it is telling you. Use the compiler to its utmost.
I agree with one or two of the other respondents who have identified selecting the right algorithm as critical to squeezing performance out of a program. Beyond that the rate of return (measured in code execution improvement) on the time you invest in using the compiler is far higher than the rate of return in tweaking the code.
Yes, compiler writers are not from a race of coding giants and compilers contain mistakes and what should, according to the manual and according to compiler theory, make things faster sometimes makes things slower. That's why you have to take one step at a time and measure before- and after-tweak performance.
And yes, ultimately, you might be faced with a combinatorial explosion of compiler flags so you need to have a script or two to run make with various compiler flags, queue the jobs on the large cluster and gather the run time statistics. If it's just you and Visual Studio on a PC you will run out of interest long before you have tried enough combinations of enough compiler flags.
Regards
Mark
When I first pick up a piece of code I can usually get a factor of 1.4 -- 2.0 times more performance (ie the new version of the code runs in 1/1.4 or 1/2 of the time of the old version) within a day or two by fiddling with compiler flags. Granted, that may be a comment on the lack of compiler savvy among the scientists who originate much of the code I work on, rather than a symptom of my excellence. Having set the compiler flags to max (and it's rarely just -O3) it can take months of hard work to get another factor of 1.05 or 1.1
When DEC came out with its alpha processors, there was a recommendation to keep the number of arguments to a function under 7, as the compiler would always try to put up to 6 arguments in registers automatically.
For performance, focus first on writing maintenable code - componentized, loosely coupled, etc, so when you have to isolate a part either to rewrite, optimize or simply profile, you can do it without much effort.
Optimizer will help your program's performance marginally.
You're getting good answers here, but they assume your program is pretty close to optimal to begin with, and you say
Assume that the program has been
written correctly, compiled with full
optimization, tested and put into
production.
In my experience, a program may be written correctly, but that does not mean it is near optimal. It takes extra work to get to that point.
If I can give an example, this answer shows how a perfectly reasonable-looking program was made over 40 times faster by macro-optimization. Big speedups can't be done in every program as first written, but in many (except for very small programs), it can, in my experience.
After that is done, micro-optimization (of the hot-spots) can give you a good payoff.
i use intel compiler. on both Windows and Linux.
when more or less done i profile the code. then hang on the hotspots and trying to change the code to allow compiler make a better job.
if a code is a computational one and contain a lot of loops - vectorization report in intel compiler is very helpful - look for 'vec-report' in help.
so the main idea - polish the performance critical code. as for the rest - priority to be correct and maintainable - short functions, clear code that could be understood 1 year later.
One optimization i have used in C++ is creating a constructor that does nothing. One must manually call an init() in order to put the object into a working state.
This has benefit in the case where I need a large vector of these classes.
I call reserve() to allocate the space for the vector, but the constructor does not actually touch the page of memory the object is on. So I have spent some address space, but not actually consumed a lot of physical memory. I avoid the page faults associated the associated construction costs.
As i generate objects to fill the vector, I set them using init(). This limits my total page faults, and avoids the need to resize() the vector while filling it.
One thing I've done is try to keep expensive actions to places where the user might expect the program to delay a bit. Overall performance is related to responsiveness, but isn't quite the same, and for many things responsiveness is the more important part of performance.
The last time I really had to do improvements in overall performance, I kept an eye out for suboptimal algorithms, and looked for places that were likely to have cache problems. I profiled and measured performance first, and again after each change. Then the company collapsed, but it was interesting and instructive work anyway.
I have long suspected, but never proved that declaring arrays so that they hold a power of 2, as the number of elements, enables the optimizer to do a strength reduction by replacing a multiply by a shift by a number of bits, when looking up individual elements.
Put small and/or frequently called functions at the top of the source file. That makes it easier for the compiler to find opportunities for inlining.

How does BLAS get such extreme performance?

Out of curiosity I decided to benchmark my own matrix multiplication function versus the BLAS implementation... I was to say the least surprised at the result:
Custom Implementation, 10 trials of
1000x1000 matrix multiplication:
Took: 15.76542 seconds.
BLAS Implementation, 10 trials of
1000x1000 matrix multiplication:
Took: 1.32432 seconds.
This is using single precision floating point numbers.
My Implementation:
template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
if ( ADim2!=BDim1 )
throw std::runtime_error("Error sizes off");
memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
int cc2,cc1,cr1;
for ( cc2=0 ; cc2<BDim2 ; ++cc2 )
for ( cc1=0 ; cc1<ADim2 ; ++cc1 )
for ( cr1=0 ; cr1<ADim1 ; ++cr1 )
C[cc2*ADim2+cr1] += A[cc1*ADim1+cr1]*B[cc2*BDim1+cc1];
}
I have two questions:
Given that a matrix-matrix multiplication say: nxm * mxn requires n*n*m multiplications, so in the case above 1000^3 or 1e9 operations. How is it possible on my 2.6Ghz processor for BLAS to do 10*1e9 operations in 1.32 seconds? Even if multiplcations were a single operation and there was nothing else being done, it should take ~4 seconds.
Why is my implementation so much slower?
A good starting point is the great book The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-Ortí. They provide a free download version.
BLAS is divided into three levels:
Level 1 defines a set of linear algebra functions that operate on vectors only. These functions benefit from vectorization (e.g. from using SSE).
Level 2 functions are matrix-vector operations, e.g. some matrix-vector product. These functions could be implemented in terms of Level1 functions. However, you can boost the performance of this functions if you can provide a dedicated implementation that makes use of some multiprocessor architecture with shared memory.
Level 3 functions are operations like the matrix-matrix product. Again you could implement them in terms of Level2 functions. But Level3 functions perform O(N^3) operations on O(N^2) data. So if your platform has a cache hierarchy then you can boost performance if you provide a dedicated implementation that is cache optimized/cache friendly. This is nicely described in the book. The main boost of Level3 functions comes from cache optimization. This boost significantly exceeds the second boost from parallelism and other hardware optimizations.
By the way, most (or even all) of the high performance BLAS implementations are NOT implemented in Fortran. ATLAS is implemented in C. GotoBLAS/OpenBLAS is implemented in C and its performance critical parts in Assembler. Only the reference implementation of BLAS is implemented in Fortran. However, all these BLAS implementations provide a Fortran interface such that it can be linked against LAPACK (LAPACK gains all its performance from BLAS).
Optimized compilers play a minor role in this respect (and for GotoBLAS/OpenBLAS the compiler does not matter at all).
IMHO no BLAS implementation uses algorithms like the Coppersmith–Winograd algorithm or the Strassen algorithm. The likely reasons are:
Maybe its not possible to provide a cache optimized implementation of these algorithms (i.e. you would loose more then you would win)
These algorithms are numerically not stable. As BLAS is the computational kernel of LAPACK this is a no-go.
Although these algorithms have a nice time complexity on paper, the Big O notation hides a large constant, so it only starts to become viable for extremely large matrices.
Edit/Update:
The new and ground breaking paper for this topic are the BLIS papers. They are exceptionally well written. For my lecture "Software Basics for High Performance Computing" I implemented the matrix-matrix product following their paper. Actually I implemented several variants of the matrix-matrix product. The simplest variants is entirely written in plain C and has less than 450 lines of code. All the other variants merely optimize the loops
for (l=0; l<MR*NR; ++l) {
AB[l] = 0;
}
for (l=0; l<kc; ++l) {
for (j=0; j<NR; ++j) {
for (i=0; i<MR; ++i) {
AB[i+j*MR] += A[i]*B[j];
}
}
A += MR;
B += NR;
}
The overall performance of the matrix-matrix product only depends on these loops. About 99.9% of the time is spent here. In the other variants I used intrinsics and assembler code to improve the performance. You can see the tutorial going through all the variants here:
ulmBLAS: Tutorial on GEMM (Matrix-Matrix Product)
Together with the BLIS papers it becomes fairly easy to understand how libraries like Intel MKL can gain such a performance. And why it does not matter whether you use row or column major storage!
The final benchmarks are here (we called our project ulmBLAS):
Benchmarks for ulmBLAS, BLIS, MKL, openBLAS and Eigen
Another Edit/Update:
I also wrote some tutorial on how BLAS gets used for numerical linear algebra problems like solving a system of linear equations:
High Performance LU Factorization
(This LU factorization is for example used by Matlab for solving a system of linear equations.)
I hope to find time to extend the tutorial to describe and demonstrate how to realise a highly scalable parallel implementation of the LU factorization like in PLASMA.
Ok, here you go: Coding a Cache Optimized Parallel LU Factorization
P.S.: I also did make some experiments on improving the performance of uBLAS. It actually is pretty simple to boost (yeah, play on words :) ) the performance of uBLAS:
Experiments on uBLAS.
Here a similar project with BLAZE:
Experiments on BLAZE.
So first of all BLAS is just an interface of about 50 functions. There are many competing implementations of the interface.
Firstly I will mention things that are largely unrelated:
Fortran vs C, makes no difference
Advanced matrix algorithms such as Strassen, implementations dont use them as they dont help in practice
Most implementations break each operation into small-dimension matrix or vector operations in the more or less obvious way. For example a large 1000x1000 matrix multiplication may broken into a sequence of 50x50 matrix multiplications.
These fixed-size small-dimension operations (called kernels) are hardcoded in CPU-specific assembly code using several CPU features of their target:
SIMD-style instructions
Instruction Level Parallelism
Cache-awareness
Furthermore these kernels can be executed in parallel with respect to each other using multiple threads (CPU cores), in the typical map-reduce design pattern.
Take a look at ATLAS which is the most commonly used open source BLAS implementation. It has many different competing kernels, and during the ATLAS library build process it runs a competition among them (some are even parameterized, so the same kernel can have different settings). It tries different configurations and then selects the best for the particular target system.
(Tip: That is why if you are using ATLAS you are better off building and tuning the library by hand for your particular machine then using a prebuilt one.)
First, there are more efficient algorithms for matrix multiplication than the one you're using.
Second, your CPU can do much more than one instruction at a time.
Your CPU executes 3-4 instructions per cycle, and if the SIMD units are used, each instruction processes 4 floats or 2 doubles. (of course this figure isn't accurate either, as the CPU can typically only process one SIMD instruction per cycle)
Third, your code is far from optimal:
You're using raw pointers, which means that the compiler has to assume they may alias. There are compiler-specific keywords or flags you can specify to tell the compiler that they don't alias. Alternatively, you should use other types than raw pointers, which take care of the problem.
You're thrashing the cache by performing a naive traversal of each row/column of the input matrices. You can use blocking to perform as much work as possible on a smaller block of the matrix, which fits in the CPU cache, before moving on to the next block.
For purely numerical tasks, Fortran is pretty much unbeatable, and C++ takes a lot of coaxing to get up to a similar speed. It can be done, and there are a few libraries demonstrating it (typically using expression templates), but it's not trivial, and it doesn't just happen.
I don't know specfically about BLAS implementation but there are more efficient alogorithms for Matrix Multiplication that has better than O(n3) complexity. A well know one is Strassen Algorithm
Most arguments to the second question -- assembler, splitting into blocks etc. (but not the less than N^3 algorithms, they are really overdeveloped) -- play a role. But the low velocity of your algorithm is caused essentially by matrix size and the unfortunate arrangement of the three nested loops. Your matrices are so large that they do not fit at once in cache memory. You can rearrange the loops such that as much as possible will be done on a row in cache, this way dramatically reducing cache refreshes (BTW splitting into small blocks has an analogue effect, best if loops over the blocks are arranged similarly). A model implementation for square matrices follows. On my computer its time consumption was about 1:10 compared to the standard implementation (as yours).
In other words: never program a matrix multiplication along the "row times column" scheme that we learned in school.
After having rearranged the loops, more improvements are obtained by unrolling loops, assembler code etc.
void vector(int m, double ** a, double ** b, double ** c) {
int i, j, k;
for (i=0; i<m; i++) {
double * ci = c[i];
for (k=0; k<m; k++) ci[k] = 0.;
for (j=0; j<m; j++) {
double aij = a[i][j];
double * bj = b[j];
for (k=0; k<m; k++) ci[k] += aij*bj[k];
}
}
}
One more remark: This implementation is even better on my computer than replacing all by the BLAS routine cblas_dgemm (try it on your computer!). But much faster (1:4) is calling dgemm_ of the Fortran library directly. I think this routine is in fact not Fortran but assembler code (I do not know what is in the library, I don't have the sources). Totally unclear to me is why cblas_dgemm is not as fast since to my knowledge it is merely a wrapper for dgemm_.
This is a realistic speed up. For an example of what can be done with SIMD assembler over C++ code, see some example iPhone matrix functions - these were over 8x faster than the C version, and aren't even "optimized" assembly - there's no pipe-lining yet and there is unnecessary stack operations.
Also your code is not "restrict correct" - how does the compiler know that when it modifies C, it isn't modifying A and B?
With respect to the original code in MM multiply, memory reference for most operation is the main cause of bad performance. Memory is running at 100-1000 times slower than cache.
Most of speed up comes from employing loop optimization techniques for this triple loop function in MM multiply. Two main loop optimization techniques are used; unrolling and blocking. With respect to unrolling, we unroll the outer two most loops and block it for data reuse in cache. Outer loop unrolling helps optimize data-access temporally by reducing the number of memory references to same data at different times during the entire operation. Blocking the loop index at specific number, helps with retaining the data in cache. You can choose to optimize for L2 cache or L3 cache.
https://en.wikipedia.org/wiki/Loop_nest_optimization
For many reasons.
First, Fortran compilers are highly optimized, and the language allows them to be as such. C and C++ are very loose in terms of array handling (e.g. the case of pointers referring to the same memory area). This means that the compiler cannot know in advance what to do, and is forced to create generic code. In Fortran, your cases are more streamlined, and the compiler has better control of what happens, allowing him to optimize more (e.g. using registers).
Another thing is that Fortran store stuff columnwise, while C stores data row-wise. I havent' checked your code, but be careful of how you perform the product. In C you must scan row wise: this way you scan your array along contiguous memory, reducing the cache misses. Cache miss is the first source of inefficiency.
Third, it depends of the blas implementation you are using. Some implementations might be written in assembler, and optimized for the specific processor you are using. The netlib version is written in fortran 77.
Also, you are doing a lot of operations, most of them repeated and redundant. All those multiplications to obtain the index are detrimental for the performance. I don't really know how this is done in BLAS, but there are a lot of tricks to prevent expensive operations.
For example, you could rework your code this way
template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
if ( ADim2!=BDim1 ) throw std::runtime_error("Error sizes off");
memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
int cc2,cc1,cr1, a1,a2,a3;
for ( cc2=0 ; cc2<BDim2 ; ++cc2 ) {
a1 = cc2*ADim2;
a3 = cc2*BDim1
for ( cc1=0 ; cc1<ADim2 ; ++cc1 ) {
a2=cc1*ADim1;
ValT b = B[a3+cc1];
for ( cr1=0 ; cr1<ADim1 ; ++cr1 ) {
C[a1+cr1] += A[a2+cr1]*b;
}
}
}
}
Try it, I am sure you will save something.
On you #1 question, the reason is that matrix multiplication scales as O(n^3) if you use a trivial algorithm. There are algorithms that scale much better.