Why is my Strassen's Matrix Multiplication slow?

Why is my Strassen's Matrix Multiplication slow? - c++

I wrote two Matrix Multiplications programs in C++: Regular MM (source), and Strassen's MM (source), both of which operate on square matrices of sizes 2^k x 2^k(in other words, square matrices of even size).
Results are just terrible. For 1024 x 1024 matrix, Regular MM takes 46.381 sec, while Strassen's MM takes 1484.303 sec (25 minutes !!!!).
I attempted to keep the code as simple as possible. Other Strassen's MM examples found on the web are not that much different from my code. One issue with Strassen's code is obvious - I don't have cutoff point, that switches to regular MM.
What other issues my Strassen's MM code has ???
Thanks !
Direct links to sources
http://pastebin.com/HqHtFpq9
http://pastebin.com/USRQ5tuy
Edit1.
Fist, a lot of great advices. Thank you for taking your time and sharing knowledge.
I implemented changes(kept all of my code), added cut-off point.
MM of 2048x2048 matrix, with cutoff 512 already gives good results.
Regular MM: 191.49s
Strassen's MM: 112.179s
Significant improvement.
Results were obtained on prehistoric Lenovo X61 TabletPC with Intel Centrino processor, using Visual Studio 2012.
I will do more checks(to make sure I got the correct results), and will publish the results.

One issue with Strassen's code is obvious - I don't have cutoff point,
that switches to regular MM.
It's fair to say that recursing down to 1 point is the bulk of (if not the entire) problem. Trying to guess at other performance bottlenecks without addressing this is almost moot due to the massive performance hit that it brings. (In other words, you're comparing Apples to Oranges.)
As discussed in the comments, cache alignment could have an effect, but not to this scale. Furthemore, cache alignment would likely hurt the regular algorithm more than the Strassen algorithm since the latter is cache-oblivious.
void strassen(int **a, int **b, int **c, int tam) {
// trivial case: when the matrix is 1 X 1:
if (tam == 1) {
c[0][0] = a[0][0] * b[0][0];
return;
}
That's far too small. While the Strassen algorithm has a smaller complexity, it has a much bigger Big-O constant. For one, you have function call overhead all the way down to 1 element.
This is analogous to using merge or quick sort and recursing all the way down to one element. To be efficient you need to stop the recursion when the size gets small and fall back to the classic algorithm.
In quick/merge sort, you'd fall back to a low-overhead O(n^2) insertion or selection sort. Here you would fall back to the normal O(n^3) matrix multiply.
The threshold which you fall back the classic algorithm should be a tunable threshold that will likely vary depending on the hardware and the ability of the compiler to optimize the code.
For something like Strassen multiplication where the advantage is only O(2.8074) over the classic O(n^3), don't be surprised if this threshold turns out to be very high. (thousands of elements?)
In some applications there can be many algorithms each with decreasing complexity but increasing Big-O. The result is that multiple algorithms become optimal at different sizes.
Large integer multiplication is a notorious example of this:
Grade-school Multiplication: O(N^2) optimal for < ~100 digits*
Karatsuba Multiplication: O(N^1.585) faster than above at ~100 digits*
Toom-Cook 3-way: O(N^1.465) faster than Karatsuba at ~3000 digits*
Floating-point FFT: O(> N log(N)) faster than Karatsuba/Toom-3 at ~700 digits*
Schönhage–Strassen algorithm (SSA): O(N log(n) loglog(n)) faster than FFT at ~ a billion digits*
Fixed-width Number-Theoretic Transform: O(N log(n) faster than SSA at ~ a few billion digits?*
*Note these example thresholds are approximate and can vary drastically - often by more than a factor of 10.

So, there may be more problems that this, but your first problem is that you're using arrays of pointers to arrays. And since you're using array sizes that are powers of 2, this is an especially big performance hit over allocating the elements contiguously and using integer division to fold the long array of numbers into rows.
Anyway, that's my first guess as to a problem. As I said, there may be more, and I'll add to this answer as I discover them.
Edit: This likely only contributes a small amount to the problem. The problem is likely the one Luchian Grigore refers to involving cache line contention issues with powers of two.
I verified that my concern is valid for the naive algorithm. The time for the naive algorithm goes down by almost 50% if the array is contiguous instead. Here is the code for this (using a SquareMatrix class that is C++11 dependent) on pastebin.

Related

What does time complexity actually mean?

I got the task of showing the time taken by the merge sort algorithm theoretically ( n log(n) ) and practically (by program) on a graph by using different values of n and time taken.
In the program, I'm printing the time difference between before calling the function and after the end of the function in microseconds I want to know what dose n log(n) means.
I tryed with this values:
Number of values:
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
program time in micro second:
12964 24961 35905 47870 88764 67848 81782 97739 111702 119682
time using n log n formula:
132877 285754 446180 611508 780482 952360 1.12665e+006 1.30302e+006 1.48119e+006 1.66096e+006
code:
auto start = std::chrono::high_resolution_clock::now();
mergeSort(arr, 0, n - 1);
auto elapsed = std::chrono::high_resolution_clock::now() - start;
long long microseconds = std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
cout << microseconds << " ";
Graph i got:

What time complexity actually means?
I interpret your question in the following way:
Why is the actual time needed by the program not K*n*log(n) microseconds?
The answer is: Because on modern computers, the same step (such as comparing two numbers) does not need the same time if it is executed multiple times.
If you look at the time needed for 50.000 and 60.000 numbers, you can see, that the 50.000 numbers even needed more time than the 60.000 numbers.
The reason might be some interrupt that occurred while the 50.000 numbers were sorted; I assume that you'll get a time between the 40.000 numbers and the 60.000 numbers if you run your program a second time.
In other words: External influences (like interrupts) have more impact on the time needed by your program than the program itself.
I got the task of showing the time taken by the merge sort algorithm theoretically (n log(n)) and practically (by program) on a graph by using different values of n and time taken.
I'd take a number of elements to be sorted that takes about one second. Let's say sorting 3 Million numbers takes one second; then I would sort 3, 6, 9, 12 ... and 30 Million numbers and measure the time.
This reduces the influence of interrupts etc. on the measurement. However, you'll still have some effect of the memory cache in this case.
You can use your existing measurements (especially the 50.000 and the 60.000) to show that for a small number of elements to be sorted, there are other factors that influence the run time.

Note that a graph of y = x log(x) is surprisingly close to a straight line.
This is because the gradient at any point x is 1 + log(x), which is a slowly growing function of x.
In other words, it's difficult within the bounds of experimental error to distinguish between O(N) and O(N log N).
The fact that the blue line is pretty straight is a reasonable verification that the algorithm is not O(N * N), but really without better statistical analysis and program control set-up, one can't say much else.
The difference between the red and blue line is down to "big O" not concerning itself with proportionality constants and other coefficients.

The time complexity is the time a program takes to execute, as a function of the problem size.
The problem size is usually expressed as the number of input elements, but some other measures can sometimes be used (e.g. algorithms on matrices of size NxN can be rated in terms of N instead of N²).
The time can effectively be measured in units of time (seconds), but is often assessed by just counting the number of atomic operations of some kind performed (e.g. the number of comparisons, of array accesses...)
In fact, for theoretical studies, the exact time is not a relevant information because it is not "portable": it strongly depends on the performance of the computer used and also on implementation details.
This is why algorithmicians do not really care about exact figures, but rather on how the time varies with increasing problem sizes. This leads to the concept of asymptotic complexity, which measures the running time to an unknown factor, and for mathematical convenience, an approximation of the running time is often used, to make the computations tractable.
If you study the complexity by pure benchmarking (timing), you can obtain experimental points, which you could call empirical complexity. But some statistical rigor should be applied.
(Some of the other answers do merge the concepts of complexity and asymptotic complexity, but this is not correct.)
In this discussion of complexity, you can replace time by space and you study the memory footprint of the program.

Time complexity has nothing to do with actual time.
It's just a way that helps us to compare different algorithms - which algorithm will run faster.
For example -
In case of sorting: we have bubble sort having time-complexity as O(n^2) and merge sort having time-complexity as O(N log(N)). So, with the help of time-complexity we can say that merge-sort is much better than bubble sort for sorting things.
Big-O notations was created so that we can have generalized way of comparing speed of different algorithms, a way which is not machine dependent.

Is there a more efficient way to calculate a rolling maximum / minimum than the naive method? [duplicate]

I have input array A
A[0], A[1], ... , A[N-1]
I want function Max(T,A) which return B represent max value on A over previous moving window of size T where
B[i+T] = Max(A[i], A[i+T])
By using max heap to keep track of max value on current moving windows A[i] to A[i+T], this algorithm yields O(N log(T)) worst case.
I would like to know is there any better algorithm? Maybe an O(N) algorithm

O(N) is possible using Deque data structure. It holds pairs (Value; Index).
at every step:
if (!Deque.Empty) and (Deque.Head.Index <= CurrentIndex - T) then
Deque.ExtractHead;
//Head is too old, it is leaving the window
while (!Deque.Empty) and (Deque.Tail.Value > CurrentValue) do
Deque.ExtractTail;
//remove elements that have no chance to become minimum in the window
Deque.AddTail(CurrentValue, CurrentIndex);
CurrentMin = Deque.Head.Value
//Head value is minimum in the current window

it's called RMQ(range minimum query). Actually i once wrote an article about that(with c++ code). See http://attiix.com/2011/08/22/4-ways-to-solve-%C2%B11-rmq/
or you may prefer the wikipedia, Range Minimum Query
after the preparation, you can get the max number of any given range in O(1)

There is a sub-field in image processing called Mathematical Morphology. The operation you are implementing is a core concept in this field, called dilation. Obviously, this operation has been studied extensively and we know how to implement it very efficiently.
The most efficient algorithm for this problem was proposed in 1992 and 1993, independently by van Herk, and Gil and Werman. This algorithm needs exactly 3 comparisons per sample, independently of the size of T.
Some years later, Gil and Kimmel further refined the algorithm to need only 2.5 comparisons per sample. Though the increased complexity of the method might offset the fewer comparisons (I find that more complex code runs more slowly). I have never implemented this variant.
The HGW algorithm, as it's called, needs two intermediate buffers of the same size as the input. For ridiculously large inputs (billions of samples), you could split up the data into chunks and process it chunk-wise.
In sort, you walk through the data forward, computing the cumulative max over chunks of size T. You do the same walking backward. Each of these require one comparison per sample. Finally, the result is the maximum over one value in each of these two temporary arrays. For data locality, you can do the two passes over the input at the same time.
I guess you could even do a running version, where the temporary arrays are of length 2*T, but that would be more complex to implement.
van Herk, "A fast algorithm for local minimum and maximum filters on rectangular and octagonal kernels", Pattern Recognition Letters 13(7):517-521, 1992 (doi)
Gil, Werman, "Computing 2-D min, median, and max filters", IEEE Transactions on Pattern Analysis and Machine Intelligence 15(5):504-507 , 1993 (doi)
Gil, Kimmel, "Efficient dilation, erosion, opening, and closing algorithms", IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12):1606-1617, 2002 (doi)
(Note: cross-posted from this related question on Code Review.)

Introsort (quicksort + heapsort) implementation and complexity

I've read that C++ uses introsort (introspective sort) for its built-in std::sort where it starts off with quicksort and switches to heapsort when you hit the depth limit.
I've also read that the depth limit is supposed to be 2*log(2,N).
Is this value purely experimental? Or is there some mathematical theory behind it?

If you have an interval (range or array), the number of times you'll have to split the interval in half before you end up with an empty (or one element) interval is log(2,N), that's just a mathematical fact, you can work it out easily, if you want. If all goes perfectly well with quicksort, it should recurse log(2,N) times, for the same reason (and at each recursion level, it has to process all values of the interval, which leads to a O(N*log(2,N)) complexity for the overall algorithm). The problem is that quicksort could require many more recursions (if it keeps getting "unlucky" with picking pivot values, which means that it doesn't split the interval in half, but in an imbalanced way instead). At worse, quicksort could end up recursing N times, which is definitely not acceptable for a production-quality implementation.
Switching to heap-sort at 2*log(2,N) is just a good heuristic in general, to detect a much too deep number of recursions.
Technically, you could base this on the empirical performance of heap-sort versus quick-sort, to figure out what limit is the best. But such tests are highly dependent on the application (what are you sorting? how are you comparing elements? how cheap are the element swaps? etc..). So, most one-size-fits-all implementation, like std::sort, would just pick a reasonable limit like 2*log(2,N).

What #Mikael Persson said regarding why the depth limit is 2*log(2,N) is partly correct. It is not just a good heuristic, or a reasonable limit.
In fact, as you have probably guessed (depicted from your second question), there is an important mathematical reason for this: in tilde notation (search for tilde notation), quicksort makes on average ~2*log(2,N) comparisons. In big-oh notation, this is equivalent to O(N*log(2,N)).
That is why introsort switches to heapsort (which has asymptotic O(N*log(2,N)) complexity) when the depth of the recursion becomes more than 2*log(2,N). You can think of it as something which is not usual to happen and most probably means that something went wrong with the pivot picking and quicksort alone would lead to O(N^2) complexity.
You can find a short mathematical proof of the average number of compares quicksort does here (slide 21).

Performance of doing bitwise operations on bitsets

In C++ if I do a logical OR (or AND) on two bitsets, for example:
bitset<1000000> b1, b2;
//some stuff
b1 |= b2;
Does this happen in O(n) or O(1) time? Why?
Also, can this be accomplished using an array of bools in O(1) time?
Thanks.

It has to happen in O(N) time since there is a finite number of bits that can be processed in any given chunk of time by a given processor platform. In other words, the larger the bit-set, the longer the amount of time each operation will take, and the increase will be linear with respect to the number of bits in the bitset.
You also end up with the same problem using the array of bool types. While each individual operation itself will take O(1) time, the total amount of time for N objects will be O(N).

It's impossible to perform a logical operation (e.g. OR or AND) on arbitrary arrays of flags in unit time. True Big-Oh analysis deals with runtime as the size of the data tends to infinity, and a Core i7 is never going to OR together a billion bits in the same time it takes to OR together two bit.

I think it needs to be made clear that Big O is a boundary - an asymptotic boundary (minimum time required cannot be less than the f(x)'s Big O., and in in thinking about it, it states the order of magnitude of the speed of a computation. So if you think about how an array works - if you can say I can do this operation all in one computation or so, or there's a known amount that is very small and much less than N, then it is constant. If you need to iterate in some manner (in this case you will see all the bits need to be checked, and there is no short cut for bitwise OR - therefore N bits need to be computed, and therefore it's O(n). [It's actually tighter boundary than that, but we're dealing with just Big O]. An array itself stores N-bits in it.
In fact, few things are really O(1) (index look ups at a known address using a pointer can be O(1) (if you already know what you are looking up). But, if you have M things that need to be looked up in constant time, then it is O(M) * O(1) = O(M).
This is a function of modern day computer - since most things are processed sequentially. (multi-core helps but doesn't come close to affecting big O notation yet). There is of course, the ability of the computer to process words in parallel, but even that is just a constant subtraction. O(n) / O(64) is still O(n).

How to select an unlike number in an array in C++?

I'm using C++ to write a ROOT script for some task. At some point I have an array of doubles in which many are quite similar and one or two are different. I want to average all the number except those sore thumbs. How should I approach it? For an example, lets consider:
x = [2.3, 2.4, 2.11, 10.5, 1.9, 2.2, 11.2, 2.1]
I want to somehow average all the numbers except 10.5 and 11.2, the dissimilar ones. This algorithm is going to repeated several thousand times and the array of doubles has 2000 entries, so optimization (while maintaining readability) is desired. Thanks SO!
Check out:
http://tinypic.com/r/111p0ya/3
The "dissimilar" numbers of the y-values of the pulse.
The point of this to determine the ground value for the waveform. I am comparing the most negative value to the ground and hoped to get a better method for grounding than to average the first N points in the sample.

Given that you are using ROOT you might consider looking at the TSpectrum classes which have support for extracting backgrounds from under an unspecified number of peaks...
I have never used them with so much baseline noise, but they ought to be robust.
BTW: what is the source of this data. The peak looks like a particle detector pulse, but the high level of background jitter suggests that you could really improve things by some fairly minor adjustments in the DAQ hardware, which might be better than trying to solve a difficult software problem.
Finally, unless you are restricted to some very primitive hardware (in which case why and how are you running ROOT?), if you only have a couple thousand such spectra you can afford a pretty slow algorithm. Or is that 2000 spectra per event and a high event rate?

If you can, maintain a sorted list; then you can easily chop off the head and the tail of the list each time you work out the average.
This is much like removing outliers based on the median (ie, you're going to need two passes over the data, one to find the median - which is almost as slow as sorting for floating point data, the other to calculate the average), but requires less overhead at the time of working out the average at the cost of maintaining a sorted list. Which one is fastest will depend entirely on your circumstances. It may be, of course, that what you really want is the median anyway!
If you had discrete data (say, bytes=256 possible values), you could use 256 histogram 'bins' with a single pass over your data putting counting the values that go in each bin, then it's really easy to find the median / approximate the mean / remove outliers, etc. This would be my preferred option, if you could afford to lose some of the precision in your data, followed by maintaining a sorted list, if that is appropriate for your data.

A quick way might be to take the median, and then take the averages of number not so far off from the median.
"Not so far off," being dependent of your project.

A good rule of thumb for determining likely outliers is to calculate the Interquartile Range (IQR), and then any values that are 1.5*IQR away from the nearest quartile are outliers.
This is the basic method many statistics systems (like R) use to automatically detect outliers.

Any method that is statistically significant and a good way to approach it (Dark Eru, Daniel White) will be too computationally intense to repeat, and I think I've found a work around that will allow later correction (meaning, leave it un-grounded).
Thanks for the suggestions. I'll look into them if I have time and want to see if their gain is worth the slowdown.

Here's a quick and dirty method that I've used before (works well if there are very few outliers at the beginning, and you don't have very complicated conditions for what constitutes an outlier)
The algorithm is O(N). The only really expensive part is the division.
The real advantage here is that you can have it up and running in a couple minutes.
avgX = Array[0] // initialize array with the first point
N = length(Array)
percentDeviation = 0.3 // percent deviation acceptable for non-outliers
count = 1
foreach x in Array[1..N-1]
if x < avgX + avgX*percentDeviation
and x > avgX - avgX*percentDeviation
count++
sumX =+ x
avgX = sumX / count
endif
endfor
return avgX

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js