I have recently started to look into being cache efficient by trying to avoid cache misses in c++. So far I have taken away the following:
Try and avoid linked lists objects where possible when processing. Instead use them to point to contiguous data that you can store in cache and perform operations on.
Be careful of holding state in classes as it makes the above potentially more difficult.
Use structs when allocating on the heap, as this helps in localising data.
Try and use 1D arrays when possible for lists of data.
So my question is broken into two parts:
Is the above correct? Have I made any fundamental misunderstandings?
When dealing with 2D arrays I have seen other users recommend the use of Hilbert curves. I do not understand how this provides a speed increase over using division and modulus operators on an index to simulate a 2D array as that is surely less instructions which is good for speed and instruction cache usage?
Thanks for reading.
P.S. I do not have a CompSci background therefore, if you notice anything that I have said that is incorrect I would appreciate it if you could alert me so that I can read around that topic.
Your approach is flawed for at least one reason: you are willing to sacrifice everything to avoid cache misses. How do you know if that (cache misses) is the major performance factor in your code?
For example, there are MANY cases where the use of linked list is better that a contiguous array, specifically - where you frequently insert / delete items. You would pay greatly for compacting or expanding an array.
So the answer to your first question is: yes, you will improve the data locality using those four principals. But - at the cost, probably greater than the savings.
For the second question, I suggest you read about Hilbert curves. You don't need them if you are processing your 2D array in order, row-by-row. They will help a lot (with data locality) if you process some area of your 2D array, because the distance between elements in the same column/different rows is much smaller that way.
Related
So I've run into a problem in some code I'm writing in c++. I need to find the median of an array of points with an offset and step size (example).
This code will be executed millions of times in as it's part of one of my core data structures so I'm trying to make it as fast as possible.
Research has led me to believe that for the best worst-case time complexity, introselect is the fastest way to find a median in a set of unordered values. I have some additional limitations that have to due with optimization:
I can't swap any values in the array. The values in the array are all exactly where they need to be based on that context in the program. But I still need the median.
I can't make any "new" allocations or call anything that does heap allocation. Or if I have to, then they need to be at a minimum as they are costly.
I've tried implementing the following in C++:
First Second Third
Is this possible? Or are there alternatives that are just as fast at finding the median and fit those requirements?
You could consider using the same heap allocation for all operations, and avoid freeing it until you're done. This way, rather than creating millions of arrays you just create one.
Of course this approach is more complex if you're doing these find-median operations in parallel. You'd need one array per thread.
I have searched for an answer to this question but have not found anything that can directly help me.
I am working on a 3D numerical integrator for a non-linear PDE using the parallel FFT library included in MKL.
My arrays consist of 2^30 data points which is much much larger than the cache. This results in ~50% of cache references being misses, which appears to add a massive amount of overhead accessing memory.
Is there a clever way I can deal with this? Is it expected to have 50% cache misses using an array this large?
Any help would be much appreciated.
Thanks,
Dylan
2^30 data points in a single FFT counts as being quite big!
The data plus the exponentials and the output array is several thousand times bigger than the L3 cache, and millions times bigger than L1.
Given that disparity one might argue that a 50% cache miss rate is actually quite good, especially for an algorithm like an FFT which accesses memory in non-sequential ways.
I don't think that there will be much you can do about it. The MKL is quite good, and I'm sure that they've taken advantage of whatever cache hinting instructions there are.
You might try contacting Mercury Systems Inc. (www.mrcy.com) and ask them about their Scientific Algorithms Library (SAL). They have a habit of writing their own math libraries, and in my experience they are pretty good at it. Their FFT on PowerPC was 30% quicker than the next best one; quite an achievement. You can try an un-optimised version of SAL for free (http://sourceforge.net/projects/opensal/). The real optimised for Intel SAL is definitely not free though.
Also bear in mind that no matter how clever the algorithm is, with a data set that size you're always going to be fundamentally stuck with main memory bandwidths, not cache bandwidths.
GPUs might be worth a look, but you'd need one with a lot of memory to hold 2^30 data points (32 bit complex values = 2gbytes, same again for the output array, plus exponentials, etc).
I think the problem of excessive misses is due to a failure of the cache prefetch mechanism, but not knowing the details of the memory accesses I can't tell you exactly why.
It does not matter that your arrays are very large, 50% misses are excessive. The processor should avoid misses by detecting you are iterating over an array and loading ahead of time the data elements you are likely to use.
Either the pattern of array accesses is not regular and thus the prefetcher in the processor does not figure out a pattern to prefetch, or you have a cache associativy problem, that is, elements in your iteration might be matched to the same cache slots.
For example, assume a cache size of 1Mb and a set associativy of 4. In this example, the cache will map memory using the lower 20 bits to an internal slot. If you stride by 1Mb, that is, your iterations are exactly 1Mb, then the lower 20 bits are always the same and go to the same cache slot, the new element shares the same cache slot as the old one. When you get to the fifth element, all four positions are used up and from then on it is only misses, in such case your cache size is effectively one single slot; if you stride by half the cache size, then the effective number of slots is 2, which might be enough to not have any misses at all or have 100% or anything in between depending on whether your access pattern requires both slots simultaneously or not.
To convince yourself of this, make a toy program with varying stride sizes and you'll see that those that divide or are multiples of the cache sizes increase misses, you can use valgrind --tool=cachegrind
You should first make sure you know what is causing the cache misses; they may be the fault of other code you've written rather than the FFT library. In fact, I expect that is very likely the case.
The rest of this post assumes that the FFT is really at fault and we need to optimize.
The standard trick to get data locality out of an FFT is to
Arrange the data in a two-dimensional array
Do an FFT along each row
Apply twiddle factors
Do a matrix transpose
Do an FFT along each row
This is the Cooley-Tukey algorithm, in the case where we factor 2^(m+n) = 2^m * 2^n.
The point of this is that the recursive calls to the FFT are much much smaller, and may very well fit in cache. And if not, you can apply this method recursively until things do fit in cache. And if you're ambitious, you do a lot of benchmarking to figure out the optimal way to do the splitting.
Thus, assuming you also use a good matrix transpose algorithm, the end result is a relatively cache-friendly FFT.
The library you're using really should be doing this already. If it's not, then some options are:
Maybe it exposes enough lower level functionality that you can tell it to use Cooley-Tukey in an efficient way even though the high level routines aren't
You could implement Cooley-Tukey yourself, using the given library to do the smaller FFTs.
I am trying to program a type of neural network using c in CUDA. I have one basic question. For the programming, I can either use big arrays or different naming strategy. For example for the weights, I can put all the weights in one big array or use different arrays for different layers with different names such as weight1 which is for layer one and weight2 for layer2 and so on. The first strategy is a little bit troublesome while the second one is easier for me. However, I am wondering if I use the different naming strategy, does it make the program slower to run on GPU?
As long as all the arrays are allocated only once and not resized, the difference in performance should be negligible.
If you are constantly reallocating memory and resizing arrays holding the weights, then there might be a performance benefit in managing your own memory within the big array.
That however that is very implementation specific, if you don't know what you are doing, managing your own memory/arrays could make your code slower and less robust. Also if your NN is huge, you might have trouble finding a contiguous block of memory large enough to hold your memory/array block.
This is my 2 cents.
The drawbacks of having 1 very large array:
harder to resize, so if you intent on resizing indiviual layers. Go for a large block.
As Daniel said it might be hard to find a contiguous block of memory(take in mind that something might feel large. But isn't from a techinal/hardware perspective.
The drawbacks of Seperate arrays or containers.
If you have a very granulated, unpredictable access pattern. The access times can be slower if it takes multiple steps to find a single location in an array. For example, if you have a list of pointers to a list of pointers, to a list of pointers. You have to take three(slightly expensive) steps every time. This can be avoided with proper coding.
In general I would be in favor of splitting up.
I have the following three-dimensional bit array(for a bloom filter):
unsigned char P_bit_table_[P_ROWS][ROWS][COLUMNS];
the P_ROWS's dimension represents independent two-dimensional bit arrays(i.e, P_ROWS[0], P_ROWS1,P_ROWS[2] are independent bit arrays) and could be as large as 100MBs and contains data which are populated independently. The data that I am looking for could be in any of these P_ROWS and right now I am searching through it independently, which is P_ROWS[0] then P_ROWS1 and so on until i get a positive or until the end of it(P_ROWS[n-1]). This implies that if n is 100 I have to do this search(bit comparison) 100 times(and this search is done very often). Some body suggested that I can improve the search performance if I could do bit grouping (use a column-major order on the row-major order array-- I DON'T KNOW HOW).
I really need to improve the performance of the search because the program does a lot of it.
I will be happy to give more details of my bit table implementation if required.
Sorry for the poor language.
Thanks for your help.
EDIT:
The bit grouping could be done in the following format:
Assume the array to be :
unsigned char P_bit_table_[P_ROWS][ROWS][COLUMNS]={{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))},
{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))},
{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))}};
As you can see all the rows --on the third dimension-- have similar data. What I want after the grouping is like; all the a1's are in one group(as just one entity so that i can compare them with another bit for checking if they are on or off ) and all the b1's are in another group and so on.
Re-use Other People's Algorithms
There are a ton of bit-calculation optimizations out there including many that are non-obvious, like Hamming Weights and specialized algorithms for finding the next true or false bit, that are rather independent of how you structure your data.
Reusing algorithms that other people have written can really speed up computation and lookups, not to mention development time. Some algorithms are so specialized and use computational magic that will have you scratching your head: in that case, you can take the author's word for it (after you confirm their correctness with unit tests).
Take Advantage of CPU Caching and Multithreading
I personally reduce my multidimensional bit arrays to one dimension, optimized for expected traversal.
This way, there is a greater chance of hitting the CPU cache.
In your case, I would also think deeply about the mutability of the data and whether you want to put locks on blocks of bits. With 100MBs of data, you have the potential of running your algorithms in parallel using many threads, if you can structure your data and algorithms to avoid contention.
You may even have a lockless model if you divide up ownership of the blocks of data by thread so no two threads can read or write to the same block. It all depends on your requirements.
Now is a good time to think about these issues. But since no one knows your data and usage better than you do, you must consider design options in the context of your data and usage patterns.
I'm curious about the efficiency of using a higher dimensional array vs a one dimensional array. Do you lose anything when defining, and iterating through an array like this:
array[i][j][k];
or defining and iterating through an array like this:
array[k + j*jmax + i*imax];
My inclination is that there wouldn't be a difference, but I'm still learning about high efficiency programming (I've never had to care about this kind of thing before).
Thanks!
The only way to know for sure is to benchmark both ways (with optimization flags on in the compiler of course). The one think you lose for sure in the second method is the clarity of reading.
The former way and the latter way to access arrays are identical once you compile it. Keep in mind that accessing memory locations that are close to one another does make a difference in performance, as they're going to be cached differently. Thus, if you're storing a high-dimensional matrix, ensure that you store rows one after the other if you're going to be accessing them that way.
In general, CPU caches optimize for temporal and spacial ordering. That is, if you access memory address X, the odds of you accessing X+1 are higher. It's much more efficient to operate on values within the same cache line.
Check out this article on CPU caches for more information on how different storage policies affect performance: http://en.wikipedia.org/wiki/CPU_cache
If you can rewrite the indexing, so can the compiler. I wouldn't worry about that.
Trust your compiler(tm)!
It probably depends on implementation, but I'd say it more or less amounts to your code for one-dimensional array.
Do yourself a favor and care about such things after profiling the code. It is very unlikely that something like that will affect the performance of the application as a whole. Using the correct algorithms is much more important
And even if it does matter, it is most certainly only a single inner loop that needs attention.