3D FFT with data larger than cache - c++

I have searched for an answer to this question but have not found anything that can directly help me.
I am working on a 3D numerical integrator for a non-linear PDE using the parallel FFT library included in MKL.
My arrays consist of 2^30 data points which is much much larger than the cache. This results in ~50% of cache references being misses, which appears to add a massive amount of overhead accessing memory.
Is there a clever way I can deal with this? Is it expected to have 50% cache misses using an array this large?
Any help would be much appreciated.
Thanks,
Dylan

2^30 data points in a single FFT counts as being quite big!
The data plus the exponentials and the output array is several thousand times bigger than the L3 cache, and millions times bigger than L1.
Given that disparity one might argue that a 50% cache miss rate is actually quite good, especially for an algorithm like an FFT which accesses memory in non-sequential ways.
I don't think that there will be much you can do about it. The MKL is quite good, and I'm sure that they've taken advantage of whatever cache hinting instructions there are.
You might try contacting Mercury Systems Inc. (www.mrcy.com) and ask them about their Scientific Algorithms Library (SAL). They have a habit of writing their own math libraries, and in my experience they are pretty good at it. Their FFT on PowerPC was 30% quicker than the next best one; quite an achievement. You can try an un-optimised version of SAL for free (http://sourceforge.net/projects/opensal/). The real optimised for Intel SAL is definitely not free though.
Also bear in mind that no matter how clever the algorithm is, with a data set that size you're always going to be fundamentally stuck with main memory bandwidths, not cache bandwidths.
GPUs might be worth a look, but you'd need one with a lot of memory to hold 2^30 data points (32 bit complex values = 2gbytes, same again for the output array, plus exponentials, etc).

I think the problem of excessive misses is due to a failure of the cache prefetch mechanism, but not knowing the details of the memory accesses I can't tell you exactly why.
It does not matter that your arrays are very large, 50% misses are excessive. The processor should avoid misses by detecting you are iterating over an array and loading ahead of time the data elements you are likely to use.
Either the pattern of array accesses is not regular and thus the prefetcher in the processor does not figure out a pattern to prefetch, or you have a cache associativy problem, that is, elements in your iteration might be matched to the same cache slots.
For example, assume a cache size of 1Mb and a set associativy of 4. In this example, the cache will map memory using the lower 20 bits to an internal slot. If you stride by 1Mb, that is, your iterations are exactly 1Mb, then the lower 20 bits are always the same and go to the same cache slot, the new element shares the same cache slot as the old one. When you get to the fifth element, all four positions are used up and from then on it is only misses, in such case your cache size is effectively one single slot; if you stride by half the cache size, then the effective number of slots is 2, which might be enough to not have any misses at all or have 100% or anything in between depending on whether your access pattern requires both slots simultaneously or not.
To convince yourself of this, make a toy program with varying stride sizes and you'll see that those that divide or are multiples of the cache sizes increase misses, you can use valgrind --tool=cachegrind

You should first make sure you know what is causing the cache misses; they may be the fault of other code you've written rather than the FFT library. In fact, I expect that is very likely the case.
The rest of this post assumes that the FFT is really at fault and we need to optimize.
The standard trick to get data locality out of an FFT is to
Arrange the data in a two-dimensional array
Do an FFT along each row
Apply twiddle factors
Do a matrix transpose
Do an FFT along each row
This is the Cooley-Tukey algorithm, in the case where we factor 2^(m+n) = 2^m * 2^n.
The point of this is that the recursive calls to the FFT are much much smaller, and may very well fit in cache. And if not, you can apply this method recursively until things do fit in cache. And if you're ambitious, you do a lot of benchmarking to figure out the optimal way to do the splitting.
Thus, assuming you also use a good matrix transpose algorithm, the end result is a relatively cache-friendly FFT.
The library you're using really should be doing this already. If it's not, then some options are:
Maybe it exposes enough lower level functionality that you can tell it to use Cooley-Tukey in an efficient way even though the high level routines aren't
You could implement Cooley-Tukey yourself, using the given library to do the smaller FFTs.

Related

Any optimization about random access array modification?

Given an array A of size 105.
Then given m (m is very large, m>> the size of A) operations, each operation is for position p, increasing t.
A[p]+=t
Finally, I output the value of each position of the whole array.
Is there any constant optimization to speed up the intermediate modification operations?
For example, if I sort the positions, I can modify them sequentially to avoid random access. However, this operation will incur an additional sorting cost. Is there any other way to speed it up?
Trying to re-execute all operations after sorting can be an order of magnitude faster than executing them directly. But the cost of sorting is too high.
On architectures with many cores, the best solution is certainly to perform atomic accesses of A[p] in parallel. This assume the number of cores is sufficiently big for the parallelism to not only mitigate the overhead of the atomic operations but also be faster than the serial implementation. This can be pretty easily done with OpenMP or with native C++ thread/atomics. The number of core need not to be too huge, otherwise, the number of conflict may be significantly bigger causing contention and so decreasing performance. This should be fine since the number of item is pretty big. This solution also assume the accesses are quite uniformly random. If they are not (eg. normal distribution), then the contention can be too big for the method to be efficient.
An alternative solution is to split the accesses between N threads spacially. The array range can be statically split in N (relatively equal) parts. All the threads read the inputs but only the thread owning the target range of the output array write into it. The array parts can then be combined after that. This method works well with few threads and if the data distribution is uniform. When the distribution is not uniform at all (eg. normal distribution), then a pre-computing step may be needed so to adjust the array range owned by threads. For example, one can compute the median, or event the quartiles so to better balance the work between threads. Computing quartiles can be done using a partitioning algorithm like Floyd Rivest (std::partition should not be too bad despite I expect it to use a kind of IntroSelect algorithm that is often a bit slower). The pre-computation may be expensive but this should be significantly faster than doing a sort. Using OpenMP is certainly a good idea to implement this.
Another alternative implementation is simply to perform the reduction separately in each thread and then sum up the final array of each thread in a global array. This solution works well in your case (since "m >> the size of A") assuming the number of core is not too big. If so, on need to mix this method with the first one. This last method is probably the simplest efficient method.
Besides, #Jérôme Richard's answer targeted parallel thread computing.
I would name an idea of the partial sort like "merge-sort-just-a-few-iterations" or "bucket-sort-only-in-bucket" (note, they are different). Preferably, set the bulk size to be the Page size to have a better overall performance in terms of OS level. Especially considering m is extraordinarily big. The cost of the partial sort would be amortized by saving cache miss and page swap.
And if this is an interview question, I would ask for more details about m, p, t, data sparsity, distribution, hardware, CPU, memory, power consumption, latency, .etc details. And for each new condition, customizes more detailed designs accordingly.

CUDA - operations on single elements of a matrix - getting ideas

I'm about writing a CUDA kernel to perform a single operation on every single element of a matrix (e.g. squarerooting every element, or exponentiation, or calculating the sine/cosine if all the numbers are between [-1;1], etc..)
I chose the blocks/threads grid dimensions and I think the code is pretty straightforward and simple, but I'm asking myself... what can I do to maximize coalescence/SM occupancy?
My first idea was: making all semiwarp (16 threads) load data ensemble from global memory and then putting them all to compute, but it finds out that there are no enough memory-transfer/calculations parallelization.. I mean all threads load data, then compute, then load again data, then calculate again.. this sounds really poor in terms of performance.
I thought using shared memory would be great, maybe using some sort of locality to make a thread load more data than it actually needs to facilitate other threads' work, but this sounds stupid too because the second would wait for the former to finish loading data before starting its work.
I'm not really sure I gave the right idea regarding my problem, I'm just getting ideas before commencing to work on something concrete.
Every comment/suggestion/critic is well accepted, and thanks.
If you have defined the grid so that threads read along the major dimension of the array containing your matrix, then you have already guaranteed coalesced memory access, and there is little else to be done to improve performance. These sort of O(N) complexity operations really do not contain sufficient arithmetic intensity to give good parallel speed up over an optimized CPU implementation. Often the best strategy is to fuse multiple O(N) operations together into a single kernel to improve the FLOP to memory transaction ratio.
In my eyes your problem is this
load data ensemble from global memory
It seems that your algorithm idea is:
Do something on cpu - have some matrix
Transfer matrix from global to device memory
Perform your operation on every element
Transfer matrix back from device to global memory
Do something else on cpu - go sometimes back 1.
This kind of computations are almost everytime I/O-bandwidth limited (IO = memory IO), not computation power limited. GPGPU computations can sustain a very high memory bandwidth - but only from device memory to the gpu - transfer from global memory goes always over the very slow PCIe (slow compared to the device memory connection, that can deliver up to 160 GB/s + on fast cards). So one main thing to get good results is to keep the data (matrix) in device memory - preferable generate it even there if possible (depends on your problem). Never try to migrate data between cpu and gpu for and back as the transfer overhead eats all your speedup up. Also keep in mind that your matrix must have a certain size to amortize the transfer overhead, that you cant avoid (to compute a matrix with 10 x 10 elements would bring almost nothing, heck it would even cost more)
The interchanging transfer/compute/transfer is full ok, thats how such gpu algorithms work - but only if the the tranfer is from device memory.
The GPU for something this trivial is overkill and will be slower than just keeping it on the CPU. Especially if you have a multicore CPU.
I have seen many projects showing the "great" advantages of the GPU over the CPU. They rarely stand up to scrutiny. Of course, goofy managers who want to impress their managers want to show how "leading edge" his group is.
Someone in the department toils months on getting silly GPU code optimized (which is generally 8x harder to read than equivalent CPU code), then have the "equivalent" CPU code written by some Indian sweat shop (the programmer whose last project was PGP), compile it with the slowest version of gcc they can find, with no optimization, then tout their 2x speed improvement. And BTW, many overlook I/O speed as somehow not important.

how to memory map a huge matrix?

Suppose you got a huge (40+ GB) feature value (floating-point) matrix, rows are different features and columns are the samples/images.
The table is precomputed column-wise.
Then it is completely accessed row-wise and multi-threaded (each thread loads a whole row) several times.
What would be the best way to handle this matrix? I'm especially pondering over 5 points:
Since it's run on an x64 PC I could memory map the whole matrix at once but would that make sense?
What about the effects of multithreading (multithreaded initial computation as well?)?
How to layout the matrix: row or column major?
Would it help to mark the matrix as read-only after the precomputation has been finished?
Could something like http://www.kernel.org/doc/man-pages/online/pages/man2/madvise.2.html be used to speed it up?
Memory mapping the whole file could make the process much easier.
You want to lay out your data to optimize for the most common access pattern. It sounds like the data is going to be written once (column-wise) and read several times (row-wise). That suggests the data should be stored in row-major order.
Marking the matrix read-only once the pre-computation is done probably won't help performance (there are some possible low-level optimizations, but I don't think anything implements them), but it will prevent bugs from accidentally writing to data you don't intend to. Might as well.
madvise could end up being useful, once you've got your application written and working.
My overall advice: write the program in the simplest way you can, sequentially at first, and then put timers around the whole thing and the various major operations. Make sure the major operation times sum to the overall time, so you can be sure you're not missing anything. Then target your performance improvement efforts toward the components that are actually taking the most time.
Per JimR's mention of 4MB pages in his comment, you may end up wanting to look into hugetlbfs or using a Linux Kernel release with transparent huge page support (merged for 2.6.38, could probably be patched into earlier versions). This would likely save you a whole lot of TLB misses, and convince the kernel to do the disk IO in sufficiently large chunks to amortize any seek overhead.
Maybe, see below.
The size of the total working set of all threads must not exceed available RAM, otherwise the program will run at snail speed because of swapping.
Layout should match access patterns, as long as condition 2 is respected.
What do you mean by "mark as read only"?
Measure it.
Re 3: If you have, e.g., 8 CPUs but do not have enough RAM to load 8 rows, you should make each thread process its row sequentially in manageable chunks. In this case, block-layout of a matrix would make sense. If the thread MUST have the whole row in memory to process it, I'm afraid that you can't use all the CPUs, as the process will start thrashing, i.e., kicking out some subset of the matrix out of the ram and reloading another needed subset. This is slightly less bad than full swapping as the matrix is never modified, so the contents of the pages do not need to be written to the swap file before being kicked out. But it still hurts performance badly.
Also, doing random access I/O from multiple threads is a bad idea, which is what you'll end up doing if you use mmap(). You have (presumably) only a single disk, and parallel I/O will just make it slower. So mmap() might not make sense and you could achieve better I/O performance by reading data sequentially into ram.
Note that 40GB is approximately 10.5 million pages of 4096 bytes. By doing mmap(), you will, in the worst case, slow down computation by that many hard disk seeks. At 8ms per seek (taken from wikipedia), you'll end up wasting 83666 seconds, i.e., almost a whole day!
If you could fit the whole thing into main memory, then yes: memory map it all, and it doesn't matter whether it's column major or row major. However, at 40+ Gb, I'm sure it's too big for main memory. In which case:
No, don't map the whole thing! At least, don't expect the memory to work like normal memory if you map it all. Your program will take forever if you don't properly deal with the i/o issues.
The multi-threaded access issue is solved if you store it row-major (it sounds like you don't have multi-threaded column writes).
You should lay it out row-wise, assuming each cell is written once and then read many times.
Yes, I think it would help to mark the matrix as read-only after it's been written, but purely as a way to prevent bugs (accidental writes). It won't affect performance.
No, no amount of clever kernel read-ahead is going to solve your performance problems. You need to solve it at the algorithm level.
I think you are going to have a performance problem with a naive implementation. Either the computer with thrash while writing (if you store it row major) or it will thrash while querying (if you store it column major). The latter is presumably worse, but it's a problem both ways.
The right solution is to use an intermediate representation which is neither row-major nor column-major but 'large squares'. Take the first 50,000 columns and store them in a memory-mapped file (phase 1). It doesn't matter if it's column major or row major since it'll be purely memory resident. Then, take each row and write it into the final row-major memory-mapped file (phase 2). Then repeat the cycle for the next 50,000 columns, and so on.

Data type size impact on performance

I'm running intensive numerical simulations. I often use long integers, but I realized it would be safe to use integers instead. Would this improve the speed of my simulations significantly?
Depends. If you have lots and lots of numbers consecutively in memory, they're more likely to fit in the L2 cache, so you have fewer cache misses. L2 cache misses - depending on your platform - can be a significant performance impact, so it's definitely a good idea to make things fit in cache as much as possible (and prefetch if you can). But don't expect your code to fly all of a sudden because your types are smaller.
EDIT: One more thing - if you choose an awkward type (like a 16-bit integer on a 32-bit or 64-bit platform), you may end up with worse performance because the CPU will have to surgically extract the 16-bit value and turn it into something it can work with. But typically, ints are a good choice.
Depends on your data set sizes. Obviously, halving the size of your integers could double the amount of data that fits into CPU caches and thus access to data would be faster. For more details I suggest you read the famous Ulrich Drepper's paper What Every Programmer Should Know About Memory.
This is why typedef is your friend. :-)
If mathematically possible, try using floats instead of integers. I read somewhere that floating point arithmetic (esp. multiplication) can actually be faster on some processors.
The best thing is to experiment and benchmark. It's damn near impossible to figure out analytically which micro-optimizations work best.
EDIT: This post discusses the performance difference between integer and float.
All the answers have already treated the CPU cache issue: if your data is two times smaller, then in some cases it can fit into L2 cache completely, yielding performance boost.
However, there is another very important and more general thing: memory bandwidth. If you algorithm is linear (aka O(N) complexity) and accesses memory sequentally, then it may be memory-bound. It means that memory reads/writes are the bottleneck, and CPU is simply wasting a lot of cycles waiting for memory operations to complete. In such case reducing total memory size in two times would yield reliable 2x performance boost.
Moreover, in such cases switching to bytes may yield even more performance boost, despite the fact that CPU computations may be slower with bytes as one of the other answerers have already mentioned.
In general, the answer depends on several things like: total size of data your algorithm works with, memory access pattern (random/sequental), algorithm asymptotic complexity, computation per memory ratio (mostly for linear algorithms).

How to implement Radix sort on multi-GPU?

How to implement Radix sort on multi-GPU – same way as on single GPU i.e. by splitting the data then building histograms on separate GPUs and then use merge data back (like bunch of cards)?
That method would work, but I don't think it would be the fastest approach. Specifically, merging histograms for every K bits (K=4 is currently best) would require the keys to be exchanged between GPUs 32/K = 8 times to sort 32-bit integers. Since the memory bandwidth between GPUs (~5GB/s) is much lower than the memory bandwidth on a GPU (~150GB/s) this will kill performance.
A better strategy would be to split the data into multiple parts, sort each part in parallel on a different GPU, and then merge the parts once at the end. This approach requires only one inter-GPU transfer (vs. 8 above) so it will be considerably faster.
Unfortunately this question is not adequately posed. It depends on element size, where the elements begin life in memory, and where you want the sorted elements to end up residing.
Sometimes it's possible to compress the sorted list by storing elements in groups sharing the same common prefix, or you can unique elements on the fly, storing each element once in the sorted list with an associated count. For example, you might sort a huge list of 32-bit integers into 64K distinct lists of 16-bit values, cutting your memory requirement in half.
The general principle is that you want to make the fewest number of passes over the data as possible and that your throughput will almost always correspond to bandwidth constraints associated with your storage policy.
If your data set exceeds the size of fast memory, you probably want to finish with a merge pass rather than continue to radix sort, as another person has already answered.
I'm just getting into GPU architecture and I don't understand the K=4 comment above. I've never seen an architecture yet where such a small K would prove optimal.
I suspect merging histograms is also the wrong approach. I'd probably let the elements fragment in memory rather than merge histograms. Is it that hard to manage meso-scale scatter/gather lists in the GPU fabric? I sure hope not.
Finally, it's hard to conceive of a reason why you would want to involve multiple GPUs for this task. Say your card has 2GB of memory and 60GB/s write bandwidth (that's what my mid-range card is showing). A three pass radix sort (11-bit histograms) requires 6GB of write bandwidth (likely your rate limiting factor), or about 100ms to sort a 2GB list of 32-bit integers. Great, they're sorted, now what? If you need to ship them anywhere else without some kind of preprocessing or compression, the sorting time will be small fish.
In any case, just compiled my first example programs today. There's still a lot to learn. My target application is permutation intensive, which is closely related to sorting. I'm sure I'll weigh in on this subject again in future.