I have a big software written in Fortran 77 running on Linux that uses multidimensional arrays for storing time-indexed matrices and with several, different series, but the indexing order is that of a row-major ordering like C. However, Fortran is column-ordered and there are cache-miss penalties when the values are indexed sequentially.
As a example, I have two series of 100.000 10x10 matrices. I'm storing them as:
MATRIX(2, 100000, 10, 10)
but as I understand, if the values are intended to be accessed linearly, the optimum declaration for Fortran would be
MATRIX(10, 10, 100000, 2)
Right now the impact of refactoring all the code to use the other ordering system would be big, but I would like to have an idea of the potential impact.
Is there any way I could easily measure the impact of having the wrong ordering system, to estimate the potential improvement of changing it?
Maybe some automatic measure of cache misses or cache misses associated to those arrays.
Related
Given an array A of size 105.
Then given m (m is very large, m>> the size of A) operations, each operation is for position p, increasing t.
A[p]+=t
Finally, I output the value of each position of the whole array.
Is there any constant optimization to speed up the intermediate modification operations?
For example, if I sort the positions, I can modify them sequentially to avoid random access. However, this operation will incur an additional sorting cost. Is there any other way to speed it up?
Trying to re-execute all operations after sorting can be an order of magnitude faster than executing them directly. But the cost of sorting is too high.
On architectures with many cores, the best solution is certainly to perform atomic accesses of A[p] in parallel. This assume the number of cores is sufficiently big for the parallelism to not only mitigate the overhead of the atomic operations but also be faster than the serial implementation. This can be pretty easily done with OpenMP or with native C++ thread/atomics. The number of core need not to be too huge, otherwise, the number of conflict may be significantly bigger causing contention and so decreasing performance. This should be fine since the number of item is pretty big. This solution also assume the accesses are quite uniformly random. If they are not (eg. normal distribution), then the contention can be too big for the method to be efficient.
An alternative solution is to split the accesses between N threads spacially. The array range can be statically split in N (relatively equal) parts. All the threads read the inputs but only the thread owning the target range of the output array write into it. The array parts can then be combined after that. This method works well with few threads and if the data distribution is uniform. When the distribution is not uniform at all (eg. normal distribution), then a pre-computing step may be needed so to adjust the array range owned by threads. For example, one can compute the median, or event the quartiles so to better balance the work between threads. Computing quartiles can be done using a partitioning algorithm like Floyd Rivest (std::partition should not be too bad despite I expect it to use a kind of IntroSelect algorithm that is often a bit slower). The pre-computation may be expensive but this should be significantly faster than doing a sort. Using OpenMP is certainly a good idea to implement this.
Another alternative implementation is simply to perform the reduction separately in each thread and then sum up the final array of each thread in a global array. This solution works well in your case (since "m >> the size of A") assuming the number of core is not too big. If so, on need to mix this method with the first one. This last method is probably the simplest efficient method.
Besides, #Jérôme Richard's answer targeted parallel thread computing.
I would name an idea of the partial sort like "merge-sort-just-a-few-iterations" or "bucket-sort-only-in-bucket" (note, they are different). Preferably, set the bulk size to be the Page size to have a better overall performance in terms of OS level. Especially considering m is extraordinarily big. The cost of the partial sort would be amortized by saving cache miss and page swap.
And if this is an interview question, I would ask for more details about m, p, t, data sparsity, distribution, hardware, CPU, memory, power consumption, latency, .etc details. And for each new condition, customizes more detailed designs accordingly.
I would like to write in C++ Tensorflow sparse matrix dense vector (SPMv) multiplication: y = Ax
The sparse matrix, A, is stored in CSR format. The usual sparsity of A is between 50-90%. The goal is to reach better or similar time than that of dense matrix dense vector (DMv) multiplication.
Please note that I have already viewed the following posts: Q1 Q2 Q3. However, I still am wondering about the following:
How does SPMv multiplication compare in terms of time to DMv? Since sparsity is relatively high, I assume that SPMv should be better given the reduction in the number of operations - Yes?
What should I take into to account to make SpMv the same or better in terms of time than the DMv? Why ppl are saying that the DMv will perform petter than SPMv? Does the storage representation make a difference?
Any recommended libraries that do SPMv in C++ for either CPU or GPU implementation.
This question is relevant to my other question here: (CSCC: Convolution Split Compression Calculation Algorithm for Deep Neural Network)
To answer the edited question:
Unless the Matrix is very sparse (<10% nonzeros on CPU, probably <1% on GPU), you will likely not benefit from the sparsity. While the number of floating point operations is reduced, the amount of storage is at least double (column or row index + value), memory access is irregular (you have an indirection via the index for the right-hand side), it becomes far more difficult to vectorize (or to achieve coalescing on the GPU) and if you parallelize you have to deal with the fact that rows are of varying length and therefore a static schedule is likely to be suboptimal.
Beyond the points above, yes, the storage representation matters. For example a COO-matrix stores two indices and the value, while CSR/CSC only store one but require an additional offset array which makes them more complex to build on the fly. Especially on the GPU, storage formats matter if you want to at least achieve some coalescing. This paper looks into how storage formats affect performance on the GPU: https://onlinelibrary.wiley.com/doi/full/10.1111/cgf.13957
For something generic try Eigen or cuSparse on GPU. There are plenty of others that perform better for specific use cases, but this part of the question isn't clearly answerable.
Beyond the matrix format itself, even the ordering of entries in your matrix can have a massive impact on performance, which is why the Cuthill-McKee algorithm is often used to reduce matrix bandwidth (and thereby improve cache performance).
I have searched for an answer to this question but have not found anything that can directly help me.
I am working on a 3D numerical integrator for a non-linear PDE using the parallel FFT library included in MKL.
My arrays consist of 2^30 data points which is much much larger than the cache. This results in ~50% of cache references being misses, which appears to add a massive amount of overhead accessing memory.
Is there a clever way I can deal with this? Is it expected to have 50% cache misses using an array this large?
Any help would be much appreciated.
Thanks,
Dylan
2^30 data points in a single FFT counts as being quite big!
The data plus the exponentials and the output array is several thousand times bigger than the L3 cache, and millions times bigger than L1.
Given that disparity one might argue that a 50% cache miss rate is actually quite good, especially for an algorithm like an FFT which accesses memory in non-sequential ways.
I don't think that there will be much you can do about it. The MKL is quite good, and I'm sure that they've taken advantage of whatever cache hinting instructions there are.
You might try contacting Mercury Systems Inc. (www.mrcy.com) and ask them about their Scientific Algorithms Library (SAL). They have a habit of writing their own math libraries, and in my experience they are pretty good at it. Their FFT on PowerPC was 30% quicker than the next best one; quite an achievement. You can try an un-optimised version of SAL for free (http://sourceforge.net/projects/opensal/). The real optimised for Intel SAL is definitely not free though.
Also bear in mind that no matter how clever the algorithm is, with a data set that size you're always going to be fundamentally stuck with main memory bandwidths, not cache bandwidths.
GPUs might be worth a look, but you'd need one with a lot of memory to hold 2^30 data points (32 bit complex values = 2gbytes, same again for the output array, plus exponentials, etc).
I think the problem of excessive misses is due to a failure of the cache prefetch mechanism, but not knowing the details of the memory accesses I can't tell you exactly why.
It does not matter that your arrays are very large, 50% misses are excessive. The processor should avoid misses by detecting you are iterating over an array and loading ahead of time the data elements you are likely to use.
Either the pattern of array accesses is not regular and thus the prefetcher in the processor does not figure out a pattern to prefetch, or you have a cache associativy problem, that is, elements in your iteration might be matched to the same cache slots.
For example, assume a cache size of 1Mb and a set associativy of 4. In this example, the cache will map memory using the lower 20 bits to an internal slot. If you stride by 1Mb, that is, your iterations are exactly 1Mb, then the lower 20 bits are always the same and go to the same cache slot, the new element shares the same cache slot as the old one. When you get to the fifth element, all four positions are used up and from then on it is only misses, in such case your cache size is effectively one single slot; if you stride by half the cache size, then the effective number of slots is 2, which might be enough to not have any misses at all or have 100% or anything in between depending on whether your access pattern requires both slots simultaneously or not.
To convince yourself of this, make a toy program with varying stride sizes and you'll see that those that divide or are multiples of the cache sizes increase misses, you can use valgrind --tool=cachegrind
You should first make sure you know what is causing the cache misses; they may be the fault of other code you've written rather than the FFT library. In fact, I expect that is very likely the case.
The rest of this post assumes that the FFT is really at fault and we need to optimize.
The standard trick to get data locality out of an FFT is to
Arrange the data in a two-dimensional array
Do an FFT along each row
Apply twiddle factors
Do a matrix transpose
Do an FFT along each row
This is the Cooley-Tukey algorithm, in the case where we factor 2^(m+n) = 2^m * 2^n.
The point of this is that the recursive calls to the FFT are much much smaller, and may very well fit in cache. And if not, you can apply this method recursively until things do fit in cache. And if you're ambitious, you do a lot of benchmarking to figure out the optimal way to do the splitting.
Thus, assuming you also use a good matrix transpose algorithm, the end result is a relatively cache-friendly FFT.
The library you're using really should be doing this already. If it's not, then some options are:
Maybe it exposes enough lower level functionality that you can tell it to use Cooley-Tukey in an efficient way even though the high level routines aren't
You could implement Cooley-Tukey yourself, using the given library to do the smaller FFTs.
Suppose you got a huge (40+ GB) feature value (floating-point) matrix, rows are different features and columns are the samples/images.
The table is precomputed column-wise.
Then it is completely accessed row-wise and multi-threaded (each thread loads a whole row) several times.
What would be the best way to handle this matrix? I'm especially pondering over 5 points:
Since it's run on an x64 PC I could memory map the whole matrix at once but would that make sense?
What about the effects of multithreading (multithreaded initial computation as well?)?
How to layout the matrix: row or column major?
Would it help to mark the matrix as read-only after the precomputation has been finished?
Could something like http://www.kernel.org/doc/man-pages/online/pages/man2/madvise.2.html be used to speed it up?
Memory mapping the whole file could make the process much easier.
You want to lay out your data to optimize for the most common access pattern. It sounds like the data is going to be written once (column-wise) and read several times (row-wise). That suggests the data should be stored in row-major order.
Marking the matrix read-only once the pre-computation is done probably won't help performance (there are some possible low-level optimizations, but I don't think anything implements them), but it will prevent bugs from accidentally writing to data you don't intend to. Might as well.
madvise could end up being useful, once you've got your application written and working.
My overall advice: write the program in the simplest way you can, sequentially at first, and then put timers around the whole thing and the various major operations. Make sure the major operation times sum to the overall time, so you can be sure you're not missing anything. Then target your performance improvement efforts toward the components that are actually taking the most time.
Per JimR's mention of 4MB pages in his comment, you may end up wanting to look into hugetlbfs or using a Linux Kernel release with transparent huge page support (merged for 2.6.38, could probably be patched into earlier versions). This would likely save you a whole lot of TLB misses, and convince the kernel to do the disk IO in sufficiently large chunks to amortize any seek overhead.
Maybe, see below.
The size of the total working set of all threads must not exceed available RAM, otherwise the program will run at snail speed because of swapping.
Layout should match access patterns, as long as condition 2 is respected.
What do you mean by "mark as read only"?
Measure it.
Re 3: If you have, e.g., 8 CPUs but do not have enough RAM to load 8 rows, you should make each thread process its row sequentially in manageable chunks. In this case, block-layout of a matrix would make sense. If the thread MUST have the whole row in memory to process it, I'm afraid that you can't use all the CPUs, as the process will start thrashing, i.e., kicking out some subset of the matrix out of the ram and reloading another needed subset. This is slightly less bad than full swapping as the matrix is never modified, so the contents of the pages do not need to be written to the swap file before being kicked out. But it still hurts performance badly.
Also, doing random access I/O from multiple threads is a bad idea, which is what you'll end up doing if you use mmap(). You have (presumably) only a single disk, and parallel I/O will just make it slower. So mmap() might not make sense and you could achieve better I/O performance by reading data sequentially into ram.
Note that 40GB is approximately 10.5 million pages of 4096 bytes. By doing mmap(), you will, in the worst case, slow down computation by that many hard disk seeks. At 8ms per seek (taken from wikipedia), you'll end up wasting 83666 seconds, i.e., almost a whole day!
If you could fit the whole thing into main memory, then yes: memory map it all, and it doesn't matter whether it's column major or row major. However, at 40+ Gb, I'm sure it's too big for main memory. In which case:
No, don't map the whole thing! At least, don't expect the memory to work like normal memory if you map it all. Your program will take forever if you don't properly deal with the i/o issues.
The multi-threaded access issue is solved if you store it row-major (it sounds like you don't have multi-threaded column writes).
You should lay it out row-wise, assuming each cell is written once and then read many times.
Yes, I think it would help to mark the matrix as read-only after it's been written, but purely as a way to prevent bugs (accidental writes). It won't affect performance.
No, no amount of clever kernel read-ahead is going to solve your performance problems. You need to solve it at the algorithm level.
I think you are going to have a performance problem with a naive implementation. Either the computer with thrash while writing (if you store it row major) or it will thrash while querying (if you store it column major). The latter is presumably worse, but it's a problem both ways.
The right solution is to use an intermediate representation which is neither row-major nor column-major but 'large squares'. Take the first 50,000 columns and store them in a memory-mapped file (phase 1). It doesn't matter if it's column major or row major since it'll be purely memory resident. Then, take each row and write it into the final row-major memory-mapped file (phase 2). Then repeat the cycle for the next 50,000 columns, and so on.
How to implement Radix sort on multi-GPU – same way as on single GPU i.e. by splitting the data then building histograms on separate GPUs and then use merge data back (like bunch of cards)?
That method would work, but I don't think it would be the fastest approach. Specifically, merging histograms for every K bits (K=4 is currently best) would require the keys to be exchanged between GPUs 32/K = 8 times to sort 32-bit integers. Since the memory bandwidth between GPUs (~5GB/s) is much lower than the memory bandwidth on a GPU (~150GB/s) this will kill performance.
A better strategy would be to split the data into multiple parts, sort each part in parallel on a different GPU, and then merge the parts once at the end. This approach requires only one inter-GPU transfer (vs. 8 above) so it will be considerably faster.
Unfortunately this question is not adequately posed. It depends on element size, where the elements begin life in memory, and where you want the sorted elements to end up residing.
Sometimes it's possible to compress the sorted list by storing elements in groups sharing the same common prefix, or you can unique elements on the fly, storing each element once in the sorted list with an associated count. For example, you might sort a huge list of 32-bit integers into 64K distinct lists of 16-bit values, cutting your memory requirement in half.
The general principle is that you want to make the fewest number of passes over the data as possible and that your throughput will almost always correspond to bandwidth constraints associated with your storage policy.
If your data set exceeds the size of fast memory, you probably want to finish with a merge pass rather than continue to radix sort, as another person has already answered.
I'm just getting into GPU architecture and I don't understand the K=4 comment above. I've never seen an architecture yet where such a small K would prove optimal.
I suspect merging histograms is also the wrong approach. I'd probably let the elements fragment in memory rather than merge histograms. Is it that hard to manage meso-scale scatter/gather lists in the GPU fabric? I sure hope not.
Finally, it's hard to conceive of a reason why you would want to involve multiple GPUs for this task. Say your card has 2GB of memory and 60GB/s write bandwidth (that's what my mid-range card is showing). A three pass radix sort (11-bit histograms) requires 6GB of write bandwidth (likely your rate limiting factor), or about 100ms to sort a 2GB list of 32-bit integers. Great, they're sorted, now what? If you need to ship them anywhere else without some kind of preprocessing or compression, the sorting time will be small fish.
In any case, just compiled my first example programs today. There's still a lot to learn. My target application is permutation intensive, which is closely related to sorting. I'm sure I'll weigh in on this subject again in future.