how to improve multi-dimentional bit array comparison performance in c or c++ - c++

I have the following three-dimensional bit array(for a bloom filter):
unsigned char P_bit_table_[P_ROWS][ROWS][COLUMNS];
the P_ROWS's dimension represents independent two-dimensional bit arrays(i.e, P_ROWS[0], P_ROWS1,P_ROWS[2] are independent bit arrays) and could be as large as 100MBs and contains data which are populated independently. The data that I am looking for could be in any of these P_ROWS and right now I am searching through it independently, which is P_ROWS[0] then P_ROWS1 and so on until i get a positive or until the end of it(P_ROWS[n-1]). This implies that if n is 100 I have to do this search(bit comparison) 100 times(and this search is done very often). Some body suggested that I can improve the search performance if I could do bit grouping (use a column-major order on the row-major order array-- I DON'T KNOW HOW).
I really need to improve the performance of the search because the program does a lot of it.
I will be happy to give more details of my bit table implementation if required.
Sorry for the poor language.
Thanks for your help.
EDIT:
The bit grouping could be done in the following format:
Assume the array to be :
unsigned char P_bit_table_[P_ROWS][ROWS][COLUMNS]={{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))},
{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))},
{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))}};
As you can see all the rows --on the third dimension-- have similar data. What I want after the grouping is like; all the a1's are in one group(as just one entity so that i can compare them with another bit for checking if they are on or off ) and all the b1's are in another group and so on.

Re-use Other People's Algorithms
There are a ton of bit-calculation optimizations out there including many that are non-obvious, like Hamming Weights and specialized algorithms for finding the next true or false bit, that are rather independent of how you structure your data.
Reusing algorithms that other people have written can really speed up computation and lookups, not to mention development time. Some algorithms are so specialized and use computational magic that will have you scratching your head: in that case, you can take the author's word for it (after you confirm their correctness with unit tests).
Take Advantage of CPU Caching and Multithreading
I personally reduce my multidimensional bit arrays to one dimension, optimized for expected traversal.
This way, there is a greater chance of hitting the CPU cache.
In your case, I would also think deeply about the mutability of the data and whether you want to put locks on blocks of bits. With 100MBs of data, you have the potential of running your algorithms in parallel using many threads, if you can structure your data and algorithms to avoid contention.
You may even have a lockless model if you divide up ownership of the blocks of data by thread so no two threads can read or write to the same block. It all depends on your requirements.
Now is a good time to think about these issues. But since no one knows your data and usage better than you do, you must consider design options in the context of your data and usage patterns.

Related

Colstore vs Rowstore for in-memory algorithms

I'm familiar with using a column- vs a row-store for how a databases internally persists data to disk. My question is whether, for a dataset is entirely in memory, and there's no storage to disk, if the row- vs column- orientation makes much of a difference?
The things I can think of that may make a difference would be:
For fields under 8 bytes, it would involve less memory accesses for columns than for rows.
Compression would also be easier on a column-store regardless of whether in memory or not (seems like a non-issue if not saving back to storage I suppose? does compression ever matter on in-memory operations?)
Possible to vectorize operations.
Much, much easier to work with a struct on a row-by-row basis of course.
Are both of those accurate, and are there any more? Given this, would there be substantial performance improvements on using an in-memory colstore vs rowstore on a read-only dataset, or just a marginal improvement?
I'm familiar with using a column- vs a row-store for how a databases internally persists data to disk. My question is whether, for a dataset is entirely in memory, and there's no storage to disk, if the row- vs column- orientation makes much of a difference?
A lot depends on the size of the dataset, what the contents of each row are, how you need to search in it, whether you want to add items to or remove items from the dataset, and so on.
There is also the CPU and memory architecture to consider; how big are your caches, what is the size of a cache line, and how intelligent is your CPU's prefetcher.
For fields under 8 bytes, it would involve less memory accesses for columns than for rows.
Memory is not accessed a register at a time, but rather a cache line at a time. On most contemporary machines, cache lines are 64 bytes.
Compression would also be easier on a column-store regardless of whether in memory or not
Not really. You can compress/decompress a column even if it is not stored in memory consecutively. It might be faster though.
does compression ever matter on in-memory operations?
That depends. If it's in-memory, then it's likely that compression will reduce performance, but on the other hand, the amount of data that you need to store is smaller, so you will be able to fit more into memory.
Possible to vectorize operations.
It's only loading/storing to memory that might be slower if data is grouped by rows.
Much, much easier to work with a struct on a row-by-row basis of course.
It's easy to use a pointer to a struct with a row-by-row store, but with C++ you can make classes that hide the fact that data is stored column-by-column. That's a bit more work up front, but might make it as easy as row-by-row once you have set that up.
Also, column-by-column store is often used in the entity-component-system pattern, and there are libraries such as EnTT that make it quite easy to work with.
Are both of those accurate, and are there any more? Given this, would there be substantial performance improvements on using an in-memory colstore vs rowstore on a read-only dataset, or just a marginal improvement?
Again, it heavily depends on the size of the dataset and how you want to access it. If you frequently use all columns in a row, then row-by-row store is preferred. If you frequently just use one column, and need to access that column of many consecutive rows, then a column-by-column store is best.
Also, there are hybrid solutions possible. You could have one column on its own, and then all the other columns stored in row-by-row fashion.
How you will search in a read-only dataset matters a lot. Is it going to be sorted, or is it more like a hash map? In the former case, you want the index to be as compact as possible, and possibly ordered like a B-tree as Alex Guteniev already mentioned. If it's going to be like a hash map, then you probably want row-by-row.
For in-memory arrays, this is called AoS vs SoA (array of structs vs struct of arrays).
I think the main advantage in SoA for a read-only database is that searches would need to access smaller memory range. This is more cache friendly, less prone to page faults.
The amount of improvement depends on how you use the database. There may be some more significant improvement by using more targetted structure (sorted array, B-tree)

Being cache efficient with data, mainly arrays

I have recently started to look into being cache efficient by trying to avoid cache misses in c++. So far I have taken away the following:
Try and avoid linked lists objects where possible when processing. Instead use them to point to contiguous data that you can store in cache and perform operations on.
Be careful of holding state in classes as it makes the above potentially more difficult.
Use structs when allocating on the heap, as this helps in localising data.
Try and use 1D arrays when possible for lists of data.
So my question is broken into two parts:
Is the above correct? Have I made any fundamental misunderstandings?
When dealing with 2D arrays I have seen other users recommend the use of Hilbert curves. I do not understand how this provides a speed increase over using division and modulus operators on an index to simulate a 2D array as that is surely less instructions which is good for speed and instruction cache usage?
Thanks for reading.
P.S. I do not have a CompSci background therefore, if you notice anything that I have said that is incorrect I would appreciate it if you could alert me so that I can read around that topic.
Your approach is flawed for at least one reason: you are willing to sacrifice everything to avoid cache misses. How do you know if that (cache misses) is the major performance factor in your code?
For example, there are MANY cases where the use of linked list is better that a contiguous array, specifically - where you frequently insert / delete items. You would pay greatly for compacting or expanding an array.
So the answer to your first question is: yes, you will improve the data locality using those four principals. But - at the cost, probably greater than the savings.
For the second question, I suggest you read about Hilbert curves. You don't need them if you are processing your 2D array in order, row-by-row. They will help a lot (with data locality) if you process some area of your 2D array, because the distance between elements in the same column/different rows is much smaller that way.

3D FFT with data larger than cache

I have searched for an answer to this question but have not found anything that can directly help me.
I am working on a 3D numerical integrator for a non-linear PDE using the parallel FFT library included in MKL.
My arrays consist of 2^30 data points which is much much larger than the cache. This results in ~50% of cache references being misses, which appears to add a massive amount of overhead accessing memory.
Is there a clever way I can deal with this? Is it expected to have 50% cache misses using an array this large?
Any help would be much appreciated.
Thanks,
Dylan
2^30 data points in a single FFT counts as being quite big!
The data plus the exponentials and the output array is several thousand times bigger than the L3 cache, and millions times bigger than L1.
Given that disparity one might argue that a 50% cache miss rate is actually quite good, especially for an algorithm like an FFT which accesses memory in non-sequential ways.
I don't think that there will be much you can do about it. The MKL is quite good, and I'm sure that they've taken advantage of whatever cache hinting instructions there are.
You might try contacting Mercury Systems Inc. (www.mrcy.com) and ask them about their Scientific Algorithms Library (SAL). They have a habit of writing their own math libraries, and in my experience they are pretty good at it. Their FFT on PowerPC was 30% quicker than the next best one; quite an achievement. You can try an un-optimised version of SAL for free (http://sourceforge.net/projects/opensal/). The real optimised for Intel SAL is definitely not free though.
Also bear in mind that no matter how clever the algorithm is, with a data set that size you're always going to be fundamentally stuck with main memory bandwidths, not cache bandwidths.
GPUs might be worth a look, but you'd need one with a lot of memory to hold 2^30 data points (32 bit complex values = 2gbytes, same again for the output array, plus exponentials, etc).
I think the problem of excessive misses is due to a failure of the cache prefetch mechanism, but not knowing the details of the memory accesses I can't tell you exactly why.
It does not matter that your arrays are very large, 50% misses are excessive. The processor should avoid misses by detecting you are iterating over an array and loading ahead of time the data elements you are likely to use.
Either the pattern of array accesses is not regular and thus the prefetcher in the processor does not figure out a pattern to prefetch, or you have a cache associativy problem, that is, elements in your iteration might be matched to the same cache slots.
For example, assume a cache size of 1Mb and a set associativy of 4. In this example, the cache will map memory using the lower 20 bits to an internal slot. If you stride by 1Mb, that is, your iterations are exactly 1Mb, then the lower 20 bits are always the same and go to the same cache slot, the new element shares the same cache slot as the old one. When you get to the fifth element, all four positions are used up and from then on it is only misses, in such case your cache size is effectively one single slot; if you stride by half the cache size, then the effective number of slots is 2, which might be enough to not have any misses at all or have 100% or anything in between depending on whether your access pattern requires both slots simultaneously or not.
To convince yourself of this, make a toy program with varying stride sizes and you'll see that those that divide or are multiples of the cache sizes increase misses, you can use valgrind --tool=cachegrind
You should first make sure you know what is causing the cache misses; they may be the fault of other code you've written rather than the FFT library. In fact, I expect that is very likely the case.
The rest of this post assumes that the FFT is really at fault and we need to optimize.
The standard trick to get data locality out of an FFT is to
Arrange the data in a two-dimensional array
Do an FFT along each row
Apply twiddle factors
Do a matrix transpose
Do an FFT along each row
This is the Cooley-Tukey algorithm, in the case where we factor 2^(m+n) = 2^m * 2^n.
The point of this is that the recursive calls to the FFT are much much smaller, and may very well fit in cache. And if not, you can apply this method recursively until things do fit in cache. And if you're ambitious, you do a lot of benchmarking to figure out the optimal way to do the splitting.
Thus, assuming you also use a good matrix transpose algorithm, the end result is a relatively cache-friendly FFT.
The library you're using really should be doing this already. If it's not, then some options are:
Maybe it exposes enough lower level functionality that you can tell it to use Cooley-Tukey in an efficient way even though the high level routines aren't
You could implement Cooley-Tukey yourself, using the given library to do the smaller FFTs.

how to memory map a huge matrix?

Suppose you got a huge (40+ GB) feature value (floating-point) matrix, rows are different features and columns are the samples/images.
The table is precomputed column-wise.
Then it is completely accessed row-wise and multi-threaded (each thread loads a whole row) several times.
What would be the best way to handle this matrix? I'm especially pondering over 5 points:
Since it's run on an x64 PC I could memory map the whole matrix at once but would that make sense?
What about the effects of multithreading (multithreaded initial computation as well?)?
How to layout the matrix: row or column major?
Would it help to mark the matrix as read-only after the precomputation has been finished?
Could something like http://www.kernel.org/doc/man-pages/online/pages/man2/madvise.2.html be used to speed it up?
Memory mapping the whole file could make the process much easier.
You want to lay out your data to optimize for the most common access pattern. It sounds like the data is going to be written once (column-wise) and read several times (row-wise). That suggests the data should be stored in row-major order.
Marking the matrix read-only once the pre-computation is done probably won't help performance (there are some possible low-level optimizations, but I don't think anything implements them), but it will prevent bugs from accidentally writing to data you don't intend to. Might as well.
madvise could end up being useful, once you've got your application written and working.
My overall advice: write the program in the simplest way you can, sequentially at first, and then put timers around the whole thing and the various major operations. Make sure the major operation times sum to the overall time, so you can be sure you're not missing anything. Then target your performance improvement efforts toward the components that are actually taking the most time.
Per JimR's mention of 4MB pages in his comment, you may end up wanting to look into hugetlbfs or using a Linux Kernel release with transparent huge page support (merged for 2.6.38, could probably be patched into earlier versions). This would likely save you a whole lot of TLB misses, and convince the kernel to do the disk IO in sufficiently large chunks to amortize any seek overhead.
Maybe, see below.
The size of the total working set of all threads must not exceed available RAM, otherwise the program will run at snail speed because of swapping.
Layout should match access patterns, as long as condition 2 is respected.
What do you mean by "mark as read only"?
Measure it.
Re 3: If you have, e.g., 8 CPUs but do not have enough RAM to load 8 rows, you should make each thread process its row sequentially in manageable chunks. In this case, block-layout of a matrix would make sense. If the thread MUST have the whole row in memory to process it, I'm afraid that you can't use all the CPUs, as the process will start thrashing, i.e., kicking out some subset of the matrix out of the ram and reloading another needed subset. This is slightly less bad than full swapping as the matrix is never modified, so the contents of the pages do not need to be written to the swap file before being kicked out. But it still hurts performance badly.
Also, doing random access I/O from multiple threads is a bad idea, which is what you'll end up doing if you use mmap(). You have (presumably) only a single disk, and parallel I/O will just make it slower. So mmap() might not make sense and you could achieve better I/O performance by reading data sequentially into ram.
Note that 40GB is approximately 10.5 million pages of 4096 bytes. By doing mmap(), you will, in the worst case, slow down computation by that many hard disk seeks. At 8ms per seek (taken from wikipedia), you'll end up wasting 83666 seconds, i.e., almost a whole day!
If you could fit the whole thing into main memory, then yes: memory map it all, and it doesn't matter whether it's column major or row major. However, at 40+ Gb, I'm sure it's too big for main memory. In which case:
No, don't map the whole thing! At least, don't expect the memory to work like normal memory if you map it all. Your program will take forever if you don't properly deal with the i/o issues.
The multi-threaded access issue is solved if you store it row-major (it sounds like you don't have multi-threaded column writes).
You should lay it out row-wise, assuming each cell is written once and then read many times.
Yes, I think it would help to mark the matrix as read-only after it's been written, but purely as a way to prevent bugs (accidental writes). It won't affect performance.
No, no amount of clever kernel read-ahead is going to solve your performance problems. You need to solve it at the algorithm level.
I think you are going to have a performance problem with a naive implementation. Either the computer with thrash while writing (if you store it row major) or it will thrash while querying (if you store it column major). The latter is presumably worse, but it's a problem both ways.
The right solution is to use an intermediate representation which is neither row-major nor column-major but 'large squares'. Take the first 50,000 columns and store them in a memory-mapped file (phase 1). It doesn't matter if it's column major or row major since it'll be purely memory resident. Then, take each row and write it into the final row-major memory-mapped file (phase 2). Then repeat the cycle for the next 50,000 columns, and so on.

How to implement Radix sort on multi-GPU?

How to implement Radix sort on multi-GPU – same way as on single GPU i.e. by splitting the data then building histograms on separate GPUs and then use merge data back (like bunch of cards)?
That method would work, but I don't think it would be the fastest approach. Specifically, merging histograms for every K bits (K=4 is currently best) would require the keys to be exchanged between GPUs 32/K = 8 times to sort 32-bit integers. Since the memory bandwidth between GPUs (~5GB/s) is much lower than the memory bandwidth on a GPU (~150GB/s) this will kill performance.
A better strategy would be to split the data into multiple parts, sort each part in parallel on a different GPU, and then merge the parts once at the end. This approach requires only one inter-GPU transfer (vs. 8 above) so it will be considerably faster.
Unfortunately this question is not adequately posed. It depends on element size, where the elements begin life in memory, and where you want the sorted elements to end up residing.
Sometimes it's possible to compress the sorted list by storing elements in groups sharing the same common prefix, or you can unique elements on the fly, storing each element once in the sorted list with an associated count. For example, you might sort a huge list of 32-bit integers into 64K distinct lists of 16-bit values, cutting your memory requirement in half.
The general principle is that you want to make the fewest number of passes over the data as possible and that your throughput will almost always correspond to bandwidth constraints associated with your storage policy.
If your data set exceeds the size of fast memory, you probably want to finish with a merge pass rather than continue to radix sort, as another person has already answered.
I'm just getting into GPU architecture and I don't understand the K=4 comment above. I've never seen an architecture yet where such a small K would prove optimal.
I suspect merging histograms is also the wrong approach. I'd probably let the elements fragment in memory rather than merge histograms. Is it that hard to manage meso-scale scatter/gather lists in the GPU fabric? I sure hope not.
Finally, it's hard to conceive of a reason why you would want to involve multiple GPUs for this task. Say your card has 2GB of memory and 60GB/s write bandwidth (that's what my mid-range card is showing). A three pass radix sort (11-bit histograms) requires 6GB of write bandwidth (likely your rate limiting factor), or about 100ms to sort a 2GB list of 32-bit integers. Great, they're sorted, now what? If you need to ship them anywhere else without some kind of preprocessing or compression, the sorting time will be small fish.
In any case, just compiled my first example programs today. There's still a lot to learn. My target application is permutation intensive, which is closely related to sorting. I'm sure I'll weigh in on this subject again in future.