How do I use the cache efficiently when transposing an array?

How do I use the cache efficiently when transposing an array? - c++

If I have a 1D array that represents the contents of an MxN matrix (where the least significant dimension is contiguous in memory), how do I make the best use of caching when transposing it (to place the contents of the most significant dimension in contiguous memory). This question could be rephrased as follows;
If I have a choice between reading contiguous memory but writing to random access locations or reading from random access locations and writing to contiguous memory, all things being equal, which should I choose?

Only one generally correct approach: code, profile, measure, and compare.
For example: do you need to actually transpose the array? Or could it suffice to read it transposed (in which case an iterator will do the trick). Often times when I interact with my favorite enemy ( Fortran) I have to "read transposed" because the fool is column major.
Play with Eigen, which lets you specify the storage order.
But---again---test and see. It may very we'll be the case that you are pursuing a red herring, and the difference in performance won't make it worth your while to complicate the code.

I would chose read contiguous over write contiguous if I have to pick one. Reasons
In multi-processor systems when multiple processors are concurrently operating on this data structure, there will be a cache invalidation during writes while cache is much more useful during reads. So in a way cache friendly reads are more beneficial than writes since it can also be shared across processors (or in cases of NUMA)
Many disks buffers writes at disk controller level and combines writes to disk to maximize throughput so some optimizations there might automatically help in writes.
Of course since there are many assumptions here and depends on your specific use case and hardware so you might have to profile it yourself to see how valid these claims are.

Related

Colstore vs Rowstore for in-memory algorithms

I'm familiar with using a column- vs a row-store for how a databases internally persists data to disk. My question is whether, for a dataset is entirely in memory, and there's no storage to disk, if the row- vs column- orientation makes much of a difference?
The things I can think of that may make a difference would be:
For fields under 8 bytes, it would involve less memory accesses for columns than for rows.
Compression would also be easier on a column-store regardless of whether in memory or not (seems like a non-issue if not saving back to storage I suppose? does compression ever matter on in-memory operations?)
Possible to vectorize operations.
Much, much easier to work with a struct on a row-by-row basis of course.
Are both of those accurate, and are there any more? Given this, would there be substantial performance improvements on using an in-memory colstore vs rowstore on a read-only dataset, or just a marginal improvement?

I'm familiar with using a column- vs a row-store for how a databases internally persists data to disk. My question is whether, for a dataset is entirely in memory, and there's no storage to disk, if the row- vs column- orientation makes much of a difference?
A lot depends on the size of the dataset, what the contents of each row are, how you need to search in it, whether you want to add items to or remove items from the dataset, and so on.
There is also the CPU and memory architecture to consider; how big are your caches, what is the size of a cache line, and how intelligent is your CPU's prefetcher.
For fields under 8 bytes, it would involve less memory accesses for columns than for rows.
Memory is not accessed a register at a time, but rather a cache line at a time. On most contemporary machines, cache lines are 64 bytes.
Compression would also be easier on a column-store regardless of whether in memory or not
Not really. You can compress/decompress a column even if it is not stored in memory consecutively. It might be faster though.
does compression ever matter on in-memory operations?
That depends. If it's in-memory, then it's likely that compression will reduce performance, but on the other hand, the amount of data that you need to store is smaller, so you will be able to fit more into memory.
Possible to vectorize operations.
It's only loading/storing to memory that might be slower if data is grouped by rows.
Much, much easier to work with a struct on a row-by-row basis of course.
It's easy to use a pointer to a struct with a row-by-row store, but with C++ you can make classes that hide the fact that data is stored column-by-column. That's a bit more work up front, but might make it as easy as row-by-row once you have set that up.
Also, column-by-column store is often used in the entity-component-system pattern, and there are libraries such as EnTT that make it quite easy to work with.
Are both of those accurate, and are there any more? Given this, would there be substantial performance improvements on using an in-memory colstore vs rowstore on a read-only dataset, or just a marginal improvement?
Again, it heavily depends on the size of the dataset and how you want to access it. If you frequently use all columns in a row, then row-by-row store is preferred. If you frequently just use one column, and need to access that column of many consecutive rows, then a column-by-column store is best.
Also, there are hybrid solutions possible. You could have one column on its own, and then all the other columns stored in row-by-row fashion.
How you will search in a read-only dataset matters a lot. Is it going to be sorted, or is it more like a hash map? In the former case, you want the index to be as compact as possible, and possibly ordered like a B-tree as Alex Guteniev already mentioned. If it's going to be like a hash map, then you probably want row-by-row.

For in-memory arrays, this is called AoS vs SoA (array of structs vs struct of arrays).
I think the main advantage in SoA for a read-only database is that searches would need to access smaller memory range. This is more cache friendly, less prone to page faults.
The amount of improvement depends on how you use the database. There may be some more significant improvement by using more targetted structure (sorted array, B-tree)

Data Orientated Design; how do I optimize a data structure in c++ for performance?

I would like to have a class of varying number n of objects which are easily iterated over as a group, with each object member having a large list (20+) of individually modified variables influencing class methods. Before I started learning OOP, I would just make a 2D array and load the variable values into each row, corresponding to each object, and then append/delete rows as needed. Is this still a good solution? Is there a better solution?
Again, in this case I am more interested in pushing processor performance rather than preserving abstraction and modularity, etc. In this respect, I am very confused about the way the data container ultimately is read into the L1 cache, and how to ensure that I do not induce page inefficiency or cache-misses. If for example, I have a 128 kb cache, I assume the entire container should fit into this cache to be efficient, correct?

According to Agner Fog's optimization manual, the C++ Standard Template Library is rather inefficient, because it makes extensive use of dynamic memory allocation. However, a fixed size array that is made larger than necessary (e.g. because the needed size is not known at compile time) can also be bad for performance, because a larger size means that it won't fit into the cache as easily. In such situations, the STL's dynamic memory allocation could perform better.
Generally, it is best to store your data in contiguous memory. You can use a fixed size array or an std::vector for this. However, before using std::vector, you should call std::vector::reserve() for performance reasons, so that the memory does not have to be reallocated too often. If you reallocate too often, the heap could become fragmented, which is also bad for cache performance.
Ideally, the data that you are working on will fit entirely into the Level 1 data cache (which is about 32 KB on modern desktop processors). However, even if it doesn't fit, the Level 2 cache is much larger (about 512 KB) and the Level 3 Cache is several Megabytes. The higher-level caches are still significantly faster than reading from main memory.
It is best if your memory access patterns are predictable, so that the hardware prefetcher can do its work best. Sequential memory accesses are easiest for the hardware prefetcher to predict.
The CPU cache works best if you access the same data several times and if the data is small enough to be kept in the cache. However, even if the data is used only once, the CPU cache can still make the memory access faster, by making use of prefetching.
A cache miss will occur if
the data is being accessed for the first time and the hardware prefetcher was not able to predict and prefetch the needed memory address in time, or
the data is no longer cached, because the cache had to make room for other data, due to the data being too large to fit in the cache.
In addition to the hardware prefetcher attempting to predict needed memory addresses in advance (which is automatic), it is also possible for the programmer to explicity issue a software prefetch. However, from what I have read, it is hard to get significant performance gains from doing this, except under very special circumstances.

Optimising data-structures so that they take advantage of virtual memory

I would like to know how to optimise data-structures in openCV (the mat type specifically) so that I am able to leverage the operating systems built in memory/virtual memory management.
For a full context please read the Q and A here - but otherwise the situation could be summed up that I have a large collection of mats* that I'll need to access arbitrarily and rapidly. The main complication is that full amount of data is well above the amount of RAM available.
(*Conceptually the data is a recursively defined 3D array of 3D arrays, but let's not muddy the water with that confusion!)
Rather than build my own LRU cache and RAM-hungry and inefficient 'page' addressing strategies to access it, I'd rather let the OS do this for me.
I think I get the concepts, but when it comes to the actual implementation I'm twiddling thumbs:
Is this a generic C++ consideration, or something I need to address at the openCV level?
Is it as simple as making the granularity of the of data close to (but not over) 4KB? (see the solution here for the 4KB motivation)
How would the mat(s) actually be saved, accessed and represented on disk? (is this how memory-mapping is involved?)

Is this a generic C++ consideration, or something I need to address at the openCV level?
You just allocate and use boatloads of memory. The whole point of paging / virtual memory is that it's completely transparent. Everything gets extremely slow, but keeps working. You don't get ENOMEM until you're out of swap space + RAM.
On a normal Linux system, your normal swap partition should be very small (under 1GB), so you'll probably need to dd a swap file, and mkswap / swapon on it. Make sure the swap file is has read-write permission for root only. Obviously every major OS will have its own procedures.
Is it as simple as making the granularity of the of data close to (but not over) 4KB? (see the solution here for the 4KB motivation)
If you have pointers to other data, make sure you keep them together. You want all the small "hot" data to be in only a few pages that a decent OS LRU algorithm won't page out.
If you have hot data mixed with cold data, it will easily get paged out and lead to an extra page-file round trip before the cache miss for the final data can even happen.
Like Yakk says, sequential access patterns will do much better, because disk I/O does better with multi-block reads. (Even SSDs have better throughput with larger blocks). This also allows prefetching, which allows one I/O request to start before the previous one's data arrives. Maxing out I/O throughput requires pipelining requests.
Try to design your algorithms to do sequential accesses when possible. This is advantageous at all levels of memory, from paging all the way up to L1 cache. Sequential access even enables auto-vectorization with vector-registers.
Cache blocking (aka loop tiling) techniques are also applicable to page misses. Google for details, but the main idea is to do all the steps of your algorithm over a subset of the data, instead of touching all the data at each step. Then each piece of data only has to be loaded into cache once total, instead of once for each step of your algorithm.
Think of DRAM as a cache for your giant virtual address space.
How would the mat(s) actually be saved, accessed and represented on disk? (is this how memory-mapping is involved?)
Swap space / the pagefile is the backing store for your process's address space. So yes, it's very similar to what you'd get if you allocated memory by mmaping a big file instead of making an anonymous allocation.

c++: how to optimize IO?

I am working on a mathematical problem that has the advantage of being able to "pre-compute" about half of the problem, save this information to file, and then reuse it many times to compute various 'instances' of my problem. The difficulty is that uploading all of this information in order to solve the actual problem is a major bottleneck.
More specifically:
I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map<int,int>, and much more - and save all this stuff to disk (several Gb).
The second half of my program accepts an input argument D. For each D, I need to perform a great many computations that involve a combination of the pre-computed data (from file), and some other data that are specific to D (so that the problem is different for each D).
Sometimes I will need to pick out certain pieces of pre-computed information from the files. Other times, I will need to upload every piece of data from a (large) file.
Are there any strategies for making the IO faster?
I already have the program parallelized (MPI, via boost::mpi) for other reasons, but regardless, accessing files on the disk is making my compute time unbearable.
Any strategies or optimizations?
Currently I am doing everything with cstdio, i.e. no iostream. Will that make a big difference?

Certainly the fastest (but the fragilest) solution would be to mmap the data to a fixed address. Slap it all in one big struct, and instantiate the std:::map with an allocator which will allocate in a block attached to the end of the struct. It's not simple, but it will be fast; one call to mmap, and the data is in your (virtual) memory. And because you're forcing the address in mmap, you can even store the pointers, etc.
As mentioned above, in addition to requiring a fair amount of work, it's fragile. Recompile your application, and the targeted address might not be available, or the layout might be different, or whatever. But since it's really just an optimization, this might not be an issue; anytime a compatibility issue arises, just drop the old file and start over. It will make the first run after a change which breaks compatibility extremely slow, but if you don't break compatibility too often...

The stuff that isn't in a map is easy. You put everything in one contiguous chunk of memory that you know (like a big array, or a struct/class with no pointers), and then use write() to write it out. Later use read() to read it in, in a single operation. If the size might vary, then use one operation to read a single int with the size, allocate the memory, and then use a single read() to pull it in.
The map part is a bit harder, since you can't do it all in one operation. Here you need to come up with a convention for serializing it. To make the i/o as fast as possible, your best bet is to convert it from the map to an in-memory form that is all in one place and you can convert back to the map easily and quickly. If, for example your keys are ints, and your values are of constant size then you could make an array of keys, and an array of values, copy your keys into the one array and values into the other, and then write() the two arrays, possibly writing out their size as well. Again, you read things in with only two or three calls to read().
Note that nothing ever got translated to ASCII, and there are a minimum number of system calls. The file will not be human readable, but it will be compact, and fast to read in. Three things make i/o slow: 1) system calls, if you use small reads/writes; 2) translation to/from ASCII (printf, scanf); 3) disk speed. Hard to do much about 3) (other than an SSD). You can do the read in a background thread, but you might need to block waiting for the data to be in.

Some guidelines:
multiple calls to read() are more expensive than single call
binary files are faster than text files
single file is faster than multiple files for large values of "multiple"
use memory-mapped files if you can
use 64 bit OS to let OS manage the memory for you
Ideally, I'd try to put all long doubles into memory-mapped file, and all maps into binary files.
Divide and conquer: if 64 bits is not an option, try to break your data into large chunks in a way that all chunks are never used together, and the entire chunk is needed when it's needed. This way you could load the chunks when they needed and discard them when they are not.

These suggestions of uploading the whole data to the RAM are good when two conditions are met:
Sum of all I/O times during is much more than cost of loading all data to RAM
Relatively large portion of all data is being accessed during application run
(they are usually met when some application is running for a long time processing different data)
However for other cases other options might be considered.
E.g. it is essential to understand if access pattern is truly random. If no, look into reordering data to ensure that items that are accessible together are close to each other. This will ensure that OS caching is performing at its best, and also will reduce HDD seek times (not a case for SSD of course).
If accesses are truly random, and application is not running as long as needed to ammortize one-time data loading cost I would look into architecture, e.g. by extracting this data manager into separate module that will keep this data preloaded.
For Windows it might be system service, for other OSes other options are available.

Cache, cache, cache. If it's only several GB it should be feasible to cache most if not all of your data in something like memcached. This is an especially good solution if you're using MPI across multiple machines rather than just multiple processors on the same machine.
If it's all running on the same machine, consider a shared memory cache if you have the memory available.
Also, make sure your file writes are being done on a separate thread. No need to block an entire process waiting for a file to write.

As was said, cache as much as you can in memory.
If you're finding that the amount you need to cache is larger than your memory will allow, try swapping out the caches between memory and disk how it is often done when virtual memory pages need to be swapped to disk. It is essentially the same problem.
One common method is the Least Recently Used Algorithm for determining which page will be swapped.

It really depends on how much memory is available and what the access pattern is.
The simplest solution is to use memory mapped files. This generally requires that the file has been layed out as if the objects were in memory, so you will need to only use POD data with no pointers (but you can use relative indexes).
You need to study your access pattern to see if you can group together the values that are often used together. This will help the OS in better caching those values (ie, keeping them in memory for you, rather than always going to the disk to read them).
Another option will be to split the file into several chunks, preferably in a logical way. It might be necessary to create an index file that map a range of values to the file that contain them.
Then, you can only access the set of files required.
Finally, for complex data structures (where memory mapped files fail) or for sparse reading (when you only ever extract only a small piece of information from a given file), it might be interesting to read about LRU caches.
The idea will be to use serialization and compression. You write several files, among which an index, and compress all of them (zip). Then, at launch time, you start by loading the index and save it in memory.
Whenever you need to access a value, you first try your cache, if it is not it, you access the file that contains it, decompress it in memory, dump its content in your cache. Note: if the cache is too small, you have to be picky about what you dump in... or reduce the size of the files.
The frequently accessed values will stay in cache, avoiding unnecessary round-trip, and because the file is zipped there will be less IO.

Structure your data in a way that caching can be effective. For instance, when you are reading "certain pieces," if those are all contiguous it won't have to seek around the disk to gather all of them.
Reading and writing in batches, instead of record by record will help if you are sharing disk access with another process.

More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
As far as I understood the std::map are pre-calculated also and there are no insert/remove operations. Only search. How about an idea to replace the maps to something like std::hash_map or sparsehash. In theory it can give performance gain.

More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
Don't reinvent the wheel. I'd suggest using a key-value data store, such as berkeley db: http://docs.oracle.com/cd/E17076_02/html/gsg/C/concepts.html
This will enable saving and sharing the files, caching the parts you actually use a lot and keeping other parts on disk.

how to memory map a huge matrix?

Suppose you got a huge (40+ GB) feature value (floating-point) matrix, rows are different features and columns are the samples/images.
The table is precomputed column-wise.
Then it is completely accessed row-wise and multi-threaded (each thread loads a whole row) several times.
What would be the best way to handle this matrix? I'm especially pondering over 5 points:
Since it's run on an x64 PC I could memory map the whole matrix at once but would that make sense?
What about the effects of multithreading (multithreaded initial computation as well?)?
How to layout the matrix: row or column major?
Would it help to mark the matrix as read-only after the precomputation has been finished?
Could something like http://www.kernel.org/doc/man-pages/online/pages/man2/madvise.2.html be used to speed it up?

Memory mapping the whole file could make the process much easier.
You want to lay out your data to optimize for the most common access pattern. It sounds like the data is going to be written once (column-wise) and read several times (row-wise). That suggests the data should be stored in row-major order.
Marking the matrix read-only once the pre-computation is done probably won't help performance (there are some possible low-level optimizations, but I don't think anything implements them), but it will prevent bugs from accidentally writing to data you don't intend to. Might as well.
madvise could end up being useful, once you've got your application written and working.
My overall advice: write the program in the simplest way you can, sequentially at first, and then put timers around the whole thing and the various major operations. Make sure the major operation times sum to the overall time, so you can be sure you're not missing anything. Then target your performance improvement efforts toward the components that are actually taking the most time.
Per JimR's mention of 4MB pages in his comment, you may end up wanting to look into hugetlbfs or using a Linux Kernel release with transparent huge page support (merged for 2.6.38, could probably be patched into earlier versions). This would likely save you a whole lot of TLB misses, and convince the kernel to do the disk IO in sufficiently large chunks to amortize any seek overhead.

Maybe, see below.
The size of the total working set of all threads must not exceed available RAM, otherwise the program will run at snail speed because of swapping.
Layout should match access patterns, as long as condition 2 is respected.
What do you mean by "mark as read only"?
Measure it.
Re 3: If you have, e.g., 8 CPUs but do not have enough RAM to load 8 rows, you should make each thread process its row sequentially in manageable chunks. In this case, block-layout of a matrix would make sense. If the thread MUST have the whole row in memory to process it, I'm afraid that you can't use all the CPUs, as the process will start thrashing, i.e., kicking out some subset of the matrix out of the ram and reloading another needed subset. This is slightly less bad than full swapping as the matrix is never modified, so the contents of the pages do not need to be written to the swap file before being kicked out. But it still hurts performance badly.
Also, doing random access I/O from multiple threads is a bad idea, which is what you'll end up doing if you use mmap(). You have (presumably) only a single disk, and parallel I/O will just make it slower. So mmap() might not make sense and you could achieve better I/O performance by reading data sequentially into ram.
Note that 40GB is approximately 10.5 million pages of 4096 bytes. By doing mmap(), you will, in the worst case, slow down computation by that many hard disk seeks. At 8ms per seek (taken from wikipedia), you'll end up wasting 83666 seconds, i.e., almost a whole day!

If you could fit the whole thing into main memory, then yes: memory map it all, and it doesn't matter whether it's column major or row major. However, at 40+ Gb, I'm sure it's too big for main memory. In which case:
No, don't map the whole thing! At least, don't expect the memory to work like normal memory if you map it all. Your program will take forever if you don't properly deal with the i/o issues.
The multi-threaded access issue is solved if you store it row-major (it sounds like you don't have multi-threaded column writes).
You should lay it out row-wise, assuming each cell is written once and then read many times.
Yes, I think it would help to mark the matrix as read-only after it's been written, but purely as a way to prevent bugs (accidental writes). It won't affect performance.
No, no amount of clever kernel read-ahead is going to solve your performance problems. You need to solve it at the algorithm level.
I think you are going to have a performance problem with a naive implementation. Either the computer with thrash while writing (if you store it row major) or it will thrash while querying (if you store it column major). The latter is presumably worse, but it's a problem both ways.
The right solution is to use an intermediate representation which is neither row-major nor column-major but 'large squares'. Take the first 50,000 columns and store them in a memory-mapped file (phase 1). It doesn't matter if it's column major or row major since it'll be purely memory resident. Then, take each row and write it into the final row-major memory-mapped file (phase 2). Then repeat the cycle for the next 50,000 columns, and so on.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js