Insert MANY key value pairs fast into berkeley db with hash access - c++

i'm trying to build a hash with berkeley db, which shall contain many tuples (approx 18GB of key value pairs), but in all my tests the performance of the insert operations degrades drastically over time. I've written this script to test the performance:
#include<iostream>
#include<db_cxx.h>
#include<ctime>
#define MILLION 1000000
int main () {
long long a = 0;
long long b = 0;
int passes = 0;
int i = 0;
u_int32_t flags = DB_CREATE;
Db* dbp = new Db(NULL,0);
dbp->set_cachesize( 0, 1024 * 1024 * 1024, 1 );
int ret = dbp->open(
NULL,
"test.db",
NULL,
DB_HASH,
flags,
0);
time_t time1 = time(NULL);
while ( passes < 100 ) {
while( i < MILLION ) {
Dbt key( &a, sizeof(long long) );
Dbt data( &b, sizeof(long long) );
dbp->put( NULL, &key, &data, 0);
a++; b++; i++;
}
DbEnv* dbep = dbp->get_env();
int tmp;
dbep->memp_trickle( 50, &tmp );
i=0;
passes++;
std::cout << "Inserted one million --> pass: " << passes << " took: " << time(NULL) - time1 << "sec" << std::endl;
time1 = time(NULL);
}
}
Perhaps you can tell me why after some time the "put" operation takes increasingly longer and maybe how to fix this.
Thanks for your help,
Andreas

You may want to look at the information provided by the db_stat utility and the HASH-specific tuning functions that are available. Please see BDB Reference Guide section on configuring a HASH database.
I would expect you to get 10s of thousands of inserts per second on commodity hardware. What are you experiencing and what is your performance target?
Regards,
Dave

I would recommend trying the bulk insert API, you can read about that in the documentation here:
http://www.oracle.com/technology/documentation/berkeley-db/db/api_reference/CXX/dbput.html#put_DB_MULTIPLE_KEY
Also, I would guess that your call to memp_trickle is responsible for most of the slowdown. As the cache becomes dirtier, finding pages to trickle becomes more expensive. In fact, since you are only writing, having a large cache only hurts (once you've written the data, you don't use it again, so you don't want it to hang around in the cache.) I would recommend testing different (smaller) cache sizes.
Finally, if your sole concern is insert performance, using a larger page size will help. You'll be able to fit more data on each page and that will result in fewer disk writes.
-Ben

The memp_trickle is almost certainly slowing things down. It's often good to use trickle, but it belongs in its own thread to be effective. BDB (except when you get into the higher level replication APIs) creates no threads for you -- nothing happens behind the scenes (thread-wise). trickle will be effective when you have dirty pages forced from the cache (look at a stat output to see if that's what's happening).
You might also consider using a BTREE instead of a HASH. Yeah, I know you specifically said hash, but why? If you are looking to maximize performance, why add that restriction? You may be able to take advantage of locality of reference to reduce your cache footprint - there is often much more locality that you believe, or perhaps you can create some -- if you generate keys that are random digits, for example, prepend the date and time. That usually introduces locality into a perceived 'random' system. If you do use btree, you'll need to pay some attention to byte ordering for your keys for your system (look up Endianness in Wikipedia), if you are using a Little Endian system, you'll need to swap the bytes. Using BTREE with the right ordering and introduced locality means your key/value pairs will be stored in 'key-generation-time' order, so if you see the most action on the recent keys, you'll be tending to hit the same pages over and over (see your cache hit rate in your stats). So you'll need less cache. Another way to think of it is with the same cache amount, your solution will scale by a larger multiple.
I expect your actual app really doesn't insert integers keys in order (if it does, you'll be lucky). So you should write a benchmark that closely simulates your access patterns, at least with respect to: size of key, size of data, access pattern, number of items in the database, mix of read/write. Once you have that, look at the stats -- pay close attention to anything that implies IO or contention.
BTW, I've recently started a blog at http://libdb.wordpress.com to discuss BDB performance tuning (and other matters related to BDB). You may get some good ideas there. There can be a huge difference of latency and throughput depending on the kind of tuning you do. In particular, see http://libdb.wordpress.com/2011/01/31/revving-up-a-benchmark-from-626-to-74000-operations-per-second/

Your performance degrade can have several reasons, which aren't actually connected with your code. I may be mistaken, but I think this is all about internal database structure (and used data structure).
Think of a situation where database uses some approach other than a hash table, for example an RB Tree. Inserting in that tree would take O(logN) in Big-O sense and every inserted element increases the time needed for the next insert.
Unfortunately, the same could happen with a plain hash-table, so that initial O(1) insertion operation time degrades to something worse. This could have several reasons, but it's all about hash collisions, which could happen due to wrong hash function, wrong data (that is bad for currently used hash function) or even due to the moon phase.
If I were you, I would try to dig into your db internal structure. Also, I think testing your keys with something other than your db (e.g boost::unordered_map) could also benefit your testing and profiling.
Edit: Also to mention, did you try to change that cache_size stuff in your sample? Or maybe there are some other performance-related parameters that could be modified?

Related

For-loop variables and cache misses? [duplicate]

What is the difference between "cache unfriendly code" and the "cache friendly" code?
How can I make sure I write cache-efficient code?
Preliminaries
On modern computers, only the lowest level memory structures (the registers) can move data around in single clock cycles. However, registers are very expensive and most computer cores have less than a few dozen registers. At the other end of the memory spectrum (DRAM), the memory is very cheap (i.e. literally millions of times cheaper) but takes hundreds of cycles after a request to receive the data. To bridge this gap between super fast and expensive and super slow and cheap are the cache memories, named L1, L2, L3 in decreasing speed and cost. The idea is that most of the executing code will be hitting a small set of variables often, and the rest (a much larger set of variables) infrequently. If the processor can't find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache, and if not there, main memory. Each of these "misses" is expensive in time.
(The analogy is cache memory is to system memory, as system memory is to hard disk storage. Hard disk storage is super cheap but very slow).
Caching is one of the main methods to reduce the impact of latency. To paraphrase Herb Sutter (cfr. links below): increasing bandwidth is easy, but we can't buy our way out of latency.
Data is always retrieved through the memory hierarchy (smallest == fastest to slowest). A cache hit/miss usually refers to a hit/miss in the highest level of cache in the CPU -- by highest level I mean the largest == slowest. The cache hit rate is crucial for performance since every cache miss results in fetching data from RAM (or worse ...) which takes a lot of time (hundreds of cycles for RAM, tens of millions of cycles for HDD). In comparison, reading data from the (highest level) cache typically takes only a handful of cycles.
In modern computer architectures, the performance bottleneck is leaving the CPU die (e.g. accessing RAM or higher). This will only get worse over time. The increase in processor frequency is currently no longer relevant to increase performance. The problem is memory access. Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches, prefetching, pipelines and concurrency. For instance, modern CPUs spend around 85% of die on caches and up to 99% for storing/moving data!
There is quite a lot to be said on the subject. Here are a few great references about caches, memory hierarchies and proper programming:
Agner Fog's page. In his excellent documents, you can find detailed examples covering languages ranging from assembly to C++.
If you are into videos, I strongly recommend to have a look at Herb Sutter's talk on machine architecture (youtube) (specifically check 12:00 and onwards!).
Slides about memory optimization by Christer Ericson (director of technology # Sony)
LWN.net's article "What every programmer should know about memory"
Main concepts for cache-friendly code
A very important aspect of cache-friendly code is all about the principle of locality, the goal of which is to place related data close in memory to allow efficient caching. In terms of the CPU cache, it's important to be aware of cache lines to understand how this works: How do cache lines work?
The following particular aspects are of high importance to optimize caching:
Temporal locality: when a given memory location was accessed, it is likely that the same location is accessed again in the near future. Ideally, this information will still be cached at that point.
Spatial locality: this refers to placing related data close to each other. Caching happens on many levels, not just in the CPU. For example, when you read from RAM, typically a larger chunk of memory is fetched than what was specifically asked for because very often the program will require that data soon. HDD caches follow the same line of thought. Specifically for CPU caches, the notion of cache lines is important.
Use appropriate c++ containers
A simple example of cache-friendly versus cache-unfriendly is c++'s std::vector versus std::list. Elements of a std::vector are stored in contiguous memory, and as such accessing them is much more cache-friendly than accessing elements in a std::list, which stores its content all over the place. This is due to spatial locality.
A very nice illustration of this is given by Bjarne Stroustrup in this youtube clip (thanks to #Mohammad Ali Baydoun for the link!).
Don't neglect the cache in data structure and algorithm design
Whenever possible, try to adapt your data structures and order of computations in a way that allows maximum use of the cache. A common technique in this regard is cache blocking (Archive.org version), which is of extreme importance in high-performance computing (cfr. for example ATLAS).
Know and exploit the implicit structure of data
Another simple example, which many people in the field sometimes forget is column-major (ex. fortran,matlab) vs. row-major ordering (ex. c,c++) for storing two dimensional arrays. For example, consider the following matrix:
1 2
3 4
In row-major ordering, this is stored in memory as 1 2 3 4; in column-major ordering, this would be stored as 1 3 2 4. It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues. Unfortunately, I see stuff like this very often in my domain (machine learning). #MatteoItalia showed this example in more detail in his answer.
When fetching a certain element of a matrix from memory, elements near it will be fetched as well and stored in a cache line. If the ordering is exploited, this will result in fewer memory accesses (because the next few values which are needed for subsequent computations are already in a cache line).
For simplicity, assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory, the next one is too. Say we want to take the sum over all elements in the example 2x2 matrix above (lets call it M):
Exploiting the ordering (e.g. changing column index first in c++):
M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached)
= 1 + 2 + 3 + 4
--> 2 cache hits, 2 memory accesses
Not exploiting the ordering (e.g. changing row index first in c++):
M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory)
= 1 + 3 + 2 + 4
--> 0 cache hits, 4 memory accesses
In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice, the performance difference can be much larger.
Avoid unpredictable branches
Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access. When your critical code contains (unpredictable) branches, it is hard or impossible to prefetch data. This will indirectly lead to more cache misses.
This is explained very well here (thanks to #0x90 for the link): Why is processing a sorted array faster than processing an unsorted array?
Avoid virtual functions
In the context of c++, virtual methods represent a controversial issue with regard to cache misses (a general consensus exists that they should be avoided when possible in terms of performance). Virtual functions can induce cache misses during look up, but this only happens if the specific function is not called often (otherwise it would likely be cached), so this is regarded as a non-issue by some. For reference about this issue, check out: What is the performance cost of having a virtual method in a C++ class?
Common problems
A common problem in modern architectures with multiprocessor caches is called false sharing. This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line. This causes the cache line -- which contains data another processor can use -- to be overwritten again and again. Effectively, different threads make each other wait by inducing cache misses in this situation.
See also (thanks to #Matt for the link): How and when to align to cache line size?
An extreme symptom of poor caching in RAM memory (which is probably not what you mean in this context) is so-called thrashing. This occurs when the process continuously generates page faults (e.g. accesses memory which is not in the current page) which require disk access.
In addition to #Marc Claesen's answer, I think that an instructive classic example of cache-unfriendly code is code that scans a C bidimensional array (e.g. a bitmap image) column-wise instead of row-wise.
Elements that are adjacent in a row are also adjacent in memory, thus accessing them in sequence means accessing them in ascending memory order; this is cache-friendly, since the cache tends to prefetch contiguous blocks of memory.
Instead, accessing such elements column-wise is cache-unfriendly, since elements on the same column are distant in memory from each other (in particular, their distance is equal to the size of the row), so when you use this access pattern you are jumping around in memory, potentially wasting the effort of the cache of retrieving the elements nearby in memory.
And all that it takes to ruin the performance is to go from
// Cache-friendly version - processes pixels which are adjacent in memory
for(unsigned int y=0; y<height; ++y)
{
for(unsigned int x=0; x<width; ++x)
{
... image[y][x] ...
}
}
to
// Cache-unfriendly version - jumps around in memory for no good reason
for(unsigned int x=0; x<width; ++x)
{
for(unsigned int y=0; y<height; ++y)
{
... image[y][x] ...
}
}
This effect can be quite dramatic (several order of magnitudes in speed) in systems with small caches and/or working with big arrays (e.g. 10+ megapixels 24 bpp images on current machines); for this reason, if you have to do many vertical scans, often it's better to rotate the image of 90 degrees first and perform the various analysis later, limiting the cache-unfriendly code just to the rotation.
Optimizing cache usage largely comes down to two factors.
Locality of Reference
The first factor (to which others have already alluded) is locality of reference. Locality of reference really has two dimensions though: space and time.
Spatial
The spatial dimension also comes down to two things: first, we want to pack our information densely, so more information will fit in that limited memory. This means (for example) that you need a major improvement in computational complexity to justify data structures based on small nodes joined by pointers.
Second, we want information that will be processed together also located together. A typical cache works in "lines", which means when you access some information, other information at nearby addresses will be loaded into the cache with the part we touched. For example, when I touch one byte, the cache might load 128 or 256 bytes near that one. To take advantage of that, you generally want the data arranged to maximize the likelihood that you'll also use that other data that was loaded at the same time.
For just a really trivial example, this can mean that a linear search can be much more competitive with a binary search than you'd expect. Once you've loaded one item from a cache line, using the rest of the data in that cache line is almost free. A binary search becomes noticeably faster only when the data is large enough that the binary search reduces the number of cache lines you access.
Time
The time dimension means that when you do some operations on some data, you want (as much as possible) to do all the operations on that data at once.
Since you've tagged this as C++, I'll point to a classic example of a relatively cache-unfriendly design: std::valarray. valarray overloads most arithmetic operators, so I can (for example) say a = b + c + d; (where a, b, c and d are all valarrays) to do element-wise addition of those arrays.
The problem with this is that it walks through one pair of inputs, puts results in a temporary, walks through another pair of inputs, and so on. With a lot of data, the result from one computation may disappear from the cache before it's used in the next computation, so we end up reading (and writing) the data repeatedly before we get our final result. If each element of the final result will be something like (a[n] + b[n]) * (c[n] + d[n]);, we'd generally prefer to read each a[n], b[n], c[n] and d[n] once, do the computation, write the result, increment n and repeat 'til we're done.2
Line Sharing
The second major factor is avoiding line sharing. To understand this, we probably need to back up and look a little at how caches are organized. The simplest form of cache is direct mapped. This means one address in main memory can only be stored in one specific spot in the cache. If we're using two data items that map to the same spot in the cache, it works badly -- each time we use one data item, the other has to be flushed from the cache to make room for the other. The rest of the cache might be empty, but those items won't use other parts of the cache.
To prevent this, most caches are what are called "set associative". For example, in a 4-way set-associative cache, any item from main memory can be stored at any of 4 different places in the cache. So, when the cache is going to load an item, it looks for the least recently used3 item among those four, flushes it to main memory, and loads the new item in its place.
The problem is probably fairly obvious: for a direct-mapped cache, two operands that happen to map to the same cache location can lead to bad behavior. An N-way set-associative cache increases the number from 2 to N+1. Organizing a cache into more "ways" takes extra circuitry and generally runs slower, so (for example) an 8192-way set associative cache is rarely a good solution either.
Ultimately, this factor is more difficult to control in portable code though. Your control over where your data is placed is usually fairly limited. Worse, the exact mapping from address to cache varies between otherwise similar processors. In some cases, however, it can be worth doing things like allocating a large buffer, and then using only parts of what you allocated to ensure against data sharing the same cache lines (even though you'll probably need to detect the exact processor and act accordingly to do this).
False Sharing
There's another, related item called "false sharing". This arises in a multiprocessor or multicore system, where two (or more) processors/cores have data that's separate, but falls in the same cache line. This forces the two processors/cores to coordinate their access to the data, even though each has its own, separate data item. Especially if the two modify the data in alternation, this can lead to a massive slowdown as the data has to be constantly shuttled between the processors. This can't easily be cured by organizing the cache into more "ways" or anything like that either. The primary way to prevent it is to ensure that two threads rarely (preferably never) modify data that could possibly be in the same cache line (with the same caveats about difficulty of controlling the addresses at which data is allocated).
Those who know C++ well might wonder if this is open to optimization via something like expression templates. I'm pretty sure the answer is that yes, it could be done and if it was, it would probably be a pretty substantial win. I'm not aware of anybody having done so, however, and given how little valarray gets used, I'd be at least a little surprised to see anybody do so either.
In case anybody wonders how valarray (designed specifically for performance) could be this badly wrong, it comes down to one thing: it was really designed for machines like the older Crays, that used fast main memory and no cache. For them, this really was a nearly ideal design.
Yes, I'm simplifying: most caches don't really measure the least recently used item precisely, but they use some heuristic that's intended to be close to that without having to keep a full time-stamp for each access.
Welcome to the world of Data Oriented Design. The basic mantra is to Sort, Eliminate Branches, Batch, Eliminate virtual calls - all steps towards better locality.
Since you tagged the question with C++, here's the obligatory typical C++ Bullshit. Tony Albrecht's Pitfalls of Object Oriented Programming is also a great introduction into the subject.
Just piling on: the classic example of cache-unfriendly versus cache-friendly code is the "cache blocking" of matrix multiply.
Naive matrix multiply looks like:
for(i=0;i<N;i++) {
for(j=0;j<N;j++) {
dest[i][j] = 0;
for( k=0;k<N;k++) {
dest[i][j] += src1[i][k] * src2[k][j];
}
}
}
If N is large, e.g. if N * sizeof(elemType) is greater than the cache size, then every single access to src2[k][j] will be a cache miss.
There are many different ways of optimizing this for a cache. Here's a very simple example: instead of reading one item per cache line in the inner loop, use all of the items:
int itemsPerCacheLine = CacheLineSize / sizeof(elemType);
for(i=0;i<N;i++) {
for(j=0;j<N;j += itemsPerCacheLine ) {
for(jj=0;jj<itemsPerCacheLine; jj+) {
dest[i][j+jj] = 0;
}
for( k=0;k<N;k++) {
for(jj=0;jj<itemsPerCacheLine; jj+) {
dest[i][j+jj] += src1[i][k] * src2[k][j+jj];
}
}
}
}
If the cache line size is 64 bytes, and we are operating on 32 bit (4 byte) floats, then there are 16 items per cache line. And the number of cache misses via just this simple transformation is reduced approximately 16-fold.
Fancier transformations operate on 2D tiles, optimize for multiple caches (L1, L2, TLB), and so on.
Some results of googling "cache blocking":
http://stumptown.cc.gt.atl.ga.us/cse6230-hpcta-fa11/slides/11a-matmul-goto.pdf
http://software.intel.com/en-us/articles/cache-blocking-techniques
A nice video animation of an optimized cache blocking algorithm.
http://www.youtube.com/watch?v=IFWgwGMMrh0
Loop tiling is very closely related:
http://en.wikipedia.org/wiki/Loop_tiling
Processors today work with many levels of cascading memory areas. So the CPU will have a bunch of memory that is on the CPU chip itself. It has very fast access to this memory. There are different levels of cache each one slower access ( and larger ) than the next, until you get to system memory which is not on the CPU and is relatively much slower to access.
Logically, to the CPU's instruction set you just refer to memory addresses in a giant virtual address space. When you access a single memory address the CPU will go fetch it. in the old days it would fetch just that single address. But today the CPU will fetch a bunch of memory around the bit you asked for, and copy it into the cache. It assumes that if you asked for a particular address that is is highly likely that you are going to ask for an address nearby very soon. For example if you were copying a buffer you would read and write from consecutive addresses - one right after the other.
So today when you fetch an address it checks the first level of cache to see if it already read that address into cache, if it doesn't find it, then this is a cache miss and it has to go out to the next level of cache to find it, until it eventually has to go out into main memory.
Cache friendly code tries to keep accesses close together in memory so that you minimize cache misses.
So an example would be imagine you wanted to copy a giant 2 dimensional table. It is organized with reach row in consecutive in memory, and one row follow the next right after.
If you copied the elements one row at a time from left to right - that would be cache friendly. If you decided to copy the table one column at a time, you would copy the exact same amount of memory - but it would be cache unfriendly.
It needs to be clarified that not only data should be cache-friendly, it is just as important for the code. This is in addition to branch predicition, instruction reordering, avoiding actual divisions and other techniques.
Typically the denser the code, the fewer cache lines will be required to store it. This results in more cache lines being available for data.
The code should not call functions all over the place as they typically will require one or more cache lines of their own, resulting in fewer cache lines for data.
A function should begin at a cache line-alignment-friendly address. Though there are (gcc) compiler switches for this be aware that if the the functions are very short it might be wasteful for each one to occupy an entire cache line. For example, if three of the most often used functions fit inside one 64 byte cache line, this is less wasteful than if each one has its own line and results in two cache lines less available for other usage. A typical alignment value could be 32 or 16.
So spend some extra time to make the code dense. Test different constructs, compile and review the generated code size and profile.
As #Marc Claesen mentioned that one of the ways to write cache friendly code is to exploit the structure in which our data is stored. In addition to that another way to write cache friendly code is: change the way our data is stored; then write new code to access the data stored in this new structure.
This makes sense in the case of how database systems linearize the tuples of a table and store them. There are two basic ways to store the tuples of a table i.e. row store and column store. In row store as the name suggests the tuples are stored row wise. Lets suppose a table named Product being stored has 3 attributes i.e. int32_t key, char name[56] and int32_t price, so the total size of a tuple is 64 bytes.
We can simulate a very basic row store query execution in main memory by creating an array of Product structs with size N, where N is the number of rows in table. Such memory layout is also called array of structs. So the struct for Product can be like:
struct Product
{
int32_t key;
char name[56];
int32_t price'
}
/* create an array of structs */
Product* table = new Product[N];
/* now load this array of structs, from a file etc. */
Similarly we can simulate a very basic column store query execution in main memory by creating an 3 arrays of size N, one array for each attribute of the Product table. Such memory layout is also called struct of arrays. So the 3 arrays for each attribute of Product can be like:
/* create separate arrays for each attribute */
int32_t* key = new int32_t[N];
char* name = new char[56*N];
int32_t* price = new int32_t[N];
/* now load these arrays, from a file etc. */
Now after loading both the array of structs (Row Layout) and the 3 separate arrays (Column Layout), we have row store and column store on our table Product present in our memory.
Now we move on to the cache friendly code part. Suppose that the workload on our table is such that we have an aggregation query on the price attribute. Such as
SELECT SUM(price)
FROM PRODUCT
For the row store we can convert the above SQL query into
int sum = 0;
for (int i=0; i<N; i++)
sum = sum + table[i].price;
For the column store we can convert the above SQL query into
int sum = 0;
for (int i=0; i<N; i++)
sum = sum + price[i];
The code for the column store would be faster than the code for the row layout in this query as it requires only a subset of attributes and in column layout we are doing just that i.e. only accessing the price column.
Suppose that the cache line size is 64 bytes.
In the case of row layout when a cache line is read, the price value of only 1(cacheline_size/product_struct_size = 64/64 = 1) tuple is read, because our struct size of 64 bytes and it fills our whole cache line, so for every tuple a cache miss occurs in case of a row layout.
In the case of column layout when a cache line is read, the price value of 16(cacheline_size/price_int_size = 64/4 = 16) tuples is read, because 16 contiguous price values stored in memory are brought into the cache, so for every sixteenth tuple a cache miss ocurs in case of column layout.
So the column layout will be faster in the case of given query, and is faster in such aggregation queries on a subset of columns of the table. You can try out such experiment for yourself using the data from TPC-H benchmark, and compare the run times for both the layouts. The wikipedia article on column oriented database systems is also good.
So in database systems, if the query workload is known beforehand, we can store our data in layouts which will suit the queries in workload and access data from these layouts. In the case of above example we created a column layout and changed our code to compute sum so that it became cache friendly.
Be aware that caches do not just cache continuous memory. They have multiple lines (at least 4) so discontinous and overlapping memory can often be stored just as efficiently.
What is missing from all the above examples is measured benchmarks. There are many myths about performance. Unless you measure it you do not know. Do not complicate your code unless you have a measured improvement.
Cache-friendly code is code that has been optimized to make efficient use of the CPU cache. This typically involves organizing data in a way that takes advantage of spatial and temporal locality, which refers to the idea that data that is accessed together is likely to be stored together in memory, and that data that is accessed frequently is likely to be accessed again in the near future.
There are several ways to make code cache-friendly, including:
Using contiguous memory layouts: By storing data in contiguous
blocks in memory, you can take advantage of spatial locality and
reduce the number of cache misses.
Using arrays: Arrays are a good choice for data structures when you
need to access data sequentially, as they allow you to take
advantage of temporal locality and keep hot data in the cache.
Using pointers carefully: Pointers can be used to access data that
is not stored contiguously in memory, but they can also lead to
cache misses if they are used excessively. If you need to use
pointers, try to use them in a way that takes advantage of spatial
and temporal locality to minimize cache misses.
Using compiler optimization flags: Most compilers have optimization
flags that can be used to optimize the use of the CPU cache. These
flags can help to minimize the number of cache misses and improve
the overall performance of your code.
It is important to note that the specific techniques that work best for optimizing the use of the CPU cache will depend on the specific requirements and constraints of your system. It may be necessary to experiment with different approaches to find the best solution for your needs.

Efficiency of 2 vectors versus a vector of structs

I'm working on a C++ project where I need to search through a vector ignoring those that have already been visited. If one has been visited I set its corresponding visited to 1 and ignore it. Which solution is faster?
Solution 1:
vector<string> stringsToVisit;
vector<int> stringVisited;
for (int i = 0; i < stringToVisit.size(); ++i) {
if (stringVisited[i] == 0) {
string current = stringsToVisit[i];
...Do Stuff...
stringVisited[i] = 1;
}
}
or
Solution 2:
struct StringInfo {
string myString;
int visited = 0;
}
vector<StringInfo> stringsToVisit;
for (int i = 0; i < stringsToVisit.size(); ++i) {
if (stringsToVisit[i].visited == 0) {
string current = stringsToVisit[i].myString;
...Do Stuff...
stringsToVisit[i].visited = 1;
}
}
As Bernard notes, the time and memory complexity of both proposed solutions is identical, and the slightly more complex addressing required by the second solution isn't going to slow things down on modern processors. But I disagree with his suggestion that "Solution 2 is likely to be faster." We really don't know enough to even say that it should theoretically be faster, and except perhaps in a few degenerate situations, the difference in actual performance is likely to be unmeasurable.
The first iteration of the loop would indeed likely be slower. The cache is cold, and the first solution requires two cache lines to store the first elements, while the second solution only requires one. But after that, both solutions are doing a forward linear traversal. The CPU is going to have no problem prefetching additional cache lines, so in most situations, that initial extra overhead is unlikely to actually matter too much.
On the other hand, you are writing data while in this loop, so some of the cache lines you access also get marked dirty (meaning their data needs to be written back to a shared cache or main memory eventually, and they get purged from any other cores' caches). In solution 1, depending on sizeof(string) and sizeof(int), only 5-25% of the cache lines are getting marked dirty. Solution 2, however, dirties every single one, so it may actually use more memory bandwidth.
So, some things that might make solution 2 faster are:
The list of strings being processed is extremely short
...Do Stuff... is very complicated (enough so that the cache lines holding the data get purged from L1 cache)
Some things that might make solution 1 equivalent or faster than solution 2:
The list of strings being processed is moderate to large
...Do Stuff... is not very complex, so the cache stays warm
The program is multithreaded, and another thread wants to read data from stringsToVisit at the same time.
The bottom line is, it probably doesn't matter.
First of all, you should profile your code to check if this piece of code is really the bottleneck, and accurately measure the amount of time each solution takes to run. That would give the best results.
Nevertheless, here's my answer:
The time complexity of both solutions are O(n), so we are talking only about constant-factor optimizations here.
Solution 1 requires the lookup of two different memory blocks - stringsToVisit[i] and stringVisited[i] in each loop. This isn't good for CPU caches, as compared to Solution 2, each iteration of the loop accesses a single struct stored in contiguous locations in memory. As such, Solution 2 would perform better.
Solution 2 would need a more complicated indirect memory lookup than Solution 1 to access the visited property of the struct: (base address of stringsToVisit) + (index) * (struct size) + (displacement in struct). Nevertheless, this kind of lookup fits well into most processors' SIB (scale-index-base) addressing, so it will compile to one assembly instruction only, so there would not be much slowness, if any at all. It is worth noting that an optimizing compiler might notice that you're accessing memory sequentially and do optimizations to avoid using SIB addressing totally.
Hence, Solution 2 is likely to be faster.

Fast hash function for long string key

I am using an extendible hash and I want to have strings as keys. The problem is that the current hash function that I am using iterates over the whole string/key and I think that this is pretty bad for the program's performance since the hash function is called multiple times especially when I am splitting buckets.
Current hash function
int hash(const string& key)
{
int seed = 131;
unsigned long hash = 0;
for(unsigned i = 0; i < key.length(); i++)
{
hash = (hash * seed) + key[i];
}
return hash;
}
The keys could be as long as 40 characters.
Example of string/key
string key = "from-to condition"
I have searched over the internet for a better one but I didn't find anything to match my case. Any suggestions?
You should prefer to use std::hash unless measurement shows that you can do better. To limit the number of characters it uses, use something like:
const auto limit = min(key.length(), 16);
for(unsigned i = 0; i < limit; i++)
You will want to experiment to find the best value of 16 to use.
I would actually expect the performance to get worse (because you will have more collisions). If your strings were several k, then limiting to the first 64 bytes might well be worth while.
Depending on your strings, it might be worth starting not at the beginning. For example, hashing filenames you would probably do better using the characters between 20 and 5 from the end (ignore the often constant pathname prefix, and the file extension). But you still have to measure.
I am using an extendible hash and I want to have strings as keys.
As mentioned before, use std::hash until there is a good reason not to.
The problem is that the current hash function that I am using iterates over the whole string/key and I think that this is pretty bad...
It's an understandable thought, but is actually unlikely to be a real concern.
(anticipating) why?
A quick scan over stack overflow will reveal many experienced developers talking about caches and cache lines.
(forgive me if I'm teaching my grandmother to suck eggs)
A modern CPU is incredibly quick at processing instructions and performing (very complex) arithmetic. In almost all cases, what limits its performance is having to talk to memory across a bus, which is by comparison, horribly slow.
So chip designers build in memory caches - extremely fast memory that sits in the CPU (and therefore does not have to be communicated with over a slow bus). Unhappily, there is only so much space available [plus heat constraints - a topic for another day] for this cache memory so the CPU must treat it like an OS does a disk cache, flushing memory and reading in memory as and when it needs to.
As mentioned, communicating across the bus is slow - (simply put) it requires all the electronic components on the motherboard to stop and synchronise with each other. This wastes a horrible amount of time [This would be a fantastic point to go into a discussion about the propagation of electronic signals across the motherboard being constrained by approximately half the speed of light - it's fascinating but there's only so much space here and I have only so much time]. So rather than transfer one byte, word or longword at a time, the memory is accessed in chunks - called cache lines.
It turns out that this is a good decision by chip designers because they understand that most memory is accessed sequentially - because most programs spend most of their time accessing memory linearly (such as when calculating a hash, comparing strings or objects, transforming sequences, copying and initialising sequences and so on).
What's the upshot of all this?
Well, bizarrely, if your string is not already in-cache it turns of that reading one byte of it is almost exactly as expensive as reading all the bytes in the first (say) 128 bytes of it.
Plus, because the cache circuitry assumes that memory access is linear, it will begin a fetch of the next cache line as soon as it has fetched your first one. It will do this while your CPU is performing its hash computation.
I hope you can see that in this case, even if your string was many thousands of bytes long, and you chose to only hash (say) every 128th byte, all you would be doing would be to compute a very much inferior hash which still causing the memory cache to halt the processor while it fetched large chunks of unused memory. It would take just as long - for a worse result!
Having said that, what are good reasons not to use the standard implementation?
Only when:
The users are complaining that your software is too slow to be useful, and
The program is verifiably CPU-bound (using 100% of CPU time), and
The program is not wasting any cycles by spinning, and
Careful profiling has revealed that the program's biggest bottleneck is the hash function, and
Independent analysis by another experienced developer confirms that there is no way to improve the algorithm (for example by calling hash less often).
In short, almost never.
You can directly use std::hashlink instead of implementing your own function.
#include <iostream>
#include <functional>
#include <string>
size_t hash(const std::string& key)
{
std::hash<std::string> hasher;
return hasher(key);
}
int main() {
std::cout << hash("abc") << std::endl;
return 0;
}
See this code here: https://ideone.com/4U89aU

boost::flat_map and its performance compared to map and unordered_map

It is common knowledge in programming that memory locality improves performance a lot due to cache hits. I recently found out about boost::flat_map which is a vector based implementation of a map. It doesn't seem to be nearly as popular as your typical map/unordered_map so I haven't been able to find any performance comparisons. How does it compare and what are the best use cases for it?
Thanks!
I have run a benchmark on different data structures very recently at my company so I feel I need to drop a word. It is very complicated to benchmark something correctly.
Benchmarking
On the web, we rarely find (if ever) a well-engineered benchmark. Until today I only found benchmarks that were done the journalist way (pretty quickly and sweeping dozens of variables under the carpet).
1) You need to consider cache warming
Most people running benchmarks are afraid of timer discrepancy, therefore they run their stuff thousands of times and take the whole time, they just are careful to take the same thousand of times for every operation, and then consider that comparable.
The truth is, in the real world it makes little sense, because your cache will not be warm, and your operation will likely be called just once. Therefore you need to benchmark using RDTSC, and time stuff calling them once only.
Intel has made a paper describing how to use RDTSC (using a cpuid instruction to flush the pipeline, and calling it at least 3 times at the beginning of the program to stabilize it).
2) RDTSC accuracy measure
I also recommend doing this:
u64 g_correctionFactor; // number of clocks to offset after each measurement to remove the overhead of the measurer itself.
u64 g_accuracy;
static u64 const errormeasure = ~((u64)0);
#ifdef _MSC_VER
#pragma intrinsic(__rdtsc)
inline u64 GetRDTSC()
{
int a[4];
__cpuid(a, 0x80000000); // flush OOO instruction pipeline
return __rdtsc();
}
inline void WarmupRDTSC()
{
int a[4];
__cpuid(a, 0x80000000); // warmup cpuid.
__cpuid(a, 0x80000000);
__cpuid(a, 0x80000000);
// measure the measurer overhead with the measurer (crazy he..)
u64 minDiff = LLONG_MAX;
u64 maxDiff = 0; // this is going to help calculate our PRECISION ERROR MARGIN
for (int i = 0; i < 80; ++i)
{
u64 tick1 = GetRDTSC();
u64 tick2 = GetRDTSC();
minDiff = std::min(minDiff, tick2 - tick1); // make many takes, take the smallest that ever come.
maxDiff = std::max(maxDiff, tick2 - tick1);
}
g_correctionFactor = minDiff;
printf("Correction factor %llu clocks\n", g_correctionFactor);
g_accuracy = maxDiff - minDiff;
printf("Measurement Accuracy (in clocks) : %llu\n", g_accuracy);
}
#endif
This is a discrepancy measurer, and it will take the minimum of all measured values, to avoid getting a -10**18 (64 bits first negatives values) from time to time.
Notice the use of intrinsics and not inline assembly. First inline assembly is rarely supported by compilers nowadays, but much worse of all, the compiler creates a full ordering barrier around inline assembly because it cannot static analyze the inside, so this is a problem to benchmark real-world stuff, especially when calling stuff just once. So an intrinsic is suited here because it doesn't break the compiler free-re-ordering of instructions.
3) parameters
The last problem is people usually test for too few variations of the scenario.
A container performance is affected by:
Allocator
size of the contained type
cost of implementation of the copy operation, assignment operation, move operation, construction operation, of the contained type.
number of elements in the container (size of the problem)
type has trivial 3.-operations
type is POD
Point 1 is important because containers do allocate from time to time, and it matters a lot if they allocate using the CRT "new" or some user-defined operation, like pool allocation or freelist or other...
(for people interested about pt 1, join the mystery thread on gamedev about system allocator performance impact)
Point 2 is because some containers (say A) will lose time copying stuff around, and the bigger the type the bigger the overhead. The problem is that when comparing to another container B, A may win over B for small types, and lose for larger types.
Point 3 is the same as point 2, except it multiplies the cost by some weighting factor.
Point 4 is a question of big O mixed with cache issues. Some bad-complexity containers can largely outperform low-complexity containers for a small number of types (like map vs. vector, because their cache locality is good, but map fragments the memory). And then at some crossing point, they will lose, because the contained overall size starts to "leak" to main memory and cause cache misses, that plus the fact that the asymptotic complexity can start to be felt.
Point 5 is about compilers being able to elide stuff that are empty or trivial at compile time. This can optimize greatly some operations because the containers are templated, therefore each type will have its own performance profile.
Point 6 same as point 5, PODs can benefit from the fact that copy construction is just a memcpy, and some containers can have a specific implementation for these cases, using partial template specializations, or SFINAE to select algorithms according to traits of T.
About the flat map
Apparently, the flat map is a sorted vector wrapper, like Loki AssocVector, but with some supplementary modernizations coming with C++11, exploiting move semantics to accelerate insert and delete of single elements.
This is still an ordered container. Most people usually don't need the ordering part, therefore the existence of unordered...
Have you considered that maybe you need a flat_unorderedmap? which would be something like google::sparse_map or something like that—an open address hash map.
The problem of open address hash maps is that at the time of rehash they have to copy everything around to the new extended flat land, whereas a standard unordered map just has to recreate the hash index, while the allocated data stays where it is. The disadvantage of course is that the memory is fragmented like hell.
The criterion of a rehash in an open address hash map is when the capacity exceeds the size of the bucket vector multiplied by the load factor.
A typical load factor is 0.8; therefore, you need to care about that, if you can pre-size your hash map before filling it, always pre-size to: intended_filling * (1/0.8) + epsilon this will give you a guarantee of never having to spuriously rehash and recopy everything during filling.
The advantage of closed address maps (std::unordered..) is that you don't have to care about those parameters.
But the boost::flat_map is an ordered vector; therefore, it will always have a log(N) asymptotic complexity, which is less good than the open address hash map (amortized constant time). You should consider that as well.
Benchmark results
This is a test involving different maps (with int key and __int64/somestruct as value) and std::vector.
tested types information:
typeid=__int64 . sizeof=8 . ispod=yes
typeid=struct MediumTypePod . sizeof=184 . ispod=yes
Insertion
EDIT:
My previous results included a bug: they actually tested ordered insertion, which exhibited a very fast behavior for the flat maps.
I left those results later down this page because they are interesting.
This is the correct test:
I have checked the implementation, there is no such thing as a deferred sort implemented in the flat maps here. Each insertion sorts on the fly, therefore this benchmark exhibits the asymptotic tendencies:
map: O(N * log(N))
hashmaps: O(N)
vector and flatmaps: O(N * N)
Warning: hereafter the 2 tests for std::map and both flat_maps are buggy and actually test ordered insertion (vs random insertion for other containers. yes it's confusing sorry):
We can see that ordered insertion, results in back pushing, and is extremely fast. However, from the non-charted results of my benchmark, I can also say that this is not near the absolute optimality for a back-insertion. At 10k elements, perfect back-insertion optimality is obtained on a pre-reserved vector. Which gives us 3Million cycles; we observe 4.8M here for the ordered insertion into the flat_map (therefore 160% of the optimal).
Analysis: remember this is 'random insert' for the vector, so the massive 1 billion cycles come from having to shift half (in average) the data upward (one element by one element) at each insertion.
Random search of 3 elements (clocks renormalized to 1)
in size = 100
in size = 10000
Iteration
over size 100 (only MediumPod type)
over size 10000 (only MediumPod type)
Final grain of salt
In the end, I wanted to come back on "Benchmarking §3 Pt1" (the system allocator). In a recent experiment, I am doing around the performance of an open address hash map I developed, I measured a performance gap of more than 3000% between Windows 7 and Windows 8 on some std::unordered_map use cases (discussed here).
This makes me want to warn the reader about the above results (they were made on Win7): your mileage may vary.
From the docs it seems this is analogous to Loki::AssocVector which I'm a fairly heavy user of. Since it's based on a vector it has the characteristics of a vector, that is to say:
Iterators gets invalidated whenever size grows beyond capacity.
When it grows beyond capacity it needs to reallocated and move objects over, ie insertion isn't guaranteed constant time except for the special case of inserting at end when capacity > size
Lookup is faster than std::map due to cache locality, a binary search which has the same performance characteristics as std::map otherwise
Uses less memory because it isn't a linked binary tree
It never shrinks unless you forcibly tell it to ( since that triggers reallocation )
The best use is when you know the number of elements in advance ( so you can reserve upfront ), or when insertion / removal is rare but lookup is frequent. Iterator invalidation makes it a bit cumbersome in some use cases so they're not interchangeable in terms of program correctness.

What is a "cache-friendly" code?

What is the difference between "cache unfriendly code" and the "cache friendly" code?
How can I make sure I write cache-efficient code?
Preliminaries
On modern computers, only the lowest level memory structures (the registers) can move data around in single clock cycles. However, registers are very expensive and most computer cores have less than a few dozen registers. At the other end of the memory spectrum (DRAM), the memory is very cheap (i.e. literally millions of times cheaper) but takes hundreds of cycles after a request to receive the data. To bridge this gap between super fast and expensive and super slow and cheap are the cache memories, named L1, L2, L3 in decreasing speed and cost. The idea is that most of the executing code will be hitting a small set of variables often, and the rest (a much larger set of variables) infrequently. If the processor can't find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache, and if not there, main memory. Each of these "misses" is expensive in time.
(The analogy is cache memory is to system memory, as system memory is to hard disk storage. Hard disk storage is super cheap but very slow).
Caching is one of the main methods to reduce the impact of latency. To paraphrase Herb Sutter (cfr. links below): increasing bandwidth is easy, but we can't buy our way out of latency.
Data is always retrieved through the memory hierarchy (smallest == fastest to slowest). A cache hit/miss usually refers to a hit/miss in the highest level of cache in the CPU -- by highest level I mean the largest == slowest. The cache hit rate is crucial for performance since every cache miss results in fetching data from RAM (or worse ...) which takes a lot of time (hundreds of cycles for RAM, tens of millions of cycles for HDD). In comparison, reading data from the (highest level) cache typically takes only a handful of cycles.
In modern computer architectures, the performance bottleneck is leaving the CPU die (e.g. accessing RAM or higher). This will only get worse over time. The increase in processor frequency is currently no longer relevant to increase performance. The problem is memory access. Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches, prefetching, pipelines and concurrency. For instance, modern CPUs spend around 85% of die on caches and up to 99% for storing/moving data!
There is quite a lot to be said on the subject. Here are a few great references about caches, memory hierarchies and proper programming:
Agner Fog's page. In his excellent documents, you can find detailed examples covering languages ranging from assembly to C++.
If you are into videos, I strongly recommend to have a look at Herb Sutter's talk on machine architecture (youtube) (specifically check 12:00 and onwards!).
Slides about memory optimization by Christer Ericson (director of technology # Sony)
LWN.net's article "What every programmer should know about memory"
Main concepts for cache-friendly code
A very important aspect of cache-friendly code is all about the principle of locality, the goal of which is to place related data close in memory to allow efficient caching. In terms of the CPU cache, it's important to be aware of cache lines to understand how this works: How do cache lines work?
The following particular aspects are of high importance to optimize caching:
Temporal locality: when a given memory location was accessed, it is likely that the same location is accessed again in the near future. Ideally, this information will still be cached at that point.
Spatial locality: this refers to placing related data close to each other. Caching happens on many levels, not just in the CPU. For example, when you read from RAM, typically a larger chunk of memory is fetched than what was specifically asked for because very often the program will require that data soon. HDD caches follow the same line of thought. Specifically for CPU caches, the notion of cache lines is important.
Use appropriate c++ containers
A simple example of cache-friendly versus cache-unfriendly is c++'s std::vector versus std::list. Elements of a std::vector are stored in contiguous memory, and as such accessing them is much more cache-friendly than accessing elements in a std::list, which stores its content all over the place. This is due to spatial locality.
A very nice illustration of this is given by Bjarne Stroustrup in this youtube clip (thanks to #Mohammad Ali Baydoun for the link!).
Don't neglect the cache in data structure and algorithm design
Whenever possible, try to adapt your data structures and order of computations in a way that allows maximum use of the cache. A common technique in this regard is cache blocking (Archive.org version), which is of extreme importance in high-performance computing (cfr. for example ATLAS).
Know and exploit the implicit structure of data
Another simple example, which many people in the field sometimes forget is column-major (ex. fortran,matlab) vs. row-major ordering (ex. c,c++) for storing two dimensional arrays. For example, consider the following matrix:
1 2
3 4
In row-major ordering, this is stored in memory as 1 2 3 4; in column-major ordering, this would be stored as 1 3 2 4. It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues. Unfortunately, I see stuff like this very often in my domain (machine learning). #MatteoItalia showed this example in more detail in his answer.
When fetching a certain element of a matrix from memory, elements near it will be fetched as well and stored in a cache line. If the ordering is exploited, this will result in fewer memory accesses (because the next few values which are needed for subsequent computations are already in a cache line).
For simplicity, assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory, the next one is too. Say we want to take the sum over all elements in the example 2x2 matrix above (lets call it M):
Exploiting the ordering (e.g. changing column index first in c++):
M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached)
= 1 + 2 + 3 + 4
--> 2 cache hits, 2 memory accesses
Not exploiting the ordering (e.g. changing row index first in c++):
M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory)
= 1 + 3 + 2 + 4
--> 0 cache hits, 4 memory accesses
In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice, the performance difference can be much larger.
Avoid unpredictable branches
Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access. When your critical code contains (unpredictable) branches, it is hard or impossible to prefetch data. This will indirectly lead to more cache misses.
This is explained very well here (thanks to #0x90 for the link): Why is processing a sorted array faster than processing an unsorted array?
Avoid virtual functions
In the context of c++, virtual methods represent a controversial issue with regard to cache misses (a general consensus exists that they should be avoided when possible in terms of performance). Virtual functions can induce cache misses during look up, but this only happens if the specific function is not called often (otherwise it would likely be cached), so this is regarded as a non-issue by some. For reference about this issue, check out: What is the performance cost of having a virtual method in a C++ class?
Common problems
A common problem in modern architectures with multiprocessor caches is called false sharing. This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line. This causes the cache line -- which contains data another processor can use -- to be overwritten again and again. Effectively, different threads make each other wait by inducing cache misses in this situation.
See also (thanks to #Matt for the link): How and when to align to cache line size?
An extreme symptom of poor caching in RAM memory (which is probably not what you mean in this context) is so-called thrashing. This occurs when the process continuously generates page faults (e.g. accesses memory which is not in the current page) which require disk access.
In addition to #Marc Claesen's answer, I think that an instructive classic example of cache-unfriendly code is code that scans a C bidimensional array (e.g. a bitmap image) column-wise instead of row-wise.
Elements that are adjacent in a row are also adjacent in memory, thus accessing them in sequence means accessing them in ascending memory order; this is cache-friendly, since the cache tends to prefetch contiguous blocks of memory.
Instead, accessing such elements column-wise is cache-unfriendly, since elements on the same column are distant in memory from each other (in particular, their distance is equal to the size of the row), so when you use this access pattern you are jumping around in memory, potentially wasting the effort of the cache of retrieving the elements nearby in memory.
And all that it takes to ruin the performance is to go from
// Cache-friendly version - processes pixels which are adjacent in memory
for(unsigned int y=0; y<height; ++y)
{
for(unsigned int x=0; x<width; ++x)
{
... image[y][x] ...
}
}
to
// Cache-unfriendly version - jumps around in memory for no good reason
for(unsigned int x=0; x<width; ++x)
{
for(unsigned int y=0; y<height; ++y)
{
... image[y][x] ...
}
}
This effect can be quite dramatic (several order of magnitudes in speed) in systems with small caches and/or working with big arrays (e.g. 10+ megapixels 24 bpp images on current machines); for this reason, if you have to do many vertical scans, often it's better to rotate the image of 90 degrees first and perform the various analysis later, limiting the cache-unfriendly code just to the rotation.
Optimizing cache usage largely comes down to two factors.
Locality of Reference
The first factor (to which others have already alluded) is locality of reference. Locality of reference really has two dimensions though: space and time.
Spatial
The spatial dimension also comes down to two things: first, we want to pack our information densely, so more information will fit in that limited memory. This means (for example) that you need a major improvement in computational complexity to justify data structures based on small nodes joined by pointers.
Second, we want information that will be processed together also located together. A typical cache works in "lines", which means when you access some information, other information at nearby addresses will be loaded into the cache with the part we touched. For example, when I touch one byte, the cache might load 128 or 256 bytes near that one. To take advantage of that, you generally want the data arranged to maximize the likelihood that you'll also use that other data that was loaded at the same time.
For just a really trivial example, this can mean that a linear search can be much more competitive with a binary search than you'd expect. Once you've loaded one item from a cache line, using the rest of the data in that cache line is almost free. A binary search becomes noticeably faster only when the data is large enough that the binary search reduces the number of cache lines you access.
Time
The time dimension means that when you do some operations on some data, you want (as much as possible) to do all the operations on that data at once.
Since you've tagged this as C++, I'll point to a classic example of a relatively cache-unfriendly design: std::valarray. valarray overloads most arithmetic operators, so I can (for example) say a = b + c + d; (where a, b, c and d are all valarrays) to do element-wise addition of those arrays.
The problem with this is that it walks through one pair of inputs, puts results in a temporary, walks through another pair of inputs, and so on. With a lot of data, the result from one computation may disappear from the cache before it's used in the next computation, so we end up reading (and writing) the data repeatedly before we get our final result. If each element of the final result will be something like (a[n] + b[n]) * (c[n] + d[n]);, we'd generally prefer to read each a[n], b[n], c[n] and d[n] once, do the computation, write the result, increment n and repeat 'til we're done.2
Line Sharing
The second major factor is avoiding line sharing. To understand this, we probably need to back up and look a little at how caches are organized. The simplest form of cache is direct mapped. This means one address in main memory can only be stored in one specific spot in the cache. If we're using two data items that map to the same spot in the cache, it works badly -- each time we use one data item, the other has to be flushed from the cache to make room for the other. The rest of the cache might be empty, but those items won't use other parts of the cache.
To prevent this, most caches are what are called "set associative". For example, in a 4-way set-associative cache, any item from main memory can be stored at any of 4 different places in the cache. So, when the cache is going to load an item, it looks for the least recently used3 item among those four, flushes it to main memory, and loads the new item in its place.
The problem is probably fairly obvious: for a direct-mapped cache, two operands that happen to map to the same cache location can lead to bad behavior. An N-way set-associative cache increases the number from 2 to N+1. Organizing a cache into more "ways" takes extra circuitry and generally runs slower, so (for example) an 8192-way set associative cache is rarely a good solution either.
Ultimately, this factor is more difficult to control in portable code though. Your control over where your data is placed is usually fairly limited. Worse, the exact mapping from address to cache varies between otherwise similar processors. In some cases, however, it can be worth doing things like allocating a large buffer, and then using only parts of what you allocated to ensure against data sharing the same cache lines (even though you'll probably need to detect the exact processor and act accordingly to do this).
False Sharing
There's another, related item called "false sharing". This arises in a multiprocessor or multicore system, where two (or more) processors/cores have data that's separate, but falls in the same cache line. This forces the two processors/cores to coordinate their access to the data, even though each has its own, separate data item. Especially if the two modify the data in alternation, this can lead to a massive slowdown as the data has to be constantly shuttled between the processors. This can't easily be cured by organizing the cache into more "ways" or anything like that either. The primary way to prevent it is to ensure that two threads rarely (preferably never) modify data that could possibly be in the same cache line (with the same caveats about difficulty of controlling the addresses at which data is allocated).
Those who know C++ well might wonder if this is open to optimization via something like expression templates. I'm pretty sure the answer is that yes, it could be done and if it was, it would probably be a pretty substantial win. I'm not aware of anybody having done so, however, and given how little valarray gets used, I'd be at least a little surprised to see anybody do so either.
In case anybody wonders how valarray (designed specifically for performance) could be this badly wrong, it comes down to one thing: it was really designed for machines like the older Crays, that used fast main memory and no cache. For them, this really was a nearly ideal design.
Yes, I'm simplifying: most caches don't really measure the least recently used item precisely, but they use some heuristic that's intended to be close to that without having to keep a full time-stamp for each access.
Welcome to the world of Data Oriented Design. The basic mantra is to Sort, Eliminate Branches, Batch, Eliminate virtual calls - all steps towards better locality.
Since you tagged the question with C++, here's the obligatory typical C++ Bullshit. Tony Albrecht's Pitfalls of Object Oriented Programming is also a great introduction into the subject.
Just piling on: the classic example of cache-unfriendly versus cache-friendly code is the "cache blocking" of matrix multiply.
Naive matrix multiply looks like:
for(i=0;i<N;i++) {
for(j=0;j<N;j++) {
dest[i][j] = 0;
for( k=0;k<N;k++) {
dest[i][j] += src1[i][k] * src2[k][j];
}
}
}
If N is large, e.g. if N * sizeof(elemType) is greater than the cache size, then every single access to src2[k][j] will be a cache miss.
There are many different ways of optimizing this for a cache. Here's a very simple example: instead of reading one item per cache line in the inner loop, use all of the items:
int itemsPerCacheLine = CacheLineSize / sizeof(elemType);
for(i=0;i<N;i++) {
for(j=0;j<N;j += itemsPerCacheLine ) {
for(jj=0;jj<itemsPerCacheLine; jj+) {
dest[i][j+jj] = 0;
}
for( k=0;k<N;k++) {
for(jj=0;jj<itemsPerCacheLine; jj+) {
dest[i][j+jj] += src1[i][k] * src2[k][j+jj];
}
}
}
}
If the cache line size is 64 bytes, and we are operating on 32 bit (4 byte) floats, then there are 16 items per cache line. And the number of cache misses via just this simple transformation is reduced approximately 16-fold.
Fancier transformations operate on 2D tiles, optimize for multiple caches (L1, L2, TLB), and so on.
Some results of googling "cache blocking":
http://stumptown.cc.gt.atl.ga.us/cse6230-hpcta-fa11/slides/11a-matmul-goto.pdf
http://software.intel.com/en-us/articles/cache-blocking-techniques
A nice video animation of an optimized cache blocking algorithm.
http://www.youtube.com/watch?v=IFWgwGMMrh0
Loop tiling is very closely related:
http://en.wikipedia.org/wiki/Loop_tiling
Processors today work with many levels of cascading memory areas. So the CPU will have a bunch of memory that is on the CPU chip itself. It has very fast access to this memory. There are different levels of cache each one slower access ( and larger ) than the next, until you get to system memory which is not on the CPU and is relatively much slower to access.
Logically, to the CPU's instruction set you just refer to memory addresses in a giant virtual address space. When you access a single memory address the CPU will go fetch it. in the old days it would fetch just that single address. But today the CPU will fetch a bunch of memory around the bit you asked for, and copy it into the cache. It assumes that if you asked for a particular address that is is highly likely that you are going to ask for an address nearby very soon. For example if you were copying a buffer you would read and write from consecutive addresses - one right after the other.
So today when you fetch an address it checks the first level of cache to see if it already read that address into cache, if it doesn't find it, then this is a cache miss and it has to go out to the next level of cache to find it, until it eventually has to go out into main memory.
Cache friendly code tries to keep accesses close together in memory so that you minimize cache misses.
So an example would be imagine you wanted to copy a giant 2 dimensional table. It is organized with reach row in consecutive in memory, and one row follow the next right after.
If you copied the elements one row at a time from left to right - that would be cache friendly. If you decided to copy the table one column at a time, you would copy the exact same amount of memory - but it would be cache unfriendly.
It needs to be clarified that not only data should be cache-friendly, it is just as important for the code. This is in addition to branch predicition, instruction reordering, avoiding actual divisions and other techniques.
Typically the denser the code, the fewer cache lines will be required to store it. This results in more cache lines being available for data.
The code should not call functions all over the place as they typically will require one or more cache lines of their own, resulting in fewer cache lines for data.
A function should begin at a cache line-alignment-friendly address. Though there are (gcc) compiler switches for this be aware that if the the functions are very short it might be wasteful for each one to occupy an entire cache line. For example, if three of the most often used functions fit inside one 64 byte cache line, this is less wasteful than if each one has its own line and results in two cache lines less available for other usage. A typical alignment value could be 32 or 16.
So spend some extra time to make the code dense. Test different constructs, compile and review the generated code size and profile.
As #Marc Claesen mentioned that one of the ways to write cache friendly code is to exploit the structure in which our data is stored. In addition to that another way to write cache friendly code is: change the way our data is stored; then write new code to access the data stored in this new structure.
This makes sense in the case of how database systems linearize the tuples of a table and store them. There are two basic ways to store the tuples of a table i.e. row store and column store. In row store as the name suggests the tuples are stored row wise. Lets suppose a table named Product being stored has 3 attributes i.e. int32_t key, char name[56] and int32_t price, so the total size of a tuple is 64 bytes.
We can simulate a very basic row store query execution in main memory by creating an array of Product structs with size N, where N is the number of rows in table. Such memory layout is also called array of structs. So the struct for Product can be like:
struct Product
{
int32_t key;
char name[56];
int32_t price'
}
/* create an array of structs */
Product* table = new Product[N];
/* now load this array of structs, from a file etc. */
Similarly we can simulate a very basic column store query execution in main memory by creating an 3 arrays of size N, one array for each attribute of the Product table. Such memory layout is also called struct of arrays. So the 3 arrays for each attribute of Product can be like:
/* create separate arrays for each attribute */
int32_t* key = new int32_t[N];
char* name = new char[56*N];
int32_t* price = new int32_t[N];
/* now load these arrays, from a file etc. */
Now after loading both the array of structs (Row Layout) and the 3 separate arrays (Column Layout), we have row store and column store on our table Product present in our memory.
Now we move on to the cache friendly code part. Suppose that the workload on our table is such that we have an aggregation query on the price attribute. Such as
SELECT SUM(price)
FROM PRODUCT
For the row store we can convert the above SQL query into
int sum = 0;
for (int i=0; i<N; i++)
sum = sum + table[i].price;
For the column store we can convert the above SQL query into
int sum = 0;
for (int i=0; i<N; i++)
sum = sum + price[i];
The code for the column store would be faster than the code for the row layout in this query as it requires only a subset of attributes and in column layout we are doing just that i.e. only accessing the price column.
Suppose that the cache line size is 64 bytes.
In the case of row layout when a cache line is read, the price value of only 1(cacheline_size/product_struct_size = 64/64 = 1) tuple is read, because our struct size of 64 bytes and it fills our whole cache line, so for every tuple a cache miss occurs in case of a row layout.
In the case of column layout when a cache line is read, the price value of 16(cacheline_size/price_int_size = 64/4 = 16) tuples is read, because 16 contiguous price values stored in memory are brought into the cache, so for every sixteenth tuple a cache miss ocurs in case of column layout.
So the column layout will be faster in the case of given query, and is faster in such aggregation queries on a subset of columns of the table. You can try out such experiment for yourself using the data from TPC-H benchmark, and compare the run times for both the layouts. The wikipedia article on column oriented database systems is also good.
So in database systems, if the query workload is known beforehand, we can store our data in layouts which will suit the queries in workload and access data from these layouts. In the case of above example we created a column layout and changed our code to compute sum so that it became cache friendly.
Be aware that caches do not just cache continuous memory. They have multiple lines (at least 4) so discontinous and overlapping memory can often be stored just as efficiently.
What is missing from all the above examples is measured benchmarks. There are many myths about performance. Unless you measure it you do not know. Do not complicate your code unless you have a measured improvement.
Cache-friendly code is code that has been optimized to make efficient use of the CPU cache. This typically involves organizing data in a way that takes advantage of spatial and temporal locality, which refers to the idea that data that is accessed together is likely to be stored together in memory, and that data that is accessed frequently is likely to be accessed again in the near future.
There are several ways to make code cache-friendly, including:
Using contiguous memory layouts: By storing data in contiguous
blocks in memory, you can take advantage of spatial locality and
reduce the number of cache misses.
Using arrays: Arrays are a good choice for data structures when you
need to access data sequentially, as they allow you to take
advantage of temporal locality and keep hot data in the cache.
Using pointers carefully: Pointers can be used to access data that
is not stored contiguously in memory, but they can also lead to
cache misses if they are used excessively. If you need to use
pointers, try to use them in a way that takes advantage of spatial
and temporal locality to minimize cache misses.
Using compiler optimization flags: Most compilers have optimization
flags that can be used to optimize the use of the CPU cache. These
flags can help to minimize the number of cache misses and improve
the overall performance of your code.
It is important to note that the specific techniques that work best for optimizing the use of the CPU cache will depend on the specific requirements and constraints of your system. It may be necessary to experiment with different approaches to find the best solution for your needs.