Cache locality performance

Cache locality performance - c++

If I had a C or C++ program where I was using say 20 integers throughout the program, would it improve performance to create an array of size 20 to store the integers and then create aliases for each number?
Would this improve the cache locality (rather than just creating 20 normal ints) because the ints would be loaded into the cache together as part of the int array(or at least , improve the chances of this)?

The question is how do you allocate space for them? I doubt that you just randomly do new int 20 times here and there in the code. If they are local variables then they will get on stack and get cached.
The main question is is that worth bothering? Try to write your program in readable and elegant way first, and then try to remove major bottlenecks and only after start messing with microoptimizations. If you are processing 20 ints, should not they be array essentially?
Also is it theoretical question? If it is, then yes, array will likely be cached better then 20 random areas in memory. If it is practical question, then I doubt that this is really important unless you are writing supercritical performance code, and even then microoptimizations are last thing to deal with.

It might improve performance a bit, yes. It might also completely ruin your performance. Or it might have no impact whatsoever because the compiler already did something similar for you. Or it might have no impact because you're just not using those integers often enough for this to make a difference.
It also depends on whether one or multiple threads access these integers, and whether they just read, or also modify the numbers. (if you have multiple threads and you write to those integers, then putting them in an array will cause false sharing which will hurt your performance far more than anything you'd hoped to gain)
So why don't you just try it?
There is no simple, single answer. The only serious answer you're going to get is "it depends". If you want to know how it would behave in your case, then you have two options:
try it and see what happens, or
gain a thorough understanding of how your CPU works, gather data on exactly how often these values are accessed and in which patterns, so you can make an educated guess at how the change would affect your performance.
If you choose #2, you'll likely need to follow it up with #1 anyway, to verify that your guess was correct.
Performance isn't simple. There are few universal rules, and everything depends on context. A change which is an optimization in one case might slow everything down in another.
If you're serious about optimizing your code, then there's no substitute for the two steps above. And if you're not serious about it, don't do it. :)

Yes, the theoretical chance of the 20 integers ending up on the same cache line would be higher, although I think a good compiler would almost always be able to replicate the same performance for you even when not using an array.

So, you currently have int positionX, positionY, positionZ;then somewhere else int fuzzy; and int foo;, etc to make about 20 integers?
And you want to do something like this:
int arr[20];
#define positionX arr[0]
#define positionY arr[1]
#define positionZ arr[2]
#define fuzzy arr[3]
#define foo arr[4]
I would expect that if there is ANY performance difference, it may make it slower, because the compiler will notice that you are using arr in some other place, and thus can't use registers to store the value of foo, since it sees that you call update_position which touches arr[0]..arr[2]. It depends on how finegrained the compilers detection of "we're touching the same data" is. And I suspect it may quite often be based on "object" rather than individual fields of an object - particularly for arrays.
However, if you do have data that is used close together, e.g. position variables, it would probably help to have them next to each other.
But I seriously think that you are wasting your time trying to put variables next to each other, and using an array is almost certainly a BAD idea.

It would likely decrease performance. Modern compilers will move variables around in memory when you're not looking, and may store two variables at the same address when they're not used concurrently. With your array idea, those variables cannot overlap, and must use distinct cache lines.

Yes this may improve your performance, however it may not, as its really Variables that get used together that should be stored together.
So if they are used together then yes. Variables and objects should really be declared in the function in which they are used as they will be stored on the stack (level-1 cache in most cases).
So yes if you are going to use them together i.e they are relevant to each other, then this would probably be a little more efficient, provding you also take into consideration how you allocate them memory.

Related

How to measure sequential memory read speed in C/C++

The problem does not take CPU cache into consideration. That is, let the cache do its job (let cpu cache improve the performance).
My idea is to allocate a big enough chunk of memory (so that not all of it fit into cache) and treat them as one data type(like int) and do addition to avoid the compiler completely optimize away the code to read the memory. The problem is does the data type affect the measurement? Or is there a more general way of doing it?
EDIT: Might be a bit mis-leading before. An example is AIDA64's memory and cache benchmark, which is able to measure the memory read/write speed as well as latency. I want to know a general idea of how it is done.

Microbenchmarks like this are not easy in C/C++. The amount of time something takes in C++ is not a specified aspect of the language. Indeed, for every use case except this one, faster is better, so compilers are encouraged to do smart things.
The trick to these is to write the benchmark, compile it, and then look at the assembly to see whether its doing clever tricks. Or, at the very least, check to make sure that it makes sense (accessing more memory = more time).
Compilers are getting smart. Addition is not always enough. More than once I've had Visual Studio realize what I was doing to construct the microbenchmark and compile it all away.
At the moment, I am having good luck using the argc argument passed into main as a seed, and using a cryptographic hash like SHA1 or MD-5 to fill the data. This tends to be enough to trick the compiler into actually writing all of the reads. But verify your results. There's no guarantee that a new compiler doesn't get even smarter.

2 Arrays vs Array of Structures with 2 data members

I was wondering which one is better, 2 Arrays or the Array of a structure with 2 data members.
I want insights regarding:
Is struct a kind of wrapper which takes extra memory? I am aware of padding in structures to fit it into word limits.
Which one is faster if I want to access both the data members together? I think its array of structures.
And does an array of a structure may occupy more memory than 2 arrays due to the possibility of structure padding.
Answers both in general context and language specific are welcome.
And please don't suggest having a look at SoA vs AoS questions, already done that.

It entirely depends on what you're trying to do, neither answer is always going to be "correct".
Outside of compiler-specific padding, structs do not take up any extra memory unless you make it virtual, in which case it'll get a vtable pointer but that's it.
As long as you're targeting a machine with a cache large enough to fit two pages (usually 4KB each, but check for your specific CPU), it doesn't matter and you should choose whichever is easier to work with and makes more sense in your code. The array of structs will use one page and cause a cache miss once for every 4KB of structs you load, while the array of values will load two pages that cause two cache misses half as often. If you do happen to be working with a dinky cache that only allows one cache for your program data, then yes, it'll be much faster to use an array of structs since the alternative would cause cache misses on every read.
Same answer as #1 - arrays will never have their own padding, but a struct might have padding built into it by your compiler.
Struct padding though depends entirely on your compiler, which probably has flags to turn it on or off or set the maximum pad size or whatever. Inspect the raw data of an array of your objects to see if they have padding, and if so, find out how to turn that off in your compiler if you need that memory.
What compiler are you using, and what are you trying to do with your project?
And perhaps more importantly: What stage is your project in, and are you running into speed issues already? Pre-optimization is the root of all evils, and you are likely wasting your time worrying about the issue.

Structures are padded to allow optimal access to the members by the CPU so they may take more memory. The fields may be already aligned so no padding needed. So they are not a wrapper in the sense that they are wrapped with data always. Think of structure padding/adjustment of optimizations by the compiler.
Structures will be faster together as the entire structure is likely to fit in cache together. If you have separate lists, they may fall out of the cache.
If padded, yes.
Don't forget one important reason to keep the data together: code readability. If you are planning to process each field independently a different thread. You may get performance improvements if you use arrays.

Higher dimensional array vs 1-D array efficiency in C++

I'm curious about the efficiency of using a higher dimensional array vs a one dimensional array. Do you lose anything when defining, and iterating through an array like this:
array[i][j][k];
or defining and iterating through an array like this:
array[k + j*jmax + i*imax];
My inclination is that there wouldn't be a difference, but I'm still learning about high efficiency programming (I've never had to care about this kind of thing before).
Thanks!

The only way to know for sure is to benchmark both ways (with optimization flags on in the compiler of course). The one think you lose for sure in the second method is the clarity of reading.

The former way and the latter way to access arrays are identical once you compile it. Keep in mind that accessing memory locations that are close to one another does make a difference in performance, as they're going to be cached differently. Thus, if you're storing a high-dimensional matrix, ensure that you store rows one after the other if you're going to be accessing them that way.
In general, CPU caches optimize for temporal and spacial ordering. That is, if you access memory address X, the odds of you accessing X+1 are higher. It's much more efficient to operate on values within the same cache line.
Check out this article on CPU caches for more information on how different storage policies affect performance: http://en.wikipedia.org/wiki/CPU_cache

If you can rewrite the indexing, so can the compiler. I wouldn't worry about that.
Trust your compiler(tm)!

It probably depends on implementation, but I'd say it more or less amounts to your code for one-dimensional array.

Do yourself a favor and care about such things after profiling the code. It is very unlikely that something like that will affect the performance of the application as a whole. Using the correct algorithms is much more important
And even if it does matter, it is most certainly only a single inner loop that needs attention.

Is it better/faster to have class variables or local function variables?

Ok I know the title doesn't fully explain this question. So I'm writing a program that performs a large number of calculations and I'm trying to optimize it so that it won't run quite so slow. I have a function that is a member of a class that gets called around 5 million times. This is the function:
void PointCamera::GetRay(float x, float y, Ray& out)
{
//Find difference between location on view plane and origin and normalize
float vpPointx = pixelSizex * (x - 0.5f * (float)width);
float vpPointy = pixelSizey * (((float)height - y) - 0.5f * height);
//Transform ray to camera's direction
out.d = u * vpPointx + v * vpPointy - w * lens_distance;
out.d.Normalize();
//Set origin to camera location
out.o = loc;
}
I'm wondering if it is better/faster to declare the variables vpPointx and vpPointy in the class than to declare them each time I call the function. Would this be a good optimization or would it have little effect?
And in general, if there is anything here that could be optimized please let me know.

By limiting the scope of your variables, you are giving more opportunity to the compiler optimiser to rearrange your code and make it run faster. For example, it might keep the values of those variables entirely within CPU registers, which may be an order of magnitude faster than memory access. Also, if those variables were class instance variables, then the compiler would have to generate code to dereference this every time you accessed them, which would very likely be slower than local variable access.
As always, you should measure the performance yourself and try the code both ways (or better, as many ways as you can think of). All optimisation advice is subject to whatever your compiler actually does, which requires experimentation.

Always prefer locals
Anything that is a temporary value should be a local. It's possible that such a value can exist entirely within a register without kicking something else out of cache or requiring a pointless memory store that will use a resource in far shorter supply than CPU cycles.
A dual 3 GHz CPU can execute 6 billion CPU cycles per second. In order to approach that 6 billion figure, typically most ops should involve no memory or cache operations and the results of most cycles must not be needed by the next instruction unless the CPU can find a later instruction that is immediately dispatchable. This all gets quite complicated but 6 billions somethings, including some wait states, will certainly happen each second.
However, that same CPU system is capable of only 10-40 million memory operations per second. The disparity is partly compensated for by the cache systems, although they are still slower than the CPU is, they are limited in size, and they do not cope with writes as well as they do with reads.
The good news is that good software abstractions and software speed optimization both agree in this case. Do not store transient state in an object unless you have a reason to reference it later.

How about precomputing some of those multiplications that never change. For example, w*lens_distance and 0.5*height. Compute them once whenever the variables change, then just use the stored value in this function call.

There is a performance penalty for declaring them in the class. They are accessed, in effect, by using this->field. There will be, at minimum, one memory write to store the result. The function local variables could live in registers for their entire lifetime.

I'm not sure, although my guess is it's better inside the function (since it's just a push on the stack to "declare" the variable, whereas making it part of the class means accessing it from memory using indirection every time you access it). Of course, in reality the compiler probably optimizes all of this into registers anyway.
Which brings me to my point:
You're going about this the wrong way
I don't think that anyone can really tell you what will be faster. It shouldn't matter even if someone does. The only real way to optimize is by measuring.
This usually means one of two things:
One option is to try each way, measure the time it takes, and compare. Note that this isn't always trivial to do (since each run will sometimes depend on external factors, difficult memory issues, etc). But running the code a few million times will probably iron that out for you.
Ideally, you should be using a profiler. That's a piece of software designed to measure the code for you, and tell you what parts take the longest amount of time. As most people who have dealt with optimization will tell you, you'll usually be surprised at what takes up a lot of time.
That's why you should always go with the "scientific" method of measuring, instead of relying on anyone's guesswork.

Appears to be a raytracer. The little things do add up, but also consider the big hits: You'll get a huge speedup with decent spatial partitioning. Get yourself an octtree or KD-Tree for a few orders magnitude speedup on complex scenes.
As for your direct question: profile it.

The others have already covered the benefits of using locals over class variables, so I won't go into that.. but since you asked for optimization tips in general:
Your int-to-float cast jumps out at me. There's a cost to it, especially if you are using the x387 FPU. Using SSE registers will make it better, but it looks thoroughly unnecessary for your function: you could simply store a copy of them as floats in your class.
You mentioned in a comment that you were still working on your kd-tree. It's probably a better idea to finish that first before doing the low-level optimization; what appears important now may not take up a fraction of the time later.
Use an instruction-level profiler, like VTune. gprof doesn't give you anywhere near enough information.
Have you heard of ompf.org? It's a wonderful raytracing forum, and you can learn a lot about the relevant optimizations there.
See my answer to this post for more tips.
Read Agner Fog.
As an aside: I've heard that the Bounding Interval Hierarchy is much easier to implement. I've not implemented a kd-tree, but I have implemented a BIH, and I'd say it's reasonably straightforward.

C++, ways to benchmark improvements in cache locality?

I have an implementation of a class X, that has two pointers to two pieces of information. I have written a new implementation, class Y, that has only one pointer to a struct that contains the two pieces of information together as adjacent members. X's and Y's methods usually only need to manipulate one of the pieces of information, but provide a get() method that returns a pointer to the second piece (in this case class X just returns its pointer to that piece and class Y returns the address of the struct's second member). In normal usage, calls to X's and Y's methods will happen interspersed by calls to get() and doing work on that returned second piece.
I expect that in real life situations there should be a performance improvement, now that the two pieces of information are next to one another in memory in the class Y implementation (because they are adjacent members of a struct), but I'm not seeing any difference in the benchmarks I've written (interspersing calls to X's and Y's methods with doing work on their second pieces in big loops). I suspect this is because everything fits in cache in either case in my tests. I don't want to try this in my real app yet because the semantics of X and Y differ in other subtle ways not related to this optimization and porting the using application will be some work, and these benchmarks are supposed to help justify doing that work in the first place.
What's the best way to observe the difference in performance due to better cache locality? If I do a bunch of dummy work on an array equal to the size of the cache in between calls is that sufficient? Or do I want to do work on an array slightly less than the cache size, so that work on my instances of my class will cause things to fall in and out of cache? I'm not sure how to code something that is robust against compiler optimizations and different cache sizes.

If you are on Linux, then using Cachegrind in conjunction with KCacheGrind might provide more insight as to what how your cache is behaving.

You could design a benchmark specifically to bust the cache. For instance, allocate the pointed-to data blocks such that they're all guaranteed to be on different cache lines (say, by using a custom memory allocator that pads allocations out to at least a few hundred bytes). Then repeatedly iterate over a number of objects too big to fit everything in even the L2 cache (very platform-dependent, since it depends on the number of lines in cache, but 1 million would cover most architectures and only require a few hundred meg RAM total).
This will give you an upper limit on the performance gain made by the change from X to Y. But it does it by degrading the performance of X down to below any likely real-world usage. And to prove your case you need a lower-limit estimate, not an upper-limit estimate. So I'm not sure you'd achieve much, unless you discover that even this worst case still makes no significant difference and you needn't bother with the optimization.
Even if you don't aim for theoretical worst-case performance of X, any benchmark designed to exceed the cache is just picking an arbitrary point of bad performance of X, and looking to see if Y is better. It's not far off rigging the benchmark to make Y look good. It really doesn't matter how your code performs in dodgy benchmarks, except maybe for the purposes of marketing lies literature.
The best way to observe the real-world difference in performance, is to measure a real-world client of your class. You say that "the semantics of X and Y differ in other subtle ways not related to this optimization", in which case I can only recommend that you write a class Z which differs from X only in respect of this optimization, and use that in your application as the comparison.
Once your tests attempt to represent the worst realistic use, then if you aren't seeing any difference in performance there's probably no performance gain to be had.
All that said, if it makes logical sense (that is, it doesn't make the code any more astonishing), then I would advocate minimising the number of heap allocations in C++ simply as a rule of thumb. It doesn't tend to make speed or total memory usage worse, and it does tend to simplify your resource handling. A rule of thumb doesn't justify a re-write of working code, of course.

If I'm understanding your situation correctly (and please correct me if not), then it's six of one, or half a dozen of the other.
In class X, you need one pointer lookup for either piece of information. In class Y, you need one lookup for the first, and two (get the first and then offset) for the second. That's sacrificing "locality" for another memory access. Compilers are still, unfortunately, very good at wasting bus time looking up words in RAM.
If it's possible, you'll get the best results by holding the two pieces of target information directly within the class in question (i.e. each it's own class member), rather than using those pointers for unnecessary indirection. Not seeing any code, that's pretty much all I can say.
At any rate, you'll get a lot more performance out of studying the algorithmic complexity of your application than you ever will with micro-optimizing two variables in a class definition. Also a great idea is to use a profiling tool to see (objectively) where your bottlenecks are (gprof is common on *nix systems). Is there a distinct reason you're looking to increase locality caching specifically?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js