Locality and shared access to objects - c++

Profiling my code, i see a lot of cache misses and would like to know whether there is a way to improve the situation. Optimization is not really needed, I'm more curious about whether there exist general approaches to this problem (this is a follow up question).
// class to compute stuff
class A {
double compute();
...
// depends on other objects
std::vector<A*> dependencies;
}
I have a container class that stores pointers to all created objects of class A. I do not store copies as I want to have shared access. Before I was using shared_ptr, but as single As are meaningless without the container, raw pointers are fine.
class Container {
...
void compute_all();
std::vector<A*> objects;
...
}
The vector objects is insertion sorted in a way that the full evaluation can be done by simply iterating and calling A.compute(), all dependencies in A are resolved.
With a_i objects of class A, the evaluation might look like this:
a_1 => a_2 => a_3 --> a_2 --> a_1 => a_4 => ....
where => denotes iteration in Container and --> iteration over A::dependencies
Moreover, the Container class is created only once and compute_all() is called many times, so rearranging the whole structure after creation is an option and wouldn't harm efficiency much.
Now to the observations/questions:
Obviously, iterating over Container::objects is cache efficient, but accessing the pointees is definitely not.
Moreover, as each object of type A has to iterate over A::dependencies, which again can produces cache misses.
Would it help to create a separate vector<A*> from all needed object in evaluation order such that dependencies in A are inserted as copies?
Something like this:
a_1 => a_2 => a_3 => a_2_c => a_1_c => a_4 -> ....
where a_i_c are copies from a_i.
Thanks for your help and sorry if this question is confusing, but I find it rather difficult to extrapolate from simple examples to large applications.

Unfortunately, I'm not sure if I'm understanding your question correctly, but I'll try to answer.
Cache misses are caused by the processor requiring data that is scattered all over memory.
One very common way of increasing cache hits is just organizing your data so that everything that is accessed sequentially is in the same region of memory. Judging by your explanation, I think this is most likely your problem; your A objects are scattered all over the place.
If you're just calling regular new every single time you need to allocate an A, you'll probably end up with all of your A objects being scattered.
You can create a custom allocator for objects that will be creating many times and accessed sequentially. This custom allocator could allocate a large number of objects and hand them out as requested. This may be similar to what you meant by reordering your data.
It can take a bit of work to implement this, however, because you have to consider cases such as what happens when it runs out of objects, how it knows which objects have been handed out, and so on.
// This example is very simple. Instead of using new to create an Object,
// the code can just call Allocate() and use the pointer returned.
// This ensures that all Object instances reside in the same region of memory.
struct CustomAllocator {
CustomAllocator() : nextObject(cache) { }
Object* Allocate() {
return nextObject++;
}
Object* nextObject;
Object cache[1024];
}
Another method involves caching operations that work on sequential data, but aren't performed sequentially. I think this is what you meant by having a separate vector.
However, it's important to understand that your CPU doesn't just keep one section of memory in cache at a time. It keeps multiple sections of memory cached.
If you're jumping back and forth between operations on data in one section and operations on data in another section, this most likely will not cause many cache hits; your CPU can and should keep both sections cached at the same time.
If you're jumping between operations on 50 different sets of data, you'll probably encounter many cache misses. In this scenario, caching operations would be beneficial.
In your case, I don't think caching operations will give you much benefit. Ensuring that all of your A objects reside in the same section of memory, however, probably will.
Another thing to consider is threading, but this can get pretty complicated. If your thread is doing a lot of context switches, you may encounter a lot of cache misses.

+1 for profiling first :)
While using a cusomt allocator can be the correct solution, I'd certainly recommend two things first:
keep a reference/pointer to the entire vector of A instead of a vector of A*:
.
class Container {
...
void compute_all();
std::vector<A>* objects;
...
}
Use a standard library with custom allocators (I think boost has some good ones, EASTL is centered around the very concept)
$0.02

Related

Is there a use case for std::unique_ptr<std::array<T,N>>

I came across something like:
using arr_t=std::array<std::array<std::array<int,1000>,1000>,1000>;
std::unique_ptr<arr_t> u_ptr;
The unique pointer was used, obviously, to overcome stackoverflow problem. Is there any case to use the previous code rather than just using std::vector ? Is there a real use case for std::unique_ptr<std::array<T,N>> ?
The code above generates one contiguous buffer of a billion elements, with [] access that lets you get at elements as a 3-dimensional 1000-sided cube.
A vector of vectors of vectors would be a whole pile of non-contiguous buffers linked by pointers and ownership semantics.
I suspect you are suggesting
using u_ptr=std::vector<std::array<std::array<int,1000>,1000>>;
then resizing said arr_t to 1000 once created. This has the modest cost of an extra 2 pointer overhead in the handle object. It also permits varible size, which means that ensuring it is fixed size as intended is something the user code has to ensure. You'd want to block a pile of methods, basically everything unique_ptr doesn't expose, to ensure safety, or audit that your code doesn't use any of them.
Some of those operations could be very expensive; .push_back({}) would reallocate a gigabyte.
Now, maybe you intend you won't ever call that; but if you have generic code that processes vectors, you'd have to audit all of it to ensure that none of it every does these operations. It isn't possible to have a non-const handle to a vector that cannot resize it, for example, without rolling-your-own-span-class at this point.
We could block the methods we do not want to expose with private inheritance and using statements, but at this point we end up doing most of the work to get back to the unique_ptr solution.

Keep vector members of class contiguous with class instance

I have a class that implements two simple, pre-sized stacks; those are stored as members of the class of type vector pre-sized by the constructor. They are small and cache line size friendly objects.
Those two stacks are constant in size, persisted and updated lazily, and are often accessed together by some computationally cheap methods that, however, can be called a large number of times (tens to hundred of thousands of times per second).
All objects are already in good state (code is clean and does what it's supposed to do), all sizes kept under control (64k to 128K most cases for the whole chain of ops including results, rarely they get close to 256k, so at worse an L2 look-up and often L1).
some auto-vectorization comes into play, but other than that it's single threaded code throughout.
The class, minus some minor things and padding, looks like this:
class Curve{
private:
std::vector<ControlPoint> m_controls;
std::vector<Segment> m_segments;
unsigned int m_cvCount;
unsigned int m_sgCount;
std::vector<unsigned int> m_sgSampleCount;
unsigned int m_maxIter;
unsigned int m_iterSamples;
float m_lengthTolerance;
float m_length;
}
Curve::Curve(){
m_controls = std::vector<ControlPoint>(CONTROL_CAP);
m_segments = std::vector<Segment>( (CONTROL_CAP-3) );
m_cvCount = 0;
m_sgCount = 0;
std::vector<unsigned int> m_sgSampleCount(CONTROL_CAP-3);
m_maxIter = 3;
m_iterSamples = 20;
m_lengthTolerance = 0.001;
m_length = 0.0;
}
Curve::~Curve(){}
Bear with the verbosity, please, I'm trying to educate myself and make sure I'm not operating by some half-arsed knowledge:
Given the operations that are run on those and their actual use, performance is largely memory I/O bound.
I have a few questions related to optimal positioning of the data, keep in mind this is on Intel CPUs (Ivy and a few Haswell) and with GCC 4.4, I have no other use cases for this:
I'm assuming that if the actual storage of controls and segments are contiguous to the instance of Curve that's an ideal scenario for the cache (size wise the lot can easily fit on my target CPUs).
A related assumption is that if the vectors are distant from the instance of the Curve , and between themselves, as methods alternatively access the contents of those two members, there will be more frequent eviction and re-populating the L1 cache.
1) Is that correct (data is pulled for the entire stretch of cache size from the address first looked up on a new operation, and not in convenient multiple segments of appropriate size), or am I mis-understanding the caching mechanism and the cache can pull and preserve multiple smaller stretches of ram?
2) Following the above, insofar by pure circumstance all my test always end up with the class' instance and the vectors contiguous, but I assume that's just dumb luck, however statistically probable. Normally instancing the class reserves only the space for that object, and then the vectors are allocated in the next free contiguous chunk available, which is not guaranteed to be anywhere near my Curve instance if that previously found a small emptier niche in memory.
Is this correct?
3) Assuming 1 and 2 are correct, or close enough functionally speaking, I understand to guarantee performance I'd have to write an allocator of sorts to make sure the class object itself is large enough on instancing, and then copy the vectors in there myself and from there on refer to those.
I can probably hack my way to something like that if it's the only way to work through the problem, but I'd rather not hack it horribly if there are nice/smart ways to go about something like that. Any pointers on best practices and suggested methods would be hugely helpful (beyond "don't use malloc it's not guaranteed contiguous", that one I already have down :) ).
If the Curve instances fit into a cache line and the data of the two vectors also fit a cachline each, the situation is not that bad, because you have four constant cachelines then. If every element was accessed indirectly and randomly positioned in memory, every access to an element might cost you a fetch operation, which is avoided in that case. In the case that both Curve and its elements fit into less than four cachelines, you would reap benefits from putting them into contiguous storage.
True.
If you used std::array, you would have the guarantee that the elements are embedded in the owning class and not have the dynamic allocation (which in and of itself costs you memory space and bandwidth). You would then even avoid the indirect access that you would still have if you used a special allocator that puts the vector content in contiguous storage with the Curve instance.
BTW: Short style remark:
Curve::Curve()
{
m_controls = std::vector<ControlPoint>(CONTROL_CAP, ControlPoint());
m_segments = std::vector<Segment>(CONTROL_CAP - 3, Segment());
...
}
...should be written like this:
Curve::Curve():
m_controls(CONTROL_CAP),
m_segments(CONTROL_CAP - 3)
{
...
}
This is called "initializer list", search for that term for further explanations. Also, a default-initialized element, which you provide as second parameter, is already the default, so no need to specify that explicitly.

Should I use manual alloc to allow move semantics?

I'm interested to learn when I should start considering using move semantics in favour over copying data depending on the size of that data and the usage of the class. For example for a Matrix4 class we have two options:
struct Matrix4{
float* data;
Matrix4(){ data = new float[16]; }
Matrix4(Matrix4&& other){
*this = std::move(other);
}
Matrix4& operator=(Matrix4&& other)
{
... removed for brevity ...
}
~Matrix4(){ delete [] data; }
... other operators and class methods ...
};
struct Matrix4{
float data[16]; // let the compiler do the magic
Matrix4(){}
Matrix4(const Matrix4& other){
std::copy(other.data, other.data+16, data);
}
Matrix4& operator=(const Matrix4& other)
{
std::copy(other.data, other.data+16, data);
}
... other operators and class methods ...
};
I believe there is some overhead having to alloc and dealloc memory "by hand", and given the chances of really hitting the move construct when using this class what is the preferred implementations of a class with such small in memory size? Is really always preferred move over copy?
In the first case, allocation and deallocation are expensive - because you are dynamically allocating memory from the heap, even if your matrix is constructed on the stack - and moves are cheap (just copying a pointer).
In the second case, allocation and deallocation are cheap, but moves are expensive - because they are actually copies.
So if you are writing an application and you just care about performance of that application, the answer to the question "Which one is better?" likely depends on how much you are creating/destroying matrices vs how much you are copying/moving them - and in any case, do your own measurements to support any conjectures.
By doing measurements you will also check whether your compiler is doing a lot of copy/move elisions in places where you expect moves to be going on - results may be against your expectations.
Also, cache locality may have an impact here: if you allocate storage for a matrix's data on the heap, having three matrices that you want to process element-by-element created on the stack will likely require quite a scattered memory access pattern - potentially result in more cache misses.
On the other hand, if you are using arrays for which memory is allocated on the stack, it is likely that the same cache line will be able to hold the data of all those matrices - thus increasing the cache hit rate. Not to mention the fact that in order to access elements on the heap you first need to read the value of the data pointer, which means accessing a different region of memory than the one holding the elements.
So once more, the moral of the story is: do your own measurements.
If you are writing a library on the other hand, and you cannot predict how many constructions/destructions vs moves/copies the client is going to perform, then you may offer two such matrix classes, and factor out the common behavior into a base class - possibly a base class template.
That will give the client flexibility and will give you a sufficiently high degree of reuse - no need to write the implementation of all common member functions twice.
This way, clients may choose the matrix class that best fits the creation/moving profile of the application in which they are using it.
UPDATE:
As DeadMG points out in the comments, one advantage of the array-based approach over the dynamic allocation approach is that the latter is doing manual resource management through raw pointers, new, and delete, which forces you to write user-defined destructor, copy constructor, move constructor, copy-assignment operator, and move-assignment operator.
You could avoid all of this if you were using std::vector, which would perform the memory management task for you and would save you from the burden of defining all those special member functions.
This said, the mere fact of suggesting to use std::vector instead of doing manual memory management - as much as it is a good advice in terms of design and programming practice - does not answer the question, while I believe the original answer does.
Like everything else in programming, specially when performance is concerned, it's a complicated trade-off.
Here, you have two designs: to keep the data inside your class (method 1) or to allocate the data on the heap and keep a pointer to it in the class (method 2).
As far as I can tell, these are the trade-offs you are making:
Construction/Destruction Speed: Naively implemented, method 2 will be slower here, because it requires dynamic memory allocation and deallocation. However, you can help the situation using custom memory allocators, specially if the size of your data is predictable and/or fixed.
Size: In your 4x4 matrix example, method 2 requires storing an additional pointer, plus memory allocation size overhead (typically can be anywhere from 4 to 32 bytes.) This might or might not be a factor, but it certainly must be considered, specially if your class instances are small.
Move Speed: Method 2 has very fast move operation, because it only requires setting two pointers. In method 1, you have no choice but to copy your data. However, while being able to rely on fast moving can make your code pretty and straightforward and readable and more efficient, compilers are quite good at copy elision, which means that you can write your pretty, straightforward and readable pass-by-value interfaces even if you implement method 1 and the compiler will not generate too many copies anyway. But you can't be sure of that, so relying on this compiler optimization, specially if your instances are larger, requires measurement and inspection of the generated code.
Member Access Speed: This is the most important differentiator for small classes, in my opinion. Each time you access an element in a matrix implemented using method 2 (or access a field in a class implemented that way, i.e., with external data) you access the memory twice: once to read the address of the external block of memory, and once to actually read the data you want. In method 1, you just directly access the field or element you want. This means that in method 2, every access could potentially generate an additional cache miss, which could affect your performance. This is specially important if your class instances are small (e.g. a 4x4 matrix) and you operate on many of them stored in arrays or vectors.
In fact, this is why you might want to actually copy bytes around when you are copying/moving an instance of your matrix, instead of just setting a pointer: to keep your data contiguous. This is why flat data structures (like arrays of values,) are much preferred in high-performance code, than pointer spaghetti data structures (like arrays of pointers, linked lists, etc.) So, while moving is cooler and faster than copying in isolation, you sometimes want to do copy your instances to make (or keep) a whole bunch of them contiguous and make iteration over and accessing them much much more efficient.
Flexibility of Length/Size: Method 2 is obviously more flexible in this regard because you can decide how much data you need at runtime, be it 16 or 16777216 bytes.
All in all, here's the algorithm I suggest you use for picking one implementation:
If you need variable amount of data, pick method 2.
If you have very large amounts of data in each instance of your class (e.g. several kilobytes,) pick method 2.
If you need to copy instances of your class around a lot (and I mean a lot!) pick method 2 (but try to measure the performance improvement and inspect the generated code, specially in hot areas.)
In all other cases, prefer method 1.
In short, method 1 should be your default, until proven otherwise. And the way to prove anything regarding performance is measurement! So don't optimize anything unless you have measured and have proof that one method is better than another, and also (as mentioned in other answers,) you might want to implement both methods if you are writing a library and let your users choose the implementation.
I would probably use a stdlib container (such as std::vector or std::array) that already implements move semantics, and then I would simply have the vectors or arrays move.
For example, you could use std::array< std::array, 4 > or std::vector< std::vector< float > > to represent your matrix type.
I don't think it will matter a lot for a 4x4 matrix, but it might for 10000x10000. So yes, a move constructor for a matrix type is definitely worth it, especially if you're planning to work with a lot of temporary matrices (which seems likely when you want to do calculations with them). It will also allow returning Matrix4 objects efficiently (whereas you'd have to use a by-ref call to get around copying otherwise).
Rather unrelated to the matter but probably worth mentioning: in case you decide to use std::array, please make a Matrix a template class (instead of embedding the size into the classname).

In generic object update loop, is it better to update per controller or per object?

I'm writing some generic code which basically will have a vector of objects being updated by a set of controllers.
The code is a bit complex in my specific context but a simplification would be:
template< class T >
class Controller
{
public:
virtual ~Controller(){}
virtual void update( T& ) = 0;
// and potentially other functions used in other cases than update
}
template< class T >
class Group
{
public:
typedef std::shared_ptr< Controller<T> > ControllerPtr;
void add_controller( ControllerPtr ); // register a controller
void remove_controller( ControllerPtr ); // remove a controller
void update(); // udpate all objects using controllers
private:
std::vector< T > m_objects;
std::vector< ControllerPtr > m_controllers;
};
I intentionally didn't use std::function because I can't use it in my specific case.
I also intentionally use shared pointers instead of raw pointers, this is not important for my question actually.
Anyway here it's the update() implementation that interest me.
I can do it two ways.
A) For each controller, update all objects.
template< class T >
void Group<T>::update()
{
for( auto& controller : m_controllers )
for( auto& object : m_objects )
controller->update( object );
}
B) For each object, update by applying all controllers.
template< class T >
void Group<T>::update()
{
for( auto& object : m_objects )
for( auto& controller : m_controllers )
controller->update( object );
}
"Measure! Measure! Measure!" you will say and I fully agree, but I can't measure what I don't use. The problem is that it's generic code. I don't know the size of T, I just assume it will not be gigantic, maybe small, maybe still a bit big. Really I can't assume much about T other than it is designed to be contained in a vector.
I also don't know how many controllers or T instances will be used. In my current use cases, there would be widely different counts.
The question is: which solution would be the most efficient in general?
I'm thinking about cache coherency here. Also, I assume this code would be used on different compilers and platforms.
My guts tells me that updating instruction cache is certainly faster than updating data cache, which would make solution B) the more efficient in general. However, I learnt to not trust my gusts when I have doubts about performance, so I'm asking here.
The solution I'm getting to would allow the user to choose (using a compile-time policy) which update implementation to use with each Group instance, but I want to provide a default policy and I can't decide which one would be the most efficient for most of the cases.
We have a living proof that modern compilers (Intel C++ in particular) are able to swap loops, so it shouldn't really matter for you.
I have remembered it from the great #Mysticial's answer:
Intel Compiler 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune the mispredictions, it is also twice as fast as whatever VC++ and GCC can generate!
Wikipedia article about the topic
Detecting whether loop interchange can be done requires checking if the swapped code will really produce the same results. In theory it could be possible to prepare classes that won't allow for the swap, but then again, it could be possible to prepare classes that would benefit from either version more.
Cache-Friendliness Is Close to Godliness
Knowing nothing else about how the update methods of individual Controllers behave, I think the most important factor in performance would be cache-friendliness.
Considering cache effectiveness, the only difference between the two loops is that m_objects are laid out contiguously (because they are contained in the vector) and they are accessed linearly in memory (because the loop is in order) but m_controllers are only pointed to here and they can be anywhere in memory and moreover, they can be of different types with different update() methods that themselves can reside anywhere. Therefore, while looping over them we would be jumping around in memory.
In respect to cache, the two loops would behave like this: (things are never simple and straightforward when you are concerned about performance, so bear with me!)
Loop A: The inner loop runs efficiently (unless the objects are large - hundreds or thousands of bytes - or they store their data outside themselves, e.g., std::string) because the cache access pattern is predictable and the CPU will prefetch consecutive cachelines so there won't be much stalling on reading memory for the objects. However, if the size of the vector of objects is larger than the size of the L2 (or L3) cache, each iteration of the outer loop will require reloading of the entire cache. But again, that cache reloading will be efficient!
Loop B: If indeed the controllers have many different types of update() methods, the inner loop here may cause wild jumping around in memory, but all these different update functions will be working on data that is cached and available (specially if objects are large or they themselves contain pointers to data scattered in memory.) Unless the update() methods access so much memory themselves (because, e.g., their code is huge or they require large amount of their own - i.e. controller - data) that they thrash the cache on each invocation; in which case all bets are off anyways.
So, I suggest the following strategy generally, which requires information that you probably don't have:
If objects are small (or smallish!) and POD-like (don't contain pointers themselves) definitely prefer loop A.
If objects are large and/or complex, or if there are many many different types of complex controllers (hundreds or thousands of different update() methods) prefer loop B.
If objects are large and/or complex, and there are so very many of them that iterating over them will thrash the cache many times (millions of objects), and the update() methods are many and they are very large and complex and require a lot of other data, then I'd say the order of your loop doesn't make any difference and you need to consider redesigning objects and controllers.
Sorting the Code
If you can, it may be beneficial to sort the controllers based on their type! You can use some internal mechanism in Controller or something like typeid() or another technique to sort the controllers based on their type, so the behavior of consecutive update() passes become more regular and predictable and nice.
This is a good idea regardless of which loop order you choose to implement, but it will have much more effect in loop B.
However, if you have so much variation among controllers (i.e. if practically all are unique) this won't help much. Also, obviously, if you need to preserve the order that controllers are applied, you won't be able to do this.
Adaptation and Improvisation
It should not be hard to implement both loop strategies and select between them at compile-time (or even runtime) based on either user hint or based on information available at compile time (e.g. size of T or some traits of T; if T is small and/or a POD, you probably should use loop A.)
You can even do this at runtime, basing your decision on the number of objects and controllers and anything else you can find out about them.
But, these kinds of "Klever" tricks can get you into trouble as the behavior of your container will depend on weird, opaque and even surprising heuristics and hacks. Also, they might and will even hurt performance in some cases, because there are many other factors meddling in performance of these two loops, including but not limited to the nature of the data and the code in objects and controllers, the exact sizes and configurations of cache levels and their relative speeds, the architecture of CPU and the exact way it handles prefetching, branch prediction, cache misses, etc., the code that the compiler generates, and much more.
If you want to use this technique (implementing both loops and switching between them are compile- and/or run-time) I highly suggest that you let the user do the choosing. You can accept a hint about which update strategy to use, either as a template parameter or a constructor argument. You can even have two update functions (e.g. updateByController() and updateByObject()) that the user can call at will.
On Branch Prediction
The only interesting branch here is the virtual update call, and as an indirect call through two pointers (the pointer to the controller instance and then the pointer to its vtable) it is quite hard to predict. However, sorting controllers based on type will help immensely with this.
Also remember that a mispredicted branch will cause a stall of a few to a few dozen CPU cycles, but for a cache miss, the stall will be in the hundreds of cycles. Of course, a mispredicted branch can cause a cache miss too, so... As I said before, nothing is simple and straightforward when it comes to performance!
In any case, I think cache friendliness is by far the most important factor in performance here.

std::sort on container of pointers

I want to explore the performance differences for multiple dereferencing of data inside a vector of new-ly allocated structs (or classes).
struct Foo
{
int val;
// some variables
}
std::vector<Foo*> vectorOfFoo;
// Foo objects are new-ed and pushed in vectorOfFoo
for (int i=0; i<N; i++)
{
Foo *f = new Foo;
vectorOfFoo.push_back(f);
}
In the parts of the code where I iterate over vector I would like to enhance locality of reference through the many iterator derefencing, for example I have very often to perform a double nested loop
for (vector<Foo*>::iterator iter1 = vectorOfFoo.begin(); iter!=vectorOfFoo.end(); ++iter1)
{
int somevalue = (*iter)->value;
}
Obviously if the pointers inside the vectorOfFoo are very far, I think locality of reference is somewhat lost.
What about the performance if before the loop I sort the vector before iterating on it? Should I have better performance in repeated dereferencings?
Am I ensured that consecutive ´new´ allocates pointer which are close in the memory layout?
Just to answer your last question: no, there is no guarantee whatsoever where new allocates memory. The allocations can be distributed throughout the memory. Depending on the current fragmentation of the memory you may be lucky that they are sometimes close to each other but no guarantee is - or, actually, can be - given.
If you want to improve the locality of reference for your objects then you should look into Pool Allocation.
But that's pointless without profiling.
It depends on many factors.
First, it depends on how your objects that are being pointed to from the vector were allocated. If they were allocated on different pages then you cannot help it but fix the allocation part and/or try to use software prefetching.
You can generally check what virtual addresses malloc gives out, but as a part of the larger program the result of separate allocations is not deterministic. So if you want to control the allocation, you have to do it smarter.
In case of NUMA system, you have to make sure that the memory you are accessing is allocated from the physical memory of the node on which your process is running. Otherwise, no matter what you do, the memory will be coming from the other node and you cannot do much in that case except transfer you program back to its "home" node.
You have to check the stride that is needed in order to jump from one object to another. Pre-fetcher can recognize the stride within 512 byte window. If the stride is greater, you are talking about a random memory access from the pre-fetcher point of view. Then it will shut off not to evict your data from the cache, and the best you can do there is to try and use software prefetching. Which may or may not help (always test it).
So if sorting the vector of pointers makes the objects pointed by them continuously placed one after another with a relatively small stride - then yes, you will improve the memory access speed by making it more friendly for the prefetch hardware.
You also have to make sure that sorting that vector doesn't result in a worse gain/lose ratio.
On a side note, depending on how you use each element, you may want to allocate them all at once and/or split those objects into different smaller structures and iterate over smaller data chunks.
At any rate, you absolutely must measure the performance of the whole application before and after your changes. These sort of optimizations is a tricky business and things can get worse even though in theory the performance should have been improved. There are many tools that can be used to help you profile the memory access. For example, cachegrind. Intel's VTune does the same. And many other tools. So don't guess, experiment and verify the results.