Lets say my array is 32KB, L1 is 64 KB. Does Windows use some of it while program is running? Maybe I am not able to use L1 because windows is making other programs work? Should I set priority of my program to use all cache?
for(int i=0;i<8192;i++)
{
array_3[i]+=clock()*(rand()%256);//clock() and rand in cache too?
//how many times do I need to use a variable to make it stay in cache?
//or cache is only for reading? look below plz
temp_a+=array_x[i]*my_function();
}
The program is in C/C++.
Same thing for L2 too please.
Also are functions kept in cache? Cache is read only? (If I change my array then it loses the cache bond?)
Does the compiler create the asm codes to use cache more yield?
Thanks
How can i know my array is in cache?
In general, you can't. Generally speaking, the cache is managed directly by hardware, not by Windows. You also can't control whether data resides in the cache (although it is possible to specify that an area of memory shouldn't be cached).
Does windows use some of it while program is running? Maybe i am not able to use L1 because windows is making other programs work? Should i set priority of my program to use all cache?
The L1 and L2 caches are shared by all processes running on a given core. When your process is running, it will use all of cache (if it needs it). When there's a context switch, some or all of the cache will be evicted, depending on what the second process needs. So next time there's a context switch back to your process, the cache may have to be refilled all over again.
But again, this is all done automatically by the hardware.
also functions are kept in cache?
On most modern processors, there is a separate cache for instructions. See e.g. this diagram which shows the arrangement for the Intel Nehalem architecture; note the shared L2 and L3 caches, but the separate L1 caches for instructions and data.
cache is read only?(if i change my array then it loses the cache bond?)
No. Caches can handle modified data, although this is considerably more complex (because of the problem of synchronising multiple caches in a multi-core system.)
does the compiler create the asm codes to use cache more yield?
As cache activity is generally all handled automatically by the hardware, no special instructions are needed.
Cache is not directly controlled by the operating system, it is done
in hardware
In case of a context switch, another application may modify the
cache, but you should not care about this. It is more important to
handle cases when your program behaves cache unfriendly.
Functions are kept in cache (I-Cahce , instruction cache)
Cache is not read only, when you write something it goes to [memory
and] the cache.
The cache is primarily controlled by the hardware. However, I know that Windows scheduler tends to schedule execution of a thread to the same core as before specifically because of the caches. It understands that it will be necessary to reload them on another core. Windows is using this behavior at least since Windows 2000.
As others have stated, you generally cannot control what is in cache. If you are writing code for high-performance and need to rely on cache for performance, then it is not uncommon to write your code so that you are using about half the space of L1 cache. Methods for doing so involve a great deal of discussion beyond the scope of StackOverflow questions. Essentially, you would want to do as much work as possible on some data before moving on to other data.
As a matter of what works practically, using about half of cache leaves enough space for other things to occur that most of your data will remain in cache. You cannot rely on this without cooperation from the operating system and other aspects of the computing platform, so it may be a useful technique for speeding up research calculations but it cannot be used where real-time performance must be guaranteed, as in operating dangerous machinery.
There are additional caveats besides how much data you use. Using data that maps to the same cache lines can evict data from cache even though there is plenty of cache unused. Matrix transposes are notorious for this, because a matrix whose row length is a multiple of a moderate power of two will have columns in which elements map to a small set of cache lines. So learning to use cache efficiently is a significant job.
As far as I know, you can't control what will be in the cache. You can declare a variable as register var_type a and then access to it will be in a single cycle(or a small number of cycles). Moreover, the amount of cycles it will take you to access a chunk of memory also depends on virtual memory translation and TLB.
It should be noted that the register keyword is merely a suggestion and the compiler is perfectly free to ignore it, as was suggested by the comment.
Even though you may not know which data is in cache and which not, you still may get an idea how much of the cache you are utilizing. Modern processor have quite many performance counters and some of them related to cache. Intel's processors may tell you how many L1 and L2 misses there were. Check this for more details of how to do it: How to read performance counters on i5, i7 CPUs
Related
I have read about that on x86 and x86-64 Intel gcc provides special prefetching instructions:
#include <xmmintrin.h>
enum _mm_hint
{
_MM_HINT_T0 = 3,
_MM_HINT_T1 = 2,
_MM_HINT_T2 = 1,
_MM_HINT_NTA = 0
};
void _mm_prefetch(void *p, enum _mm_hint h);
Programs can use the _mm_prefetch intrinsic on any
pointer in the program. And The different hints to be used with the _mm_prefetch
intrinsic are implementation defined. Generally said is that each of the hints have its own meaning.
_MM_HINT_T0
fetches data to all levels of the cache for inclusive caches
and to the lowest level cache for exclusive caches
_MM_HINT_T1 hint pulls the data into L2 and
not into L1d. If there is an L3 cache the _MM_HINT_T2
hints can do something similar for it
_MM_HINT_NTA, allows telling the processor to treat the prefetched cache line specially
So can someone describe examples when this instruction used?
And how to properly choose the hint?
The idea of prefetching is based upon these facts:
Accessing memory is very expensive the first time.
The first time a memory address1 is accessed is must be fetched from memory, it is then stored in the cache hierarchy2.
Accessing memory is inherently asynchronous.
The CPU doesn't need any resource from the core to perform the lengthiest part of a load/store3 and thus it can be easily done in parallel with other tasks4.
Thanks to the above it makes sense to try a load before it is actually needed so that when the code will actually need the data, it won't have to wait.
It is very worth nothing that the CPU can go pretty far ahead when looking for something to do, but not arbitrarily deep; so sometimes it needs the help of the programmer to perform optimally.
The cache hierarchy is, by its very nature, an aspect of the micro-architecture not the architecture (read ISA). Intel or AMD cannot give strong guarantees on what these instructions do.
Furthermore using them correctly is not easy as the programmer must have clear in mind how many cycles each instruction can take.
Finally, the latest CPU are getting more and more good at hiding memory latency and lowering it.
So in general prefetching is a job for the skilled assembly programmer.
That said the only possible scenario is where the timing of a piece of code must be consistent at every invocation.
For example, if you know that an interrupt handler always update a state and it must perform as fast as possible, it is worth, when setting the hardware that uses such interrupt, to prefetch the state variable.
Regarding the different level of prefetching, my understanding is that different levels (L1 - L4) correspond to different amounts of sharing and polluting.
For example prefetch0 is good if the thread/core that executes the instruction is the same that will read the variable.
However, this will take a line in all the caches, eventually evicting other, possibly useful, lines.
You can use this for example when you know that you'll need the data surely in short.
prefetch1 is good to make the data quickly available for all core or core group (depending on how L2 is shared) without polluting L1.
You can use this if you know that you may need the data or that you'll need it after having done with another task (that takes priority in using the cache).
This is not as fast as having the data in L1 but much better than having it in memory.
prefetch2 can be used to take out most of the memory access latency since it moves the data in the L3 cache.
It doesn't pollute L1 or L2 and it is shared among cores, so it's good for data used by rare (but possible) code paths or for preparing data for other cores.
prefetchnta is the easiest to understand, it is a non-temporal move. It avoids creating an entry in every cache line for a data that is accessed only once.
prefetchw/prefetchwnt1 are like the others but makes the line Exclusive and invalidates other cores lines that alias this one.
Basically, it makes writing faster as it is in the optimal state of the MESI protocol (for cache coherence).
Finally, a prefetch can be done incrementally, first by moving into L3 and then by moving into L1 (just for the threads that need it).
In short, each instruction let you decide the compromise between pollution, sharing and speed of access.
Since these all require to keep track of the use of the cache very carefully (you need to know that it's not worth creating and entry in the L1 but it is in the L2) the use is limited to very specific environments.
In a modern OS, it's not possible to keep track of the cache, you can do a prefetch just to find your quantum expired and your program replaced by another one that evicts the just loaded line.
As for a concrete example I'm a bit out of ideas.
In the past, I had to measure the timing of some external event as consistently as possible.
I used and interrupt to periodically monitor the event, in such case I prefetched the variables needed by the interrupt handler, thereby eliminating the latency of the first access.
Another, unorthodox, use of the prefetching is to move the data into the cache.
This is useful if you want to test the cache system or unmap a device from memory relying on the cache to keep the data a bit longer.
In this case moving to L3 is enough, not all CPU has an L3, so we may need to move to L2 instead.
I understand these examples are not very good, though.
1 Actually the granularity is "cache lines" not "addresses".
2 Which I assume you are familiar with. Shortly put: It, as present, goes from L1 to L3/L4. L3/L4 is shared among cores. L1 is always private per core and shared by the core's threads, L2 usually is like L1 but some model may have L2 shared across pairs of cores.
3 The lengthiest part is the data transfer from the RAM. Computing the address and initializing the transaction takes up resources (store buffer slots and TLB entries for example).
4 However any resource used to access the memory can become a critical issue as pointed out by #Leeor and proved by the Linux kernel developer.
I want to dynamically allocate a memory block for an array in C/C++, and this array will be accessed at a high frequency. So I want this array to stay on chip, i.e., in the Cache. How can I do this explicitly with code in C/C++?
There is no standard C++ language feature that allows you to do this.
Depending on your compiler and CPU, you may be able to use an arch-specific CPU instruction in an asm block:
T* p = new T(...);
size_t n = sizeof(T);
asm {
"CACHE n bytes at address p"
}
...or some builtin compiler function ("intrinsic") that does this.
You will need to consult your CPU manual and/or your compiler manual.
As an example, x86 CPUs have a set of instructions starting with PREFETCH.
And another example, GCC has a function called __builtin_prefetch. See GCC Data Prefetch Support
I will try to answer this question from a bit different perspective. Do you really need to do this. And even if it would be a way to do so, will it worth it? Imagine there is a "magic" void * malloc_and_lock_in_cache( int cacheLevel ) function. What you going to do with this data. If it's an application limited to while (1) loop with random array access from single thread you will have such behaviour anyway due to optimisation and CPU architecture. If you think about more real world solutions you always have logic around access. For example locking for multithreading, certain conditions, etc. The the question - do the rest of your application algorithms are so perfect that only left to do is to allocate array in cache.
Do all other access/sorting/lookup functions are state-of-art logic which cannot be reviewed rather then gaining very limited performance kickback trying to overwrite CPU optimisation.
Also do you consider to run your application without ANY operation system on a raw hardware so you shouldn't care about how your allocation will affects OS behaviour, rest of application running around?
And what should happen if your application will run inside virtual machine or environments like XEN.?
I can remember one similar popular subject 15-18 years ago about physical memory usage and disk caching utilities. Indeed tools like MS-DOS smartdrive or similar utilities were REALLY useful and speed up things a lot. Usenet was full of 'tuning advices' and performance analyses for things like write-through/write-back settings.
Especially if your DOS application were processing large amounts of data and implemented some memory swapping logic (I am talking about times then 4MB RAM was luxury) that's became mostly a drama, that from one point of view you need as much memory you can, but from another point of view you need swapping, so you actually need to swap, but swapping goes through cache etc..
But what happened next. We've got VM386 mode, disk cache/memory swaps integrated into OS, and who was care anymore about things like tuning smartdrive/ramdisks. In general it was 'cheaper' to allocate as much as you need VM then implement own voodoo algorithms to swap physical memory blocks (although this functionality is still in WinAPI).
So I would really recommend to concentrate efforts on algorithms and application design rather then trying to use some very low level features with really unpredictable results until you dont develop some new microkernel OS.
I don't think you can. First, which cache? L3, L2, L1? You can prefetch, and align so it its access is more optimized, and then you can query it periodically maybe to make it stay and not go LRU'd, but you can't really make it stay in cache.
First you have to know what's the architecture of the machine you want to run the code on. Then you should check it there's an instruction doing that kind of stuff.
Actually using the memory heavily will force the cache controller to put this region in cache.
And there are three rules of optimizing, you may want to know them first :)
http://c2.com/cgi/wiki?RulesOfOptimization
Is there a way to determine exactly what values, memory addresses, and/or other information currently resides in the CPU cache (L1, L2, etc.) - for current or all processes?
I've been doing quite a bit a reading which shows how to optimize programs to utilize the CPU cache more effectively. However, I'm looking for a way to truly determine if certain approaches are effective.
Bottom line: is it possible to be 100% certain what does and does not make it into the CPU cache.
Searching for this topic returns several results on how to determine the cache size, but not contents.
Edit: To clarify some of the comments below: Since software would undoubtedly alter the cache, do CPU manufactures have a tool / hardware diagnostic system (built-in) which provides this functionality?
Without using specialized hardware, you cannot directly inspect what is in the CPU cache. The act of running any software to inspect the CPU cache would alter the state of the cache.
The best approach I have found is simply to identify real hot spots in your application and benchmark alternative algorithms on hardware the code will run on in production (or on a range of likely hardware if you do not have control over the production environment).
In addition to Eric J.'s answer, I'll add that while I'm sure the big chip manufacturers do have such tools it's unlikely that such a "debug" facility would be made available to regular mortals like you and I, but even if it were, it wouldn't really be of much help.
Why? It's unlikely that you are having performance issues that you've traced to cache and which cannot be solved using the well-known and "common sense" techniques for maintaining high cache-hit ratios.
Have you really optimized all other hotspots in the code and poor cache behavior by the CPU is the problem? I very much doubt that.
Additionally, as food for thought: do you really want to optimize your program's behavior to only one or two particular CPUs? After all, caching algorithms change all the time, as do the parameters of the caches, sometimes dramatically.
If you have a relatively modern processor running Windows then take a look at
http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
and see if that might provide some of what you are looking for.
To optimize for one specific CPU cache size is usually in vain since this optimization will break when your assumptions about the CPU cache sizes are wrong when you execute on a different CPU.
But there is a way out there. You should optimize for certain access patterns to allow the CPU to easily predict what memory locations should be read next (the most obvious one is a linear increasing read). To be able to fully utilize a CPU you should read about cache oblivious algorithms where most of them follow a divide and conquer strategy where a problem is divided into sub parts to a certain extent until all memory accesses fit completly into the CPU cache.
It is also noteworthy to mention that you have a code and data cache which are separate. Herb Sutter has a nice video online where he talks about the CPU internals in depth.
The Visual Studio Profiler can collect CPU counters dealing with memory and L2 counters. These options are available when you select instrumentation profiling.
Intel has also a paper online which talks in greater detail about these CPU counters and what the task manager of Windows and Linux do show you and how wrong it is for todays CPUs which do work internally asynchronous and parallel at many diffent levels. Unfortunatley there is no tool from intel to display this stuff directly. The only tool I do know is the VS profiler. Perhaps VTune has similar capabilities.
If you have gone this far to optimize your code you might look as well into GPU programming. You need at least a PHD to get your head around SIMD instructions, cache locality, ... to get perhaps a factor 5 over your original design. But by porting your algorithm to a GPU you get a factor 100 with much less effort ony a decent graphics card. NVidia GPUs which do support CUDA (all today sold cards do support it) can be very nicely programmed in a C dialect. There are even wrapper for managed code (.NET) to take advantage of the full power of GPUs.
You can stay platform agnostic by using OpenCL but NVidia OpenCL support is very bad. The OpenCL drivers are at least 8 times slower than its CUDA counterpart.
Almost everything you do will be in the cache at the moment when you use it, unless you are reading memory that has been configured as "uncacheable" - typically, that's frame buffer memory of your graphics card. The other way to "not hit the cache" is to use specific load and store instructions that are "non-temporal". Everything else is read into the L1 cache before it reaches the target registers inside the CPU itself.
For nearly all cases, CPU's do have a fairly good system of knowing what to keep and what to throw away in the cache, and the cache is nearly always "full" - not necessarily of useful stuff, if, for example you are working your way through an enormous array, it will just contain a lot of "old array" [this is where the "non-temporal" memory operations come in handy, as they allow you to read and/or write data that won't be stored in the cache, since next time you get back to the same point, it won't be in the cache ANYWAYS].
And yes, processors usually have special registers [that can be accessed in kernel drivers] that can inspect the contents of the cache. But they are quite tricky to use without at the same time losing the content of the cache(s). And they are definitely not useful as "how much of array A is in the cache" type checking. They are specifically for "Hmm, it looks like cache-line 1234 is broken, I'd better read the cached data to see if it's really the value it should be" when processors aren't working as they should.
As DanS says, there are performance counters that you can read from suitable software [need to be in the kernel to use those registers too, so you need some sort of "driver" software for that]. In Linux, there's "perf". And AMD has a similar set of performance counters that can be used to find out, for example "how many cache misses have we had over this period of time" or "how many cache hits in L" have we had, etc.
I need to evaluate the time taken by a C++ function in a bunch of hypothesis about memory hierarchy efficiency (e.g: time taken when we have a cache miss, a cache hit or page fault when reading a portion of an array), so I'd like to have some libraries that let me count the cache miss / page faults in order to be capable of auto-generating a performance summary.
I know there are some tools like cachegrind that gives some related statistics on a given application execution, but I'd like a library, as I've already said.
edit Oh, I forgot: I'm using Linux and I'm not interested in portability, it's an academic thing.
Any suggestion is welcome!
Most recent CPUs (both AMD and Intel) have performance monitor registers that can be used for this kind of job. For Intel, they're covered in the programmer's reference manual, volume 3B, chapter 30. For AMD, it's in the BIOS and Kernel Developer's Guide.
Either way, you can count things like cache hits, cache misses, memory requests, data prefetches, etc. They have pretty specific selectors, so you could get a count of (for example) the number of reads on the L2 cache to fill lines in the L1 instruction cache (while still excluding L2 reads to fill lines in the L1 data cache).
There is a Linux kernel module to give access to MSRs (Model-specific registers). Offhand, I don't know whether it gives access to the performance monitor registers, but I'd expect it probably does.
It looks like now there is exactly what I was searching for: perf_event_open.
It lets you do interesting things like initializing/enabling/disabling some performance counters for subsequently fetching their values through an uniform and intuitive API (it gives you a special file descriptor which hosts a struct containing the previously requested informations).
It is a linux-only solution and the functionalities varies depending on the kernel version, so be careful :)
Intel VTune is a performance tuning tool that does exactly what you are asking for;
Of course it works with Intel processors, as it access the internal processor counters, as explained by Jerry Coffin, so this probably not work on an AMD processor.
It expose literally undreds of counters, like cache hit/misses, branch prediction rates, etc. the real issue with it is understanding which counters to check ;)
The cache misses cannot be just counted easily. Most tools or profilers simulate the memory access by redirecting memory accesses to a function that provides this feature. That means these kind of tools instrument the code at all places where a memory access is done and makes your code run awfully slowly. This is not what your intent is I guess.
However depending on the hardware you might have some other possibilities. But even if this is the case the OS should support it (because otherwise you would get system global stats not the ones related to a process or thread)
EDIT: I could find this interesting article that may help you: http://lwn.net/Articles/417979/
I have a problem....
I'm writing a data into array in the while-loop. And the point is that I'm doing it really frequently. It seems to be that this writing is now a bottle-neck in the code. So as i presume it's caused by the writing to memory. This array is not really large (smth like 300 elements). The question is it possible to do it in that way: to store it in the cache and update in the memory only after the while-loop is finished?
[edit - copied from an answer added by Alex]
double* array1 = new double[1000000]; // this array has elements
unsigned long* array2 = unsigned long[300];
double varX,t,sum=0;
int iter=0,i=0;
while(i<=max_steps)
{
varX+=difX;
nm0 = int(varX);
if(nm1!=nm0)
{
array2[iter] = nm0; // if you comment this string application works more then 2 times faster :)
nm1=nm0;
t = array1[nm0]; // if you comment this string , there is almost no change in time
++iter;
}
sum+=t;
++i;
}
For the first I'd like to thank all of you for answers. Indeed, it was a little dumb not to place a code. So i decided to do it now.
double* array1 = new double[1000000]; // this array has elements
unsigned long* array2 = unsigned long[300];
double varX,t,sum=0;
int iter=0,i=0;
while(i<=max_steps)
{
varX+=difX;
nm0 = int(varX);
if(nm1!=nm0)
{
array2[iter] = nm0; // if you comment this string application works more then 2 times faster :)
nm1=nm0;
t = array1[nm0]; // if you comment this string , there is almost no change in time
++iter;
}
sum+=t;
++i;
}
So that was it. It would be nice if someone will have any ideas. Thank you very much again.
Sincerely
Alex
Not intentionally, no. Among other things, you have no idea how big the cache is, so you have no idea of what's going to fit. Furthermore, if the app were allowed to lock off part of the cache, the effects on the OS might be devastating to overall system performance. This falls squarely onto my list of "you can't do it because you shouldn't do it. Ever."
What you can do is to improve your locality of reference - try to arrange the loop such that you don't access the elements more than once, and try to access them in order in memory.
Without more clues about your application, I don't think more specific advice can be given.
The CPU does not usually offer fine-grained cache control, you're not allowed to choose what is evicted or to pin things in cache. You do have a few cache operations on some CPUs. Just as a bit of info on what you can do: Here's some interesting cache related instructions on newer x86{-64} CPUs (Doing things like this makes portability hell, but I figured you may be curious)
Software Data Prefecth
The non-temporal instruction is
prefetchnta, which fetches the data
into the second-level cache,
minimizing cache pollution.
The temporal instructions are as
follows:
* prefetcht0 – fetches the data into all cache levels, that is, to the
second-level cache for the Pentium® 4 processor.
* prefetcht1 – Identical to prefetcht0
* prefetcht2 – Identical to prefetcht0
Additionally there are a set of instructions for accessing data in memory but explicitly tell the processor to not insert the data into the cache. These are called non-temporal instructions. An example of one is here: MOVNTI.
You could use the non temporal instructions for every piece of data you DON'T want in cache, in the hopes that the rest will always stay in cache. I don't know if this would actually improve performance as there are subtle behaviors to be aware of when it comes to the cache. Also it sounds like it'd be relatively painful to do.
I have a problem.... I'm writing a data into array in the while-loop. And the point is that I'm doing it really frequently. It seems to be that this writing is now a bottle-neck in the code. So as i presume it's caused by the writing to memory. This array is not really large (smth like 300 elements). The question is it possible to do it in that way: to store it in the cache and update in the memory only after the while-loop is finished?
You don't need to. The only reason why it might get pushed out of the cache is if some other data is deemed more urgent to put in the cache.
Apart from this, an array of 300 elements should fit in the cache with no trouble (assuming the element size isn't too crazy), so most likely, your data is already in cach.
In any case, the most effective solution is probably to tweak your code. Use lots of temporaries (to indicate to the compiler that the memory address isn't important), rather than writing/reading into the array constantly. Reorder your code so loads are performed once, at the start of the loop, and break up dependency chains as much as possible.
Manually unrolling the loop gives you more flexibility to achieve these things.
And finally, two obvious tools you should use rather than guessing about the cache behavior:
A profiler, and cachegrind if available. A good profiler can tell you a lot of statistics on cache misses, and cachegrind give you a lot of information too.
Us here at StackOverflow. If you post your loop code and ask how its performance can be improved, I'm sure a lot of us will find it a fun challenge.
But as others have mentioned, when working with performance, don't guess. You need hard data and measurements, not hunches and gut feelings.
Unless your code does something completely different in between writing to the array, then most of the array will probably be kept in the cache.
Unfortunately there isn't anything you can do to affect what is in the cache, apart from rewriting your algorithm with the cache in mind. Try to use as little memory as possible in between writing to the memory: don't use lot's of variables, don't call many other functions, and try to write to the same region of the array consecutively.
I doubt that this is possible, at least on a high-level multitasking operating system. You can't guarantee that your process won't be pre-empted, and lose the CPU. If your process then owns the cache, other processes can't use it, which would make their exeucution very slow, and complicate things a great deal. You really don't want to run a modern several-GHz processor without cache, just because one application has locked all others out of it.
In this case, array2 will be quite "hot" and stay in cache for that reason alone. The trick is keeping array1 out of cache (!). You're reading it only once, so there is no point in caching it. The SSE instruction for that is MOVNTPD, intrinsic void_mm_stream_pd(double *destination, __m128i source)
Even if you could, it's a bad idea.
Modern desktop computers use multiple-core CPUs. Intel's chips are the most common chips in Desktop machines... but the Core and Core 2 processors don't share an on-die cache.
That is, didn't share a cache until the Core 2 i7 chips were released, which share an on-die 8MB L3 cache.
So, if you were able to lock data in the cache on the computer I'm typing this from, you can't even guarantee this process will be scheduled on the same core, so that cache lock may be totally useless.
If your writes are slow, make sure that no other CPU core is writing in the same memory area at the same time.
When you have a performance problem, don't assume anything, measure first. For example, comment out the writes, and see if the performance is any different.
If you are writing to an array of structures, use a structure pointer to cache the address of the structure so you are not doing the array multiply each time you do an access. Make sure you are using the native word length for the array indexer variable for maximum optimisation.
As other people have said, you can't control this directly, but changing your code may indirectly enable better caching. If you're running on linux and want to get more insight into what's happening with the CPU cache when your program runs, you can use the Cachegrind tool, part of the Valgrind suite. It's a simulation of a processor, so it's not totally realistic, but it gives you information that is hard to get any other way.
It might be possible to use some assembly code, or as onebyone pointed out, assembly intrinsics, to prefetch lines of memory into the cache, but that would cost a lot of time to tinker with it.
Just for trial, try to read in all the data (in a manner that the compiler won't optimize away), and then do the write. See if that helps.
In the early boot phases of CoreBoot (formerly LinuxBIOS) since they have no access to RAM yet (we are talking about BIOS code, and therefore the RAM hasn't been initialized yet) they set up something they call Cache-as-RAM (CAR), i.e. they use the processor cache as RAM even if not backed by actual RAM.