Scenarios when software prefetching manual instructions are reasonable

Scenarios when software prefetching manual instructions are reasonable - c++

I have read about that on x86 and x86-64 Intel gcc provides special prefetching instructions:
#include <xmmintrin.h>
enum _mm_hint
{
_MM_HINT_T0 = 3,
_MM_HINT_T1 = 2,
_MM_HINT_T2 = 1,
_MM_HINT_NTA = 0
};
void _mm_prefetch(void *p, enum _mm_hint h);
Programs can use the _mm_prefetch intrinsic on any
pointer in the program. And The different hints to be used with the _mm_prefetch
intrinsic are implementation defined. Generally said is that each of the hints have its own meaning.
_MM_HINT_T0
fetches data to all levels of the cache for inclusive caches
and to the lowest level cache for exclusive caches
_MM_HINT_T1 hint pulls the data into L2 and
not into L1d. If there is an L3 cache the _MM_HINT_T2
hints can do something similar for it
_MM_HINT_NTA, allows telling the processor to treat the prefetched cache line specially
So can someone describe examples when this instruction used?
And how to properly choose the hint?

The idea of prefetching is based upon these facts:
Accessing memory is very expensive the first time.
The first time a memory address1 is accessed is must be fetched from memory, it is then stored in the cache hierarchy2.
Accessing memory is inherently asynchronous.
The CPU doesn't need any resource from the core to perform the lengthiest part of a load/store3 and thus it can be easily done in parallel with other tasks4.
Thanks to the above it makes sense to try a load before it is actually needed so that when the code will actually need the data, it won't have to wait.
It is very worth nothing that the CPU can go pretty far ahead when looking for something to do, but not arbitrarily deep; so sometimes it needs the help of the programmer to perform optimally.
The cache hierarchy is, by its very nature, an aspect of the micro-architecture not the architecture (read ISA). Intel or AMD cannot give strong guarantees on what these instructions do.
Furthermore using them correctly is not easy as the programmer must have clear in mind how many cycles each instruction can take.
Finally, the latest CPU are getting more and more good at hiding memory latency and lowering it.
So in general prefetching is a job for the skilled assembly programmer.
That said the only possible scenario is where the timing of a piece of code must be consistent at every invocation.
For example, if you know that an interrupt handler always update a state and it must perform as fast as possible, it is worth, when setting the hardware that uses such interrupt, to prefetch the state variable.
Regarding the different level of prefetching, my understanding is that different levels (L1 - L4) correspond to different amounts of sharing and polluting.
For example prefetch0 is good if the thread/core that executes the instruction is the same that will read the variable.
However, this will take a line in all the caches, eventually evicting other, possibly useful, lines.
You can use this for example when you know that you'll need the data surely in short.
prefetch1 is good to make the data quickly available for all core or core group (depending on how L2 is shared) without polluting L1.
You can use this if you know that you may need the data or that you'll need it after having done with another task (that takes priority in using the cache).
This is not as fast as having the data in L1 but much better than having it in memory.
prefetch2 can be used to take out most of the memory access latency since it moves the data in the L3 cache.
It doesn't pollute L1 or L2 and it is shared among cores, so it's good for data used by rare (but possible) code paths or for preparing data for other cores.
prefetchnta is the easiest to understand, it is a non-temporal move. It avoids creating an entry in every cache line for a data that is accessed only once.
prefetchw/prefetchwnt1 are like the others but makes the line Exclusive and invalidates other cores lines that alias this one.
Basically, it makes writing faster as it is in the optimal state of the MESI protocol (for cache coherence).
Finally, a prefetch can be done incrementally, first by moving into L3 and then by moving into L1 (just for the threads that need it).
In short, each instruction let you decide the compromise between pollution, sharing and speed of access.
Since these all require to keep track of the use of the cache very carefully (you need to know that it's not worth creating and entry in the L1 but it is in the L2) the use is limited to very specific environments.
In a modern OS, it's not possible to keep track of the cache, you can do a prefetch just to find your quantum expired and your program replaced by another one that evicts the just loaded line.
As for a concrete example I'm a bit out of ideas.
In the past, I had to measure the timing of some external event as consistently as possible.
I used and interrupt to periodically monitor the event, in such case I prefetched the variables needed by the interrupt handler, thereby eliminating the latency of the first access.
Another, unorthodox, use of the prefetching is to move the data into the cache.
This is useful if you want to test the cache system or unmap a device from memory relying on the cache to keep the data a bit longer.
In this case moving to L3 is enough, not all CPU has an L3, so we may need to move to L2 instead.
I understand these examples are not very good, though.
1 Actually the granularity is "cache lines" not "addresses".
2 Which I assume you are familiar with. Shortly put: It, as present, goes from L1 to L3/L4. L3/L4 is shared among cores. L1 is always private per core and shared by the core's threads, L2 usually is like L1 but some model may have L2 shared across pairs of cores.
3 The lengthiest part is the data transfer from the RAM. Computing the address and initializing the transaction takes up resources (store buffer slots and TLB entries for example).
4 However any resource used to access the memory can become a critical issue as pointed out by #Leeor and proved by the Linux kernel developer.

Related

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

TL;DR: In a producer-consumer queue does it ever make sense to put an unnecessary (from C++ memory model viewpoint) memory fence, or unnecessarily strong memory order to have better latency at the expense of possibly worse throughput?
C++ memory model is executed on the hardware by having some sort of memory fences for stronger memory orders and not having them on weaker memory orders.
In particular, if producer does store(memory_order_release), and consumer observes the stored value with load(memory_order_acquire), there are no fences between load and store. On x86 there are no fences at all, on ARM fences are put operation before store and after load.
The value stored without a fence will eventually be observed by load without a fence (possibly after few unsuccessful attempts)
I'm wondering if putting a fence on either of sides of the queue can make the value to be observed faster?
What is the latency with and without fence, if so?
I expect that just having a loop with load(memory_order_acquire) and pause / yield limited to thousands of iterations is the best option, as it is used everywhere, but want to understand why.
Since this question is about hardware behavior, I expect there's no generic answer. If so, I'm wondering mostly about x86 (x64 flavor), and secondarily about ARM.
Example:
T queue[MAX_SIZE]
std::atomic<std::size_t> shared_producer_index;
void producer()
{
std::size_t private_producer_index = 0;
for(;;)
{
private_producer_index++; // Handling rollover and queue full omitted
/* fill data */;
shared_producer_index.store(
private_producer_index, std::memory_order_release);
// Maybe barrier here or stronger order above?
}
}
void consumer()
{
std::size_t private_consumer_index = 0;
for(;;)
{
std::size_t observed_producer_index = shared_producer_index.load(
std::memory_order_acquire);
while (private_consumer_index == observed_producer_index)
{
// Maybe barrier here or stronger order below?
_mm_pause();
observed_producer_index= shared_producer_index.load(
std::memory_order_acquire);
// Switching from busy wait to kernel wait after some iterations omitted
}
/* consume as much data as index difference specifies */;
private_consumer_index = observed_producer_index;
}
}

Basically no significant effect on inter-core latency, and definitely never worth using "blindly" without careful profiling, if you suspect there might be any contention from later loads missing in cache.
It's a common misconception that asm barriers are needed to make the store buffer commit to cache. In fact barriers just make this core wait for something that was already going to happen on its own, before doing later loads and/or stores. For a full barrier, blocking later loads and stores until the store buffer is drained.
Size of store buffers on Intel hardware? What exactly is a store buffer?
In the bad old days before std::atomic, compiler barriers were one way to stop the compiler from keeping values in registers (private to a CPU core / thread, not coherent), but that's a compilation issue not asm. CPUs with non-coherent caches are possible in theory (where std::atomic would need to do explicit flushing to make a store visible), but in practice no implementation runs std::thread across cores with non-coherent caches.
If I don't use fences, how long could it take a core to see another core's writes? is highly related, I've written basically this answer at least a few times before. (But this looks like a good place for an answer specifically about this, without getting into the weeds of which barriers do what.)
There might be some very minor secondary effects of blocking later loads that could maybe compete with RFOs (for this core to get exclusive access to a cache line to commit a store). The CPU always tries to drain the store buffer as fast as possible (by committing to L1d cache). As soon as a store commits to L1d cache, it becomes globally visible to all other cores. (Because they're coherent; they'd still have to make a share request...)
Getting the current core to write-back some store data to L3 cache (especially in shared state) could reduce the miss penalty if the load on another core happens somewhat after this store commits. But there are no good ways to do that. Creating a conflict miss in L1d and L2 maybe, if producer performance is unimportant other than creating low latency for the next read.
On x86, Intel Tremont (low power Silvermont series) will introduce cldemote (_mm_cldemote) that writes back a line as far as an outer cache, but not all the way to DRAM. (clwb could possibly help, but does force the store to go all the way to DRAM. Also, the Skylake implementation is just a placeholder and works like clflushopt.)
Is there any way to write for Intel CPU direct core-to-core communication code?
How to force cpu core to flush store buffer in c?
x86 MESI invalidate cache line latency issue
Force a migration of a cache line to another core (not possible)
Fun fact: non-seq_cst stores/loads on PowerPC can store-forward between logical cores on the same physical core, making stores visible to some other cores before they become globally visible to all other cores. This is AFAIK the only real hardware mechanism for threads to not agree on a global order of stores to all objects. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. On other ISAs, including ARMv8 and x86, it's guaranteed that stores become visible to all other cores at the same time (via commit to L1d cache).
For loads, CPUs already prioritize demand loads over any other memory accesses (because of course execution has to wait for them.) A barrier before a load could only delay it.
That might happen to be optimal by coincidence of timing, if that makes it see the store it was waiting for instead of going "too soon" and seeing the old cached boring value. But there's generally no reason to assume or ever predict that a pause or barrier could be a good idea ahead of a load.
A barrier after a load shouldn't help either. Later loads or stores might be able to start, but out-of-order CPUs generally do stuff in oldest-first priority so later loads probably can't fill up all the outstanding load buffers before this load gets a chance to get its load request sent off-core (assuming a cache miss because another core stored recently.)
I guess I could imagine a benefit to a later barrier if this load address wasn't ready for a while (pointer-chasing situation) and the max number of off-core requests were already in-flight when the address did become known.
Any possible benefit is almost certainly not worth it; if there was that much useful work independent of this load that it could fill up all the off-core request buffers (LFBs on Intel) then it might well not be on the critical path and it's probably a good thing to have those loads in flight.

Flushing the cache to prevent benchmarking fluctiations

I am running the c++ code of someone to do the benchmarking on a dataset. The issue I have is that often I get a timing for the first run, and these numbers massively change (i.e. 28 seconds to 10 seconds) if I run the same code again. I assume this happens due to CPU's automatic caching. Is there a way to flush the cache, or prevent these fluctuations somehow?

Not one that works "for everything, everywhere". Most processors have special instructions to flush the cache, but they are often privileged instructions, so it has to be done from inside the OS kernel, not your user-mode code. And of course, it's completely different instructions for each processor architecture.
All current x86 processors does have a clflush instruction, that flushes one cache-line, but to do that, you have to have the address of the data (or code) you want to flush. Which is fine for small and simple data structures, not so good if you have a binary tree that is all over the place. And of course, not at all portable.
In most environments, reading and writing a large block of alternative data, e.g. something like:
// Global variables.
const size_t bigger_than_cachesize = 10 * 1024 * 1024;
long *p = new long[bigger_than_cachesize];
...
// When you want to "flush" cache.
for(int i = 0; i < bigger_than_cachesize; i++)
{
p[i] = rand();
}
Using rand will be much slower than filling with something constant/known. But the compiler can't optimise the call away, which means it's (almost) guaranteed that the code will stay.
The above won't flush instruction caches - that is a lot more difficult to do, basically, you have to run some (large enough) other piece of code to do that reliably. However, instruction caches tend to have less effect on overall benchmark performance (instruction cache is EXTREMELY important for modern processor's perforamnce, that's not what I'm saying, but in the sense that the code for a benchmark is typically small enough that it all fits in cache, and the benchmark runs many times over the same code, so it's only slower the first iteration)
Other ideas
Another way to simulate "non-cache" behaviour is allocate a new area for each benchmark pass - in other words, not freeing the memory until the end of the benchmark or using an array containing the data, and output results, such that each run has it's own set of data to work on.
Further, it's common to actually measure the performance of the "hot runs" of a benchmark, not the first "cold run" where the caches are empty. This does of course depend on what you are actually trying to achieve...

Here's my basic approach:
Allocate a memory region 2x the size of the LLC, if you can determine the LLC size dynamically (or you know it statically), or if you don't, some reasonable multiple of the largest LLC size on the platform of interest1.
memset the memory region to some non-zero value: 1 will do just fine.
"Sink" the pointer somewhere so that the compiler can't optimize out the stuff above or below (writing to a volatile global works pretty much 100% of the time).
Read from random indexes in the region until you've touched each cache line an average of 10 times or so (accumulate the read values into a sum that you sink in a similar way to (3)).
Here are some notes on why this is generally works and why doing less may not work - the details are x86-centric, but similar concerns will apply on many other architectures.
You absolutely want to write to the allocated memory (step 2) before you begin your main read-only flushing loop, since otherwise you might just be repeatedly reading from the same small zero-mapped page returned by the OS to satisfy your memory allocation.
You want to use a region considerably larger than the LLC size, since the outer cache levels are typically physically addressed, but you can only allocate and access virtual addresses. If you just allocate an LLC-sized region, you generally won't get full coverage of all the ways of every cache set: some sets will be over-represented (and so will be fully flushed), while other sets be under-represented and so not all existing values can even be flushed by accessing this region of memory. A 2x over-allocation makes it highly likely that almost all sets have enough representation.
You want to avoid the optimizer doing clever things, such as noting the memory never escapes the function and eliminating all your reads and writes.
You want to iterate randomly around the memory region, rather than just striding through it linearly: some designs like the LLC on recent Intel detect when a "streaming" pattern is present, and switch from LRU to MRU since LRU is about the worst-possible replacement policy for such a load. The effect is that no matter how many times you stream though memory, some "old" lines from before your efforts can remain in the cache. Randomly accessing memory defeats this behavior.
You want to access more than just LLC amount of memory for (a) the same reason you allocate more than the LLC size (virtual access vs physical caching) and (b) because random access needs more accesses before you have a high likelihood of hitting every set enough times (c) caches are usually only pseudo-LRU, so you need more than the number of accesses you'd expect under exact-LRU to flush out every line.
Even this is not foolproof. Other hardware optimizations or caching behaviors not considered above could cause this approach to fail. You might get very unlucky with the page allocation provided by the OS and not be able to reach all the pages (you can largely mitigate this by using 2MB pages). I highly recommend testing whether your flush technique is adequate: one approach is to measure the number of cache misses using CPU performance counters while running your benchmark and see if the number makes sense based on the known working-set size2.
Note that this leaves all levels of the cache with lines in E (exclusive) or perhaps S (shared) state, and not the M (modified) state. This means that these lines don't need to be evicted to other cache levels when they are replaced by accesses in your benchmark: they can simply be dropped. The approach described in the other answer will leave most/all lines in the M state, so you'll initially have 1 line of eviction traffic for every line you access in your benchmark. You can achieve the same behavior with my recipe above by changing step 4 to write rather than read.
In that regard, neither approach here is inherently "better" than the other: in the real world the cache levels will have a mix of modified and not-modified lines, while these approaches leave the cache at the two extremes of the continuum. In principle you could benchmark with both the all-M and no-M states, and see if it matters much: if it does, you can try to evaluate what the real-world state of the cache will usually be an replicate that.
1Remember that LLC sizes are growing almost every CPU generation (mostly because core counts are increasing), so you want to leave some room for growth if this needs to be future-proof.
2 I just throw that out there as if it was "easy", but in reality may be very difficult depending on your exact problem.

How to clear L1, L2 and L3 caches?

I am doing some cache performance measuring and I need to ensure the caches are empty of "useful" data before timing.
Assuming an L3 cache is 10MB would it suffice to create a vector of 10M/4 = 2,500,000 floats, iterate through the whole of this vector, sum the numbers and that would empty the whole cache of any data which was in it prior to iterating through the vector?

Yes, that should be sufficient for flushing the L3 cache of useful data.
I have done similar types of measurements and cross-verified by using Intel's cache counters to verify that I incur the expected number of L3 cache misses during my tests.
If you want to absolutely sure, you should also use the counters. In particular, you can measure last-level cache misses by using Event select 2EH, Umask 41H in most Intel architectures.
See the Intel Manual for details on these counters.

It depends on how insane you are trying to be to get your guarantee.
x86_64 L3 cache is physically indexed, and while a 10MiB chunk that's linear in virtual space is almost definitely going to be physically contiguous on a lightly mem-loaded machine, it's not guaranteed.
Sandy and Ivy Bridge, for example, have L3 cache in 2MiB slices with 16-way set associativity (128kiB stride), so you could guarantee physical coverage by doing a MAP_HUGETLB mmap() call, assuming standard 2-4MiB huge pages.
Also, since each slice (on new Sandy/Ivy Bridge at least) is attached to a different core, and which slice a given physical address resides on is determined by a hash of some low/middle-order address bits, you might have to make an array slightly larger than the size of L3 to counter for minutely uneven overlap.
At this point, scrubbing your array a few times linearly should do the trick.

Another option is to use dedicated cache invalidation instructions that some ISAs provide. x86 for e.g. has wbinvd for this purpose (or clflush for a single line).
http://x86.renejeschke.de/html/file_module_x86_id_325.html
One problem is that it requires ring-0 permissions. Another one is that it doesn't guarantee that the flush is completed prior to any serialization point, so it's not good enough to guarantee system non-volatility, but it may be enough for benchmarking as long as you can prevent the ensuing WBs from eating up your memory bandwidth.
If you can overcome these issues, it may be a better solution in some cases than going over some large data structure just to make sure the cache is flushed. Some CPUs may decide to avoid caching fetches they believe would not be reused in the future (there are several papers about these options, and at least some claims that it's implemented in real CPUs)

Concurrent stores seen in a consistent order

The Intel Architectures Software Developer's Manual, Aug. 2012, vol. 3A, sect. 8.2.2:
Any two stores are seen in a consistent order by processors other than
those performing the stores.
But can this be so?
The reason I ask is this: Consider a dual-core Intel i7 processor with HyperThreading. According to the Manual's vol. 1, Fig. 2-8, the i7's logical processors 0 and 1 share an L1/L2 cache, but its logical processors 2 and 3 share a different L1/L2 cache -- whereas all the logical processors share a single L3 cache. Suppose that logical processors 0 and 2 -- which do not share an L1/L2 cache -- write to the same memory location at about the same time, and that the writes go no deeper than L2 for the moment. Could not logical processors 1 and 3 (which are "processors other than those performing the stores") then see the "two stores in an inconsistent order"?
To achieve consistency, must not logical processors 0 and 2 issue SFENCE instructions, and logical processors 1 and 3 issue LFENCE instructions? Notwithstanding, the Manual seems to think otherwise, and its opinion in the matter does not have the look of a mere misprint. It looks deliberate. I'm confused.
UPDATE
In light of #Benoit's answer, a following question: The only purpose of L1 and L2 therefore is to speed loads. It is L3 that speeds stores. Is that right?

Intel CPUs (like all normal SMP systems) use (a variant of) MESI to ensure cache coherency for cached loads/stores. i.e. that all cores see the same view of memory through their caches.
A core can only write to a cache line after doing a Read For Ownership (RFO), getting the line in Exclusive state (no other caches have a valid copy of the line that could satisfy loads). Related: atomic RMW operations prevent other cores from doing anything to the target cache-line by locking it in Modified state for the duration of the operation.
To test for this kind of reordering, you need two other threads which both read both stores (in opposite order). Your proposed scenario has one core (reader2) reading an old value from memory (or L3, or its own private L2/L1) after another core (reader1) has read the new value of the same line stored by writer1. This is impossible: for reader1 to see writer1's store, writer1 must have already completed a RFO that invalidates all other copies of the cache line anywhere. And reading directly from DRAM without (effectively) snooping any write-back caches is not allowed. (Wikipedia's MESI article has diagrams.)
When a store commits (from the store buffer inside a core) to L1d cache, it becomes globally visible to all other cores at the same time. Before that, only the local core could "see" it (via store->load forwarding from the store buffer).
On a system where the only way for data to propagate from one core to another is through the global cache-coherency domain, MESI cache coherency alone guarantees that a single global store order exists, that all threads can agree on. x86's strong memory ordering rules make this global store order be some interleaving of program order, and we call this a Total Store Order memory model.
x86's strong memory model disallows LoadLoad reordering, so loads take their data from cache in program order without any barrier instructions in the reader threads.1
Loads actually snoop the local store buffer before taking data from the coherent cache. This is the reason the consistent order rule you quoted excludes the case where either store was done by the same core that's doing the loads. See Globally Invisible load instructions for more about where load data really comes from. But when the load addresses don't overlap with any recent stores, what I said above applies: load order is the order of sampling from the shared globally coherent cache domain.
The consistent order rule is a pretty weak requirement. Many non-x86 ISAs don't guarantee it on paper, but very few actual (non-x86) CPU designs have a mechanism by which one core can see store data from another core before it becomes globally visible to all cores. IBM POWER with SMT is one such example: Will two atomic writes to different locations in different threads always be seen in the same order by other threads? explains how forwarding between logical cores within one physical core can cause it. (This is a like what you proposed, but within the store buffer rather than L2).
x86 microarchitectures with HyperThreading (or AMD's SMT in Ryzen) obey that requirement by statically partitioning the store buffer between the logical cores on one physical core. What will be used for data exchange between threads are executing on one Core with HT? So even within one physical core, a store has to commit to L1d (and become globally visible) before the other logical core can load the new data.
It's probably simpler to not have forwarding from retired-but-not-committed stores in one logical core to the other logical cores on the same physical core.
(The other requirements of x86's TSO memory model, like loads and stores appearing in program order, are harder. Modern x86 CPUs execute out of order, but use a Memory Order Buffer to maintain the illusion and have stores commit to L1d in program order. Loads can speculatively take values earlier than they're "supposed" to, and then check later. This is why Intel CPUs have "memory-order mis-speculation" pipeline nukes: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?.)
As #BeeOnRope points out, there is an interaction between HT and maintaining the illusion of no LoadLoad reordering: normally a CPU can detect when another core touched a cache line after a load actual read it but before it was architecturally allowed to have read it: the load port can track invalidations to that cache line. But with HT, load ports also have to snoop the stores that the other hyperthread commits to L1d cache, because they won't invalidate the line. (Other mechanisms are possible, but it is a problem that CPU designers have to solve if they want high performance for "normal" loads.)
Footnote 1: On a weakly-ordered ISA, you'd use load-ordering barriers to control the order in which the 2 loads in each reader take their data from the globally coherent cache domain.
The writer threads are only doing a single store each so a fence is meaningless.
Because all cores share a single coherent cache domain, fences only need to control local reordering within a core. The store buffer in each core already tries to make stores globally visible as quickly as possible (while respecting the ordering rules of the ISA), so a barrier just makes the CPU wait before doing later operations.
x86 lfence has basically no memory-ordering use cases, and sfence is only useful with NT stores. Only mfence is useful for "normal" stuff, when one thread is writing something and then reading another location. http://preshing.com/20120515/memory-reordering-caught-in-the-act/. So it blocks StoreLoad reordering and store-forwarding across the barrier.
In light of #Benoit's answer, a following question: The only purpose of L1 and L2 therefore is to speed loads. It is L3 that speeds stores. Is that right?
No, L1d and L2 are write-back caches: Which cache mapping technique is used in intel core i7 processor?. Repeated stores to the same line can be absorbed by L1d.
But Intel uses inclusive L3 caches, so how can L1d in one core have the only copy? L3 is actually tag-inclusive, which is all that's needed for L3 tags work as a snoop filter (instead of broadcasting RFO requests to every core). The actual data in dirty lines is private to the per-core inner caches, but L3 knows which core has the current data for a line (and thus where to send a request when another core wants to read a line that another core has in Modified state). Clean cache lines (in Shared state) are data-inclusive of L3, but writing to a cache line doesn't write-through to L3.

I believe what the Intel documentation is saying is that the mechanics of the x86 chip will ensure that the other processors always see the writes in a consistent order.
So the other processors will only ever see one of the following results when reading that memory location:
value before either write (I.e. the read preceeded both writes)
value after processor 0's write (I.e. as if processor 2 wrote first, and then processor 0 overwrote)
value after processor 2's write (I.e. as if processor 0 wrote first and then processor 2 overwrote)
It won't be possible for processor 1 to see the value after processor 0's write, but at the same time have processor 3 see the value after processor 2's write (or vice versa).
Keep in mind that since intra-processor re-ordering is allowed (see section 8.2.3.5) processor's 0 and 2 may see things differently.

Ouch, this is a tough question! But I'll try...
the writes go no deeper than L2
Basically this is impossible since Intel uses inclusive caches. Any data written to L1, will also takes place in L2 and L3, unless you prevent from caching by disabling them through CR0/MTRR.
That being said, I guess there are arbitration mechanisms: processors issue a request to write data and an arbiter selects which request is granted from among the pending requests from each of the request queues. The selected requests are broadcasted to the snoopers, and to caches then. I suppose it would prevent from race, enforcing the consistent order seen by processors other than the one performing the request.

How can i know my array is in cache?

Lets say my array is 32KB, L1 is 64 KB. Does Windows use some of it while program is running? Maybe I am not able to use L1 because windows is making other programs work? Should I set priority of my program to use all cache?
for(int i=0;i<8192;i++)
{
array_3[i]+=clock()*(rand()%256);//clock() and rand in cache too?
//how many times do I need to use a variable to make it stay in cache?
//or cache is only for reading? look below plz
temp_a+=array_x[i]*my_function();
}
The program is in C/C++.
Same thing for L2 too please.
Also are functions kept in cache? Cache is read only? (If I change my array then it loses the cache bond?)
Does the compiler create the asm codes to use cache more yield?
Thanks

How can i know my array is in cache?
In general, you can't. Generally speaking, the cache is managed directly by hardware, not by Windows. You also can't control whether data resides in the cache (although it is possible to specify that an area of memory shouldn't be cached).
Does windows use some of it while program is running? Maybe i am not able to use L1 because windows is making other programs work? Should i set priority of my program to use all cache?
The L1 and L2 caches are shared by all processes running on a given core. When your process is running, it will use all of cache (if it needs it). When there's a context switch, some or all of the cache will be evicted, depending on what the second process needs. So next time there's a context switch back to your process, the cache may have to be refilled all over again.
But again, this is all done automatically by the hardware.
also functions are kept in cache?
On most modern processors, there is a separate cache for instructions. See e.g. this diagram which shows the arrangement for the Intel Nehalem architecture; note the shared L2 and L3 caches, but the separate L1 caches for instructions and data.
cache is read only?(if i change my array then it loses the cache bond?)
No. Caches can handle modified data, although this is considerably more complex (because of the problem of synchronising multiple caches in a multi-core system.)
does the compiler create the asm codes to use cache more yield?
As cache activity is generally all handled automatically by the hardware, no special instructions are needed.

Cache is not directly controlled by the operating system, it is done
in hardware
In case of a context switch, another application may modify the
cache, but you should not care about this. It is more important to
handle cases when your program behaves cache unfriendly.
Functions are kept in cache (I-Cahce , instruction cache)
Cache is not read only, when you write something it goes to [memory
and] the cache.

The cache is primarily controlled by the hardware. However, I know that Windows scheduler tends to schedule execution of a thread to the same core as before specifically because of the caches. It understands that it will be necessary to reload them on another core. Windows is using this behavior at least since Windows 2000.

As others have stated, you generally cannot control what is in cache. If you are writing code for high-performance and need to rely on cache for performance, then it is not uncommon to write your code so that you are using about half the space of L1 cache. Methods for doing so involve a great deal of discussion beyond the scope of StackOverflow questions. Essentially, you would want to do as much work as possible on some data before moving on to other data.
As a matter of what works practically, using about half of cache leaves enough space for other things to occur that most of your data will remain in cache. You cannot rely on this without cooperation from the operating system and other aspects of the computing platform, so it may be a useful technique for speeding up research calculations but it cannot be used where real-time performance must be guaranteed, as in operating dangerous machinery.
There are additional caveats besides how much data you use. Using data that maps to the same cache lines can evict data from cache even though there is plenty of cache unused. Matrix transposes are notorious for this, because a matrix whose row length is a multiple of a moderate power of two will have columns in which elements map to a small set of cache lines. So learning to use cache efficiently is a significant job.

As far as I know, you can't control what will be in the cache. You can declare a variable as register var_type a and then access to it will be in a single cycle(or a small number of cycles). Moreover, the amount of cycles it will take you to access a chunk of memory also depends on virtual memory translation and TLB.
It should be noted that the register keyword is merely a suggestion and the compiler is perfectly free to ignore it, as was suggested by the comment.

Even though you may not know which data is in cache and which not, you still may get an idea how much of the cache you are utilizing. Modern processor have quite many performance counters and some of them related to cache. Intel's processors may tell you how many L1 and L2 misses there were. Check this for more details of how to do it: How to read performance counters on i5, i7 CPUs

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js