2bit bit-fields array effects on performance and cache efficiency?

2bit bit-fields array effects on performance and cache efficiency? - c++

I am in need of a 2bit array, I am not concerned with saving memory at all, but I am concerned with minimizing cache misses and maximizing cache efficiency. Using an array of bools will use 4 times more memory, which means for every usable chunk of data in the cache, there will be 3 that are not used. So technically, I can get 3 times better cache consistency if I use bitfields.
The plan is to implement it as an array of bytes, divided into 4 equal bitfields, and use the div function to be able to get the integral quotient and remainder, possibly in a single clock, and use those to access the right index and right bitfield.
The array I needs is about 10000 elements long, so it will make for a significantly denser packed data, using 2 actual bits will allow for the entire array to fit in L1 cache, while using a byte array this will not be possible.
So my question is whether someone can tell me if this is a good idea in a performance oriented task, so I know if it is worth to go forth and implement a 2bit array? And surely, the best way to know is profiling, but any information in advance may be useful and will be appreciated.

With 10000 elements, on a modern processor, it should fit nicely in memory as bytes (10KB), so I wouldn't worry too much about it, unless you want this to run on some very tiny microprocessor with a cache that is much smaller than the typical 16-32KB L1 caches that modern CPU's have.
Of course, you may well want to TEST the performance with different solutions, if you think this is an important part of your code from a performance perspective [as measured from your profiling that you've already done before you start optimising, right?].

It's not clear to me that this will result in a performance
gain. Accessing each field will require several instructions
((data[i / 4] >> 2 * (i % 4)) & 0x03), and a lot of modern
processors have an L3 cache which would hold the entire array
with one byte per entry. Whether the extra cost in execution
time will be greater or less than the difference in caching is
hard to say; you'll have to profile to know exactly.
If you can organize your algorithms to work a byte (or even a
word) at a time, the cost of access may be much less. Iterating
over the entire array, for example:
for ( int i = 0; i < 10000; i += 4 ) {
unsigned char w1 = data[ i / 4 ];
for ( int j = 0; j < 4; ++ j ) {
unsigned char w2 = w1 & 0x03;
// w2 is entry i + j...
w1 >>= 2;
}
}
could make a significant difference. Most compilers will be
able to keep w1 and w2 in registers, meaning you'll only
have 1/4 as many memory accesses. Packing withunsigned int`
would probably be even faster.

Related

How to set bits of a bit vector efficiently in parallel?

Consider a bit vector of N bits in it (N is large) and an array of M numbers (M is moderate, usually much smaller than N), each in range 0..N-1 indicating which bit of the vector must be set to 1. The latter array is not sorted. The bit vector is just an array of integers, specifically __m256i, where 256 bits are packed into each __m256i structure.
How can this work be split efficiently accross multiple threads?
Preferred language is C++ (MSVC++2017 toolset v141), assembly is also great. Preferred CPU is x86_64 (intrinsics are ok). AVX2 is desired, if any benefit from it.

Let's assume you want to divide this work up among T threads. It's a pretty interesting problem since it isn't trivially parallelizable via partitioning and various solutions may apply for different sizes of N and M.
Fully Concurrent Baseline
You could simply divide up the array M into T partitions and have each thread work on its own partition of M with a shared N. The main problem is that since M is not sorted, all threads may access any element of N and hence stomp on each others work. To avoid this, you'd have to use atomic operations such as std::atomic::fetch_or for each modification of the shared N array, or else come up with some locking scheme. Both approaches are likely to kill performance (i.e., using an atomic operation to set a bit is likely to be an order of magnitude slower than the equivalent single-threaded code).
Let's look at ideas that are likely faster.
Private N
One relatively obvious idea to avoid the "shared N" problem which requires atomic operations for all mutations of N is simply to give each T a private copy of N and merge them at the end via or.
Unfortunately, this solution is O(N) + O(M/T) whereas the original single-threaded solution is O(M) and the "atomic" solution above is something like O(M/T)4. Since we know that N >> M this is likely to be a poor tradeoff in this case. Still, it's worth noting that the hidden constants in each term are very different: the O(N) term, which comes from the merging step0 can use 256-bit wide vpor instructions, meaning a throughput of something close to 200-500 bits/cycle (if cached), while the bit-setting step which is O(M/T) I estimate at closer to 1 bit/cycle. So this approach can certainly be the best one for moderate T even if the size of N is 10 or 100 times the size of M.
Partitions of M
The basic idea here is to partition the indexes in M such that each worker thread can then work on a disjoint part of the N array. If M was sorted, that would be trivial, but it's not, so...
A simple algorithm that will work well if M is smoothly distributed is to first partition that values of M into T buckets, with the buckets having values in the ranges [0, N/T), [N/T, 2N/T], ..., [(T-1)N/T, N). That is, divide N into T disjoint regions and then find the values of M that fall into each of them. You can spread that work across the T threads by assigning each thread an equal size chunk of M, and having them each create the T partitions and then logically merging1 them at the end so you have the T partitions of M.
The second step is to actually set all the bits: you assign one partition to each thread T which can set the bits in a "single threaded" way, i.e., not worrying about concurrent updates, since each thread is working on a disjoint partition of N2.
Both steps O(M) and the second step is identical to the single-threaded case, so the overhead for parallelizing this is the first step. I suspect the first will range from about the same speed as the second to perhaps 2-4 times as slow, depending on implementation and hardware, so you can expect a speedup on a machine with many cores, but with only 2 or 4 it might not be any better.
If the distribution of M is not smooth, such that the partitions created in the first step have very different sizes, it will work poorly because some threads will get a lot more work. A simple strategy is to create say 10 * T partitions, rather than only T and have the threads in the second pass all consume from the same queue of partitions until complete. In this way you spread the work more evenly, unless the array M is very bunched up. In that case you might consider a refinement of the first step which first essentially creates a bucketed histogram of the elements, and then a reduce stage which looks at the combined histogram to create a good partitioning.
Essentially, we are just progressively refining the first stage into a type of parallel sort/partitioning algorithm, for which there is already lots of literature. You might even find that a full (parallel) sort is fastest, since it will greatly help in bit-setting phase, since accesses will be in-order and have the best spatial locality (helping with prefetching and caching, respectively).
0 ... and also from the "allocate a private array of length N" step, although this is likely to be quite fast.
1 The conceptually simplest form of merging would be to simply copy each thread's partitions of M such that you have a contiguous partition of all of M, but in practice if the partitions are large you can just leave the partitions where they are and link them together, adding some complexity to the consuming code, but avoiding the compacting step.
2 To make it truly disjoint from a threading point of view you want to ensure the partition of N falls on "byte boundaries", and perhaps even cache-line boundaries to avoid false sharing (although the latter is likely not to be a big problem since it only occurs at the edge of each partition, and the order of processing means that you are not likely to get contention).
4 In practice, the exact "order" of the baseline concurrent solution using shared N is hard to define because there will be contention so the O(M/T) scaling will break down for large enough T. If we assume N is quite large and T is limited to typical hardware concurrency of at most a dozen cores or so it's probably an OK approximation.

#IraBaxter posted an interesting but flawed idea which can be made to work (at significant cost). I suspect #BeeOnRope's idea of partial-sort / partitioning the M array will perform better (especially for CPUs with large private caches which can keep parts of N hot). I'll summarize the modified version of Ira's idea that I described in comments on his deleted answer. (That answer has some suggestions about how big N has to be before it's worth multi-threading.)
Each writer thread gets a chunk of M with no sorting/partitioning.
The idea is that conflicts are very rare because N is large compared to the number of stores that can be in flight at once. Since setting a bit is idempotent, so we can handle conflicts (where two threads want to set different bits in the same byte) by checking the value in memory to make sure it really does have the bit set that we want after a RMW operation like or [N + rdi], al (with no lock prefix).
E.g. thread 1 tried to store 0x1 and stepped on thread 2's store of 0x2. Thread 2 must notice and retry the read-modify-write (probably with lock or to keep it simple and make multiple retries not possible) to end up with 0x3 in the conflict byte.
We need an mfence instruction before the read-back. Otherwise store-forwarding will give us the value we we just wrote before other threads see our store. In other words, a thread can observe its own stores earlier than they appear in the global order. x86 does have a Total Order for stores, but not for loads. Thus, we need mfence to prevent StoreLoad reordering. (Intel's "Loads Are not Reordered with Older Stores to the Same Location" guarantee is not as useful as it sounds: store/reload isn't a memory barrier; they're just talking about out-of-order execution preserving program-order semantics.)
mfence is expensive, but the trick that makes this better than just using lock or [N+rdi], al is that we can batch operations. e.g. do 32 or instructions and then 32 read-back. It's a tradeoff between mfence overhead per operation vs. increased chance of false-sharing (reading back cache lines that had already been invalidated by another CPU claiming them).
Instead of an actual mfence instruction, we can do the last or of a group as a lock or. This is better for throughput on both AMD and Intel. For example, according to Agner Fog's tables, mfence has one per 33c throughput on Haswell/Skylake, where lock add (same performance as or) has 18c or 19c throughput. Or for Ryzen, ~70c (mfence) vs. ~17c (lock add).
If we keep the amount of operations per fence very low, the array index (m[i]/8) + mask (1<<(m[i] & 7)) can be kept in registers for all the operations. This probably isn't worth it; fences are too expensive to do as often as every 6 or operations. Using the bts and bt bit-string instructions would mean we could keep more indices in registers (because no shift-result is needed), but probably not worth it because they're slow.
Using vector registers to hold indices might be a good idea, to avoid having to reload them from memory after the barrier. We want the load addresses to be ready as soon as the read-back load uops can execute (because they're waiting for the last store before the barrier to commit to L1D and become globally visible).
Using single-byte read-modify-write makes actual conflicts as unlikely as possible. Each write of a byte only does a non-atomic RMW on 7 neighbouring bytes. Performance still suffers from false-sharing when two threads modify bytes in the same 64B cache-line, but at least we avoid having to actually redo as many or operations. 32-bit element size would make some things more efficient (like using xor eax,eax / bts eax, reg to generate 1<<(m[i] & 31) with only 2 uops, or 1 for BMI2 shlx eax, r10d, reg (where r10d=1).)
Avoid the bit-string instructions like bts [N], eax: it has worse throughput than doing the indexing and mask calculation for or [N + rax], dl. This is the perfect use-case for it (except that we don't care about the old value of the bit in memory, we just want to set it), but still its CISC baggage is too much.
In C, a function might look something like
/// UGLY HACKS AHEAD, for testing only.
// #include <immintrin.h>
#include <stddef.h>
#include <stdint.h>
void set_bits( volatile uint8_t * restrict N, const unsigned *restrict M, size_t len)
{
const int batchsize = 32;
// FIXME: loop bounds should be len-batchsize or something.
for (int i = 0 ; i < len ; i+=batchsize ) {
for (int j = 0 ; j<batchsize-1 ; j++ ) {
unsigned idx = M[i+j];
unsigned mask = 1U << (idx&7);
idx >>= 3;
N[idx] |= mask;
}
// do the last operation of the batch with a lock prefix as a memory barrier.
// seq_cst RMW is probably a full barrier on non-x86 architectures, too.
unsigned idx = M[i+batchsize-1];
unsigned mask = 1U << (idx&7);
idx >>= 3;
__atomic_fetch_or(&N[idx], mask, __ATOMIC_SEQ_CST);
// _mm_mfence();
// TODO: cache `M[]` in vector registers
for (int j = 0 ; j<batchsize ; j++ ) {
unsigned idx = M[i+j];
unsigned mask = 1U << (idx&7);
idx >>= 3;
if (! (N[idx] & mask)) {
__atomic_fetch_or(&N[idx], mask, __ATOMIC_RELAXED);
}
}
}
}
This compiles to approximately what we want with gcc and clang. The asm (Godbolt) could be more efficient in several ways, but might be interesting to try this. This is not safe: I just hacked this together in C to get the asm I wanted for this stand-alone function, without inlining into a caller or anything. __atomic_fetch_or is not a proper compiler barrier for non-atomic variables the way asm("":::"memory") is. (At least the C11 stdatomic version isn't.) I should probably have used the legacy __sync_fetch_and_or, which is a full barrier for all memory operations.
It uses GNU C atomic builtins to do atomic RMW operations where desired on variables that aren't atomic_uint8_t. Running this function from multiple threads at once would be C11 UB, but we only need it to work on x86. I used volatile to get the asynchronous-modification-allowed part of atomic without forcing N[idx] |= mask; to be atomic. The idea is to make sure that the read-back checks don't optimize away.
I use __atomic_fetch_or as a memory barrier because I know it will be on x86. With seq_cst, it probably will be on other ISAs, too, but this is all a big hack.

There are a couple of operations involved in sets (A,B = set, X = element in a set):
Set operation Instruction
---------------------------------------------
Intersection of A,B A and B
Union of A,B A or B
Difference of A,B A xor B
A is subset of B A and B = B
A is superset of B A and B = A
A <> B A xor B <> 0
A = B A xor B = 0
X in A BT [A],X
Add X to A BTS [A],X
Subtract X from A BTC [A],X
Given the fact that you can use the boolean operators to replace set operations you can use VPXOR, VPAND etc.
To set, reset or test individual bits you simply use
mov eax,BitPosition
BT [rcx],rax
You can set if a set is (equal to) empty (or something else) using the following code
vpxor ymm0,ymm0,ymm0 //ymm0 = 0
//replace the previous instruction with something else if you don't want
//to compare to zero.
vpcmpeqqq ymm1,ymm0,[mem] //compare mem qwords to 0 per qword
vpslldq ymm2,ymm1,8 //line up qw0 and 1 + qw2 + 3
vpand ymm2,ymm1,ymm2 //combine qw0/1 and qw2/3
vpsrldq ymm1,ymm2,16 //line up qw0/1 and qw2/3
vpand ymm1,ymm1,ymm2 //combine qw0123, all in the lower 64 bits.
//if the set is empty, all bits in ymm1 will be 1.
//if its not, all bits in ymm1 will be 0.
(I'm sure this code can be improved using the blend/gather etc instructions)
From here you can just extend to bigger sets or other operations.
Note that bt, btc, bts with a memory operand is not limited to 64 bits.
The following will work just fine.
mov eax,1023
bts [rcx],rax //set 1024st element (first element is 0).

Roofline model: calculating operational intensity

Say I have a toy loop like this
float x[N];
float y[N];
for (int i = 1; i < N-1; i++)
y[i] = a*(x[i-1] - x[i] + x[i+1])
And I assume my cache line is 64 Byte (i.e. big enough). Then I will have (per frame) basically 2 accesses to the RAM and 3 FLOP:
1 (cached) read access: loading all 3 x[i-1], x[i], x[i+1]
1 write access: storing y[i]
3 FLOP (1 mul, 1 add, 1 sub)
The operational intensity is ergo
OI = 3 FLOP/(2 * 4 BYTE)
Now what happens if I do something like this
float x[N];
for (int i = 1; i < N-1; i++)
x[i] = a*(x[i-1] - x[i] + x[i+1])
Note that there is no y anymore. Does it mean now that I have a single RAM access
1 (cached) read/write: loading x[i-1], x[i], x[i+1], storing x[i]
or still 2 RAM accesses
1 (cached) read: loading x[i-1], x[i], x[i+1]
1 (cached) write: storing x[i]
Because the operational intensity OI would be different in either case. Can anyone tell something about this? Or maybe clarify some things. Thanks

Disclaimer: I've never heard of the roofline performance model until today. As far as I can tell, it attempts to calculate a theoretical bound on the "arithmetic intensity" of an algorithm, which is the number of FLOPS per byte of data accessed. Such a measure may be useful for comparing similar algorithms as the size of N grows large, but is not very helpful for predicting real-world performance.
As a general rule of thumb, modern processors can execute instructions much more quickly than they can fetch/store data (this becomes drastically more pronounced as the data starts to grow larger than the size of the caches). So contrary to what one might expect, a loop with higher arithmetic intensity may run much faster than a loop with lower arithmetic intensity; what matters most as N scales is the total amount of data touched (this will hold true as long as memory remains significantly slower than the processor, as is true in common desktop and server systems today).
In short, x86 CPUs are unfortunately too complex to be accurately described with such a simple model. An access to memory goes through several layers of caching (typically L1, L2, and L3) before hitting RAM. Maybe all your data fits in L1 -- the second time you run your loop(s) there could be no RAM accesses at all.
And there's not just the data cache. Don't forget that code is in memory too and has to be loaded into the instruction cache. Each read/write is also done from/to a virtual address, which is supported by the hardware TLB (that can in extreme cases trigger a page fault and, say, cause the OS to write a page to disk in the middle of your loop). All of this is assuming your program is hogging the hardware all to itself (in non-realtime OSes this is simply not the case, as other processes and threads are competing for the same limited resources).
Finally, the execution itself is not (directly) done with memory reads and writes, but rather the data is loaded into registers first (then the result is stored).
How the compiler allocates registers, if it attempts loop unrolling, auto-vectorization, the instruction scheduling model (interleaving instructions to avoid data dependencies between instructions) etc. will also all affect the actual throughput of the algorithm.
So, finally, depending on the code produced, the CPU model, the amount of data processed, and the state of various caches, the latency of the algorithm will vary by orders of magnitude. Thus, the operational intensity of a loop cannot be determined by inspecting the code (or even the assembly produced) alone, since there are many other (non-linear) factors in play.
To address your actual question, though, as far as I can see by the definition outlined here, the second loop would count as a single additional 4-byte access per iteration on average, so its OI would be θ(3N FLOPS / 4N bytes). Intuitively, this makes sense because the cache already has the data loaded, and the write can change the cache directly instead of going back to main memory (the data does eventually have to be written back, however, but that requirement is unchanged from the first loop).

Double-checking understanding of memory coalescing in CUDA

Suppose I define some arrays which are visible to the GPU:
double* doubleArr = createCUDADouble(fieldLen);
float* floatArr = createCUDAFloat(fieldLen);
char* charArr = createCUDAChar(fieldLen);
Now, I have the following CUDA thread:
void thread(){
int o = getOffset(); // the same for all threads in launch
double d = doubleArr[threadIdx.x + o];
float f = floatArr[threadIdx.x + o];
char c = charArr[threadIdx.x + o];
}
I'm not quite sure whether I correctly interpret the documentation, and its very critical for my design: Will the memory accesses for double, float and char be nicely coalesced? (Guess: Yes, it will fit into sizeof(type) * blockSize.x / (transaction size) transactions, plus maybe one extra transaction at the upper and lower boundary.)

Yes, for all the cases you have shown, and assuming createCUDAxxxxx translates into some kind of ordinary cudaMalloc type operation, everything should nicely coalesce.
If we have ordinary 1D device arrays allocated via cudaMalloc, in general we should have good coalescing behavior across threads if our load pattern includes an array index of the form:
data_array[some_constant + threadIdx.x];
It really does not matter what data type the array is - it will coalesce nicely.
However, from a performance perspective, global loads (assuming an L1 miss) will occur in a minimum 128-byte granularity. Therefore loading larger sizes per thread (say, int, float, double, float4, etc.) may give slightly better performance. The caches tend to mitigate any difference, if the loads are across a large enough number of warps.
It's pretty easy also to verify this on a particular piece of code with a profiler. There are many ways to do this depending on which profiler you choose, but for example with nvprof you can do:
nvprof --metric gld_efficiency ./my_exe
and it will return an average percentage number that more or less exactly reflects the percentage of optimal coalescing that is occurring on global loads.
This is the presentation I usually cite for additional background info on memory optimization.
I suppose someone will come along and notice that this pattern:
data_array[some_constant + threadIdx.x];
roughly corresponds to the access type shown on slides 40-41 of the above presentation. And aha!! efficiency drops to 50%-80%. That is true, if only a single warp-load is being considered. However, referring to slide 40, we see that the "first" load will require two cachelines to be loaded. After that however, additional loads (moving to the right, for simplicity) will only require one additional/new cacheline per warp-load (assuming the existence of an L1 or L2 cache, and reasonable locality, i.e. lack of thrashing). Therefore, over a reasonably large array (more than just 128 bytes), the average requirement will be one new cacheline per warp, which corresponds to 100% efficiency.

iteration direction on an array

Say we have two arrays a and b of a fundamental type (say, a float) and we need to calculate a[i] + b[i] for every valid index i, as well as store the result. What is the best way to iterate over the arrays to maximize cache hits? Is it front-to-back, back-to-front or something else?

For this kind of operation you should use the auto-vectorization of your compiler. Iterate small i to large i. Also, the answer depends on what you mean by "store the result" and the number n of items items you are going to iterate over.
If you mean c[i] = a[i] + b[i] and n is not too small then your compiler's auto-vectorizer will optimize this best without any more changes. Even MSVC will get that one correct (at least for SSE). Your compiler will have to do some adjustments for n not a multiple of 4 (or 8 for AVX) and alignment but this cost will be amortized across n and this overhead will have a negligible effect except for small n. If n is small then you might want to consider alignment. How small is small has to be determined but I would guess it's much less than 100.
If you mean sum + = a[i] + b[i], a reduction, then you do need to think about this. This has a dependency chain so you need to unroll your loop 3-10 times. Additionally, you need to use a relaxed floating point model since floating point arithmetic is not associative and the auto-vectorization won't kick in without it so add -ffast-math to GCC (/fp:fast to MSVC). If you unroll the loop and use a a relaxed floating point model then GCC, ICC, Clang, and MSVC should auto-vectorize your reduction efficiently.

In order to utilize the cache pre-fetch capability you need to read the arrays from front to back sequentially.
Furthermore, the arrays should be SSE aligned (16 byte). Even more important is that the items (e.g. floats) will be aligned on their size (4 bytes for floats). This is important so data will not cross cache lines (slower read).
After the arrays are aligned, you can use SSE/AVX to read, add and store the results doing 4 or 8 operations in a single instruction.
Edit:
You can read more on cache prefetching here and in depth description in the Intel SW Developer Manual.

How to find the size of the L1 cache line size with IO timing measurements?

As a school assignment, I need to find a way to get the L1 data cache line size, without reading config files or using api calls. Supposed to use memory accesses read/write timings to analyze & get this info. So how might I do that?
In an incomplete try for another part of the assignment, to find the levels & size of cache, I have:
for (i = 0; i < steps; i++) {
arr[(i * 4) & lengthMod]++;
}
I was thinking maybe I just need vary line 2, (i * 4) part? So once I exceed the cache line size, I might need to replace it, which takes sometime? But is it so straightforward? The required block might already be in memory somewhere? Or perpahs I can still count on the fact that if I have a large enough steps, it will still work out quite accurately?
UPDATE
Heres an attempt on GitHub ... main part below
// repeatedly access/modify data, varying the STRIDE
for (int s = 4; s <= MAX_STRIDE/sizeof(int); s*=2) {
start = wall_clock_time();
for (unsigned int k = 0; k < REPS; k++) {
data[(k * s) & lengthMod]++;
}
end = wall_clock_time();
timeTaken = ((float)(end - start))/1000000000;
printf("%d, %1.2f \n", s * sizeof(int), timeTaken);
}
Problem is there dont seem to be much differences between the timing. FYI. since its for L1 cache. I have SIZE = 32 K (size of array)

Allocate a BIG char array (make sure it is too big to fit in L1 or L2 cache). Fill it with random data.
Start walking over the array in steps of n bytes. Do something with the retrieved bytes, like summing them.
Benchmark and calculate how many bytes/second you can process with different values of n, starting from 1 and counting up to 1000 or so. Make sure that your benchmark prints out the calculated sum, so the compiler can't possibly optimize the benchmarked code away.
When n == your cache line size, each access will require reading a new line into the L1 cache. So the benchmark results should get slower quite sharply at that point.
If the array is big enough, by the time you reach the end, the data at the beginning of the array will already be out of cache again, which is what you want. So after you increment n and start again, the results will not be affected by having needed data already in the cache.

Have a look at Calibrator, all of the work is copyrighted but source code is freely available. From its document idea to calculate cache line sizes sounds much more educated than what's already said here.
The idea underlying our calibrator tool is to have a micro benchmark whose performance only depends
on the frequency of cache misses that occur. Our calibrator is a simple C program, mainly a small loop
that executes a million memory reads. By changing the stride (i.e., the offset between two subsequent
memory accesses) and the size of the memory area, we force varying cache miss rates.
In principle, the occurance of cache misses is determined by the array size. Array sizes that fit into
the L1 cache do not generate any cache misses once the data is loaded into the cache. Analogously,
arrays that exceed the L1 cache size but still fit into L2, will cause L1 misses but no L2 misses. Finally,
arrays larger than L2 cause both L1 and L2 misses.
The frequency of cache misses depends on the access stride and the cache line size. With strides
equal to or larger than the cache line size, a cache miss occurs with every iteration. With strides
smaller than the cache line size, a cache miss occurs only every n iterations (on average), where n is
the ratio cache
line
size/stride.
Thus, we can calculate the latency for a cache miss by comparing the execution time without
misses to the execution time with exactly one miss per iteration. This approach only works, if
memory accesses are executed purely sequential, i.e., we have to ensure that neither two or more load
instructions nor memory access and pure CPU work can overlap. We use a simple pointer chasing
mechanism to achieve this: the memory area we access is initialized such that each load returns the
address for the subsequent load in the next iteration. Thus, super-scalar CPUs cannot benefit from
their ability to hide memory access latency by speculative execution.
To measure the cache characteristics, we run our experiment several times, varying the stride and
the array size. We make sure that the stride varies at least between 4 bytes and twice the maximal
expected cache line size, and that the array size varies from half the minimal expected cache size to
at least ten times the maximal expected cache size.
I had to comment out #include "math.h" to get it compiled, after that it found my laptop's cache values correctly. I also couldn't view postscript files generated.

You can use the CPUID function in assembler, although non portable, it will give you what you want.
For Intel Microprocessors, the Cache Line Size can be calculated by multiplying bh by 8 after calling cpuid function 0x1.
For AMD Microprocessors, the data Cache Line Size is in cl and the instruction Cache Line Size is in dl after calling cpuid function 0x80000005.
I took this from this article here.

I think you should write program, that will walk throught array in random order instead straight, because modern process do hardware prefetch.
For example, make array of int, which values will number of next cell.
I did similar program 1 year ago http://pastebin.com/9mFScs9Z
Sorry for my engish, I am not native speaker.

See how to memtest86 is implemented. They measure and analyze data transfer rate in some way. Points of rate changing is corresponded to size of L1, L2 and possible L3 cache size.

If you get stuck in the mud and can't get out, look here.
There are manuals and code that explain how to do what you're asking. The code is pretty high quality as well. Look at "Subroutine library".
The code and manuals are based on X86 processors.

Just a note.
Cache line size is variable on few ARM Cortex families and can change during execution without any notifications to a current program.

I think it should be enough to time an operation that uses some amount of memory. Then progresively increase the memory (operands for instance) used by the operation.
When the operation performance severelly decreases you have found the limit.
I would go with just reading a bunch of bytes without printing them (printing would hit the performance so bad that would become a bottleneck). While reading, the timing should be directly proportinal to the ammount of bytes read until the data cannot fit the L1 anymore, then you will get the performance hit.
You should also allocate the memory once at the start of the program and before starting to count time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js