CUDA Time Events - c++

I am timing how long it takes my CUDA program to calculate matrices of a certain size. For example, 10x10, 100x100, 500x500,100x1000.
However, the results are not at all what I was expecting. The numbers for the graph are not at what is expected. With the increase in size of the matrices, the computational time decreases.
For example, here is the average time (from 1000 runs):
10x10: 0.032768s
100x100: 0.068960s
500x500: 0.006336s
1000x1000: 0.018400s
The time goes down, then up again at 1000. What is going on? Shouldn't the numbers peak off at a certain point? Why is it going in a roller coaster like this?
Here is how the actual timing code is being run:
int blocksNeeded=0;
cudaError_t cudaStatus;
blocksNeeded=(size/MAXTHREADS)+1;
int threadsPerBlock = MAXTHREADS/blocksNeeded+1;
cudaEvent_t start, stop;
float elapsedtime;
.
.
.
.
.
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
addKernel<<<blocksNeeded, size>>>(dev_c, dev_a, dev_b,size);
cudaStatus = cudaDeviceSynchronize();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedtime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
where MAXTHREADS are 1024 and, size is the amount of elements I have in the matrix. I.E. 10x10 matrix will have 100 elements which is the size.
Updated with kernel:
__global__ void addKernel(float *c, float *a, float *b,int size)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if(idx < size)
c[idx] = a[idx] + b[idx];
}

I've made a test on a recent GPU cluster equipped with NVIDIA Tesla M2090. Basically i'm performing a vector addition with different sizes. The results are:
Size Kernel time (msec)
===========================
2 0.04
4 0.010912
8 0.012128
16 0.012256
32 0.011296
64 0.01248
128 0.012192
256 0.012576
512 0.012416
1024 0.012736
2048 0.01232
4096 0.011968
8192 0.011264
16384 0.007296
32768 0.007776
65536 0.009728
131072 0.018304
262144 0.031392
524288 0.055168
1048576 0.10352
What you can see is, that there is knee at a vector size of 16384, which basically resembles your observations. This is not an error but normal behavior since the GPU has to be utilized for showing performance. The point of utilization is, in case of the Tesla M2090, reached around 16384 parallel additions.
The way you are measuring kernel performance is perfectly ok. I assume you've taken this from the "Best Practices Guide" for CUDA.
Notice: Please consider that the shown data is generated by using a single kernel run, i. e. it is not representative. Generally for exact time measurements the kernel should run multiple times with the same problem and the kernel time is the mean of the runs.

You must call the kernel with
addKernel<<<blocksNeeded, MAXTHREADS>>>(dev_c, dev_a, dev_b,size);
The second parameter on a kernel call is the number of threads to launch in each block, not the total number of threads.
At 100x100 you are already exceeding the maximum number of threads per block which is 1536 for compute capability 2.x
And just noticed that you calculate some kind of threadsPerBlock which is wrong and that you don't use it. Choose a number of threads per block. Then divide by the total number of elements to process and add 1 to it if the remainder is different from 0 and you get the number of blocks to launch.

Related

Why OpenCL work group size has huge performance impact on GPU?

I am benchmarking a simple matrix transposition kernel on Qualcomm Adreno 630 GPU, and I am trying to see the impact of different work group size, but surprisingly, I get some interesting result which I cannot explain. Here is my kernel code:
__kernel void transpose(__global float *input, __global float *output, const int width, const int height)
int i = get_global_id(0);
int j = get_global_id(1);
output[i*height + j] = input[j*width + i];
}
and the width and height are both 6400, the experiment results are(execution time is the difference between END and START event):
work group size execution time
x y
4 64 24ms
64 4 169ms
256 1 654ms
1 256 34ms
8 32 27ms
1 1024 375ms
1024 1 657ms
32 32 26ms
after this I did another experimemnt where I change the width and height from 6400 to 6401(and the global work size in NDRangeKernel call as well), and the result is even more interesing:
work group size execution time
x y
4 64 28ms
64 4 105ms
256 1 359ms
1 256 31ms
8 32 32ms
1 1024 99ms
1024 1 358ms
32 32 32ms
execution time of most scenarios drops significantly. I know memory coalescing or cache could play a role here, but I cannot completely explain this.
Memory coalescence occurs when consecutive threads access data at consecutive global memory addresses within a 128-byte aligned segment. Then memory accesses are coalesced into one, significantly reducing overall latency.
In the 2D range, coalescing only happens along get_global_id(1) or the j direction in your case. In the line output[i*height + j] = input[j*width + i];, input[j*width + i]; is a misaligned (non-coalesced) read and output[i*height + j] is a coalesced write. Coalesced memory access generally is much faster than misaligned access, but the performance penalty for coalesced/misaligned reads can be vastly different than coalesced/misaligned writes. On most desktop GPU architectures, the combination misaligned read and coalesced write is faster than the other way around, see the diagram below. So your implementation should be the faster variant already.
Since coalesced access is only possible along the j index, if you have a range of (x=256,y=1) (i along x-direction, j along y-direction), you do not get any coalescing. For (x=8,y=32), j is coalesced in groups of 32 8 times per thread block, so memory bandwidth is fairly saturated and performance is good.
If you want maximum possible performance, I'd suggest you go with 1D indexing. This way you have full control about coalescing and coalescing happens over the entire thread block. Your matrix transpose kernel then would look like this:
#define width 6400
__kernel void transpose(__global float *input, __global float *output) {
const int n = get_global_id(0);
int i = n/width;
int j = n%width;
output[i*height + j] = input[j*width + i];
}
You can bake width into the OpenCL Ccode at C++ runtime and before OpenCL compile time via string concatenation.

How to verify CPU cache line size with c++ code?

I read an article from Igor's blog. The article said:
... today’s CPUs do not access memory byte by byte. Instead, they fetch memory in chunks of (typically) 64 bytes, called cache lines. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache. And, accessing other values from the same cache line is cheap!
The article also provides c# code to verify above conclusion:
int[] arr = new int[64 * 1024 * 1024];
// Loop 1 (step = 1)
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2 (step = 16)
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
The two for-loops take about the same time: 80 and 78 ms respectively on Igor's machine, so the cache line machanism is verified.
And then I refer the above idea to implement a c++ version to verify the cache line size as the following:
#include "stdafx.h"
#include <iostream>
#include <chrono>
#include <math.h>
using namespace std::chrono;
const int total_buff_count = 16;
const int buff_size = 32 * 1024 * 1024;
int testCacheHit(int * pBuffer, int size, int step)
{
int result = 0;
for (int i = 0; i < size;) {
result += pBuffer[i];
i += step;
}
return result;
}
int main()
{
int * pBuffer = new int[buff_size*total_buff_count];
for (int i = 0; i < total_buff_count; ++i) {
int step = (int)pow(2, i);
auto start = std::chrono::system_clock::now();
volatile int result = testCacheHit(pBuffer + buff_size*i, buff_size, step);
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "step: " << step << ", elapsed time: " << elapsed_seconds.count() * 1000 << "ms\n";
}
delete[] pBuffer;
}
But my test result is totally different from the one from Igor's article. If the step is 1 ,then the time cost is about 114ms; If the step is 16, then the time cost is about 78ms. The test application is built with release configuration, there's 32 GB memory on my machine and the CPU is intel Xeon E5 2420 v2 2.2G; the result is the following.
The interesting finding is the time cost decreased significantly when step is 2 and step is 2048. My question is, how to explain the gap when step is 2 and step is 2048 in my test? Why is my result totally different from Igor's result? Thanks.
My own explaination to the first question is, the time cost of the code contains two parts: One is "memory read/write" which contains memory read/write time cost, another is "other cost" which contains for loop and calculation cost. If step is 2, then the "memory read/write" cost almost doesn't change (because of the cache line), but the calculation and loop cost decreased half, so we see an obvious gap. And I guess the cache line on my CPU is 4096 bytes (1024 * 4 bytes) rather than 64 bytes, that's why we got another gap when step is 2048. But it's just my guess. Any help from your guys is appreciated, thanks.
Drop between 1024 and 2048
Note that you are using an uninitialized array. This basically means that
int * pBuffer = new int[buff_size*total_buff_count];
does not cause your program to actually ask for any physical memory. Instead, just some virtual address space is reserved.
Then, whey you first touch some array element, a page fault is triggered and the OS maps the page to physical memory. This is a relatively slow operation which may significantly influece your experiment. Since a page size on your system is likely 4 kB, it can hold 1024 4-byte integers. When you go for 2048 step, then only every second page is actually accessed and the runtime drops proportionally.
You can avoid the negative effect of this mechanism by "touching" the memory in advance:
int * pBuffer = new int[buff_size*total_buff_count]{};
When I tried that, I got almost linear decrease of time between 64 and 8192 step sizes.
Drop between 1 and 2
A cache line size on your system is definitely not 2048 bytes, it's very likely 64 bytes (generally, it may have different values and even different values for different cache levels).
As for the first part, for step being 1, there are simply much more arithmetic operations involved (addition of array elements and increments of i).
Difference from Igor's experiment
We can only speculate about why Igor's experiment gave practically the same times in both cases. I would guess that the runtime of arithmetics is negligible there, since there is only a single loop counter increment involved and he writes into the array, which requires an additional transfer of cached lines back to memory. (We can say that the byte/op ratio is much higher than in your experiment.)
How to verify CPU cache line size with c++ code?
There is std::hardware_destructive_interference_size in C++17, which should provide the smallest cache line size. Note that it is a compile time value and the compiler relies on your input on what machine is targeted. When targeting entire architecture, the number may be inaccurate.
How to verify CPU cache line size with c++ code?
You reliably cannot.
And you should write portable C++ code. Read n3337.
Imagine that you did not enable compiler optimizations in your C++ compiler. And imagine that you run your C++ compiler in some emulator (like these).
On Linux specifically, you could parse the /proc/cpuinfo pseudo file and get the CPU cache line size from it.
For example:
% head -20 /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 8
model name : AMD Ryzen Threadripper 2970WX 24-Core Processor
stepping : 2
microcode : 0x800820b
cpu MHz : 1776.031
cache size : 512 KB
physical id : 0
siblings : 48
core id : 0
cpu cores : 24
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
BTW, there are many different organizations and levels of caches.
You could imagine a C++ application on Linux parsing the output of /proc/cpuinfo then making HTTP requests (using libcurl) to the Web to get more from it.
See also this answer.

Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?

Is there any possible optimization for random access on a very big array (I currently use uint8_t, and I'm asking about what's better)
uint8_t MyArray[10000000];
when the value at any position in the array is
0 or 1 for 95% of all cases,
2 in 4% of cases,
between 3 and 255 in
the other 1% of cases?
So, is there anything better than a uint8_t array to use for this? It should be as quick as possible to loop over the whole array in a random order, and this is very heavy on RAM bandwidth, so when having more than a few threads doing that at the same time for different arrays, currently the whole RAM bandwidth is quickly saturated.
I'm asking since it feels very inefficient to have such a big array (10 MB) when it's actually known that almost all values, apart from 5%, will be either 0 or 1. So when 95% of all values in the array would only actually need 1 bit instead of 8 bit, this would reduce memory usage by almost an order of magnitude.
It feels like there has to be a more memory efficient solution that would greatly reduce RAM bandwidth required for this, and as a result also be significantly quicker for random access.
A simple possibility that comes to mind is to keep a compressed array of 2 bits per value for the common cases, and a separated 4 byte per value (24 bit for original element index, 8 bit for actual value, so (idx << 8) | value)) sorted array for the other ones.
When you look up a value, you first do a lookup in the 2bpp array (O(1)); if you find 0, 1 or 2 it's the value you want; if you find 3 it means that you have to look it up in the secondary array. Here you'll perform a binary search to look for the index of your interest left-shifted by 8 (O(log(n) with a small n, as this should be the 1%), and extract the value from the 4-byte thingie.
std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;
uint8_t lookup(unsigned idx) {
// extract the 2 bits of our interest from the main array
uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
// usual (likely) case: value between 0 and 2
if(v != 3) return v;
// bad case: lookup the index<<8 in the secondary array
// lower_bound finds the first >=, so we don't need to mask out the value
auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
// some coherency checks
if(ptr == sec_arr.end()) std::abort();
if((*ptr >> 8) != idx) std::abort();
#endif
// extract our 8-bit value from the 32 bit (index, value) thingie
return (*ptr) & 0xff;
}
void populate(uint8_t *source, size_t size) {
main_arr.clear(); sec_arr.clear();
// size the main storage (round up)
main_arr.resize((size+3)/4);
for(size_t idx = 0; idx < size; ++idx) {
uint8_t in = source[idx];
uint8_t &target = main_arr[idx>>2];
// if the input doesn't fit, cap to 3 and put in secondary storage
if(in >= 3) {
// top 24 bits: index; low 8 bit: value
sec_arr.push_back((idx << 8) | in);
in = 3;
}
// store in the target according to the position
target |= in << ((idx & 3)*2);
}
}
For an array such as the one you proposed, this should take 10000000 / 4 = 2500000 bytes for the first array, plus 10000000 * 1% * 4 B = 400000 bytes for the second array; hence 2900000 bytes, i.e. less than one third of the original array, and the most used portion is all kept together in memory, which should be good for caching (it may even fit L3).
If you need more than 24-bit addressing, you'll have to tweak the "secondary storage"; a trivial way to extend it is to have a 256 element pointer array to switch over the top 8 bits of the index and forward to a 24-bit indexed sorted array as above.
Quick benchmark
#include <algorithm>
#include <vector>
#include <stdint.h>
#include <chrono>
#include <stdio.h>
#include <math.h>
using namespace std::chrono;
/// XorShift32 generator; extremely fast, 2^32-1 period, way better quality
/// than LCG but fail some test suites
struct XorShift32 {
/// This stuff allows to use this class wherever a library function
/// requires a UniformRandomBitGenerator (e.g. std::shuffle)
typedef uint32_t result_type;
static uint32_t min() { return 1; }
static uint32_t max() { return uint32_t(-1); }
/// PRNG state
uint32_t y;
/// Initializes with seed
XorShift32(uint32_t seed = 0) : y(seed) {
if(y == 0) y = 2463534242UL;
}
/// Returns a value in the range [1, 1<<32)
uint32_t operator()() {
y ^= (y<<13);
y ^= (y>>17);
y ^= (y<<15);
return y;
}
/// Returns a value in the range [0, limit); this conforms to the RandomFunc
/// requirements for std::random_shuffle
uint32_t operator()(uint32_t limit) {
return (*this)()%limit;
}
};
struct mean_variance {
double rmean = 0.;
double rvariance = 0.;
int count = 0;
void operator()(double x) {
++count;
double ormean = rmean;
rmean += (x-rmean)/count;
rvariance += (x-ormean)*(x-rmean);
}
double mean() const { return rmean; }
double variance() const { return rvariance/(count-1); }
double stddev() const { return std::sqrt(variance()); }
};
std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;
uint8_t lookup(unsigned idx) {
// extract the 2 bits of our interest from the main array
uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
// usual (likely) case: value between 0 and 2
if(v != 3) return v;
// bad case: lookup the index<<8 in the secondary array
// lower_bound finds the first >=, so we don't need to mask out the value
auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
// some coherency checks
if(ptr == sec_arr.end()) std::abort();
if((*ptr >> 8) != idx) std::abort();
#endif
// extract our 8-bit value from the 32 bit (index, value) thingie
return (*ptr) & 0xff;
}
void populate(uint8_t *source, size_t size) {
main_arr.clear(); sec_arr.clear();
// size the main storage (round up)
main_arr.resize((size+3)/4);
for(size_t idx = 0; idx < size; ++idx) {
uint8_t in = source[idx];
uint8_t &target = main_arr[idx>>2];
// if the input doesn't fit, cap to 3 and put in secondary storage
if(in >= 3) {
// top 24 bits: index; low 8 bit: value
sec_arr.push_back((idx << 8) | in);
in = 3;
}
// store in the target according to the position
target |= in << ((idx & 3)*2);
}
}
volatile unsigned out;
int main() {
XorShift32 xs;
std::vector<uint8_t> vec;
int size = 10000000;
for(int i = 0; i<size; ++i) {
uint32_t v = xs();
if(v < 1825361101) v = 0; // 42.5%
else if(v < 4080218931) v = 1; // 95.0%
else if(v < 4252017623) v = 2; // 99.0%
else {
while((v & 0xff) < 3) v = xs();
}
vec.push_back(v);
}
populate(vec.data(), vec.size());
mean_variance lk_t, arr_t;
for(int i = 0; i<50; ++i) {
{
unsigned o = 0;
auto beg = high_resolution_clock::now();
for(int i = 0; i < size; ++i) {
o += lookup(xs() % size);
}
out += o;
int dur = (high_resolution_clock::now()-beg)/microseconds(1);
fprintf(stderr, "lookup: %10d µs\n", dur);
lk_t(dur);
}
{
unsigned o = 0;
auto beg = high_resolution_clock::now();
for(int i = 0; i < size; ++i) {
o += vec[xs() % size];
}
out += o;
int dur = (high_resolution_clock::now()-beg)/microseconds(1);
fprintf(stderr, "array: %10d µs\n", dur);
arr_t(dur);
}
}
fprintf(stderr, " lookup | ± | array | ± | speedup\n");
printf("%7.0f | %4.0f | %7.0f | %4.0f | %0.2f\n",
lk_t.mean(), lk_t.stddev(),
arr_t.mean(), arr_t.stddev(),
arr_t.mean()/lk_t.mean());
return 0;
}
(code and data always updated in my Bitbucket)
The code above populates a 10M element array with random data distributed as OP specified in their post, initializes my data structure and then:
performs a random lookup of 10M elements with my data structure
does the same through the original array.
(notice that in case of sequential lookup the array always wins by a huge measure, as it's the most cache-friendly lookup you can do)
These last two blocks are repeated 50 times and timed; at the end, the mean and standard deviation for each type of lookup are calculated and printed, along with the speedup (lookup_mean/array_mean).
I compiled the code above with g++ 5.4.0 (-O3 -static, plus some warnings) on Ubuntu 16.04, and ran it on some machines; most of them are running Ubuntu 16.04, some some older Linux, some some newer Linux. I don't think the OS should be relevant at all in this case.
CPU | cache | lookup (µs) | array (µs) | speedup (x)
Xeon E5-1650 v3 # 3.50GHz | 15360 KB | 60011 ± 3667 | 29313 ± 2137 | 0.49
Xeon E5-2697 v3 # 2.60GHz | 35840 KB | 66571 ± 7477 | 33197 ± 3619 | 0.50
Celeron G1610T # 2.30GHz | 2048 KB | 172090 ± 629 | 162328 ± 326 | 0.94
Core i3-3220T # 2.80GHz | 3072 KB | 111025 ± 5507 | 114415 ± 2528 | 1.03
Core i5-7200U # 2.50GHz | 3072 KB | 92447 ± 1494 | 95249 ± 1134 | 1.03
Xeon X3430 # 2.40GHz | 8192 KB | 111303 ± 936 | 127647 ± 1503 | 1.15
Core i7 920 # 2.67GHz | 8192 KB | 123161 ± 35113 | 156068 ± 45355 | 1.27
Xeon X5650 # 2.67GHz | 12288 KB | 106015 ± 5364 | 140335 ± 6739 | 1.32
Core i7 870 # 2.93GHz | 8192 KB | 77986 ± 429 | 106040 ± 1043 | 1.36
Core i7-6700 # 3.40GHz | 8192 KB | 47854 ± 573 | 66893 ± 1367 | 1.40
Core i3-4150 # 3.50GHz | 3072 KB | 76162 ± 983 | 113265 ± 239 | 1.49
Xeon X5650 # 2.67GHz | 12288 KB | 101384 ± 796 | 152720 ± 2440 | 1.51
Core i7-3770T # 2.50GHz | 8192 KB | 69551 ± 1961 | 128929 ± 2631 | 1.85
The results are... mixed!
In general, on most of these machines there is some kind of speedup, or at least they are on a par.
The two cases where the array truly trumps the "smart structure" lookup are on a machines with lots of cache and not particularly busy: the Xeon E5-1650 above (15 MB cache) is a night build machine, at the moment quite idle; the Xeon E5-2697 (35 MB cache) is a machine for high performance calculations, in an idle moment as well. It does make sense, the original array fits completely in their huge cache, so the compact data structure only adds complexity.
At the opposite side of the "performance spectrum" - but where again the array is slightly faster, there's the humble Celeron that powers my NAS; it has so little cache that neither the array nor the "smart structure" fits in it at all. Other machines with cache small enough perform similarly.
The Xeon X5650 must be taken with some caution - they are virtual machines on a quite busy dual-socket virtual machine server; it may well be that, although nominally it has a decent amount of cache, during the time of the test it gets preempted by completely unrelated virtual machines several times.
Another option could be
check if the result is 0, 1 or 2
if not do a regular lookup
In other words something like:
unsigned char lookup(int index) {
int code = (bmap[index>>2]>>(2*(index&3)))&3;
if (code != 3) return code;
return full_array[index];
}
where bmap uses 2 bits per element with the value 3 meaning "other".
This structure is trivial to update, uses 25% more memory but the big part is looked up only in 5% of the cases. Of course, as usual, if it's a good idea or not depends on a lot of other conditions so the only answer is experimenting with real usage.
This is more of a "long comment" than a concrete answer
Unless your data is something that is something well-known, I doubt anyone can DIRECTLY answer your question (and I'm not aware of anything that matches your description, but then I don't know EVERYTHING about all kinds of data patterns for all kinds of use-cases). Sparse data is a common problem in high performance computing, but it's typically "we have a very large array, but only some values are non-zero".
For not well known patterns like what I think yours is, nobody will KNOW directly which is better, and it depends on the details: how random is the random access - is the system accessing clusters of data items, or is it completely random like from a uniform random number generator. Is the table data completely random, or are there sequences of 0 then sequences of 1, with a scattering of other values? Run length encoding would work well if you have reasonably long sequences of 0 and 1, but won't work if you have "checkerboard of 0/1". Also, you'd have to keep a table of "starting points", so you can work your way to the relevant place reasonably quickly.
I know from a long time back that some big databases are just a large table in RAM (telephone exchange subscriber data in this example), and one of the problems there is that caches and page-table optimisations in the processor is pretty useless. The caller is so rarely the same as one recently calling someone, that there is no pre-loaded data of any kind, it's just purely random. Big page-tables is the best optimisation for that type of access.
In a lot of cases, compromising between "speed and small size" is one of those things you have to pick between in software engineering [in other engineering it's not necessarily so much of a compromise]. So, "wasting memory for simpler code" is quite often the preferred choice. In this sense, the "simple" solution is quite likely better for speed, but if you have "better" use for the RAM, then optimising for size of the table would give you sufficient performance and a good improvement on size. There are lots of different ways you could achieve this - as suggested in a comment, a 2 bit field where the two or three most common values are stored, and then some alternative data format for the other values - a hash-table would be my first approach, but a list or binary tree may work too - again, it depends on the patterns of where your "not 0, 1 or 2" are. Again, it depends on how the values are "scattered" in the table - are they in clusters or are they more of an evenly distributed pattern?
But a problem with that is that you are still reading the data from RAM. You are then spending more code processing the data, including some code to cope with the "this is not a common value".
The problem with most common compression algorithms is that they are based on unpacking sequences, so you can't random access them. And the overhead of splitting your big data into chunks of, say, 256 entries at a time, and uncompressing the 256 into a uint8_t array, fetching the data you want, and then throwing away your uncompressed data, is highly unlikely to give you good performance - assuming that's of some importance, of course.
In the end, you will probably have to implement one or a few of the ideas in comments/answers to test out, see if it helps solving your problem, or if memory bus is still the main limiting factor.
What I've done in the past is use a hashmap in front of a bitset.
This halves the space compared to Matteo's answer, but may be slower if "exception" lookups are slow (i.e. there are many exceptions).
Often, however, "cache is king".
Unless there is pattern to your data it is unlikely that there is any sensible speed or size optimisation, and - assuming you are targetting a normal computer - 10 MB isn't that big a deal anyway.
There are two assumptions in your questions:
The data is being poorly stored because you aren't using all the bits
Storing it better would make things faster.
I think both of these assumptions are false. In most cases the appropriate way to store data is to store the most natural representation. In your case, this is the one you've gone for: a byte for a number between 0 and 255. Any other representation will be more complex and therefore - all other things being equal - slower and more error prone. To need to divert from this general principle you need a stronger reason than potentially six "wasted" bits on 95% of your data.
For your second assumption, it will be true if, and only if, changing the size of the array results in substantially fewer cache misses. Whether this will happen can only be definitively determined by profiling working code, but I think it's highly unlikely to make a substantial difference. Because you will be randomly accessing the array in either case, the processor will struggle to know which bits of data to cache and keep in either case.
If the data and accesses are uniformly randomly distributed, performance is probably going to depend upon what fraction of accesses avoid an outer-level cache miss. Optimizing that will require knowing what size array can be reliably accommodated in cache. If your cache is large enough to accommodate one byte for every five cells, the simplest approach may be to have one byte hold the five base-three encoded values in the range 0-2 (there are 243 combinations of 5 values, so that will fit in a byte), along with a 10,000,000 byte array that would be queried whenever an the base-3 value indicates "2".
If the cache isn't that big, but could accommodate one byte per 8 cells, then it would not be possible to use one byte value to select from among all 6,561 possible combinations of eight base-3 values, but since the only effect of changing a 0 or 1 to a 2 would be to cause an otherwise-unnecessary lookup, correctness wouldn't require supporting all 6,561. Instead, one could focus on the 256 most "useful" values.
Especially if 0 is more common than 1, or vice versa, a good approach might be to use 217 values to encode the combinations of 0 and 1 that contain 5 or fewer 1's, 16 values to encode xxxx0000 through xxxx1111, 16 to encode 0000xxxx through 1111xxxx, and one for xxxxxxxx. Four values would remain for whatever other use one might find. If the data are randomly distributed as described, a slight majority of all queries would hit bytes which contained just zeroes and ones (in about 2/3 of all groups of eight, all bits would be zeroes and ones, and about 7/8 of those would have six or fewer 1 bits); the vast majority of those that didn't would land in a byte which contained four x's, and would have a 50% chance of landing on a zero or a one. Thus, only about one in four queries would necessitate a large-array lookup.
If the data are randomly distributed but the cache isn't big enough to handle one byte per eight elements, one could try to use this approach with each byte handling more than eight items, but unless there is a strong bias toward 0 or toward 1, the fraction of values that can be handled without having to do a lookup in the big array will shrink as the number handled by each byte increases.
I'll add to #o11c's answer, as his wording might be a bit confusing.
If I need to squeeze the last bit and CPU cycle I'd do the following.
We will start by constructing a balanced binary search tree that holds the 5% "something else" cases. For every lookup, you walk the tree quickly: you have 10000000 elements: 5% of which is in the tree: hence the tree data structure holds 500000 elements. Walking this in O(log(n)) time, gives you 19 iterations. I'm no expert at this, but I guess there are some memory-efficient implementations out there. Let's guesstimate:
Balanced tree, so subtree position can be calculated (indices do not need to be stored in the nodes of the tree). The same way a heap (data structure) is stored in linear memory.
1 byte value (2 to 255)
3 bytes for the index (10000000 takes 23 bits, which fits 3 bytes)
Totalling, 4 bytes: 500000*4 = 1953 kB. Fits the cache!
For all the other cases (0 or 1), you can use a bitvector. Note that you cannot leave out the 5% other cases for random access: 1.19 MB.
The combination of these two use approximately 3,099 MB. Using this technique, you will save a factor 3.08 of memory.
However, this doesn't beat the answer of #Matteo Italia (which uses 2.76 MB), a pity. Is there anything we can do extra? The most memory consuming part is the 3 bytes of index in the tree. If we can get this down to 2, we would save 488 kB and the total memory usage would be: 2.622 MB, which is smaller!
How do we do this? We have to reduce the indexing to 2 bytes. Again, 10000000 takes 23 bits. We need to be able to drop 7 bits. We can simply do this by partitioning the range of 10000000 elements into 2^7 (=128) regions of 78125 elements. Now we can build a balanced tree for each of these regions, with 3906 elements on average. Picking the right tree is done by a simple division of the target index by 2^7 (or a bitshift >> 7). Now the required index to store can be represented by the remaining 16 bits. Note that there is some overhead for the length of the tree that needs to be stored, but this is negligible. Also note that this splitting mechanism reduces the required number of iterations to walk the tree, this now reduces to 7 iterations less, because we dropped 7 bits: only 12 iterations are left.
Note that you could theoretically repeat the process to cut off the next 8 bits, but this would require you to create 2^15 balanced trees, with ~305 elements on average. This would result in 2.143 MB, with only 4 iterations to walk the tree, which is a considerable speedup, compared to the 19 iterations we started with.
As a final conclusion: this beats the 2-bit vector strategy by a tiny bit of memory usage, but is a whole struggle to implement. But if it can make the difference between fitting the cache or not, it might be worth the try.
If you only perform read operations it would be better to not assign a value to an single index but to an interval of indices.
For example:
[0, 15000] = 0
[15001, 15002] = 153
[15003, 26876] = 2
[25677, 31578] = 0
...
This can be done with a struct. You also might want to define a class similar to this if you like an OO approach.
class Interval{
private:
uint32_t start; // First element of interval
uint32_t end; // Last element of interval
uint8_t value; // Assigned value
public:
Interval(uint32_t start, uint32_t end, uint8_t value);
bool isInInterval(uint32_t item); // Checks if item lies within interval
uint8_t getValue(); // Returns the assigned value
}
Now you just have to iterate trough a list of intervals and check if your index lies within one of them which can be much less memory intensive in average but costs more CPU resources.
Interval intervals[INTERVAL_COUNT];
intervals[0] = Interval(0, 15000, 0);
intervals[1] = Interval(15001, 15002, 153);
intervals[2] = Interval(15003, 26876, 2);
intervals[3] = Interval(25677, 31578, 0);
...
uint8_t checkIntervals(uint32_t item)
for(int i=0; i<INTERVAL_COUNT-1; i++)
{
if(intervals[i].isInInterval(item) == true)
{
return intervals[i].getValue();
}
}
return DEFAULT_VALUE;
}
If you order the intervals by descending size you increase the probability that the item you are looking for is found early which further decreases your average memory and CPU resource usage.
You could also remove all intervals with a size of 1. Put the corresponding values into a map and check them only if the item you are looking for wasn't found in the intervals. This should also raise the average performance a bit.
Long long time ago, I can just remember...
In university we got a task to accelerate a ray tracer program, that has to read by algorithm over and over again from buffer arrays. A friend told me to always use RAM-reads that are multiples of 4Bytes. So I changed the array from a pattern of [x1,y1,z1,x2,y2,z2,...,xn,yn,zn] to a pattern of [x1,y1,z1,0,x2,y2,z2,0,...,xn,yn,zn,0]. Means I add a empty field after each 3D coordinate. After some performance testing: It was faster.
So long story short: Read multiple of 4 Bytes from your array from RAM, and maybe also from the right starting position, so you read a little cluster where the searched index is in it and read the searched index from this little cluster in cpu. (In your case you will not need to inserting fill-fields, but the concept should be clear)
Maybe also other multiples could be the key in newer systems.
I don't know if this will work in your case, so if it doesn't work: Sorry. If it work I would be happy to hear about some test results.
PS: Oh and if there is any access pattern or nearby accessed indices, you can reuse the cached cluster.
PPS: It could be, that the multiple factor was more like 16Bytes or something like that, it's too long ago, that I can remember exactly.
Looking at this, you could split your data, for example:
a bitset which gets indexed and represents the value 0 (std::vector would be useful here)
a bitset which gets indexed and represents the value 1
a std::vector for the values of 2, containing the indexes which refer to this value
a map for the other values (or std::vector>)
In this case, all values appear till a given index, so you could even remove one of bitsets and represents the value as it being missing in the other ones.
This will save you some memory for this case, though would make the worst case worse.
You'll also need more CPU power to do the lookups.
Make sure to measure!
Like Mats mentions in his comment-answer, it is hard to say what is actually the best solution without knowing specifically what kind of data you have (e.g., are there long runs of 0's, and so on), and what your access pattern looks like (does "random" mean "all over the place" or just "not strictly in completely linear fashion" or "every value exactly once, just randomized" or ...).
That said, there are two mechanisms coming to mind:
Bit arrays; i.e., if you only had two values, you could trivially compress your array by a factor of 8; if you have 4 values (or "3 values + everything else") you can compress by a factor of two. Which might just not be worth the trouble and would need benchmarks, especially if you have really random access patterns which escape your caches and hence do not change the access time at all.
(index,value) or (value,index) tables. I.e., have one very small table for the 1% case, maybe one table for the 5% case (which only needs to store the indexes as all have the same value), and a big compressed bit array for the final two cases. And with "table" I mean something which allows relatively quick lookup; i.e., maybe a hash, a binary tree, and so on, depending on what you have available and your actual needs. If these subtables fit into your 1st/2nd level caches, you might get lucky.
I am not very familiar with C, but in C++ you can use unsigned char to represent an integer in the range 0 - 255.
Compared to normal int (again, I am coming from Java and C++ world) in which 4 byte (32 bit) are required, an unsigned char requires 1 byte (8 bits).
so it might reduce the total size of the array by 75%.
You have succinctly described all the distribution characteristics of your array; toss the array.
You can easily replace the array with a randomized method that produces the same probabilistic output as the array.
If consistency matters (producing the same value for the same random index), consider using a bloom filter and/or hash map to track repeat hits. If your array accesses really are random, though, this is totally unnecessary.

Concurrent matrix sum - past Exam paper

I'm currently studying in my 3rd year of university - my exam for Computer Systems and Concurrency and I'm confused about a past paper question. Nobody - even the lecturer - has answered my question.
Question:
Consider the following GPU that consists of 8 multiprocessors clocked at 1.5 GHz, each of which contains 8 multithreaded single-precision floating-point units and integer processing units. It has a memory system that consists of 8 partitions of 1GHz Graphics DDR3DRAM, each 8 bytes wide and with 256 MB of capacity. Making reasonable assumptions (state them), and a naive matrix multiplication algorithm, compute how much time the computation C = A * B would take. A, B, and C are n * n matrices and n is determined by the amount of memory the system has.
Answer given in solutions:
> Assuming it has a single-precision FP multiply-add instruction,
Single-precision FP multiply-add performance =
\#MPs * #SP/MP * #FLOPs/instr/SP * #instr/clock * #clocks/sec =
8 * 8 * 2 * 1 * 1.5 G = 192 GFlops / second
Total DDR3RAM memory size = 8 * 256 MB = 2048 MB
The peak DDR3 bandwidth = #Partitions * #bytes/transfer * #transfers/clock * #clocks/sec = 8 * 8 * 2 * 1G = 128 GB/sec
>Modern computers have 32-bit single precision So, if we want 3 n*n SP matrices,
maximum n is
3n^2 * 4 <= 2048 * 1024 * 1024
>nmax = 13377 = n
>The number of operations that a naive mm algorithm (triply nested loop) needs is calculated as follows:
>For each element of the
result, we need n multiply-adds For each row of the result,
>we need n * n multiply-adds For the entire result matrix, we need n * n * n multiply-adds Thus, approximately 2393 GFlops.
> Assuming no cache, we have loading of 2 matrices and storing of 1 to the graphics memory.
>That is 3 * n^2 = 512 GB of data. This process will take 512 / 128 = 4 seconds
Also, the processing will take 2393 / 192 = 12.46 seconds Thus the
entire matrix multiplication will take 16.46 seconds.
Now my questions is - how does the calculation of 3*((13377)^2) = 536,832,387
translate to 536,832,387 = 512 GB.
That is 536.8 Million values. Each value is 4 bytes long. The memory interface is 8 bytes wide - assuming the GPU cannot fetch 2 values and split them - that effectively doubles the size of the reads and writes. Therefore the 2GB of Memory used is effectively read/written twice (because 8 bytes are read and 4 ignored) Therefore only 4GB of data is passed between the RAM and the GPU.
Can someone please tell me where I am going wrong as the only way I can think of is that 536.8 Million Result is the value of the memory operations in KB - which is not stated anywhere.

Timing difference/increase different when using `%` vs `&`

In trying to determine the cache size for a given CPU, I tried to time the memory access to memory/cache like:
lengthMod = sizes[i]/sizeof(int) - 1; // where sizes[i] is something like 1024, 2048 ...
for (unsigned int k = 0; k < REPS; k++) {
data[(k * 16) & lengthMod]++;
}
1, 0.52
4, 0.52
8, 0.52
16, 0.52
32, 0.52
64, 1.11 // << note the jump in timing. L1 cache size is 32K
128, 1.12
256, 1.19
So I think if the lengthMod is not a power of 2, I cant do this. So I tried doing
lengthMod = sizes[i]/sizeof(int);
for (unsigned int k = 0; k < REPS; k++) {
data[(k * 16) % lengthMod]++;
}
1, 2.67
4, 2.57
8, 2.55
16, 2.51
32, 2.42
64, 2.42 // << no jump anymore ...
128, 2.42
256, 2.42
Then I find that the timing increase that I expected is non-existant anymore ... I expected the time to increase but it should apply to all values? So if its x seconds when using &, I'd expect ~x+c seconds (where c is approximatly constant), but thats not the case, in fact, it reduces the timing difference to non-existant why is that?
What you're seeing is a trade-off of bottlenecks.
In the first example, you are bottlenecked by your cache bandwidth.
In the second example, you are bottlenecked by integer division.
Before we continue, let's look at the difference between the two examples:
In the first case, you use & which is a fast bitwise operation.
In the second case, you use % which is very slow division.
Divisions are very slow. Modern compilers will try to optimize them when the divisor/modulus is a compile-time constant.
But that's not the case here. So you pay the full cost of a hardware division. This is why the times in your second example are much slower than the first.
With the &, the code is fast enough to max out the cache bandwidth. However, with %, the code is much slower - not fast enough to keep up with the cache. So you see the same times all the way up.
Looks like in second case compiler generates less optimized code (because of "division with remainder").