Concurrent matrix sum - past Exam paper

Concurrent matrix sum - past Exam paper - concurrency

I'm currently studying in my 3rd year of university - my exam for Computer Systems and Concurrency and I'm confused about a past paper question. Nobody - even the lecturer - has answered my question.
Question:
Consider the following GPU that consists of 8 multiprocessors clocked at 1.5 GHz, each of which contains 8 multithreaded single-precision floating-point units and integer processing units. It has a memory system that consists of 8 partitions of 1GHz Graphics DDR3DRAM, each 8 bytes wide and with 256 MB of capacity. Making reasonable assumptions (state them), and a naive matrix multiplication algorithm, compute how much time the computation C = A * B would take. A, B, and C are n * n matrices and n is determined by the amount of memory the system has.
Answer given in solutions:
> Assuming it has a single-precision FP multiply-add instruction,
Single-precision FP multiply-add performance =
\#MPs * #SP/MP * #FLOPs/instr/SP * #instr/clock * #clocks/sec =
8 * 8 * 2 * 1 * 1.5 G = 192 GFlops / second
Total DDR3RAM memory size = 8 * 256 MB = 2048 MB
The peak DDR3 bandwidth = #Partitions * #bytes/transfer * #transfers/clock * #clocks/sec = 8 * 8 * 2 * 1G = 128 GB/sec
>Modern computers have 32-bit single precision So, if we want 3 n*n SP matrices,
maximum n is
3n^2 * 4 <= 2048 * 1024 * 1024
>nmax = 13377 = n
>The number of operations that a naive mm algorithm (triply nested loop) needs is calculated as follows:
>For each element of the
result, we need n multiply-adds For each row of the result,
>we need n * n multiply-adds For the entire result matrix, we need n * n * n multiply-adds Thus, approximately 2393 GFlops.
> Assuming no cache, we have loading of 2 matrices and storing of 1 to the graphics memory.
>That is 3 * n^2 = 512 GB of data. This process will take 512 / 128 = 4 seconds
Also, the processing will take 2393 / 192 = 12.46 seconds Thus the
entire matrix multiplication will take 16.46 seconds.
Now my questions is - how does the calculation of 3*((13377)^2) = 536,832,387
translate to 536,832,387 = 512 GB.
That is 536.8 Million values. Each value is 4 bytes long. The memory interface is 8 bytes wide - assuming the GPU cannot fetch 2 values and split them - that effectively doubles the size of the reads and writes. Therefore the 2GB of Memory used is effectively read/written twice (because 8 bytes are read and 4 ignored) Therefore only 4GB of data is passed between the RAM and the GPU.
Can someone please tell me where I am going wrong as the only way I can think of is that 536.8 Million Result is the value of the memory operations in KB - which is not stated anywhere.

Related

Why OpenCL work group size has huge performance impact on GPU?

I am benchmarking a simple matrix transposition kernel on Qualcomm Adreno 630 GPU, and I am trying to see the impact of different work group size, but surprisingly, I get some interesting result which I cannot explain. Here is my kernel code:
__kernel void transpose(__global float *input, __global float *output, const int width, const int height)
int i = get_global_id(0);
int j = get_global_id(1);
output[i*height + j] = input[j*width + i];
}
and the width and height are both 6400, the experiment results are(execution time is the difference between END and START event):
work group size execution time
x y
4 64 24ms
64 4 169ms
256 1 654ms
1 256 34ms
8 32 27ms
1 1024 375ms
1024 1 657ms
32 32 26ms
after this I did another experimemnt where I change the width and height from 6400 to 6401(and the global work size in NDRangeKernel call as well), and the result is even more interesing:
work group size execution time
x y
4 64 28ms
64 4 105ms
256 1 359ms
1 256 31ms
8 32 32ms
1 1024 99ms
1024 1 358ms
32 32 32ms
execution time of most scenarios drops significantly. I know memory coalescing or cache could play a role here, but I cannot completely explain this.

Memory coalescence occurs when consecutive threads access data at consecutive global memory addresses within a 128-byte aligned segment. Then memory accesses are coalesced into one, significantly reducing overall latency.
In the 2D range, coalescing only happens along get_global_id(1) or the j direction in your case. In the line output[i*height + j] = input[j*width + i];, input[j*width + i]; is a misaligned (non-coalesced) read and output[i*height + j] is a coalesced write. Coalesced memory access generally is much faster than misaligned access, but the performance penalty for coalesced/misaligned reads can be vastly different than coalesced/misaligned writes. On most desktop GPU architectures, the combination misaligned read and coalesced write is faster than the other way around, see the diagram below. So your implementation should be the faster variant already.
Since coalesced access is only possible along the j index, if you have a range of (x=256,y=1) (i along x-direction, j along y-direction), you do not get any coalescing. For (x=8,y=32), j is coalesced in groups of 32 8 times per thread block, so memory bandwidth is fairly saturated and performance is good.
If you want maximum possible performance, I'd suggest you go with 1D indexing. This way you have full control about coalescing and coalescing happens over the entire thread block. Your matrix transpose kernel then would look like this:
#define width 6400
__kernel void transpose(__global float *input, __global float *output) {
const int n = get_global_id(0);
int i = n/width;
int j = n%width;
output[i*height + j] = input[j*width + i];
}
You can bake width into the OpenCL Ccode at C++ runtime and before OpenCL compile time via string concatenation.

understanding vtu file size

I am having a problem with understanding/managing sizes of .vtu files in VTK. I need to write CFD output for hexahedral meshes with millions of cells and nodes. So, I am looking at ways to improve the efficiency of storage. I started with simple test cases.
Case1: 80x40x40 hexahedral mesh with 8 points for each hexahedron. So, 128000 cells and 1024000 points in total. Let's call it C1.vtu.
Case2: 80x40x40 hexahedral mesh with only unique points. So, 128000 cells and 136161 points in total. Let's call it C2.vtu.
I store one vector field (velocity) for each point in each case. I use vtkFloatArray for this data. The size of C1.vtu is 7.5 MB, and C2.vtu file is 3.0MB.
This is not what I expected when I created C2.vtu. As I store only about 13% of points (of Case1) in Case2, I expected that C2.vtu would be reduced accordingly (at least 5 times). However, the reduction is only 2.5 times.
I would like to understand what is going on internally. Also, I appreciate any insights on reducing the file size further.
I am using vtk6.2 with C++ on Ubuntu12.04.

It sounds like you have compression enabled in the writer; does writer->GetCompressor() return a non-NULL pointer? If so, then that is almost surely the reason for the difference in file sizes. Without compression, I would expect larger file sizes that you are reporting. As the comments above noted, unstructured storage adds connectivity overhead. Consider your meshes C1 and C2:
C1
connectivity size = 128000 * (1 cell type + 1 cell offset + 8 point IDs) * (4 or 8 bytes per integer)
point coordinate size = 1024000 * (3 coords) * (4 or 8 bytes per coord)
vector field size = 1024000 * (3 components per tuple) * (4 or 8 bytes per component)
that would be 28.32 MiB at a minimum (all int32/float32) yet you report it is 7.5 MB
C2
connectivity size = 128000 * (1 cell type + 1 cell offset + 8 point IDs) * (4 or 8 bytes per integer)
point coordinate size = 136161 * (3 coords) * (4 or 8 bytes per coord)
vector field size = 136161 * (3 components per tuple) * (4 or 8 bytes per component)
that would be 8 MiB at a minimum, but you report 3 MB.

Loading DDS textures?

I'm reading about loading DDS textures. I read this article and saw this posting. (I also read the wiki about S3TC)
I understood most of the code, but there's two lines I didn't quite get.
blockSize = (format == GL_COMPRESSED_RGBA_S3TC_DXT1_EXT) ? 8 : 16;
and:
size = ((width + 3) / 4) * ((height + 3) / 4) * blockSize;
and:
bufsize = mipMapCount > 1 ? linearSize * 2 : linearSize;
What is blockSize? and why are we using 8 for DXT1 and 16 for the rest?
What is happening exactly when we're calculating size? More
specifically why are we adding 3, dividing by 4 then multiplying
by blockSize?
Why are we multiplying by 2 if mipMapCount > 1?

DXT1-5 formats are also called BCn formats (it ends with numbers but not exactly the same ones) and BC stands for block compression. Pixels are not stored separately, it only stores a block of data for the equivalent of 4x4 pixels.
The 1st line checks if it's DXT1, because it has a size of 8 byte per block. DXT3 and DXT5 have use 16 bytes per block. (Note that newer formats exist and at least one of them is 8 bytes/block: BC4).
The 2nd rounds up the dimensions of the texture to a multiple of the dimensions of a block. This is required since these formats can only store blocks, not pixels. For example, if you have a texture of 15x6 pixels, and since BCn blocks are 4x4 pixels, you will need to store 4 blocks per column, and 2 blocks per row, even if the last column/row of blocks will only be partially filled.
One way of rounding up a positive integer (let's call it i) to a multiple of another positive integer (let's call it m), is:
(i + m - 1) / m * m
Here, we need get the number of blocks on each dimension and then multiply by the size of a block to get the total size of the texture. To do that we round up width and height to the next multiple of 4, divide by 4 to get the number of block and finally and multiply it by the size of the block:
size = (((width + 3) / 4 * 4) * ((height + 3) / 4 * 4)) / 4 * blockSize;
// ^ ^ ^
If you look closely, there's a *4 followed by a /4 that can be simplified. If you do that, you'll get exactly the same code you had. The conclusion to all this could be comment any code that's not perfectly obvious :P
The 3rd line may be an approximation to calculate a buffer size big enough to store the whole mipmap chain easily. But I'm not sure what this linearSize is; it correspond to dwPitchOrLinearSize in the DDS header. In any case, you don't really need this value since you can calculate the size of each level easily with the code above.

Bit string nearest neighbour searching

I have hundreds of thousands of sparse bit strings of length 32 bits.
I'd like to do a nearest neighbour search on them and look-up performance is critical. I've been reading up on various algorithms but they seem to target text strings rather than binary strings. I think either locally sensitive hashing or spectral hashing seem good candidates or I could look into compression. Will any of these work well for my bit string problem ? Any direction or guidance would be greatly appreciated.

Here's a fast and easy method,
then a variant with better performance at the cost of more memory.
In: array Uint X[], e.g. 1M 32-bit words
Wanted: a function near( Uint q ) --> j with small hammingdist( q, X[j] )
Method: binary search q in sorted X,
then linear search a block around that.
Pseudocode:
def near( q, X, Blocksize=100 ):
preprocess: sort X
Uint* p = binsearch( q, X ) # match q in leading bits
linear-search Blocksize words around p
return the hamming-nearest of these.
This is fast --
Binary search 1M words
+ nearest hammingdist in a block of size 100
takes < 10 us on my Mac ppc.
(This is highly cache-dependent — your mileage will vary.)
How close does this come to finding the true nearest X[j] ?
I can only experiment, can't do the math:
for 1M random queries in 1M random words,
the nearest match is on average 4-5 bits away,
vs. 3 away for the true nearest (linear scan all 1M):
near32 N 1048576 Nquery 1048576 Blocksize 100
binary search, then nearest +- 50
7 usec
distance distribution: 0 4481 38137 185212 443211 337321 39979 235 0
near32 N 1048576 Nquery 100 Blocksize 1048576
linear scan all 1048576
38701 usec
distance distribution: 0 0 7 58 35 0
Run your data with blocksizes say 50 and 100
to see how the match distances drop.
To get even nearer, at the cost of twice the memory,
make a copy Xswap of X with upper / lower halfwords swapped,
and return the better of
near( q, X, Blocksize )
near( swap q, Xswap, Blocksize )
With lots of memory, one can use many more bit-shuffled copies of X,
e.g. 32 rotations.
I have no idea how performance varies with Nshuffle and Blocksize —
a question for LSH theorists.
(Added): To near-match bit strings of say 320 bits, 10 words,
make 10 arrays of pointers, sorted on word 0, word 1 ...
and search blocks with binsearch as above:
nearest( query word 0, Sortedarray0, 100 ) -> min Hammingdist e.g. 42 of 320
nearest( query word 1, Sortedarray1, 100 ) -> min Hammingdist 37
nearest( query word 2, Sortedarray2, 100 ) -> min Hammingdist 50
...
-> e.g. the 37.
This will of course miss near-matches where no single word is close,
but it's very simple, and sort and binsearch are blazingly fast.
The pointer arrays take exactly as much space as the data bits.
100 words, 3200 bits would work in exactly the same way.
But: this works only if there are roughly equal numbers of 0 bits and 1 bits,
not 99 % 0 bits.

I just came across a paper that addresses this problem.
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering (Ravichandran et al, 2005)
The basic idea is similar to Denis's answer (sort lexicographically by different permutations of the bits) but it includes a number of additional ideas and further references for articles on the topic.
It is actually implemented in https://github.com/soundcloud/cosine-lsh-join-spark which is where I found it.

CUDA Time Events

I am timing how long it takes my CUDA program to calculate matrices of a certain size. For example, 10x10, 100x100, 500x500,100x1000.
However, the results are not at all what I was expecting. The numbers for the graph are not at what is expected. With the increase in size of the matrices, the computational time decreases.
For example, here is the average time (from 1000 runs):
10x10: 0.032768s
100x100: 0.068960s
500x500: 0.006336s
1000x1000: 0.018400s
The time goes down, then up again at 1000. What is going on? Shouldn't the numbers peak off at a certain point? Why is it going in a roller coaster like this?
Here is how the actual timing code is being run:
int blocksNeeded=0;
cudaError_t cudaStatus;
blocksNeeded=(size/MAXTHREADS)+1;
int threadsPerBlock = MAXTHREADS/blocksNeeded+1;
cudaEvent_t start, stop;
float elapsedtime;
.
.
.
.
.
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
addKernel<<<blocksNeeded, size>>>(dev_c, dev_a, dev_b,size);
cudaStatus = cudaDeviceSynchronize();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedtime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
where MAXTHREADS are 1024 and, size is the amount of elements I have in the matrix. I.E. 10x10 matrix will have 100 elements which is the size.
Updated with kernel:
__global__ void addKernel(float *c, float *a, float *b,int size)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if(idx < size)
c[idx] = a[idx] + b[idx];
}

I've made a test on a recent GPU cluster equipped with NVIDIA Tesla M2090. Basically i'm performing a vector addition with different sizes. The results are:
Size Kernel time (msec)
===========================
2 0.04
4 0.010912
8 0.012128
16 0.012256
32 0.011296
64 0.01248
128 0.012192
256 0.012576
512 0.012416
1024 0.012736
2048 0.01232
4096 0.011968
8192 0.011264
16384 0.007296
32768 0.007776
65536 0.009728
131072 0.018304
262144 0.031392
524288 0.055168
1048576 0.10352
What you can see is, that there is knee at a vector size of 16384, which basically resembles your observations. This is not an error but normal behavior since the GPU has to be utilized for showing performance. The point of utilization is, in case of the Tesla M2090, reached around 16384 parallel additions.
The way you are measuring kernel performance is perfectly ok. I assume you've taken this from the "Best Practices Guide" for CUDA.
Notice: Please consider that the shown data is generated by using a single kernel run, i. e. it is not representative. Generally for exact time measurements the kernel should run multiple times with the same problem and the kernel time is the mean of the runs.

You must call the kernel with
addKernel<<<blocksNeeded, MAXTHREADS>>>(dev_c, dev_a, dev_b,size);
The second parameter on a kernel call is the number of threads to launch in each block, not the total number of threads.
At 100x100 you are already exceeding the maximum number of threads per block which is 1536 for compute capability 2.x
And just noticed that you calculate some kind of threadsPerBlock which is wrong and that you don't use it. Choose a number of threads per block. Then divide by the total number of elements to process and add 1 to it if the remainder is different from 0 and you get the number of blocks to launch.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js