Continuing my Chapel adventures...
I have a matrix A.
var idx = {1..n};
var adom = {idx, idx};
var A: [adom] int;
//populate A;
var rowsums: [idx] int;
What is the most efficient way to populate rowsums?
The most efficient solution is hard to define. However, here is one way to compute rowsums that is both parallel and elegant:
config const n = 8; // "naked" n would cause compilation to fail
const indices = 1..n; // tio.chpl:1: error: 'n' undeclared (first use this function)
const adom = {indices, indices};
var A: [adom] int;
// Populate A
[(i,j) in adom] A[i, j] = i*j;
var rowsums: [indices] int;
forall i in indices {
rowsums[i] = + reduce(A[i, ..]);
}
writeln(rowsums);
Try it online!
This is utilizing the + reduction over array slices of A.
Note that both the forall and + reduce introduce parallelism to the program above. It may be more efficient to only use a for loop, avoiding task-spawning overhead, if the size of indices is sufficiently small.
A few hintsto make the code actually run-live in both SEQ and PAR mode:
Besides a few implementation details, the above stated #bencray's assumption about the assumed overhead costs for a PAR setup, which may favor a purely serial processing in a SEQ setup, was not experimentally confirmed. It is fair to also note here, that a distributed mode was not tested on live <TiO>-IDE due to obvious reasons, whereas a small-if-not-tiny-scale distributed implementation is by far more an oxymoron, than a scientifically meaningful experiment to run.
Facts matter
A rowsums[] processing, even at a smallest possible scale of 2x2, was in the SEQ mode yet slower, than the same for 256x256 in the PAR mode.
Good job, chapel Team, indeed cool results on optimum alignment for harnessing the compact silicon resources to the max in PAR!
For records on exact run-time performance, ( ref. self-documented tables ) below, or do not hesistate to visit the live-IDE-run ( ref.'d above ) and experiment on your own.
Readers may also recognise extrinsic noise on small-scale experimentations, as O/S- and hosted-IDE-related processes intervene with resources-usage and influence onto the <SECTION-UNDER-TEST> runtime performance via adverse CPU / Lx-CACHE / memIO / process / et al conflicts, which fact exludes these measurements from being used for some generalised interpretations.
Hope all will enjoy the chapel lovely [TIME] resultsdemonstrated across the growing [EXPSPACE]-scaled computing landscapes
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_SEQ: Timer;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_PAR: Timer;
//nst max_idx = 123456; // seems to be too fat for <TiO>-IDE to allocate <TiO>-- /wrappers/chapel: line 6: 24467 Killed
const max_idx = 4096;
//nst max_idx = 8192; // seems to be too long for <TiO>-IDE to let it run [SEQ] part <TiO>-- The request exceeded the 60 second time limit and was terminated
//nst max_idx = 16384; // seems to be too long for <TiO>-IDE to let it run [PAR] part too <TiO>-- /wrappers/chapel: line 6: 12043 Killed
const indices = 1..max_idx;
const adom = {indices, indices};
var A: [adom] int;
[(i,j) in adom] A[i, j] = i*j; // Populate A[,]
var rowsums: [indices] int;
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start();
for i in indices { // SECTION-UNDER-TEST--
rowsums[i] = + reduce(A[i, ..]); // SECTION-UNDER-TEST--
} // SECTION-UNDER-TEST--
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop();
/*
<SECTION-UNDER-TEST> took 8973 [us] to run in [SEQ] mode for 2 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 28611 [us] to run in [SEQ] mode for 4 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 58824 [us] to run in [SEQ] mode for 8 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 486786 [us] to run in [SEQ] mode for 64 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 1019990 [us] to run in [SEQ] mode for 128 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 2010680 [us] to run in [SEQ] mode for 256 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 4154970 [us] to run in [SEQ] mode for 512 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8260960 [us] to run in [SEQ] mode for 1024 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 15853000 [us] to run in [SEQ] mode for 2048 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 33126800 [us] to run in [SEQ] mode for 4096 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took n/a [us] to run in [SEQ] mode for 8192 elements on <TiO>-IDE
============================================ */
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.start();
forall i in indices { // SECTION-UNDER-TEST--
rowsums[i] = + reduce(A[i, ..]); // SECTION-UNDER-TEST--
} // SECTION-UNDER-TEST--
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.stop();
/*
<SECTION-UNDER-TEST> took 12131 [us] to run in [PAR] mode for 2 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8095 [us] to run in [PAR] mode for 4 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8023 [us] to run in [PAR] mode for 8 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8156 [us] to run in [PAR] mode for 64 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 7990 [us] to run in [PAR] mode for 128 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 8692 [us] to run in [PAR] mode for 256 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 15134 [us] to run in [PAR] mode for 512 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 16926 [us] to run in [PAR] mode for 1024 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 30671 [us] to run in [PAR] mode for 2048 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 105323 [us] to run in [PAR] mode for 4096 elements on <TiO>-IDE
<SECTION-UNDER-TEST> took 292232 [us] to run in [PAR] mode for 8192 elements on <TiO>-IDE
============================================ */
writeln( rowsums,
"\n <SECTION-UNDER-TEST> took ", aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] to run in [SEQ] mode for ", max_idx, " elements on <TiO>-IDE",
"\n <SECTION-UNDER-TEST> took ", aStopWATCH_PAR.elapsed( Time.TimeUnits.microseconds ), " [us] to run in [PAR] mode for ", max_idx, " elements on <TiO>-IDE"
);
This is what makes chapel so great
Thanks for developing and improving such great computing tool for the HPC.
Related
I have a minimally reproducible sample which is as follows -
#include <iostream>
#include <chrono>
#include <immintrin.h>
#include <vector>
#include <numeric>
template<typename type>
void AddMatrixOpenMP(type* matA, type* matB, type* result, size_t size){
for(size_t i=0; i < size * size; i++){
result[i] = matA[i] + matB[i];
}
}
int main(){
size_t size = 8192;
//std::cout<<sizeof(double) * 8<<std::endl;
auto matA = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
auto matB = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
auto result = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
for(int i = 0; i < size * size; i++){
*(matA + i) = i;
*(matB + i) = i;
}
auto start = std::chrono::high_resolution_clock::now();
for(int j=0; j<500; j++){
AddMatrixOpenMP<float>(matA, matB, result, size);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout<<"Average Time is = "<<duration/500<<std::endl;
std::cout<<*(result + 100)<<" "<<*(result + 1343)<<std::endl;
}
I experiment as follows - I time the code with #pragma omp for simd directive for the loop in the AddMatrixOpenMP function and then time it without the directive. I compile the code as follows -
g++ -O3 -fopenmp example.cpp
Upon inspecting the assembly, both the variants generate vector instructions but when the OpenMP pragma is explicitly specified, the code runs 3 times slower.
I am not able to understand why so.
Edit - I am running GCC 9.3 and OpenMP 4.5. This is running on an i7 9750h 6C/12T on Ubuntu 20.04. I ensured no major processes were running in the background. The CPU frequency held more or less constant during the run for both versions (Minor variations from 4.0 to 4.1)
TIA
The non-OpenMP vectorizer is defeating your benchmark with loop inversion.
Make your function __attribute__((noinline, noclone)) to stop GCC from inlining it into the repeat loop. For cases like this with large enough functions that call/ret overhead is minor, and constant propagation isn't important, this is a pretty good way to make sure that the compiler doesn't hoist work out of the loop.
And in future, check the asm, and/or make sure the benchmark time scales linearly with the iteration count. e.g. increasing 500 up to 1000 should give the same average time in a benchmark that's working properly, but it won't with -O3. (Although it's surprisingly close here, so that smell test doesn't definitively detect the problem!)
After adding the missing #pragma omp simd to the code, yeah I can reproduce this. On i7-6700k Skylake (3.9GHz with DDR4-2666) with GCC 10.2 -O3 (without -march=native or -fopenmp), I get 18266, but with -O3 -fopenmp I get avg time 39772.
With the OpenMP vectorized version, if I look at top while it runs, memory usage (RSS) is steady at 771 MiB. (As expected: init code faults in the two inputs, and the first iteration of the timed region writes to result, triggering page-faults for it, too.)
But with the "normal" vectorizer (not OpenMP), I see the memory usage climb from ~500 MiB until it exits just as it reaches the max 770MiB.
So it looks like gcc -O3 performed some kind of loop inversion after inlining and defeated the memory-bandwidth-intensive aspect of your benchmark loop, only touching each array element once.
The asm shows the evidence: GCC 9.3 -O3 on Godbolt doesn't vectorize, and it leaves an empty inner loop instead of repeating the work.
.L4: # outer loop
movss xmm0, DWORD PTR [rbx+rdx*4]
addss xmm0, DWORD PTR [r13+0+rdx*4] # one scalar operation
mov eax, 500
.L3: # do {
sub eax, 1 # empty inner loop after inversion
jne .L3 # }while(--i);
add rdx, 1
movss DWORD PTR [rcx], xmm0
add rcx, 4
cmp rdx, 67108864
jne .L4
This is only 2 or 3x faster than fully doing the work. Probably because it's not vectorized, and it's effectively running a delay loop instead of optimizing away the empty inner loop entirely. And because modern desktops have very good single-threaded memory bandwidth.
Bumping up the repeat count from 500 to 1000 only improved the computed "average" from 18266 to 17821 us per iter. An empty loop still takes 1 iteration per clock. Normally scaling linearly with the repeat count is a good litmus test for broken benchmarks, but this is close enough to be believable.
There's also the overhead of page faults inside the timed region, but the whole thing runs for multiple seconds so that's minor.
The OpenMP vectorized version does respect your benchmark repeat-loop. (Or to put it another way, doesn't manage to find the huge optimization that's possible in this code.)
Looking at memory bandwidth while the benchmark is running:
Running intel_gpu_top -l while the proper benchmark is running shows (openMP, or with __attribute__((noinline, noclone))). IMC is the Integrated Memory Controller on the CPU die, shared by the IA cores and the GPU via the ring bus. That's why a GPU-monitoring program is useful here.
$ intel_gpu_top -l
Freq MHz IRQ RC6 Power IMC MiB/s RCS/0 BCS/0 VCS/0 VECS/0
req act /s % W rd wr % se wa % se wa % se wa % se wa
0 0 0 97 0.00 20421 7482 0.00 0 0 0.00 0 0 0.00 0 0 0.00 0 0
3 4 14 99 0.02 19627 6505 0.47 0 0 0.00 0 0 0.00 0 0 0.00 0 0
7 7 20 98 0.02 19625 6516 0.67 0 0 0.00 0 0 0.00 0 0 0.00 0 0
11 10 22 98 0.03 19632 6516 0.65 0 0 0.00 0 0 0.00 0 0 0.00 0 0
3 4 13 99 0.02 19609 6505 0.46 0 0 0.00 0 0 0.00 0 0 0.00 0 0
Note the ~19.6GB/s read / 6.5GB/s write. Read ~= 3x write since it's not using NT stores for the output stream.
But with -O3 defeating the benchmark, with a 1000 repeat count, we see only near-idle levels of main-memory bandwidth.
Freq MHz IRQ RC6 Power IMC MiB/s RCS/0 BCS/0 VCS/0 VECS/0
req act /s % W rd wr % se wa % se wa % se wa % se wa
...
8 8 17 99 0.03 365 85 0.62 0 0 0.00 0 0 0.00 0 0 0.00 0 0
9 9 17 99 0.02 349 90 0.62 0 0 0.00 0 0 0.00 0 0 0.00 0 0
4 4 5 100 0.01 303 63 0.25 0 0 0.00 0 0 0.00 0 0 0.00 0 0
7 7 15 100 0.02 345 69 0.43 0 0 0.00 0 0 0.00 0 0 0.00 0 0
10 10 21 99 0.03 350 74 0.64 0 0 0.00 0 0 0.00 0 0 0.00 0 0
vs. a baseline of 150 to 180 MB/s read, 35 to 50MB/s write when the benchmark isn't running at all. (I have some programs running that don't totally sleep even when I'm not touching the mouse / keyboard.)
I am trying to parallelize a for(){...} loop, using OpenMP, which takes a number of "lines" N of a "table" N*M and sorts each line in an ascending order.
I added #pragma omp parallel, #pragma omp for schedule directives, but don't see any changes, as if it does nothing at all.
Here is full program:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <iostream>
double GetTime() {
struct timeval clock;
gettimeofday(&clock, NULL);
double rez = (double)clock.tv_sec+(double)clock.tv_usec/1000000;
return rez;
}
void genMatrix(int *A, int N, int M) {
// Generate matrix
for (int i=0; i<N; i++) {
for (int j=0; j<M; j++) A[i*M+j] = (int)((double)rand()/RAND_MAX*99) + 1;
}
}
int main() {
srand(time(NULL));
int N = 4800;
int M = 6000;
int *A = new int[N*M];
int t, n;
genMatrix(A, N, M);
double t_start = GetTime();
#pragma omp parallel
{
#pragma omp for schedule
for (int k=0; k<N; k++) {
for (int i=0; i<M-1; i++) {
for (int j=0; j<M-1; j++) {
if (A[k*M+j] > A[k*M+j+1]) {
t = A[k*M+j];
A[k*M+j] = A[k*M+j+1];
A[k*M+j+1] = t;
}}}}}
double t_load = GetTime();
// Print matrix
// for (int i=0; i<N; i++) {
// for (int j=0; j<M; j++) {
// printf("%3d", A[i*M+j]);
// }
// printf("\n");
// }
printf("Load time: %.2f\n", t_load - t_start);
system("pause");
}
What is wrong and how should I add parallelization with OpenMP in this case?
Also, don't know why, but when trying to print the matrix A with big numbers( like int N = 480;int M = 600; ), some values are not sorted.
Is it a printing problem?
There are three distinct things, sine-qua-non, to go omp parallel:
A ) - the algorithm has to be correct
B ) - the algorithm has to use resources efficiently
C ) - the algorithm has to spend less on add-on overhead costs, than it receives from going omp
Fixing A) and after some slight experimentation on B) and C):
one may soon realise, that the costs demonstrated under B ) and C ) for a rand() processing are way higher, that any benefit from whatever naive or smarter matrix-coverage mapping onto resources ( here, a singular-engine, as any type of concurrency has to re-propagate a new state of the rand()-source-of-randomness across all the concurrent uses thereof, costs way more than it could deliver in concurrently operated matrix-coverage ( plus naive cache-line un-aware crossing of the matrix does not help either ).
The best results ( without optimising the myOMP_SCHEDULE_CHUNKS ):
/*
-O3 private( ..., i, j ) omp single
MATRIX.RAND time: 3 191 [us] 3 446 [us] 3 444 [us] 3 384 [us] 3 173 [us]
MATRIX.SORT time: 96 270 [us] 98 401 [us] 98 423 [us] 95 911 [us] 101 019 [us] #( 3 ) = OMP_THREADS in [ 5 ] OMP_SCHEDULE_CHUNKS
*/
The global view:
/* COMPILE:: -fopenmp
*
* MAY SHELL:: $ export OMP_NUM_THREADS = 3
* $ export OMP_DISPLAY_ENV = 1
* https://stackoverflow.com/questions/47495916/how-to-parallelize-matrix-sorting-for-loop
*/
#include <omp.h>
#define myOMP_SCHEDULE_CHUNKS 5 // OMP schedule( static, chunk ) ~ better cache-line depletion
#define myOMP_THREADS 4
/*
$ ./OMP_matrix_SORT
MATRIX.RAND time: 187 744 [us] 234 729 [us] 174 535 [us] 254 273 [us] 122 983 [us]
MATRIX.SORT time: 1 911 310 [us] 1 898 494 [us] 2 026 455 [us] 1 978 631 [us] 1 911 231 [us] #( 3 ) = OMP_THREADS
MATRIX.RAND time: 6 166 [us] 6 977 [us] 6 722 [us]
MATRIX.SORT time: 2 448 608 [us] 2 264 572 [us] 2 355 366 [us] #( 3 ) = OMP_THREADS in [ 5 ] OMP_SCHEDULE_CHUNKS
MATRIX.RAND time: 6 918 [us] 17 551 [us] 7 194 [us]
MATRIX.SORT time: 1 774 883 [us] 1 809 002 [us] 1 786 494 [us] #( 1 ) = OMP_THREADS
MATRIX.RAND time: 7 321 [us] 7 337 [us] 6 698 [us]
MATRIX.SORT time: 2 152 945 [us] 1 900 149 [us] 1 883 638 [us] #( 1 ) = OMP_THREADS
MATRIX.RAND time: 54 198 [us] 67 290 [us] 52 123 [us]
MATRIX.SORT time: 759 248 [us] 769 580 [us] 760 759 [us] 812 875 [us] #( 3 ) = OMP_THREADS
MATRIX.RAND time: 7 054 [us] 6 414 [us] 6 435 [us] 6 426 [us]
MATRIX.SORT time: 687 021 [us] 760 917 [us] 674 496 [us] 705 629 [us] #( 3 ) = OMP_THREADS
-O3
MATRIX.RAND time: 5 890 [us] 6 147 [us] 6 081 [us] 5 796 [us] 6 143 [us]
MATRIX.SORT time: 148 381 [us] 152 664 [us] 184 922 [us] 155 236 [us] 169 442 [us] #( 3 ) = OMP_THREADS in [ 5 ] OMP_SCHEDULE_CHUNKS
-O3 private( ..., i, j )
MATRIX.RAND time: 6 410 [us] 6 111 [us] 6 903 [us] 5 831 [us] 6 224 [us]
MATRIX.SORT time: 129 787 [us] 129 836 [us] 195 299 [us] 136 111 [us] 161 117 [us] #( 4 ) = OMP_THREADS in [ 5 ] OMP_SCHEDULE_CHUNKS
MATRIX.RAND time: 6 349 [us] 6 532 [us] 6 104 [us] 6 213 [us]
MATRIX.SORT time: 151 202 [us] 152 542 [us] 160 403 [us] 180 375 [us] #( 3 ) = OMP_THREADS in [ 5 ] OMP_SCHEDULE_CHUNKS
MATRIX.RAND time: 6 745 [us] 5 834 [us] 5 791 [us] 7 164 [us] 6 535 [us]
MATRIX.SORT time: 214 590 [us] 214 563 [us] 209 610 [us] 205 940 [us] 230 787 [us] #( 2 ) = OMP_THREADS in [ 5 ] OMP_SCHEDULE_CHUNKS
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <iostream>
long GetTime() { // double GetTime()
struct timeval clock;
gettimeofday( &clock, NULL );
return (long)clock.tv_sec * 1000000 // in [us] ( go (long) instead of float )
+ (long)clock.tv_usec; //
/* double rez = (double)clock.tv_sec
* + (double)clock.tv_usec / 1000000;
* // + (double)clock.tv_usec * 0.000001; // NEVER DIV
return rez;
*/
}
void genMatrix( int *A, int N, int M ) { // Generate matrix
register int i, iM, j;
#pragma omp parallel
for ( i = 0; i < N; i++ ) {
iM = i * M;
/* for ( register int i = 0; i < N; i++ ) {
register int iM = i * M;
*/
// #pragma omp parallel // 234 729 [us]
// for ( register int j = 0; j < M; j++ )
// #pragma omp parallel for schedule( static, myOMP_SCHEDULE_CHUNKS ) // 122 983 [us] #( 3 ) = OMP_THREADS ~~~ v/s 6 698 [us] #( 1 ) = OMP_THREADS
// // v/s 5 796 [us] # NON-OMP
#pragma omp single // ~~ 3 191 [us]
for ( int j = 0; j < M; j++ )
A[iM +j] = (int)( (double)rand() / RAND_MAX * 99 ) + 1;
// A[i*M+j] = (int)( (double)rand() / RAND_MAX * 99 ) + 1;
}
}
int main() {
srand( time( NULL ) );
int N = 480; // 4800; ~ 100x faster outputs
int M = 600; // 6000;
int Mb1 = M - 1;
int *A = new int[N*M];
omp_set_num_threads( myOMP_THREADS );
long long int t_start = GetTime();
genMatrix( A, N, M );
long long int t_load = GetTime();
printf( "MATRIX.RAND time: %lld [us]\n", t_load - t_start );
register int thisB,
this1,
next1,
t, i, j;
t_start = GetTime(); // double t_start = GetTime();
// for ( register int k = 0; k < N; k++ ) {
// #pragma omp parallel
// #pragma omp parallel for schedule( static, myOMP_SCHEDULE_CHUNKS ) // schedule( type, chunk ):
// #pragma omp parallel for schedule( static, myOMP_SCHEDULE_CHUNKS ) private( thisB, this1, next1, t ) // schedule( type, chunk ):
#pragma omp parallel for schedule( static, myOMP_SCHEDULE_CHUNKS ) private( thisB, this1, next1, t, i, j ) // schedule( type, chunk ):
for ( int k = 0; k < N; k++ ) {
thisB = k*M;
if ( omp_get_num_threads() != myOMP_THREADS ) {
printf( "INF: myOMP_THREADS ( == %d ) do not match the number of executed ones ( == %d ) ", myOMP_THREADS, omp_get_num_threads() );
}
//--------------------------------------------------// -------------SORT ROW-k
// for ( register int i = 0; i < Mb1; i++ ) { // < M-1; i++ ) {
// for ( register int j = 0; j < Mb1; j++ ) { // < M-1; j++ ) {
for ( i = 0; i < Mb1; i++ ) {
for ( j = 0; j < Mb1; j++ ) {
this1 = thisB + j,
next1 = this1 + 1;
if ( A[this1] > A[next1] ){ // A[k*M+j ] > A[k*M+j+1] ) {
t = A[this1]; // t = A[k*M+j ];
A[this1] = A[next1]; // A[k*M+j ] = A[k*M+j+1];
A[next1] = t; // A[k*M+j+1] = t;
}
}
}
//--------------------------------------------------// -------------SORT ROW-k
}
t_load = GetTime(); // double t_load = GetTime();
/* Print matrix
//
for ( int i = 0; i < N; i++ ) {
for ( int j = 0; j < M; j++ ) {
printf( "%3d", A[i*M+j] );
}
printf("\n");
}
//
*/
printf( "MATRIX.SORT time: %lld [us] #( %d ) = OMP_THREADS in [ %d ] OMP_SCHEDULE_CHUNKS\n",
t_load - t_start,
myOMP_THREADS,
myOMP_SCHEDULE_CHUNKS
);
// system( "pause" );
}
So I have implemented a heap, merge, and quicksort. I time the three sorts all the same way.
double DiffClocks(clock_t clock1, clock_t clock2){
double diffticks = clock1 - clock2;
double diffsecs = diffticks / CLOCKS_PER_SEC;
return diffsecs;
}
Then with each sort, I time them the same way. Just repeated for each different sort.
void heapsort(int myArray[], int n){
clock_t begin, end;
begin = clock();
heapSortMain(myArray, n);
end = clock();
double elapsedTime = heapDiffClocks(end, begin);
std::cout << '\t' << elapsedTime;
}
All three of the sorts are working. I have a function that verifies the arrays are sorted after executing each sort. My question is, why do I have such a big difference between the timing when running on g++ and on Visual Studio?
My output from Visual Studio 2012:
n Heap Merge Quick
100 0 0 0
1000 0 0 0
10000 0.01 0 0
100000 0.14 0.02 0.03
1000000 1.787 0.22 0.33
10000000 24.116 2.475 6.956
My output from g++ 4.7.2
n Heap Merge Quick
100 0 0 0
1000 0 0 0
10000 0 0 0.01
100000 0.05 0.02 0.02
1000000 0.59 0.33 0.29
10000000 10.78 3.79 3.3
I used a standard bubbleDown and swap implementation with heap. A recursive mergesort with a merge to merge the two sorted subarrays. A recursive quicksort with a median of 3 pivot and partition function.
I have always understood quicksort to be the fastest general sorting algorithm. On VS it really lags behind merge, and heap just goes up quickly when I hit 10 million on VS.
My code acts like 2d matrix muliplication ( http://gpgpu-computing4.blogspot.de/2009/09/matrix-multiplication-2-opencl.html).
The dimenstions of the matrixes are (1000*1000 and 10000*10000 and 100000*100000).
My Hardware is: NVIDIA Corporation GM204 [GeForce GTX 980] (MAX_WORK_GROUP_SIZES: 1024 1024 64).
The question is:
I have got Some Confusing Results depends on local_item_size and I need to understand what is happened?
1000 X 1000 matrixes & local_item_size = 16 : INVALID_WORKGROUP_SIZE.
1000 X 1000 matrixes & local_item_size = 8 : WORKS :).
1000 X 1000 matrixes & local_item_size = 10 : WORKS :) (The Execution time when 8 was better).
10000 X 10000 matrixes & local_item_size = 8 or 16: CL_OUT_OF_RESOURCES.
Thanks in advance,
To your second question, this is the reasoning behind:
1000 / 8 = 125, ok
1000 / 16 = 62.5, wrong! INVALID_WORKGROUP_SIZE
1000 / 10 = 100 ok, but 10 and multiples of 10, will never fully use the GPU cores.
IE: If you have 16 warps, 6 are wasted, if you have 32, 2 are wasted, and so on.
10000x10000 = 400MB(at least, if using floats) for just the input, so something is getting too big for the memory, therefore CL_OUT_OF_RESOURCES
Why did they do this:
Sys_SetPhysicalWorkMemory( 192 << 20, 1024 << 20 ); //Min = 201,326,592 Max = 1,073,741,824
Instead of this:
Sys_SetPhysicalWorkMemory( 201326592, 1073741824 );
The article I got the code from
A neat property is that shifting a value << 10 is the same as multiplying it by 1024 (1 KiB), and << 20 is 1024*1024, (1 MiB).
Shifting by successive powers of 10 yields all of our standard units of computer storage:
1 << 10 = 1 KiB (Kibibyte)
1 << 20 = 1 MiB (Mebibyte)
1 << 30 = 1 GiB (Gibibyte)
...
So that function is expressing its arguments to Sys_SetPhysicalWorkMemory(int minBytes, int maxBytes) as 192 MB (min) and 1024 MB (max).
Self commenting code:
192 << 20 means 192 * 2^20 = 192 * 2^10 * 2^10 = 192 * 1024 * 1024 = 192 MByte
1024 << 20 means 1024 * 2^20 = 1 GByte
Computations on constants are optimized away so nothing is lost.
I might be wrong (and I didn't study the source) , but I guess it's just for readability reasons.
I think the point (not mentioned yet) is that
All but the most basic compilers will do the shift at compilation time. Whenever you use operators with constant expressions, the
compiler will be able to do this before the code is even generated.
Note, that before constexpr and C++11, this did not extend to
functions.