I am implementing a basic NeuroEvolution program. The Data Set I am attempting is the "Cover Type" data set. This set has 15,120 records with 56 inputs (numerical data on patches of forest land) and 1 output (the cover type). As recommended, I am using 150 hidden neurons. The fitness function attempts to iterate through all 15,120 records to calculate an error, and use that error to calculate the fitness. The method is shown below.
double getFitness() {
double error = 0.0;
// for every record in the dataset.
for (int record = 0; record < targets.size(); record++) {
// evaluate output vector for this record of inputs.
vector<double> outputs = eval(inputs[record]);
// for every value in those records.
for (int value = 0; value < outputs.size(); value++) {
// add to error the difference.
error += abs(outputs[value] - targets[record][value]);
}
}
return 1 - (error / targets.size());
}
"inputs" and "targets" are 2-D vectors read-in from the CSV file. The entire program uses ~40 MB of memory at run-time. That is not a problem. The program mutates off of a parent network, both are evaluated for fitness, and the most fit is kept to be mutated-upon. In the entire process, the getFitness() function is taking the most time. The program is written in Visual Studio 2017 (on a 2.6GHz i7) in Windows 10.
It took ~7 minutes to evaluate ONE network's fitness, using 21% of CPU. Smaller problems have required hundreds of thousands.
What methods are available to get that number down?
The program, implemented making (apparently) significant use of vectors, cannot be directly offloaded to a GPU using OpenCL or CUDA without significant modifications. This makes OpenMP the most viable option. As few as 2 lines of code can be added to "go parallel." In addition, Visual Studio must be set to use OpenMP (Project -> Properties -> C/C++ -> Language -> Open MP Support).
#include <omp.h>
#include <numeric>
// libraries, variables, functions, etc...
double getFitness() {
double err = 0.0;
vector<double> error;
error.resize(inputs.size());
// for every record in the dataset.
#pragma omp parallel for
for (int record = 0; record < targets.size(); record++) {
// evaluate output vector for this record of inputs.
vector<double> outputs = eval(inputs[record]);
// for every value in those records.
for (int value = 0; value < outputs.size(); value++) {
// add to error the difference.
error[record] = abs(outputs[value] - targets[record][value]);
}
}
err = accumulate(error.begin(), error.end(), 0);
return 1 - (err / targets.size());
}
In the above code, OMP should create a thread for each of the records. However, unless you run this on a cloud instance, it will eat your CPU. Expect a ~4x improvement on a quad-core CPU.
Related
When using an OpenACC "#pragma acc routine worker"-routine, that contains multiple loops of vector (and worker) level parallelism, how do vector_length and num_workers work?
I played around with some code (see below) and stumbled upon a few things:
Setting the vector length of these loops is seriously confusing me. Using the vector_length(#) clause on the outer parallel region seems to work weirdly, when comparing run times. When I increase the vector length to huge numbers, say e.g. 4096, the run time actually gets smaller. In my understanding, a huge amount of threads should lie dormant when there are only as many as 10 iterations in the vector loop. Am I doing something wrong here?
I noticed that the output weirdly depends on the number of workers in foo(). If it is 16 or smaller, the output is "correct". If it is 32 and even much larger, the loops inside the worker routine somehow get executed twice. What am I missing here?
Can someone give me a hand with the OpenACC routine clause? Many thanks in advance.
Here is the example code:
#include <iostream>
#include <chrono>
class A{
public:
int out;
int* some_array;
A(){
some_array = new int[1000*100*10];
for(int i = 0; i < 1000*100*10; ++i){
some_array[i] = 1;
}
#pragma acc enter data copyin(this, some_array[0:1000*100*10])
};
~A(){
#pragma acc exit data delete(some_array, this)
delete [] some_array;
}
#pragma acc routine worker
void some_worker(int i){
int private_out = 10;
#pragma acc loop vector reduction(+: private_out)
for(int j=0; j < 10; ++j){
//do some stuff
private_out -= some_array[j];
}
#pragma acc loop reduction(+: private_out) worker
for(int j=0; j < 100; ++j){
#pragma acc loop reduction(+: private_out) vector
for(int k=0; k < 10; ++k){
//do some other stuff
private_out += some_array[k+j*10+i*10*100];
}
}
#pragma acc atomic update
out += private_out;
}
void foo(){
#pragma acc data present(this, some_array[0:1000*100*10]) pcreate(out)
{
#pragma acc serial
out=0;
//#######################################################
//# setting num_workers and vector_length produce weird #
//# results and runtimes #
//#######################################################
#pragma acc parallel loop gang num_workers(64) vector_length(4096)
for(int i=0; i < 1000; ++i){
some_worker(i);
}
#pragma acc update host(out)
}
}
};
int main() {
using namespace std::chrono;
A a;
auto start = high_resolution_clock::now();
a.foo();
auto stop = high_resolution_clock::now();
std::cout << a.out << std::endl
<< "took " << duration_cast<microseconds>(stop - start).count() << "ms" << std::endl;
//output for num_workers(16) vector_length(4096)
//1000000
//took 844ms
//
//output for num_workers(16) vector_length(2)
//1000000
//took 1145ms
//
//output for num_workers(32) vector_length(2)
//1990000
//took 1480ms
//
//output for num_workers(64) vector_length(1)
//1990000
//took 502ms
//
//output for num_workers(64) vector_length(4096)
//1000000
//took 853ms
return 0;
}
Machine specs: nvc++ 21.3-0 with OpenACC 2.7, Tesla K20c with cc35, NVIDIA-driver 470.103.01 with CUDA 11.4
Edit:
Additional information for 2.:
I simply used some printfs in the worker to look into the intermediate results. I placed them during the implicit barriers between the loops. I could see that the value of private_out went from initially 10
to -10 instead of 0 between the loops and
to 1990instead of 1000.
This just looks to me like both loops are being executed twice.
More results for convenience
To add some strangeness of this example: The code does not compile for some combinations of num_workers/vector_length. For e.g leaving num_workers just at 64 and setting the vector_length to 2,4,8,16 and even to 32 (which increases the threads over the limit of 1024). It gives the error message
ptxas error : Entry function '_ZN1A14foo_298_gpu__1Ev' with max regcount of 32 calls function '_ZN1A11some_workerEi' with regcount of 41
However, simply inserting the printfs as described above, it suddenly compiles fine but runs into a runtime error: "call to cuLaunchKernel returned error 1: Invalid value".
But the most strange is, that it compiles and runs fine for 64/64 but returns incorrect results. Below is the output of this setting with NV_ACC_TIME=1, but note that the output is almost exactly the same for all compiling and running configurations, except for the block: [1x#-######]-part.
Accelerator Kernel Timing data
/path/to/src/main.cpp
_ZN1AC1Ev NVIDIA devicenum=0
time(us): 665
265: data region reached 1 time
265: data copyin transfers: 3
device time(us): total=665 max=650 min=4 avg=221
/path/to/src/main.cpp
_ZN1AD1Ev NVIDIA devicenum=0
time(us): 8
269: data region reached 1 time
269: data copyin transfers: 1
device time(us): total=8 max=8 min=8 avg=8
/path/to/src/main.cpp
_ZN1A3fooEv NVIDIA devicenum=0
time(us): 1,243
296: data region reached 2 times
298: compute region reached 2 times
298: kernel launched 2 times
grid: [1-1000] block: [1-32x1-24]
device time(us): total=1,230 max=1,225 min=5 avg=615
elapsed time(us): total=1,556 max=1,242 min=314 avg=778
304: update directive reached 1 time
304: data copyout transfers: 1
device time(us): total=13 max=13 min=13 avg=13
The exact mapping of workers and vectors will depend on the target device and implementation. Specifically when using NVHPC targeting NVIDIA GPUs, a "gang" maps to a CUDA Block, "worker" maps the the y dimension of a thread block, and "vector" to the x-dimension. The value used in "num_workers" or "vector_length" may be reduced given the constrains of the target. CUDA Blocks can contain up to a maximum 1024 threads so the "4096" value will be reduced to what is allowed by the hardware. Secondly, in order to support vector reductions in device routines, a maximum vector_length can be 32. In other words, you're "4096" value is actually "32" due to these constraints.
Note to see the max thread block size on your device, run the "nvaccelinfo" utility and look for the "Maximum Threads per Block" and "Maximum Block Dimensions" fields. Also, setting the environment variable "NV_ACC_TIME=1" will have the runtime produce some basic profiling information, including the actual number of blocks and thread block size used during the run.
In my understanding, a huge amount of threads should lie dormant when
there are only as many as 10 iterations in the vector loop.
CUDA threads are grouped into a "warp" of 32 threads where all threads of a warp execute the same instructions concurrently (aka SIMT or single instruction multiple threads). Hence even though only 10 threads are doing useful work, the remaining 12 are not dormant. Plus they still take resources such as registers so adding too many threads for loops with lower trip counts, may actually hurt performance.
In this case setting the vector length to 1 is most likey the best case since the warp can now be comprised of the y-dimension threads. Setting it to 2, will cause a full 32 thread warp in the x-dimension, but only 2 doing useful work.
As to why some combinations give incorrect results, I didn't investigate. Routine worker, especially with reductions, is rarely used so it's possible we have some type of code gen issue, like an off-by one error in the reduction, at these irregular schedule sizes. I'll look into this later and determine if I need to file an issue report.
For #2, How you're determining it's getting run twice? Is just this based on the runtime?
I have the following sequential code:
1.
ProcessImage(){
for_each_line
{
for_each_pixel_of_line()
{
A = ComputeA();
B = ComputeB();
DoBiggerWork();
}
}
}
Now I changed for precalculating all the A, B value of whole image as below.
2.
ProcessImage(){
for_each_line
{
A = ComputeAinLine();
B = ComputeBinLine();
for_each_pixel_of_line()
{
Ai = A[i];
Bi = B[i];
DoBiggerWork();
}
}
}
The result shows that the 2nd block of code execute slower about 10% of processing time compared to the 1st block of code.
I wondering was it a cache miss issue in the 2nd block of code ?
I am going to use SIMD for parallel the precalculation in the 2nd block of code. Is it worth trying ?
All depends on how did you implement your functions. Try to profile your code and determine where are the bottlenecks.
If there are no benefits in calculating values once for a row, then don't use it. You need A and B values only for one pixel routine. In the second block of code you run the line once for calculate values, then run again for DoBiggerWork() and each time you retrieve values from prepared array. That costs more CPU time.
I encountered this weird bug in a c++ multithread program on linux. The multithreaded part basically executes a loop. One single iteration first loads a sift file containing some features. And then it queries these features against a tree. Since I have a lot of images, I used multiple threads to do this querying. Here is the code snippets.
struct MultiMatchParam
{
int thread_id;
float *scores;
double *scores_d;
int *perm;
size_t db_image_num;
std::vector<std::string> *query_filenames;
int start_id;
int num_query;
int dim;
VocabTree *tree;
FILE *file;
};
// multi-thread will do normalization anyway
void MultiMatch(MultiMatchParam ¶m)
{
// Clear scores
for(size_t t = param.start_id; t < param.start_id + param.num_query; t++)
{
for (size_t i = 0; i < param.db_image_num; i++)
param.scores[i] = 0.0;
DTYPE *keys;
int num_keys;
keys = ReadKeys_sfm((*param.query_filenames)[t].c_str(), param.dim, num_keys);
int normalize = true;
double mag = param.tree->MultiScoreQueryKeys(num_keys, normalize, keys, param.scores);
delete [] keys;
}
}
I run this on a 8-core cpu. At first it runs perfectly and the cpu usage is nearly 100% on all 8 cores. After each thread has queried several images (about 20 images), all of a sudden the performance (cpu usage) drops drastically, down to about 30% across all eight cores.
I doubt the key to this bug is concerned with this line of code.
double mag = param.tree->MultiScoreQueryKeys(num_keys, normalize, keys, param.scores);
Since if I replace it with another costly operations (e.g., a large for-loop containing sqrt). The cpu usage is always nearly 100%. This MultiScoreQueryKeys function does a complex operation on a tree. Since all eight cores may read the same tree (no write operation to this tree), I wonder whether the read operation has some kind of blocking effect. But it shouldn't have this effect because I don't have write operations in this function. Also the operations in the loop are basically the same. If it were to block the cpu usage, it would happen in the first few iterations. If you need to see the details of this function or other part of this project, please let me know.
Use std::async() instead of zeta::SimpleLock lock
I am trying to implement quickHull algorithm (for convex hull) parallely in CUDA. It works correctly for input_size <= 1 million. When I try 10 million points, the program crashes. My graphic card size is 1982 MB and all my data structures in the algorithm collectively require not more than 600 MB for this input size, which is less than 50 % of the available space.
By commenting out lines of my kernels, I found out that the crash occurs when I try to access array element and the index of the element I am trying to access is not out of bounds (double checked). The following is the kernel code where it crashes.
for(unsigned int i = old_setIndex; i < old_setIndex + old_setS[tid]; i++)
{
int pI = old_set[i];
if(pI <= -1 || pI > pts.size())
{
printf("Thread %d: i = %d, pI = %d\n", tid, i, pI);
continue;
}
p = pts[pI];
double d = distance(A,B,p);
if(d > dist) {
dist = d;
furthestPoint = i;
fpi = pI;
}
}
//fpi = old_set[furthestPoint];
//printf("Thread %d: Furthestpoint = %d\n", tid, furthestPoint);
My code crashes when I uncomment the statements (array access and printf) after the for loop. I am unable to explain the error as furthestPoint is always within bounds of old_set array size. Old_setS stores the size of smaller arrays that each thread can operate on. It crashes even if just try to print the value of furthestPoint (last line) without the array access statement above it.
There's no problem with the above code for input size <= 1 million. Am I overflowing some buffer in the device in case of 10 million?
Please help me in finding the source of the crash.
There is no out of bounds memory access in your code (or at least not one which is causing the symptoms you are seeing).
What is happening is that your kernel is being killed by the display driver because it is taking too much time to execute on your display GPU. All CUDA platform display drivers include a time limit for any operation on the GPU. This exists to prevent the display from freezing for a sufficiently long time that either the OS kernel panics or the user panics and thinks the machine has crashed. On the windows platform you are using, the time limit is about 2 seconds.
What has partly mislead you into thinking the source of the problem is array adressing is the commenting out of code makes the problem disappear. But what really happens there is an artifact of compiler optimization. When you comment out a global memory write, the compiler recognizes that the calculations which lead to the value being stored are unused, and it removes all that code from the assembler code it emits (google "nvcc dead code removal" for more information). That has the effect of making the code run much faster and puts it under the display driver time limit.
For workarounds see this recent stackoverflow question and answer
I have the following C++ code:
const int N = 1000000
int id[N]; //Value can range from 0 to 9
float value[N];
// load id and value from an external source...
int size[10] = { 0 };
float sum[10] = { 0 };
for (int i = 0; i < N; ++i)
{
++size[id[i]];
sum[id[i]] += value[i];
}
How should I optimize the loop?
I considered using SSE to add every 4 floats to a sum and then after N iterations, the sum is just the sum of the 4 floats in the xmm register but this doesn't work when the source is indexed like this and needs to write out to 10 different arrays.
This kind of loop is very hard to optimize using SIMD instructions. Not only isn't there an easy way in most SIMD instruction sets to do this kind of indexed read ("gather") or write ("scatter"), even if there was, this particular loop still has the problem that you might have two values that map to the same id in one SIMD register, e.g. when
id[0] == 0
id[1] == 1
id[2] == 2
id[3] == 0
in this case, the obvious approach (pseudocode here)
x = gather(size, id[i]);
y = gather(sum, id[i]);
x += 1; // componentwise
y += value[i];
scatter(x, size, id[i]);
scatter(y, sum, id[i]);
won't work either!
You can get by if there's a really small number of possible cases (e.g. assume that sum and size only had 3 elements each) by just doing brute-force compares, but that doesn't really scale.
One way to get this somewhat faster without using SIMD is by breaking up the dependencies between instructions a bit using unrolling:
int size[10] = { 0 }, size2[10] = { 0 };
int sum[10] = { 0 }, sum2[10] = { 0 };
for (int i = 0; i < N/2; i++) {
int id0 = id[i*2+0], id1 = id[i*2+1];
++size[id0];
++size2[id1];
sum[id0] += value[i*2+0];
sum2[id1] += value[i*2+1];
}
// if N was odd, process last element
if (N & 1) {
++size[id[N]];
sum[id[N]] += value[N];
}
// add partial sums together
for (int i = 0; i < 10; i++) {
size[i] += size2[i];
sum[i] += sum2[i];
}
Whether this helps or not depends on the target CPU though.
Well, you are calling id[i] twice in your loop. You could store it in a variable, or a register int if you wanted to.
register int index;
for(int i = 0; i < N; ++i)
{
index = id[i];
++size[index];
sum[index] += value[i];
}
The MSDN docs state this about register:
The register keyword specifies that
the variable is to be stored in a
machine register.. Microsoft Specific
The compiler does not accept user
requests for register variables;
instead, it makes its own register
choices when global
register-allocation optimization (/Oe
option) is on. However, all other
semantics associated with the register
keyword are honored.
Something you can do is to compile it with the -S flag (or equivalent if you aren't using gcc) and compare the various assembly outputs using -O, -O2, and -O3 flags. One common way to optimize a loop is to do some degree of unrolling, for (a very simple, naive) example:
int end = N/2;
int index = 0;
for (int i = 0; i < end; ++i)
{
index = 2 * i;
++size[id[index]];
sum[id[index]] += value[index];
index++;
++size[id[index]];
sum[id[index]] += value[index];
}
which will cut the number of cmp instructions in half. However, any half-decent optimizing compiler will do this for you.
Are you sure it will make much difference? The likelihood is that the loading of "id from an external source" will take significantly longer than adding up the values.
Do not optimise until you KNOW where the bottleneck is.
Edit in answer to the comment: You misunderstand me. If it takes 10 seconds to load the ids from a hard disk then the fractions of a second spent on processing the list are immaterial in the grander scheme of things. Lets say it takes 10 seconds to load and 1 second to process:
You optimise the processing loop so it takes 0 seconds (almost impossible but its to illustrate a point) then it is STILL taking 10 seconds. 11 Seconds really isn't that ba a performance hit and you would be better off focusing your optimisation time on the actual data load as this is far more likely to be the slow part.
In fact it can be quite optimal to do double buffered data loads. ie you load buffer 0, then you start the load of buffer 1. While buffer 1 is loading you process buffer 0. when finished start the load of the next buffer while processing buffer 1 and so on. this way you can completely amortise the cost of procesing.
Further edit: In fact your best optimisation would probably come from loading things into a set of buckets that eliminate the "id[i]" part of te calculation. You could then simply offload to 3 threads where each uses SSE adds. This way you could have them all going simultaneously and, provided you have at least a triple core machine, process the whole data in a 10th of the time. Organising data for optimal processing will always allow for the best optimisation, IMO.
Depending on your target machine and compiler, see if you have the _mm_prefetch intrinsic and give it a shot. Back in the Pentium D days, pre-fetching data using the asm instruction for that intrinsic was a real speed win as long as you were pre-fetching a few loop iterations before you needed the data.
See here (Page 95 in the PDF) for more info from Intel.
This computation is trivially parallelizable; just add
#pragma omp parallel_for reduction(+:size,+:sum) schedule(static)
immediately above the loop if you have OpenMP support (-fopenmp in GCC.) However, I would not expect much speedup on a typical multicore desktop machine; you're doing so little computation per item fetched that you're almost certainly going to be constrained by memory bandwidth.
If you need to perform the summation several times for a given id mapping (i.e. the value[] array changes more often than id[]), you can halve your memory bandwidth requirements by pre-sorting the value[] elements into id order and eliminating the per-element fetch from id[]:
for (i = 0, j = 0, k = 0; j < 10; sum[j] += tmp, j++)
for (k += size[j], tmp = 0; i < k; i++)
tmp += value[i];