Precalculate data vs sequential processing - c++

I have the following sequential code:
1.
ProcessImage(){
for_each_line
{
for_each_pixel_of_line()
{
A = ComputeA();
B = ComputeB();
DoBiggerWork();
}
}
}
Now I changed for precalculating all the A, B value of whole image as below.
2.
ProcessImage(){
for_each_line
{
A = ComputeAinLine();
B = ComputeBinLine();
for_each_pixel_of_line()
{
Ai = A[i];
Bi = B[i];
DoBiggerWork();
}
}
}
The result shows that the 2nd block of code execute slower about 10% of processing time compared to the 1st block of code.
I wondering was it a cache miss issue in the 2nd block of code ?
I am going to use SIMD for parallel the precalculation in the 2nd block of code. Is it worth trying ?

All depends on how did you implement your functions. Try to profile your code and determine where are the bottlenecks.
If there are no benefits in calculating values once for a row, then don't use it. You need A and B values only for one pixel routine. In the second block of code you run the line once for calculate values, then run again for DoBiggerWork() and each time you retrieve values from prepared array. That costs more CPU time.

Related

How can one improve performance in a neural network?

I am implementing a basic NeuroEvolution program. The Data Set I am attempting is the "Cover Type" data set. This set has 15,120 records with 56 inputs (numerical data on patches of forest land) and 1 output (the cover type). As recommended, I am using 150 hidden neurons. The fitness function attempts to iterate through all 15,120 records to calculate an error, and use that error to calculate the fitness. The method is shown below.
double getFitness() {
double error = 0.0;
// for every record in the dataset.
for (int record = 0; record < targets.size(); record++) {
// evaluate output vector for this record of inputs.
vector<double> outputs = eval(inputs[record]);
// for every value in those records.
for (int value = 0; value < outputs.size(); value++) {
// add to error the difference.
error += abs(outputs[value] - targets[record][value]);
}
}
return 1 - (error / targets.size());
}
"inputs" and "targets" are 2-D vectors read-in from the CSV file. The entire program uses ~40 MB of memory at run-time. That is not a problem. The program mutates off of a parent network, both are evaluated for fitness, and the most fit is kept to be mutated-upon. In the entire process, the getFitness() function is taking the most time. The program is written in Visual Studio 2017 (on a 2.6GHz i7) in Windows 10.
It took ~7 minutes to evaluate ONE network's fitness, using 21% of CPU. Smaller problems have required hundreds of thousands.
What methods are available to get that number down?
The program, implemented making (apparently) significant use of vectors, cannot be directly offloaded to a GPU using OpenCL or CUDA without significant modifications. This makes OpenMP the most viable option. As few as 2 lines of code can be added to "go parallel." In addition, Visual Studio must be set to use OpenMP (Project -> Properties -> C/C++ -> Language -> Open MP Support).
#include <omp.h>
#include <numeric>
// libraries, variables, functions, etc...
double getFitness() {
double err = 0.0;
vector<double> error;
error.resize(inputs.size());
// for every record in the dataset.
#pragma omp parallel for
for (int record = 0; record < targets.size(); record++) {
// evaluate output vector for this record of inputs.
vector<double> outputs = eval(inputs[record]);
// for every value in those records.
for (int value = 0; value < outputs.size(); value++) {
// add to error the difference.
error[record] = abs(outputs[value] - targets[record][value]);
}
}
err = accumulate(error.begin(), error.end(), 0);
return 1 - (err / targets.size());
}
In the above code, OMP should create a thread for each of the records. However, unless you run this on a cloud instance, it will eat your CPU. Expect a ~4x improvement on a quad-core CPU.

How to run a graph algorithm concurrently in Java using multi-core parallelism

I want to run an algorithm on large graphs concurrently, using multi-core parallelism. I have been working on it for a while, but haven't been able to come up with a good solution.
This is the naive algorithm:
W - a very large number
double weight = 0
while(weight < W)
- v : get_random_node_from(Graph)
- weight += calculate(v)
I looked into fork-and-join, but can't figure out a way to divide this problem into smaller subproblems.
Then I tried using Java 8 streams, for which I need to create a lambda expression. When I tried doing something like this:
double weight = 0
Callable<Object> task = () -> {
can not update weight here, as it needs to be final
}
My question is, is it possible to update a variable like weight in a lambda method? Or is there a better way in which this problem can be solved?
The closest I have got is by using ExecutorService, but run into the problems of synchronization.
------------EDIT--------------
Here is the detailed algorithm:
In a nutshell, what I am trying to do, is traverse a massive graph, perform an operation on randomly selected nodes(as long as weight < W) and update a global structure Index.
This is taking too long as it doesn't utilize the full power of the CPU.
Ideally, all threads/processes on multiple cores would perform the operations on the randomly selected nodes, and update the shared weight and Index.
Note: It doesn't matter if different threads pick up the same node, as it's random without replacement.
Algorithm:
function Serial () {
List<List<Integer>> I (shared data structure which I want to update)
double weight
//// Task which I want to parallelize
while(weight < W) {
v : get_random_node_from(Graph)
bfs(v, affected_nodes) ...// this will fill up affected_nodes by v
foreach(affected_node in affected_nodes) {
// update I related to affected_node
// and do other computation
}
weight += affected_nodes.size()
}
///////// Parallelization ends here
use_index(I) // I is passed now to some other method(not important) to get further results
}
The important thing is, all threads update the same I and weight.
Thanks.
Well you could wrap that weight into an array of a single element, it's sort of a know trick for this kind of stuff; even done internally by java, like this:
weight[0] = weight[0] + calculate(v);
But there are problems with this, since you are going to run it in parallel. You will not get the result you want since weight[0] is not thread-safe. And you could use some sort of synchronization, but java already has a great solution for that : DoubleAdder that scales far better in contended environments (and multiple cpus).
A trivial and small example:
DoubleAdder weight = new DoubleAdder();
private static int calculate(int v) {
return v + 1;
}
Stream.of(1, 2, 3, 4, 5, 6, 7, 8, 9)
.parallel()
.forEach(x -> {
int y = calculate(x);
weight.add(y);
});
System.out.println(weight); // 54
Then there is the problem of the randomizer that you are going to choose for this: get_random_node_from(Graph). You need to get a random Node indeed, but at the same time you need to get all of them exactly once.
But you might not need it if you can flatten all the nodes into a single List let's say.
The problem here is that Graphs are usually traversed in a recursive way, you don't know the exact size of it:
while(parent.hasChildren) {
traverse children and so on...
}
This will parallelize bad under Streams, you can look yourself at Spliterators#spliteratorUnknownSize. It will grow arithmetically from 1024; that's why my suggestion of flattening the Nodes into a single List, with known size; that will parallelize much better.

c++ stack efficient for multicore application

I am trying to code a multicode Markov Chain in C++ and while I am trying to take advantage of the many CPUs (up to 24) to run a different chain in each one, I have a problem in picking a right container to gather the result the numerical evaluations on each CPU. What I am trying to measure is basically the average value of an array of boolean variables. I have tried coding a wrapper around a `std::vector`` object looking like that:
struct densityStack {
vector<int> density; //will store the sum of boolean varaibles
int card; //will store the amount of elements we summed over for normalizing at the end
densityStack(int size){ //constructor taking as only parameter the size of the array, usually size = 30
density = vector<int> (size, 0);
card = 0;
}
void push_back(vector<int> & toBeAdded){ //method summing a new array (of measurements) to our stack
for(auto valStack = density.begin(), newVal = toBeAdded.begin(); valStack != density.end(); ++valStack, ++ newVal)
*valStack += *newVal;
card++;
}
void savef(const char * fname){ //method outputting into a file
ofstream out(fname);
out.precision(10);
out << card << "\n"; //saving the cardinal in first line
for(auto val = density.begin(); val != density.end(); ++val)
out << << (double) *val/card << "\n";
out.close();
}
};
Then, in my code I use a single densityStack object and every time a CPU core has data (can be 100 times per second) it will call push_back to send the data back to densityStack.
My issue is that this seems to be slower that the first raw approach where each core stored each array of measurement in file and then I was using some Python script to average and clean (I was unhappy with it because storing too much information and inducing too much useless stress on the hard drives).
Do you see where I can be losing a lot of performance? I mean is there a source of obvious overheading? Because for me, copying back the vector even at frequencies of 1000Hz should not be too much.
How are you synchronizing your shared densityStack instance?
From the limited info here my guess is that the CPUs are blocked waiting to write data every time they have a tiny chunk of data. If that is the issue, a simple technique to improve performance would be to reduce the number of writes. Keep a buffer of data for each CPU and write to the densityStack less frequently.

Multithread performance drops down after a few operations

I encountered this weird bug in a c++ multithread program on linux. The multithreaded part basically executes a loop. One single iteration first loads a sift file containing some features. And then it queries these features against a tree. Since I have a lot of images, I used multiple threads to do this querying. Here is the code snippets.
struct MultiMatchParam
{
int thread_id;
float *scores;
double *scores_d;
int *perm;
size_t db_image_num;
std::vector<std::string> *query_filenames;
int start_id;
int num_query;
int dim;
VocabTree *tree;
FILE *file;
};
// multi-thread will do normalization anyway
void MultiMatch(MultiMatchParam &param)
{
// Clear scores
for(size_t t = param.start_id; t < param.start_id + param.num_query; t++)
{
for (size_t i = 0; i < param.db_image_num; i++)
param.scores[i] = 0.0;
DTYPE *keys;
int num_keys;
keys = ReadKeys_sfm((*param.query_filenames)[t].c_str(), param.dim, num_keys);
int normalize = true;
double mag = param.tree->MultiScoreQueryKeys(num_keys, normalize, keys, param.scores);
delete [] keys;
}
}
I run this on a 8-core cpu. At first it runs perfectly and the cpu usage is nearly 100% on all 8 cores. After each thread has queried several images (about 20 images), all of a sudden the performance (cpu usage) drops drastically, down to about 30% across all eight cores.
I doubt the key to this bug is concerned with this line of code.
double mag = param.tree->MultiScoreQueryKeys(num_keys, normalize, keys, param.scores);
Since if I replace it with another costly operations (e.g., a large for-loop containing sqrt). The cpu usage is always nearly 100%. This MultiScoreQueryKeys function does a complex operation on a tree. Since all eight cores may read the same tree (no write operation to this tree), I wonder whether the read operation has some kind of blocking effect. But it shouldn't have this effect because I don't have write operations in this function. Also the operations in the loop are basically the same. If it were to block the cpu usage, it would happen in the first few iterations. If you need to see the details of this function or other part of this project, please let me know.
Use std::async() instead of zeta::SimpleLock lock

Cache Poisoning Issue for deep nested loop

I am writing a code for a mathematical method (Incomplete Cholesky) and I have hit a curious roadblock. Please see the following simplified code.
for(k=0;k<nosUnknowns;k++)
{
//Pieces of code
for(i=k+1;i<nosUnknowns;i++)
{
// more code
}
for(j=k+1;j<nosUnknowns;j++)
{
for(i=j;i<nosUnknowns;i++)
{
//Some more code
if(xOk && yOk && zOk)
{
if(xDF == 1 && yDF == 0 && zDF == 0)
{
for(row=0;row<3;row++)
{
for(col=0;col<3;col++)
{
// All 3x3 static arrays This is the line
statObj->A1_[row][col] -= localFuncArr[row][col];
}
}
}
}
}//Inner loop i ends here
}//Inner loop j ends here
}//outer loop k ends here
For context,
statObj is an object containing a number of 3x3 static double arrays. I am initializing statObj by a call to new function. Then I am populating the arrays inside it using some mathematical functions. One such array is A1_. The value of variable nosUnknowns is around 3000. The array localFuncArr is previously generated by matrix multiplication and is a double array.
Now this is my problem:
When I use the line as shown in the code, the code runs extremely sluggishly. Something like 245secs for the whole function.
When I comment out the said line, the code performs extremely fast. It takes something like 6 secs.
Now when I replace the said line with the following line : localFuncArr[row][col] += 3.0, again the code runs with the same speed as that of case(2) above.
Clearly something about the call to statObj->A1_ is making the code run slow.
My question(s):
Is Cache Poisoning the reason why this is happening ?
If so, what could be changed in terms of array initialization/object initialization/loop unrolling or for that matter any form of code optimization that can speed this up ?
Any insights to this from experienced folks is highly appreciated.
EDIT: Changed the description to be more verbose and redress some of the points mentioned in the comments.
If the conditions are mostly true, your line of code is executed 3000x3000x3000x3x3 times. That's about 245 billion times. Depending on your hardware architecture 245 seconds might be a very reasonable timing (that's 1 iteration every 2 cycles - assuming 2GHz processor). In any case there isn't anything in the code that suggests cache poisoning.