Using std atomic and compare exchange wrong in multi-threaded cpp

Using std atomic and compare exchange wrong in multi-threaded cpp - c++

I am trying to implement multi-threaded page rank algorithm using std::atomic and compare_exchange.
I am definitely doing something wrong as the program never ends. I have the page ranks stored in pr_next and pr_curr.
I was able to make it work using an array of mutexes for each vertex of the graph and locking pr_next[v] += pr_curr[u]/out_degree (commented out in the first for-loop below).
To allow for atomics, I made pr_next an array of std::atomic<PageRankType> and am using compare_exchange_weak to write the new value.
#include "core/graph.h"
#include "core/utils.h"
#include <iomanip>
#include <iostream>
#include <stdlib.h>
#include <thread>
#define INIT_PAGE_RANK 1.0
#define EPSILON 0.01
#define DAMPING 0.85
#define PAGE_RANK(x) (1 - DAMPING + DAMPING * x)
typedef float PageRankType;
//std::vector<std::mutex> mutexes;
void threadFunction(Graph& g, int max_iters, uintV start, uintV end, PageRankType* pr_curr, std::atomic<PageRankType>* pr_next, CustomBarrier& bar){
for (int iter = 0; iter < max_iters; iter++) {
// for each vertex 'u', process all its outNeighbors 'v'
for (uintV u = start; u < end; u++) {
uintE out_degree = g.vertices_[u].getOutDegree();
for (uintE i = 0; i < out_degree; i++) {
uintV v = g.vertices_[u].getOutNeighbor(i);
//mutex version
// mutexes[v].lock();
//pr_next[v] += (pr_curr[u] / out_degree);
//mutexes[v].unlock();
PageRankType expected = std::atomic_load(&pr_next[v]);
while(!std::atomic_compare_exchange_weak(&pr_next[v], &expected, expected + (pr_curr[u] / out_degree))){
}
}
bar.wait();
for (uintV v = start; v < end; v++) {
pr_next[v] = PAGE_RANK(pr_next[v]);
// reset pr_curr for the next iteration
pr_curr[v] = pr_next[v];
pr_next[v] = 0.0;
}
bar.wait();
}
rs.time = t1.stop();
}
}
void pageRankParallel(Graph &g, int max_iters, uint n_workers) {
uintV n = g.n_;
PageRankType *pr_curr = new PageRankType[n];
std::atomic<PageRankType> *pr_next = new std::atomic<PageRankType>[n];
for (uintV i = 0; i < n; i++) {
pr_curr[i] = INIT_PAGE_RANK;
pr_next[i] = 0.0;
}
//mutexes = std::vector<std::mutex>(n);
// Push based pagerank
std::thread threads[n_workers];
CustomBarrier bar(n_workers);
uintV part = n/n_workers;
uintV start = 0;
uintV end = start + part;
for(uint i = 0; i < n_workers; i++){
if( i == n_workers-1)
end = n;
threads[i] = std::thread(threadFunction, std::ref(g), max_iters, start, end, pr_curr, pr_next, std::ref(bar));
start += part;
end += part;
}
for(uint i = 0; i < n_workers; i++) {
threads[i].join();
}
}
int main(int argc, char *argv[]) {
uint n_workers = 4;
int max_iterations = 5;
std::string input_file_path = argv[1];
Graph g;
std::cout << "Reading graph\n";
g.readGraphFromBinary<int>(input_file_path);
std::cout << "Created graph\n";
pageRankParallel(g, max_iterations, n_workers);
return 0;
}
The only changes I made from my working mutex version is changing PageRank* pr_next to std::atomic<PageRank>* pr_next and the two lines here :
PageRankType expected = std::atomic_load(&pr_next[v]);
while(!std::atomic_compare_exchange_weak(&pr_next[v], &expected, expected + (pr_curr[u] / out_degree))){
}
It compiles but when I run it it just goes on forever. If I add print statements before and after the compare_exchange loop, it keeps printing "before loop" "after loop" forever.
What am I doing wrong here?

Related

Why is multi-threading of matrix calculation not faster than single-core?

this is my first time using multi-threading to speed up a heavy calculation.
Background: The idea is to calculate a Kernel Covariance matrix, by reading a list of 3D points x_test and calculating the corresponding matrix, which has dimensions x_test.size() x x_test.size().
I already sped up the calculations by only calculating the lower triangluar matrix. Since all the calculations are independent from each other I tried to speed up the process (x_test.size() = 27000 in my case) by splitting the calculations of the matrix entries row-wise, assigning a range of rows to each thread.
On a single core the calculations took about 280 seconds each time, on 4 cores it took 270-290 seconds.
main.cpp
int main(int argc, char *argv[]) {
double sigma0sq = 1;
double lengthScale [] = {0.7633, 0.6937, 3.3307e+07};
const std::vector<std::vector<double>> x_test = parse2DCsvFile(inputPath);
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size()*x_test.size()/2;
const int numThreads = 4;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for(std::size_t i=1; i<x_test.size()+1; ++i){
int prod = i*(i+1)/2 - j*(j+1)/2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if(indices.size() == numThreads-1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
/* Spreding calculations to multiple threads */
std::vector<std::thread> threads;
for(std::size_t i = 1; i < indices.size(); ++i){
threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices.at(i-1), indices.at(i)));
}
for(auto & th: threads){
th.join();
}
return 0;
}
As you can see, each thread performs the following calculations on the data assigned to it:
void calculateKMatrixCpp(const std::vector<std::vector<double>> xtest, double lengthScale[], double sigma0sq, int threadCounter, int start, int stop){
char buffer[8192];
std::ofstream out("lower_half_matrix_" + std::to_string(threadCounter) +".csv");
out.rdbuf()->pubsetbuf(buffer, 8196);
for(int i = start; i < stop; ++i){
for(int j = 0; j < i+1; ++j){
double kij = seKernel(xtest.at(i), xtest.at(j), lengthScale, sigma0sq);
if (j!=0)
out << ',';
out << kij;
}
if(i!=xtest.size()-1 )
out << '\n';
}
out.close();
}
and
double seKernel(const std::vector<double> x1,const std::vector<double> x2, double lengthScale[], double sigma0sq) {
double sum(0);
for(std::size_t i=0; i<x1.size();i++){
sum += pow((x1.at(i)-x2.at(i))/lengthScale[i],2);
}
return sigma0sq*exp(-0.5*sum);
}
Aspects I considered
locking by simultaneous access to data vector -> I don't pass a reference to the threads, but a copy of the data. I know this is not optimal in terms of RAM usage, but as far as I know this should prevent simultaneous data access since every thread has its own copy
Output -> every thread writes its part of the lower triangular matrix to its own file. My task manager doesn't indicate a full SSD utilization in the slightest
Compiler and machine
Windows 11
GNU GCC Compiler
Code::Blocks (although I don't think that should be of importance)

There are many details that can be improved in your code, but I think the two biggest issues are:
using vectors or vectors, which leads to fragmented data;
writing each piece of data to file as soon as its value is computed.
The first point is easy to fix: use something like std::vector<std::array<double, 3>>. In the code below I use an alias to make it more readable:
using Point3D = std::array<double, 3>;
std::vector<Point3D> x_test;
The second point is slightly harder to address. I assume you wanted to write to the disk inside each thread because you couldn't manage to write to a shared buffer that you could then write to a file.
Here is a way to do exactly that:
void calculateKMatrixCpp(
std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq,
int threadCounter, int start, int stop, std::vector<double>& kMatrix
) {
// ...
double& kij = kMatrix[i * xtest.size() + j];
kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
// ...
}
// ...
threads.push_back(std::thread(
calculateKMatrixCpp, x_test, lengthScale, sigma0sq,
i, indices[i-1], indices[i], std::ref(kMatrix)
));
Here, kMatrix is the shared buffer and represents the whole matrix you are trying to compute. You need to pass it to the thread via std::ref. Each thread will write to a different location in that buffer, so there is no need for any mutex or other synchronization.
Once you make these changes and try to write kMatrix to the disk, you will realize that this is the part that takes the most time, by far.
Below is the full code I tried on my machine, and the computation time was about 2 seconds whereas the writing-to-file part took 300 seconds! No amount of multithreading can speed that up.
If you truly want to write all that data to the disk, you may have some luck with file mapping. Computing the exact size needed should be easy enough if all values have the same number of digits, and it looks like you could write the values with multithreading. I have never done anything like that, so I can't really say much more about it, but it looks to me like the fastest way to write multiple gigabytes of memory to the disk.
#include <vector>
#include <thread>
#include <iostream>
#include <string>
#include <cmath>
#include <array>
#include <random>
#include <fstream>
#include <chrono>
using Point3D = std::array<double, 3>;
auto generateSampleData() -> std::vector<Point3D> {
static std::minstd_rand g(std::random_device{}());
std::uniform_real_distribution<> d(-1.0, 1.0);
std::vector<Point3D> data;
data.reserve(27000);
for (auto i = 0; i < 27000; ++i) {
data.push_back({ d(g), d(g), d(g) });
}
return data;
}
double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
double sum = 0.0;
for (auto i = 0u; i < 3u; ++i) {
double distance = (x1[i] - x2[i]) / lengthScale[i];
sum += distance*distance;
}
return sigma0sq * std::exp(-0.5*sum);
}
void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::vector<double>& kMatrix) {
std::cout << "start of thread " << threadCounter << "\n" << std::flush;
for(int i = start; i < stop; ++i) {
for(int j = 0; j < i+1; ++j) {
double& kij = kMatrix[i * xtest.size() + j];
kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
}
}
std::cout << "end of thread " << threadCounter << "\n" << std::flush;
}
int main() {
double sigma0sq = 1;
Point3D lengthScale = {0.7633, 0.6937, 3.3307e+07};
const std::vector<Point3D> x_test = generateSampleData();
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size()*x_test.size()/2;
const int numThreads = 4;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for(std::size_t i = 1; i < x_test.size()+1; ++i){
int prod = i*(i+1)/2 - j*(j+1)/2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if(indices.size() == numThreads-1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
auto start = std::chrono::system_clock::now();
std::vector<double> kMatrix(x_test.size() * x_test.size(), 0.0);
std::vector<std::thread> threads;
for (std::size_t i = 1; i < indices.size(); ++i) {
threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::ref(kMatrix)));
}
for (auto& t : threads) {
t.join();
}
auto end = std::chrono::system_clock::now();
auto elapsed_seconds = std::chrono::duration<double>(end - start).count();
std::cout << "computation time: " << elapsed_seconds << "s" << std::endl;
start = std::chrono::system_clock::now();
constexpr int buffer_size = 131072;
char buffer[buffer_size];
std::ofstream out("matrix.csv");
out.rdbuf()->pubsetbuf(buffer, buffer_size);
for (int i = 0; i < x_test.size(); ++i) {
for (int j = 0; j < i + 1; ++j) {
if (j != 0) {
out << ',';
}
out << kMatrix[i * x_test.size() + j];
}
if (i != x_test.size() - 1) {
out << '\n';
}
}
end = std::chrono::system_clock::now();
elapsed_seconds = std::chrono::duration<double>(end - start).count();
std::cout << "writing time: " << elapsed_seconds << "s" << std::endl;
}

Okey I've wrote implementation with optimized formatting.
By using #Nelfeal code it was taking on my system around 250 seconds for the run to complete with write time taking the most by far. Or rather std::ofstream formatting taking most of the time.
I've written a C++20 version via std::format_to/format. It is a multi-threaded version that takes around 25-40 seconds to complete all the computations, formatting, and writing. If run in a single thread, it takes on my system around 70 seconds. Same performance should be achievable via fmt library on C++11/14/17.
Here is the code:
import <vector>;
import <thread>;
import <iostream>;
import <string>;
import <cmath>;
import <array>;
import <random>;
import <fstream>;
import <chrono>;
import <format>;
import <filesystem>;
using Point3D = std::array<double, 3>;
auto generateSampleData(Point3D scale) -> std::vector<Point3D>
{
static std::minstd_rand g(std::random_device{}());
std::uniform_real_distribution<> d(-1.0, 1.0);
std::vector<Point3D> data;
data.reserve(27000);
for (auto i = 0; i < 27000; ++i)
{
data.push_back({ d(g)* scale[0], d(g)* scale[1], d(g)* scale[2] });
}
return data;
}
double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
double sum = 0.0;
for (auto i = 0u; i < 3u; ++i) {
double distance = (x1[i] - x2[i]) / lengthScale[i];
sum += distance * distance;
}
return sigma0sq * std::exp(-0.5 * sum);
}
void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::filesystem::path localPath)
{
using namespace std::string_view_literals;
std::vector<char> buffer;
buffer.reserve(15'000);
std::ofstream out(localPath);
std::cout << std::format("starting thread {}: from {} to {}\n"sv, threadCounter, start, stop);
for (int i = start; i < stop; ++i)
{
for (int j = 0; j < i; ++j)
{
double kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
std::format_to(std::back_inserter(buffer), "{:.6g}, "sv, kij);
}
double kii = seKernel(xtest[i], xtest[i], lengthScale, sigma0sq);
std::format_to(std::back_inserter(buffer), "{:.6g}\n"sv, kii);
out.write(buffer.data(), buffer.size());
buffer.clear();
}
}
int main() {
double sigma0sq = 1;
Point3D lengthScale = { 0.7633, 0.6937, 3.3307e+07 };
const std::vector<Point3D> x_test = generateSampleData(lengthScale);
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size() * (x_test.size()+1) / 2;
const int numThreads = 3;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for (std::size_t i = 1; i < x_test.size() + 1; ++i) {
int prod = i * (i + 1) / 2 - j * (j + 1) / 2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if (indices.size() == numThreads - 1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
auto start = std::chrono::system_clock::now();
std::vector<std::thread> threads;
using namespace std::string_view_literals;
for (std::size_t i = 1; i < indices.size(); ++i)
{
threads.push_back(std::thread(calculateKMatrixCpp, std::ref(x_test), lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::format("./matrix_{}.csv"sv, i-1)));
}
for (auto& t : threads)
{
t.join();
}
auto end = std::chrono::system_clock::now();
auto elapsed_seconds = std::chrono::duration<double>(end - start);
std::cout << std::format("total elapsed time: {}"sv, elapsed_seconds);
return 0;
}
Note: I used 6 digits of precision here as it is the default for std::ofstream. More digits means more writing time to disk and lower performance.

Different float values in array impact performance by 10x - why?

please check out my code and the quesion below - thanks
Code:
#include <iostream>
#include <chrono>
using namespace std;
int bufferWriteIndex = 0;
float curSample = 0;
float damping[5] = { 1, 1, 1, 1, 1 };
float modeDampingTermsExp[5] = { 0.447604, 0.0497871, 0.00247875, 0.00012341, 1.37263e-05 };
float modeDampingTermsExp2[5] = { -0.803847, -3, -6, -9, -11.1962 };
int main(int argc, char** argv) {
float subt = 0;
int subWriteIndex = 0;
auto now = std::chrono::high_resolution_clock::now();
while (true) {
curSample = 0;
for (int i = 0; i < 5; i++) {
//Slow version
damping[i] = damping[i] * modeDampingTermsExp2[i];
//Fast version
//damping[i] = damping[i] * modeDampingTermsExp[i];
float cosT = 2 * damping[i];
for (int m = 0; m < 5; m++) {
curSample += cosT;
}
}
//t += tIncr;
bufferWriteIndex++;
//measure calculations per second
auto elapsed = std::chrono::high_resolution_clock::now() - now;
if ((elapsed / std::chrono::milliseconds(1)) > 1000) {
now = std::chrono::high_resolution_clock::now();
int idx = bufferWriteIndex;
cout << idx - subWriteIndex << endl;
subWriteIndex = idx;
}
}
}
As you can see im measuring the number of calculations or increments of bufferWriteIndex per second.
Question:
Why is performance faster when using modeDampingTermsExp -
Program output:
12625671
12285846
12819392
11179072
12272587
11722863
12648955
vs using modeDampingTermsExp2 ?
1593620
1668170
1614495
1785965
1814576
1851797
1808568
1801945
It's about 10x faster. It seems like the numbers in those 2 arrays have an impact on calculation time. Why?
I am using Visual Studio 2019 with the following flags: /O2 /Oi /Ot /fp:fast

This is because you are hitting denormal numbers (also see this question).
You can get rid of denormals like so:
#include <cmath>
// [...]
for (int i = 0; i < 5; i++) {
damping[i] = damping[i] * modeDampingTermsExp2[i];
if (std::fpclassify(damping[i]) == FP_SUBNORMAL) {
damping[i] = 0; // Treat denormals as 0.
}
float cosT = 2 * damping[i];
for (int m = 0; m < 5; m++) {
curSample += cosT;
}
}

Matrix inversion slower using threads

I made a function that makes the inverse and then another multithreaded, as long I have to make inverse of arrays >2000 x 2000.
A 1000x1000 array unthreated takes 2.5 seconds (on a i5-4460 4 cores 2.9ghz)
and multithreaded takes 7.25 seconds
I placed the multithreads in the part that most time consumption is taken. Whai is wrong?
Is due vectors are used instead of 2 dimensions arrays?
This is the minimum code to test both versions:
#include<iostream>
#include <vector>
#include <stdlib.h>
#include <time.h>
#include <chrono>
#include <thread>
const int NUCLEOS = 8;
#ifdef __linux__
#include <unistd.h> //usleep()
typedef std::chrono::system_clock t_clock; //try to use high_resolution_clock on new linux x64 computer!
#else
typedef std::chrono::high_resolution_clock t_clock;
#pragma warning(disable:4996)
#endif
using namespace std;
std::chrono::time_point<t_clock> start_time, stop_time = start_time; char null_char = '\0';
void timer(char *title = 0, int data_size = 1) { stop_time = t_clock::now(); double us = (double)chrono::duration_cast<chrono::microseconds>(stop_time - start_time).count(); if (title) printf("%s time = %7lgms = %7lg MOPs\n", title, (double)us*1e-3, (double)data_size / us); start_time = t_clock::now(); }
//makes columns 0
void colum_zero(vector< vector<double> > &x, vector< vector<double> > &y, int pos0, int pos1,int dim, int ord);
//returns inverse of x, x is not modified, not threaded
vector< vector<double> > inverse(vector< vector<double> > x)
{
if (x.size() != x[0].size())
{
cout << "ERROR on inverse() not square array" << endl; getchar(); return{};//returns a null
}
size_t dim = x.size();
int i, j, ord;
vector< vector<double> > y(dim,vector<double>(dim,0));//initializes output = 0
//init_2Dvector(y, dim, dim);
//1. Unity array y:
for (i = 0; i < dim; i++)
{
y[i][i] = 1.0;
}
double diagon, coef;
double *ptrx, *ptry, *ptrx2, *ptry2;
for (ord = 0; ord<dim; ord++)
{
//2 Hacemos diagonal de x =1
int i2;
if (fabs(x[ord][ord])<1e-15) //If that element is 0, a line that contains a non zero is added
{
for (i2 = ord + 1; i2<dim; i2++)
{
if (fabs(x[i2][ord])>1e-15) break;
}
if (i2 >= dim)
return{};//error, returns null
for (i = 0; i<dim; i++)//added a line without 0
{
x[ord][i] += x[i2][i];
y[ord][i] += y[i2][i];
}
}
diagon = 1.0/x[ord][ord];
ptry = &y[ord][0];
ptrx = &x[ord][0];
for (i = 0; i < dim; i++)
{
*ptry++ *= diagon;
*ptrx++ *= diagon;
}
//uses the same function but not threaded:
colum_zero(x,y,0,dim,dim,ord);
}//end ord
return y;
}
//threaded version
vector< vector<double> > inverse_th(vector< vector<double> > x)
{
if (x.size() != x[0].size())
{
cout << "ERROR on inverse() not square array" << endl; getchar(); return{};//returns a null
}
int dim = (int) x.size();
int i, ord;
vector< vector<double> > y(dim, vector<double>(dim, 0));//initializes output = 0
//init_2Dvector(y, dim, dim);
//1. Unity array y:
for (i = 0; i < dim; i++)
{
y[i][i] = 1.0;
}
std::thread tarea[NUCLEOS];
double diagon;
double *ptrx, *ptry;// , *ptrx2, *ptry2;
for (ord = 0; ord<dim; ord++)
{
//2 Hacemos diagonal de x =1
int i2;
if (fabs(x[ord][ord])<1e-15) //If a diagonal element=0 it is added a column that is not 0 the diagonal element
{
for (i2 = ord + 1; i2<dim; i2++)
{
if (fabs(x[i2][ord])>1e-15) break;
}
if (i2 >= dim)
return{};//error, returns null
for (i = 0; i<dim; i++)//It is looked for a line without zero to be added to make the number a non zero one to avoid later divide by 0
{
x[ord][i] += x[i2][i];
y[ord][i] += y[i2][i];
}
}
diagon = 1.0 / x[ord][ord];
ptry = &y[ord][0];
ptrx = &x[ord][0];
for (i = 0; i < dim; i++)
{
*ptry++ *= diagon;
*ptrx++ *= diagon;
}
int pos0 = 0, N1 = dim;//initial array position
if ((N1<1) || (N1>5000))
{
cout << "It is detected out than 1-5000 simulations points=" << N1 << " ABORT or press enter to continue" << endl; getchar();
}
//cout << "Initiation of " << NUCLEOS << " threads" << endl;
for (int thread = 0; thread<NUCLEOS; thread++)
{
int pos1 = (int)((thread + 1)*N1 / NUCLEOS);//next position
tarea[thread] = std::thread(colum_zero, std::ref(x), std::ref(y), pos0, pos1, dim, ord);//ojo, coil current=1!!!!!!!!!!!!!!!!!!
pos0 = pos1;//next thread will work at next point
}
for (int thread = 0; thread<NUCLEOS; thread++)
{
tarea[thread].join();
//cout << "Thread num: " << thread << " end\n";
}
}//end ord
return y;
}
//makes columns 0
void colum_zero(vector< vector<double> > &x, vector< vector<double> > &y, int pos0, int pos1,int dim, int ord)
{
double coef;
double *ptrx, *ptry, *ptrx2, *ptry2;
//Hacemos '0' la columna ord salvo elemento diagonal:
for (int i = pos0; i<pos1; i++)//Begin to end for every thread
{
if (i == ord) continue;
coef = x[i][ord];//element to make 0
if (fabs(coef)<1e-15) continue; //If already zero, it is avoided
ptry = &y[i][0];
ptry2 = &y[ord][0];
ptrx = &x[i][0];
ptrx2 = &x[ord][0];
for (int j = 0; j < dim; j++)
{
*ptry++ = *ptry - coef * (*ptry2++);//1ª matriz
*ptrx++ = *ptrx - coef * (*ptrx2++);//2ª matriz
}
}
}
void test_6_inverse(int dim)
{
vector< vector<double> > vec1(dim, vector<double>(dim));
for (int i=0;i<dim;i++)
for (int j = 0; j < dim; j++)
{
vec1[i][j] = (-1.0 + 2.0*rand() / RAND_MAX) * 10000;
}
vector< vector<double> > vec2,vec3;
double ini, end;
ini = (double)clock();
vec2 = inverse(vec1);
end = (double)clock();
cout << "=== Time inverse unthreaded=" << (end - ini) / CLOCKS_PER_SEC << endl;
ini=end;
vec3 = inverse_th(vec1);
end = (double)clock();
cout << "=== Time inverse threaded=" << (end - ini) / CLOCKS_PER_SEC << endl;
cout<<vec2[2][2]<<" "<<vec3[2][2]<<endl;//to make the sw to do de inverse
cout << endl;
}
int main()
{
test_6_inverse(1000);
cout << endl << "=== END ===" << endl; getchar();
return 1;
}

After looking deeper in the code of the colum_zero() function I have seen that one thread rewrites in the data to be used by another threads, so the threads are not INDEPENDENT from each other. Fortunately the compiler detect it and avoid it.
Conclusions:
It is not recommended to try Gauss-Jordan method alone to make multithreads
If somebody detects that in multithread is slower and the initial function is spreaded correctly for every thread, perhaps is due one thread results are used by another
The main function inverse() works and can be used by other programmers, so this question should not be deleted
Non answered question:
What is a matrix inverse method that could be spreaded in a lot of independent threads to be used in a gpu?

what are some optimization tricks to make my code run faster

i'm moving outside my confront zone and trying to make a random number distribution program while also making sure it is still somewhat uniform.
here is my code
this is the RandomDistribution.h file
#pragma once
#include <vector>
#include <random>
#include <iostream>
static float randy(float low, float high) {
static std::random_device rd;
static std::mt19937 random(rd());
std::uniform_real_distribution<float> ran(low, high);
return ran(random);
}
typedef std::vector<float> Vfloat;
class RandomDistribution
{
public:
RandomDistribution();
RandomDistribution(float percent, float contents, int container);
~RandomDistribution();
void setvariables(float percent, float contents, int container);
Vfloat RunDistribution();
private:
float divider;
float _percent;
int jar_limit;
float _contents;
float _maxdistribution;
Vfloat Jar;
bool is0;
};
this is my RandomDistribution.cpp
#include "RandomDistribution.h"
RandomDistribution::RandomDistribution() {
}
RandomDistribution::RandomDistribution(float percent, float contents, int containers):_contents(contents),jar_limit(containers)
{
Jar.resize(containers);
if (percent < 0)
_percent = 0;
else {
_percent = percent;
}
divider = jar_limit * percent;
is0 = false;
}
RandomDistribution::~RandomDistribution()
{
}
void RandomDistribution::setvariables(float percent, float contents, int container) {
if (jar_limit != container)
Jar.resize(container);
_contents = contents;
jar_limit = container;
is0 = false;
if (percent < 0)
_percent = 0;
else {
_percent = percent;
}
divider = jar_limit * percent;
}
Vfloat RandomDistribution::RunDistribution() {
for (int i = 0; i < jar_limit; i++) {
if (!is0) {
if (i + 1 >= jar_limit || _contents < 2) {
Jar[i] = _contents;
_contents -= Jar[i];
is0 = true;
}
if (!_percent <= 0) {//making sure it does not get the hole container at once
_maxdistribution = (_contents / (divider)) * (i + 1);
}
else {
_maxdistribution = _contents;
}
Jar[i] = randy(0, _maxdistribution);
if (Jar[i] < 1) {
Jar[i] = 0;
continue;
}
_contents -= Jar[i];
}
else {
Jar[0];
}
//mixing Jar so it is randomly spaced out instead all at the top
int swapper = randy(0, i);
float hold = Jar[i];
Jar[i] = Jar[swapper];
Jar[swapper] = hold;
}
return Jar;
}
source code
int main(){
RandomDistribution distribution[100];
for (int i = 0; i < 100; i++) {
distribution[i] = {RandomDistribution(1.0f, 5000.0f, 2000) };
}
Vfloat k;
k.resize(200);
for (int i = 0; i < 10; i++) {
auto t3 = chrono::steady_clock::now();
for (int b = 0; b < 100; b++) {
k = distribution[b].RunDistribution();
distribution[b].setvariables(1.0f, 5000.0f, 2000);
}
auto t4 = chrono::steady_clock::now();
auto time_span = chrono::duration_cast<chrono::duration<double>>(t4 - t3);
cout << time_span.count() << " seconds\n";
}
}
what prints out is usually between 1 to 2 seconds for each cycle. i want to bring it down to a tenth of a second if possible cause this is gonna be only one step of the process to completion and i want to run it alot more then 100 times. what can i do to speed this up, any trick or something i'm just missing here.
here is a sample of the time stamps
4.71113 seconds
1.35444 seconds
1.45008 seconds
1.74961 seconds
2.59192 seconds
2.76171 seconds
1.90149 seconds
2.2822 seconds
2.36768 seconds
2.61969 seconds

Cheinan Marks has some benchmarks and performance tips related to random generators & friends in his cppcon 2016 talk I Just Wanted a Random Integer! He mentions some fast generators as well IIRC. I'd start there.

How to set a timeout in a function that is running in a thread

I'm doing a blocking communication with a server using a client. the function is running in a thread. I would like to set a time out functionality. I'm not using boost or something like that. I'm using windows threading library.
Here is the function that I want to set a time out functionality in it.
bool S3W::IWFSData::WaitForCompletion(unsigned int timeout)
{
if (m_Buffer)
{
while (!m_Buffer.IsEmpty())
{
unsigned int i = 0;
char gfname[255]; // must be changed to SBuffer
char minHeightArr[8], maxHeightArr[8], xArr[8], yArr[8];
m_PingTime += timeout;
if (m_PingTime > PONG_TIMEOUT)
{
m_PingTime = 0;
return false;
}
while (m_Buffer[i] != '\0')
{
gfname[i] = m_Buffer[i];
i++;
}
gfname[i] = '\0';
for (unsigned int j = 0; j < 8; j++)
{
minHeightArr[j] = m_Buffer[i++];
}
for (unsigned int j = 0; j < 8; j++)
{
maxHeightArr[j] = m_Buffer[i++];
}
double minH = *(double*)minHeightArr;
double maxH = *(double*)maxHeightArr;
for (unsigned int j = 0; j < 8; j++)
{
xArr[j] = m_Buffer[i++];
}
for (unsigned int j = 0; j < 8; j++)
{
yArr[j] = m_Buffer[i++];
}
double x = *(double*)xArr;
double y = *(double*)yArr;
OGRFeature *poFeature = OGRFeature::CreateFeature(m_Layer->GetLayerDefn());
if(poFeature)
{
poFeature->SetField("gfname", gfname);
poFeature->SetField("minHeight", minH);
poFeature->SetField("maxHeight", maxH);
OGRPoint point;
point.setX(x);
point.setY(y);
poFeature->SetGeometry(&point);
if (m_Layer->CreateFeature(poFeature) != OGRERR_NONE)
{
std::cout << "error inserting an area" << std::endl;
}
else
{
std::cout << "Created a feature" << std::endl;
}
}
OGRFeature::DestroyFeature(poFeature);
m_Buffer.Cut(0, i);
}
}
return true;
}
There is a thread that is setting the data to the buffer
int S3W::ImplConnection::Thread(void * pData)
{
SNet::SAutoLock lockReader(m_sLock);
// RECEIVE DATA
SNet::SBuffer buffer;
m_data->SrvReceive(buffer);
// Driver code for inserting data into the buffer in blocking communication
SNet::SAutoLock lockWriter(m_sLockWriter);
m_data->SetData("ahmed", strlen("ahmed"));
double minHeight = 10;
double maxHeight = 11;
double x = 4;
double y = 2;
char minHeightArr[sizeof(minHeight)];
memcpy(&minHeightArr, &minHeight, sizeof(minHeight));
char maxHeightArr[sizeof(maxHeight)];
memcpy(&maxHeightArr, &maxHeight, sizeof(maxHeight));
char xArr[sizeof(x)];
memcpy(&xArr, &x, sizeof(x));
char yArr[sizeof(y)];
memcpy(&yArr, &y, sizeof(y));
m_data->SetData(minHeightArr, sizeof(minHeightArr));
m_data->SetData(maxHeightArr, sizeof(maxHeightArr));
m_data->SetData(xArr, sizeof(xArr));
m_data->SetData(yArr, sizeof(yArr));
m_data->WaitForCompletion(1000);
return LOOP_TIME;
}

In general, you should not use threads for these purposes, because when terminating a thread like this, the process and the other threads could be left in an unknown state. Look here for the explanation.
Therefore, consider using procceses instead. Read here about opening processes in c++.
If you do want to use threads, you can exit the thread after the time passed.
Make a loop (as you have) that will break when some time has elapsed.
#include <ctime>
#define NUM_SECONDS_TO_WAIT 5
// outside your loop
std::time_t t1 = std::time(0);
// and in your while loop, each iteration:
std::time_t t2 = std::time(0);
if ((t2 - t1) >= NUM_SECONDS_TO_WAIT)
{
break; // ...
}

You can have a class member which holds a time stamp (when to timeout, set its value to currentTime + intervalToTimeout). In WaitForCompletion(), get current time and compare with the timeout time.
I assume in your code, m_PingTime is the time you start communication. You want to timeout after 1000 ms. What you need to do is, in WaitForCompletion():
while (!m_Buffer.IsEmpty())
{
...
long time = getCurrentTime(); // fake code
if (m_PingTime + timeout < time)
{
m_PingTime = 0;
return false;
}
...
}

Here is something I did, if you want to implement it yourself:
clock_t startTime = clock();
clock_t timeElapsed;
double counter;
while(true){
if(counter>=10){
//do what you want upon timeout, in this case it is 10 secs
}
startTime = clock();
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using std atomic and compare exchange wrong in multi-threaded cpp - c++

Related

Why is multi-threading of matrix calculation not faster than single-core?

Different float values in array impact performance by 10x - why?

Matrix inversion slower using threads

what are some optimization tricks to make my code run faster

How to set a timeout in a function that is running in a thread

Categories

Resources