We have an urban legend regarding chained Armadillo operations having "issues". Here's a comment from a recent code change:
// calculate coefficients
// Note: yes, we could write this more efficiently in one line of code.
// e.g., a = (s.t() * s).i() * s.t() * p
// However we have had exception issues with armadillo that seem to have been solved
// by un-chaining blocks of code, which we will do here:
The code, as implemented was functionally equivalent to the chaining, due to the use of auto:
auto st = s.t();
auto sts = st * s;
auto stsi = sts.i();
auto stsist = stsi * st;
arma::vec a = stsist * p;
This works fine, in a single thread instance. However, when running multiple threads (each thread operating on its own instances, so there should be no concurrency), the final statement hangs.
The fix is to explicitly assign the intermediate steps to arma::mat:
arma::mat st = s.t();
arma::mat sts = st * s;
arma::mat stsi = sts.i();
arma::mat stsist = stsi * st;
arma::vec a = stsist * p;
Now, all threads run just fine.
What is going on that Armadillo can't do chained operations on different objects concurrently?
Related
I'm using the XLA C++ API, and I've managed to run a simple addition, but I've no idea if I'm doing it right. There seem to be an awful lot of classes that I've not used. Here's my example
auto builder = new XlaBuilder("XlaBuilder");
auto one = ConstantR0(builder, 1);
auto two = ConstantR0(builder, 2);
auto res = one + two;
ValueInferenceMode value_inf_mode;
auto value_inf = new ValueInference(builder_);
auto lit = value_inf
->AnalyzeConstant(res, value_inf_mode)
->GetValue()
->Clone();
// I'm using `untyped_data` because I can't express arbitrary array types.
// I guess I could use `data<int32>` in this simple case
auto data = lit.untyped_data();
std::cout << ((int32*) data)[0] << std::endl; // prints 3
I suspect I didn't actually run that computation through XLA. Here's a different approach based on a sample harness in the XLA source code
XlaComputation computation = res.builder()->Build().ConsumeValueOrDie();
ExecutionProfile profile;
Literal lit = ClientLibrary::LocalClientOrDie()
->ExecuteAndTransfer(computation, {}, nullptr, &profile)
.ConsumeValueOrDie();
data = lit.untyped_data()
I am using R with Rcpp to perform computationally expensive calculations which also require a lot of RAM. Since I often do these for different parameters, I want to calculate them in parallel. For this I use packages foreach and doParallel. My problem is that once a worker on a thread has finished, it seems like it does not release the RAM. For example, if I use 7 cores and want to scan 9 parameters, I get approximately this behavior:
You see that the jump in memory is roughly the same for the two threads 8 and 9 as it is for the seven threads 1-7. Only after 8 and 9 the memory seems to be released.
My minimal working example in R:
library(minpack.lm)
library(Rcpp)
library(myRcppPackage)
library(foreach)
library(doParallel)
myParameters <- c(1:9)
# setup parallel backend
cores=detectCores()
close( file( "./monitorfile.txt", open="w" ) ) # flush the monitor-file
cl <- makeCluster(cores[1]-1, outfile="./monitorfile.txt")
registerDoParallel(cl)
clusterExport(cl, list("performCppLoop"), envir = environment())
myResult <- foreach(i=1:length(myParameters), .combine=rbind) %dopar% {
# perform C++ loop with my parameters
myData <- performCppLoop(myParameters[i])
# do some stuff with myData
cbind(mean(myData[,1]), mean(myData[,2]), mean(myData[,3]))
rm(myData)
}
stopCluster(cl)
MWE C++ code:
// [[Rcpp::export]]
NumericMatrix performLoop(double myParameter){
const number_of_steps = 20000000;
// ps for phase space
NumericMatrix data(number_of_steps, 3);
for (unsigned long int i = 0; i <number_of_steps; i++) {
// just some dummy calculations
data(i, 0) = sqrt(myParameter);
data(i, 1) = myParameter*2.0;
data(i, 2) = myParameter/2.0;
}
return data;
}
What am I doing wrong?
I recently got interested in Intel Threading Building Blocks. I would like to make use of the tbb::task_group class to manage a thread pool.
My first attempt was to build a test where a copy a vector in another: I create nth tasks each taking care of copying a continuous slice of the vector.
However, performances decreases with the number of threads. I have the same results with another thread pool implementation. With TBB 2018 Update 5, gcc 6.3 on debian strecth on an 8 i7 core box, I get the following figures to copy a vector of 1'000'000 of elements:
nth real user
1 0.808s 0.807s
2 1.068s 2.105s
4 1.109s 4.282s
May be some of you would me help understanding the issue. Here is the code:
#include<iostream>
#include<cstdlib>
#include<vector>
#include<algorithm>
#include "tbb/task_group.h"
#include "tbb/task_scheduler_init.h"
namespace mgis{
using real = double;
using size_type = size_t;
}
void my_copy(std::vector<mgis::real>& d,
const std::vector<mgis::real>& s,
const mgis::size_type b,
const mgis::size_type e){
const auto pb = s.begin()+b;
const auto pe = s.begin()+e;
const auto po = d.begin()+b;
std::copy(pb,pe,po);
}
int main(const int argc, const char* const* argv) {
using namespace mgis;
if (argc != 3) {
std::cerr << "invalid number of arguments\n";
std::exit(-1);
}
const auto ng = std::stoi(argv[1]);
const auto nth = std::stoi(argv[2]);
tbb::task_scheduler_init init(nth);
tbb::task_group g;
std::vector<real> v(ng,0);
std::vector<real> v2(ng);
for(auto i =0; i!=2000;++i){
const auto d = ng / nth;
const auto r = ng % nth;
size_type b = 0;
for (size_type i = 0; i != r; ++i) {
g.run([&v2, &v, b, d] { my_copy(v2, v, b, b + d + 1); });
b += d+1;
}
for (size_type i = r; i != nth; ++i) {
g.run([&v2, &v, b, d] { my_copy(v2, v, b, b + d); });
b += d ;
}
g.wait();
}
return EXIT_SUCCESS;
}
such a short benchmark does not make sense as TBB needs to create threads and get them started, it does not happen immediately on the first call to TBB since it is lazy asynchronous process. Though, your user times suggest that threads are up and running but probably don't have work to do.
memcopy is bad for scalability study because it does not scale beyond the number of memory controllers/channels. So, it doesn't matter if you have 4 CPUs or 24, it's unlikely you can get more than x4 speedup even for good hardware. Yours might have less channels.
instead of manually splitting the range, use tbb::parallel_for, you don't need task_group there. Moreover, invoking tasks one by one has linear complexity, parallel_for has logarithmic complexity.
I am trying to perform several FFT's in parallel. I am using FFTW and OpenMP. Each FFT is different, so I'm not relying on FFTW's build-in multithreading (which I know uses OpenMP).
int m;
// assume:
// int numberOfColumns = 100;
// int numberOfRows = 100;
#pragma omp parallel for default(none) private(m) shared(numberOfColumns, numberOfRows)// num_threads(4)
for(m = 0; m < 36; m++){
// create pointers
double *inputTest;
fftw_complex *outputTest;
fftw_plan testPlan;
// preallocate vectors for FFTW
outputTest = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfRows*numberOfColumns);
inputTest = (double *)fftw_malloc(sizeof(double)*numberOfRows*numberOfColumns);
// confirm that preallocation worked
if (inputTest == NULL || outputTest == NULL){
logger_.log_error("\t\t FFTW memory not allocated on m = %i", m);
}
// EDIT: insert data into inputTest
inputTest = someDataSpecificToThisIteration(m); // same size for all m
// create FFTW plan
#pragma omp critical (make_plan)
{
testPlan = fftw_plan_dft_r2c_2d(numberOfRows, numberOfColumns, inputTest, outputTest, FFTW_ESTIMATE);
}
// confirm that plan was created correctly
if (testPlan == NULL){
logger_.log_error("\t\t failed to create plan on m = %i", m);
}
// execute plan
fftw_execute(testPlan);
// clean up
fftw_free(inputTest);
fftw_free(outputTest);
fftw_destroy_plan(testPlan);
}// end parallelized for loop
This all works fine. However, if I remove the critical construct from around the plan creation (fftw_plan_dft_r2c_2d) my code will fail. Can someone explain why? fftw_plan_dft_r2c_2d isn't really an "orphan", right? Is it because two threads might both try to hit the numberOfRows or numberOfColumns memory location at the same time?
It's pretty much all written in the FFTW documentation about thread safety:
... but some care must be taken because the planner routines share data (e.g. wisdom and trigonometric tables) between calls and plans.
The upshot is that the only thread-safe (re-entrant) routine in FFTW is fftw_execute (and the new-array variants thereof). All other routines (e.g. the planner) should only be called from one thread at a time. So, for example, you can wrap a semaphore lock around any calls to the planner; even more simply, you can just create all of your plans from one thread. We do not think this should be an important restriction (FFTW is designed for the situation where the only performance-sensitive code is the actual execution of the transform), and the benefits of shared data between plans are great.
In a typical application of FFT plans are constructed seldom, so it doesn't really matter if you have to synchronise their creation. In your case you don't need to create a new plan at each iteration, unless the dimension of the data changes. You would rather do the following:
#pragma omp parallel default(none) private(m) shared(numberOfColumns, numberOfRows)
{
// create pointers
double *inputTest;
fftw_complex *outputTest;
fftw_plan testPlan;
// preallocate vectors for FFTW
outputTest = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfRows*numberOfColumns);
inputTest = (double *)fftw_malloc(sizeof(double)*numberOfRows*numberOfColumns);
// confirm that preallocation worked
if (inputTest == NULL || outputTest == NULL){
logger_.log_error("\t\t FFTW memory not allocated on m = %i", m);
}
// create FFTW plan
#pragma omp critical (make_plan)
testPlan = fftw_plan_dft_r2c_2d(numberOfRows, numberOfColumns, inputTest, outputTest, FFTW_ESTIMATE);
#pragma omp for
for (m = 0; m < 36; m++) {
// execute plan
fftw_execute(testPlan);
}
// clean up
fftw_free(inputTest);
fftw_free(outputTest);
fftw_destroy_plan(testPlan);
}
Now the plans are created only once in each thread and the serialisation overhead would diminish with each execution of fftw_execute(). If running on a NUMA system (e.g. a multi-socket AMD64 or Intel (post-)Nehalem system), then you should enable thread binding in order to achieve maximum performance.
I've written a program for search of the maximum in arrays using c++0x threads (for learning purposes). For implementation I used standard thread and future classes. However, parallelized function constantly showes same or worse run time than non-parallelized.
Code is below. I tried to store data in one-dimensional array, multi-dimensional array and ended up with several arrays. However, no option has given good results. I tried to compile and run my code from Eclipse and command line, still with no success. I also tried similar test without array usage. Parallelization gave only 20% speed up there. From my point of view, I run very simple parallel program, without locks and almost no resource sharing (each thread operates on his own array). What is bottleneck?
My machine has Intel Core i7 processor 2.2 GHz with 8 GB of RAM, running Ubuntu 12.04.
const int n = 100000000;
int a[n], b[n], c[n], d[n];
int find_max_usual() {
int res = 0;
for (int i = 0; i < n; ++i) {
res = max(res, a[i]);
res = max(res, b[i]);
res = max(res, c[i]);
res = max(res, d[i]);
}
return res;
}
int find_max(int *a) {
int res = 0;
for (int i = 0; i < n; ++i)
res = max(res, a[i]);
return res;
}
int find_max_parallel() {
future<int> res_a = async(launch::async, find_max, a);
future<int> res_b = async(launch::async, find_max, b);
future<int> res_c = async(launch::async, find_max, c);
future<int> res_d = async(launch::async, find_max, d);
int res = max(max(res_a.get(), res_b.get()), max(res_c.get(), res_d.get()));
return res;
}
double get_time() {
timeval tim;
gettimeofday(&tim, NULL);
double t = tim.tv_sec + (tim.tv_usec / 1000000.0);
return t;
}
int main() {
for (int i = 0; i < n; ++i) {
a[i] = rand();
b[i] = rand();
c[i] = rand();
d[i] = rand();
}
double start = get_time();
int x = find_max_usual();
cerr << x << " " << get_time() - start << endl;
start = get_time();
x = find_max_parallel();
cerr << x << " " << get_time() - start << endl;
return 0;
}
Timing showed that almost all the time in find_max_parralel is consumed by
int res = max(max(res_a.get(), res_b.get()), max(res_c.get(), res_d.get()));
Compilation command line
g++ -O3 -std=c++0x -pthread x.cpp
Update. Problem is solved. I got desired results with same test. 4 threads give about 3.3 speed up, 3 threads give about 2.5 speed up, 2 threads behave almost ideally with 1.9 speed up. I've just rebooted system with some new updates. I haven't seen any significant difference in cpu load and running porgrams.
Thanks to all for help.
You have to explicitly set std::launch::async.
future<int> res_c = async(std::launch::async, find_max, c);
If you omit the flag std::launch::async | std::launch::deferred is assumend which leaves it up to implementation to choose whether to start the task asynchronously or deferred.
Current versions of gcc use std::launch::deferred, MSVC has an runtime scheduler which decides on runtime how the task should be run.
Also note that if you want to try:
std::async(find_max, c);
this will also block because the destructor of std::future waits for the task to finish.
I just ran the same test with gcc-4.7.1 and threaded version is roughly 4 times faster (on 4-core server).
So the problem is obviously not in std::future implementation, but in choosing threading settings not optimal for your environment. As it was noted above, you test is not CPU, but memory intensive, so the bottleneck is definitely memory access.
You'd probably want to run some cpu-intensive test (like computing PI number with high precision) to benchmark threading properly.
Without experimenting with different number of threads and different array sizes, it's hard to say, where exactly the bottleneck is, but there are probably few things in play:
- You probably have 2-channel memory controller (it's either 2, or 3), so going above 2 threads will just introduce additional contention around memory access. Thus your thesis about having no locking and no resource sharing is not correct: on hardware level there's contention around concurrent memory access.
- Non-parallel version will be efficiently optimized by pre-fetching data into cache. On other hand, there's chance, that in parallel version you end up with intensive context switching, and as result thrashing CPU cache.
For both factors you are likely to see a speedup, if you tune down number of threads to 2.