I have a code to compute a Gaussian Mixture Model with Expectation Maximization in order to identify the clusters from a given input data sample.
A piece of the code is repeating the computation of such model for a number of trials Ntrials (one indepenendet of the other but using the same input data) in order to finally pick up the best solution (the one maximizing the total likelihood from the model). This concept can be generalized to many other clustering algorithms (e.g. k-means).
I want to parallelize the part of the code that has to be repeated Ntrials times through multi-threading with C++11 such that each thread will execute one trial.
A code example, assuming an input Eigen::ArrayXXd sample of (Ndimensions x Npoints) can be of the type:
double bestTotalModelProbability = 0;
Eigen::ArrayXd clusterIndicesFromSample(Npoints);
clusterIndicesFromSample.setZero();
for (int i=0; i < Ntrials; i++)
{
totalModelProbability = computeGaussianMixtureModel(sample);
// Check if this trial is better than the previous one.
// If so, update the results (cluster index for each point
// in the sample) and keep them.
if totalModelProbability > bestTotalModelProbability
{
bestTotalModelProbability = totalModelProbability;
...
clusterIndicesFromSample = obtainClusterMembership(sample);
}
}
where I pass the reference value of sample (Eigen::Ref), and not sample itself to both the functions computeGaussianMixtureModel() and obtainClusterMembership().
My code is heavily based on Eigen array, and the N-dimensional problems that I take can account for order 10-100 dimensions and 500-1000 different sample points. I am looking for some examples to create a multi-threaded version of this code using Eigen arrays and std:thread of C++11, but could not find anything around and I am struggling with making even some simple examples for manipulation of Eigen arrays.
I am not even sure Eigen can be used within std::thread in C++11.
Can someone help me even with some simple example to understand the synthax?
I am using clang++ as compiler in Mac OSX on a CPU with 6 cores (12 threads).
OP's question attracted my attention because number-crunching with speed-up earned by multi-threading is one of the top todo's on my personal list.
I must admit that my experience with the Eigen library is very limited. (I once used the decompose of 3×3 rotation matrices to Euler angles which is very clever solved in a general way in the Eigen library.)
Hence, I defined another sample task consisting of a stupid counting of values in a sample data set.
This is done multiple times (using the same evaluation function):
single threaded (to get a value for comparison)
starting each sub-task in an extra thread (in an admittedly rather stupid approach)
starting threads with interleaved access to sample data
starting threads with partitioned access to sample data.
test-multi-threading.cc:
#include <cstdint>
#include <cstdlib>
#include <chrono>
#include <iomanip>
#include <iostream>
#include <limits>
#include <thread>
#include <vector>
// a sample function to process a certain amount of data
template <typename T>
size_t countFrequency(
size_t n, const T data[], const T &begin, const T &end)
{
size_t result = 0;
for (size_t i = 0; i < n; ++i) result += data[i] >= begin && data[i] < end;
return result;
}
typedef std::uint16_t Value;
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::microseconds MuSecs;
typedef decltype(std::chrono::duration_cast<MuSecs>(Clock::now() - Clock::now())) Time;
Time duration(const Clock::time_point &t0)
{
return std::chrono::duration_cast<MuSecs>(Clock::now() - t0);
}
std::vector<Time> makeTest()
{
const Value SizeGroup = 4, NGroups = 10000, N = SizeGroup * NGroups;
const size_t NThreads = std::thread::hardware_concurrency();
// make a test sample
std::vector<Value> sample(N);
for (Value &value : sample) value = (Value)rand();
// prepare result vectors
std::vector<size_t> results4[4] = {
std::vector<size_t>(NGroups, 0),
std::vector<size_t>(NGroups, 0),
std::vector<size_t>(NGroups, 0),
std::vector<size_t>(NGroups, 0)
};
// make test
std::vector<Time> times{
[&]() { // single threading
// make a copy of test sample
std::vector<Value> data(sample);
std::vector<size_t> &results = results4[0];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment single-threaded
for (size_t i = 0; i < NGroups; ++i) {
results[i] = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
}
// done
return duration(t0);
}(),
[&]() { // multi-threading - stupid aproach
// make a copy of test sample
std::vector<Value> data(sample);
std::vector<size_t> &results = results4[1];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment multi-threaded
std::vector<std::thread> threads(NThreads);
for (Value i = 0; i < NGroups;) {
size_t nT = 0;
for (; nT < NThreads && i < NGroups; ++nT, ++i) {
threads[nT] = std::move(std::thread(
[i, &results, &data, SizeGroup]() {
size_t result = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
results[i] = result;
}));
}
for (size_t iT = 0; iT < nT; ++iT) threads[iT].join();
}
// done
return duration(t0);
}(),
[&]() { // multi-threading - interleaved
// make a copy of test sample
std::vector<Value> data(sample);
std::vector<size_t> &results = results4[2];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment multi-threaded
std::vector<std::thread> threads(NThreads);
for (Value iT = 0; iT < NThreads; ++iT) {
threads[iT] = std::move(std::thread(
[iT, &results, &data, NGroups, SizeGroup, NThreads]() {
for (Value i = iT; i < NGroups; i += NThreads) {
size_t result = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
results[i] = result;
}
}));
}
for (std::thread &threadI : threads) threadI.join();
// done
return duration(t0);
}(),
[&]() { // multi-threading - grouped
std::vector<Value> data(sample);
std::vector<size_t> &results = results4[3];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment multi-threaded
std::vector<std::thread> threads(NThreads);
for (Value iT = 0; iT < NThreads; ++iT) {
threads[iT] = std::move(std::thread(
[iT, &results, &data, NGroups, SizeGroup, NThreads]() {
for (Value i = iT * NGroups / NThreads,
iN = (iT + 1) * NGroups / NThreads; i < iN; ++i) {
size_t result = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
results[i] = result;
}
}));
}
for (std::thread &threadI : threads) threadI.join();
// done
return duration(t0);
}()
};
// check results (must be equal for any kind of computation)
const unsigned nResults = sizeof results4 / sizeof *results4;
for (unsigned i = 1; i < nResults; ++i) {
size_t nErrors = 0;
for (Value j = 0; j < NGroups; ++j) {
if (results4[0][j] != results4[i][j]) {
++nErrors;
#ifdef _DEBUG
std::cerr
<< "results4[0][" << j << "]: " << results4[0][j]
<< " != results4[" << i << "][" << j << "]: " << results4[i][j]
<< "!\n";
#endif // _DEBUG
}
}
if (nErrors) std::cerr << nErrors << " errors in results4[" << i << "]!\n";
}
// done
return times;
}
int main()
{
std::cout << "std::thread::hardware_concurrency(): "
<< std::thread::hardware_concurrency() << '\n';
// heat up
std::cout << "Heat up...\n";
for (unsigned i = 0; i < 3; ++i) makeTest();
// repeat NTrials times
const unsigned NTrials = 10;
std::cout << "Measuring " << NTrials << " runs...\n"
<< " "
<< " | " << std::setw(10) << "Single"
<< " | " << std::setw(10) << "Multi 1"
<< " | " << std::setw(10) << "Multi 2"
<< " | " << std::setw(10) << "Multi 3"
<< '\n';
std::vector<double> sumTimes;
for (unsigned i = 0; i < NTrials; ++i) {
std::vector<Time> times = makeTest();
std::cout << std::setw(2) << (i + 1) << ".";
for (const Time &time : times) {
std::cout << " | " << std::setw(10) << time.count();
}
std::cout << '\n';
sumTimes.resize(times.size(), 0.0);
for (size_t j = 0; j < times.size(); ++j) sumTimes[j] += times[j].count();
}
std::cout << "Average Values:\n ";
for (const double &sumTime : sumTimes) {
std::cout << " | "
<< std::setw(10) << std::fixed << std::setprecision(1)
<< sumTime / NTrials;
}
std::cout << '\n';
std::cout << "Ratio:\n ";
for (const double &sumTime : sumTimes) {
std::cout << " | "
<< std::setw(10) << std::fixed << std::setprecision(3)
<< sumTime / sumTimes.front();
}
std::cout << '\n';
// done
return 0;
}
Compiled and tested on cygwin64 on Windows 10:
$ g++ --version
g++ (GCC) 7.3.0
$ g++ -std=c++11 -O2 -o test-multi-threading test-multi-threading.cc
$ ./test-multi-threading
std::thread::hardware_concurrency(): 8
Heat up...
Measuring 10 runs...
| Single | Multi 1 | Multi 2 | Multi 3
1. | 384008 | 1052937 | 130662 | 138411
2. | 386500 | 1103281 | 133030 | 132576
3. | 382968 | 1078988 | 137123 | 137780
4. | 395158 | 1120752 | 138731 | 138650
5. | 385870 | 1105885 | 144825 | 129405
6. | 366724 | 1071788 | 137684 | 130289
7. | 352204 | 1104191 | 133675 | 130505
8. | 331679 | 1072299 | 135476 | 138257
9. | 373416 | 1053881 | 138467 | 137613
10. | 370872 | 1096424 | 136810 | 147960
Average Values:
| 372939.9 | 1086042.6 | 136648.3 | 136144.6
Ratio:
| 1.000 | 2.912 | 0.366 | 0.365
I did the same on coliru.com. (I had to reduce the heat up cycles and the sample size as I exceeded the time limit with the original values.):
g++ (GCC) 8.1.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
std::thread::hardware_concurrency(): 4
Heat up...
Measuring 10 runs...
| Single | Multi 1 | Multi 2 | Multi 3
1. | 224684 | 297729 | 48334 | 39016
2. | 146232 | 337222 | 66308 | 59994
3. | 195750 | 344056 | 61383 | 63172
4. | 198629 | 317719 | 62695 | 50413
5. | 149125 | 356471 | 61447 | 57487
6. | 155355 | 322185 | 50254 | 35214
7. | 140269 | 316224 | 61482 | 53889
8. | 154454 | 334814 | 58382 | 53796
9. | 177426 | 340723 | 62195 | 54352
10. | 151951 | 331772 | 61802 | 46727
Average Values:
| 169387.5 | 329891.5 | 59428.2 | 51406.0
Ratio:
| 1.000 | 1.948 | 0.351 | 0.303
Live Demo on coliru
I wonder a little bit that the ratios on coliru (with only 4 threads) are even better than on my PC with (with 8 threads). Actually, I don't know how to explain this.
However, there are a lot of other differences in the two setups which may or may not be responsible. At least, both measurements show a rough speed-up of 3 for 3rd and 4th approach where the 2nd consumes uniquely every potential speed-up (probably by starting and joining all these threads).
Looking at the sample code, you will recognize that there is no mutex or any other explicit locking. This is intentionally. As I've learned (many, many years ago), every attempt for parallelization may cause a certain extra amount of communication overhead (for concurrent tasks which have to exchange data). If communication overhead becomes to big, it simply consumes the speed advantage of concurrency. So, best speed-up can be achieved by:
least communication overhead i.e. concurrent tasks operate on independent data
least effort for post-merging the concurrently computed results.
In my sample code, I
prepared every data and storage before starting the threads,
shared data which is read is never changed while threads are running,
data which is written as it were thread-local (no two threads write to the same address of data)
evaluate the computed results after all threads have been joined.
Concerning 3. I struggled a bit whether this is legal or not i.e. is it granted for data which is written in threads to appear correctly in main thread after joining. (The fact that something seems to work fine is illusive in general but especially illusive concerning multi-threading.)
cppreference.com provides the following explanations
for std::thread::thread()
The completion of the invocation of the constructor synchronizes-with (as defined in std::memory_order) the beginning of the invocation of the copy of f on the new thread of execution.
for std::thread::join()
The completion of the thread identified by *this synchronizes with the corresponding successful return from join().
In Stack Overflow, I found the following related Q/A's:
Does relaxed memory order effect can be extended to after performing-thread's life?
Are memory fences required here?
Is there an implicit memory barrier with synchronized-with relationship on thread::join?
which convinced me, it is OK.
However, the drawback is that
the creation and joining of threads causes additional effort (and it's not that cheap).
An alternative approach could be the usage of a thread pool to overcome this. I googled a bit and found e.g. Jakob Progsch's ThreadPool on github. However, I guess, with a thread pool the locking issue is back “in the game”.
Whether this will work for Eigen functions as well, depends on how the resp. Eigen functions are implemented. If there are accesses to global variables in them (which become shared when the same function is called concurrently), this will cause a data race.
Googling a bit, I found the following doc.
Eigen and multi-threading – Using Eigen in a multi-threaded application:
In the case your own application is multithreaded, and multiple threads make calls to Eigen, then you have to initialize Eigen by calling the following routine before creating the threads:
#include <Eigen/Core>
int main(int argc, char** argv)
{
Eigen::initParallel();
...
}
Note
With Eigen 3.3, and a fully C++11 compliant compiler (i.e., thread-safe static local variable initialization), then calling initParallel() is optional.
Warning
note that all functions generating random matrices are not re-entrant nor thread-safe. Those include DenseBase::Random(), and DenseBase::setRandom() despite a call to Eigen::initParallel(). This is because these functions are based on std::rand which is not re-entrant. For thread-safe random generator, we recommend the use of boost::random or c++11 random feature.
Related
The code below was taken from an example compiled with g++. The multi-threaded was 2x faster than the single-threaded.
I'm executing it in Visual Studio 2019 and the results are the opposite: the single-threaded is 2x faster than the multi-threaded.
#include<thread>
#include<iostream>
#include<chrono>
using namespace std;
using ll = long long;
ll odd, even;
void par(const ll ini, const ll fim)
{
for (auto i = ini; i <= fim; i++)
if (!(i & 1))
even += i;
}
void impar(const ll ini, const ll fim)
{
for (auto i = ini; i <= fim; i++)
if (i & 1)
odd += i;
}
int main()
{
const ll start = 0;
const ll end = 190000000;
/* SINGLE THREADED */
auto start_single = chrono::high_resolution_clock::now();
par(start, end);
impar(start, end);
auto end_single = chrono::high_resolution_clock::now();
auto single_duration = chrono::duration_cast<chrono::microseconds>(end_single - start_single).count();
cout << "SINGLE THREADED\nEven sum: " << even << "\nOdd sum: " << odd << "\nTime: " << single_duration << "ms\n\n\n";
/* END OF SINGLE*/
/* MULTI THREADED */
even = odd = 0;
auto start_multi= chrono::high_resolution_clock::now();
thread t(par, start, end);
thread t2(impar, start, end);
t.join();
t2.join();
auto end_multi = chrono::high_resolution_clock::now();
auto multi_duration = chrono::duration_cast<chrono::microseconds>(end_multi - start_multi).count();
cout << "MULTI THREADED\nEven sum: " << even << "\nOdd sum: " << odd << "\nTime: " << multi_duration << "ms\n";
/*END OF MULTI*/
cout << "\n\nIs multi faster than single? => " << boolalpha << (multi_duration < single_duration) << '\n';
}
However, If I do a small modification on my functions as shown below:
void par(const ll ini, const ll fim)
{
ll temp = 0;
for (auto i = ini; i <= fim; i++)
if (!(i & 1))
temp += i;
even = temp;
}
void impar(const ll ini, const ll fim)
{
ll temp = 0;
for (auto i = ini; i <= fim; i++)
if (i & 1)
temp += i;
odd = temp;
}
The multi-threaded performs better. I would like to know what leads to this behavior (what are the possible differences in implementation that explains it).
Also, I have compiled with gcc from www.onlinegdb.com and the results are similar to Visual Studio's in my machine.
You are a victim of false sharing.
odd and even reside next to each other, and accessing them from two threads leads to L3 cache line contention (a.k.a false sharing).
You can fix it by spreading them by 64 bytes to make sure they reside in different cache lines, for example, like this:
alignas(64) ll odd, even;
With that change I get good speedup with 2 threads:
SINGLE THREADED
Even sum: 9025000095000000
Odd sum: 9025000000000000
Time: 825954ms
MULTI THREADED
Even sum: 9025000095000000
Odd sum: 9025000000000000
Time: 532420ms
As for G++ performance - it might be performing the optimization you made manually for you. MSVC is more careful when it comes to optimizing global variables.
i'm trying to optimize my code using multithreading and is not just that the program is not the double speed as is suposed to be in this dual-core computer, it is SO MUCH SLOW. And i just wanna know if i'm doing something wrong or is pretty normal that in this case use multithreading does not help. I make this recreation of how i used the multithreading, and in my computer the parallel versions take's 4 times the time in the comparation of the normal version:
#include <iostream>
#include <random>
#include <thread>
#include <chrono>
using namespace std;
default_random_engine ran;
inline bool get(){
return ran() % 3;
}
void normal_serie(unsigned repetitions, unsigned &result){
for (unsigned i = 0; i < repetitions; ++i)
result += get();
}
unsigned parallel_series(unsigned repetitions){
const unsigned hardware_threads = std::thread::hardware_concurrency();
cout << "Threads in this computer: " << hardware_threads << endl;
const unsigned threads_number = (hardware_threads != 0) ? hardware_threads : 2;
const unsigned its_per_thread = repetitions / threads_number;
unsigned *results = new unsigned[threads_number]();
std::thread *threads = new std::thread[threads_number - 1];
for (unsigned i = 0; i < threads_number - 1; ++i)
threads[i] = std::thread(normal_serie, its_per_thread, std::ref(results[i]));
normal_serie(its_per_thread, results[threads_number - 1]);
for (unsigned i = 0; i < threads_number - 1; ++i)
threads[i].join();
auto result = std::accumulate(results, results + threads_number, 0);
delete[] results;
delete[] threads;
return result;
}
int main()
{
constexpr unsigned repetitions = 100000000;
auto to = std::chrono::high_resolution_clock::now();
cout << parallel_series(repetitions) << endl;
auto tf = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(tf - to).count();
cout << "Parallel duration: " << duration << "ms" << endl;
to = std::chrono::high_resolution_clock::now();
unsigned r = 0;
normal_serie(repetitions, r);
cout << r << endl;
tf = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(tf - to).count();
cout << "Normal duration: " << duration << "ms" << endl;
return 0;
}
Things that i already know, but i didn't to make this code shorter:
I should set a max_iterations_per_thread because you don't wanna make 10 iterations per thread, but in this case we are doing one billion iterations so that is not gonna happend.
The number of iterations must be divisible by the number or threads, otherwise the code will not do an effective work.
This is the output that i get in my computer:
Threads in this computer: 2
66665160
Parallel duration: 4545ms
66664432
Normal duration: 1019ms
(Solved partially doing this changes: )
inline bool get(default_random_engine &ran){
return ran() % 3;
}
void normal_serie(unsigned repetitions, unsigned &result){
default_random_engine eng;
unsigned saver_result = 0;
for (unsigned i = 0; i < repetitions; ++i)
saver_result += get(eng);
result += saver_result;
}
All your threads are tripping over each other fighting for access to ran which can only perform one operation at a time because it only has one state and each operation advances its state. There is no point in running operations in parallel if the vast majority of each operation involves a choke point that cannot support any concurrency.
All elements of results are likely to share a cache line, which means there is lots of inter-core communication going on.
Try modifying normal_serie to accumulate into a local variable and only write it to results in the end.
Consider the following code:
#include <iostream>
#include <string>
#include <chrono>
using namespace std;
int main()
{
int iter = 1000000;
int loops = 10;
while (loops)
{
int a=0, b=0, c=0, f = 0, m = 0, q = 0;
auto begin = chrono::high_resolution_clock::now();
auto end = chrono::high_resolution_clock::now();
auto deltaT = end - begin;
auto accumT = end - begin;
accumT = accumT - accumT;
auto controlT = accumT;
srand(chrono::duration_cast<chrono::nanoseconds>(begin.time_since_epoch()).count());
for (int i = 0; i < iter; i++) {
begin = chrono::high_resolution_clock::now();
// No arithmetic operation
end = chrono::high_resolution_clock::now();
deltaT = end - begin;
accumT += deltaT;
}
controlT = accumT; // Control duration
accumT = accumT - accumT; // Reset to zero
for (int i = 0; i < iter; i++) {
auto n1 = rand() % 100;
auto n2 = rand() % 100;
begin = chrono::high_resolution_clock::now();
c += i*2*n1*n2; // Some arbitrary arithmetic operation
end = chrono::high_resolution_clock::now();
deltaT = end - begin;
accumT += deltaT;
}
// Print the difference in time between loop with no arithmetic operation and loop with
cout << " c = " << c << "\t\t" << " | ";
cout << "difference between the 1st and 2nd loop: "
<< chrono::duration_cast<chrono::nanoseconds>(accumT - controlT).count()
<< endl;
loops--;
}
return 0;
}
It tries to isolate the time measurement of an operation. The first loop is a control to establish a baseline and the second loop has an arbitrary arithmetic operation.
Then it outputs to the console. Here's sample output:
c = 2116663282 | difference between 1st and 2nd loop: -8620916
c = 112424882 | difference between 1st and 2nd loop: -1197927
c = -1569775878 | difference between 1st and 2nd loop: -5226990
c = 1670984684 | difference between 1st and 2nd loop: 4394706
c = -1608171014 | difference between 1st and 2nd loop: 676683
c = -1684897180 | difference between 1st and 2nd loop: 2868093
c = 112418158 | difference between 1st and 2nd loop: 5846887
c = 2019014070 | difference between 1st and 2nd loop: -951609
c = 656490372 | difference between 1st and 2nd loop: 997815
c = 263579698 | difference between 1st and 2nd loop: 2371088
Here's the very interesting part: sometime the loop with the arithmetic operation finishes faster than the loop with no arithmetic operation (negative difference). Which means that the operation to record the current time is slower than the arithmetic operation, and thus not negligible.
Is there a way around this?
PS: Yes, I understand you can wrap the whole loop between begin and end.
Setup machine: Core i7 architecture, Windows 10 64 bit, and Visual Studio 2015
Your problem is that you measure time and not the number of instructions processed. Time can be influenced by a lot of things that are not really what you would expect, or wish to measure.
Instead, you should measure the number of clock cycles. There exists a library for this which can be found on Agner Fog's website. He has a lot of useful information about optimizations:
http://www.agner.org/optimize/#manuals
Even using clock cycles, you can still experience peculiarities in the results. This could happen if the processor uses out-of-order execution which enables the processor to optimize the order of execution of the operations.
If you have compiled your code with debugging symbols, the compiler may have injected additional code, which may impact the result. When performing tests like this, you should always compile without debugging information.
You should use a steady clock, std::steady_clock.
The std::system_clock/std::high_resolution_clock is getting corrected by the OS.
I need to do something like this in the fastest way possible (O(1) would be perfect):
for (int j = 0; j < V; ++j)
{
if(!visited[j]) required[j]=0;
}
I came up with this solution:
for (int j = 0; j < V; ++j)
{
required[j]=visited[j]&required[j];
}
Which made the program run 3 times faster but I believe there is an even better way to do this. Am I right?
Btw. required and visited are dynamically allocated arrays
bool *required;
bool *visited;
required = new bool[V];
visited = new bool[V];
In the case where you're using a list of simple objects, you are most likely best suited using the functionality provided by the C++ Standard Library. Structures like valarray and vectors are recognized and optimized very effectively by all modern compilers.
Much debate exists as to how much you can rely on your compiler, but one guarantee is, your compiler was built alongside the standard library and relying on it for basic functionality (such as your problem) is generally a safe bet.
Never be afraid to run your own time tests and race your compiler! It's a fun exercise and one that is ever increasingly difficult to achieve.
Construct a valarray (highly optimized in c++11 and later):
std::valarray<bool> valRequired(required, V);
std::valarray<bool> valVisited(visited, V);
valRequired &= valVisited;
Alternatively, you could do it with one line using transform:
std::transform(required[0], required[V-1], visited[0], required[0], [](bool r, bool v){ return r & v; })
Edit: while fewer lines is not faster, your compiler will likely vectorize this operation.
I also tested their timing:
int main(int argc, const char * argv[]) {
auto clock = std::chrono::high_resolution_clock{};
{
bool visited[5] = {1,0,1,0,0};
bool required[5] = {1,1,1,0,1};
auto start = clock.now();
for (int i = 0; i < 5; ++i) {
required[i] &= visited[i];
}
auto end = clock.now();
std::cout << "1: " << (end - start).count() << std::endl;
}
{
bool visited[5] = {1,0,1,0,0};
bool required[5] = {1,1,1,0,1};
auto start = clock.now();
for (int i = 0; i < 5; ++i) {
required[i] = visited[i] & required[i];
}
auto end = clock.now();
std::cout << "2: " << (end - start).count() << std::endl;
}
{
bool visited[5] = {1,0,1,0,0};
bool required[5] = {1,1,1,0,1};
auto start = clock.now();
std::transform(required, required + 4, visited, required, [](bool r, bool v){ return r & v; });
auto end = clock.now();
std::cout << "3: " << (end - start).count() << std::endl;
}
{
bool visited[5] = {1,0,1,0,0};
bool required[5] = {1,1,1,0,1};
std::valarray<bool> valVisited(visited, 5);
std::valarray<bool> valrequired(required, 5);
auto start = clock.now();
valrequired &= valVisited;
auto end = clock.now();
std::cout << "4: " << (end - start).count() << std::endl;
}
}
Output:
1: 102
2: 55
3: 47
4: 45
Program ended with exit code: 0
In the line of #AlanStokes, use packed binary data and combine with the AVX instruction _mm512_and_epi64, 512 bits at a time. Be prepared for your hair messed up.
I'm trying to implement a naive version of LU decomposition in OpenCL. To start, I have implemented a sequential version in C++ and constructed methods to verify my result (i.e., multiplication methods). Next I implemented my algorithm in a kernel and tested it with manually verified input (i.e., a 5x5 matrix). This works fine.
However, when I run my algorithm on a randomly generated matrix bigger than 5x5 I get strange results. I've cleaned my code, checked the calculations manually but I can't figure out where my kernel is going wrong. I'm starting to think that it might have something to do with the floats and the stability of the calculations. By this I mean that error margins get propagated and get bigger and bigger. I'm well-aware that I can swap rows to get the biggest pivot value and such, but the error margin is way off sometimes. And in any case I would have expected the result - albeit a wrong one - to be the same as the sequential algorithm. I would like some help identifying where I could be doing something wrong.
I'm using a single dimensional array so addressing a matrix with two dimensions happens like this:
A(row, col) = A[row * matrix_width + col].
About the results I might add that I decided to merge the L and U matrix into one. So Given L and U:
L: U:
1 0 0 A B C
X 1 0 0 D E
Y Z 1 0 0 F
I display them as:
A:
A B C
X D E
Y Z F
The kernel is the following:
The parameter source is the original matrix I want to decompose.
The parameter destin is the destination. matrix_size is the total size of the matrix (so that would be 9 for a 3x3) and matrix_width is the width (3 for a 3x3 matrix).
__kernel void matrix(
__global float * source,
__global float * destin,
unsigned int matrix_size,
unsigned int matrix_width
)
{
unsigned int index = get_global_id(0);
int col_idx = index % matrix_width;
int row_idx = index / matrix_width;
if (index >= matrix_size)
return;
// First of all, copy our value to the destination.
destin[index] = source[index];
// Iterate over all the pivots.
for(int piv_idx = 0; piv_idx < matrix_width; piv_idx++)
{
// We have to be the row below the pivot row
// And we have to be the column of the pivot
// or right of that column.
if(col_idx < piv_idx || row_idx <= piv_idx)
return;
// Calculate the divisor.
float pivot_value = destin[(piv_idx * matrix_width) + piv_idx];
float below_pivot_value = destin[(row_idx * matrix_width) + piv_idx];
float divisor = below_pivot_value/ pivot_value;
// Get the value in the pivot row on this column.
float pivot_row_value = destin[(piv_idx * matrix_width) + col_idx];
float current_value = destin[index];
destin[index] = current_value - (pivot_row_value * divisor);
// Write the divisor to the memory (we won't use these values anymore!)
// if we are the value under the pivot.
barrier(CLK_GLOBAL_MEM_FENCE);
if(col_idx == piv_idx)
{
int divisor_location = (row_idx * matrix_width) + piv_idx;
destin[divisor_location] = divisor;
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
This is the sequential version:
// Decomposes a matrix into L and U but in the same matrix.
float * decompose(float* A, int matrix_width)
{
int total_length = matrix_width*matrix_width;
float *U = new float[total_length];
for (int i = 0; i < total_length; i++)
{
U[i] = A[i];
}
for (int row = 0; row < matrix_width; row++)
{
int pivot_idx = row;
float pivot_val = U[pivot_idx * matrix_width + pivot_idx];
for (int r = row + 1; r < matrix_width; r++)
{
float below_pivot = U[r*matrix_width + pivot_idx];
float divisor = below_pivot / pivot_val;
for (int row_idx = pivot_idx; row_idx < matrix_width; row_idx++)
{
float value = U[row * matrix_width + row_idx];
U[r*matrix_width + row_idx] = U[r*matrix_width + row_idx] - (value * divisor);
}
U[r * matrix_width + pivot_idx] = divisor;
}
}
return U;
}
An example output I get is the following:
Workgroup size: 1
Array dimension: 6
Original unfactorized:
| 176.000000 | 133.000000 | 431.000000 | 839.000000 | 739.000000 | 450.000000 |
| 507.000000 | 718.000000 | 670.000000 | 753.000000 | 122.000000 | 941.000000 |
| 597.000000 | 449.000000 | 596.000000 | 742.000000 | 491.000000 | 212.000000 |
| 159.000000 | 944.000000 | 797.000000 | 717.000000 | 822.000000 | 219.000000 |
| 266.000000 | 755.000000 | 33.000000 | 231.000000 | 824.000000 | 785.000000 |
| 724.000000 | 408.000000 | 652.000000 | 863.000000 | 663.000000 | 113.000000 |
Sequential:
| 176.000000 | 133.000000 | 431.000000 | 839.000000 | 739.000000 | 450.000000 |
| 2.880682 | 334.869324 | -571.573853 | -1663.892090 | -2006.823730 | -355.306763 |
| 3.392045 | -0.006397 | -869.627747 | -2114.569580 | -2028.558716 | -1316.693359 |
| 0.903409 | 2.460203 | -2.085742 | -357.893066 | 860.526367 | -2059.689209 |
| 1.511364 | 1.654343 | -0.376231 | -2.570729 | 4476.049805 | -5097.599121 |
| 4.113636 | -0.415427 | 1.562076 | -0.065806 | 0.003290 | 52.263515 |
Sequential multiplied matching with original?:
1
GPU:
| 176.000000 | 133.000000 | 431.000000 | 839.000000 | 739.000000 | 450.000000 |
| 2.880682 | 334.869293 | -571.573914 | -1663.892212 | -2006.823975 | -355.306885 |
| 3.392045 | -0.006397 | -869.627808 | -2114.569580 | -2028.558716 | -1316.693359 |
| 0.903409 | 2.460203 | -2.085742 | -357.892578 | 5091.575684 | -2059.688965 |
| 1.511364 | 1.654343 | -0.376232 | -2.570732 | 16116.155273 | -5097.604980 |
| 4.113636 | -0.415427 | -0.737347 | 2.005755 | -3.655331 | -237.480438 |
GPU multiplied matching with original?:
Values differ: 5053.05 -- 822
0
Values differ: 5091.58 -- 860.526
Correct solution? 0
Edit
Okay, I understand why it was not working before, I think. The reason is that I only synchronize on each workgroup. When I would call my kernel with a workgroup size equal to the number of items in my matrix it would always be correct, because then the barriers would work properly. However, I decided to go with the approach as mentioned in the comments. Enqueue multiple kernels and wait for each kernel to finish before starting the next one. This would then map onto an iteration over each row of the matrix and multiplying it with the pivot element. This makes sure that I do not modify or read elements that are being modified by the kernel at that point.
Again, this works but only for small matrices. So I think I was wrong in assuming that it was the synchronization only. As per the request of Baiz I am posting my entire main here that calls the kernel:
int main(int argc, char *argv[])
{
try {
if (argc != 5) {
std::ostringstream oss;
oss << "Usage: " << argv[0] << " <kernel_file> <kernel_name> <workgroup_size> <array width>";
throw std::runtime_error(oss.str());
}
// Read in arguments.
std::string kernel_file(argv[1]);
std::string kernel_name(argv[2]);
unsigned int workgroup_size = atoi(argv[3]);
unsigned int array_dimension = atoi(argv[4]);
int total_matrix_length = array_dimension * array_dimension;
// Print parameters
std::cout << "Workgroup size: " << workgroup_size << std::endl;
std::cout << "Array dimension: " << array_dimension << std::endl;
// Create matrix to work on.
// Create a random array.
int matrix_width = sqrt(total_matrix_length);
float* input_matrix = new float[total_matrix_length];
input_matrix = randomMatrix(total_matrix_length);
/// Debugging
//float* input_matrix = new float[9];
//int matrix_width = 3;
//total_matrix_length = matrix_width * matrix_width;
//input_matrix[0] = 10; input_matrix[1] = -7; input_matrix[2] = 0;
//input_matrix[3] = -3; input_matrix[4] = 2; input_matrix[5] = 6;
//input_matrix[6] = 5; input_matrix[7] = -1; input_matrix[8] = 5;
// Allocate memory on the host and populate source
float *gpu_result = new float[total_matrix_length];
// OpenCL initialization
std::vector<cl::Platform> platforms;
std::vector<cl::Device> devices;
cl::Platform::get(&platforms);
platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);
cl::Context context(devices);
cl::CommandQueue queue(context, devices[0], CL_QUEUE_PROFILING_ENABLE);
// Load the kernel source.
std::string file_text;
std::ifstream file_stream(kernel_file.c_str());
if (!file_stream) {
std::ostringstream oss;
oss << "There is no file called " << kernel_file;
throw std::runtime_error(oss.str());
}
file_text.assign(std::istreambuf_iterator<char>(file_stream), std::istreambuf_iterator<char>());
// Compile the kernel source.
std::string source_code = file_text;
std::pair<const char *, size_t> source(source_code.c_str(), source_code.size());
cl::Program::Sources sources;
sources.push_back(source);
cl::Program program(context, sources);
try {
program.build(devices);
}
catch (cl::Error& e) {
std::string msg;
program.getBuildInfo<std::string>(devices[0], CL_PROGRAM_BUILD_LOG, &msg);
std::cerr << "Your kernel failed to compile" << std::endl;
std::cerr << "-----------------------------" << std::endl;
std::cerr << msg;
throw(e);
}
// Allocate memory on the device
cl::Buffer source_buf(context, CL_MEM_READ_ONLY, total_matrix_length*sizeof(float));
cl::Buffer dest_buf(context, CL_MEM_WRITE_ONLY, total_matrix_length*sizeof(float));
// Create the actual kernel.
cl::Kernel kernel(program, kernel_name.c_str());
// transfer source data from the host to the device
queue.enqueueWriteBuffer(source_buf, CL_TRUE, 0, total_matrix_length*sizeof(float), input_matrix);
for (int pivot_idx = 0; pivot_idx < matrix_width; pivot_idx++)
{
// set the kernel arguments
kernel.setArg<cl::Memory>(0, source_buf);
kernel.setArg<cl::Memory>(1, dest_buf);
kernel.setArg<cl_uint>(2, total_matrix_length);
kernel.setArg<cl_uint>(3, matrix_width);
kernel.setArg<cl_int>(4, pivot_idx);
// execute the code on the device
std::cout << "Enqueueing new kernel for " << pivot_idx << std::endl;
cl::Event evt;
queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(total_matrix_length), cl::NDRange(workgroup_size), 0, &evt);
evt.wait();
std::cout << "Iteration " << pivot_idx << " done" << std::endl;
}
// transfer destination data from the device to the host
queue.enqueueReadBuffer(dest_buf, CL_TRUE, 0, total_matrix_length*sizeof(float), gpu_result);
// Calculate sequentially.
float* sequential = decompose(input_matrix, matrix_width);
// Print out the results.
std::cout << "Sequential:\n";
printMatrix(total_matrix_length, sequential);
// Print out the results.
std::cout << "GPU:\n";
printMatrix(total_matrix_length, gpu_result);
std::cout << "Correct solution? " << equalMatrices(gpu_result, sequential, total_matrix_length);
// compute the data throughput in GB/s
//float throughput = (2.0*total_matrix_length*sizeof(float)) / t; // t is in nano seconds
//std::cout << "Achieved throughput: " << throughput << std::endl;
// Cleanup
// Deallocate memory
delete[] gpu_result;
delete[] input_matrix;
delete[] sequential;
return 0;
}
catch (cl::Error& e) {
std::cerr << e.what() << ": " << jc::readable_status(e.err());
return 3;
}
catch (std::exception& e) {
std::cerr << e.what() << std::endl;
return 2;
}
catch (...) {
std::cerr << "Unexpected error. Aborting!\n" << std::endl;
return 1;
}
}
As maZZZu already stated, due to the parallel execution of the work items you can not be sure if an element in the array has been read/written yet.
This can be ensured using
CLK_LOCAL_MEM_FENCE/CLK_GLOBAL_MEM_FENCE
however these mechanisms only work on threads wihtin the same work group.
There is no possibility to synchronize work items from different work groups.
Your problem most likely is:
you use multiple work groups for an algorithm which is most likely only executable by a single work group
you do not use enough barriers
if you already use only a single work group, try adding a
barrier(CLK_GLOBAL_MEM_FENCE);
to all parts where you read/write from/to destin.
You should restructure your algorithm:
have only one work group perform the algorithm on your matrix
use local memory for better performance(since you repeatedly access elements)
use barriers everywhere. If the algorithm works you can start removing them after working out, which ones you don't need.
Could you post your kernel call and the working sizes?
EDIT:
From your algorithm I came up with this code.
I haven't tested it and I doubt it'll work right away.
But it should help you in understanding how to parallelize a sequential algorithm.
It will decompose the matrix with only one kernel launch.
Some restrictions:
This code only works with a single work group.
It will only work for matrices whose size does not exceed your maximum local work-group size (probably between 256 and 1024).
If you want to change that, you should refactor the algorithm to use only as many work items as the width of the matrix.
Just adapt them to your kernel.setArg(...) code
int nbElements = width*height;
clSetKernelArg (kernel, 0, sizeof(A), &A);
clSetKernelArg (kernel, 1, sizeof(U), &U);
clSetKernelArg (kernel, 2, sizeof(float) * widthMat * heightMat, NULL); // Local memory
clSetKernelArg (kernel, 3, sizeof(int), &width);
clSetKernelArg (kernel, 4, sizeof(int), &height);
clSetKernelArg (kernel, 5, sizeof(int), &nbElements);
Kernel code:
inline int indexFrom2d(const int u, const int v, const int width)
{
return width*v + u;
}
kernel void decompose(global float* A,
global float* U,
local float* localBuffer,
const int widthMat,
const int heightMat,
const int nbElements)
{
int gidx = get_global_id(0);
int col = gidx%widthMat;
int row = gidx/widthMat;
if(gidx >= nbElements)
return;
// Copy from global to local memory
localBuffer[gidx] = A[gidx];
// Sync copy process
barrier(CLK_LOCAL_MEM_FENCE);
for (int rowOuter = 0; rowOuter < widthMat; ++rowOuter)
{
int pivotIdx = rowOuter;
float pivotValue = localBuffer[indexFrom2d(pivotIdx, pivotIdx, widthMat)];
// Data for all work items in the row
float belowPrivot = localBuffer[indexFrom2d(pivotIdx, row, widthMat)];
float divisor = belowPrivot / pivotValue;
float value = localBuffer[indexFrom2d(col, rowOuter, widthMat)];
// Only work items below pivot and from pivot to the right
if( widthMat > col >= pivotIdx &&
heightMat > row >= pivotIdx + 1)
{
localBuffer[indexFrom2d(col, row, widthMat)] = localBuffer[indexFrom2d(col, row, widthMat)] - (value * divisor);
if(col == pivotIdx)
localBuffer[indexFrom2d(pivotIdx, row, widthMat)] = divisor;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// Write back to global memory
U[gidx] = localBuffer[gidx];
}
The errors are way too big to be caused by float arithmetics.
Without any deeper understanding of your algorithm, I would say that the problem is that you are using values from the destination buffer. With sequential code this is fine, because you know what values are there. But with OpenCL, kernels are executed in parallel. So you cannot tell if another kernel has already stored its value to destination buffer or not.