OpenCL computation does not match output of sequential algorithm - c++

I'm trying to implement a naive version of LU decomposition in OpenCL. To start, I have implemented a sequential version in C++ and constructed methods to verify my result (i.e., multiplication methods). Next I implemented my algorithm in a kernel and tested it with manually verified input (i.e., a 5x5 matrix). This works fine.
However, when I run my algorithm on a randomly generated matrix bigger than 5x5 I get strange results. I've cleaned my code, checked the calculations manually but I can't figure out where my kernel is going wrong. I'm starting to think that it might have something to do with the floats and the stability of the calculations. By this I mean that error margins get propagated and get bigger and bigger. I'm well-aware that I can swap rows to get the biggest pivot value and such, but the error margin is way off sometimes. And in any case I would have expected the result - albeit a wrong one - to be the same as the sequential algorithm. I would like some help identifying where I could be doing something wrong.
I'm using a single dimensional array so addressing a matrix with two dimensions happens like this:
A(row, col) = A[row * matrix_width + col].
About the results I might add that I decided to merge the L and U matrix into one. So Given L and U:
L: U:
1 0 0 A B C
X 1 0 0 D E
Y Z 1 0 0 F
I display them as:
A:
A B C
X D E
Y Z F
The kernel is the following:
The parameter source is the original matrix I want to decompose.
The parameter destin is the destination. matrix_size is the total size of the matrix (so that would be 9 for a 3x3) and matrix_width is the width (3 for a 3x3 matrix).
__kernel void matrix(
__global float * source,
__global float * destin,
unsigned int matrix_size,
unsigned int matrix_width
)
{
unsigned int index = get_global_id(0);
int col_idx = index % matrix_width;
int row_idx = index / matrix_width;
if (index >= matrix_size)
return;
// First of all, copy our value to the destination.
destin[index] = source[index];
// Iterate over all the pivots.
for(int piv_idx = 0; piv_idx < matrix_width; piv_idx++)
{
// We have to be the row below the pivot row
// And we have to be the column of the pivot
// or right of that column.
if(col_idx < piv_idx || row_idx <= piv_idx)
return;
// Calculate the divisor.
float pivot_value = destin[(piv_idx * matrix_width) + piv_idx];
float below_pivot_value = destin[(row_idx * matrix_width) + piv_idx];
float divisor = below_pivot_value/ pivot_value;
// Get the value in the pivot row on this column.
float pivot_row_value = destin[(piv_idx * matrix_width) + col_idx];
float current_value = destin[index];
destin[index] = current_value - (pivot_row_value * divisor);
// Write the divisor to the memory (we won't use these values anymore!)
// if we are the value under the pivot.
barrier(CLK_GLOBAL_MEM_FENCE);
if(col_idx == piv_idx)
{
int divisor_location = (row_idx * matrix_width) + piv_idx;
destin[divisor_location] = divisor;
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
This is the sequential version:
// Decomposes a matrix into L and U but in the same matrix.
float * decompose(float* A, int matrix_width)
{
int total_length = matrix_width*matrix_width;
float *U = new float[total_length];
for (int i = 0; i < total_length; i++)
{
U[i] = A[i];
}
for (int row = 0; row < matrix_width; row++)
{
int pivot_idx = row;
float pivot_val = U[pivot_idx * matrix_width + pivot_idx];
for (int r = row + 1; r < matrix_width; r++)
{
float below_pivot = U[r*matrix_width + pivot_idx];
float divisor = below_pivot / pivot_val;
for (int row_idx = pivot_idx; row_idx < matrix_width; row_idx++)
{
float value = U[row * matrix_width + row_idx];
U[r*matrix_width + row_idx] = U[r*matrix_width + row_idx] - (value * divisor);
}
U[r * matrix_width + pivot_idx] = divisor;
}
}
return U;
}
An example output I get is the following:
Workgroup size: 1
Array dimension: 6
Original unfactorized:
| 176.000000 | 133.000000 | 431.000000 | 839.000000 | 739.000000 | 450.000000 |
| 507.000000 | 718.000000 | 670.000000 | 753.000000 | 122.000000 | 941.000000 |
| 597.000000 | 449.000000 | 596.000000 | 742.000000 | 491.000000 | 212.000000 |
| 159.000000 | 944.000000 | 797.000000 | 717.000000 | 822.000000 | 219.000000 |
| 266.000000 | 755.000000 | 33.000000 | 231.000000 | 824.000000 | 785.000000 |
| 724.000000 | 408.000000 | 652.000000 | 863.000000 | 663.000000 | 113.000000 |
Sequential:
| 176.000000 | 133.000000 | 431.000000 | 839.000000 | 739.000000 | 450.000000 |
| 2.880682 | 334.869324 | -571.573853 | -1663.892090 | -2006.823730 | -355.306763 |
| 3.392045 | -0.006397 | -869.627747 | -2114.569580 | -2028.558716 | -1316.693359 |
| 0.903409 | 2.460203 | -2.085742 | -357.893066 | 860.526367 | -2059.689209 |
| 1.511364 | 1.654343 | -0.376231 | -2.570729 | 4476.049805 | -5097.599121 |
| 4.113636 | -0.415427 | 1.562076 | -0.065806 | 0.003290 | 52.263515 |
Sequential multiplied matching with original?:
1
GPU:
| 176.000000 | 133.000000 | 431.000000 | 839.000000 | 739.000000 | 450.000000 |
| 2.880682 | 334.869293 | -571.573914 | -1663.892212 | -2006.823975 | -355.306885 |
| 3.392045 | -0.006397 | -869.627808 | -2114.569580 | -2028.558716 | -1316.693359 |
| 0.903409 | 2.460203 | -2.085742 | -357.892578 | 5091.575684 | -2059.688965 |
| 1.511364 | 1.654343 | -0.376232 | -2.570732 | 16116.155273 | -5097.604980 |
| 4.113636 | -0.415427 | -0.737347 | 2.005755 | -3.655331 | -237.480438 |
GPU multiplied matching with original?:
Values differ: 5053.05 -- 822
0
Values differ: 5091.58 -- 860.526
Correct solution? 0
Edit
Okay, I understand why it was not working before, I think. The reason is that I only synchronize on each workgroup. When I would call my kernel with a workgroup size equal to the number of items in my matrix it would always be correct, because then the barriers would work properly. However, I decided to go with the approach as mentioned in the comments. Enqueue multiple kernels and wait for each kernel to finish before starting the next one. This would then map onto an iteration over each row of the matrix and multiplying it with the pivot element. This makes sure that I do not modify or read elements that are being modified by the kernel at that point.
Again, this works but only for small matrices. So I think I was wrong in assuming that it was the synchronization only. As per the request of Baiz I am posting my entire main here that calls the kernel:
int main(int argc, char *argv[])
{
try {
if (argc != 5) {
std::ostringstream oss;
oss << "Usage: " << argv[0] << " <kernel_file> <kernel_name> <workgroup_size> <array width>";
throw std::runtime_error(oss.str());
}
// Read in arguments.
std::string kernel_file(argv[1]);
std::string kernel_name(argv[2]);
unsigned int workgroup_size = atoi(argv[3]);
unsigned int array_dimension = atoi(argv[4]);
int total_matrix_length = array_dimension * array_dimension;
// Print parameters
std::cout << "Workgroup size: " << workgroup_size << std::endl;
std::cout << "Array dimension: " << array_dimension << std::endl;
// Create matrix to work on.
// Create a random array.
int matrix_width = sqrt(total_matrix_length);
float* input_matrix = new float[total_matrix_length];
input_matrix = randomMatrix(total_matrix_length);
/// Debugging
//float* input_matrix = new float[9];
//int matrix_width = 3;
//total_matrix_length = matrix_width * matrix_width;
//input_matrix[0] = 10; input_matrix[1] = -7; input_matrix[2] = 0;
//input_matrix[3] = -3; input_matrix[4] = 2; input_matrix[5] = 6;
//input_matrix[6] = 5; input_matrix[7] = -1; input_matrix[8] = 5;
// Allocate memory on the host and populate source
float *gpu_result = new float[total_matrix_length];
// OpenCL initialization
std::vector<cl::Platform> platforms;
std::vector<cl::Device> devices;
cl::Platform::get(&platforms);
platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);
cl::Context context(devices);
cl::CommandQueue queue(context, devices[0], CL_QUEUE_PROFILING_ENABLE);
// Load the kernel source.
std::string file_text;
std::ifstream file_stream(kernel_file.c_str());
if (!file_stream) {
std::ostringstream oss;
oss << "There is no file called " << kernel_file;
throw std::runtime_error(oss.str());
}
file_text.assign(std::istreambuf_iterator<char>(file_stream), std::istreambuf_iterator<char>());
// Compile the kernel source.
std::string source_code = file_text;
std::pair<const char *, size_t> source(source_code.c_str(), source_code.size());
cl::Program::Sources sources;
sources.push_back(source);
cl::Program program(context, sources);
try {
program.build(devices);
}
catch (cl::Error& e) {
std::string msg;
program.getBuildInfo<std::string>(devices[0], CL_PROGRAM_BUILD_LOG, &msg);
std::cerr << "Your kernel failed to compile" << std::endl;
std::cerr << "-----------------------------" << std::endl;
std::cerr << msg;
throw(e);
}
// Allocate memory on the device
cl::Buffer source_buf(context, CL_MEM_READ_ONLY, total_matrix_length*sizeof(float));
cl::Buffer dest_buf(context, CL_MEM_WRITE_ONLY, total_matrix_length*sizeof(float));
// Create the actual kernel.
cl::Kernel kernel(program, kernel_name.c_str());
// transfer source data from the host to the device
queue.enqueueWriteBuffer(source_buf, CL_TRUE, 0, total_matrix_length*sizeof(float), input_matrix);
for (int pivot_idx = 0; pivot_idx < matrix_width; pivot_idx++)
{
// set the kernel arguments
kernel.setArg<cl::Memory>(0, source_buf);
kernel.setArg<cl::Memory>(1, dest_buf);
kernel.setArg<cl_uint>(2, total_matrix_length);
kernel.setArg<cl_uint>(3, matrix_width);
kernel.setArg<cl_int>(4, pivot_idx);
// execute the code on the device
std::cout << "Enqueueing new kernel for " << pivot_idx << std::endl;
cl::Event evt;
queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(total_matrix_length), cl::NDRange(workgroup_size), 0, &evt);
evt.wait();
std::cout << "Iteration " << pivot_idx << " done" << std::endl;
}
// transfer destination data from the device to the host
queue.enqueueReadBuffer(dest_buf, CL_TRUE, 0, total_matrix_length*sizeof(float), gpu_result);
// Calculate sequentially.
float* sequential = decompose(input_matrix, matrix_width);
// Print out the results.
std::cout << "Sequential:\n";
printMatrix(total_matrix_length, sequential);
// Print out the results.
std::cout << "GPU:\n";
printMatrix(total_matrix_length, gpu_result);
std::cout << "Correct solution? " << equalMatrices(gpu_result, sequential, total_matrix_length);
// compute the data throughput in GB/s
//float throughput = (2.0*total_matrix_length*sizeof(float)) / t; // t is in nano seconds
//std::cout << "Achieved throughput: " << throughput << std::endl;
// Cleanup
// Deallocate memory
delete[] gpu_result;
delete[] input_matrix;
delete[] sequential;
return 0;
}
catch (cl::Error& e) {
std::cerr << e.what() << ": " << jc::readable_status(e.err());
return 3;
}
catch (std::exception& e) {
std::cerr << e.what() << std::endl;
return 2;
}
catch (...) {
std::cerr << "Unexpected error. Aborting!\n" << std::endl;
return 1;
}
}

As maZZZu already stated, due to the parallel execution of the work items you can not be sure if an element in the array has been read/written yet.
This can be ensured using
CLK_LOCAL_MEM_FENCE/CLK_GLOBAL_MEM_FENCE
however these mechanisms only work on threads wihtin the same work group.
There is no possibility to synchronize work items from different work groups.
Your problem most likely is:
you use multiple work groups for an algorithm which is most likely only executable by a single work group
you do not use enough barriers
if you already use only a single work group, try adding a
barrier(CLK_GLOBAL_MEM_FENCE);
to all parts where you read/write from/to destin.
You should restructure your algorithm:
have only one work group perform the algorithm on your matrix
use local memory for better performance(since you repeatedly access elements)
use barriers everywhere. If the algorithm works you can start removing them after working out, which ones you don't need.
Could you post your kernel call and the working sizes?
EDIT:
From your algorithm I came up with this code.
I haven't tested it and I doubt it'll work right away.
But it should help you in understanding how to parallelize a sequential algorithm.
It will decompose the matrix with only one kernel launch.
Some restrictions:
This code only works with a single work group.
It will only work for matrices whose size does not exceed your maximum local work-group size (probably between 256 and 1024).
If you want to change that, you should refactor the algorithm to use only as many work items as the width of the matrix.
Just adapt them to your kernel.setArg(...) code
int nbElements = width*height;
clSetKernelArg (kernel, 0, sizeof(A), &A);
clSetKernelArg (kernel, 1, sizeof(U), &U);
clSetKernelArg (kernel, 2, sizeof(float) * widthMat * heightMat, NULL); // Local memory
clSetKernelArg (kernel, 3, sizeof(int), &width);
clSetKernelArg (kernel, 4, sizeof(int), &height);
clSetKernelArg (kernel, 5, sizeof(int), &nbElements);
Kernel code:
inline int indexFrom2d(const int u, const int v, const int width)
{
return width*v + u;
}
kernel void decompose(global float* A,
global float* U,
local float* localBuffer,
const int widthMat,
const int heightMat,
const int nbElements)
{
int gidx = get_global_id(0);
int col = gidx%widthMat;
int row = gidx/widthMat;
if(gidx >= nbElements)
return;
// Copy from global to local memory
localBuffer[gidx] = A[gidx];
// Sync copy process
barrier(CLK_LOCAL_MEM_FENCE);
for (int rowOuter = 0; rowOuter < widthMat; ++rowOuter)
{
int pivotIdx = rowOuter;
float pivotValue = localBuffer[indexFrom2d(pivotIdx, pivotIdx, widthMat)];
// Data for all work items in the row
float belowPrivot = localBuffer[indexFrom2d(pivotIdx, row, widthMat)];
float divisor = belowPrivot / pivotValue;
float value = localBuffer[indexFrom2d(col, rowOuter, widthMat)];
// Only work items below pivot and from pivot to the right
if( widthMat > col >= pivotIdx &&
heightMat > row >= pivotIdx + 1)
{
localBuffer[indexFrom2d(col, row, widthMat)] = localBuffer[indexFrom2d(col, row, widthMat)] - (value * divisor);
if(col == pivotIdx)
localBuffer[indexFrom2d(pivotIdx, row, widthMat)] = divisor;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// Write back to global memory
U[gidx] = localBuffer[gidx];
}

The errors are way too big to be caused by float arithmetics.
Without any deeper understanding of your algorithm, I would say that the problem is that you are using values from the destination buffer. With sequential code this is fine, because you know what values are there. But with OpenCL, kernels are executed in parallel. So you cannot tell if another kernel has already stored its value to destination buffer or not.

Related

Best way to represent Quarto Board Game Pieces

I'm creating Quarto, the board game. It's essentially advanced connect 4.
Each piece has four distinguishing features (WHITE VS BLACK | TALL VS SHORT | SQUARE VS CIRCLE | SOLID VS HOLLOW). If you get four of any of those features in a row, you win.
At the moment, I am current trying to write a function to check for a win. However, it is on track to be O(n^4), and I think I can improve this depending on how I structure my code.
Currently, I have this:
enum Piece
{
WTSS, WTSH, WTHS, WTHH,
WSSS, WSSH, WSHS, WSHH,
BTSS, BTSH, BTHS, BTHH,
BSSS, BSSH, BSHS, BSHH,
EMPTY
};
static const char * PieceStrings[] = {
"WTSS ", "WTSH ", "WTHS ", "WTHH ",
"WSSS ", "WSSH ", "WSHS ", "WSHH ",
"BTSS ", "BTSH ", "BTHS ", "BTHH ",
"BSSS ", "BSSH ", "BSHS ", "BSHH ",
"____ ",
};
But I don't think this is very efficient. I have thought about making them their own class, but it then makes it hard to initialize and work with all of these pieces.
This is how I am starting to go about checking for a win:
// go through all N+2 possible connections
// check if any have shared feature
bool Board::isWin() {
int i, j, k;
for (i = 0, k = BOARDSIZE-1; i < BOARDSIZE; i++, k--) {
for (j = 0; j < BOARDSIZE; j++) {
//Horizontal case
// board[i][j];
//Vertical case
// board[j][i];
}
// Diagonal cases
// board[i][i];
// board[k][i];
}
return false;
}
// Checks if pieces have a similar component
bool Board::checkPieces(list<string> & piecesInARow) {
int i, j;
list<string>::iterator it = piecesInARow.begin();
for (i = 0; i < BOARDSIZE; i++) {
for (j = 0; j < BOARDSIZE; j++) {
// check if pieces all have shared char at index
}
}
}
How can I improve this and make it easier for myself?
Each property has 2 possibilities, so you could store them in a bit-field as 0 or 1. For example:
unsigned char type = (color << 0) | (size << 1) | (shape << 2) | (thickness << 3)
where each value is a 0 or 1. Let's say:
enum Color { BLACK = 0, WHITE = 1 };
enum Size { SHORT = 0, TALL = 1 };
enum Shape { CIRCLE = 0, SQUARE = 1 };
enum Thickness { HOLLOW = 0, SOLID = 1 };
then you could compare them by checking XNOR (bit equality) to compare 2 at a time in one operation, where each comparison will return a bit-field of which types compared equally.
(piece1 XNOR piece2) AND (piece2 XNOR piece3) AND (piece3 XNOR piece4) != 0
Like this
class Piece {
public:
Piece(Color color, Size size, Shape shape, Thickness thickness) {
type = (color) | (size << 1) | (shape << 2) | (thickness << 3);
}
Color color() const {
return static_cast<Color>((type >> 0) & 1);
}
Size size() const {
return static_cast<Size>((type >> 1) & 1);
}
Shape shape() const {
return static_cast<Shape>((type >> 2) & 1);
}
Thickness thickness() const {
return static_cast<Thickness>((type >> 3) & 1);
}
static bool compare(Piece p0, Piece p1, Piece p2, Piece p3) {
unsigned char c[3];
c[0] = ~( p0.type ^ p1.type ); // XNOR
c[1] = ~( p1.type ^ p2.type ); // XNOR
c[2] = ~( p2.type ^ p3.type ); // XNOR
return (c[0] & c[1] & c[2] & 0b1111) != 0;
}
protected:
unsigned char type;
};
I tested it with this code:
int main() {
Piece p0(WHITE, SHORT, CIRCLE, HOLLOW);
Piece p1(WHITE, SHORT, CIRCLE, HOLLOW);
Piece p2(BLACK, SHORT, CIRCLE, HOLLOW);
Piece p3(BLACK, TALL, CIRCLE, SOLID);
const char* str = Piece::compare(p0, p1, p2, p3) ? "success" : "fail";
std::cout << str << '\n';
return 0;
}
I decided to manage it with a class, however, each piece is a 4-bit value, so it can be managed in an integral type just as easily.
Another consequence of doing it this way is that pieces can be randomized by picking a value between 0b0000 and 0b1111, which is [0,15] inclusive.

Fast algorithm for computing intersections of N sets

I have n sets A0,A2,...An-1 holding items of a set E.
I define a configuration C as the integer made of n bits, so C has values between 0 and 2^n-1. Now, I define the following:
(C) an item e of E is in configuration C
<=> for each bit b of C, if b==1 then e is in Ab, else e is not in Ab
For instance for n=3, the configuration C=011 corresponds to items of E that are in A0 and A1 but NOT in A2 (the NOT is important)
C[bitmap] is the count of elements that have exactly that presence/absence pattern in the sets. C[001] is the number of elements in A0 that aren't also in any other sets.
Another possible definition is :
(V) an item e of E is in configuration V
<=> for each bit b of V, if b==1 then e is in Ab
For instance for n=3, the (V) configuration V=011 corresponds to items of E that are in A0 and A1
V[bitmap] is the count of the intersection of the selected sets. (i.e. the count of how many elements are in all of the sets where the bitmap is true.) V[001] is the number of elements in A0. V[011] is the number of elements in A0 and A1, regardless of whether or not they're also in A2.
In the following, the first picture shows items of sets A0, A1 and A2, the second picture shows size of (C) configurations and the third picture shows size of (V) configurations.
I can also represent the configurations by either of two vectors:
C[001]= 5 V[001]=14
C[010]=10 V[010]=22
C[100]=11 V[100]=24
C[011]= 2 V[011]= 6
C[101]= 3 V[101]= 7
C[110]= 6 V[110]=10
C[111]= 4 V[111]= 4
What I want is to write a C/C++ function that transforms C into V as efficiently as possible. A naive approach could be the following 'transfo' function that is obviously in O(4^n) :
#include <vector>
#include <cstdio>
using namespace std;
vector<size_t> transfo (const vector<size_t>& C)
{
vector<size_t> V (C.size());
for (size_t i=0; i<C.size(); i++)
{
V[i] = 0;
for (size_t j=0; j<C.size(); j++)
{
if ((j&i)==i) { V[i] += C[j]; }
}
}
return V;
}
int main()
{
vector<size_t> C = {
/* 000 */ 0,
/* 001 */ 5,
/* 010 */ 10,
/* 011 */ 2,
/* 100 */ 11,
/* 101 */ 3,
/* 110 */ 6,
/* 111 */ 4
};
vector<size_t> V = transfo (C);
for (size_t i=1; i<V.size(); i++) { printf ("[%2ld] C=%2ld V=%2ld\n", i, C[i], V[i]); }
}
My question is : is there a more efficient algorithm than the naive one for transforming a vector C into a vector V ? And what would be the complexity of such a "good" algorithm ?
Note that I could be interested by any SIMD solution.
Well, you are trying to compute 2n values, so you cannot do better than O(2n).
The naive approach starts from the observation that V[X] is obtained by fixing all the 1 bits in X and iterating over all the possible values where the 0 bits are. For example,
V[010] = C[010] + C[011] + C[110] + C[111]
But this approach performs O(2n) additions for every element of V, yielding a total complexity of O(4n).
Here is an O(n × 2n) algorithm. I too am curious if an O(2n) algorithm exists.
Let n = 4. Let us consider the full table of V versus C. Each line in the table below corresponds to one value of V and this value is calculated by summing up the columns marked with a *. The layout of * symbols can be easily deduced from the naive approach.
|0000|0001|0010|0011|0100|0101|0110|0111||1000|1001|1010|1011|1100|1101|1110|1111
0000| * | * | * | * | * | * | * | * || * | * | * | * | * | * | * | *
0001| | * | | * | | * | | * || | * | | * | | * | | *
0010| | | * | * | | | * | * || | | * | * | | | * | *
0011| | | | * | | | | * || | | | * | | | | *
0100| | | | | * | * | * | * || | | | | * | * | * | *
0101| | | | | | * | | * || | | | | | * | | *
0110| | | | | | | * | * || | | | | | | * | *
0111| | | | | | | | * || | | | | | | | *
-------------------------------------------------------------------------------------
1000| | | | | | | | || * | * | * | * | * | * | * | *
1001| | | | | | | | || | * | | * | | * | | *
1010| | | | | | | | || | | * | * | | | * | *
1011| | | | | | | | || | | | * | | | | *
1100| | | | | | | | || | | | | * | * | * | *
1101| | | | | | | | || | | | | | * | | *
1110| | | | | | | | || | | | | | | * | *
1111| | | | | | | | || | | | | | | | *
Notice that the top-left, top-right and bottom-right corners contain identical layouts. Therefore, we can perform some calculations in bulk as follows:
Compute the bottom half of the table (the bottom-right corner).
Add the values to the top half.
Compute the top-left corner.
If we let q = 2n, Thus the recurrent complexity is
T(q) = 2T(q/2) + O(q)
which solves using the Master Theorem to
T(q) = O(q log q)
or, in terms of n,
T(n) = O(n × 2n)
According to the great observation of #CătălinFrâncu, I wrote two recursive implementations of the transformation (see code below) :
transfo_recursive: very straightforward recursive implementation
transfo_avx2 : still recursive but use AVX2 for last step of the recursion for n=3
I propose here that the sizes of the counters are coded on 32 bits and that the n value can grow up to 28.
I also wrote an iterative implementation (transfo_iterative) based on my own observation on the recursion behaviour. Actually, I guess it is close to the non recursive implementation proposed by #chtz.
Here is the benchmark code:
// compiled with: g++ -O3 intersect.cpp -march=native -mavx2 -lpthread -DNDEBUG
#include <vector>
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <cmath>
#include <thread>
#include <algorithm>
#include <sys/times.h>
#include <immintrin.h>
#include <boost/align/aligned_allocator.hpp>
using namespace std;
////////////////////////////////////////////////////////////////////////////////
typedef u_int32_t Count;
// Note: alignment is important for AVX2
typedef std::vector<Count,boost::alignment::aligned_allocator<Count, 8*sizeof(Count)>> CountVector;
typedef void (*callback) (CountVector::pointer C, size_t q);
typedef vector<pair<const char*, callback>> FunctionsVector;
unsigned int randomSeed = 0;
////////////////////////////////////////////////////////////////////////////////
double timestamp()
{
struct timespec timet;
clock_gettime(CLOCK_MONOTONIC, &timet);
return timet.tv_sec + (timet.tv_nsec/ 1000000000.0);
}
////////////////////////////////////////////////////////////////////////////////
CountVector getRandomVector (size_t n)
{
// We use the same seed, so we'll get the same random values
srand (randomSeed);
// We fill a vector of size q=2^n with random values
CountVector C(1ULL<<n);
for (size_t i=0; i<C.size(); i++) { C[i] = rand() % (1ULL<<(8*sizeof(Count))); }
return C;
}
////////////////////////////////////////////////////////////////////////////////
void copy_add_block (CountVector::pointer C, size_t q)
{
for (size_t i=0; i<q/2; i++) { C[i] += C[i+q/2]; }
}
////////////////////////////////////////////////////////////////////////////////
void copy_add_block_avx2 (CountVector::pointer C, size_t q)
{
__m256i* target = (__m256i*) (C);
__m256i* source = (__m256i*) (C+q/2);
size_t imax = q/(2*8);
for (size_t i=0; i<imax; i++)
{
target[i] = _mm256_add_epi32 (source[i], target[i]);
}
}
////////////////////////////////////////////////////////////////////////////////
// Naive approach : O(4^n)
////////////////////////////////////////////////////////////////////////////////
CountVector transfo_naive (const CountVector& C)
{
CountVector V (C.size());
for (size_t i=0; i<C.size(); i++)
{
V[i] = 0;
for (size_t j=0; j<C.size(); j++)
{
if ((j&i)==i) { V[i] += C[j]; }
}
}
return V;
}
////////////////////////////////////////////////////////////////////////////////
// Recursive approach : O(n.2^n)
////////////////////////////////////////////////////////////////////////////////
void transfo_recursive (CountVector::pointer C, size_t q)
{
if (q>1)
{
transfo_recursive (C+q/2, q/2);
transfo_recursive (C, q/2);
copy_add_block (C, q);
}
}
////////////////////////////////////////////////////////////////////////////////
// Iterative approach : O(n.2^n)
////////////////////////////////////////////////////////////////////////////////
void transfo_iterative (CountVector::pointer C, size_t q)
{
size_t i = 0;
for (size_t n=q; n>1; n>>=1, i++)
{
size_t d = 1<<i;
for (ssize_t j=q-1-d; j>=0; j--)
{
if ( ((j>>i)&1)==0) { C[j] += C[j+d]; }
}
}
}
////////////////////////////////////////////////////////////////////////////////
// Recursive AVX2 approach : O(n.2^n)
////////////////////////////////////////////////////////////////////////////////
#define ROTATE1(s) _mm256_permutevar8x32_epi32 (s, _mm256_set_epi32(0,7,6,5,4,3,2,1))
#define ROTATE2(s) _mm256_permutevar8x32_epi32 (s, _mm256_set_epi32(0,0,7,6,5,4,3,2))
#define ROTATE4(s) _mm256_permutevar8x32_epi32 (s, _mm256_set_epi32(0,0,0,0,7,6,5,4))
void transfo_avx2 (CountVector::pointer V, size_t N)
{
__m256i k1 = _mm256_set_epi32 (0,0xFFFFFFFF,0,0xFFFFFFFF,0,0xFFFFFFFF,0,0xFFFFFFFF);
__m256i k2 = _mm256_set_epi32 (0,0,0xFFFFFFFF,0xFFFFFFFF,0,0,0xFFFFFFFF,0xFFFFFFFF);
__m256i k4 = _mm256_set_epi32 (0,0,0,0,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF);
if (N==8)
{
__m256i* source = (__m256i*) (V);
*source = _mm256_add_epi32 (*source, _mm256_and_si256(ROTATE1(*source),k1));
*source = _mm256_add_epi32 (*source, _mm256_and_si256(ROTATE2(*source),k2));
*source = _mm256_add_epi32 (*source, _mm256_and_si256(ROTATE4(*source),k4));
}
else // if (N>8)
{
transfo_avx2 (V+N/2, N/2);
transfo_avx2 (V, N/2);
copy_add_block_avx2 (V, N);
}
}
#define ROTATE1_AND(s) _mm256_srli_epi64 ((s), 32) // odd 32bit elements
#define ROTATE2_AND(s) _mm256_bsrli_epi128 ((s), 8) // high 64bit halves
// gcc doesn't have _mm256_zextsi128_si256
// and _mm256_castsi128_si256 doesn't guarantee zero extension
// vperm2i118 can do the same job as vextracti128, but is slower on Ryzen
#ifdef __clang__ // high 128bit lane
#define ROTATE4_AND(s) _mm256_zextsi128_si256(_mm256_extracti128_si256((s),1))
#else
//#define ROTATE4_AND(s) _mm256_castsi128_si256(_mm256_extracti128_si256((s),1))
#define ROTATE4_AND(s) _mm256_permute2x128_si256((s),(s),0x81) // high bit set = zero that lane
#endif
void transfo_avx2_pcordes (CountVector::pointer C, size_t q)
{
if (q==8)
{
__m256i* source = (__m256i*) (C);
__m256i tmp = *source;
tmp = _mm256_add_epi32 (tmp, ROTATE1_AND(tmp));
tmp = _mm256_add_epi32 (tmp, ROTATE2_AND(tmp));
tmp = _mm256_add_epi32 (tmp, ROTATE4_AND(tmp));
*source = tmp;
}
else //if (N>8)
{
transfo_avx2_pcordes (C+q/2, q/2);
transfo_avx2_pcordes (C, q/2);
copy_add_block_avx2 (C, q);
}
}
////////////////////////////////////////////////////////////////////////////////
// Template specialization (same as transfo_avx2_pcordes)
////////////////////////////////////////////////////////////////////////////////
template <int n>
void transfo_template (__m256i* C)
{
const size_t q = 1ULL << n;
transfo_template<n-1> (C);
transfo_template<n-1> (C + q/2);
__m256i* target = (__m256i*) (C);
__m256i* source = (__m256i*) (C+q/2);
for (size_t i=0; i<q/2; i++)
{
target[i] = _mm256_add_epi32 (source[i], target[i]);
}
}
template <>
void transfo_template<0> (__m256i* C)
{
__m256i* source = (__m256i*) (C);
__m256i tmp = *source;
tmp = _mm256_add_epi32 (tmp, ROTATE1_AND(tmp));
tmp = _mm256_add_epi32 (tmp, ROTATE2_AND(tmp));
tmp = _mm256_add_epi32 (tmp, ROTATE4_AND(tmp));
*source = tmp;
}
void transfo_recur_template (CountVector::pointer C, size_t q)
{
#define CASE(n) case 1ULL<<n: transfo_template<n> ((__m256i*)C); break;
q = q / 8; // 8 is the number of 32 bits items in the AVX2 registers
// We have to 'link' the dynamic value of q with a static template specialization
switch (q)
{
CASE( 1); CASE( 2); CASE( 3); CASE( 4); CASE( 5); CASE( 6); CASE( 7); CASE( 8); CASE( 9);
CASE(10); CASE(11); CASE(12); CASE(13); CASE(14); CASE(15); CASE(16); CASE(17); CASE(18); CASE(19);
CASE(20); CASE(21); CASE(22); CASE(23); CASE(24); CASE(25); CASE(26); CASE(27); CASE(28); CASE(29);
default: printf ("transfo_template undefined for q=%ld\n", q); break;
}
}
////////////////////////////////////////////////////////////////////////////////
// Recursive approach multithread : O(n.2^n)
////////////////////////////////////////////////////////////////////////////////
void transfo_recur_thread (CountVector::pointer C, size_t q)
{
std::thread t1 (transfo_recur_template, C+q/2, q/2);
std::thread t2 (transfo_recur_template, C, q/2);
t1.join();
t2.join();
copy_add_block_avx2 (C, q);
}
////////////////////////////////////////////////////////////////////////////////
void header (const char* title, const FunctionsVector& functions)
{
printf ("\n");
for (size_t i=0; i<functions.size(); i++) { printf ("------------------"); } printf ("\n");
printf ("%s\n", title);
for (size_t i=0; i<functions.size(); i++) { printf ("------------------"); } printf ("\n");
printf ("%3s\t", "# n");
for (auto fct : functions) { printf ("%20s\t", fct.first); }
printf ("\n");
}
////////////////////////////////////////////////////////////////////////////////
// Check that alternative implementations provide the same result as the naive one
////////////////////////////////////////////////////////////////////////////////
void check (const FunctionsVector& functions, size_t nmin, size_t nmax)
{
header ("CHECK (0 values means similar to naive approach)", functions);
for (size_t n=nmin; n<=nmax; n++)
{
printf ("%3ld\t", n);
CountVector reference = transfo_naive (getRandomVector(n));
for (auto fct : functions)
{
// We call the (in place) transformation
CountVector C = getRandomVector(n);
(*fct.second) (C.data(), C.size());
int nbDiffs= 0;
for (size_t i=0; i<C.size(); i++)
{
if (reference[i]!=C[i]) { nbDiffs++; }
}
printf ("%20ld\t", nbDiffs);
}
printf ("\n");
}
}
////////////////////////////////////////////////////////////////////////////////
// Performance test
////////////////////////////////////////////////////////////////////////////////
void performance (const FunctionsVector& functions, size_t nmin, size_t nmax)
{
header ("PERFORMANCE", functions);
for (size_t n=nmin; n<=nmax; n++)
{
printf ("%3ld\t", n);
for (auto fct : functions)
{
// We compute the average time for several executions
// We use more executions for small n values in order
// to have more accurate results
size_t nbRuns = 1ULL<<(2+nmax-n);
vector<double> timeValues;
// We run the test several times
for (size_t r=0; r<nbRuns; r++)
{
// We don't want to measure time for vector fill
CountVector C = getRandomVector(n);
double t0 = timestamp();
(*fct.second) (C.data(), C.size());
double t1 = timestamp();
timeValues.push_back (t1-t0);
}
// We sort the vector of times in order to get the median value
std::sort (timeValues.begin(), timeValues.end());
double median = timeValues[timeValues.size()/2];
printf ("%20lf\t", log(1000.0*1000.0*median)/log(2));
}
printf ("\n");
}
}
////////////////////////////////////////////////////////////////////////////////
//
////////////////////////////////////////////////////////////////////////////////
int main (int argc, char* argv[])
{
size_t nmin = argc>=2 ? atoi(argv[1]) : 14;
size_t nmax = argc>=3 ? atoi(argv[2]) : 28;
// We get a common random seed
randomSeed = time(NULL);
FunctionsVector functions = {
make_pair ("transfo_recursive", transfo_recursive),
make_pair ("transfo_iterative", transfo_iterative),
make_pair ("transfo_avx2", transfo_avx2),
make_pair ("transfo_avx2_pcordes", transfo_avx2_pcordes),
make_pair ("transfo_recur_template", transfo_recur_template),
make_pair ("transfo_recur_thread", transfo_recur_thread)
};
// We check for some n that alternative implementations
// provide the same result as the naive approach
check (functions, 5, 15);
// We run the performance test
performance (functions, nmin, nmax);
}
And here is the performance graph:
One can observe that the simple recursive implementation is pretty good, even compared to the AVX2 version. The iterative implementation is a little bit disappointing but I made no big effort to optimize it.
Finally, for my own use case with 32 bits counters and for n values up to 28, these implementations are obviously ok for me compared to the initial "naive" approach in O(4^n).
Update
Following some remarks from #PeterCordes and #chtz, I added the following implementations:
transfo-avx2-pcordes : the same as transfo-avx2 with some AVX2 optimizations
transfo-recur-template : the same as transfo-avx2-pcordes but using C++ template specialization for implementing recursion
transfo-recur-thread : usage of multithreading for the two initial recursive calls of transfo-recur-template
Here is the updated benchmark result:
A few remarks about this result:
the AVX2 implementations are logically the best options but maybe not with the maximum potential x8 speedup with counters of 32 bits
among the AVX2 implementations, the template specialization brings a little speedup but it almost fades for bigger values for n
the simple two-threads version has bad results for n<20; for n>=20, there is always a little speedup but far from a potential 2x.

Eigen library with C++11 multithreading

I have a code to compute a Gaussian Mixture Model with Expectation Maximization in order to identify the clusters from a given input data sample.
A piece of the code is repeating the computation of such model for a number of trials Ntrials (one indepenendet of the other but using the same input data) in order to finally pick up the best solution (the one maximizing the total likelihood from the model). This concept can be generalized to many other clustering algorithms (e.g. k-means).
I want to parallelize the part of the code that has to be repeated Ntrials times through multi-threading with C++11 such that each thread will execute one trial.
A code example, assuming an input Eigen::ArrayXXd sample of (Ndimensions x Npoints) can be of the type:
double bestTotalModelProbability = 0;
Eigen::ArrayXd clusterIndicesFromSample(Npoints);
clusterIndicesFromSample.setZero();
for (int i=0; i < Ntrials; i++)
{
totalModelProbability = computeGaussianMixtureModel(sample);
// Check if this trial is better than the previous one.
// If so, update the results (cluster index for each point
// in the sample) and keep them.
if totalModelProbability > bestTotalModelProbability
{
bestTotalModelProbability = totalModelProbability;
...
clusterIndicesFromSample = obtainClusterMembership(sample);
}
}
where I pass the reference value of sample (Eigen::Ref), and not sample itself to both the functions computeGaussianMixtureModel() and obtainClusterMembership().
My code is heavily based on Eigen array, and the N-dimensional problems that I take can account for order 10-100 dimensions and 500-1000 different sample points. I am looking for some examples to create a multi-threaded version of this code using Eigen arrays and std:thread of C++11, but could not find anything around and I am struggling with making even some simple examples for manipulation of Eigen arrays.
I am not even sure Eigen can be used within std::thread in C++11.
Can someone help me even with some simple example to understand the synthax?
I am using clang++ as compiler in Mac OSX on a CPU with 6 cores (12 threads).
OP's question attracted my attention because number-crunching with speed-up earned by multi-threading is one of the top todo's on my personal list.
I must admit that my experience with the Eigen library is very limited. (I once used the decompose of 3×3 rotation matrices to Euler angles which is very clever solved in a general way in the Eigen library.)
Hence, I defined another sample task consisting of a stupid counting of values in a sample data set.
This is done multiple times (using the same evaluation function):
single threaded (to get a value for comparison)
starting each sub-task in an extra thread (in an admittedly rather stupid approach)
starting threads with interleaved access to sample data
starting threads with partitioned access to sample data.
test-multi-threading.cc:
#include <cstdint>
#include <cstdlib>
#include <chrono>
#include <iomanip>
#include <iostream>
#include <limits>
#include <thread>
#include <vector>
// a sample function to process a certain amount of data
template <typename T>
size_t countFrequency(
size_t n, const T data[], const T &begin, const T &end)
{
size_t result = 0;
for (size_t i = 0; i < n; ++i) result += data[i] >= begin && data[i] < end;
return result;
}
typedef std::uint16_t Value;
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::microseconds MuSecs;
typedef decltype(std::chrono::duration_cast<MuSecs>(Clock::now() - Clock::now())) Time;
Time duration(const Clock::time_point &t0)
{
return std::chrono::duration_cast<MuSecs>(Clock::now() - t0);
}
std::vector<Time> makeTest()
{
const Value SizeGroup = 4, NGroups = 10000, N = SizeGroup * NGroups;
const size_t NThreads = std::thread::hardware_concurrency();
// make a test sample
std::vector<Value> sample(N);
for (Value &value : sample) value = (Value)rand();
// prepare result vectors
std::vector<size_t> results4[4] = {
std::vector<size_t>(NGroups, 0),
std::vector<size_t>(NGroups, 0),
std::vector<size_t>(NGroups, 0),
std::vector<size_t>(NGroups, 0)
};
// make test
std::vector<Time> times{
[&]() { // single threading
// make a copy of test sample
std::vector<Value> data(sample);
std::vector<size_t> &results = results4[0];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment single-threaded
for (size_t i = 0; i < NGroups; ++i) {
results[i] = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
}
// done
return duration(t0);
}(),
[&]() { // multi-threading - stupid aproach
// make a copy of test sample
std::vector<Value> data(sample);
std::vector<size_t> &results = results4[1];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment multi-threaded
std::vector<std::thread> threads(NThreads);
for (Value i = 0; i < NGroups;) {
size_t nT = 0;
for (; nT < NThreads && i < NGroups; ++nT, ++i) {
threads[nT] = std::move(std::thread(
[i, &results, &data, SizeGroup]() {
size_t result = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
results[i] = result;
}));
}
for (size_t iT = 0; iT < nT; ++iT) threads[iT].join();
}
// done
return duration(t0);
}(),
[&]() { // multi-threading - interleaved
// make a copy of test sample
std::vector<Value> data(sample);
std::vector<size_t> &results = results4[2];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment multi-threaded
std::vector<std::thread> threads(NThreads);
for (Value iT = 0; iT < NThreads; ++iT) {
threads[iT] = std::move(std::thread(
[iT, &results, &data, NGroups, SizeGroup, NThreads]() {
for (Value i = iT; i < NGroups; i += NThreads) {
size_t result = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
results[i] = result;
}
}));
}
for (std::thread &threadI : threads) threadI.join();
// done
return duration(t0);
}(),
[&]() { // multi-threading - grouped
std::vector<Value> data(sample);
std::vector<size_t> &results = results4[3];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment multi-threaded
std::vector<std::thread> threads(NThreads);
for (Value iT = 0; iT < NThreads; ++iT) {
threads[iT] = std::move(std::thread(
[iT, &results, &data, NGroups, SizeGroup, NThreads]() {
for (Value i = iT * NGroups / NThreads,
iN = (iT + 1) * NGroups / NThreads; i < iN; ++i) {
size_t result = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
results[i] = result;
}
}));
}
for (std::thread &threadI : threads) threadI.join();
// done
return duration(t0);
}()
};
// check results (must be equal for any kind of computation)
const unsigned nResults = sizeof results4 / sizeof *results4;
for (unsigned i = 1; i < nResults; ++i) {
size_t nErrors = 0;
for (Value j = 0; j < NGroups; ++j) {
if (results4[0][j] != results4[i][j]) {
++nErrors;
#ifdef _DEBUG
std::cerr
<< "results4[0][" << j << "]: " << results4[0][j]
<< " != results4[" << i << "][" << j << "]: " << results4[i][j]
<< "!\n";
#endif // _DEBUG
}
}
if (nErrors) std::cerr << nErrors << " errors in results4[" << i << "]!\n";
}
// done
return times;
}
int main()
{
std::cout << "std::thread::hardware_concurrency(): "
<< std::thread::hardware_concurrency() << '\n';
// heat up
std::cout << "Heat up...\n";
for (unsigned i = 0; i < 3; ++i) makeTest();
// repeat NTrials times
const unsigned NTrials = 10;
std::cout << "Measuring " << NTrials << " runs...\n"
<< " "
<< " | " << std::setw(10) << "Single"
<< " | " << std::setw(10) << "Multi 1"
<< " | " << std::setw(10) << "Multi 2"
<< " | " << std::setw(10) << "Multi 3"
<< '\n';
std::vector<double> sumTimes;
for (unsigned i = 0; i < NTrials; ++i) {
std::vector<Time> times = makeTest();
std::cout << std::setw(2) << (i + 1) << ".";
for (const Time &time : times) {
std::cout << " | " << std::setw(10) << time.count();
}
std::cout << '\n';
sumTimes.resize(times.size(), 0.0);
for (size_t j = 0; j < times.size(); ++j) sumTimes[j] += times[j].count();
}
std::cout << "Average Values:\n ";
for (const double &sumTime : sumTimes) {
std::cout << " | "
<< std::setw(10) << std::fixed << std::setprecision(1)
<< sumTime / NTrials;
}
std::cout << '\n';
std::cout << "Ratio:\n ";
for (const double &sumTime : sumTimes) {
std::cout << " | "
<< std::setw(10) << std::fixed << std::setprecision(3)
<< sumTime / sumTimes.front();
}
std::cout << '\n';
// done
return 0;
}
Compiled and tested on cygwin64 on Windows 10:
$ g++ --version
g++ (GCC) 7.3.0
$ g++ -std=c++11 -O2 -o test-multi-threading test-multi-threading.cc
$ ./test-multi-threading
std::thread::hardware_concurrency(): 8
Heat up...
Measuring 10 runs...
| Single | Multi 1 | Multi 2 | Multi 3
1. | 384008 | 1052937 | 130662 | 138411
2. | 386500 | 1103281 | 133030 | 132576
3. | 382968 | 1078988 | 137123 | 137780
4. | 395158 | 1120752 | 138731 | 138650
5. | 385870 | 1105885 | 144825 | 129405
6. | 366724 | 1071788 | 137684 | 130289
7. | 352204 | 1104191 | 133675 | 130505
8. | 331679 | 1072299 | 135476 | 138257
9. | 373416 | 1053881 | 138467 | 137613
10. | 370872 | 1096424 | 136810 | 147960
Average Values:
| 372939.9 | 1086042.6 | 136648.3 | 136144.6
Ratio:
| 1.000 | 2.912 | 0.366 | 0.365
I did the same on coliru.com. (I had to reduce the heat up cycles and the sample size as I exceeded the time limit with the original values.):
g++ (GCC) 8.1.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
std::thread::hardware_concurrency(): 4
Heat up...
Measuring 10 runs...
| Single | Multi 1 | Multi 2 | Multi 3
1. | 224684 | 297729 | 48334 | 39016
2. | 146232 | 337222 | 66308 | 59994
3. | 195750 | 344056 | 61383 | 63172
4. | 198629 | 317719 | 62695 | 50413
5. | 149125 | 356471 | 61447 | 57487
6. | 155355 | 322185 | 50254 | 35214
7. | 140269 | 316224 | 61482 | 53889
8. | 154454 | 334814 | 58382 | 53796
9. | 177426 | 340723 | 62195 | 54352
10. | 151951 | 331772 | 61802 | 46727
Average Values:
| 169387.5 | 329891.5 | 59428.2 | 51406.0
Ratio:
| 1.000 | 1.948 | 0.351 | 0.303
Live Demo on coliru
I wonder a little bit that the ratios on coliru (with only 4 threads) are even better than on my PC with (with 8 threads). Actually, I don't know how to explain this.
However, there are a lot of other differences in the two setups which may or may not be responsible. At least, both measurements show a rough speed-up of 3 for 3rd and 4th approach where the 2nd consumes uniquely every potential speed-up (probably by starting and joining all these threads).
Looking at the sample code, you will recognize that there is no mutex or any other explicit locking. This is intentionally. As I've learned (many, many years ago), every attempt for parallelization may cause a certain extra amount of communication overhead (for concurrent tasks which have to exchange data). If communication overhead becomes to big, it simply consumes the speed advantage of concurrency. So, best speed-up can be achieved by:
least communication overhead i.e. concurrent tasks operate on independent data
least effort for post-merging the concurrently computed results.
In my sample code, I
prepared every data and storage before starting the threads,
shared data which is read is never changed while threads are running,
data which is written as it were thread-local (no two threads write to the same address of data)
evaluate the computed results after all threads have been joined.
Concerning 3. I struggled a bit whether this is legal or not i.e. is it granted for data which is written in threads to appear correctly in main thread after joining. (The fact that something seems to work fine is illusive in general but especially illusive concerning multi-threading.)
cppreference.com provides the following explanations
for std::thread::thread()
The completion of the invocation of the constructor synchronizes-with (as defined in std::memory_order) the beginning of the invocation of the copy of f on the new thread of execution.
for std::thread::join()
The completion of the thread identified by *this synchronizes with the corresponding successful return from join().
In Stack Overflow, I found the following related Q/A's:
Does relaxed memory order effect can be extended to after performing-thread's life?
Are memory fences required here?
Is there an implicit memory barrier with synchronized-with relationship on thread::join?
which convinced me, it is OK.
However, the drawback is that
the creation and joining of threads causes additional effort (and it's not that cheap).
An alternative approach could be the usage of a thread pool to overcome this. I googled a bit and found e.g. Jakob Progsch's ThreadPool on github. However, I guess, with a thread pool the locking issue is back “in the game”.
Whether this will work for Eigen functions as well, depends on how the resp. Eigen functions are implemented. If there are accesses to global variables in them (which become shared when the same function is called concurrently), this will cause a data race.
Googling a bit, I found the following doc.
Eigen and multi-threading – Using Eigen in a multi-threaded application:
In the case your own application is multithreaded, and multiple threads make calls to Eigen, then you have to initialize Eigen by calling the following routine before creating the threads:
#include <Eigen/Core>
int main(int argc, char** argv)
{
Eigen::initParallel();
...
}
Note
With Eigen 3.3, and a fully C++11 compliant compiler (i.e., thread-safe static local variable initialization), then calling initParallel() is optional.
Warning
note that all functions generating random matrices are not re-entrant nor thread-safe. Those include DenseBase::Random(), and DenseBase::setRandom() despite a call to Eigen::initParallel(). This is because these functions are based on std::rand which is not re-entrant. For thread-safe random generator, we recommend the use of boost::random or c++11 random feature.

Why is this CUDA kernel slow?

I need help getting my cuda program run faster. NVIDIA visual profiler shows poor performance saying "Low Compute Utilization 1.4%":
The code is below. First kernel preparations:
void laskeSyvyydet(int& tiilet0, int& tiilet1, int& tiilet2, int& tiilet3) {
cudaArray *tekstuuriSisaan, *tekstuuriUlos;
//take care of synchronazion
cudaEvent_t cEvent;
cudaEventCreate(&cEvent);
//let's take control of OpenGL textures
cudaGraphicsMapResources(1, &cuda.cMaxSyvyys);
cudaEventRecord(cEvent, 0);
cudaGraphicsMapResources(1, &cuda.cDepthTex);
cudaEventRecord(cEvent, 0);
//need to create CUDA pointers
cudaGraphicsSubResourceGetMappedArray(&tekstuuriSisaan, cuda.cDepthTex, 0, 0);
cudaGraphicsSubResourceGetMappedArray(&tekstuuriUlos, cuda.cMaxSyvyys, 0, 0);
cudaProfilerStart();
//launch kernel
cLaskeSyvyydet(tiilet0, tiilet1, tiilet2, tiilet3, tekstuuriSisaan, tekstuuriUlos);
cudaEventRecord(cEvent, 0);
cudaProfilerStop();
//release textures back to OpenGL
cudaGraphicsUnmapResources(1, &cuda.cMaxSyvyys, 0);
cudaEventRecord(cEvent, 0);
cudaGraphicsUnmapResources(1, &cuda.cDepthTex, 0);
cudaEventRecord(cEvent, 0);
//final synchronazion
cudaEventSynchronize(cEvent);
cudaEventDestroy(cEvent);
}
Kernel launch:
void cLaskeSyvyydet(int& tiilet0, int& tiilet1, int& tiilet2, int& tiilet3, cudaArray* tekstuuriSisaan, cudaArray* tekstuuriUlos) {
cudaBindTextureToArray(surfRefSisaan, tekstuuriSisaan);
cudaBindSurfaceToArray(surfRefUlos, tekstuuriUlos);
int blocksW = (int)ceilf( tiilet0 / 32.0f );
int blocksH = (int)ceilf( tiilet1 / 32.0f );
dim3 gridDim( blocksW, blocksH, 1 );
dim3 blockDim(32, 32, 1 );
kLaskeSyvyydet<<<gridDim, blockDim>>>(tiilet0, tiilet1, tiilet2, tiilet3);
}
And the kernel:
__global__ void kLaskeSyvyydet(const int tiilet0, const int tiilet1, const int tiilet2, const int tiilet3) {
//first define indexes
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i >= tiilet0 || j >= tiilet1) return;
//if we are inside boundaries, let's find the greatest depth value
unsigned int takana=0;
unsigned int ddd;
uchar4 syvyys;
uchar4 dd;
//there's possibly four different tile sizes to choose between
if (j!=tiilet1-1 && i!=tiilet0-1) {
for (int y=j*BLOCK_SIZE; y<(j+1)*BLOCK_SIZE; y++) {
for (int x=i*BLOCK_SIZE; x<(i+1)*BLOCK_SIZE; x++) {
dd=tex2D(surfRefSisaan, x, y);
ddd=(dd.x << 24) | (dd.y << 16) | (dd.z << 8) | (dd.w);
takana=max(takana, ddd);
}
}
} else if (j==tiilet1-1 && i!=tiilet0-1) {
for (int y=j*BLOCK_SIZE; y<j*BLOCK_SIZE+tiilet3; y++) {
for (int x=i*BLOCK_SIZE; x<(i+1)*BLOCK_SIZE; x++) {
dd=tex2D(surfRefSisaan, x, y);
ddd=(dd.x << 24) | (dd.y << 16) | (dd.z << 8) | (dd.w);
takana=max(takana, ddd);
}
}
} else if (j!=tiilet1-1 && i==tiilet0-1) {
for (int y=j*BLOCK_SIZE; y<(j+1)*BLOCK_SIZE; y++) {
for (int x=i*BLOCK_SIZE; x<i*BLOCK_SIZE+tiilet2; x++) {
dd=tex2D(surfRefSisaan, x, y);
ddd=(dd.x << 24) | (dd.y << 16) | (dd.z << 8) | (dd.w);
takana=max(takana, ddd);
}
}
} else if (j==tiilet1-1 && i==tiilet0-1) {
for (int y=j*BLOCK_SIZE; y<j*BLOCK_SIZE+tiilet3; y++) {
for (int x=i*BLOCK_SIZE; x<i*BLOCK_SIZE+tiilet2; x++) {
dd=tex2D(surfRefSisaan, x, y);
ddd=(dd.x << 24) | (dd.y << 16) | (dd.z << 8) | (dd.w);
takana=max(takana, ddd);
}
}
}
//if there's empty texture, then we choose the maximum possible value
if (takana==0) {
takana=1000000000;
}
//after slicing the greatest 32bit depth value into four 8bit pieces we write the value into another texture
syvyys.x=(takana & 0xFF000000) >> 24;
syvyys.y=(takana & 0x00FF0000) >> 16;
syvyys.z=(takana & 0x0000FF00) >> 8;
syvyys.w=(takana & 0x000000FF) >> 0;
surf2Dwrite(syvyys, surfRefUlos, i*sizeof(syvyys), j, cudaBoundaryModeZero);
}
Please help me get this working faster, I have no ideas...
It looks like you have a 2D int input array of the size
((tiilet0-1)*BLOCK_SIZE+tiilet2, ((tiilet1-1)*BLOCK_SIZE)+tiilet3)
Each of your thread will sequentially read all elements in an input block of the size
(BLOCK_SIZE, BLOCK_SIZE)
and write the the maximum of the each input block to an 2D result array of the size
(tiilet0, tiilet1)
Compared to the coalesced memory access, this may be the worst possible way to access the global memory, even with 2D texture. You many want to read about coalesced memory access.
https://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/
Generally you put too much work into one thread. Given the way you map the CUDA thread blocks to your input array, I guess unless you have a VERY large input, your gridDim will be too small to fully utilize the GPU.
For better performance you may want to change from one CUDA thread per input block to one CUDA thread block per input block (int[BLOCK_SIZE][BLOCK_SIZE]), and use parallel reduction to find the block-wise maximum.

Reading then adding large number of integers from binary file fast in C/C++

I was writing code to read unsigned integers from a binary file using C/C++ on a 32 bit Linux OS intended to run on an 8-core x86 system. The application takes an input file which contains unsigned integers in little-endian format one after another. So the input file size in bytes is a multiple of 4. The file could have a billion integers in it. What is the fastest way to read and add all the integers and return the sum with 64 bit precision?
Below is my implementation. Error checking for corrupt data is not the major concern here and the input file is considered to without any issues in this case.
#include <iostream>
#include <fstream>
#include <pthread.h>
#include <string>
#include <string.h>
using namespace std;
string filepath;
unsigned int READBLOCKSIZE = 1024*1024;
unsigned long long nFileLength = 0;
unsigned long long accumulator = 0; // assuming 32 bit OS running on X86-64
unsigned int seekIndex[8] = {};
unsigned int threadBlockSize = 0;
unsigned long long acc[8] = {};
pthread_t thread[8];
void* threadFunc(void* pThreadNum);
//time_t seconds1;
//time_t seconds2;
int main(int argc, char *argv[])
{
if (argc < 2)
{
cout << "Please enter a file path\n";
return -1;
}
//seconds1 = time (NULL);
//cout << "Start Time in seconds since January 1, 1970 -> " << seconds1 << "\n";
string path(argv[1]);
filepath = path;
ifstream ifsReadFile(filepath.c_str(), ifstream::binary); // Create FileStream for the file to be read
if(0 == ifsReadFile.is_open())
{
cout << "Could not find/open input file\n";
return -1;
}
ifsReadFile.seekg (0, ios::end);
nFileLength = ifsReadFile.tellg(); // get file size
ifsReadFile.seekg (0, ios::beg);
if(nFileLength < 16*READBLOCKSIZE)
{
//cout << "Using One Thread\n"; //**
char* readBuf = new char[READBLOCKSIZE];
if(0 == readBuf) return -1;
unsigned int startOffset = 0;
if(nFileLength > READBLOCKSIZE)
{
while(startOffset + READBLOCKSIZE < nFileLength)
{
//ifsReadFile.flush();
ifsReadFile.read(readBuf, READBLOCKSIZE); // At this point ifsReadFile is open
int* num = reinterpret_cast<int*>(readBuf);
for(unsigned int i = 0 ; i < (READBLOCKSIZE/4) ; i++)
{
accumulator += *(num + i);
}
startOffset += READBLOCKSIZE;
}
}
if(nFileLength - (startOffset) > 0)
{
ifsReadFile.read(readBuf, nFileLength - (startOffset));
int* num = reinterpret_cast<int*>(readBuf);
for(unsigned int i = 0 ; i < ((nFileLength - startOffset)/4) ; ++i)
{
accumulator += *(num + i);
}
}
delete[] readBuf; readBuf = 0;
}
else
{
//cout << "Using 8 Threads\n"; //**
unsigned int currthreadnum[8] = {0,1,2,3,4,5,6,7};
if(nFileLength > 200000000) READBLOCKSIZE *= 16; // read larger blocks
//cout << "Read Block Size -> " << READBLOCKSIZE << "\n";
if(nFileLength % 28)
{
threadBlockSize = (nFileLength / 28);
threadBlockSize *= 4;
}
else
{
threadBlockSize = (nFileLength / 7);
}
for(int i = 0; i < 8 ; ++i)
{
seekIndex[i] = i*threadBlockSize;
//cout << seekIndex[i] << "\n";
}
pthread_create(&thread[0], NULL, threadFunc, (void*)(currthreadnum + 0));
pthread_create(&thread[1], NULL, threadFunc, (void*)(currthreadnum + 1));
pthread_create(&thread[2], NULL, threadFunc, (void*)(currthreadnum + 2));
pthread_create(&thread[3], NULL, threadFunc, (void*)(currthreadnum + 3));
pthread_create(&thread[4], NULL, threadFunc, (void*)(currthreadnum + 4));
pthread_create(&thread[5], NULL, threadFunc, (void*)(currthreadnum + 5));
pthread_create(&thread[6], NULL, threadFunc, (void*)(currthreadnum + 6));
pthread_create(&thread[7], NULL, threadFunc, (void*)(currthreadnum + 7));
pthread_join(thread[0], NULL);
pthread_join(thread[1], NULL);
pthread_join(thread[2], NULL);
pthread_join(thread[3], NULL);
pthread_join(thread[4], NULL);
pthread_join(thread[5], NULL);
pthread_join(thread[6], NULL);
pthread_join(thread[7], NULL);
for(int i = 0; i < 8; ++i)
{
accumulator += acc[i];
}
}
//seconds2 = time (NULL);
//cout << "End Time in seconds since January 1, 1970 -> " << seconds2 << "\n";
//cout << "Total time to add " << nFileLength/4 << " integers -> " << seconds2 - seconds1 << " seconds\n";
cout << accumulator << "\n";
return 0;
}
void* threadFunc(void* pThreadNum)
{
unsigned int threadNum = *reinterpret_cast<int*>(pThreadNum);
char* localReadBuf = new char[READBLOCKSIZE];
unsigned int startOffset = seekIndex[threadNum];
ifstream ifs(filepath.c_str(), ifstream::binary); // Create FileStream for the file to be read
if(0 == ifs.is_open())
{
cout << "Could not find/open input file\n";
return 0;
}
ifs.seekg (startOffset, ios::beg); // Seek to the correct offset for this thread
acc[threadNum] = 0;
unsigned int endOffset = startOffset + threadBlockSize;
if(endOffset > nFileLength) endOffset = nFileLength; // for last thread
//cout << threadNum << "-" << startOffset << "-" << endOffset << "\n";
if((endOffset - startOffset) > READBLOCKSIZE)
{
while(startOffset + READBLOCKSIZE < endOffset)
{
ifs.read(localReadBuf, READBLOCKSIZE); // At this point ifs is open
int* num = reinterpret_cast<int*>(localReadBuf);
for(unsigned int i = 0 ; i < (READBLOCKSIZE/4) ; i++)
{
acc[threadNum] += *(num + i);
}
startOffset += READBLOCKSIZE;
}
}
if(endOffset - startOffset > 0)
{
ifs.read(localReadBuf, endOffset - startOffset);
int* num = reinterpret_cast<int*>(localReadBuf);
for(unsigned int i = 0 ; i < ((endOffset - startOffset)/4) ; ++i)
{
acc[threadNum] += *(num + i);
}
}
//cout << "Thread " << threadNum + 1 << " subsum = " << acc[threadNum] << "\n"; //**
delete[] localReadBuf; localReadBuf = 0;
return 0;
}
I wrote a small C# program to generate the input binary file for testing.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace BinaryNumWriter
{
class Program
{
static UInt64 total = 0;
static void Main(string[] args)
{
BinaryWriter bw = new BinaryWriter(File.Open("test.txt", FileMode.Create));
Random rn = new Random();
for (UInt32 i = 1; i <= 500000000; ++i)
{
UInt32 num = (UInt32)rn.Next(0, 0xffff);
bw.Write(num);
total += num;
}
bw.Flush();
bw.Close();
}
}
}
Running the program on a Core i5 machine # 3.33 Ghz (its quad-core but its what I got at the moment) with 2 GB RAM and Ubuntu 9.10 32 bit had the following performance numbers
100 integers ~ 0 seconds (I would really have to suck otherwise)
100000 integers < 0 seconds
100000000 integers ~ 7 seconds
500000000 integers ~ 29 seconds (1.86 GB input file)
I am not sure if the HDD is 5400RPM or 7200RPM. I tried different buffer sizes for reading and found reading 16 MB at a time for big input files was kinda the sweet spot.
Are there any better ways to read faster from the file to increase overall performance? Is there a smarter way to add large arrays of integers faster and folding repeatedly? Is there any major roadblocks to performance the way I have written the code / Am I doing something obviously wrong that's costing a lot of time?
What can I do to make this process of reading and adding data faster?
Thanks.
Chinmay
Accessing a mechanical HDD from multiple threads the way you do is going to take some head movement (read slow it down). You're almost surely IO bound (65MBps for the 1.86GB file).
Try to change your strategy by:
starting the 8 threads - we'll call them CONSUMERS
the 8 threads will wait for data to be made available
in the main thread start to read chunks (say 256KB) of the file thus being a PROVIDER for the CONSUMERS
main thread hits the EOF and signals the workers that there is no more avail data
main thread waits for the 8 workers to join.
You'll need quite a bit of synchronization to get it working flawlessly and I think it would totally max out your HDD / filesystem IO capabilities by doing sequencial file access. YMMV on smallish files which can be cached and served from the cache at lightning speed.
Another thing you can try is to start only 7 threads, leave one free CPU for the main thread & the rest of the system.
.. or get an SSD :)
Edit:
For simplicity see how fast you can simply read the file (discarding the buffers) with no processing, single-threaded. That plus epsilon is your theoretical limit to how fast you can get this done.
If you want to read (or write) a lot of data fast, and you don't want to do much processing with that data, you need to avoid extra copies of the data between buffers. That means you want to avoid fstream or FILE abstractions (as they introduce an extra buffer that needs to be copied through), and avoid read/write type calls that copy stuff between kernel and user buffers.
Instead, on linux, you want to use mmap(2). On a 64-bit OS, just mmap the entire file into memory, use madvise(MADV_SEQUENTIAL) to tell the kernel you're going to be accessing it mostly sequentially, and have at it. For a 32-bit OS, you'll need to mmap in chunks, unmapping the previous chunk each time. Something much like your current structure, with each thread mmapping one fixed-size chunk at a time should work well.