How to make a template cycle? - c++

I'm very often fiddling with cycle and they're almost the same, I think you can simplify a lot of code if you have one template.
// the blocks can be different, but the number is known before compilation
const int block_1 = 10,
block_2 = 4,
block_3 = 6,
block_4 = 3;
Basically all cycles are like this
the cycle can be like this
for (int i = 1; i < block_1 - 1; ++i) {
}
or this
for (int i = 1; i < block_1 - 1; ++i) {
for (int k = 1; k < block_2 - 1; ++k) {
}
}
or this
for (int i = 1; i < block_1 - 1; ++i) {
for (int k = 1; k < block_2 - 1; ++k) {
for (int j = 1; j < block_3 - 1; ++j) {
}
}
}
The number of cycle within a cycle can be a lot, but they are similar.
I think that if I use a template instead of loops all the time, would it be more convenient or not, but maybe I shouldn't and you will dissuade me from doing it.
Ideally I would like a template like this
for_funk(block_1, block_2, block_3) {
// Here I write the code that will be inside the three loops block_1, block_2, block_3
}
Maybe this will help https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2374r4.html

Yes, you can compose iota_view and cartesian_product_view to get nested indexes in C++23
constexpr inline auto for_funk = [](auto... index) {
return std::views::cartesian_product(std::views::iota(1, index-1)...);
};
const int block_1 = 10,
block_2 = 4,
block_3 = 6,
block_4 = 3;
for (auto [i, j, k, w] : for_funk(block_1, block_2, block_3, block_4))
// use i, j, k, w
Demo with range-v3

Related

Calculating Big-O notation that has three nested loops

I want to find Big_O notation for my code. It has three nested loops and each loop has parameter that maybe vary.
According to my understanding (I am not sure if that correct).
time complexity is O(NKC) where N is the size in the outer loop, K is a constant inserted by user. C is also constant that may be change when using other dataset.
my code:
for (int m=0; m< size; m++)
{
int array_Y_class_target[2]{};
float CT[2]{};
float SumOf_Each_class_distances[2] = { 0.0 };
int min_index = -1;
for (int i = k; i > 0; --i) {
for (int c = 0; c < 2; ++c) {
for (int j = 0; j < i; ++j) {
int index = index_arr[j];
if (Y_train[index] == c)
{
array_Y_class_target[c] ++;
float dist = array_dist[index_arr[j]];
SumOf_Each_class_distances[c] += dist;
}
}
if (array_Y_class_target[c] != 0)
{
CT[c] = (((float)k / (float)array_Y_class_target[c]) + (SumOf_Each_class_distances[c] / (float)array_Y_class_target[c]));
}
else
{
CT[c] = 1.5; // max CT value
}
}

Is this the fastest single digit sort algorithm possible?

I'm trying to learn c++ and wrote this algorithm and I am wondering if there is a faster way to do the same thing. This is assuming that the input is valid. I was trying to think of how to remove the nested for loop but decided that it is fine since it is not exponential. Is this correct? Thanks
void DigitSort(int* arr, int size)
{
int counts[10] = { 0,0,0,0,0,0,0,0,0,0 };
int k = -1;
while (++k < size)
counts[arr[k]]++;
k = -1;
for (int j = 0; j < 10; ++j)
for (int i = 0; i < counts[j]; ++i)
arr[++k] = j;
}
There is no benchmark, but here is a (probably) faster solution, using std::fill_n.
void DigitSort(int* arr, int size)
{
int counts[10] = { 0,0,0,0,0,0,0,0,0,0 };
int k = -1, sum_count = 0;
while (++k < size)
counts[arr[k]]++;
for (k = 0; k < 10; ++k) {
std::fill_n(arr + sum_count, counts[k], k);
sum_count += counts[k];
}
}
When I say "probably", it's because the compiler can optimize the std::fill_n to a memset-like instruction.

What is the reason of bad parallel performance?

I'm trying to implement parallel algorithm that will compute Levenshtein distance between each of sequences in a list and store them in matrix (2d vector). In other words, I'm given 2d vector with numbers (thousands of number sequences of up to 30 numbers) and I need to compute Levenshtein distance between each vector of integers. I implemented serial algorithm that works, but when I tried to convert it to parallel, it is much slower (the more threads, the slower it is). The parallel version is implemented with c++11 threads (I also tried OpenMP, but with the same results).
Here is the function that distributes work:
vector<vector<int>> getGraphParallel(vector<vector<int>>& records){
int V = records.size();
auto threadCount = std::thread::hardware_concurrency();
if(threadCount == 0){
threadCount = 1;
}
vector<future<vector<vector<int>>>> futures;
int rowCount = V / threadCount;
vector<vector<int>>::const_iterator first = records.begin();
vector<vector<int>>::const_iterator last = records.begin() + V;
for(int i = 0; i < threadCount; i++){
int start = i * rowCount;
if(i == threadCount - 1){
rowCount += V % threadCount;
}
futures.push_back(std::async(getRows, std::ref(records), start, rowCount, V));
}
vector<vector<int>> graph;
for(int i = 0; i < futures.size(); i++){
auto result = futures[i].get();
for(const auto &row : result){
graph.push_back(row);
}
}
for(int i = 0; i < V; i++)
{
for(int j = i + 1; j < V; j++){
graph[j][i] = graph[i][j];
}
}
return graph;
}
Here is the function that computes rows of final matrix:
vector<vector<int>> getRows(vector<vector<int>>& records, int from, int count, int size){
vector<vector<int>> result(count, vector<int>(size, 0));
for(int i = 0; i < count; i++){
for(int j = i + from + 1; j < size; j++){
result[i][j] = levenshteinDistance(records[i + from], records[j]);
}
}
return result;
}
And finally function that computes Levenshtein distance:
int levenshteinDistance(const vector<int>& first, const vector<int>& second){
const int sizeFirst = first.size();
const int sizeSecond = second.size();
if(sizeFirst == 0) return sizeSecond;
if(sizeSecond == 0) return sizeFirst;
vector<vector<int>> distances(sizeFirst + 1, vector<int>(sizeSecond + 1, 0));
for(int i = 0; i <= sizeFirst; i++){
distances[i][0] = i;
}
for(int j = 0; j <= sizeSecond; j++){
distances[0][j] = j;
}
for (int j = 1; j <= sizeSecond; j++)
for (int i = 1; i <= sizeFirst; i++)
if (first[i - 1] == second[j - 1])
distances[i][j] = distances[i - 1][j - 1];
else
distances[i][j] = min(min(
distances[i - 1][j] + 1,
distances[i][j - 1] + 1),
distances[i - 1][j - 1] + 1
);
return distances[sizeFirst][sizeSecond];
}
One thing that came to my mind is that this slow down is caused by false sharing, but I could not check it with perf, because I'm working with Ubuntu in Oracle VirtualBox - cache misses are not avalaible there. If I'm right and the slow down is caused by false sharing, what should I do to fix it? If not, what is the reason of this slow down?
One problem I can see is that you are using std::async without declaring how it should run. It can either be run async or deferred. In the case of deferred it would all run in one thread, it would just be lazily evaluated. The default behavior is implementation defined. For your case it'd make sense that it would run slower with more deferred evaluations. You can try std::async(std::launch::async, ...).
Make sure your VM is also set up to use more than 1 core. Ideally when doing optimizations such as these it'd be best to try and eliminate as many variables as possible. If you can, run the program locally without a VM. Profiling tools are your best bet and will show you exactly where time is being spent.

Parallelize four and more nested loops with CUDA

I am working on a compiler generating parallel C++ code. I am new to CUDA programming but I am trying to parallelize the C++ code with CUDA.
Currently if I have the following sequential C++ code:
for(int i = 0; i < a; i++) {
for(int j = 0; j < b; j++) {
for(int k = 0; k < c; k++) {
A[i*y*z + j*z + k*z +l] = 1;
}
}
}
and this results in the following CUDA code:
__global__ void kernelExample() {
int _cu_x = ((blockIdx.x*blockDim.x)+threadIdx.x);
int _cu_y = ((blockIdx.y*blockDim.y)+threadIdx.y);
int _cu_z = ((blockIdx.z*blockDim.z)+threadIdx.z);
A[_cu_x*y*z + _cu_y*z + _cu_z] = 1;
}
so each loop nest is mapped to one dimension, but what would be the correct way to parallelize four and more nested loops:
for(int i = 0; i < a; i++) {
for(int j = 0; j < b; j++) {
for(int k = 0; k < c; k++) {
for(int l = 0; l < d; l++) {
A[i*x*y*z + j*y*z + k*z +l] = 1;
}
}
}
}
Is there any similar way? Noteworthy: all loop dimensions are parallel and there are no dependencies between iterations.
Thanks in advance!
EDIT: the goal is to map all iterations to CUDA threads, since all iterations are independent and could be executed concurrently.
You could keep the outer loop unchanged. Also it is better to use .x as inner most loop so you can access the global memory efficiently.
__global__ void kernelExample() {
int _cu_x = ((blockIdx.x*blockDim.x)+threadIdx.x);
int _cu_y = ((blockIdx.y*blockDim.y)+threadIdx.y);
int _cu_z = ((blockIdx.z*blockDim.z)+threadIdx.z);
for(int i = 0; i < a; i++) {
A[i*x*y*z + _cu_z*y*z + _cu_y*z + _cu_x] = 1;
}
}
However if your a,b,c,d are all very small, you may not be able to get enough parallelism. In that case you could convert a linear index to n-D indices.
__global__ void kernelExample() {
int tid = ((blockIdx.x*blockDim.x)+threadIdx.x);
int i = tid / (b*c*d);
int j = tid / (c*d) % b;
int k = tid / d % c;
int l = tid % d;
A[i*x*y*z + j*y*z + k*z + l] = 1;
}
But be careful that calculating i,j,k,l may introduce a lot of overhead as integer division and mod are slow on GPU. As an alternative you could map i,j to .z and .y, and calculate only k,l and more dimensions from .x in a similar way.

C++ variable number of nested loops

I want to make a function that, depending on the depth of nested loop, does this:
if depth = 1:
for(i = 0; i < max; i++){
pot[a++] = wyb[i];
}
if depth = 2:
for(i = 0; i < max; i++){
for( j = i+1; j < max; j++){
pot[a++] = wyb[i] + wyb[j];
}
}
if depth = 3:
for(i = 0; i < max; i++){
for( j = i+1; j < max; j++){
for( k = j+1; k < max; k++){
pot[a++] = wyb[i] + wyb[j] + wyb[k];
}
}
}
and so on.
So the result would be:
depth = 1
pot[0] = wyb[0]
pot[1] = wyb[1]
...
pot[max-1] = wyb[max-1]
depth = 2, max = 4
pot[0] = wyb[0] + wyb[1]
pot[1] = wyb[0] + wyb[2]
pot[2] = wyb[0] + wyb[3]
pot[3] = wyb[1] + wyb[2]
pot[4] = wyb[1] + wyb[3]
pot[5] = wyb[2] + wyb[3]
I think you get the idea. I can't think of a way to do this neatly.
Could someone present an easy way of using recursion (or maybe not?) to achieve this, keeping in mind that I'm still a beginner in c++, to point me in the right direction?
Thank you for your time.
You may use the std::next_permutation to manage the combinaison:
std::vector<int> compute(const std::vector<int>& v, std::size_t depth)
{
if (depth == 0 || v.size() < depth) {
throw "depth is out of range";
}
std::vector<int> res;
std::vector<int> coeffs(depth, 1);
coeffs.resize(v.size(), 0); // flags is now {1, .., 1, 0, .., 0}
do {
int sum = 0;
for (std::size_t i = 0; i != v.size(); ++i) {
sum += v[i] * coeffs[i];
}
res.push_back(sum);
} while (std::next_permutation(coeffs.rbegin(), coeffs.rend()));
return res;
}
Live example
Simplified recursive version:
int *sums_recursive(int *pot, int *wyb, int max, int depth) {
if (depth == 1) {
while (max--)
*pot++ = *wyb++;
return pot;
}
for (size_t i = 1; i <= max - depth + 1; ++i) {
int *pot2 = sums_recursive(pot, wyb + i, max - i, depth - 1);
for (int *p = pot ; p < pot2; ++p) *p += wyb[i - 1];
pot = pot2;
}
return pot;
}
Iterative version:
void sums(int *pot, int *wyb, int max, int depth) {
int maxi = 1;
int o = 0;
for (int d = 0; d < depth; ++d) { maxi *= max; }
for (int i = 0; i < maxi; ++i) {
int i_div = i;
int idx = -1;
pot[o] = 0;
int d;
for (d = 0; d < depth; ++d) {
int new_idx = i_div % max;
if (new_idx <= idx) break;
pot[o] += wyb[new_idx];
idx = new_idx;
i_div /= max;
}
if (d == depth) o++;
}
}