Background:
I have 2 similar blocks of code that I would like to merge together in a function. One block is for the x axis and the other is for the y axis. I have had this similar issue multiple times before and have shrugged it off since I assumed there was no better way of merging these in a clean fashion.
Problem:
How do I make a function that can replace both snippets of code below in the least number of lines?
Code:
//rows
vector<float> rowSpectrum;
float tempVal;
for (int i = 0; i < ROI.size().height; i++) {
tempVal = cv::mean(cleanImg.row(i)).val[0];
rowSpectrum.push_back(tempVal);
}
//columns
vector<float> colSpectrum;
for (int i = 0; i < ROI.size().width; i++) {
tempVal = cv::mean(cleanImg.col(i)).val[0];
colSpectrum.push_back(tempVal);
}
auto calcSpectrum = [&](int size, cv::Mat (cv::Mat::*memFn)(int) const) {
vector<float> spectrum;
for (int i = 0; i < size; i++) {
auto tempVal = cv::mean((cleanImg.*memFn)(i)).val[0];
spectrum.push_back(tempVal);
}
return spectrum;
}
auto rowSpectrum = calcSpectrum(ROI.size().height, &cv::Mat::row);
auto colSpectrum = calcSpectrum(ROI.size().width, &cv::Mat::col);
Related
I wrote a code to calculate moving L2 norm of two arrays.
func_lstl2(const int &nx, const float x[],const int &ny, const float y[], int &shift, double &lstl2)
{
int maxshift = 200;
int len_z = maxshift * 2;
int len_work = len_z + ny;
//initialize array work and array z
double *z = new double[len_z]; float *work = new float[len_work];
for (int i = 0; i < len_z; i++)
z[i] = 0;
for (int i = 0; i < len_work; i++)
work[i] = 0;
for (int i = 0; i < ny; i++)
work[i + maxshift] = y[i];
// do moving least square residue calculation
float temp;
for (int i = 0; i < len_z; i++)
{
for (int j = 0; j < nx; j++)
{
temp = x[j] - work[i + j];
z[i] += temp * temp;
}
}
// find the best fit value
lstl2 = 1E30;
shift = 0;
for (int i = 0; i < len_z; i++)
{
if (z[i] < lstl2)
{
lstl2 = z[i];
shift = i - maxshift;
}
}
//end of program
delete[] z;
delete[] work;
}
I tested two arrays with exactly same length and same scale.
int shift; double lstl2;
func_lstl2(2000,z1,2000,z2,shift,lstl2) ;
func_lstl2(2000,x1,2000,x2,shift,lstl2) ;
For z array, it used 0.0032346 seconds, for x array, it used 0.0140903 seconds. I cannot figure out why there is near 5 times time consumption difference. Could you help me figure it out? Thank you very much!
Here is the link for z array and x array.
https://drive.google.com/file/d/1aONKTjE_7NI1bp8YkDL2CMfg9C5h67Fe/view?usp=sharing
I strongly suspect you're dealing with denormalized floating point calculation effects. Using your existing function, loading the values as-appropriate in vectors, and turning them loose seven times on the provided input, (compiled with -O3 optimization)
for (int i = 0; i < 5; ++i)
{
int shift = 0;
double lstl2 = 0;
auto tp0 = steady_clock::now();
func_lstl2(2000, v1.data(), 2000, v2.data(), shift, lstl2);
auto tp1 = steady_clock::now();
std::cout << pr[0] << ',' << pr[1] << ':';
std::cout << duration_cast<milliseconds>(tp1 - tp0).count() << "ms\n";
}
I receive the following output, confirming your conundrum:
x1.txt,x2.txt:23ms
x1.txt,x2.txt:19ms
x1.txt,x2.txt:21ms
x1.txt,x2.txt:21ms
x1.txt,x2.txt:19ms
x1.txt,x2.txt:22ms
x1.txt,x2.txt:21ms
z1.txt,z2.txt:8ms
z1.txt,z2.txt:9ms
z1.txt,z2.txt:5ms
z1.txt,z2.txt:5ms
z1.txt,z2.txt:6ms
z1.txt,z2.txt:5ms
z1.txt,z2.txt:5ms
However, enabling denormalize-as-zero (DAZ) and flush-to-zero (FTZ) for floating calculations (the mechanism for doing so is toolchain-dependent; below is clang 13.01 on macOS):
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
delivers the following:
x1.txt,x2.txt:4ms
x1.txt,x2.txt:4ms
x1.txt,x2.txt:3ms
x1.txt,x2.txt:5ms
x1.txt,x2.txt:3ms
x1.txt,x2.txt:3ms
x1.txt,x2.txt:5ms
z1.txt,z2.txt:7ms
z1.txt,z2.txt:6ms
z1.txt,z2.txt:4ms
z1.txt,z2.txt:3ms
z1.txt,z2.txt:3ms
z1.txt,z2.txt:4ms
z1.txt,z2.txt:3ms
Your x-data set is sensitive to this; z does not appear to be. See this question for a better explanation.
Currently, I am making a C++ program that solves a sudoku. In order to do this, I calculate the "energy" of the sudoku (the number of faults) frequently. This calculation unfortunately takes up a lot of computation time. I think that it can be sped up significantly by using pointers and references in the calculation, but have trouble figuring out how to implement this.
In my solver class, I have a vector<vector<int> data-member called _sudoku, that contains the values of each site. Currently, when calculating the energy I call a lot of functions with pass-by-value. I tried adding a & in the arguments of the functions and a * when making the variables, but this did not work. How can I make this program run faster by using pass-by-reference?
Calculating the energy should not change the vector anyway so that would be better.
I used the CPU usage to track down 80% of the calculation time to the function where vectors are called.
int SudokuSolver::calculateEnergy() {
int energy = 243 - (rowUniques() + colUniques() + blockUniques());//count number as faults
return energy;
}
int SudokuSolver::colUniques() {
int count = 0;
for (int col = 0; col < _dim; col++) {
vector<int> colVec = _sudoku[col];
for (int i = 1; i <= _dim; i++) {
if (isUnique(colVec, i)) {
count++;
}
}
}
return count;
}
int SudokuSolver::rowUniques() {
int count = 0;
for (int row = 0; row < _dim; row++) {
vector<int> rowVec(_dim);
for (int i = 0; i < _dim; i++) {
rowVec[i] = _sudoku[i][row];
}
for (int i = 1; i <= _dim; i++) {
if (isUnique(rowVec, i)) {
count++;
}
}
}
return count;
}
int SudokuSolver::blockUniques() {
int count = 0;
for (int nBlock = 0; nBlock < _dim; nBlock++) {
vector<int> blockVec = blockMaker(nBlock);
for (int i = 1; i <= _dim; i++) {
if (isUnique(blockVec, i)) {
count++;
}
}
}
return count;
}
vector<int> SudokuSolver::blockMaker(int No) {
vector<int> block(_dim);
int xmin = 3 * (No % 3);
int ymin = 3 * (No / 3);
int col, row;
for (int i = 0; i < _dim; i++) {
col = xmin + (i % 3);
row = ymin + (i / 3);
block[i] = _sudoku[col][row];
}
return block;
}
bool SudokuSolver::isUnique(vector<int> v, int n) {
int count = 0;
for (int i = 0; i < _dim; i++) {
if (v[i] == n) {
count++;
}
}
if (count == 1) {
return true;
} else {
return false;
}
}
The specific lines that use a lot of computatation time are the ones like:
vector<int> colVec = _sudoku[col];
and every time isUnique() is called.
I expect that if I switch to using pass-by-reference, my code will speed up significantly. Could anyone help me in doing so, if that would indeed be the case?
Thanks in advance.
If you change your SudokuSolver::isUnique to take vector<int> &v, that is the only change you need to do pass-by-reference instead of pass-by-value. Passing with a pointer will be similar to passing by reference, with the difference that pointers could be re-assigned, or be NULL, while references can not.
I suspect you would see some performance increase if you are working on a sufficiently large-sized problem where you would be able to distinguish a large copy (if your problem is small, it will be difficult to see minor performance increases).
Hope this helps!
vector<int> colVec = _sudoku[col]; does copy/transfer all the elements, while const vector<int>& colVec = _sudoku[col]; would not (it only creates an alias for the right hand side).
Same with bool SudokuSolver::isUnique(vector<int> v, int n) { versus bool SudokuSolver::isUnique(const vector<int>& v, int n) {
Edited after Jesper Juhl's suggestion: The const addition makes sure that you don't change the reference contents by mistake.
Edit 2: Another thing to notice is that vector<int> rowVec(_dim); these vectors are continuously allocated and unallocated at each iteration, which might get costly. You could try something like
int SudokuSolver::rowUniques() {
int count = 0;
vector<int> rowVec(_maximumDim); // Specify maximum dimension
for (int row = 0; row < _dim; row++) {
for (int i = 0; i < _dim; i++) {
rowVec[i] = _sudoku[i][row];
}
for (int i = 1; i <= _dim; i++) {
if (isUnique(rowVec, i)) {
count++;
}
}
}
return count;
}
if that doesn't mess up with your implementation.
I don't know how I can parallel this loops because I have a lot of dependent variables and I am very confused
can you help and guide me?
the number one is :
for (int a = 0; a < sigmaLen; ++a) {
int f = freq[a];
if (f >= sumFreqLB)
if (updateRemainingDistances(s, a, pos))
if (prunePassed(pos + 1)) {
lmer[pos] = a;
enumerateStrings(pos + 1, sumFreqLB - f);
}
}
The second one is :
void preprocessLowerBounds() {
int i = stackSz - 1;
int pairOffset = (i * (i - 1)) >> 1;
for (int k = L; k; --k) {
int *dsn = dist[k] + pairOffset;
int *ds = dist[k - 1] + pairOffset;
int *s = colS[k - 1];
char ci = s[i];
for (int j = 0; j < i; ++j) {
char cj = s[j];
*ds++ = (*dsn++) + (ci != cj);
}
}
Really another one is :
void enumerateSubStrings(int rowNumber, int remainQTolerance) {
int nItems = rowSize[rowNumber][stackSz];
if (shouldGenerateNeighborhood(rowNumber, nItems)) {
bruteForceIt(rowNumber, nItems);
} else {
indexType *row = rowItem[rowNumber];
for (int j = 0; j < nItems; ++j) {
indexType ind = row[j];
addString(lmers + ind);
preprocessLowerBounds();
uint threshold = maxLB[stackSz] - addMaxFreq();
if (hasSolution(0, threshold)) {
if (getValid<hasPreprocessedPairs, useQ>(rowNumber + 1,
(stackSz <= 2 ? n : smallN), threshold + LminusD,
ind, remainQTolerance)) {
enumerateSubStrings<hasPreprocessedPairs, useQ>(
rowNumber + 1, remainQTolerance);
}
}
removeLastString();
}
}
void addString(const char *t) {
int *mf = colMf[stackSz + 1];
for (int j = 0; j < L; ++j) {
int c = t[j];
colS[j][stackSz] = c;
mf[j] = colMaxFreq[j] + (colMaxFreq[j] == colFreq[j][c]++);
}
colMaxFreq = mf;
++stackSz;
}
void preprocessLowerBounds() {
int i = stackSz - 1;
int pairOffset = (i * (i - 1)) >> 1;
for (int k = L; k; --k) {
int *dsn = dist[k] + pairOffset;
int *ds = dist[k - 1] + pairOffset;
int *s = colS[k - 1];
char ci = s[i];
for (int j = 0; j < i; ++j) {
char cj = s[j];
*ds++ = (*dsn++) + (ci != cj);
}
}
}
void removeLastString() {
--stackSz;
for (int j = 0; j < L; ++j)
--colFreq[j][colS[j][stackSz]];
colMaxFreq = colMf[stackSz];
}
Ok, For OpenMP to parallelize a loop in your basically follow these two rules, the first never write in the same memory location from different threads and second rule never depend on the reading of a memory area that may modified another thread, Now in the first loop you just change the lmer variable and other operations are read-only variables that I assume are not changing at the same time from another part of your code, so the first loop would be as follows:
#pragma omp for private(s,a,pos) //According to my intuition these variables are global or belong to a class, so you must convert private to each thread, on the other hand sumFreqLB and freq not included because only these reading
for (int a = 0; a < sigmaLen; ++a) {
int f = freq[a];
if (f >= sumFreqLB)
if (updateRemainingDistances(s, a, pos))
if (prunePassed(pos + 1)) {
#pragma omp critical //Only one thread at a time can enter otherwise you will fail at runtime
{
lmer[pos] = a;
}
enumerateStrings(pos + 1, sumFreqLB - f);
}
}
In the second loop i could not understand how you're using the for, but you have no problems because you use only reads and only modified the thread local variables.
You must make sure that the functions updateRemainingDistances, prunePassed and enumerateStrings do not use static or global variables within.
In the following function you use most only read operations which can be done from multiple threads (if any thread modifying these variables) and write in local memory positions so just change the shape of the FOR for OpenMP can recognize that FOR.
void preprocessLowerBounds() {
int i = stackSz - 1;
int pairOffset = (i * (i - 1)) >> 1;
#pragma omp for
for (int var=0; var<=k-L; var++){
int newK=k-var;//This will cover the initial range and in the same order
int *dsn = dist[newK] + pairOffset;
int *ds = dist[newK - 1] + pairOffset;
int *s = colS[newK - 1];
char ci = s[i];
for (int j = 0; j < i; ++j) {
char cj = s[j];
*ds++ = (*dsn++) + (ci != cj);
}
}
In the last function you use many functions for which I do not know the source code and thus can not know if they are looking for parallelizable example below the following examples are wrong:
std::vector myVector;
void notParalelizable_1(int i){
miVector.push_back(i);
}
void notParalelizable_2(int i){
static int A=0;
A=A+i;
}
int varGlobal=0;
void notParalelizable_3(int i){
varGlobal=varGlobal+i;
}
void oneFunctionParalelizable(int i)
{
int B=i;
}
int main()
{
#pragma omp for
for(int i=0;i<10;i++)
{
notParalelizable_1(i);//Error because myVector is modified simultaneously from multiple threads, The error here is that myVector not store the values in ascending order as this necessarily being accesing by multiple threads, this more complex functions can generate erroneous results or even errors in run time.
}
#pragma omp for
for(int i=0;i<10;i++)
{
notParalelizable_2(i);//Error because A is modified simultaneously from multiple threads
}
#pragma omp for
for(int i=0;i<10;i++)
{
notParalelizable_3(i);//Error because varGlobal is modified simultaneously from multiple threads
}
#pragma omp for
for(int i=0;i<10;i++)
{
oneFunctionParalelizable(i);//no problem
}
//The following code is correct
int *vector=new int[10];
#pragma omp for
for(int i=0;i<10;i++)
{
vector[i]=i;//No problem because each thread writes to a different memory pocicion
}
//The following code is wrong
int k=2;
#pragma omp for
for(int i=0;i<10;i++)
{
k=k+i; //The result of the k variable at the end will be wrong as it is modified from different threads
}
return 0;
}
I cant find out whats wrong with this part of my program, i want to find out most occuring number in my structure(array), but it finds only the last number :/
void Daugiausiai(int n)
{
int max = 0;
int sk;
for(int i = 0; i < n; i++){
int kiek = 0;
for(int j=0; j < n; j++){
if(A[i].datamet == A[j].datamet){
kiek++;
if(kiek > max){
max = kiek;
sk = A[i].datamet;
}
}
}
}
}
ps. its only a part of my code
You haven't shown us enough of your code, but it is likely that you are not looking at the real result of your function. The result, sk is local to the function and you don't return it. If you have global variable that is also named sk, it will not be touched by Daugiausiai.
In the same way, you pass the number of elements in your struct array, but work on a global struct. It is good practice to "encapsulate" functions so that they receive the data they work on as arguments and return a result. Your function should therefore pass both array length and array and return the result.
(Such an encapsulation doesn't work in all cases, but here, it has the benefit that you can use the same function for many different arrays of the same structure tape.)
It is also enough to test whether the current number of elements is more than the maximum so far after your counting loop.
Putting all this together:
struct Data {
int datamet;
};
int Daugiausiai(const struct Data A[], int n)
{
int max = 0;
int sk;
for (int i = 0; i < n; i++){
int kiek = 0;
// Count occurrences
for(int j = 0; j < n; j++){
if(A[i].datamet == A[j].datamet) kiek++;
}
// Check for maximum
if (kiek > max) {
max = kiek;
sk = A[i].datamet;
}
}
return sk;
}
And you call it like this:
struct Data A[6] = {{1}, {2}, {1}, {4}, {1}, {2}};
int n = Daugiausiai(A, 6);
printf("%d\n", n); // 1
It would be nice if you had english variable names, so I could read them a bit better ^^. What should your paramter n do? Is that the array-length? And what should yout funtion do? It has no return value or something.
int getMostOccuring(int array[], int length)
{
int current_number;
int current_count = 0;
int most_occuring_number;
int most_occuring_count = 0;
for (int i = 0; i < length; i++)
{
current_number = array[i];
current_count = 0;
for (int j = i; j < length; j++)
{
int test_number = array[j];
if (test_number == current_number)
{
current_count ++;
if (current_count > most_occuring_count)
{
most_occuring_number = current_number;
most_occuring_count = current_count;
}
}
}
}
return most_occuring_number;
}
this should work and return the most occuring number in the given array (it has a bad runtime, but is very simple and good to understand).
One often reads that there is little performance difference between dynamically allocated array and std::vector.
Here are two versions of the problem 10 of project Euler test with two versions:
with std::vector:
const __int64 sum_of_primes_below_vectorversion(int max)
{
auto primes = new_primes_vector(max);
__int64 sum = 0;
for (auto p : primes) {
sum += p;
}
return sum;
}
const std::vector<int> new_primes_vector(__int32 max_prime)
{
std::vector<bool> is_prime(max_prime, true);
is_prime[0] = is_prime[1] = false;
for (auto i = 2; i < max_prime; i++) {
is_prime[i] = true;
}
for (auto i = 1; i < max_prime; i++) {
if (is_prime[i]) {
auto max_j = max_prime / i;
for (auto j = i; j < max_j; j++) {
is_prime[j * i] = false;
}
}
}
auto primes_count = 0;
for (auto i = 0; i < max_prime; i++) {
if (is_prime[i]) {
primes_count++;
}
}
std::vector<int> primes(primes_count, 0);
for (auto i = 0; i < max_prime; i++) {
if (is_prime[i]) {
primes.push_back(i);
}
}
return primes;
}
Note that I also tested the version version with the call to the default constructor of std::vector and without the precomputation of its final size.
Here is the array version:
const __int64 sum_of_primes_below_carrayversion(int max)
{
auto p_length = (int*)malloc(sizeof(int));
auto primes = new_primes_array(max, p_length);
auto last_index = *p_length - 1;
__int64 sum = 0;
for (int i = 0; i < last_index; i++) {
sum += primes[i];
}
free((__int32*)(primes));
free(p_length);
return sum;
}
const __int32* new_primes_array(__int32 max_prime, int* p_primes_count)
{
auto is_prime = (bool*)malloc(max_prime * sizeof(bool));
is_prime[0] = false;
is_prime[1] = false;
for (auto i = 2; i < max_prime; i++) {
is_prime[i] = true;
}
for (auto i = 1; i < max_prime; i++) {
if (is_prime[i]) {
auto max_j = max_prime / i;
for (auto j = i; j < max_j; j++) {
is_prime[j * i] = false;
}
}
}
auto primes_count = 0;
for (auto i = 0; i < max_prime; i++) {
if (is_prime[i]) {
primes_count++;
}
}
*p_primes_count = primes_count;
int* primes = (int*)malloc(*p_primes_count * sizeof(__int32));
int index_primes = 0;
for (auto i = 0; i < max_prime; i++) {
if (is_prime[i]) {
primes[index_primes] = i;
index_primes++;
}
}
free(is_prime);
return primes;
}
This is compiled with the MVS2013 compiler, with optimization flags O2.
I don't really see what should be the big difference, because of the move semantics (allowing returning the big vector by value without copy).
Here are the results, with an input of 2E6:
C array version
avg= 0.0438
std= 0.00928224
vector version
avg= 0.0625
std= 0.0005
vector version (no realloc)
avg= 0.0687
std= 0.00781089
The statistics are on 10 trials.
I think there are quite some differences here. Is it because something in my code to be improved?
edit: after correction of my code (and another improvement), here are my new results:
C array version
avg= 0.0344
std= 0.00631189
vector version
avg= 0.0343
std= 0.00611637
vector version (no realloc)
avg= 0.0469
std= 0.00997447
which confirms that there is no penalty of std::vector compare to C arrays (and that one should avoid reallocating).
There shouldn't be a performance difference between vector and a dynamic array, since a vector is a dynamic array.
The performance difference in your code comes from the fact that you are actually doing different things between the vector and array version. For instance:
std::vector<int> primes(primes_count, 0);
for (auto i = 0; i < max_prime; i++) {
if (is_prime[i]) {
primes.push_back(i);
}
}
return primes;
This creates a vector of size primes_count, all initialized to 0, and then pushes back a bunch of primes onto it. But it still starts with primes_count 0s! So that's wasted memory from both an initialization perspective and an iteration perspective. What you want to do is:
std::vector<int> primes;
primes.reserve(primes_count);
// same push_back loop
return primes;
Along the same lines, this block;
std::vector<int> is_prime(max_prime, true);
is_prime[0] = is_prime[1] = false;
for (auto i = 2; i < max_prime; i++) {
is_prime[i] = true;
}
You construct a vector of max_prime ints initialized to true... and then assign most of them to true again. You're doing the initialization twice here, whereas in the array implementation you only do it once. You should just remove this for loop.
I bet if you fix these two issues - which would make the two algorithms comparable - you'd get the same performance.