Multithread calculate mean and std does not improve efficiency - c++

I am a novice in the field of C++ multithread programming and I try to use multithread to compute the mean and standard deviation of my data in parallel to reduce the cost of time. My function of calculation of mean and standard deviation is as the following.
void cal_mean_std(float* data, float* mean, float* sd, int N, int start_index, int span_cols)
{
int value;
for(int j = start_index; j < start_index + span_cols; j++){
mean[j] = 0;
sd[j] = 0;
for (int i = 0; i < N; i++) {
value = data[j * N + i];
mean[j] += value;
sd[j] += value * value;
}
mean[j] = mean[j] / N;
sd[j] = sqrt(sd[j] / N - mean[j] * mean[j]);
}
}
I specify the start index and calculation spans of each thread and I activate my thread_pool as the following.
x.mean = new float[x.M];
x.sd = new float[x.M];
std::vector<std::thread> thread_pool;
int h = 4;
thread_pool.reserve(h);
int SNIPs = static_cast<int>(x.M / h + 1);
int SNIPs_final = x.M - (h - 1) * SNIPs;
for (int i = 0; i < h - 1; i++)
{
thread_pool.push_back(std::thread(std::bind(cal_mean_std, x.data, x.mean, x.sd,
x.N, i*SNIPs, SNIPs)));
}
thread_pool.push_back(std::thread(std::bind(cal_mean_std, x.data, x.mean, x.sd,
x.N, (h-1)*SNIPs, SNIPs_final)));
for (int i = 0; i < h; i++)
thread_pool.at(i).join();
where the x.M is the total number of cols of my data. However, I found that implement in this way did not improve the program efficiency. I am not sure what the problem is.
Actually, we can simulate data to do the computation. My data size is 5k x 300k. The sequential calculation by using for loop all over the data one thread takes 15 seconds. My multithreading version sometimes takes 16 seconds.
The simulation code is as the following and I find that when I use h = 1, the program takes 6s to finish. However, when I use h = 4, the program takes 14s to finish.
#include <thread>
#include <vector>
#include <stdlib.h>
#include <vector>
#include <stdio.h>
#include <iostream>
#include <math.h>
void gen_matrix(int N, int P, float* data){
for (int i = 0; i < N * P; i++)
{
data[i] = rand() % 10;
}
}
void cal_mean_std(float* data, float* mean, float* sd, int N, int start_index, int span_cols)
{
int value;
for(int j = start_index; j < start_index + span_cols; j++){
mean[j] = 0;
sd[j] = 0;
for (int i = 0; i < N; i++) {
value = data[j * N + i];
mean[j] += value;
sd[j] += value * value;
}
mean[j] = mean[j] / N;
sd[j] = sqrt(sd[j] / N - mean[j] * mean[j]);
}
}
int main()
{
int N = 5000;
int P = 300000;
float* data = new float[N*P];
gen_matrix(N, P, data);
float* mean = new float[P];
float* std = new float[P];
std::vector<std::thread> thread_pool;
clock_t t1;
t1 = clock();
int h = 1;
thread_pool.reserve(h);
int SNIPs = static_cast<int>(P / h + 1);
int SNIPs_final = P - (h - 1) * SNIPs;
for (int i = 0; i < h - 1; i++)
{
thread_pool.push_back(std::thread(std::bind(cal_mean_std, data, mean, std,
N, i*SNIPs, SNIPs)));
}
thread_pool.push_back(std::thread(std::bind(cal_mean_std, data, mean, std,
N, (h-1)*SNIPs, SNIPs_final)));
for (int i = 0; i < h; i++)
thread_pool.at(i).join();
std::cout <<"Time for the cal mean and std is " << (clock() - t1) * 1.0/CLOCKS_PER_SEC << std::endl;
return 0;
}

Thank you, everyone. Finally, I found what the problem is with my code. The timer clock_t computes the CPU consumption time instead of wall time.

Related

Speedup when avoiding false sharing problem?

I am trying to add cache-line padding to avoid false sharing problem but I cant see a big difference in speedup. With padding its only 1.2 x faster. I am running the code without padding and the one with padding n = 700 milion times for testing. Should I get more speedup than 1.2 times? Maybe I have missed something with my padding implementation? I am adding 15 ints padding because I am assuming that counters doesnt have to be allocated at the start of a cache-line. Any tips appreciated.
Here is my code:
template <const int k> void par_countingsort2(int *out, int const *in, const int n) {
const int paddingAmount = cachelinesize / sizeof(int);
const int kPadded = k + (paddingAmount - 1);
printf("/n%d", kPadded);
int counters[nproc][kPadded] = {}; // all zeros
#pragma omp parallel
{
int *thcounters = counters[omp_get_thread_num()];
#pragma omp for
for (int i = 0; i < n; ++i)
++thcounters[in[i]];
#pragma omp single
{
int tmp, sum = 0;
for (int j = 0; j < k; ++j)
for (int i = 0; i < nproc; ++i) {
tmp = counters[i][j];
counters[i][j] = sum;
sum += tmp;
}
}
#pragma omp for
for (int i = 0; i < n; ++i)
out[thcounters[in[i]]++] = in[i];
}
}
#define k 1000
int main(int argc, char *argv[]) {
//init input
int n = argc>1 && atoi(argv[1])>0 ? atoi(argv[1]) : 0;
int* in = (int*)malloc(sizeof(int)*n);
int* out = (int*)malloc(sizeof(int)*n);;
for (int i = 0; i < n; ++i)
in[i] = rand()%k;
printf("n = %d\n", n);
//print some parameters
printf("nproc = %d\n", nproc);
printf("cachelinesize = %d byte\n", cachelinesize);
printf("k = %d\n", k);
double tp2 = omp_get_wtime();
par_countingsort2<k>(out, in, n);
tp2 = omp_get_wtime() - tp2;
printf("par2, elapsed time = %.3f seconds (%.1fx speedup from par1), check passed = %c\n", tp2, tp/tp2, checkreset(out,in,n)?'y':'n');
//free mem
free(in);
free(out);
return EXIT_SUCCESS;
}

Very slow mutex in LLVM/OpenMP

I wrote code to test the performance of openmp on win (Win7 x64, Corei7 3.4HGz) and on Mac (10.12.3 Core i7 2.7 HGz).
In xcode I made a console application setting the compiled default. I use LLVM 3.7 and OpenMP 5 (in opm.h i searched define KMP_VERSION_MAJOR=5, define KMP_VERSION_MINOR=0 and KMP_VERSION_BUILD = 20150701, libiopm5) on macos 10.12.3 (CPU - Corei7 2700GHz)
For win I use VS2010 Sp1. Additional I set c/C++ -> Optimization -> Optimization = Maximize Speed (O2), c/C++ -> Optimization ->Favor Soze Or Speed = Favor Fast code (Ot).
If I run the application in a single thread, the time difference corresponds to the frequency ratio of processors (approximately). But if you run 4 threads, the difference becomes tangible: win program be faster then mac program in ~70 times.
#include <cmath>
#include <mutex>
#include <cstdint>
#include <cstdio>
#include <iostream>
#include <omp.h>
#include <boost/chrono/chrono.hpp>
static double ActionWithNumber(double number)
{
double sum = 0.0f;
for (std::uint32_t i = 0; i < 50; i++)
{
double coeff = sqrt(pow(std::abs(number), 0.1));
double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
sum += sqrt(res);
}
return sum;
}
static double TestOpenMP(void)
{
const std::uint32_t len = 4000000;
double *a;
double *b;
double *c;
double sum = 0.0;
std::mutex _mutex;
a = new double[len];
b = new double[len];
c = new double[len];
for (std::uint32_t i = 0; i < len; i++)
{
c[i] = 0.0;
a[i] = sin((double)i);
b[i] = cos((double)i);
}
boost::chrono::time_point<boost::chrono::system_clock> start, end;
start = boost::chrono::system_clock::now();
double k = 2.0;
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
c[i] = k*a[i] + b[i] + k;
if (c[i] > 0.0)
{
c[i] += ActionWithNumber(c[i]);
}
else
{
c[i] -= ActionWithNumber(c[i]);
}
std::lock_guard<std::mutex> scoped(_mutex);
sum += c[i];
}
end = boost::chrono::system_clock::now();
boost::chrono::duration<double> elapsed_time = end - start;
double sum2 = 0.0;
for (std::uint32_t i = 0; i < len; i++)
{
sum2 += c[i];
c[i] /= sum2;
}
if (std::abs(sum - sum2) > 0.01) printf("Incorrect result.\n");
delete[] a;
delete[] b;
delete[] c;
return elapsed_time.count();
}
int main()
{
double sum = 0.0;
const std::uint32_t steps = 5;
for (std::uint32_t i = 0; i < steps; i++)
{
sum += TestOpenMP();
}
sum /= (double)steps;
std::cout << "Elapsed time = " << sum;
return 0;
}
I specifically use a mutex here to compare the performance of openmp on the "mac" and "win". On the "Win" function returns the time of 0.39 seconds. On the "Mac" function returns the time of 25 seconds, i.e. 70 times slower.
What is the cause of this difference?
First of all, thank for edit my post (i use translater to write text).
In the real app, I update the values in a huge matrix (20000х20000) in random order. Each thread determines the new value and writes it in a particular cell. I create a mutex for each row, since in most cases different threads write to different rows. But apparently in cases when 2 threads write in one row and there is a long lock. At the moment I can't divide the rows in different threads, since the order of records is determined by the FEM elements.
So just to put a critical section in there comes out, as it will block writes to the entire matrix.
I wrote code like in real application.
static double ActionWithNumber(double number)
{
const unsigned int steps = 5000;
double sum = 0.0f;
for (u32 i = 0; i < steps; i++)
{
double coeff = sqrt(pow(abs(number), 0.1));
double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
sum += sqrt(res);
}
sum /= (double)steps;
return sum;
}
static double RealAppTest(void)
{
const unsigned int elementsNum = 10000;
double* matrix;
unsigned int* elements;
boost::mutex* mutexes;
elements = new unsigned int[elementsNum*3];
matrix = new double[elementsNum*elementsNum];
mutexes = new boost::mutex[elementsNum];
for (unsigned int i = 0; i < elementsNum; i++)
for (unsigned int j = 0; j < elementsNum; j++)
matrix[i*elementsNum + j] = (double)(rand() % 100);
for (unsigned int i = 0; i < elementsNum; i++) //build FEM element like Triangle
{
elements[3*i] = rand()%(elementsNum-1);
elements[3*i+1] = rand()%(elementsNum-1);
elements[3*i+2] = rand()%(elementsNum-1);
}
boost::chrono::time_point<boost::chrono::system_clock> start, end;
start = boost::chrono::system_clock::now();
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
unsigned int* elems = &elements[3*i];
for (unsigned int j = 0; j < 3; j++)
{
//in here set mutex for row with index = elems[j];
boost::lock_guard<boost::mutex> lockup(mutexes[i]);
double res = 0.0;
for (unsigned int k = 0; k < 3; k++)
{
res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
}
for (unsigned int k = 0; k < 3; k++)
{
matrix[elems[j]*elementsNum + elems[k]] = res;
}
}
}
end = boost::chrono::system_clock::now();
boost::chrono::duration<double> elapsed_time = end - start;
delete[] elements;
delete[] matrix;
delete[] mutexes;
return elapsed_time.count();
}
int main()
{
double sum = 0.0;
const u32 steps = 5;
for (u32 i = 0; i < steps; i++)
{
sum += RealAppTest();
}
sum /= (double)steps;
std::cout<<"Elapsed time = " << sum;
return 0;
}
You're combining two different sets of threading/synchronization primitives - OpenMP, which is built into the compiler and has a runtime system, and manually creating a posix mutex with std::mutex. It's probably not surprising that there's some interoperability hiccups with some compiler/OS combinations.
My guess here is that in the slow case, the OpenMP runtime is going overboard to make sure that there's no interactions between higher-level ongoing OpenMP threading tasks and the manual mutex, and that doing so inside a tight loop causes the dramatic slowdown.
For mutex-like behaviour in the OpenMP framework, we can use critical sections:
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
//...
// replacing this: std::lock_guard<std::mutex> scoped(_mutex);
#pragma omp critical
sum += c[i];
}
or explicit locks:
omp_lock_t sumlock;
omp_init_lock(&sumlock);
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
//...
// replacing this: std::lock_guard<std::mutex> scoped(_mutex);
omp_set_lock(&sumlock);
sum += c[i];
omp_unset_lock(&sumlock);
}
omp_destroy_lock(&sumlock);
We get much more reasonable timings:
$ time ./openmp-original
real 1m41.119s
user 1m15.961s
sys 1m53.919s
$ time ./openmp-critical
real 0m16.470s
user 1m2.313s
sys 0m0.599s
$ time ./openmp-locks
real 0m15.819s
user 1m0.820s
sys 0m0.276s
Updated: There's no problem with using an array of openmp locks in exactly the same way as the mutexes:
omp_lock_t sumlocks[elementsNum];
for (unsigned idx=0; idx<elementsNum; idx++)
omp_init_lock(&(sumlocks[idx]));
//...
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
unsigned int* elems = &elements[3*i];
for (unsigned int j = 0; j < 3; j++)
{
//in here set mutex for row with index = elems[j];
double res = 0.0;
for (unsigned int k = 0; k < 3; k++)
{
res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
}
omp_set_lock(&(sumlocks[i]));
for (unsigned int k = 0; k < 3; k++)
{
matrix[elems[j]*elementsNum + elems[k]] = res;
}
omp_unset_lock(&(sumlocks[i]));
}
}
for (unsigned idx=0; idx<elementsNum; idx++)
omp_destroy_lock(&(sumlocks[idx]));

C++ memory leak, how to detect

I am using SSE to implement matrix multiplication, but I found there exists memory leak(see the picture below), the memory usage is increasing from 400M to 1G or more.
But, I free the memory in the code.
The following are codes
main.cpp
#include "sse_matrix.h"
#include <ctime>
int main(int argc, char* argv[])
{
vector<float> left(size, 0);
vector<float> right(size, 0);
vector<float> result(size, 0);
// initialize value
for (int i = 0; i < dim; i ++)
{
for (int j = 0; j < dim; j ++)
{
left[i*dim + j] = j;
right[i*dim + j] = j;
}
}
cout << "1. INFO: value initialized, starting matrix multiplication" << endl;
// calculate the result
clock_t my_time = clock();
SSE_Matrix_Multiply(&left, &right, &result);
cout << "2. INFO: SSE matrix multiplication result has got" << endl;
/*for (int i = 0; i < dim; i ++)
{
for (int j = 0; j < dim; j ++)
{
cout << result[i * dim + j] << " ";
}
cout << endl;
}*/
cout << "3. INFO: " << float(clock() - my_time)/1000.0 << endl;
system("pause");
return 0;
}
sse_matrix.h
#ifndef __SSE_MATRIX_H
#define __SSE_MATRIX_H
#include <vector>
#include <iostream>
using std::cin;
using std::cout;
using std::endl;
using std::vector;
//#define dim 8
//#define size (dim * dim)
const int dim = 4096;
const int size = dim * dim;
struct Matrix_Info
{
vector<float> * A;
int ax, ay;
vector<float> * B;
int bx, by;
vector<float> * C;
int cx, cy;
int m;
int n;
};
void Transpose_Matrix_SSE(float * matrix)
{
__m128 row1 = _mm_loadu_ps(&matrix[0*4]);
__m128 row2 = _mm_loadu_ps(&matrix[1*4]);
__m128 row3 = _mm_loadu_ps(&matrix[2*4]);
__m128 row4 = _mm_loadu_ps(&matrix[3*4]);
_MM_TRANSPOSE4_PS(row1, row2, row3, row4);
_mm_storeu_ps(&matrix[0*4], row1);
_mm_storeu_ps(&matrix[1*4], row2);
_mm_storeu_ps(&matrix[2*4], row3);
_mm_storeu_ps(&matrix[3*4], row4);
}
float * Shuffle_Matrix_Multiply(float * left, float * right)
{
__m128 _t1, _t2, _sum;
_sum = _mm_setzero_ps(); // set all value of _sum to zero
float * _result = new float[size];
float _res[4] = {0};
for (int i = 0; i < 4; i ++)
{
for (int j = 0; j < 4; j ++)
{
_t1 = _mm_loadu_ps(left + i * 4);
_t2 = _mm_loadu_ps(right + j * 4);
_sum = _mm_mul_ps(_t1, _t2);
_mm_storeu_ps(_res, _sum);
_result[i * 4 + j] = _res[0] + _res[1] + _res[2] + _res[3];
}
}
return _result;
}
float * SSE_4_Matrix(struct Matrix_Info * my_info)
{
int m = my_info->m;
int n = my_info->n;
int ax = my_info->ax;
int ay = my_info->ay;
int bx = my_info->bx;
int by = my_info->by;
//1. split Matrix A and Matrix B
float * _a = new float[16];
float * _b = new float[16];
for (int i = 0; i < m; i ++)
{
for (int j = 0; j < m; j ++)
{
_a[i*m + j] = (*my_info->A)[(i + ax) * n + j + ay];
_b[i*m + j] = (*my_info->B)[(i + bx) * n + j + by];
}
}
//2. transpose Matrix B
Transpose_Matrix_SSE(_b);
//3. calculate result and return a float pointer
return Shuffle_Matrix_Multiply(_a, _b);
}
int Matrix_Multiply(struct Matrix_Info * my_info)
{
int m = my_info->m;
int n = my_info->n;
int cx = my_info->cx;
int cy = my_info->cy;
for (int i = 0; i < m; i ++)
{
for (int j = 0; j < m; j ++)
{
float * temp = SSE_4_Matrix(my_info);
(*my_info->C)[(i + cx) * n + j + cy] += temp[i*m + j];
delete [] temp;
}
}
return 0;
}
void SSE_Matrix_Multiply(vector<float> * left, vector<float> * right, vector<float> * result)
{
struct Matrix_Info my_info;
my_info.A = left;
my_info.B = right;
my_info.C = result;
my_info.n = dim;
my_info.m = 4;
// Matrix A row:i, column:j
for (int i = 0; i < dim; i += 4)
{
for (int j = 0; j < dim; j += 4)
{
// Matrix B row:j column:k
for (int k = 0; k < dim; k += 4)
{
my_info.ax = i;
my_info.ay = j;
my_info.bx = j;
my_info.by = k;
my_info.cx = i;
my_info.cy = k;
Matrix_Multiply(&my_info);
}
}
}
}
#endif
And I guess maybe the memory leak is in Shuffle_Matrix_Multiply function in sse_matrix.h file. But, I am not sure, and now, the memory usage is increasing and my system will crash.
Hope someone can help to figure out and thanks in advance.
You never free the _a and _b allocated in SSE_4_Matrix.
You also allocate a lot dynamically just to throw it away a bit later. For example the _a and _b could be arrays of 16 floats in stack.
I would like to use a header file to help me to check memory leak. The header file as follows:
MemoryLeakChecker.hpp
#ifndef __MemoryLeakChecker_H__
#define __MemoryLeakChecker_H__
#include <crtdbg.h>
#include <cassert>
//for memory leak check
#ifdef _DEBUG
#define DEBUG_CLIENTBLOCK new(_CLIENT_BLOCK,__FILE__,__LINE__)
#else
#define DEBUG_CLIENTBLOCK
#endif
#define _CRTDBG_MAP_ALLOC
#ifdef _DEBUG
#define new DEBUG_CLIENTBLOCK
#endif
inline void checkMemoryLeak() {
_CrtSetDbgFlag(_CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF);
int m_count = _CrtDumpMemoryLeaks();
assert(m_count == 0);
}
#endif
In my project, i will use MemoryLeakChecker.hpp in the file including main function as follows:
MemoryLeakTest.cpp
#include "MemoryLeakChecker.hpp"
int main() {
//_crtBreakAlloc = 148; //if you only know the memory leak block number is 148 after checking memory leak log, use this to locate the code causing memory leak.
//do some things
atexit(checkMemoryLeak); //check all leak after main() function called
return 0;
}
Run your program in debug mode in Visual Studio, you can get the memory leak log in output window after your program exited. Also, you can find the place where memory leaked in the memory leak log.

Designing MERGE-SORT Algorithm - VERY WEIRD ISSUE ! "std::bad_alloc at memory location 0x00486F78."

This is for an assignment in an algorithm class. I understand and agree that using a vector would simplify things, but that isn't an option.
The code for the Mergesort / merge algorithm can't be modified either.
I need to run the merge sort as follows:
starting from 100 all the way to 1000, increments of 100. For each increment I run it 5 times, for each of these times I run it 1000 times.
That being said - everything works fine until my loop reaches 700 and crashes with the error: "Unhandled exception at 0x75612F71 in msdebug.exe: Microsoft C++ exception: std::bad_alloc at memory location 0x010672F4."
Here is my code:
int const size = 6;
int const size2 = 1001;
int const times = 6;
int const interval = 11;
void merge(int arr[], int p, int q, int r)
{
int n1 = q - p + 1;
int n2 = r - q;
int * L = new int[n1 + 1];
int * R = new int[n2 + 1]; // line giving the error after 700
for (int i = 1; i <= n1; i++)
{
L[i] = arr[p + i - 1];
}
for (int j = 1; j <= n2; j++)
{
R[j] = arr[q + j];
}
L[n1 + 1] = 32768;
R[n2 + 1] = 32768;
int i, j;
i = j = 1;
for (int k = p; k <= r; k++)
{
if (L[i] <= R[j])
{
arr[k] = L[i];
i++;
}
else
{
arr[k] = R[j];
j++;
}
}
}
void mergeSort(int arr[], int p, int r)
{
int q;
if (p < r)
{
q = ((p + r) / 2);
mergeSort(arr, p, q);
mergeSort(arr, (q + 1), r);
merge(arr, p, q, r);
}
}
void copyArray(int original[][size2], int copy[], int row, int finish)
{
int i = 1;
while (i <= finish)
{
copy[i] = original[row][i];
i++;
}
}
void copyOneD(int orig[], int cop[])
{
for (int i = 1; i < size2; i++)
{
cop[i] = orig[i];
}
}
int main()
{
struct timeval;
clock_t start, end;
srand(time(NULL));
int arr[size][size2];
int arr2[size2];
int arrCopy[size2];
double tMergeSort[times][interval];
double avgTmergeSort[11];
/*for (int i = 1; i < (size2); i++)
{
arr2[i] = rand();
}*/
for (int i = 1; i < size; i++)
{
for (int j = 1; j < size2; j++)
{
arr[i][j] = rand();
}
}
for (int x = 100; x <= 1000; x = x + 100) //This loop crashes >=700
{
for (int r = 1; r <= 5; r++)
{
copyArray(arr, arr2, r, 1001);
for (int k = 0; k < 1000; k++)
{
copyOneD(arr2, arrCopy);
mergeSort(arrCopy, 1, x);
}
}
}
return 0;
}
You can ignore the code and the arrays. Those functions work fine.
Everything works fine until I set 'x <= 700' or higher and then it crashes.
I had a theory that maybe the computer runs out of memory for the pointers in the merge algorithm but when I tried to use delete it also crashed.
Any help is appreciated and suggestions as well.
Thanks

Fibonacci numbers - dynamic array

I want to write Fibonacci number program, using dynamic array in function. If I want to initialize array in the function, where I must delete this array? Here is code:
#include <iostream>
using namespace std;
int* fibo(int);
int main()
{
int *fibonacci, n;
cout << "Enter how many fibonacci numbers you want to print: ";
cin >> n;
fibonacci = fibo(n);
for (int i = 0; i<n; i++)
cout << fibonacci[i] << " ";
//for (int i = 0; i < n; i++)
//delete w_fibo[i];
//delete[] w_fibo;
return 0;
}
int* fibo(int n)
{
int* w_fibo = new int[n];
if (n >= 0)
w_fibo[0] = 1;
if (n >= 1)
w_fibo[1] = 1;
for (int i = 1; i < n; i++)
w_fibo[i + 1] = w_fibo[i] + w_fibo[i - 1];
return w_fibo;
}
You don't have to initialize the array! a better dynamic Fibonacci presentation could be like this:
int fib2 (int n) {
int i = 1, j = 0;
for (int k = 0; k < n; k++) { // The loop begins to work real after one loop (k == 1). Sounds interesting!
j += i; // Adds the produced number to the last member of the sequence and makes a new sentence.
i = j - i; // Produces the number that should be added to the sequence.
}
return j;
}
and you can get the n-th fib number using this method. It's O(log(n)) so it's so efficient.`
int fib3 (int n) {
int i = 1, j = 0, k = 0, h = 1, t=0;
while (n > 0) {
if (n % 2) { // |
t = j * h; // |
j = i * h + j * k + t;
i = i * k + t;
}
t = h * h;
h = 2 * k * h + t;
k = k * k + t;
n /= 2;
}
return j;
}
If you allocate a std::vector<int> inside fibo() and reserve enough memory, and then return it by value, the memory allocation is taken care for you by the compiler:
#include <iostream>
#include <vector>
using namespace std;
std::vector<int> fibo(int n)
{
std::vector<int> w_fibo;
w_fibo.reserve(n);
if (n >= 0)
w_fibo[0] = 1;
if (n >= 1)
w_fibo[1] = 1;
for (int i = 1; i < n; i++)
w_fibo[i + 1] = w_fibo[i] + w_fibo[i - 1];
return w_fibo;
}
int main()
{
int n = 10;
std::vector<int> fibonacci = fibo(n);
for (int i = 0; i<n; i++)
cout << fibonacci[i] << " ";
}
Live Example.
NOTE: This is guaranteed to avoid needlessly copying in C++11 (move semantics) and is likely to do so in C++98 (copy-elision using the return-value-optimization).
This is an old question, but just in case someone happens to pass by this might be helpful.
If you need a efficient method to get the nth Fibonacci number, we have a O(1) time complexity procedure.
It is based on Binet's formula, which I think our friends over at math.se will be better at proving, so feel free to follow that link.
The formula itself is, given a=1.618 and b=-0.618 (these are approximate values)
the n-th term is (a^n - b^n)/2.236. A good way to round that off(since we are using approximate values) would be adding 0.5 and taking the floor function.
math.floor(((math.pow(1.618,n)-math.pow(-0.618,n))/2.236 + 0.5)