I am rotating an array or vector clockwise and counterclockwise in C++. Which is the most efficient way in terms of time complexity to do that ?
I used rotate() function but I want to know is there any faster methods than this ?
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
int main()
{
vector<int> v;
for(int i=0;i<5;i++)
v.push_back(i);
int d=2;
rotate(v.begin(),v.begin()+d,v.end());
return 0;
}
rotate() is a linear time function and that is the best you can do.
However, if you need to do multiple rotates, you can accumulate.
For eg:
rotation of 4 and rotation of 5 is same as a single rotate of 9.
Or in fact, in some applications, you may not even want to actually rotate.
Like, if you want to rotate by 'd'. You can just make a function that returns v[(i+d)%v.size()] when asked for v[i]. This is constant time solution. But like I said, this is application specific.
General answer for the "can I make XY faster?" kind of question:
Maybe you can. But probably you shouldn't.
std::rotate is designed to be efficient for the average case. That means, if you have a very specific case, it might be possible to have a more performant implementation for that case.
BUT:
Don't bother to search for a more performant implementation for your specific case, because finding that specific implementation will require you to know about the detailed performance of the steps you take and about the optimizations your compiler will perform.
Don't bother to implement it, because you will have to test it and still your coverage of corner cases won't be as good as the tests already performed with the standard library implementation.
Don't use it, because someone will be irritaded, asking himself by why you rolled your own implementation and not just used the standard library. And someone else will use it for a case where it is not as performant as the standard implementation.
Don't invest time to improve the performance of a clear piece of code, unless you are 100% sure that it is a performance bottleneck. 100% sure means, you have used a profiler and pinpointed the exact location of the bottleneck.
#include <bits/stdc++.h>
#include <iostream>
using namespace std;
void rotatebyone(int arr[], int n) {
int temp= arr[0], i;
for(i=0;i<n;i++)
{
arr[i]=arr[i+1];
arr[n]=temp;
}
}
int main()
{
int arr[]={2,3,4,5,6,7,8};
int m= sizeof(arr)/sizeof(arr[0]);
int i;
int d=1;
for(i=0;i<d;i++) { //function to implement it d no of times
rotatebyone(arr,m);
}
for(i=0;i<m;i++) {
cout<<arr[i]<<"";
}
return 0;
}
Related
My understanding is that C++ best practice is to define variables with the smallest scope possible.
My understanding is that the primary reason for this is that it will help prevent accidental reuse. In addition, there almost never a performance hit for doing so (or so I have been told). To the contrary, people seem to indicate that a compiler may actually be able to produce equivalent or better code when variables are defined locally. For example, the following two functions produce the same binaries in Godbolt:
#include <cstdio>
#include <cstdlib>
void printRand1() {
char val;
for (size_t i = 0; i < 100 ; ++i) {
val = rand();
puts(&val);
}
}
void printRand2() {
for (size_t i = 0; i < 100 ; ++i) {
const char val = rand();
puts(&val);
}
}
So in this case, version 2 is clearly preferable. This context I can totally agree with and understand.
What is not clear to me is whether the same logic should be applied to larger data types such as arrays or vectors. One particular thing that I find a lot in code is something like this:
#include <cstdio>
#include <cstdlib>
#include <vector>
struct Bob {
std::vector<char> buffer;
void bar(int N) {
buffer.resize(N);
for (auto & elem : buffer) {
elem = rand();
puts(&elem);
}
}
};
void bob() {
Bob obj;
obj.bar(100);
}
despite the fact that we could have localized the data better in this dumb example:
#include <cstdio>
#include <cstdlib>
#include <vector>
struct Bob {
void bar(int N) {
std::vector<char> buffer(N);
for (auto & elem : buffer) {
elem = rand();
puts(&elem);
}
}
};
void bob() {
Bob obj;
obj.bar(100);
}
Note: Before you guys jump on this, I totally realize that you don't actually need a vector in this example. I am just making a stupid example so that the binary code is not too large on Godbolt.
The rationale here for NOT localizing the data (aka. Snippet 1) is that the buffer could be some large vector, and we don't want to keep reallocating it every time we call the function.
The rationale for Snippet 2 is to localize the data better.
So what logic should I apply for this scenario? I am interested in the case where you actually need a vector (in this case you don't).
Should I follow the localization logic? or should I ride by the logic that I should try to prevent repeated reallocations?
I realize that in an actual application, you would want to benchmark the performance, not the compiled size in Godbolt. But I wonder what should be my default style for this scenario (before I start profiling the code).
The main consideration in the scenario you describe is: "Is the buffer an integral part of what a Bob is? Or is it just something we use in the implementation of bar()?"
If every Bob has a sequence of contiguous chars throughout its life as a Bob, then - that should be a member variable. If you only form that sequence to run bar(), then by the "smallest relevant scope" rule, that vector will only exist as a local variable inside bar().
Now, the above is the general-case answer. Sometimes, for reasons of performance, you may end up breaking your clean and reasonable abstractions. For example: You might have some single vector allocated and just it associated with a Bob for a period of time, then dissociate the buffer from your Bob but keep it in some buffer cache. But don't think about these kinds of contortions unless you have a very good reason to.
In version 2, memory will be allocated and deallocated with each bar() call. While version one will reuse already allocated chunk. For single call it doesn't matter, for multiple - version 1 would be preferred.
I've switched recently from matlab to c++ in order to run simulations faster, however it still runs slow. I'm pretty positive that there is much to improve in terms of memory usage.
Consider the following code, it shows an example of two array/vector declaration, that I use in a simulation.
One with known fixed length (array01) and another with unknown length (array02) that changes during the run.
The question here is what is the best/proper/efficient way of declaring variables ( for both array types) in terms of memory usage and performance.
# include <iostream>
# include <vector>
# include <ctime>
# include <algorithm>
using namespace std;
const int n = 1000;
const int m= 100000;
int main()
{
srand((unsigned)time(NULL));
vector <double> array02;
vector <vector<double>> Array01(n,m);
for (unsigned int i=0; i<n; i++)
{
for (unsigned int j=0; j<m;j++)
{
array02.clear();
rr = rand() % 10;
for (unsigned int l = 0 ; l<rr <l++)
{
array02.pushback(l);
}
// perform some calculation with array01 and array02
}
}
}
You should consider defining your own Matrix class with a void resize(unsigned width, unsigned height) member function, and a double get(unsigned i, unsigned j) inlined member function and/or a double& at(unsigned i, unsigned j) inlined member function (both giving Mi,j element). The matrix internal data could be a one-dimensional array or vector of doubles. Using a vector of vectors (all of the same size) is not the best (or fastest) way to represent a matrix.
class Matrix {
std::vector<double> data;
unsigned width, height;
public:
Matrix() : data(), width(0), height(0) {};
~Matrix() = default;
/// etc..., see rule of five
void resize(unsigned w, unsigned h) {
data.resize(w*h);
width = w; height = h;
}
double get(unsigned i, unsigned j) const {
assert(i<width && j<height);
return data[i*width+j];
}
double& at(unsigned i, unsigned j) {
assert(i<width && j<height);
return data[i*width+j];
}
}; // end class Matrix
Read also about the rule of five.
You could also try scilab (it is free software). It is similar to Matlab and might have different performances. Don't forget to use a recent version.
BTW, there are tons of existing C++ numerical libraries dealing with matrices. Consider using one of them. If performance is of paramount importance, don't forget to ask your compiler to optimize your code after you have debugged it.
Assuming you are on Linux (which I recommend for numerical computations; it is significant that most supercomputers run Linux), compile using g++ -std=c++11 -Wall -Wextra -g during the debugging phase, then use g++ -std=c++11 -Wall -Wextra -mtune=native -O3 during benchmarking. Don't forget to profile, and remember that premature optimization is evil (you first need to make your program correct).
You might even spend weeks, or months and perhaps many years, of work to use techniques like OpenMP, OpenCL, MPI, pthreads or std::thread for parallelization (which is a difficult subject you'll need years to master).
If your matrix is big, and/or have additional properties (is sparse, triangular, symmetric, etc...) there are many mathematical and computer science knowledge to master to improve the performance. You can make a PhD on that, and spend your entire life on the subject. So go to your University library to read some books on numerical analysis and linear algebra.
For random numbers C++11 gives you <random>; BTW use C++11 or C++14, not some earlier version of C++.
Read also http://floating-point-gui.de/ and a good book about C++ programming.
PS. I don't claim any particular expertise on numerical computation. I prefer much symbolic computation.
First of all don't try to reinvent the wheel :) Try to use some heavily optimized numerical library, for example
Intel MKL (Fastest and most used math library for Intel and compatible processors)
LAPACK++ (library for high performance linear algebra)
Boost (not only numerical, but solves almost any problem)
Second: If you need a matrix for a very simple program, use vector[i + width * j] notation. It's faster because you save an extra memory allocation.
Your example doesn't event compile. I tried to rewrite it a little:
#include <vector>
#include <ctime>
int main()
{
const int rowCount = 1000;
const int columnCount = 1000;
srand(time(nullptr));
// Declare matrix
std::vector<double> matrix;
// Preallocate elemts (faster insertion later)
matrix.reserve(rowCount * columnCount);
// Insert elements
for (size_t i = 0; i < rowCount * columnCount; ++i) {
matrix.push_back(rand() % 10);
}
// perform some calculation with matrix
// For example this is a matrix element at matrix[1, 3]:
double element_1_3 = matrix[3 + 1 * rowCount];
return EXIT_SUCCESS;
}
Now the speed depends on rand() (which is slow).
As people said:
Prefer a 1d array instead of 2d array for matrices.
Don't reinvent the wheel, use existing library: I think that Eigen library is the best suite for you, judging from your code. It also have very, very optimized code generated since it use C++ template static calculation when ever possible.
I was wandering if there is any STL algorithm which produces the same result of the following code:
std::vector<int> data;
std::vector<int> counter(N); //I know in advance that all values in data
//are between 0 and N-1
for(int i=0; i<data.size(); ++i)
counter[data[i]]++;
This code simply outputs the histogram of my integer data, with pre-defined bin size equal to one.
I know that I should avoid loops as much as I could, as the equivalents with STL algorithms are much better optimized than what the majority of C++ programmer may come up with.
Any suggestions?
Thank you in advance, Giuseppe
Well, you can certainly at least clean up the loop a bit:
for (auto i : data)
++count[i];
You could (for example) use std::for_each instead:
std::for_each(data.begin(), data.end(), [&count](int i) { ++count[i]; });
...but that doesn't really look like much (if any) of an improvement to me.
I don't think there's a more efficient way of doing this. You're right about avoiding loops and preferring STL in most cases, but this only applies to bigger, and overly-complicated loops which are harder to write and maintain, therefore likely to be not optimal.
Looking at the problem at an assembly level, the only way to compute this problem is exactly the way you have it in your example. Since C/C++ loops translate to assembly very efficiently with zero unnecessary overhead, this leaves me believing that no STL function could preform this faster than your algorithm.
There is one STL function called count, but the complexity of it is linear ( O(n) ), and so as your solution's.
If you really want to squeeze out the maximum of every CPU-cycle, then consider using C-style arrays, and a separate counter variable. The overhead introduced by vectors is barely even measurable, but if any, that's the only opportunity I see for optimization here. Not that I would suggest it, but I'm afraid that's the only way you can get a hair more speed out of this.
If you think about it, in order to count the occurrences of elements in a vector, each element would have to be "visited" at least once, there's no avoiding it.
A simple loop like this is already the most efficient. You can try to unroll it, but that's probably the best you can do. STL or not, I doubt if there's a better algorithm.
You can use for_each and one lambda function. Check this example:
#include <algorithm>
#include <vector>
#include <ctime>
#include <iostream>
const int N = 10;
using namespace std;
int main()
{
srand(time(0));
std::vector<int> counter(N);
std::vector<int> data(N);
generate(data.begin(),data.end(),[]{return rand()%N;});
for (int i = 0;i<N;i++)
cout<<data[i]<<endl;
cout<<endl;
for_each(data.begin(),data.end(),[&counter](int i){++counter[i];});
for (int i = 0;i<N;i++)
cout<<counter[i]<<endl;
}
I am having trouble with rand_r. I have a simulation that generates millions of random numbers. I have noticed that at a certain point in time, these numbers are no longer uniform. What could be the problem?
What i do: i create an instance of a generator and give it is own seed.
mainRGen= new nativeRandRUni(idumSeed_g);
here is the class/object def:
class nativeRandRUni {
public:
unsigned seed;
nativeRandRUni(unsigned sd){ seed= sd; }
float genP() { return (rand_r(&seed))/float(RAND_MAX); } // [0,1]
int genI(int R) { return (rand_r(&seed) % R); } // [0,R-1]
};
numbers are simply generated by:
newIntNumber= mainRGen->genI(desired_max);
newFloatNumber= mainRGen->genP();
the simulations have the problem described above. I know this is happening cause i have checked the distribution of the generated numbers after the point in time that a signature is shown in the results (see this, top image, http://ubuntuone.com/0tbfidZaXfGNTfiVr3x7DR)
also, if i print the seed at t-1 and t, being t the time point of the signature, i can see the seed changing by an order of magnitude from value 263069042 to 1069048066
if i run the code with a different seed, the problem is always present but at different time points
Also, if i use rand() instead of my object, all goes well... i DO need the object cause sometimes i used threads. The example above does not have threads.
i am really lost here, any clues?
EDIT - EDIT
it can be reproducible by looping enough times, problem is that, like i said, it takes millions of iterations for the problem to arise. For seed -158342163 i get it at generation t=134065568. One can check numbers generated before (uniform) and after (not uniform). I get the same problem if i change the seed manually at given t's, see (*) in code. Something i also do not expect to happen?
#include <tr1/random>
#include <fstream>
#include <sstream>
#include <iostream>
using std::ofstream;
using std::cout;
using std::endl;
class nativeRandRUni {
public:
unsigned seed;
long count;
nativeRandRUni(unsigned sd){ seed= sd; count=0; }
float genP() { count++; return (rand_r(&seed))/float(RAND_MAX); } // [0,1]
int genI(int R) { count++; return (rand_r(&seed) % R); } // [0,R-1]
};
int main(int argc, char *argv[]){
long timePointOfProblem= 134065568;
nativeRandRUni* mainRGen= new nativeRandRUni(-158342163);
int rr;
//ofstream* fout_metaAux= new ofstream();
//fout_metaAux->open("random.numbers");
for(int i=0; i< timePointOfProblem; i++){
rr= mainRGen->genI(1009200);
//(*fout_metaAux) << rr << endl;
//if(i%1000==0) mainRGen->seed= 111111; //(*) FORCE
}
//fout_metaAux->close();
}
Given that random numbers is key to your simulation, you should implement your own generator. I don't know what algorithm rand_r is using, but it could be something pretty crappy like linear congruent generator.
I'd look into implementing something fast and with good qualities where you know the underlying algorithm. I'd start by looking at implementing Mersenne Twister:
http://en.wikipedia.org/wiki/Mersenne_twister
Its simple to implement and very fast - requires no divides.
ended up trying a simple solution from boost, changing the generator to:
class nativeRandRUni {
public:
typedef mt19937 EngineType;
typedef uniform_real<> DistributionType;
typedef variate_generator<EngineType, DistributionType> VariateGeneratorType;
nativeRandRUni(long s, float min, float max) : gen(EngineType(s), DistributionType(min, max)) {}
VariateGeneratorType gen;
};
I don't get the problem anymore... tho it solved it, i dont feel very comfortable with not understanding what it was. I think Rafael is right, i should not trust rand_r for this intensive number of generations
Now, this is slower than before, so i may look for ways of optimizing it.
QUESTION: Would a Mersenne Twister implementation in principle be faster?
and thanks to all!
I was trying this problem on spoj.
First of all I came up with a sort of trivial o(blogb) algorithm(refer the problem for whats b).But since the author of the problem mentioned the constraints as b belongs to [0,10^7] i was not convinced if it would pass.Anyways out of shear belief I coded it as follows
#include<stdio.h>
#include<iostream>
#include<algorithm>
#include<cmath>
#include<vector>
#include<cstdlib>
#include<stack>
#include<queue>
#include<string>
#include<cstring>
#define PR(x) cout<<#x"="<<x<<endl
#define READ2(x,y) scanf("%d %d",&x,&y)
#define REP(i,a) for(long long i=0;i<a;i++)
#define READ(x) scanf("%d",&x)
#define PRARR(x,n) for(long long i=0;i<n;i++)printf(#x"[%d]=\t%d\n",i,x[i])
using namespace std;
#include <stdio.h>
struct node {
int val;
int idx;
};
bool operator<(node a,node b){ return a.val<b.val;}
node contain[10000001];
int main(){
int mx=1,count=1,t,n;
scanf("%d",&t);
while(t--){
count=1;mx=1;
scanf("%d",&n);
for(int i=0;i<n;i++){
scanf("%d",&contain[i].val);
contain[i].idx=i;
}
sort(contain,contain+n);
for(int j=1;j<n;j++){
if(contain[j].idx>contain[j-1].idx)
count++;
else count=1;
mx=max(count,mx);
}
printf("%d\n",n-mx);
}
}
And it passed in 0.01 s on SPOJ server(which is known to be slow)
But I soon came up with an O(b) algorithm,code given below
#include<stdio.h>
#include<iostream>
#include<algorithm>
#include<cmath>
#include<vector>
#include<cstdlib>
#include<stack>
#include<queue>
#include<string>
#include<cstring>
#define PR(x) printf(#x"=%d\n",x)
#define READ2(x,y) scanf("%d %d",&x,&y)
#define REP(i,a) for(int i=0;i<a;i++)
#define READ(x) scanf("%d",&x)
#define PRARR(x,n) for(int i=0;i<n;i++)printf(#x"[%d]=\t%d\n",i,x[i])
using namespace std;
int val[1001];
int arr[1001];
int main() {
int t;
int n;
scanf("%d",&t);
while(t--){
scanf("%d",&n);
int mn=2<<29,count=1,mx=1;
for(int i=0;i<n;i++){
scanf("%d",&arr[i]);
if(arr[i]<mn) { mn=arr[i];}
}
for(int i=0;i<n;i++){
val[arr[i]-mn]=i;
}
for(int i=1;i<n;i++){
if(val[i]>val[i-1]) count++;
else {
count=1;
}
if(mx<count) mx=count;
}
printf("%d\n",n-mx);
}
}
But surprisingly it took 0.14s :O
Now my question is isn't o(b) better than o(blogb) for b > 2 ? Then why so much difference in time? One of the members from the community suggested that it may be due to cache miss.The o(b) code is less localized as compared to o(blogb).But I dont see that causing a difference of 0.10s that too for <1000 runs of the code? (Yes b is actually less than 1000.Dont know why problem setter exaggerated so much)
EDIT : I see all answers are going towards the hidden constant values in asymptotic notations that often cause disparity in the running times of algorithms.But if you look at the codes you will realize all I am doing is replacing the call to sort by another traversal of the loop.Now I am assuming sort accesses each element of the array atleast once .Wouldn't that make both programs even closer if we think in number of lines that get executed?Beside yes my past experiences with spoj tells me I/O makes drastic impact on the running time of the program but I am using the same I/O routines in both the codes.
Big O notation describes how long the function takes as the input set approaches infinite size. If you have large enough data sets, O(n) will always beat O(n log n).
In practice, some 'poorer-performing' algorithms are faster because of the other hidden variables in the big O formula. Some more scalable algorithms can be slower. The difference becomes more arbitrary as the input set becomes smaller.
I learned all this the hard way, when I spent hours implementing a scalable solution, and when testing, found that it would only be faster for large data sets.
Edit:
Regarding the specific case, some people mentioned that the same line of code can vary extremely with regards to performance. This is likely the case here. That means that the 'hidden variables' in the big O formula are very relevant. The better you understand how a computer works on the inside, the more optimization techniques you have up your sleeve.
If you only remember one thing, remember this. Never compare two algorithms' performance by just reading the code. If it's that important, time an actual implementation on realistic data sets.
I/O operations (scanf(), printf()) are biasing the result.
those operations are notoriously slow and show a great discrepency when timing them. you shall never measure the performance of code using including any i/o operations, unless those operations are what you are trying to measure.
so, remove those calls and try again.
i will also point out that 0.1s is very small. the 0.1s difference may refer to the time it takes for loading the executable and preparing the code for execution.
Big-O notation isn't a formula that you can plug arbitrary values of n into. It merely describes the growth of the function as n heads to infinity.
This is a more interesting question than one might suspect. The O() concept can be useful, but it is not always as useful as some think. This is particularly true for logarithmic orders. Algebraically, a logarithm really has an order of zero, which is to say that log(n)/n^epsilon converges for any positive epsilon.
More often than we like to think, the log factors in order calculations don't really matter.
However, Kendall Frey is right. For sufficiently large data sets, O(n*log(n)) will eventually lose. It's only that the data set may have to be very large for the logarithmic difference to show.
I looked at your solutions in SPOj. I noticed that your O(nlogn) solution takes 79M memory while O(n) takes very small amount of memory which it shows as 0K. I looked at the other solutions too. Most of the fastest solutions I looked at used a large amount of memory. Now the most obvious reason I can think of is the implementation of std::sort() function. Its very nicely implemented which makes your solution amazingly fast. And for the O(n) solution i think it may be slow because of if() {...} else {...}. Try changing it to ternary operators and let us know if it makes any difference.
Hope it helps !!