Faster way to iterate through range

Faster way to iterate through range - c++

I'm a competitive programmer, and I've been asking myself if there is any shorter, more elegant way of writingfor(int i=0; i<n; ++i) . I can only use standard C++, no other libraries.

In c++ competitions there is well known set of macros (don't use it in commercial projects). You also asked for more elegant solution (it is well known solution, but for sure not more elegant)
For example read this topcoder website:
#define REP(x, n) for(int x = 0; x < (n); ++x)
then in code you can simply write
REP(i,n){
}
One basic complete header I found:
#include <cstdio>
#include <iostream>
#include <algorithm>
#include <string>
#include <vector>
using namespace std;
typedef vector<int> VI;
typedef long long LL;
#define FOR(x, b, e) for(int x = b; x <= (e); ++x)
#define FORD(x, b, e) for(int x = b; x >= (e); – –x)
#define REP(x, n) for(int x = 0; x < (n); ++x)
#define VAR(v, n) typeof(n) v = (n)
#define ALL(c) (c).begin(), (c).end()
#define SIZE(x) ((int)(x).size())
#define FOREACH(i, c) for(VAR(i, (c).begin()); i != (c).end(); ++i)
#define PB push_back
#define ST first
#define ND second

Without running timed tests, I would assume that both:
for(int i=0; i<n; ++i)
and:
int i=0;
while (i<n)
{
i++
}
would be extraordinarily close in timing. Perhaps use timestamps within a program that runs both types of loops, and see what the overall time/loop is for each type.
These are the fundamental looping structures of C / C++, so I do not think there would be something that would run faster (but I'm willing to be wrong if I learn something new)

Seeing you didn't specify whether you need to use i how about[1]:
int i=n+1; while(--i);
Its shorter!
[1] not proven to be correct.

I'm a competitive programmer too. This answer may be off-topic, but I think it will provide some useful ideas.
Personally, I think you shouldn't focus on these types of questions. I don't think there's a big difference between writing for (int i = 1; i <= n; ++i) and FOR(i, 1, n). The first one is obviously shorter and takes less time to type, but once you get to a high enough level, problem-solving skills matter much much more than typing speed. Don't trust me? See tourist's code.
I think what you should focus on is improving your problem-solving skills. The best way is to solve as many problems as possible. Doing so will also increase your typing speed as a side effect.

Related

Fastest way to create a vector of indices from distance matrix in C++

I have a distance matrix D of size n by n and a constant L as input. I need to create a vector v contains all entries in D such that its value is at most L. Here v must be in a specific order v = [v1 v2 .. vn] where vi contains entries in ith row of D with value at most L. The order of entries in each vi is not important.
I wonder there is a fast way to create v using vector, array or any data structure + parallization. What I did is to use for loops and it is very slow for large n.
vector<int> v;
for (int i=0; i < n; ++i){
for (int j=0; j < n; ++j){
if (D(i,j) <= L) v.push_back(j);
}
}

The best way is mostly depending on the context. If you are seeking for GPU parallization you should take a look at OpenCL.
For CPU based parallization the C++ standard #include <thread> library is probably your best bet, but you need to be careful:
Threads take time to create so if n is relatively small (<1000 or so) it will slow you down
D(i,j) has to be readably by multiple threads at the same time
v has to be writable by multiple threads, a standard vector wont cut it
v may be a 2d vector with vi as its subvectors, but these have to be initialized before the parallization:
std::vector<std::vector<int>> v;
v.reserve(n);
for(size_t i = 0; i < n; i++)
{
v.push_back(std::vector<int>());
}
You need to decide how many threads you want to use. If this is for one machine only, hardcoding is a valid option. There is a function in the thread library that gets the amount of supported threads, but it is more of a hint than trustworthy.
size_t threadAmount = std::thread::hardware_concurrency(); //How many threads should run hardware_concurrency() gives you a hint, but its not optimal
std::vector<std::thread> t; //to store the threads in
t.reserve(threadAmount-1); //you need threadAmount-1 extra threads (we already have the main-thread)
To start a thread you need a function it can execute. In this case this is to read through part of your matrix.
void CheckPart(size_t start, size_t amount, int L, std::vector<std::vector<int>>& vec)
{
for(size_t i = start; i < amount+start; i++)
{
for(size_t j = 0; j < n; j++)
{
if(D(i,j) <= L)
{
vec[i].push_back(j);
}
}
}
}
Now you need to split your matrix in parts of about n/threadAmount rows and start the threads. The thread constructor needs a function and its parameter, but it will always try to copy the parameters, even if the function wants a reference. To prevent this, you need to force using a reference with std::ref()
int i = 0;
int rows;
for(size_t a = 0; a < threadAmount-1; a++)
{
rows = n/threadAmount + ((n%threadAmount>a)?1:0);
t.push_back(std::thread(CheckPart, i, rows, L, std::ref(v)));
i += rows;
}
The threads are now running and all there is to do is run the last block on the main function:
SortPart(i, n/threadAmount, L, v);
After that you need to wait for the threads finishing and clean them up:
for(unsigned int a = 0; a < threadAmount-1; a++)
{
if(t[a].joinable())
{
t[a].join();
}
}
Please note that this is just a quick and dirty example. Different problems might need different implementation, and since I can't guess the context the help I can give is rather limited.

In consideration of the comments, I made the appropriate corrections (in emphasis).
Have you searched tips for writing performance code, threading, asm instructions (if your assembly is not exactly what you want) and OpenCL for parallel-processing? If not, I strongly recommend!
In some cases, declaring all for loop variables out of the for loop (to avoid declaring they a lot of times) will make it faster, but not in this case (comment from our friend Paddy).
Also, using new insted of vector can be faster, as we see here: Using arrays or std::vectors in C++, what's the performance gap? - and I tested, and with vector it's 6 seconds slower than with new,which only takes 1 second. I guess that the safety and ease of management guarantees that come with std::vector is not desired when someone is searching for performance, even because using new is not so difficult, just avoid heap overflow with calculations and remember using delete[]
user4581301 is correct here, and the following statement is untrue: Finally, if you build D in a array instead of matrix (or maybe if you copy D into a constant array, maybe...), it will be much mor cache-friendly and will save one for loop statement.

C++ is way slower than MATLAB

I am trying to generate 5000 by 5000 random number matrix. Here is what I do with MATLAB:
for i = 1:100
rand(5000)
end
And here is what I do in C++:
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <ctime>
using namespace std;
int main(){
int N = 5000;
double ** A = new double*[N];
for (int i=0;i<N;i++)
A[i] = new double[N];
srand(time(NULL));
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i][j] = rand();
}
}
}
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
}
MATLAB takes around 38 seconds while C++ takes around 90 seconds.
In another question, people executed the same code and got same speeds for both C++ and MATLAB.
I am using visual C++ with the following optimizations
I would like to learn what I am missing here? Thank you for all the help.
EDIT: Here is the key thing though...
Why MATLAB is faster than C++ in creating random numbers?
In this question, people gave me answers where their C++ speeds are same as MATLAB. When I use the same code I get way worse speeds and I am trying to understand why.

Your test is flawed, as others have noted, and does not even address the statement made by the title. You are comparing an inbuilt Matlab function to C++, not Matlab code itself, which in fact executes 100x more slowly than C++. Matlab is just a wrapper around the BLAS/LAPACK libraries in C/Fortran so one would expect a Matlab script, and a competently written C++ to be approximately equivalent, and indeed they are: This code in Matlab 2007b
tic; A = rand(5000); toc
executes in 810ms on my machine and this
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <ctime>
#define N 5000
int main()
{
srand(time(NULL));
clock_t start = clock();
int num_rows = N,
num_cols = N;
double * A = new double[N*N];
for (int i=0; i<N*N; ++i)
A[i] = rand();
std::cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << std::endl;
return 0;
}
executes in 830ms. A slight advantage for Matlab's in-house RNG over rand() is not too surprising. Note also the single indexing. This is how Matlab does it, internally. It then uses a clever indexing system (developed by others) to give you a matrix-like interface to the data.

In your C++ code, you are doing 5000 allocations of double[5000] on the heap. You would probably get much better speed if you did a single allocation of a double[25000000], and then do your own arithmetic to convert your 2 indices to a single one.

I believe MATLAB utilize multiple cpu cores on your machine. Have you try to write a multi-threaded version and measure the difference?
Also, the quality of (pseudo) random would also make slightly difference (but not that much).

In my experience,
First check that you execute your C++ code in release mode instead of in Debug mode. (Although I see in the picture you are in release mode)
Consider MPI parallelization.
Bear in mind that MATLAB is highly optimized and compiled with the Intel compiler which produces faster executables. You can also try more advanced compilers if you can afford them.
Last you can make a loop aggregation by using a function to generate combinations of i, j in a single loop. (In python this is a common practice given by the function product from the itertools library, see this)
I hope it helps.

Make Sparse Matrix Multiply Fast

The code is written using C++11. Each Process got tow Matrix Data(Sparse). The test data can be downloaded from enter link description here
Test data contains 2 file : a0 (Sparse Matrix 0) and a1 (Sparse Matrix 1). Each line in file is "i j v", means the sparse matrix Row i, Column j has the value v. i,j,v are all integers.
Use c++11 unordered_map as the sparse matrix's data structure.
unordered_map<int, unordered_map<int, double> > matrix1 ;
matrix1[i][j] = v ; //means at row i column j of matrix1 is value v;
The following code took about 2 minutes. The compile command is g++ -O2 -std=c++11 ./matmult.cpp.
g++ version is 4.8.1, Opensuse 13.1. My computer's info : Intel(R) Core(TM) i5-4200U CPU # 1.60GHz, 4G memory.
#include <iostream>
#include <fstream>
#include <unordered_map>
#include <vector>
#include <thread>
using namespace std;
void load(string fn, unordered_map<int,unordered_map<int, double> > &m) {
ifstream input ;
input.open(fn);
int i, j ; double v;
while (input >> i >> j >> v) {
m[i][j] = v;
}
}
unordered_map<int,unordered_map<int, double> > m1;
unordered_map<int,unordered_map<int, double> > m2;
//vector<vector<int> > keys(BLK_SIZE);
int main() {
load("./a0",m1);
load("./a1",m2);
for (auto r1 : m1) {
for (auto r2 : m2) {
double sim = 0.0 ;
for (auto c1 : r1.second) {
auto f = r2.second.find(c1.first);
if (f != r2.second.end()) {
sim += (f->second) * (c1.second) ;
}
}
}
}
return 0;
}
The code above is too slow. How can I make it run faster? I use multithread.
The new code is following, compile command is g++ -O2 -std=c++11 -pthread ./test.cpp. And it took about 1 minute. I want it to be faster.
How Can I make the task faster? Thank you!
#include <iostream>
#include <fstream>
#include <unordered_map>
#include <vector>
#include <thread>
#define BLK_SIZE 8
using namespace std;
void load(string fn, unordered_map<int,unordered_map<int, double> > &m) {
ifstream input ;
input.open(fn);
int i, j ; double v;
while (input >> i >> j >> v) {
m[i][j] = v;
}
}
unordered_map<int,unordered_map<int, double> > m1;
unordered_map<int,unordered_map<int, double> > m2;
vector<vector<int> > keys(BLK_SIZE);
void thread_sim(int blk_id) {
for (auto row1_id : keys[blk_id]) {
auto r1 = m1[row1_id];
for (auto r2p : m2) {
double sim = 0.0;
for (auto col1 : r1) {
auto f = r2p.second.find(col1.first);
if (f != r2p.second.end()) {
sim += (f->second) * col1.second ;
}
}
}
}
}
int main() {
load("./a0",m1);
load("./a1",m2);
int df = BLK_SIZE - (m1.size() % BLK_SIZE);
int blk_rows = (m1.size() + df) / (BLK_SIZE - 1);
int curr_thread_id = 0;
int index = 0;
for (auto k : m1) {
keys[curr_thread_id].push_back(k.first);
index++;
if (index==blk_rows) {
index = 0;
curr_thread_id++;
}
}
cout << "ok" << endl;
std::thread t[BLK_SIZE];
for (int i = 0 ; i < BLK_SIZE ; ++i){
t[i] = std::thread(thread_sim,i);
}
for (int i = 0; i< BLK_SIZE; ++i)
t[i].join();
return 0 ;
}

Most times when working with sparse matrices one uses more efficient representations than the nested maps you have. Typical choices are Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC). See https://en.wikipedia.org/wiki/Sparse_matrix for details.

You haven't specified the time you expect your example to run in or the platform you hope to run on. These are important design contraints in this example.
There are several areas that I can think of for improving the efficeny of this:-
Improve the way the data is stored
Improve the multithreading
Improve the algorithm
The first point is geared toward the way the system stores the sparse arrays and the interfaces to enable the data to be read. Nested unordered_maps are a good option when speed isn't important but there may be more specific data structures available that are geared toward this problem. At best you may find a library that provides a better way to store the data than nested maps, at worst you may have to come up with something yourself.
The second point refers to the way the multithreading is supported in the language. The original spec for the multithreading system were meant to be platform independant and might miss out handy features some systems might have. Decide what system you want to target and use the OSs threading system. You'll have more control over the way the threading works, possibly reduce the overhead but will lose out on the cross platform support.
The third point will take a bit of work. Is the way you're multiplying the matricies really the most efficent way given the nature of the data. I'm no expert on these things but it is something to consider but it will take a bit of effort.
Lastly, you can always be very specific about the platform you're running on and head into the world of assembly programming. Modern CPUs are complicated beasts. They can sometimes perform operations in parallel. For example, you may be able to do SIMD operations or do parallel integer and floating point operations. Doing this does require a deep understanding of what's going on and there are useful tools to help you out. Intel did have a tool called VTune (it may be something else now) that would analyse code and highlight potential bottlenecks. Ultimately, you'll be wanting to eliminate areas of the algorithm where the CPU is idle waiting for something to happen (like waiting for data from RAM) either by finding something else for the CPU to do or improving the algorithm (or both).
Ultimately, in order to improve the overall speed, you'll need to know what is slowing it down. This generally means knowing how to analyse your code and understand the results. Profilers are the general tool for this but there are platform specific tools available as well.
I know this isn't quite what you want but making code fast is really hard and very time consuming.

Is using string.length() in loop efficient?

For example, assuming a string s is this:
for(int x = 0; x < s.length(); x++)
better than this?:
int length = s.length();
for(int x = 0; x < length; x++)
Thanks,
Joel

In general, you should avoid function calls in the condition part of a loop, if the result does not change during the iteration.
The canonical form is therefore:
for (std::size_t x = 0, length = s.length(); x != length; ++x);
Note 3 things here:
The initialization can initialize more than one variable
The condition is expressed with != rather than <
I use pre-increment rather than post-increment
(I also changed the type because is a negative length is non-sense and the string interface is defined in term of std::string::size_type, which is normally std::size_t on most implementations).
Though... I admit that it's not as much for performance than for readability:
The double initialization means that both x and length scope is as tight as necessary
By memoizing the result the reader is not left in the doubt of whether or not the length may vary during iteration
Using pre-increment is usually better when you do not need to create a temporary with the "old" value
In short: use the best tool for the job at hand :)

It depends on the inlining and optimization abilities of the compiler. Generally, the second variant will most likely be faster (better: it will be either faster or as fast as the first snippet, but almost never slower).
However, in most cases it doesn't matter, so people tend to prefer the first variant for its shortness.

It depends on your C++ implementation / library, the only way to be sure is to benchmark it. However, it's effectively certain that the second version will never be slower than the first, so if you don't modify the string within the loop it's a sensible optimisation to make.

How efficient do you want to be?
If you don't modify the string inside the loop, the compiler will easily see than the size doesn't change. Don't make it any more complicated than you have to!

Although I am not necessarily encouraging you to do so, it appears it is faster to constantly call .length() than to store it in an int, surprisingly (atleast on my computer, keeping in mind that I'm using an MSI gaming laptop with i5 4th gen, but it shouldn't really affect which way is faster).
Test code for constant call:
#include <iostream>
using namespace std;
int main()
{
string g = "01234567890";
for(unsigned int rep = 0; rep < 25; rep++)
{
g += g;
}//for loop used to double the length 25 times.
int a = 0;
//int b = g.length();
for(unsigned int rep = 0; rep < g.length(); rep++)
{
a++;
}
return a;
}
On average, this ran for 385ms according to Code::Blocks
And here's the code that stores the length in a variable:
#include <iostream>
using namespace std;
int main()
{
string g = "01234567890";
for(unsigned int rep = 0; rep < 25; rep++)
{
g += g;
}//for loop used to double the length 25 times.
int a = 0;
int b = g.length();
for(unsigned int rep = 0; rep < b; rep++)
{
a++;
}
return a;
}
And this averaged around 420ms.
I know this question already has an accepted answer, but there haven't been any practically tested answers, so I decided to throw my 2 cents in. I had the same question as you, but I didn't find any helpful answers here, so I ran my own experiment.

Is s.length() inline and returns a member variable? then no, otherwise cost of dereferencing and putting stuff in stack, you know all the overheads of function call you will incur for each iteration.

How much one can do with (higher order) macros?

Is it "safe" to give macros names as arguments to other macros to simulate higher order functions?
I.e. where should I look to not shoot myself in the foot?
Here are some snippets:
#define foreach_even(ii, instr) for(int ii = 0; ii < 100; ii += 2) { instr; }
#define foreach_odd(ii, instr) for(int ii = 1; ii < 100; ii += 2) { instr; }
#define sum(foreach_loop, accu) \
foreach_loop(ii, {accu += ii});
int acc = 0;
sum(foreach_even, acc);
sum(foreach_odd, acc);
What about partial application, can I do that? :
#define foreach(ii, start, end, step, instr) \
for(int ii = start; ii < end; ii += step) { instr; }
#define foreach_even(ii, instr) foreach(ii, 0, 100, instr)
#define foreach_odd(ii, instr) foreach(ii, 1, 100, instr)
#define sum(foreach_loop, accu) \
foreach_loop(ii, {accu += ii});
int acc = 0;
sum(foreach_even, acc);
sum(foreach_odd, acc);
And can I define a macro inside a macro?
#define apply_first(new_macro, macro, arg) #define new_macro(x) macro(arg,x)

If you're into using preprocessor as much as possible, you may want to try boost.preprocessor.
But be aware that it is not safe to do so. Commas, for instance, cause a great number of problems when using preprocessors. Don't forget that preprocessors do not understand (or even try to understand) any of the code they are generating.
My basic advice is "don't do it", or "do it as cautiously as possible".

I've implemented a rotten little unit testing framework entirely in c-preprocessor. Several dozen macro, lots of macro is an argument to another macro type stuff.
This kind of thing is not "safe" in a best-practices sense of the word. There are subtle and very powerful ways to shoot yourself in the foot. The unit testing project is a toy that got out of hand.
Don't know if you can nest macro definitions. I doubt it, but I'll go try...gcc doesn't like it, and responds with
nested_macro.cc:8: error: stray '#' in program
nested_macro.cc:3: error: expected constructor, destructor, or type conversion before '(' token
nested_macro.cc:3: error: expected declaration before '}' token
Self plug: If you're interested the unit testing framework can be found at https://sourceforge.net/projects/dut/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js