"Magical" intel c++ compiler : what happened? - c++

I am on *nix. Have this simple c++ code in looptest.cpp
#include <iostream>
#include <time.h>
int main()
double sum = 0.0;
int n ;
std::cout << "n ?" << std::endl;
std::cin >> n ;
clock_t t_start = clock();
for (int i = 0 ; i < n ; ++i)
sum+= static_cast<double>(i);
clock_t t_end = clock();
clock_t diff = t_end - t_start;
double diffd = static_cast<double>(diff)/CLOCKS_PER_SEC;
std::cout << diffd << " seconds." << std::endl;
return 0;
compiled with the intel c++ compiler (icpc (ICC) 14.0.4 20140805, 2013) as follows :
/opt/intel/bin/icpc looptest.cpp -o looptest
When I test it, I have the following curious result :
My-MacBook-Air:tmp11 XXXX$ ./looptest
n ?
4e-06 seconds.
My-MacBook-Air:tmp11 XXXX$ ./looptest
n ?
3e-06 seconds.
My-MacBook-Air:tmp11 XXXX$ ./looptest
n ?
3e-06 seconds.
My-MacBook-Air:tmp11 XXXX$ ./looptest
n ?
2e-06 seconds.
My-MacBook-Air:tmp11 XXXX$ ./looptest
n ?
3e-06 seconds.
Strange, isn't it ? What happened here ? Of course, compiling with gnu-5.2's g++ instead of icpc gives an expected result (time increasing when n increases.)

sum is nowhere read, so all assignments to the variable were removed. This made the for-loop empty, so it was removed, too. Hence what remains is:
#include <iostream>
#include <time.h>
int main()
int n ;
std::cout << "n ?" << std::endl;
std::cin >> n ;
clock_t t_start = clock();
clock_t t_end = clock();
clock_t diff = t_end - t_start;
double diffd = static_cast<double>(diff)/CLOCKS_PER_SEC;
std::cout << diffd << " seconds." << std::endl;
return 0;
Effectively you measure how fast a single call to clock() is.
Look at the compiled code to figure out the optimizations the compiler did. GCC "should" be able to do the same optimization, but it will only do it if you add the parameter -O (-O2, -O3, -Os) to the invocation.


OpenMP parallel for does not speed up array sum code [duplicate]

This question already has answers here:
C++: Timing in Linux (using clock()) is out of sync (due to OpenMP?)
(3 answers)
Closed 4 months ago.
I'm trying to test the speed up of OpenMP on an array sum program.
The elements are generated using random generator to avoid optimization.
The length of array is also set large enough to indicate the performance difference.
This program is built using g++ -fopenmp -g -O0 -o main main.cpp, -g -O0 are used to avoid optimization.
However OpenMP parallel for code is significant slower than sequential code.
Test result:
Your thread count is: 12
Filling arrays
filling time:66718888
Now running omp code
2thread omp time:11154095
result: 4294903886
Now running omp code
4thread omp time:10832414
result: 4294903886
Now running omp code
6thread omp time:11165054
result: 4294903886
Now running sequential code
sequential time: 3525371
result: 4294903886
#include <iostream>
#include <stdio.h>
#include <omp.h>
#include <ctime>
#include <random>
using namespace std;
long long llsum(char *vec, size_t size, int threadCount) {
long long result = 0;
size_t i;
#pragma omp parallel for num_threads(threadCount) reduction(+: result) schedule(guided)
for (i = 0; i < size; ++i) {
result += vec[i];
return result;
int main(int argc, char **argv) {
int threadCount = 12;
cout << "Your thread count is: " << threadCount << endl;
const size_t TEST_SIZE = 8000000000;
char *testArray = new char[TEST_SIZE];
std::mt19937 rng;
std::uniform_int_distribution<std::mt19937::result_type> dist6(0, 4);
cout << "Filling arrays\n";
auto fillingStartTime = clock();
for (int i = 0; i < TEST_SIZE; ++i) {
testArray[i] = dist6(rng);
auto fillingEndTime = clock();
auto fillingTime = fillingEndTime - fillingStartTime;
cout << "filling time:" << fillingTime << endl;
// test omp time
for (int i = 1; i <= 3; ++i) {
cout << "Now running omp code\n";
auto ompStartTime = clock();
auto ompResult = llsum(testArray, TEST_SIZE, i * 2);
auto ompEndTime = clock();
auto ompTime = ompEndTime - ompStartTime;
cout << i * 2 << "thread omp time:" << ompTime << endl << "result: " << ompResult << endl;
// test sequential addition time
cout << "Now running sequential code\n";
auto seqStartTime = clock();
long long expectedResult = 0;
for (int i = 0; i < TEST_SIZE; ++i) {
expectedResult += testArray[i];
auto seqEndTime = clock();
auto seqTime = seqEndTime - seqStartTime;
cout << "sequential time: " << seqTime << endl << "result: " << expectedResult << endl;
return 0;
As pointed out by #High Performance Mark, I should use omp_get_wtime() instead of clock().
clock() is 'active processor time', not 'elapsed time.
OpenMP time and clock() give two different results
After using omp_get_wtime(), and fixing the int i to size_t i, the result is more meaningful:
Your thread count is: 12
Filling arrays
filling time:267.038
Now running omp code
2thread omp time:26.1421
result: 15999820788
Now running omp code
4thread omp time:7.16911
result: 15999820788
Now running omp code
6thread omp time:5.66505
result: 15999820788
Now running sequential code
sequential time: 30.4056
result: 15999820788

Measuring time for my array of random numbers to print is always showing 0 seconds

my first post here. Just wondering why my stopwatch is always showing 0 seconds or 0 milliseconds no matter the amount of random numbers in my array. I appreciate the help so much.
Here's my code:
#include <iostream>
#include <cstdlib>
#include <time.h>
using namespace std;
double clock_start()
clock_t start = clock();
return start;
void random_number()
int array[10000];
cout << "10k Random numbers: ";
for (int i = 0; i < 10000; i++)
array[i] = rand() % 99 + 1;
cout << array[i] << "\n";
int main()
setlocale(LC_ALL, "");
clock_t elapsed = (clock() - clock_start()) / (CLOCKS_PER_SEC / 1000);
cout << "Stopwatch: " << elapsed << "ms" << " or " << elapsed * 1000 << "s" << endl;
system("pause > nul");
return 0;
(clock() - clock_start()) will be evaluated in the blink of an eye.
All clock_start() does is return clock(). (In fact, a good optimising compiler will replace clock_start() with clock() !)
The difference will almost certainly be zero. Did you want something like
clock_t start = clock();
clock_t elapsed = (clock() - start) / (CLOCKS_PER_SEC / 1000);
Thanks for the help guys! I'm amazed how fast this community is at replying.
So i deleted the clock_start() function. And i added the:
clock_t start = clock();
to my main function.

memset is significantly faster then Eigen::Tensor SetZero()

I have changed Eigen::Tensor SetZero() call in my code to the memset call over the tensor data and observing significant better performance. Builded in VS 2016(SSE2 support should be enabled by default). Why does this happens? I have expected that Eigen::Tensor is highly optimized.
#include <unsupported/Eigen/CXX11/Tensor>
#include <iostream>
#include <ctime>
#define MyLayoutType Eigen::RowMajor
#define Tf3 Eigen::Tensor<float, 3, MyLayoutType>
clock_t begin = clock();
Tf3 tensor(1000, 500, 20);
for (size_t i = 0; i < 100; i++)
memset(tensor.data(), 0, tensor.size() * sizeof(float));
// VS:
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
cout << "-----------------------" << endl;
cout << "Total time elapsed: " << elapsed_secs << "
secs" << endl;
cout << tensor(0, 0, 0);
On my env I got 2.1 for memset on avg and 2.3 for setZero. And setRandom operation is much more havy then memset. If I comment out tensor.setRandom() I get 0.4 for memset and 0.5 for setZero.IN real code difference in performance is bigger.

How can I measure CPU time and wall clock time on both Linux/Windows?

I mean: how can I measure time my CPU spent on function execution and wall clock time it takes to run my function? (Im interested in Linux/Windows and both x86 and x86_64). See what I want to do (Im using C++ here but I would prefer C solution):
int startcputime, endcputime, wcts, wcte;
startcputime = cputime();
endcputime = cputime();
std::cout << "it took " << endcputime - startcputime << " s of CPU to execute this\n";
wcts = wallclocktime();
wcte = wallclocktime();
std::cout << "it took " << wcte - wcts << " s of real time to execute this\n";
Another important question: is this type of time measuring architecture independent or not?
Here's a copy-paste solution that works on both Windows and Linux as well as C and C++.
As mentioned in the comments, there's a boost library that does this. But if you can't use boost, this should work:
// Windows
#ifdef _WIN32
#include <Windows.h>
double get_wall_time(){
LARGE_INTEGER time,freq;
if (!QueryPerformanceFrequency(&freq)){
// Handle error
return 0;
if (!QueryPerformanceCounter(&time)){
// Handle error
return 0;
return (double)time.QuadPart / freq.QuadPart;
double get_cpu_time(){
FILETIME a,b,c,d;
if (GetProcessTimes(GetCurrentProcess(),&a,&b,&c,&d) != 0){
// Returns total user time.
// Can be tweaked to include kernel times as well.
(double)(d.dwLowDateTime |
((unsigned long long)d.dwHighDateTime << 32)) * 0.0000001;
// Handle error
return 0;
// Posix/Linux
#include <time.h>
#include <sys/time.h>
double get_wall_time(){
struct timeval time;
if (gettimeofday(&time,NULL)){
// Handle error
return 0;
return (double)time.tv_sec + (double)time.tv_usec * .000001;
double get_cpu_time(){
return (double)clock() / CLOCKS_PER_SEC;
There's a bunch of ways to implement these clocks. But here's what the above snippet uses:
For Windows:
Wall Time: Performance Counters
CPU Time: GetProcessTimes()
For Linux:
Wall Time: gettimeofday()
CPU Time: clock()
And here's a small demonstration:
#include <math.h>
#include <iostream>
using namespace std;
int main(){
// Start Timers
double wall0 = get_wall_time();
double cpu0 = get_cpu_time();
// Perform some computation.
double sum = 0;
#pragma omp parallel for reduction(+ : sum)
for (long long i = 1; i < 10000000000; i++){
sum += log((double)i);
// Stop timers
double wall1 = get_wall_time();
double cpu1 = get_cpu_time();
cout << "Wall Time = " << wall1 - wall0 << endl;
cout << "CPU Time = " << cpu1 - cpu0 << endl;
// Prevent Code Elimination
cout << endl;
cout << "Sum = " << sum << endl;
Output (12 threads):
Wall Time = 15.7586
CPU Time = 178.719
Sum = 2.20259e+011
C++11. Much easier to write!
Use std::chrono::system_clock for wall clock and std::clock for cpu clock
#include <cstdio>
#include <ctime>
#include <chrono>
std::clock_t startcputime = std::clock();
double cpu_duration = (std::clock() - startcputime) / (double)CLOCKS_PER_SEC;
std::cout << "Finished in " << cpu_duration << " seconds [CPU Clock] " << std::endl;
auto wcts = std::chrono::system_clock::now();
std::chrono::duration<double> wctduration = (std::chrono::system_clock::now() - wcts);
std::cout << "Finished in " << wctduration.count() << " seconds [Wall Clock]" << std::endl;
Et voilà, easy and portable! No need for #ifdef _WIN32 or LINUX!
You could even use chrono::high_resolution_clock if you need more precision
To give a concrete example of #lip's suggestion to use boost::timer if you can (tested with Boost 1.51):
#include <boost/timer/timer.hpp>
// this is wallclock AND cpu time
boost::timer::cpu_timer timer;
... run some computation ...
boost::timer::cpu_times elapsed = timer.elapsed();
std::cout << " CPU TIME: " << (elapsed.user + elapsed.system) / 1e9 << " seconds"
<< " WALLCLOCK TIME: " << elapsed.wall / 1e9 << " seconds"
<< std::endl;
Use the clock method in time.h:
clock_t start = clock();
/* Do stuffs */
clock_t end = clock();
float seconds = (float)(end - start) / CLOCKS_PER_SEC;
Unfortunately, this method returns CPU time on Linux, but returns wall-clock time on Windows (thanks to commenters for this information).

How to use clock() in C++

How do I call clock() in C++?
For example, I want to test how much time a linear search takes to find a given element in an array.
#include <iostream>
#include <cstdio>
#include <ctime>
int main() {
std::clock_t start;
double duration;
start = std::clock();
/* Your algorithm here */
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
std::cout<<"printf: "<< duration <<'\n';
An alternative solution, which is portable and with higher precision, available since C++11, is to use std::chrono.
Here is an example:
#include <iostream>
#include <chrono>
typedef std::chrono::high_resolution_clock Clock;
int main()
auto t1 = Clock::now();
auto t2 = Clock::now();
std::cout << "Delta t2-t1: "
<< std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count()
<< " nanoseconds" << std::endl;
Running this on ideone.com gave me:
Delta t2-t1: 282 nanoseconds
clock() returns the number of clock ticks since your program started. There is a related constant, CLOCKS_PER_SEC, which tells you how many clock ticks occur in one second. Thus, you can test any operation like this:
clock_t startTime = clock();
clock_t endTime = clock();
clock_t clockTicksTaken = endTime - startTime;
double timeInSeconds = clockTicksTaken / (double) CLOCKS_PER_SEC;
On Windows at least, the only practically accurate measurement mechanism is QueryPerformanceCounter (QPC). std::chrono is implemented using it (since VS2015, if you use that), but it is not accurate to the same degree as using QueryPerformanceCounter directly. In particular it's claim to report at 1 nanosecond granularity is absolutely not correct. So, if you're measuring something that takes a very short amount of time (and your case might just be such a case), then you should use QPC, or the equivalent for your OS. I came up against this when measuring cache latencies, and I jotted down some notes that you might find useful, here;
#include <iostream>
#include <ctime>
#include <cstdlib> //_sleep() --- just a function that waits a certain amount of milliseconds
using namespace std;
int main()
clock_t cl; //initializing a clock type
cl = clock(); //starting time of clock
_sleep(5167); //insert code here
cl = clock() - cl; //end point of clock
_sleep(1000); //testing to see if it actually stops at the end point
cout << cl/(double)CLOCKS_PER_SEC << endl; //prints the determined ticks per second (seconds passed)
return 0;
//outputs "5.17"
You can measure how long your program works. The following functions help measure the CPU time since the start of the program:
C++ (double)clock() / CLOCKS_PER_SEC with ctime included.
Python time.clock() returns floating-point value in seconds.
Java System.nanoTime() returns long value in nanoseconds.
My reference: algorithms toolbox week 1 course part of data structures and algorithms specialization by University of California San Diego & National Research University Higher School of Economics
So you can add this line of code after your algorithm:
cout << (double)clock() / CLOCKS_PER_SEC;
Expected Output: the output representing the number of clock ticks per second
Probably you might be interested in timer like this :
H : M : S . Msec.
the code in Linux OS:
#include <iostream>
#include <unistd.h>
using namespace std;
void newline();
int main() {
int msec = 0;
int sec = 0;
int min = 0;
int hr = 0;
//cout << "Press any key to start:";
//char start = _gtech();
for (;;)
if(msec == 1000)
msec = 0;
if(sec == 60)
sec = 0;
if(min == 60)
min = 0;
cout << hr << " : " << min << " : " << sec << " . " << msec << endl;
return 0;
void newline()
cout << "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n";