Time not being calculated

Time not being calculated - c++

I wanted to see which would access faster, a struct or a tuple, so I wrote a small program. However, when it finishes running the time recorded is 0.000000 for both. I'm pretty sure the program isn't finishing that fast(running on an online compiler since I am away from home)
#include <iostream>
#include <time.h>
#include <tuple>
#include <cstdlib>
using namespace std;
struct Item
{
int x;
int y;
};
typedef tuple<int, int> dTuple;
int main() {
printf("Let's see which is faster...\n");
//Timers
time_t startTime;
time_t currentTimeTuple;
time_t currentTimeItem;
//Delta times
double deltaTimeTuple;
double deltaTimeItem;
//Collections
dTuple tupleArray[100000];
struct Item itemArray[100000];
//Seed random number
srand(time(NULL));
printf("Generating tuple array...\n");
//Initialize an array of tuples with random ints
for(int i = 0; i < 100000; ++i)
{
tupleArray[i] = dTuple(rand() % 1000,rand() % 1000);
}
printf("Generating Item array...\n");
//Initialize an array of Items
for(int i = 0; i < 100000; ++i)
{
itemArray[i].x = rand() % 1000;
itemArray[i].y = rand() % 1000;
}
//Begin timer for tuple array
time(&startTime);
//Iterate through the array of tuples and print out each value, timing how long it takes
for(int i = 0; i < 100000; ++i)
{
printf("%d: %d", get<0>(tupleArray[i]), get<1>(tupleArray[i]));
}
//Get the time it took to go through the tuple array
time(&currentTimeTuple);
deltaTimeTuple = difftime(startTime, currentTimeTuple);
//Start the timer for the array of Items
time(&startTime);
//Iterate through the array of Items and print out each value, timing how long it takes
for(int i = 0; i < 100000; ++i)
{
printf("%d: %d", itemArray[i].x, itemArray[i].y);
}
//Get the time it took to go through the item array
time(&currentTimeItem);
deltaTimeItem = difftime(startTime, currentTimeItem);
printf("\n\n");
printf("It took %f seconds to go through the tuple array\nIt took %f seconds to go through the struct Item array\n", deltaTimeTuple, deltaTimeItem);
return 0;
}
According to www.cplusplus.com/reference/ctime/time/, difftime should return the difference between two time_t.

time() generally returns the number of seconds since 00:00 hours, Jan 1, 1970 UTC (i.e., the current unix timestamp). So depending on your library implementation, anything below 1 second might appear as 0.
You should prefer use of <chrono> for benchmarking purpose:
chrono::high_resolution_clock::time_point t;
t = high_resolution_clock::now();
// do something to benchmark
chrono::high_resolution_clock::time_point t2 = chrono::high_resolution_clock::now();
cout <<"Exec in ms: "<< chrono::duration_cast<milliseconds>(t2 - t).count() <<endl;
You have nevertheless to consider the clock resolution. For example with windows it's 15 ms, so if you are close around 15 ms, or even below, you should increase the number of iterations.

I recommend checking the assembly language listing before you start performance timings.
The first thing to check is that the assembly language is significantly different between accessing a tuple versus accessing your structure. You can get a rough estimate of the timing difference by counting the number of different instructions.
Secondly, I recommend looking at the definition of Tuple. Something tells me that it is a structure declared similarly to yours. Thus I predict your timings should be near equal.
Thirdly, you should also compare std::pair since you only have 2 items in your tuple and a std::pair has two elements. Again, it should be the same as your structure.
Lastly, be prepared for "noise" in your data. The noise may come from the data cache misses, other programs using the data cache, delegation of code between cores, and any I/O. The closer you can get to running your program on a single core, free of interruptions, the better quality your data will be.

time_t is a count of seconds, and 100,000 trivial operations on a fast, modern computer could indeed finish in less than a second.

Related

prime number below 2 billion - usage of std::list hinders performance

Problem Statement is to find prime number below 2 billion in timeframe < 20 sec.
I followed below approaches.
Divide the number n with list of number k ( k < sqrt(n)) - took 20 sec
Divide the number n with list of prime number below sqrt(n).In this scenario I stored prime numbers in std::list - took more than 180 sec
Can someone help me understand why did 2nd approach take longtime even though we reduced no of divisions by 50%(approx)? or Did I choose wrong Data Structure?
Approach 1:
#include <iostream>
#include<list>
#include <ctime>
using namespace std;
list<long long> primeno;
void ListPrimeNumber();
int main()
{
clock_t time_req = clock();
ListPrimeNumber();
time_req = clock() - time_req;
cout << "time taken " << static_cast<float>(time_req) / CLOCKS_PER_SEC << " seconds" << endl;
return 0;
}
void check_prime(int i);
void ListPrimeNumber()
{
primeno.push_back(2);
primeno.push_back(3);
primeno.push_back(5);
for (long long i = 6; i <= 20000000; i++)
{
check_prime(i);
}
}
void check_prime(int i)
{
try
{
int j = 0;
int limit = sqrt(i);
for (j = 2 ; j <= limit;j++)
{
if(i % j == 0)
{
break;
}
}
if( j > limit)
{
primeno.push_back(i);
}
}
catch (exception ex)
{
std::cout << "Message";
}
}
Approach 2 :
#include <iostream>
#include<list>
#include <ctime>
using namespace std;
list<long long> primeno;
int noofdiv = 0;
void ListPrimeNumber();
int main()
{
clock_t time_req = clock();
ListPrimeNumber();
time_req = clock() - time_req;
cout << "time taken " << static_cast<float>(time_req) / CLOCKS_PER_SEC << " seconds" << endl;
cout << "No of divisions : " << noofdiv;
return 0;
}
void check_prime(int i);
void ListPrimeNumber()
{
primeno.push_back(2);
primeno.push_back(3);
primeno.push_back(5);
for (long long i = 6; i <= 10000; i++)
{
check_prime(i);
}
}
void check_prime(int i)
{
try
{
int limit = sqrt(i);
for (int iter : primeno)
{
noofdiv++;
if (iter <= limit && (i%iter) == 0)
{
break;
}
else if (iter > limit)
{
primeno.push_back(i);
break;
}
}
}
catch (exception ex)
{
std::cout << "Message";
}
}

The reason your second example takes longer is you're iterating a std::list.
A std::list in C++ is a linked list, which means it doesn't use contiguous memory. This is bad because to iterate the list you must jump from node to node in a (to the CPU/prefetcher) unpredictable way. Also, You're most likely only "using" a few bytes of each cacheline. RAM is slow. Fetching a byte from RAM takes a lot longer than fetching it from L1. CPUs are fast these days, so your program is most of the time not doing anything and waiting for memory to arrive.
Use a std::vector instead. It stores all values one after the other and iterating is very cheap. Since you're iterating forward in memory without jumping, you're using the full cacheline and your prefetcher will be able to fetch further pages before you need them because your access of memory is predictable.
It has been proven by numerous people, including Bjarne Stroustrup, that std::vector is in a lot of cases faster than std::list, even in cases where the std::list has "theoretically" better complexity (random insert, delete, ...) just because caching helps a lot. So always use std::vector as your default. And if you think a linked list would be faster in your case, measure it and be surprised that - most of the time - std::vector dominates.
Edit: as others have noted, your method of finding primes isn't very efficient. I just played around a bit and implemented a Sieve of Eratosthenes using a bitset.
constexpr int max_prime = 1000000000;
std::bitset<max_prime> *bitset = new std::bitset<max_prime>{};
// Note: Bit SET means NO prime
bitset->set(0);
bitset->set(1);
for(int i = 4; i < max_prime ; i += 2)
bitset->set(i); // set all even numbers
int max = sqrt(max_prime);
for(int i = 3; i < max; i += 2) { // No point testing even numbers as they can't be prime
if(!bitset->test(i)) { // If i is prime
for(int j = i * 2; j < no_primes; j += i)
bitset->set(j); // set all multiples of i to non-prime
}
}
This takes between 4.2 and 4.5 seconds 30 seconds (not sure why it changed that much after slight modifications... must be an optimization I'm not hitting anymore) to find all primes below one Billion (1,000,000,000) on my machine. Your approach took way too long even for 100 million. I cancelled the 1 Billion search after about two minutes.
Comparison for 100 million:
time taken: 63.515 seconds
time taken bitset: 1.874 seconds
No of divisions : 1975961174
No of primes found: 5761455
No of primes found bitset: 5761455
I'm not a mathematician so I'm pretty sure there are still ways to optimize it further, I only optimize for even numbers.

The first thing to do is make sure you are compiling with optimisations enabled. The c++ standard library template classes tend to perform very poorly with unoptimised code as they generate lots of function calls. The optimiser inlines most of these function calls which makes them much cheaper.
std::list is a linked list. Its is mostly useful where you want to insert or remove elements randomly (i.e. not from the end).
For the case where you are only appending to the end of a list std::list has the following issues:
Iterating through the list is relatively expensive as the code has to follow node pointers and then retrieve the data
The list uses quite a lot more memory, each element needs a pointer to the previous and next nodes in addition to the actual data. On a 64-bit system this equates to 20 bytes per element rather than 4 for a list of int
As the elements in the list are not contiguous in memory the compiler can't perform as many SIMD optimisations and you will suffer more from CPU cache misses
A std::vector would solve all of the above as its memory is contiguous and iterating through it is basically just a case of incrementing an array index. You do need to make sure that you call reserve on your vector at the beginning with a sufficiently large value so that appending to the vector doesn't cause the whole array to be copied to a new larger array.
A bigger optimisation than the above would be to use the Sieve of Eratosthenes to calculate your primes. As generating this light require random deletions (depending on your exact implementation) std::list might perform better than std::vector though even in this case the overheads of std::list might not outweigh its costs.

A test at Ideone (the OP code with few superficial alterations) completely contradicts the claims made in this question:
/* check_prime__list:
time taken No of divisions No of primes
10M: 0.873 seconds 286144936 664579
20M: 2.169 seconds 721544444 1270607 */
2B: projected time: at least 16 minutes but likely much more (*)
/* check_prime__nums:
time taken No of divisions No of primes
10M: 4.650 seconds 1746210131 664579
20M: 12.585 seconds 4677014576 1270607 */
2B: projected time: at least 3 hours but likely much more (*)
I also changed the type of the number of divisions counter to long int because it was wrapping around the data type limit. So they could have been misinterpreting that.
But the run time wasn't being affected by that. A wall clock is a wall clock.
Most likely explanation seems to be a sloppy testing by the OP, with different values used in each test case, by mistake.
(*) The time projection was made by the empirical orders of growth analysis:
100**1.32 * 2.169 / 60 = 15.8
100**1.45 * 12.585 / 3600 = 2.8
Empirical orders of growth, as measured on the given range of sizes, were noticeably better for the list algorithm, n1.32 vs. the n1.45 for the testing by all numbers. This is entirely expected from theoretical complexity, since there are fewer primes than all numbers up to n, by a factor of log n, for a total complexity of O(n1.5/log n) vs. O(n1.5). It is also highly unlikely for any implementational discrepancy to beat an actual algorithmic advantage.

Function consuming wall time for specific amount of time

I have a function which has a factor that needs to be adjusted according to the load on the machine to consume exactly the wall time passed to the function. The factor can vary according to the load of the machine.
void execute_for_wallTime(int factor, int wallTime)
{
double d = 0;
for (int n = 0; n<factor; ++n)
for (int m = 0; wall_time; ++m)
d += d * n*m;
}
Is there a way to dynamically check the load on the machine and adjust the factor accordingly in order to consume the exact wall time passed to the function?
The wall time is read from the file and passed to this function. The values are in micro seconds, e.g:
73
21
44

According to OP comment:
#include <sys/time.h>
int deltaTime(struct timeval *tv1, struct timeval *tv2){
return ((tv2->tv_sec - tv1->tv_sec)*1000000)+ tv2->tv_usec - tv1->tv_usec;
}
//might require longs anyway. this is time in microseconds between the 2 timevals
void execute_for_wallTime(int wallTime)
{
struct timeval tvStart, tvNow;
gettimeofday(&tvStart, NULL);
double d = 0;
for (int m = 0; wall_time; ++m){
gettimeofday(&tvNow, NULL);
if(deltaTime(tvStart,tvNow) >=wall_time) { // if timeWall is 1000 microseconds,
// this function returns after
// 1000 microseconds (and a
// little more due to overhead)
return;
}
d += d*m;
}
}
Now deal with timeWall by increasing or decreasing it in a logic outside this function depending on your performance calculations. This function simply runs for timeWall microseconds.
For C++ style, you can use std::chrono.
I must comment that I would handle things differently, for example by calling nanosleep(). The operations make no sense unless in actual code you plan to substitute these "fillers" with actual operations. In that case you might consider threads and schedulers. Besides the clock calls add overhead.

Mergesort pThread implementation taking same time as single-threaded

(I have tried to simplify this as much as i could to find out where I'm doing something wrong.)
The ideea of the code is that I have a global array *v (I hope using this array isn't slowing things down, the threads should never acces the same value because they all work on different ranges) and I try to create 2 threads each one sorting the first half, respectively the second half by calling the function merge_sort() with the respective parameters.
On the threaded run, i see the process going to 80-100% cpu usage (on dual core cpu) while on the no threads run it only stays at 50% yet the run times are very close.
This is the (relevant) code:
//These are the 2 sorting functions, each thread will call merge_sort(..). Is this a problem? both threads calling same (normal) function?
void merge (int *v, int start, int middle, int end) {
//dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
//copies the original values into the 2 halves
//then sorts them back into the v array
}
void merge_sort (int *v, int start, int end) {
//recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
//calls merge(start, middle, end)
}
//here i'm expecting each thread to be created and to call merge_sort on its specific range (this is a simplified version of the original code to find the bug easier)
void* mergesort_t2(void * arg) {
t_data* th_info = (t_data*)arg;
merge_sort(v, th_info->a, th_info->b);
return (void*)0;
}
//in main I simply create 2 threads calling the above function
int main (int argc, char* argv[])
{
//some stuff
//getting the clock to calculate run time
clock_t t_inceput, t_sfarsit;
t_inceput = clock();
//ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
//the a and b are the range of values the created thread will have to sort
pthread_t thread[2];
t_data next_info[2];
next_info[0].crt_depth = 1;
next_info[0].a = 0;
next_info[0].b = n/2;
next_info[1].crt_depth = 1;
next_info[1].a = n/2+1;
next_info[1].b = n-1;
for (int i=0; i<2; i++) {
if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
cerr<<"error\n;";
return err;
}
}
for (int i=0; i<2; i++) {
if (pthread_join(thread[i], &status) != 0) {
cerr<<"error\n;";
return err;
}
}
//now i merge the 2 sorted halves
merge(v, 0, n/2, n-1);
//calculate end time
t_sfarsit = clock();
cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
delete [] v;
}
Output (on 1 million values):
Sort time (s): 1.294
Output with direct calling of merge_sort, no threads:
Sort time (s): 1.388
Output (on 10 million values):
Sort time (s): 12.75
Output with direct calling of merge_sort, no threads:
Sort time (s): 13.838
Solution:
I'd like to thank WhozCraig and Adam too as they've hinted to this from the beginning.
I've used the inplace_merge(..) function instead of my own and the program run times are as they should now.
Here's my initial merge function (not really sure if the initial, i've probably modified it a few times since, also array indices might be wrong right now, i went back and forth between [a,b] and [a,b), this was just the last commented-out version):
void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]
//create the 2 new arrays
int *st = new int[m-a+1];
int *dr = new int[c-m+1];
//copy the values
for (int i1 = 0; i1 <= m-a; i1++)
st[i1] = v[a+i1];
for (int i2 = 0; i2 <= c-(m+1); i2++)
dr[i2] = v[m+1+i2];
//merge them back together in sorted order
int is=0, id=0;
for (int i=0; i<=c-a; i++) {
if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
v[a+i] = st[is];
is++;
}
else {
v[a+i] = dr[id];
id++;
}
}
delete st, dr;
}
all this was replaced with:
inplace_merge(v+a, v+m, v+c);
Edit, some times on my 3ghz dual core cpu:
1 million values:
1 thread : 7.236 s
2 threads: 4.622 s
4 threads: 4.692 s
10 million values:
1 thread : 82.034 s
2 threads: 46.189 s
4 threads: 47.36 s

There's one thing that struck me: "dynamically creates 2 new arrays[...]". Since both threads will need memory from the system, they need to acquire a lock for that, which could well be your bottleneck. In particular the idea of doing microscopic array allocations sounds horribly inefficient. Someone suggested an in-place sort that doesn't need any additional storage, which is much better for performance.
Another thing is the often-forgotten starting half-sentence for any big-O complexity measurements: "There is an n0 so that for all n>n0...". In other words, maybe you haven't reached n0 yet? I recently saw a video (hopefully someone else will remember it) where some people tried to determine this limit for some algorithms, and their results were that these limits are surprisingly high.

Note: since OP uses Windows, my answer below (which incorrectly assumed Linux) might not apply. I left it for sake of those who might find the information useful.
clock() is a wrong interface for measuring time on Linux: it measures CPU time used by the program (see http://linux.die.net/man/3/clock), which in case of multiple threads is the sum of CPU time for all threads. You need to measure elapsed, or wallclock, time. See more details in this SO question: C: using clock() to measure time in multi-threaded programs, which also tells what API can be used instead of clock().
In the MPI-based implementation that you try to compare with, two different processes are used (that's how MPI typically enables concurrency), and the CPU time of the second process is not included - so the CPU time is close to wallclock time. Nevertheless, it's still wrong to use CPU time (and so clock()) for performance measurement, even in serial programs; for one reason, if a program waits for e.g. a network event or a message from another MPI process, it still spends time - but not CPU time.
Update: In Microsoft's implementation of C run-time library, clock() returns wall-clock time, so is OK to use for your purpose. It's unclear though if you use Microsoft's toolchain or something else, like Cygwin or MinGW.

Testing parallel_for_ performance in OpenCV

I tested parallel_for_ in OpenCV by comparing with the normal operation for just simple array summation and multiplication.
I have array of 100 integers and split into 10 each and run using parallel_for_.
Then I also have normal 0 to 99 operation for summation and multiuplication.
Then I measured the elapsed time and normal operation is faster than parallel_for_ operation.
My CPU is Intel(R) Core(TM) i7-2600 Quard Core CPU.
parallel_for_ operation took 0.002sec (took 2 clock cycles) for summation and 0.003sec (took 3 clock cycles) for multiplication.
But normal operation took 0.0000sec (less than one click cycle) for both summation and multiplication. What am I missing? My code is as follow.
TEST Class
#include <opencv2\core\internal.hpp>
#include <opencv2\core\core.hpp>
#include <tbb\tbb.h>
using namespace tbb;
using namespace cv;
template <class type>
class Parallel_clipBufferValues:public cv::ParallelLoopBody
{
private:
type *buffertoClip;
type maxSegment;
char typeOperation;//m = mul, s = summation
static double total;
public:
Parallel_clipBufferValues(){ParallelLoopBody::ParallelLoopBody();};
Parallel_clipBufferValues(type *buffertoprocess, const type max, const char op): buffertoClip(buffertoprocess), maxSegment(max), typeOperation(op){
if(typeOperation == 's')
total = 0;
else if(typeOperation == 'm')
total = 1;
}
~Parallel_clipBufferValues(){ParallelLoopBody::~ParallelLoopBody();};
virtual void operator()(const cv::Range &r) const{
double tot = 0;
type *inputOutputBufferPTR = buffertoClip+(r.start*maxSegment);
for(int i = 0; i < 10; ++i)
{
if(typeOperation == 's')
total += *(inputOutputBufferPTR+i);
else if(typeOperation == 'm')
total *= *(inputOutputBufferPTR+i);
}
}
static double getTotal(){return total;}
void normalOperation(){
//int iteration = sizeof(buffertoClip)/sizeof(type);
if(typeOperation == 'm')
{
for(int i = 0; i < 100; ++i)
{
total *= buffertoClip[i];
}
}
else if(typeOperation == 's')
{
for(int i = 0; i < 100; ++i)
{
total += buffertoClip[i];
}
}
}
};
MAIN
#include "stdafx.h"
#include "TestClass.h"
#include <ctime>
double Parallel_clipBufferValues<int>::total;
int _tmain(int argc, _TCHAR* argv[])
{
const int SIZE=100;
int myTab[SIZE];
double totalSum_by_parallel;
double totalSun_by_normaloperation;
double elapsed_secs_parallel;
double elapsed_secs_normal;
for(int i = 1; i <= SIZE; i++)
{
myTab[i-1] = i;
}
int maxSeg =10;
clock_t begin_parallel = clock();
cv::parallel_for_(cv::Range(0,maxSeg), Parallel_clipBufferValues<int>(myTab, maxSeg, 'm'));
totalSum_by_parallel = Parallel_clipBufferValues<int>::getTotal();
clock_t end_parallel = clock();
elapsed_secs_parallel = double(end_parallel - begin_parallel) / CLOCKS_PER_SEC;
clock_t begin_normal = clock();
Parallel_clipBufferValues<int> norm_op(myTab, maxSeg, 'm');
norm_op.normalOperation();
totalSun_by_normaloperation = norm_op.getTotal();
clock_t end_normal = clock();
elapsed_secs_normal = double(end_normal - begin_normal) / CLOCKS_PER_SEC;
return 0;
}

Let me do some considerations:
Accuracy
clock() function is not accurate at all. Its tick is roughly 1 / CLOCKS_PER_SEC but how often it's updated and if it's uniform or not it's system and implementation dependent. See this post for more details about that.
Better alternatives to measure time:
This post for Windows.
This article for *nix.
Trials and Test Environment
Measures are always affected by errors. Performance measurement for your code is affected (short list, there is much more than that) by other programs, cache, operating system jobs, scheduling and user activity. To have a better measure you have to repeat it many times (let's say 1000 or more) then calculate average. Moreover you should prepare your test environment to be as clean as possible.
More details about tests on these posts:
How do I write a correct micro-benchmark in Java?
NAS Parallel Benchmarks
Visual C++ 11 Beta Benchmark of Parallel Loops (for code examples)
Great articles from our Eric Lippert about benchmarking (it's about C# but most of them applies directly to any bechmark): C# Performance Benchmark Mistakes (part II).
Overhead and Scalability
In your case overhead for parallel execution (and your test code structure) is much higher that loop body itself. In this case it's not productive to make an algorithm parallel. Parallel execution must always be evaluated in a specific scenario, measured and compared. It's not kind of magic medicine to speed up everything. Take a look to this article about How to Quantify Scalability.
Just for example if you have to sum/multiply 100 numbers it's better to use SIMD instructions (even better within an unrolled loop).
Measure It!
Try to make your loop body empty (or to execute a single NOP operation or volatile write so it won't be optimized away). You'll roughly measure overhead. Now compare it with your results.
Notes About This Test
IMO this kind of test is pretty useless. You can't compare, in a generic way, serial or parallel execution. It's something you should always check against a specific situation (in real world many things will play, synchronization for example).
Imagine: you make your loop body really "heavy" and you'll see a big speed up with parallel execution. Now you make your real program parallel and you see performance is worse. Why? Because parallel execution is slowed down by locks, by cache problems or serial access to a shared resource.
Test itself is meaningless unless you're testing your specific code in your specific situation (because too many factors will play and you just can't ignore them). What it means? Well that you can compare only what you tested...if your program performs total *= buffertoClip[i]; then your results are reliable. If your real program does something else then you have to repeat tests with that something else.

Interesting processing time results

I've made a small application that averages the numbers between 1 and 1000000. It's not hard to see (using a very basic algebraic formula) that the average is 500000.5 but this was more of a project in learning C++ than anything else.
Anyway, I made clock variables that were designed to find the amount of clock steps required for the application to run. When I first ran the script, it said that it took 3770000 clock steps, but every time that I've run it since then, it's taken "0.0" seconds...
I've attached my code at the bottom.
Either a.) It's saved the variables from the first time I ran it, and it's just running quickly to the answer...
or b.) something is wrong with how I'm declaring the time variables.
Regardless... it doesn't make sense.
Any help would be appreciated.
FYI (I'm running this through a Linux computer, not sure if that matters)
double avg (int arr[], int beg, int end)
{
int nums = end - beg + 1;
double sum = 0.0;
for(int i = beg; i <= end; i++)
{
sum += arr[i];
}
//for(int p = 0; p < nums*10000; p ++){}
return sum/nums;
}
int main (int argc, char *argv[])
{
int nums = 1000000;//atoi(argv[0]);
int myarray[nums];
double timediff;
//printf("Arg is: %d\n",argv[0]);
printf("Nums is: %d\n",nums);
clock_t begin_time = clock();
for(int i = 0; i < nums; i++)
{
myarray[i] = i+1;
}
double average = avg(myarray, 0, nums - 1);
printf("%f\n",average);
clock_t end_time = clock();
timediff = (double) difftime(end_time, begin_time);
printf("Time to Average: %f\n", timediff);
return 0;
}

You are measuring the I/O operation too (printf), that depends on external factors and might be affecting the run time. Also, clock() might not be as precise as needed to measure such a small task - look into higher resolution functions such as clock_get_time(). Even then, other processes might affect the run time by generating page fault interrupts and occupying the memory BUS, etc. So this kind of fluctuation is not abnormal at all.

On the machine I tested, Linux's clock call was only accurate to 1/100th of a second. If your code runs in less than 0.01 seconds, it will usually say zero seconds have passed. Also, I ran your program a total of 50 times in .13 seconds, so I find it suspicous that you claim it takes 2 seconds to run it once on your computer.
Your code incorrectly uses the difftime, which may display incorrect output as well if clock says time did pass.
I'd guess that the first timing you got was with different code than that posted in this question, becase I can't think of any way the code in this question could produce a time of 3770000.
Finally, benchmarking is hard, and your code has several benchmarking mistakes:
You're timing how long it takes to (1) fill an array, (2) calculate an average, (3) format the result string (4) make an OS call (slow) that prints said string in the right language/font/colo/etc, which is especially slow.
You're attempting to time a task which takes less than a hundredth of a second, which is WAY too small for any accurate measurement.
Here is my take on your code, measuring that the average takes ~0.001968 seconds on this machine.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js