memset is significantly faster then Eigen::Tensor SetZero()

memset is significantly faster then Eigen::Tensor SetZero() - c++

I have changed Eigen::Tensor SetZero() call in my code to the memset call over the tensor data and observing significant better performance. Builded in VS 2016(SSE2 support should be enabled by default). Why does this happens? I have expected that Eigen::Tensor is highly optimized.
#include <unsupported/Eigen/CXX11/Tensor>
#include <iostream>
#include <ctime>
#define MyLayoutType Eigen::RowMajor
#define Tf3 Eigen::Tensor<float, 3, MyLayoutType>
clock_t begin = clock();
Tf3 tensor(1000, 500, 20);
for (size_t i = 0; i < 100; i++)
{
tensor.setRandom();
memset(tensor.data(), 0, tensor.size() * sizeof(float));
// VS:
//tensor.setZero();
}
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
cout << "-----------------------" << endl;
cout << "Total time elapsed: " << elapsed_secs << "
secs" << endl;
cout << tensor(0, 0, 0);
On my env I got 2.1 for memset on avg and 2.3 for setZero. And setRandom operation is much more havy then memset. If I comment out tensor.setRandom() I get 0.4 for memset and 0.5 for setZero.IN real code difference in performance is bigger.

Related

Measuring time for my array of random numbers to print is always showing 0 seconds

my first post here. Just wondering why my stopwatch is always showing 0 seconds or 0 milliseconds no matter the amount of random numbers in my array. I appreciate the help so much.
Here's my code:
#include <iostream>
#include <cstdlib>
#include <time.h>
using namespace std;
double clock_start()
{
clock_t start = clock();
return start;
}
void random_number()
{
int array[10000];
srand(6);
cout << "10k Random numbers: ";
for (int i = 0; i < 10000; i++)
{
array[i] = rand() % 99 + 1;
cout << array[i] << "\n";
}
}
int main()
{
setlocale(LC_ALL, "");
//-------------------------//
random_number();
clock_t elapsed = (clock() - clock_start()) / (CLOCKS_PER_SEC / 1000);
cout << "Stopwatch: " << elapsed << "ms" << " or " << elapsed * 1000 << "s" << endl;
//-------------------------//
system("pause > nul");
return 0;
}

(clock() - clock_start()) will be evaluated in the blink of an eye.
All clock_start() does is return clock(). (In fact, a good optimising compiler will replace clock_start() with clock() !)
The difference will almost certainly be zero. Did you want something like
clock_t start = clock();
random_number();
clock_t elapsed = (clock() - start) / (CLOCKS_PER_SEC / 1000);
instead?

Thanks for the help guys! I'm amazed how fast this community is at replying.
So i deleted the clock_start() function. And i added the:
clock_t start = clock();
to my main function.

"Magical" intel c++ compiler : what happened?

I am on *nix. Have this simple c++ code in looptest.cpp
#include <iostream>
#include <time.h>
int main()
{
double sum = 0.0;
int n ;
std::cout << "n ?" << std::endl;
std::cin >> n ;
clock_t t_start = clock();
for (int i = 0 ; i < n ; ++i)
{
sum+= static_cast<double>(i);
}
clock_t t_end = clock();
clock_t diff = t_end - t_start;
double diffd = static_cast<double>(diff)/CLOCKS_PER_SEC;
std::cout << diffd << " seconds." << std::endl;
sum*=1.0;
return 0;
}
compiled with the intel c++ compiler (icpc (ICC) 14.0.4 20140805, 2013) as follows :
/opt/intel/bin/icpc looptest.cpp -o looptest
When I test it, I have the following curious result :
My-MacBook-Air:tmp11 XXXX$ ./looptest
n ?
10000
4e-06 seconds.
My-MacBook-Air:tmp11 XXXX$ ./looptest
n ?
100000
3e-06 seconds.
My-MacBook-Air:tmp11 XXXX$ ./looptest
n ?
1000000
3e-06 seconds.
My-MacBook-Air:tmp11 XXXX$ ./looptest
n ?
1000000000
2e-06 seconds.
My-MacBook-Air:tmp11 XXXX$ ./looptest
n ?
4294967295
3e-06 seconds.
Strange, isn't it ? What happened here ? Of course, compiling with gnu-5.2's g++ instead of icpc gives an expected result (time increasing when n increases.)

sum is nowhere read, so all assignments to the variable were removed. This made the for-loop empty, so it was removed, too. Hence what remains is:
#include <iostream>
#include <time.h>
int main()
{
int n ;
std::cout << "n ?" << std::endl;
std::cin >> n ;
clock_t t_start = clock();
clock_t t_end = clock();
clock_t diff = t_end - t_start;
double diffd = static_cast<double>(diff)/CLOCKS_PER_SEC;
std::cout << diffd << " seconds." << std::endl;
return 0;
}
Effectively you measure how fast a single call to clock() is.
Look at the compiled code to figure out the optimizations the compiler did. GCC "should" be able to do the same optimization, but it will only do it if you add the parameter -O (-O2, -O3, -Os) to the invocation.

Results of tbb::parallel_reduce and std::accumulate differ

I am learning Intel's TBB library. When summing all values in a std::vector the result of tbb::parallel_reduce differs from std::accumulate in the case of more than 16.777.220 elements in the vector (errors experienced at 16.777.320 elements). Here is my minimum-working-example:
#include <iostream>
#include <vector>
#include <numeric>
#include <limits>
#include "tbb/tbb.h"
int main(int argc, const char * argv[]) {
int count = std::numeric_limits<int>::max() * 0.0079 - 187800; // - 187900 works
std::vector<float> heights(size);
std::fill(heights.begin(), heights.end(), 1.0f);
float ssum = std::accumulate(heights.begin(), heights.end(), 0);
float psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<float>::iterator>(heights.begin(), heights.end()), 0,
[](tbb::blocked_range<std::vector<float>::iterator> const& range, float init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<float>()
);
std::cout << std::endl << " Heights serial sum: " << ssum << " parallel sum: " << psum;
return 0;
}
which outputs on my OSX 10.10.3 with XCode 6.3.1 and tbb stable 4.3-20141023 (poured from Brew):
Heights serial sum: 1.67772e+07 parallel sum: 1.67773e+07
Why is that? Should I report an error to TBB developers?
Additional testing, applying your answers:
correct value is: 1949700403
cause we add 1.0f to zero 1949700403 times
using (int) init values:
Runtime: 17.407 sec. Heights serial sum: 16777216.000, wrong
Runtime: 8.482 sec. Heights parallel sum: 131127368.000, wrong
using (float) init values:
Runtime: 12.594 sec. Heights serial sum: 16777216.000, wrong
Runtime: 5.044 sec. Heights parallel sum: 303073632.000, wrong
using (double) initial values:
Runtime: 13.671 sec. Heights serial sum: 1949700352.000, wrong
Runtime: 5.343 sec. Heights parallel sum: 263690016.000, wrong
using (double) initial values and tbb::parallel_deterministic_reduce:
Runtime: 13.463 sec. Heights serial sum: 1949700352.000, wrong
Runtime: 99.031 sec. Heights parallel sum: 1949700352.000, wrong >>> almost 10x slower !
Why do all reduce calls produce the wrong sum? Is (double) not sufficient?
Here is my testing code:
#include <iostream>
#include <vector>
#include <numeric>
#include <limits>
#include <sys/time.h>
#include <iomanip>
#include "tbb/tbb.h"
#include <cmath>
class StopWatch {
private:
double elapsedTime;
timeval startTime, endTime;
public:
StopWatch () : elapsedTime(0) {}
void startTimer() {
elapsedTime = 0;
gettimeofday(&startTime, 0);
}
void stopNprintTimer() {
gettimeofday(&endTime, 0);
elapsedTime = (endTime.tv_sec - startTime.tv_sec) * 1000.0; // compute sec to ms
elapsedTime += (endTime.tv_usec - startTime.tv_usec) / 1000.0; // compute us to ms and add
std::cout << " Runtime: " << std::right << std::setw(6) << elapsedTime / 1000 << " sec."; // show in sec
}
};
int main(int argc, const char * argv[]) {
StopWatch watch;
std::cout << std::fixed << std::setprecision(3) << "" << std::endl;
size_t count = std::numeric_limits<int>::max() * 0.9079;
std::vector<float> heights(count);
std::cout << " Vector size: " << count << std::endl;
std::fill(heights.begin(), heights.end(), 1.0f);
watch.startTimer();
float ssum = std::accumulate(heights.begin(), heights.end(), 0.0); // change type of initial value here
watch.stopNprintTimer();
std::cout << " Heights serial sum: " << std::right << std::setw(8) << ssum << std::endl;
watch.startTimer();
float psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<float>::iterator>(heights.begin(), heights.end()), 0.0, // change type of initial value here
[](tbb::blocked_range<std::vector<float>::iterator> const& range, float init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<float>()
);
watch.stopNprintTimer();
std::cout << " Heights parallel sum: " << std::right << std::setw(8) << psum << std::endl;
return 0;
}
Answer to my last question: they all produce wrong results because they are not made for integer addition with large numbers. Switching to int solves that:
[...]
std::vector<int> heights(count);
std::cout << " Vector size: " << count << std::endl;
std::fill(heights.begin(), heights.end(), 1);
watch.startTimer();
int ssum = std::accumulate(heights.begin(), heights.end(), (int)0);
watch.stopNprintTimer();
std::cout << " Heights serial sum: " << std::right << std::setw(8) << ssum << std::endl;
watch.startTimer();
int psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<int>::iterator>(heights.begin(), heights.end()), (int)0,
[](tbb::blocked_range<std::vector<int>::iterator> const& range, int init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<int>()
);
watch.stopNprintTimer();
std::cout << " Heights parallel sum: " << std::right << std::setw(8) << psum << std::endl;
[...]
results in:
Vector size: 1949700403
Runtime: 13.041 sec. Heights serial sum: 1949700403, correct
Runtime: 4.728 sec. Heights parallel sum: 1949700403, correct and almost 4x faster

Your call to std::accumulate is doing integer addition, then transforming the result to float at the end of the calculation. In order to accumulate over floating point numbers, the accumulator should be a float*.
float ssum = std::accumulate(heights.begin(), heights.end(), 0.0f);
^^^^
* Or any other type that can accumulate float correctly.

To other correct answers for the 'why?' part, I'd also add that TBB provides parallel_deterministic_reduce which guarantees reproducible results between two and more runs on the same data (but it still can differ with std::accumulate). See the blog describing the issue and the deterministic algorithm.
Thus regarding 'Should I report an error to TBB developers?' part, the answer is obviously no (unless you find something insufficient on the TBB side).

This may fix this particular problem for you:
Your call to std::accumulate is doing integer addition, then transforming the result to float at the end of the calculation.
BUT floating point addition is NOT an associative operation:
With accumulate: (...((s+a1)+a2)+...)+an
With parralel_reduce: any parenthesis permutation possible.
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

How can I measure CPU time and wall clock time on both Linux/Windows?

I mean: how can I measure time my CPU spent on function execution and wall clock time it takes to run my function? (Im interested in Linux/Windows and both x86 and x86_64). See what I want to do (Im using C++ here but I would prefer C solution):
int startcputime, endcputime, wcts, wcte;
startcputime = cputime();
function(args);
endcputime = cputime();
std::cout << "it took " << endcputime - startcputime << " s of CPU to execute this\n";
wcts = wallclocktime();
function(args);
wcte = wallclocktime();
std::cout << "it took " << wcte - wcts << " s of real time to execute this\n";
Another important question: is this type of time measuring architecture independent or not?

Here's a copy-paste solution that works on both Windows and Linux as well as C and C++.
As mentioned in the comments, there's a boost library that does this. But if you can't use boost, this should work:
// Windows
#ifdef _WIN32
#include <Windows.h>
double get_wall_time(){
LARGE_INTEGER time,freq;
if (!QueryPerformanceFrequency(&freq)){
// Handle error
return 0;
}
if (!QueryPerformanceCounter(&time)){
// Handle error
return 0;
}
return (double)time.QuadPart / freq.QuadPart;
}
double get_cpu_time(){
FILETIME a,b,c,d;
if (GetProcessTimes(GetCurrentProcess(),&a,&b,&c,&d) != 0){
// Returns total user time.
// Can be tweaked to include kernel times as well.
return
(double)(d.dwLowDateTime |
((unsigned long long)d.dwHighDateTime << 32)) * 0.0000001;
}else{
// Handle error
return 0;
}
}
// Posix/Linux
#else
#include <time.h>
#include <sys/time.h>
double get_wall_time(){
struct timeval time;
if (gettimeofday(&time,NULL)){
// Handle error
return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}
double get_cpu_time(){
return (double)clock() / CLOCKS_PER_SEC;
}
#endif
There's a bunch of ways to implement these clocks. But here's what the above snippet uses:
For Windows:
Wall Time: Performance Counters
CPU Time: GetProcessTimes()
For Linux:
Wall Time: gettimeofday()
CPU Time: clock()
And here's a small demonstration:
#include <math.h>
#include <iostream>
using namespace std;
int main(){
// Start Timers
double wall0 = get_wall_time();
double cpu0 = get_cpu_time();
// Perform some computation.
double sum = 0;
#pragma omp parallel for reduction(+ : sum)
for (long long i = 1; i < 10000000000; i++){
sum += log((double)i);
}
// Stop timers
double wall1 = get_wall_time();
double cpu1 = get_cpu_time();
cout << "Wall Time = " << wall1 - wall0 << endl;
cout << "CPU Time = " << cpu1 - cpu0 << endl;
// Prevent Code Elimination
cout << endl;
cout << "Sum = " << sum << endl;
}
Output (12 threads):
Wall Time = 15.7586
CPU Time = 178.719
Sum = 2.20259e+011

C++11. Much easier to write!
Use std::chrono::system_clock for wall clock and std::clock for cpu clock
http://en.cppreference.com/w/cpp/chrono/system_clock
#include <cstdio>
#include <ctime>
#include <chrono>
....
std::clock_t startcputime = std::clock();
do_some_fancy_stuff();
double cpu_duration = (std::clock() - startcputime) / (double)CLOCKS_PER_SEC;
std::cout << "Finished in " << cpu_duration << " seconds [CPU Clock] " << std::endl;
auto wcts = std::chrono::system_clock::now();
do_some_fancy_stuff();
std::chrono::duration<double> wctduration = (std::chrono::system_clock::now() - wcts);
std::cout << "Finished in " << wctduration.count() << " seconds [Wall Clock]" << std::endl;
Et voilà, easy and portable! No need for #ifdef _WIN32 or LINUX!
You could even use chrono::high_resolution_clock if you need more precision
http://en.cppreference.com/w/cpp/chrono/high_resolution_clock

To give a concrete example of #lip's suggestion to use boost::timer if you can (tested with Boost 1.51):
#include <boost/timer/timer.hpp>
// this is wallclock AND cpu time
boost::timer::cpu_timer timer;
... run some computation ...
boost::timer::cpu_times elapsed = timer.elapsed();
std::cout << " CPU TIME: " << (elapsed.user + elapsed.system) / 1e9 << " seconds"
<< " WALLCLOCK TIME: " << elapsed.wall / 1e9 << " seconds"
<< std::endl;

Use the clock method in time.h:
clock_t start = clock();
/* Do stuffs */
clock_t end = clock();
float seconds = (float)(end - start) / CLOCKS_PER_SEC;
Unfortunately, this method returns CPU time on Linux, but returns wall-clock time on Windows (thanks to commenters for this information).

How to use clock() in C++

How do I call clock() in C++?
For example, I want to test how much time a linear search takes to find a given element in an array.

#include <iostream>
#include <cstdio>
#include <ctime>
int main() {
std::clock_t start;
double duration;
start = std::clock();
/* Your algorithm here */
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
std::cout<<"printf: "<< duration <<'\n';
}

An alternative solution, which is portable and with higher precision, available since C++11, is to use std::chrono.
Here is an example:
#include <iostream>
#include <chrono>
typedef std::chrono::high_resolution_clock Clock;
int main()
{
auto t1 = Clock::now();
auto t2 = Clock::now();
std::cout << "Delta t2-t1: "
<< std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count()
<< " nanoseconds" << std::endl;
}
Running this on ideone.com gave me:
Delta t2-t1: 282 nanoseconds

clock() returns the number of clock ticks since your program started. There is a related constant, CLOCKS_PER_SEC, which tells you how many clock ticks occur in one second. Thus, you can test any operation like this:
clock_t startTime = clock();
doSomeOperation();
clock_t endTime = clock();
clock_t clockTicksTaken = endTime - startTime;
double timeInSeconds = clockTicksTaken / (double) CLOCKS_PER_SEC;

On Windows at least, the only practically accurate measurement mechanism is QueryPerformanceCounter (QPC). std::chrono is implemented using it (since VS2015, if you use that), but it is not accurate to the same degree as using QueryPerformanceCounter directly. In particular it's claim to report at 1 nanosecond granularity is absolutely not correct. So, if you're measuring something that takes a very short amount of time (and your case might just be such a case), then you should use QPC, or the equivalent for your OS. I came up against this when measuring cache latencies, and I jotted down some notes that you might find useful, here;
https://github.com/jarlostensen/notesandcomments/blob/master/stdchronovsqcp.md

#include <iostream>
#include <ctime>
#include <cstdlib> //_sleep() --- just a function that waits a certain amount of milliseconds
using namespace std;
int main()
{
clock_t cl; //initializing a clock type
cl = clock(); //starting time of clock
_sleep(5167); //insert code here
cl = clock() - cl; //end point of clock
_sleep(1000); //testing to see if it actually stops at the end point
cout << cl/(double)CLOCKS_PER_SEC << endl; //prints the determined ticks per second (seconds passed)
return 0;
}
//outputs "5.17"

You can measure how long your program works. The following functions help measure the CPU time since the start of the program:
C++ (double)clock() / CLOCKS_PER_SEC with ctime included.
Python time.clock() returns floating-point value in seconds.
Java System.nanoTime() returns long value in nanoseconds.
My reference: algorithms toolbox week 1 course part of data structures and algorithms specialization by University of California San Diego & National Research University Higher School of Economics
So you can add this line of code after your algorithm:
cout << (double)clock() / CLOCKS_PER_SEC;
Expected Output: the output representing the number of clock ticks per second

Probably you might be interested in timer like this :
H : M : S . Msec.
the code in Linux OS:
#include <iostream>
#include <unistd.h>
using namespace std;
void newline();
int main() {
int msec = 0;
int sec = 0;
int min = 0;
int hr = 0;
//cout << "Press any key to start:";
//char start = _gtech();
for (;;)
{
newline();
if(msec == 1000)
{
++sec;
msec = 0;
}
if(sec == 60)
{
++min;
sec = 0;
}
if(min == 60)
{
++hr;
min = 0;
}
cout << hr << " : " << min << " : " << sec << " . " << msec << endl;
++msec;
usleep(100000);
}
return 0;
}
void newline()
{
cout << "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n";
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

memset is significantly faster then Eigen::Tensor SetZero() - c++

Related

Measuring time for my array of random numbers to print is always showing 0 seconds

"Magical" intel c++ compiler : what happened?

Results of tbb::parallel_reduce and std::accumulate differ

How can I measure CPU time and wall clock time on both Linux/Windows?

How to use clock() in C++

Categories

Resources