compilers processing maths differently? - c++

I made a code to find the derivative of a function at a given point. The code reads
#include"stdafx.h"
#include<iostream>
using namespace std;
double function(double x) {
return (3 * x * x);
}
int main() {
double x, y, dy, dx;
cin >> x;
y = function(x);
dx = 0.00000001;
dy = function(x + dx) - y;
cout << "Derivative of function at x = " << x << " is " << (double)dy / dx;
cin >> x;
}
Now my college uses turbo C++ as its IDE and compiler while at home I have visual studio (because TC++ looks very bad on a 900p screen but jokes apart). When I tried a similar program on the college PCs the result was quite messed up and was much less accurate than what I am getting at home. for example:
Examples:
x = 3
#College result = 18.something
#Home result = 18 (precise without a decimal point)
x = 1
#College result = 6.000.....something
#Home result = 6 (precise without a decimal point)
The Very big Question:
Why are different compilers giving different results ?

I’m 90% sure the result’s same in both cases, and the only reason why you see difference is different output formatting.
For 64-bit IEEE double math, the precise results of those computations are probably 17.9999997129698385833762586116790771484375 and 6.0000000079440951594733633100986480712890625, respectively.
If you want to verify that hypothesis, you can print you double values this way:
void printDoubleAsHex( double val )
{
const uint64_t* p = (const uint64_t*)( &val );
printf( "%" PRIx64 "\n", *p );
}
And verify you have same output in both compilers.
However, there’s also 10% chance that indeed your two compilers compiled your code in a way that the result is different. Ain’t uncommon, it can even happen with the same compiler but different settings/flags/options.
The most likely reason is different instruction sets. By default, many modern compilers generate SSE instructions for the code like yours, older ones producer legacy x87 code (x87 operates on 80-bit floating point values on the stack, SSE on 32 or 64-bit FP values on these vector registers, hence the difference in precision). Another reason is different rounding modes. Another one is compiler specific optimizations, such as /fp in Visual C++.

Related

Double data type causing error

I am trying to get the fixed points for the tent equation. With the given initial conditions, the solution must be 0.6. Everything works fine when I use float for x0, but when I define x0 as a double, the solution changes to 0.59999 in 55th iteration, which causes further changes in the next iteration and so on. Why is there such a difference while choosing the data types?
using namespace std;
#include <iostream>
main()
{
double x0=.6;
for (int i=0;i<100;i++)
{
if(x0<.5)
x0=1.5*x0;
else
x0=1.5*(1-x0);
cout << i << "\t" << x0 << endl;
}
}
I have posted an image of the results. Comparison of solutions - Float and Double
The real value of the 55-th iteration is 0.5999994832150543633275674437754787504673004150390625 when a double is used and 0.60000002384185791015625 for a float (on my system).
The difference between the two is in how precise the numbers are and the rounding is what throws you off.
BTW, neither of the two values is absolutely accurate, they are just close enough approximations and nothing more. With the double being "more precise".
UPDATE
After a few comments back and forth it turned out that there was no need for floating-point arithmetic to be gin with. Integers (slightly modified) do just fine and do not introduce any rounding.

Casting float to int inconsistent across MinGw and Clang

Using C++, I'm trying to cast a float value to an int using these instructions :
#include <iostream>
int main() {
float NbrToCast = 1.8f;
int TmpNbr = NbrToCast * 10;
std::cout << TmpNbr << "\n";
}
I understand the value 1.8 cannot be precisely represented as a float and is actually stored as 1.79999995.
Thus, I would expect that multiplying this value by ten, would result to 17.99999995 and then casting it to an int would give 17.
When compiling and running this code with MinGW (v4.9.2 32bits) on Windows 7, I get the expected result (17).
When compiling and running this code with CLang (v600.0.57) on my Mac (OS X 10.11), I get 18as a result, which is not what I was expecting but which seems more correct in a mathematical way !
Why do I get this difference ?
Is there a way to have a consistent behavior no matter the OS or the compiler ?
Like Yuushi said in the comments, I guess the rounding rules may differ for each compiler. Having a portable solution on such a topic probably means you need to write your own rounding method.
So in your case you probably need to check the value of the digit after 7 and increment the value or not. Let's say something like:
int main() {
float NbrToCast = 1.8f;
float TmpNbr = NbrToCast * 10;
std::cout << RoundingFloatToInt(TmpNbr) << "\n";
}
int RoundingFloatToInt(const float &val)
{
float intPart, fractPart;
fractpart = modf (val, &intpart);
int result = intPart;
if (fractpart > 0.5)
{
result++;
}
return result;
}
(code not tested at all but you have the idea)
If you need performance, it's probably not great but I think it should be portable.

How does the cout statement affect the O/P of the code written?

#include <iostream>
#include <iomanip>
#include <math.h>
using namespace std;
int main() {
int t;
double n;
cin>>t;
while(t--)
{
cin>>n;
double x;
for(int i=1;i<=10000;i++)
{
x=n*i;
if(x==ceilf(x))
{
cout<<i<<endl;
break;
}
}
}
return 0;
}
For I/P:
3
5
2.98
3.16
O/P:
1
If my code is:
#include <iostream>
#include <iomanip>
#include <math.h>
using namespace std;
int main() {
int t;
double n;
cin>>t;
while(t--)
{
cin>>n;
double x;
for(int i=1;i<=10000;i++)
{
x=n*i;
cout<<"";//only this statement is added;
if(x==ceilf(x))
{
cout<<i<<endl;
break;
}
}
}
return 0;
}
For the same input O/P is:
1
50
25
The only extra line added in 2nd code is: cout<<"";
Can anyone please help in finding why there is such a difference in output just because of the cout statement added in the 2nd code?
Well this is a veritable Heisenbug. I've tried to strip your code down to a minimal replicating example, and ended up with this (http://ideone.com/mFgs0S):
#include <iostream>
#include <math.h>
using namespace std;
int main()
{
float n;
cin >> n; // this input is needed to reproduce, but the value doesn't matter
n = 2.98; // overwrite the input value
cout << ""; // comment this out => y = z = 149
float x = n * 50; // 149
float y = ceilf(x); // 150
cout << ""; // comment this out => y = z = 150
float z = ceilf(x); // 149
cout << "x:" << x << " y:" << y << " z:" << z << endl;
}
The behaviour of ceilf appears to depend on the particular sequence of iostream operations that occur around it. Unfortunately I don't have the means to debug in any more detail at the moment, but maybe this will help someone else to figure out what's going on. Regardless, it seems almost certain that it's a bug in gcc-4.9.2 and gcc-5.1. (You can check on ideone that you don't get this behaviour in gcc-4.3.2.)
You're probably getting an issue with floating point representations - which is to say that computers cannot perfectly represent all fractions. So while you see 50, the result is probably something closer to 50.00000000001. This is a pretty common problem you'll run across when dealing with doubles and floats.
A common way to deal with it is to define a very small constant (in mathematical terms this is Epsilon, a number which is simply "small enough")
const double EPSILON = 0.000000001;
And then your comparison will change from
if (x==ceilf(x))
to something like
double difference = fabs(x - ceilf(x));
if (difference < EPSILON)
This will smooth out those tiny inaccuracies in your doubles.
"Comparing for equality
Floating point math is not exact. Simple values like 0.2 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations can change the result. Different compilers and CPU architectures store temporary results at different precisions, so results will differ depending on the details of your environment. If you do a calculation and then compare the results against some expected value it is highly unlikely that you will get exactly the result you intended.
In other words, if you do a calculation and then do this comparison:
if (result == expectedResult)
then it is unlikely that the comparison will be true. If the comparison is true then it is probably unstable – tiny changes in the input values, compiler, or CPU may change the result and make the comparison be false."
From http://www.cygnus-software.com/papers/comparingfloats/Comparing%20floating%20point%20numbers.htm
Hope this answers your question.
Also you had a problem with
if(x==ceilf(x))
ceilf() returns a float value and x you have declared as a double.
Refer to problems in floating point comparison as to why that wont work.
change x to float and the program runs fine,
I made a plain try on my laptop and even online compilers.
g++ (4.9.2-10) gave the desired output (3 outputs), along with online compiler at geeksforgeeks.org. However, ideone, codechef did not gave the right output.
All I can infer is that online compilers name their compiler as "C++(gcc)" and give wrong output. While, geeksforgeeks.org, which names the compiler as "C++" runs perfectly, along with g++ (as tested on Linux).
So, we could arrive at a hypothesis that they use gcc to compile C++ code as a method suggested at this link. :)

Benchmarking math.h square root and Quake square root

Okay so I was board and wondered how fast math.h square root was in comparison to the one with the magic number in it (made famous by Quake but made by SGI).
But this has ended up in a world of hurt for me.
I first tried this on the Mac where the math.h would win hands down every time then on Windows where the magic number always won, but I think this is all down to my own noobness.
Compiling on the Mac with "g++ -o sq_root sq_root_test.cpp" when the program ran it takes about 15 seconds to complete. But compiling in VS2005 on release takes a split second. (in fact I had to compile in debug just to get it to show some numbers)
My poor man's benchmarking? is this really stupid? cos I get 0.01 for math.h and 0 for the Magic number. (it cant be that fast can it?)
I don't know if this matters but the Mac is Intel and the PC is AMD. Is the Mac using hardware for math.h sqroot?
I got the fast square root algorithm from http://en.wikipedia.org/wiki/Fast_inverse_square_root
//sq_root_test.cpp
#include <iostream>
#include <math.h>
#include <ctime>
float invSqrt(float x)
{
union {
float f;
int i;
} tmp;
tmp.f = x;
tmp.i = 0x5f3759df - (tmp.i >> 1);
float y = tmp.f;
return y * (1.5f - 0.5f * x * y * y);
}
int main() {
std::clock_t start;// = std::clock();
std::clock_t end;
float rootMe;
int iterations = 999999999;
// ---
rootMe = 2.0f;
start = std::clock();
std::cout << "Math.h SqRoot: ";
for (int m = 0; m < iterations; m++) {
(float)(1.0/sqrt(rootMe));
rootMe++;
}
end = std::clock();
std::cout << (difftime(end, start)) << std::endl;
// ---
std::cout << "Quake SqRoot: ";
rootMe = 2.0f;
start = std::clock();
for (int q = 0; q < iterations; q++) {
invSqrt(rootMe);
rootMe++;
}
end = std::clock();
std::cout << (difftime(end, start)) << std::endl;
}
There are several problems with your benchmarks. First, your benchmark includes a potentially expensive cast from int to float. If you want to know what a square root costs, you should benchmark square roots, not datatype conversions.
Second, your entire benchmark can be (and is) optimized out by the compiler because it has no observable side effects. You don't use the returned value (or store it in a volatile memory location), so the compiler sees that it can skip the whole thing.
A clue here is that you had to disable optimizations. That means your benchmarking code is broken. Never ever disable optimizations when benchmarking. You want to know which version runs fastest, so you should test it under the conditions it'd actually be used under. If you were to use square roots in performance-sensitive code, you'd enable optimizations, so how it behaves without optimizations is completely irrelevant.
Also, you're not benchmarking the cost of computing a square root, but of the inverse square root.
If you want to know which way of computing the square root is fastest, you have to move the 1.0/... division down to the Quake version. (And since division is a pretty expensive operation, this might make a big difference in your results)
Finally, it might be worth pointing out that Carmacks little trick was designed to be fast on 12 year old computers. Once you fix your benchmark, you'll probably find that it's no longer an optimization, because today's CPU's are much faster at computing "real" square roots.

How to speed up floating-point to integer number conversion? [duplicate]

This question already has answers here:
What is the fastest way to convert float to int on x86
(10 answers)
Closed 8 years ago.
We're doing a great deal of floating-point to integer number conversions in our project. Basically, something like this
for(int i = 0; i < HUGE_NUMBER; i++)
int_array[i] = float_array[i];
The default C function which performs the conversion turns out to be quite time consuming.
Is there any work around (maybe a hand tuned function) which can speed up the process a little bit? We don't care much about a precision.
Most of the other answers here just try to eliminate loop overhead.
Only deft_code's answer gets to the heart of what is likely the real problem -- that converting floating point to integers is shockingly expensive on an x86 processor. deft_code's solution is correct, though he gives no citation or explanation.
Here is the source of the trick, with some explanation and also versions specific to whether you want to round up, down, or toward zero: Know your FPU
Sorry to provide a link, but really anything written here, short of reproducing that excellent article, is not going to make things clear.
inline int float2int( double d )
{
union Cast
{
double d;
long l;
};
volatile Cast c;
c.d = d + 6755399441055744.0;
return c.l;
}
// this is the same thing but it's
// not always optimizer safe
inline int float2int( double d )
{
d += 6755399441055744.0;
return reinterpret_cast<int&>(d);
}
for(int i = 0; i < HUGE_NUMBER; i++)
int_array[i] = float2int(float_array[i]);
The double parameter is not a mistake! There is way to do this trick with floats directly but it gets ugly trying to cover all the corner cases. In its current form this function will round the float the nearest whole number if you want truncation instead use 6755399441055743.5 (0.5 less).
I ran some tests on different ways of doing float-to-int conversion. The short answer is to assume your customer has SSE2-capable CPUs and set the /arch:SSE2 compiler flag. This will allow the compiler to use the SSE scalar instructions which are twice as fast as even the magic-number technique.
Otherwise, if you have long strings of floats to grind, use the SSE2 packed ops.
There's an FISTTP instruction in the SSE3 instruction set which does what you want, but as to whether or not it could be utilized and produce faster results than libc, I have no idea.
Is the time large enough that it outweighs the cost of starting a couple of threads?
Assuming you have a multi-core processor or multiple processors on your box that you could take advantage of, this would be a trivial task to parallelize across multiple threads.
The key is to avoid the _ftol() function, which is needlessly slow. Your best bet for long lists of data like this is to use the SSE2 instruction cvtps2dq to convert two packed floats to two packed int64s. Do this twice (getting four int64s across two SSE registers) and you can shuffle them together to get four int32s (losing the top 32 bits of each conversion result). You don't need assembly to do this; MSVC exposes compiler intrinsics to the relevant instructions -- _mm_cvtpd_epi32() if my memory serves me correctly.
If you do this it is very important that your float and int arrays be 16-byte aligned so that the SSE2 load/store intrinsics can work at maximum efficiency. Also, I recommend you software pipeline a little and process sixteen floats at once in each loop, eg (assuming that the "functions" here are actually calls to compiler intrinsics):
for(int i = 0; i < HUGE_NUMBER; i+=16)
{
//int_array[i] = float_array[i];
__m128 a = sse_load4(float_array+i+0);
__m128 b = sse_load4(float_array+i+4);
__m128 c = sse_load4(float_array+i+8);
__m128 d = sse_load4(float_array+i+12);
a = sse_convert4(a);
b = sse_convert4(b);
c = sse_convert4(c);
d = sse_convert4(d);
sse_write4(int_array+i+0, a);
sse_write4(int_array+i+4, b);
sse_write4(int_array+i+8, c);
sse_write4(int_array+i+12, d);
}
The reason for this is that the SSE instructions have a long latency, so if you follow a load into xmm0 immediately with a dependent operation on xmm0 then you will have a stall. Having multiple registers "in flight" at once hides the latency a little. (Theoretically a magic all-knowing compiler could alias its way around this problem but in practice it doesn't.)
Failing this SSE juju you can supply the /QIfist option to MSVC which will cause it to issue the single opcode fist instead of a call to _ftol; this means it will simply use whichever rounding mode happens to be set in the CPU without making sure it is ANSI C's specific truncate op. The Microsoft docs say /QIfist is deprecated because their floating point code is fast now, but a disassembler will show you that this is unjustifiedly optimistic. Even /fp:fast simply results to a call to _ftol_sse2, which though faster than the egregious _ftol is still a function call followed by a latent SSE op, and thus unnecessarily slow.
I'm assuming you're on x86 arch, by the way -- if you're on PPC there are equivalent VMX operations, or you can use the magic-number-multiply trick mentioned above followed by a vsel (to mask out the non-mantissa bits) and an aligned store.
You might be able to load all of the integers into the SSE module of your processor using some magic assembly code, then do the equivalent code to set the values to ints, then read them as floats. I'm not sure this would be any faster though. I'm not a SSE guru, so I don't know how to do this. Maybe someone else can chime in.
In Visual C++ 2008, the compiler generates SSE2 calls by itself, if you do a release build with maxed out optimization options, and look at a disassembly (though some conditions have to be met, play around with your code).
See this Intel article for speeding up integer conversions:
http://software.intel.com/en-us/articles/latency-of-floating-point-to-integer-conversions/
According to Microsoft, the /QIfist compiler option is deprecated in VS 2005 because integer conversion has been sped up. They neglect to say how it has been sped up, but looking at the disassembly listing might give a clue.
http://msdn.microsoft.com/en-us/library/z8dh4h17(vs.80).aspx
most c compilers generate calls to _ftol or something for every float to int conversion. putting a reduced floating point conformance switch (like fp:fast) might help - IF you understand AND accept the other effects of this switch. other than that, put the thing in a tight assembly or sse intrinsic loop, IF you are ok AND understand the different rounding behavior.
for large loops like your example you should write a function that sets up floating point control words once and then does the bulk rounding with only fistp instructions and then resets the control word - IF you are ok with an x86 only code path, but at least you will not change the rounding.
read up on the fld and fistp fpu instructions and the fpu control word.
What compiler are you using? In Microsoft's more recent C/C++ compilers, there is an option under C/C++ -> Code Generation -> Floating point model, which has options: fast, precise, strict. I think precise is the default, and works by emulating FP operations to some extent. If you are using a MS compiler, how is this option set? Does it help to set it to "fast"? In any case, what does the disassembly look like?
As thirtyseven said above, the CPU can convert float<->int in essentially one instruction, and it doesn't get any faster than that (short of a SIMD operation).
Also note that modern CPUs use the same FP unit for both single (32 bit) and double (64 bit) FP numbers, so unless you are trying to save memory storing a lot of floats, there's really no reason to favor float over double.
On Intel your best bet is inline SSE2 calls.
I'm surprised by your result. What compiler are you using? Are you compiling with optimization turned all the way up? Have you confirmed using valgrind and Kcachegrind that this is where the bottleneck is? What processor are you using? What does the assembly code look like?
The conversion itself should be compiled to a single instruction. A good optimizing compiler should unroll the loop so that several conversions are done per test-and-branch. If that's not happening, you can unroll the loop by hand:
for(int i = 0; i < HUGE_NUMBER-3; i += 4) {
int_array[i] = float_array[i];
int_array[i+1] = float_array[i+1];
int_array[i+2] = float_array[i+2];
int_array[i+3] = float_array[i+3];
}
for(; i < HUGE_NUMBER; i++)
int_array[i] = float_array[i];
If your compiler is really pathetic, you might need to help it with the common subexpressions, e.g.,
int *ip = int_array+i;
float *fp = float_array+i;
ip[0] = fp[0];
ip[1] = fp[1];
ip[2] = fp[2];
ip[3] = fp[3];
Do report back with more info!
If you do not care very much about the rounding semantics, you can use the lrint() function. This allows for more freedom in rounding and it can be much faster.
Technically, it's a C99 function, but your compiler probably exposes it in C++. A good compiler will also inline it to one instruction (a modern G++ will).
lrint documentation
rounding only
excellent trick, only the use 6755399441055743.5 (0.5 less) to do rounding won't work.
6755399441055744 = 2^52 + 2^51 overflowing decimals off the end of the mantissa leaving the integer that you want in bits 51 - 0 of the fpu register.
In IEEE 754
6755399441055744.0 =
sign exponent mantissa
0 10000110011 1000000000000000000000000000000000000000000000000000
6755399441055743.5
will also however compile to
0100001100111000000000000000000000000000000000000000000000000000
the 0.5 overflows off the end (rounding up) which is why this works in the first place.
to do truncation you would have to add 0.5 to your double then do this
the guard digits should take care of rounding to the correct result done this way.
also watch out for 64 bit gcc linux where long rather annoyingly means a 64 bit integer.
If you have very large arrays (bigger than a few MB--the size of the CPU cache), time your code and see what the throughput is. You're probably saturating the memory bus, not the FP unit. Look up the maximum theoretical bandwidth for your CPU and see how close to it you are.
If you're being limited by the memory bus, extra threads will just make it worse. You need better hardware (e.g. faster memory, different CPU, different motherboard).
In response to Larry Gritz's comment...
You are correct: the FPU is a major bottleneck (and using the xs_CRoundToInt trick allows one to come very close to saturating the memory bus).
Here are some test results for a Core 2 (Q6600) processor. The theoretical main-memory bandwidth for this machine is 3.2 GB/s (L1 and L2 bandwidths are much higher). The code was compiled with Visual Studio 2008. Similar results for 32-bit and 64-bit, and with /O2 or /Ox optimizations.
WRITING ONLY...
1866359 ticks with 33554432 array elements (33554432 touched). Bandwidth: 1.91793 GB/s
154749 ticks with 262144 array elements (33554432 touched). Bandwidth: 23.1313 GB/s
108816 ticks with 8192 array elements (33554432 touched). Bandwidth: 32.8954 GB/s
USING CASTING...
5236122 ticks with 33554432 array elements (33554432 touched). Bandwidth: 0.683625 GB/s
2014309 ticks with 262144 array elements (33554432 touched). Bandwidth: 1.77706 GB/s
1967345 ticks with 8192 array elements (33554432 touched). Bandwidth: 1.81948 GB/s
USING xs_CRoundToInt...
1490583 ticks with 33554432 array elements (33554432 touched). Bandwidth: 2.40144 GB/s
1079530 ticks with 262144 array elements (33554432 touched). Bandwidth: 3.31584 GB/s
1008407 ticks with 8192 array elements (33554432 touched). Bandwidth: 3.5497 GB/s
(Windows) source code:
// floatToIntTime.cpp : Defines the entry point for the console application.
//
#include <windows.h>
#include <iostream>
using namespace std;
double const _xs_doublemagic = double(6755399441055744.0);
inline int xs_CRoundToInt(double val, double dmr=_xs_doublemagic) {
val = val + dmr;
return ((int*)&val)[0];
}
static size_t const N = 256*1024*1024/sizeof(double);
int I[N];
double F[N];
static size_t const L1CACHE = 128*1024/sizeof(double);
static size_t const L2CACHE = 4*1024*1024/sizeof(double);
static size_t const Sz[] = {N, L2CACHE/2, L1CACHE/2};
static size_t const NIter[] = {1, N/(L2CACHE/2), N/(L1CACHE/2)};
int main(int argc, char *argv[])
{
__int64 freq;
QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
cout << "WRITING ONLY..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = 13;
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
cout << "USING CASTING..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = (int)F[n];
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
cout << "USING xs_CRoundToInt..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = xs_CRoundToInt(F[n]);
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
return 0;
}