How to get rid of different results of logarithm between CUDA and CPU? - c++

I want to realize an algorithm in GPU using CUDA. At the same time, I write a CPU edition using C++ to verify the results of GPU edition. However I got into trouble when using log() in CPU and GPU. A very simple piece of algorithm (used both on CPU and GPU) is shown below:
float U;
float R = U * log(U);
However, when I compare the results on CPU side, I find that there are many results (459883 out of 1843161) having small differences (max dif is 0.5). Some results are shown below:
U -- R (CPU side) -- R (GPU side) -- R using Python (U * math.log(U))
86312.0 -- 980998.375000 -- 980998.3125 -- 980998.3627440572
67405.0 -- 749440.750000 -- 749440.812500 -- 749440.7721980268
49652.0 -- 536876.875000 -- 536876.812500 -- 536876.8452369706
32261.0 -- 334921.250000 -- 334921.281250 -- 334921.2605240216
24232.0 -- 244632.437500 -- 244632.453125 -- 244632.4440747978
Can anybody give me some suggestions? Which one should I trust?

Which one should I trust?
You should trust the double-precision result computed by Python, that you could also have computed with CUDA or C++ in double-precision to obtain very similar (although likely not identical still) values.
To rephrase the first comment made by aland, if you care about an error of 0.0625 in 980998, you shouldn't be using single-precision in the first place. Both the CPU and the GPU result are “wrong” for that level of accuracy. On your examples, the CPU result happens to be more accurate, but you can see that both single-precision results are quite distant from the more accurate double-precision Python result. This is simply a consequence of using a format that allows 24 significant binary digits (about 7 decimal digits), not just for the input and the end result, but also for intermediate computations.
If the input is provided as float and you want the most accurate float result for R, compute U * log(U) using double and round to float only in the end. Then the results will almost always be identical between CPU and GPU.

By curiosity, I compared the last bit set in the significand (or in other words the number of trailing zeros in the significand)
I did it with Squeak Smalltalk because I'm more comfortable with it, but I'm pretty sure you can find equivalent libraries in Python:
CPU:
#(980998.375000 749440.750000 536876.875000 334921.250000 244632.437500)
collect: [:e | e asTrueFraction numerator highBit].
-> #(23 22 23 21 22)
GPU:
#(980998.3125 749440.812500 536876.812500 334921.281250 244632.453125)
collect: [:e | e asTrueFraction numerator highBit].
-> #(24 24 24 24 24)
That's interestingly not as random as we could expect, especially the GPU, but there is not enough clue at this stage...
Then I used an ArbitraryPrecisionFloat package to perform (emulate) the operations in extended precision, then round to nearest single precision float, the correct answer matches quite exactly the one of CPU:
#( 86312 67405 49652 32261 24232 ) collect: [:e |
| u r |
u := e asArbitraryPrecisionFloatNumBits: 80.
r = u*u ln.
(r asArbitraryPrecisionFloatNumBits: 24) asTrueFraction printShowingMaxDecimalPlaces: 100]
-> #('980998.375' '749440.75' '536876.875' '334921.25' '244632.4375')
It works as well with 64 bits.
But if I emulate the operations in single precision, then I can say the GPU matches the emulated results quite well too (except the second item):
#( 86312 67405 49652 32261 24232 ) collect: [:e |
| u r |
u := e asArbitraryPrecisionFloatNumBits: 24.
r = u*u ln.
r asTrueFraction printShowingMaxDecimalPlaces: 100]
-> #('980998.3125' '749440.75' '536876.8125' '334921.28125' '244632.453125')
So I'd say the CPU did probably use a double (or extended) precision to evaluate the log and perform the multiplication.
On the other side, the GPU did perform all the operations in single precision. Then the log function of ArbitraryPrecisionFloat package is correct to half ulp, but that's not a requirement of IEEE 754, so that can explain the observed mismatch on second item.
You may try to write the code so as to force float (like using logf instead of log if it's C99, or use intermediate results float ln=log(u); float r=u*ln;) and eventually use appropriate compilation flags to forbid extended precision (can't remember, I don't use C every day). But then you have very few guaranty to obtain 100% match on log function, the norms are too lax.

Related

Cyclic Redundancy check : Single and double bit error

Found this in the book by Forouzan (Data Communications and Networking 5E). But, not able to understand the logic behind this.
This is in the context of topic two isolated single-bit errors
In other words, g(x) must not divide x^t + 1, where t is between 0 and n − 1. However, t = 0 is meaningless and t = 1 is needed as we will see later. This means t should be between 2 and n – 1
why t=1 is excluded here? (x^1 + 1) is two consecutive errors, it must also be detected right using our g(x).
The third image states that (x+1) should be a factor of g(x), but this reduces the maximum length that the CRC is guaranteed to detect 2 bit errors from n-1 to (n/2)-1, but it provides the advantage of being able to detect any odd number of bit errors such as (x^k + x^j + x^i) where k+j+i <= (n/2)-1.
Not mentioned in the book, is that some generators can detect more than 3 errors, but sacrifice the maximum length of a message in order to do this.
If a CRC can detect e errors, then it can also correct floor(e/2) errors, but I'm not aware of an efficient algorithm to do this, other than a huge table lookup (if there is enough space). For example there is a 32 bit CRC (in hex: 1f1922815 = 787·557·465·3·3) that can detect 7 bit errors or correct 3 bit errors for a message size up to 1024 bits, but fast correction requires a 1.4 giga-byte lookup table.
As for the "t = 1 is needed", the book later clarifies this by noting that g(x) = (x+1) cannot detect adjacent bit errors. In the other statement, the book does not special case t = 0 or t = 1, it states, "If a generator cannot divide (x^t + 1), t between 0 and n-1, then all isolated double bit errors can be detected", except that if t = 0, (x^0 + 1) = (1 + 1) = 0, which would be a zero bit error case.

Second real written to stdout with the P descriptor is wrong by factor 10

Here's a minimal working example:
program test_stuff
implicit none
real :: b
b = 10000.0
write(*,'(A10,1PE12.4,F12.4)') "b, b: ", b, b
end program
which I simply compile with gfortran test_stuff.f90 -o test_stuff
However, running the program gives the following output:
$ ./test_stuff
b, b: 1.0000E+04 100000.0000
The second real written to the screen is wrong by a factor of 10.
This happens with gfotran 9.3.0 as well as 10.2.0, so I definitely must be doing something wrong, but I can't see what it is. Can anybody spot what I'm doing wrong?
The P control edit descriptor "temporarily changes" (Fortran 2018 13.8.5) the scale factor connection mode of the connection.
However, what is meant by temporary is until the mode is changed again or until the end of the data transfer statement: (Fortran 2018 12.5.2)
Edit descriptors take effect when they are encountered in format processing. When a data transfer statement terminates, the values for the modes are reset to the values in effect immediately before the data transfer statement was executed.
In the case of the question, both output values are thus processed with the scale factor having value 1.
This scale factor is responsible for the "wrong" second value: there is a difference in interpretation of the scale factor for E and F editing. For E editing the scale factor simply changes the representation, with the external and internal values the same (with the significand scaled up by 10 and the exponent reduced by 1), but for F editing the output value is itself scaled:
On output, with F output editing, the effect is that the externally represented number equals the internally represented number multiplied by 10k
So while 10000 would be represented by 0.1000E+05 with scale factor 0 and 1.0000E+04 with scale factor 1 under E12.4, under F12.4 the value 10000 is scaled to 100000 with the scale factor in place.
As a style note: although the comma is optional between 1P and E12.4 (and similar), many would regard it much better to include the comma, precisely to avoid this apparent tight coupling of the two descriptors (or looking like one descriptor). As the scale factor has a different effect for each of E and F, has no effect for EN and sometimes but not always has an effect with G, I'm not going to argue with anyone who calls P evil.
You are looking for section 12.5.2 of the Fortran 2018 standard.
A connection for formatted input/output has several changeable modes: these are ... and scale factor (13.8.5).
Values for the modes of a connection are established when the connection is initiated. If the connection is initiated by an OPEN statement, the values are as specified, either explicitly or implicitly, by the OPEN statement. If the connection is initiated other than by an OPEN statement (that is, if the file is an internal file or pre-connected file) the values established are those that would be implied by an initial OPEN statement without the corresponding keywords.
The scale factor cannot be explicitly specified in an OPEN statement; it is implicitly 0.
The modes of a connection can be temporarily changed by ... or by an edit descriptor. ... Edit descriptors take effect when they are encountered in format processing. When a data transfer statement terminates, the values for the modes are reset to the values in effect immediately before the data transfer statement was executed.
So when you used 1P in your format, you changed the mode for the connection. This applies to all output items after the 1P has been processed. When the write statement completes the scale factor is reset to 0.

Strange behavior while calling properties from REFPROP FORTRAN files

I am trying to use REFPROPs HSFLSH subroutine to compute properties for steam.
When the same state property is calculated over multiple iterations
(fixed enthalpy and entropy (Enthalpy = 50000 J/mol & Entropy = 125 J/mol),
the time taken to compute using HSFLSH after every 4th/5th iteration increases to about 0.15 ms against negligible amount of time for other iterations. This is turning problematic because my program places call to this subroutine over several thousand times. Thus leading to abnormally huge program run times.
The program used to generate the above log is here:
C refprop check
program time_check
parameter(ncmax=20)
dimension x(ncmax)
real hkj,skj
character hrf*3, herr*255
character*255 hf(ncmax),hfmix
C
C SETUP FOR WATER
C
nc=1 !Number of components
hf(1)='water.fld' !Fluid name
hfmix='hmx.bnc' !Mixture file name
hrf='DEF' !Reference state (DEF means default)
call setup(nc,hf,hfmix,hrf,ierr,herr)
if (ierr.ne.0) write (*,*) herr
call INFO(1,wm,ttp,tnbp,tc,pc,dc,zc,acf,dip,rgas)
write(*,*) 'Mol weight ', wm
h = 50000.0
s = 125.0
c
C
DO I=1,NCMAX
x(I) = 0
END DO
C ******************************************************
C THIS IS THE ACTUAL CALL PLACE
C ******************************************************
do I=1,100
call cpu_time(tstrt)
CALL HSFLSH(h,s,x,T_TEMP,P_TEMP,RHO_TEMP,dl,dv,xliq,xvap,
& WET_TEMP,e,
& cv,cp,VS_TEMP,ierr,herr)
call cpu_time(tstop)
write(*,*),I,' time taken to run hsflsh routine= ',tstop - tstrt
end do
stop
end
(of course you will need the FORTRAN FILES, which unfortunately I cannot share since REFPROP isn't open source)
Can someone help me figure out why is this happening.?
P.S : The above code was compiled using gfortran -fdefault-real-8
UPDATE
I tried using system_clock to time my computations as suggested by #Ross below. The results are uniform across the loop (image below). I will have to find alternate ways to improve computation speed I guess (Sigh!)
I don't have a concrete answer, but this sort of behaviour looks like what I would expect if all calls really took around 3 ms, but your call to CPU_TIME doesn't register anything below around 15 ms. Do you see any output with time taken less than, say 10 ms? Of particular interest to me is the approximately even spacing between calls that return nonzero time - it's about even at 5.
CPU timing can be a tricky business. I recommended in a comment that you try system_clock, which can be higher precision than CPU_TIME. You said it doesn't work, but I'm unconvinced. Did you pass a long integer to system_clock? What was the count_rate for your system? Were all the times still either 15 or 0 ms?

Fast percentile in C++ - speed more important than precision

This is a follow-up to Fast percentile in C++
I have a sorted array of 365 daily cashflows (xDailyCashflowsDistro) which I randomly sample 365 times to get a generated yearly cashflow. Generating is carried out by
1/ picking a random probability in the [0,1] interval
2/ converting this probability to an index in the [0,364] interval
3/ determining what daily cashflow corresponds to this probability by using the index and some linear aproximation.
and summing 365 generated daily cashflows. Following the previously mentioned thread, my code precalculates the differences of sorted daily cashflows (xDailyCashflowDiffs) where
xDailyCashflowDiffs[i] = xDailyCashflowsDistro[i+1] - xDailyCashflowsDistro[i]
and thus the whole code looks like
double _dIdxConverter = ((double)(365 - 1)) / (double)(RAND_MAX - 1);
for ( unsigned int xIdx = 0; xIdx < _xCount; xIdx++ )
{
double generatedVal = 0.0;
for ( unsigned int xDayIdx = 0; xDayIdx < 365; xDayIdx ++ )
{
double dIdx = (double)fastRand()* _dIdxConverter;
long iIdx1 = (unsigned long)dIdx;
double dFloor = (double)iIdx1;
generatedVal += xDailyCashflowsDistro[iIdx1] + xDailyCashflowDiffs[iIdx1] *(dIdx - dFloor);
}
results.push_back(generatedVal) ;
}
_xCount (the number of simulations) is 1K+, usually 10K.
The problem:
This simulation is being carried out 15M times (compared to 100K when the first thread was written) at the moment, and it takes ~10 minutes on a 3.4GHz machine. Due to the nature of problem, this 15M is unlikely to be significantly lowered in the future, only increased. Having used VTune Analyzer, I am being told that the last but one line (generatedVal += ...) generates 80% of runtime. And my question is why and how I can work with that.
Things I have tried:
1/ getting rid of the (dIdx - dFloor) part to see whether double difference and multiplication is the main culprit - runtime dropped by a couple of percent
2/ declaring xDailyCashflowsDistro and xDailyCashflowDiffs as __restict so as to prevent the compiler thinking they are dependendent on each other - no change
3/ tried using 16 days (as opposed to 365) to see whether it is cache misses that drag my performance - not a slight change
4/ tried using floats as opposed to doubles - no change
5/ compiling with different /fp: - no change
6/ compiling as x64 - has effect on the double <-> ulong conversions, but the line in question is unaffected
What I am willing to sacrifice is resolution - I do not care whether the generatedVal is 100010.1 or 100020.0 at the end if the speed gain is substantial.
EDIT:
The daily/yearly cashflows are related to the whole portfolio. I could divide all daily cashflows by portflio size and would thus (at 99.99% confidence level) ensure that daily cashflows/pflio_size will not reach out of the [-1000,+1000] interval. In this case, though, I would need precision to the hundredths.
Perhaps you could turn your piecewise linear function into a piecewise-linear "histogram" of its values. The number you're sampling appears to be the sum of 365 samples from that histogram. What you're doing is a not-particularly-fast way to sample from the sum of 365 samples from that histogram.
You might try computing a Fourier (or wavelet or similar) transform, keeping only the first few terms, raising it to the 365th power, and computing the inverse transform. You won't get a probability distribution in the end, but there shouldn't be "too much" mass below 0 or above 1 and the total mass shouldn't be "too different" from 1 with this technique. (I don't know what your data looks like; this technique may well be unworkable for good mathematical reasons.)

bitwise logical operators in COBOL?

How can one express bitwise logical operations in mainframe COBOL?
I have:
01 WRITE-CONTROL-CHAR.
03 WCC-NOP PIC X VALUE X'01'.
03 WCC-RESET PIC X VALUE X'02'.
03 WCC-PRINTER1 PIC X VALUE X'04'.
03 WCC-PRINTER2 PIC X VALUE X'08'.
03 WCC-START-PRINTER PIC X VALUE X'10'.
03 WCC-SOUND-ALARM PIC X VALUE X'20'.
03 WCC-KEYBD-RESTORE PIC X VALUE X'40'.
03 WCC-RESET-MDT PIC X VALUE X'80'.
In Micro Focus COBOL, I could do something like:
WCC-NOP B-AND WCC-RESET
but it seems there's no such operator on the mainframe (or at least not in Enterprise COBOL for z/OS).
Is there some (hopefully straightforward!) way to simulate/replicate bitwise logical operations in mainframe COBOL?
Your best bet appears to be 'CEESITST', as it appears to exist in z/OS COBOL. I found an example using that as well as the other bit manipulation programs.
http://esj.com/articles/2000/09/01/callable-service-manipulating-binary-data-at-the-bit-level.aspx
If you AND together values that don't share bits, you will always get zero, so I'm going to assume you meant OR. This makes sense since you tend to OR independent bits to construct a multi-bit value.
With that in mind, when the bit masks are independent of each other, as in a single term doesn't interact with other terms, there is no difference between:
termA OR termB
and:
termA + termB
Your terms are all independent here, being x'1', x'2' and so forth (no x'03' or x'ff') so adding them should work fine:
COMPUTE TARGET = WCC-NOP + WCC-RESET
Now that's good for setting bits starting with nothing, but not so useful for clearing them. However, you can use a similar trick for clearing:
COMPUTE TARGET = 255 - WCC-NOP - WCC-RESET
Setting or clearing them from an arbitrary starting point (regardless of their current state) is a little trickier and can't be done easily with addition and subtraction.