Denormalized floating point in Objective-C?

Denormalized floating point in Objective-C? - c++

What is the relevance of Stack Overflow question/answer Why does changing 0.1f to 0 slow down performance by 10x? for Objective-C? If there is any relevance, how should this change my coding habits? Is there some way to shut off denormalized floating points on Mac OS X?
It seems like this is completely irrelevant to iOS. Is that correct?

As I said in response to your comment there:
it is more of a CPU than a language issue, so it probably has
relevance for Objective-C on x86. (iPhone's ARMv7 doesn't seem to support
denormalized floats, at least with the default runtime/build settings)
Update
I just tested. On Mac OS X on x86 the slowdown is observed, on iOS on ARMv7 it is not (default build settings).
And as to be expected, running on iOS simulator (on x86) denormalized floats appear again.
Interestingly, FLT_MIN and DBL_MIN respectively are defined to the smallest non-denormalized number (on iOS, Mac OS X, and Linux). Strange things happen using
DBL_MIN/2.0
in your code; the compiler happily sets a denormalized constant, but as soon as the (arm) CPU touches it, it is set to zero:
double test = DBL_MIN/2.0;
printf("test == 0.0 %d\n",test==0.0);
printf("DBL_MIN/2 == 0.0 %d\n",DBL_MIN/2.0==0.0);
Outputs:
test == 0.0 1 // computer says YES
DBL_MIN/2 == 0.0 0 // compiler says NO
So a quick runtime check if denormalization is supported can be:
#define SUPPORT_DENORMALIZATION ({volatile double t=DBL_MIN/2.0;t!=0.0;})
("given without even the implied warranty of fitness for any purpose")
This is what ARM has to say on flush to zero mode: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204h/Bcfheche.html
Update<<1
This is how you disable flush to zero mode on ARMv7:
int x;
asm(
"vmrs %[result],FPSCR \r\n"
"bic %[result],%[result],#16777216 \r\n"
"vmsr FPSCR,%[result]"
:[result] "=r" (x) : :
);
printf("ARM FPSCR: %08x\n",x);
with the following surprising result.
Column 1: a float, divided by 2 for every iteration
Column 2: the binary representation of this float
Column 3: the time taken to sum this float 1e7 times
You can clearly see that the denormalization comes at zero cost. (For an iPad 2. On iPhone 4, it comes at a small cost of a 10% slowdown.)
0.000000000000000000000000000000000100000004670110: 10111100001101110010000011100000 110 ms
0.000000000000000000000000000000000050000002335055: 10111100001101110010000101100000 110 ms
0.000000000000000000000000000000000025000001167528: 10111100001101110010000001100000 110 ms
0.000000000000000000000000000000000012500000583764: 10111100001101110010000110100000 110 ms
0.000000000000000000000000000000000006250000291882: 10111100001101110010000010100000 111 ms
0.000000000000000000000000000000000003125000145941: 10111100001101110010000100100000 110 ms
0.000000000000000000000000000000000001562500072970: 10111100001101110010000000100000 110 ms
0.000000000000000000000000000000000000781250036485: 10111100001101110010000111000000 110 ms
0.000000000000000000000000000000000000390625018243: 10111100001101110010000011000000 110 ms
0.000000000000000000000000000000000000195312509121: 10111100001101110010000101000000 110 ms
0.000000000000000000000000000000000000097656254561: 10111100001101110010000001000000 110 ms
0.000000000000000000000000000000000000048828127280: 10111100001101110010000110000000 110 ms
0.000000000000000000000000000000000000024414063640: 10111100001101110010000010000000 110 ms
0.000000000000000000000000000000000000012207031820: 10111100001101110010000100000000 111 ms
0.000000000000000000000000000000000000006103515209: 01111000011011100100001000000000 110 ms
0.000000000000000000000000000000000000003051757605: 11110000110111001000010000000000 110 ms
0.000000000000000000000000000000000000001525879503: 00010001101110010000100000000000 110 ms
0.000000000000000000000000000000000000000762939751: 00100011011100100001000000000000 110 ms
0.000000000000000000000000000000000000000381469876: 01000110111001000010000000000000 112 ms
0.000000000000000000000000000000000000000190734938: 10001101110010000100000000000000 110 ms
0.000000000000000000000000000000000000000095366768: 00011011100100001000000000000000 110 ms
0.000000000000000000000000000000000000000047683384: 00110111001000010000000000000000 110 ms
0.000000000000000000000000000000000000000023841692: 01101110010000100000000000000000 111 ms
0.000000000000000000000000000000000000000011920846: 11011100100001000000000000000000 110 ms
0.000000000000000000000000000000000000000005961124: 01111001000010000000000000000000 110 ms
0.000000000000000000000000000000000000000002980562: 11110010000100000000000000000000 110 ms
0.000000000000000000000000000000000000000001490982: 00010100001000000000000000000000 110 ms
0.000000000000000000000000000000000000000000745491: 00101000010000000000000000000000 110 ms
0.000000000000000000000000000000000000000000372745: 01010000100000000000000000000000 110 ms
0.000000000000000000000000000000000000000000186373: 10100001000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000092486: 01000010000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000046243: 10000100000000000000000000000000 111 ms
0.000000000000000000000000000000000000000000022421: 00001000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000011210: 00010000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000005605: 00100000000000000000000000000000 111 ms
0.000000000000000000000000000000000000000000002803: 01000000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000001401: 10000000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 110 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 110 ms

Related

How to read data into an array

I have FORTRAN 77 code from an engineering textbook that I would like to make use of. The problem is that I am unable to understand how I input the data into the arrays that are called namely: FDAM1(61),FDAM2(61),FPOW1(61),FPOW2(61),UDAM(61) and UPOW(61).
For your reference the code has been taken from Page 49 of this book: https://books.google.pt/books?id=i2hyniQpecYC&lpg=PR6&dq=optimal%20design%20siddall&pg=PA49#v=onepage&q=optimal%20design%20siddall&f=false
C PROGRAM TST (INPUT,OUTPUT,TAPE5=INPUT,TAPE6=OUTPUT)
C
C PROGRAM TO ESTIMATE MAXIMUM EXPECTED VALUE FOR ALTERNATE DESIGNS
C
C FDENS(I)= ARRAYS FOR DATA DEFINING DENSITY FUNCTIONS
C FDAM1(I)= ARRAY DEFINING DENSITY FUNCTION FOR DAMAGE IN DESIGN 1
C DFAM2(I)= ARRAY DEFINING DENSITY FUNCTION FOR DAMAGE IN DESIGN 2
C FPOW1(I)= ARRAY DEFINING DENSITY FUNCTION FOR POWER IN DESIGN 1
C FPOW2(I)= ARRAY DEFINING DENSITY FUNCTION FOR POWER IN DESIGN 2
C UDAM(I)= VALUE CURVE FOR DAMAGE
C UPOW(I)= VALUE CURVE FOR POWER
C
DIMENSION FDENS(61),FDAM1(61),FDAM2(61),FPOW1(61),FPOW2(61),
1UDAM(61),UPOW(61),FUNC(61)
C
C NORMALIZE DENSITY FUNCTIONS
C
DO 1 I=1,4
READ(5,10)(FDENS(J),J=1,61)
READ(5,11)RANGE
AREA=FSIMP(FDENS,RANGE,61)
DO 2 J=1,61
GO TO(3,4,5,6)I
3 FDAM1(J)=FDENS(J)/AREA
GO TO 2
4 FDAM2(J)=FDENS(J)/AREA
GO TO 2
5 FPOW1(J)=FDENS(J)/AREA
GO TO 2
6 FPOW2(J)=FDENS(J)/AREA
2 CONTINUE
1 CONTINUE
C
C DETERMINE EXPECTED VALUES
C
READ(5,10)(UDAM(J),J=1,61)
READ(5,10)(UPOW(J),J=1,61)
DO 20 I=1,6
GO TO (30,31,32,33,34,35)I
30 DO 40 J=1,61
40 FUNC(J)=FDAM1(J)*UDAM(J)
RANGE=12.
E1=FSIMP(FUNC,RANGE,61)
GO TO 20
31 DO 41 J=1,61
41 FUNC(J)=FDAM2(J)*UDAM(J)
C
RANGE=12.
E2=FSIMP(FUNC,RANGE,61)
GO TO 20
32 DO 42 J=1,61
RANGE=60.
42 FUNC(J)=FPOW1(J)*UPOW(J)
E3=FSIMP(FUNC,RANGE,61)
33 DO 43 J=1,61
43 FUNC(J)=FPOW2(J)*UPOW(J)
RANGE=60.
E4=FSIMP(FUNC,RANGE,61)
GO TO 20
34 E5=8.17
GO TO 20
35 E6=2.20
20 CONTINUE
DES1=E1+E3+E5
DES2=E2+E4+E6
C
C OUTPUT
C
WRITE(6,100)
100 FORMAT(/,1H ,15X,24HEXPECTED VALUES OF VALUE,//)
WRITE(6,101)
101 FORMAT(/,1H ,12X,6HDAMAGE,7X,5HPOWER,9X,5HPARTS,8X,5HTOTAL,//)
WRITE(6,102)E1,E3,E5,DES1
102 FORMAT(/,1H ,8HDESIGN 1,4X,F5.3,8X,F5.3,9X,F5.3,8X,F6.3)
WRITE(6,103)E2,E4,E6,DES2
103 FORMAT(/,1H ,8HDESIGN 2,4X,F5.3,8X,F5.3,9X,F5.3,8X,F6.3)
10 FORMAT(16F5.2)
11 FORMAT(F5.0)
STOP
END
SUBROUTINE FSIMP
FUNCTION FSIMP(FUNC,RANGE,MINT)
C.... CALCULATES INTEGRAL BY SIMPSONS RULE WITH
C MODIFICATION IF MINT IS EVEN
C.... INPUT
C FUNC = ARRAY OF EQUALLY SPACED VALUES OF FUNCTION
C DIMENSION MINT
C RANGE = RANGE OF INTEGRATION
C MINT = NUMBER OF STATIONS
C.... OUTPUT
C FSIMP = AREA
DIMENSION FUNC(1)
C.... CHECK MINT FOR ODD OR EVEN
XX=RANGE/(3.*FLOAT(MINT-1))
M=MINT/2*2
IF(M.EQ.MINT) GO TO 3
C.... ODD
AREA=FUNC(1)+FUNC(M)
MM=MINT-1
DO 1 I=2,MM,2
1 AREA=AREA+4.*FUNC(I)
MM=MM-1
DO 2 I=3,MM,2
2 AREA=AREA+2.*FUNC(I)
FSIMP=XX*AREA
RETURN
C.... EVEN
C.... USE SIMPSONS RULE FOR ALL BUT THE LAST 3 INTERVALS
3 M=MINT-3
AREA=FUNC(1)+FUNC(M)
MM=M-1
DO 4 I=2,MM,2
4 AREA=AREA+4.*FUNC(I)
MM=MM-1
DO 5 I=3,MM,2
5 AREA=AREA+2.*FUNC(I)
FSIMP=XX*AREA
C.... USE NEWTONS 3/3 RULE FOR LAST THREE INTERVALS
FSIMP=FSIMP+9./3.*XX*(FUNC(MINT-3)+3.*(FUNC(MINT-2)+FUNC(MINT-1))
1 +FUNC(MINT))
RETURN
END

Here is a minimal example to help you get started:
C Minimal working example of creaky old FORTRAN I/O
PROGRAM ABYSS
IMPLICIT NONE
C
REAL FDENS(61)
REAL XRANGE
INTEGER J
C
10 FORMAT(16F5.2)
11 FORMAT(F5.0)
909 FORMAT(/, 'BEHOLD! A DENSITY DISTRIBUTION',/)
910 FORMAT(10(F5.2, 3X),/)
911 FORMAT(/, 'XRANGE is ', F6.1)
C
CONTINUE
C
READ(5,10) (FDENS(J), J=1,61)
READ(5,11) XRANGE
C
WRITE(6,909)
WRITE(6,910) (FDENS(J), J=1,61)
WRITE(6,911) XRANGE
C
STOP
END
Apologies for writing this in F77; I'm sticking with the style of the code posted above for the sake of this example. Ideally, you'd use a F03 or F08 for new code or a completely different language which actually has decent I/O features and a rich standard library. But I digress.
This code will operate on the data (be careful to preserve the spaces):
0.1 0.3 0.5 0.9 1.30 1.90 2.50 3.20 3.80 4.20
4.70 5.0 5.1 5.2 5.2 5.1 4.9 4.7 4.6 4.4 4.2 3.9 3.8 3.6 3.4 3.2
3.0 2.9 2.7 2.5 2.4 2.2 2.1 1.9 1.8 1.6 1.5 1.4 1.2 1.1 1.0 0.9
0.8 0.7 0.6 0.5 0.4 0.3 0.3 0.2 0.1 0.1
12.
to produce
BEHOLD! A DENSITY DISTRIBUTION
0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.30 0.50 0.90
1.30 1.90 2.50 3.20 3.80 4.20 4.70 5.00 5.10 5.20
5.20 5.10 4.90 4.70 4.60 4.40 4.20 3.90 3.80 3.60
3.40 3.20 3.00 2.90 2.70 2.50 2.40 2.20 2.10 1.90
1.80 1.60 1.50 1.40 1.20 1.10 1.00 0.90 0.80 0.70
0.60 0.50 0.40 0.30 0.30 0.20 0.10 0.10 0.00 0.00
0.00
XRANGE is 12.0
If the code is in abyss.f, the input data is in abyss.dat, you should be able to build the code with
gfortran -g -Wall -Og -o abyss abyss.f
and generate similar results by running
abyss < abyss.dat > abyss.out
A key point to note is that the original code is reading from unit 5 (traditionally taken as stdin, now officially canonized in iso_fortran_env as INPUT_UNIT). In your own code, I'd suggest reading from a data file, so replace the literal 5 with whatever variable contains the unit number of the file you're reading from (hint: consider using the newunit argument to the open command introduced in Fortran 2008. It solves the perennially stupid Fortran problem of trying to find a free I/O unit number.) While you can use I/O redirection, it's suboptimal; it's used here to show how to work around the limitations of the original code.
Also, for the sake of later generations and your own sanity, please avoid taking advantage of Cold-War-era FORTRAN misfeatures such as this spaces-equal-zeroes nonsense. If your data is worth using, it's worth putting in a sensible format which can be easily parsed; columnar, space-delimited values are as good a choice as any. Fortran may actually get a standard library which can read and write CSV files sometime around 2156 (give or take a century) so you have plenty of time to design something decent...

Valid range for QDateTime::fromMSecsSinceEpoch

I discovered a strange behavior in QDateTime of Qt 4.8 regarding fromMSecsSinceEpoch. The following code does not produce the result I would expect:
assert(
QDateTime::fromMSecsSinceEpoch(
std::numeric_limits<qint64>::max()
).isValid() == true
);
assert(
QDateTime::fromMSecsSinceEpoch(
std::numeric_limits<qint64>::max()
).toMSecsSinceEpoch() == std::numeric_limits<qint64>::max()
);
While the first assertion is true, the second fails. The returned result from Qt is -210866773624193.
The doc for QDateTime::fromMSecsSinceEpoch(qint64 msecs) clearly states:
There are possible values for msecs that lie outside the valid range of QDateTime, both negative and positive. The behavior of this function is undefined for those values.
However, there is no any explicit statement about the valid range.
I found this Qt bug report about an issue regarding timezones in Qt 5.5.1, 5.6.0 and 5.7.0 Beta.
I am not sure wheter this is a similar bug, or if the value I provided to QDateTime::fromMSecsSinceEpoch(qint64 msecs) is simply invalid.
What is (or rather should be) the maximum value that can be passed to this function and yields correct behavior?

std::numeric_limits<qint64>::max() ms yields 9 223 372 036 854 775 807 ms, or 9 223 372 036 854 775 s, or 2 562 047 788 015 hours, or 106 751 991 167 days, or 292 471 208 years: that's far beyond the year 11 million in the valid range of QDateTime.
From the doc, valid dates start from January 2nd, 4713 BCE, and go until QDate::toJulianDay() overflow: 2^31 days (max value for signed integer) yields nearly 5 000 000 years. That's 185 542 587 187 200 000 ms (from January 2nd, 4713 BCE, not from Epoch), "little" more than 2^57.
EDIT:
After discussion in the comments, you checked Qt4.8 sources and found that fromMSecsSinceEpoch() uses QDate(1970, 1, 1).addDays(ddays) internally, where the number of days is calculated directly from the msecs parameter.
Since ddays is of type int here, this would overflow for values larger than 2^31.

gperftools cpuProfiler cannot see the simbols when profiling a file created by and ARM device

I want to profile a C++ application that runs in an ARM device.
I ran my app and I profiled it using ProfilerStart("googleProfBL.prof"), so the file is generated.
When I open the file from the ARM device in my local computer I get this:
./pprof --text --add_lib=libraryIwanttoDebug.so BinaryThatLoadsThatLibrary googleProfBL.prof
Using local file /home/genius/PresControler/src-build-target/deploy/NavStartup.
Using local file ../traces/googleProfBL.prof.
Warning: address ffffffffffffffff is longer than address length 8
Warning: address ffffffffffffffff is longer than address length 8
Hexadecimal number > 0xffffffff non-portable at ./pprof line 4475.
Hexadecimal number > 0xffffffff non-portable at ./pprof line 4475.
Total: 5347 samples
258 4.8% 4.8% 258 4.8% 0x76d4c276
144 2.7% 7.5% 144 2.7% 0x76da2cc4
126 2.4% 9.9% 126 2.4% 0x5d0f8284
114 2.1% 12.0% 114 2.1% 0x76d27386
64 1.2% 13.2% 64 1.2% 0x76dba2dc
53 1.0% 14.2% 53 1.0% 0x76dba1f4
...
The so library is compiled in debug mode (is not stripped), I do not know why I am not getting the symbols.
I tried this:
./pprof --text --add_lib=aFileOfTheLibrary.o BinaryThatLoadsThatLibrary googleProfBL.prof
Looks like I got a couple of symbols.
Using local file /home/genius/PresControler/src-build-target/deploy/NavStartup.
Using local file ../traces/googleProfBL.prof.
Warning: address ffffffffffffffff is longer than address length 8
Warning: address ffffffffffffffff is longer than address length 8
Hexadecimal number > 0xffffffff non-portable at ./pprof line 4475.
Hexadecimal number > 0xffffffff non-portable at ./pprof line 4475.
Total: 5347 samples
258 4.8% 4.8% 258 4.8% 0x76d4c276
144 2.7% 7.5% 144 2.7% 0x76da2cc4
126 2.4% 9.9% 126 2.4% 0x5d0f8284
114 2.1% 12.0% 114 2.1% 0x76d27386
64 1.2% 13.2% 64 1.2% 0x76dba2dc
53 1.0% 14.2% 53 1.0% 0x76dba1f4
50 0.9% 15.1% 50 0.9% 0x76dbf1bc
34 0.6% 15.8% 34 0.6% 0x72eae1b4
30 0.6% 16.3% 30 0.6% 0x76d8a32a
30 0.6% 16.9% 30 0.6% 0x76d8e2c0
..
0 0.0% 100.0% 7 0.1% std::forward_as_tuple <- I couldn't see that before!!!
I tried doing --add_lib for every .o I have but I do not get any more symbols. Why I do not get the symbols, does it have anything to do because I am checking the results using an intel and getting them using an ARM?? How could I fix that? any help???
Thank you!!!

I got much more information now!
I was living my application pressing ctrl+c, so the file got somehow corrupt...
I did a test calling ProfilerStop() before pressing ctrl+c and it worked (of course using as well --lib_prefix where the .so are).
I still got these warnings:
Warning: address ffffffffffffffff is longer than address length 8
Warning: address ffffffffffffffff is longer than address length 8
Hexadecimal number > 0xffffffff non-portable at ./pprof line 4475.
Hexadecimal number > 0xffffffff non-portable at ./pprof line 4475.
If someone knows why I am getting them (I assume is because I am debugging code generated by another device) please let me know.

Hyper-threading Performance Comparison

I have written a project, which uses some basic functions in openssl such as RAND_bytes and des_ecb_encrypt.
My computer has i7-2600(4 cores and 8 logic CPU). When I run my project with 4 threads, it will costs 10 seconds. When I run it with 8 threads, it also costs 10 seconds.
What I mean is that hyper-threading doesn't give me any performance improvement. In Linux, the experiment result is same.
I found here tells me that hyper-threading doesn't give me some improvement in some situations. Also, I found here give me some intuitive results.
However, I have tried to write some simple tests and found some simple examples which will show hyper-threading won't give me apparent improvement. Sadly, I don't find it.
So, my questions is that whether there are some simple tests shows the hyper-threading won't give me any performance improvement.

You may find that hyperthreading helps more on code that is using large amounts of memory, so that the processor is regularly blocked on fetching from memory.
In my experience, it's quite hard to find "simple code" that shows benefits from hyperthreading. It tends to be more complex examples that show the benefit. Still, the benefit will most likely not be 2x that of "no hyperthreading". Count on getting perhaps 20-30% improvement.

Hyper threading takes advantage of the fact that the CPU has many components and when one is used, when there's no hyper threading, the others just sit there idle. You can try writing two types of threads, one doing integer calculations (that will hopefully use the ALU) and one doing floating point arithmetic (that will hopefully use the FPU).
I did not try this myself but it seems that in such a scenario hyper threading should improve the performance.
To show the opposite you can use only one type of the threads (either threads only doing integer operations or threads only doing floating point operations).
It may also be that your test is flawed, but in order to know if that is the case we'll need more information about that test.

I have written a project, which use some basic functions in openssl such as RAND_bytes and des_ecb_encrypt... My computer has i7-2600(4 cores and 8 logic CPU). When I run my project with 4 threads, it will costs 10 seconds. When I run it with 8 threads, it also costs 10 seconds.
When using RDRAND (which RAND_bytes will do in this case), the bus us the limiting factor. You should peak at around 800MB/sec. It does not matter how many threads you have - the bus cannot transfer data fast enough. See Intel rdrand instruction revisited.
If you used AES, then you might see a better speedup over the DES/3DES observations. Your Ivy Bridge has AES-NI and it can achieve almost 1.3 cycle/byte, and that should be about double or triple AES is software. To ensure you are using the AES-NI instructions, you have to use the EVP_* interfaces.
I found here tells me that hyper-threading doesn't give me some improvement in some situations. Also, I found here give me some intuitive results.
I think #selalerer and #Mats Petersson answered your question. The problem does not scale linearly and there's a maximum speedup you will encounter. Intel states its about 30%.
Intel's newest architecture favors of Out-Of-Order execution over Hyper-threading execution because its supposed to be more efficient. Read about the Silvermont processor cores.
But if you want a formal deep dive, then see a book on computer engineering. Here's the book we used when I studied it in college: Computer Organization and Design (its probably a bit dated now).
However, I have tried to write some simple tests and found some simple examples which will show hyper-threading won't give me apparent improvement.
OpenSSL also has a benchmarking app. See the source code in <openssl source>/apps/speed.c.
Also, benchmarking apps have their own personalities. An encryption stress test may not reveal the differences as predominantly as you hope to see them. See, for example, Benchmarking Tools.

Following are details and results of my MP benchmarks for Linux and Windows, that can behave differently. Not much HT but Linux tests include Atom (1 core 2 threads) and Windows has Core i7 results (4+4).
http://www.roylongbottom.org.uk/linux%20multithreading%20benchmarks.htm
http://www.roylongbottom.org.uk/quad%20core%208%20thread.htm
Take your pick, depending what you want to prove whether HT provides better or worse performance. Following are RandMem results on i7 (Linux seems better using this test). For such as i7, you also need to consider Turbo Boost that might be lower with multiple threads.
CPUs MBytes Per Second Using Threads Gain At Threads
/HTs 1 2 4 6 8 2 4 6 8
Serial RD
Core i7 4/8 L1 11458 22661 37039 43717 46374 2.0 3.2 3.8 4.0
930 L2 10380 20832 32853 41711 42839 2.0 3.2 4.0 4.1
#### MHz L3 8828 17743 29610 38414 40330 2.0 3.4 4.4 4.6
Win 764 RAM 4266 8712 17347 24946 25589 2.0 4.1 5.8 6.0
Serial RW
Core i7 4/8 L1 15282 13724 16240 16209 18379 0.9 1.1 1.1 1.2
930 L2 12223 18216 25326 28104 27047 1.5 2.1 2.3 2.2
#### MHz L3 10234 19266 21931 24450 26351 1.9 2.1 2.4 2.6
Win 764 RAM 4533 7656 13876 14543 13390 1.7 3.1 3.2 3.0
Random RD
Core i7 4/8 L1 11266 22548 38174 45592 47141 2.0 3.4 4.0 4.2
930 L2 6233 12463 20059 24986 25667 2.0 3.2 4.0 4.1
#### MHz L3 3499 6915 9211 10002 9531 2.0 2.6 2.9 2.7
Win 764 RAM 459 909 1241 1398 1364 2.0 2.7 3.0 3.0
Random RW
Core i7 4/8 L1 14375 3027 2780 2901 3297 0.2 0.2 0.2 0.2
930 L2 5887 4555 6117 6693 7281 0.8 1.0 1.1 1.2
#### MHz L3 3104 4604 4721 5047 4933 1.5 1.5 1.6 1.6
Win 764 RAM 428 860 899 948 1026 2.0 2.1 2.2 2.4
#### 2.8 GHz running at up to 3.06 GHz via Turbo Boost, dual channel 1066 MHz DDR3 RAM
Then the MP Whetstone benchmark that shows real gains
MWIPS MFLOP MFLOP MFLOP COS EXP FIXPT IF EQUAL
CPU MHz 1 2 3 MOPS MOPS MOPS MOPS MOPS
Core i7 1 Thrd #### 3115 1065 886 738 79.3 39.7 2447 2936 1154
Core i7 Win7 #### 21690 8676 7621 5844 531 291 16643 12027 5034
Quad Core Thread 1 1091 1027 728 66.4 36.5 2050 1501 629
Plus HT Thread 2 1089 1037 742 66.0 36.5 2090 1507 630
Thread 3 1090 946 742 66.8 36.5 2069 1534 631
Thread 4 1092 1037 727 66.6 36.6 2031 1501 630
Thread 5 1042 959 736 66.4 36.5 1912 1483 630
Thread 6 1091 874 723 66.6 36.1 2049 1507 629
Thread 7 1090 867 725 65.6 36.3 2094 1516 631
Thread 8 1091 874 722 66.3 36.3 2350 1476 624
Gain % 696 815 860 792 670 733 680 410 436

What exactly does C++ profiling (google cpu perf tools) measure?

I trying to get started with Google Perf Tools to profile some CPU intensive applications. It's a statistical calculation that dumps each step to a file using `ofstream'. I'm not a C++ expert so I'm having troubling finding the bottleneck. My first pass gives results:
Total: 857 samples
357 41.7% 41.7% 357 41.7% _write$UNIX2003
134 15.6% 57.3% 134 15.6% _exp$fenv_access_off
109 12.7% 70.0% 276 32.2% scythe::dnorm
103 12.0% 82.0% 103 12.0% _log$fenv_access_off
58 6.8% 88.8% 58 6.8% scythe::const_matrix_forward_iterator::operator*
37 4.3% 93.1% 37 4.3% scythe::matrix_forward_iterator::operator*
15 1.8% 94.9% 47 5.5% std::transform
13 1.5% 96.4% 486 56.7% SliceStep::DoStep
10 1.2% 97.5% 10 1.2% 0x0002726c
5 0.6% 98.1% 5 0.6% 0x000271c7
5 0.6% 98.7% 5 0.6% _write$NOCANCEL$UNIX2003
This is surprising, since all the real calculation occurs in SliceStep::DoStep. The "_write$UNIX2003" (where can I find out what this is?) appears to be coming from writing the output file. Now, what confuses me is that if I comment out all the outfile << "text" statements and run pprof, 95% is in SliceStep::DoStep and `_write$UNIX2003' goes away. However my application does not speed up, as measured by total time. The whole thing speeds up less than 1 percent.
What am I missing?
Added:
The pprof output without the outfile << statements is:
Total: 790 samples
205 25.9% 25.9% 205 25.9% _exp$fenv_access_off
170 21.5% 47.5% 170 21.5% _log$fenv_access_off
162 20.5% 68.0% 437 55.3% scythe::dnorm
83 10.5% 78.5% 83 10.5% scythe::const_matrix_forward_iterator::operator*
70 8.9% 87.3% 70 8.9% scythe::matrix_forward_iterator::operator*
28 3.5% 90.9% 78 9.9% std::transform
26 3.3% 94.2% 26 3.3% 0x00027262
12 1.5% 95.7% 12 1.5% _write$NOCANCEL$UNIX2003
11 1.4% 97.1% 764 96.7% SliceStep::DoStep
9 1.1% 98.2% 9 1.1% 0x00027253
6 0.8% 99.0% 6 0.8% 0x000274a6
This looks like what I'd expect, except I see no visible increase in performance (.1 second on a 10 second calculation). The code is essentially:
ofstream outfile("out.txt");
for loop:
SliceStep::DoStep()
outfile << 'result'
outfile.close()
Update: I timing using boost::timer, starting where the profiler starts and ending where it ends. I do not use threads or anything fancy.

From my comments:
The numbers you get from your profiler say, that the program should be around 40% faster without the print statements.
The runtime, however, stays nearly the same.
Obviously one of the measurements must be wrong. That means you have to do more and better measurements.
First I suggest starting with another easy tool: the time command. This should get you a rough idea where your time is spend.
If the results are still not conclusive you need a better testcase:
Use a larger problem
Do a warmup before measuring. Do some loops and start any measurement afterwards (in the same process).
Tiristan: It's all in user. What I'm doing is pretty simple, I think... Does the fact that the file is open the whole time mean anything?
That means the profiler is wrong.
Printing 100000 lines to the console using python results in something like:
for i in xrange(100000):
print i
To console:
time python print.py
[...]
real 0m2.370s
user 0m0.156s
sys 0m0.232s
Versus:
time python test.py > /dev/null
real 0m0.133s
user 0m0.116s
sys 0m0.008s
My point is:
Your internal measurements and time show you do not gain anything from disabling output. Google Perf Tools says you should. Who's wrong?

_write$UNIX2003 is probably referring to the write POSIX system call, which outputs to the terminal. I/O is very slow compared to almost anything else, so it makes sense that your program is spending a lot of time there if you are writing a fair bit of output.
I'm not sure why your program wouldn't speed up when you remove the output, but I can't really make a guess on only the information you've given. It would be nice to see some of the code, or even the perftools output when the cout statement is removed.

Google perftools collects samples of the call stack, so what you need is to get some visibility into those.
According to the doc, you can display the call graph at statement or address granularity. That should tell you what you need to know.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js