rocksdb write stall with many writes of the same data - c++

First of all I have to explain our setup, as it is quite different to the normal database server with raidgroups and so on. Usually our customers buy a server like a HP DL 380 Gen10 with two 300 GB HDDs (not SSDs) which are configured as a RAID 1 and running with Windows.
We are only managing the meta data of other storages here, so that a client can ask us and find its information on those large storages.
As our old database was always corrupt, we searched for a new more stable database which also has not that much overhead and found rocksdb, currently with version 6.12.0.
Unfortunately after it ran for some hours, it seems to block my program for many minutes due to a write stall:
2020/10/01-15:58:44.678646 1a64 [WARN] [db\column_family.cc:876] [default]
Stopping writes because we have 2 immutable memtables (waiting for flush),
max_write_buffer_number is set to 2
Am I right that the write workload is too much for the machine?
Our software is a service which retrieves at least one update per second from up to 2000 different servers (limit might be increased in the future if possible). Most of the time, the same database entries are only updated/written again, as one of the informations inside the entry is the current time of the respective server. Of course I could try to write the data less often to hdd, but if then a client requests this data from us, our information would not be up to date.
So my questions are:
I assume that currently every writerequest is really written to the disk, is there a way to enable some kind of cache (or maybe increase its size, if not sufficient?) so that the data is not written less often to the hdd, but read requests return the correct data from the memory?
I also see that there is a merge operator, but I'm not sure when this merge would take place? Is there already a cache like mentioned under 1. and the data is collected for some time, then merged and then written to hdd?
Are there any other optimizations which could help me in this situation?
Any help is welcome. Thanks in advance.
Here some more loglines if they might be interesting:
** File Read Latency Histogram By Level [default] **
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp
Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
L0 2/0 17.12 MB 0.5 0.0 0.0 0.0 0.4 0.4 0.0 1.0 0.0 81.5 5.15 0.00 52 0.099 0 0
L1 3/0 192.76 MB 0.8 3.3 0.4 2.9 3.2 0.3 0.0 8.0 333.3 327.6 10.11 0.00 13 0.778 4733K 119K
L2 20/0 1.02 GB 0.4 1.6 0.4 1.2 1.2 -0.0 0.0 3.1 387.5 290.0 4.30 0.00 7 0.614 2331K 581K
Sum 25/0 1.22 GB 0.0 4.9 0.8 4.1 4.9 0.7 0.0 11.8 257.4 254.5 19.56 0.00 72 0.272 7064K 700K
Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0
** Compaction Stats [default] **
Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
Low 0/0 0.00 KB 0.0 4.9 0.8 4.1 4.5 0.3 0.0 0.0 349.5 316.4 14.40 0.00 20 0.720 7064K 700K
High 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.4 0.4 0.0 0.0 0.0 81.7 5.12 0.00 51 0.100 0 0
User 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 50.9 0.03 0.00 1 0.030 0 0
Uptime(secs): 16170.9 total, 0.0 interval
Flush(GB): cumulative 0.410, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 4.86 GB write, 0.31 MB/s write, 4.92 GB read, 0.31 MB/s read, 19.6 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
** File Read Latency Histogram By Level [default] **
2020/10/01-15:53:21.248110 1a64 [db\db_impl\db_impl_write.cc:1701] [default] New memtable created with log file: #10465. Immutable memtables: 0.
2020/10/01-15:58:44.678596 1a64 [db\db_impl\db_impl_write.cc:1701] [default] New memtable created with log file: #10466. Immutable memtables: 1.
2020/10/01-15:58:44.678646 1a64 [WARN] [db\column_family.cc:876] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2020/10/01-16:02:57.448977 2328 [db\db_impl\db_impl.cc:900] ------- DUMPING STATS -------
2020/10/01-16:02:57.449034 2328 [db\db_impl\db_impl.cc:901]
** DB Stats **
Uptime(secs): 16836.8 total, 665.9 interval
Cumulative writes: 20M writes, 20M keys, 20M commit groups, 1.0 writes per commit group, ingest: 3.00 GB, 0.18 MB/s
Cumulative WAL: 20M writes, 0 syncs, 20944372.00 writes per sync, written: 3.00 GB, 0.18 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 517K writes, 517K keys, 517K commit groups, 1.0 writes per commit group, ingest: 73.63 MB, 0.11 MB/s
Interval WAL: 517K writes, 0 syncs, 517059.00 writes per sync, written: 0.07 MB, 0.11 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
Our watchdog caused a breakpoint after a mutex was blocked for several minutes, after that this showed up in the logfile:
2020/10/02-17:44:18.602776 2328 [db\db_impl\db_impl.cc:900] ------- DUMPING STATS -------
2020/10/02-17:44:18.602990 2328 [db\db_impl\db_impl.cc:901]
** DB Stats **
Uptime(secs): 109318.0 total, 92481.2 interval
Cumulative writes: 20M writes, 20M keys, 20M commit groups, 1.0 writes per commit group, ingest: 3.00 GB, 0.03 MB/s
Cumulative WAL: 20M writes, 0 syncs, 20944372.00 writes per sync, written: 3.00 GB, 0.03 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per commit group, ingest: 0.00 MB, 0.00 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
L0 2/0 17.12 MB 0.5 0.0 0.0 0.0 0.4 0.4 0.0 1.0 0.0 81.5 5.15 0.00 52 0.099 0 0
L1 3/0 192.76 MB 0.8 3.3 0.4 2.9 3.2 0.3 0.0 8.0 333.3 327.6 10.11 0.00 13 0.778 4733K 119K
L2 20/0 1.02 GB 0.4 1.6 0.4 1.2 1.2 -0.0 0.0 3.1 387.5 290.0 4.30 0.00 7 0.614 2331K 581K
Sum 25/0 1.22 GB 0.0 4.9 0.8 4.1 4.9 0.7 0.0 11.8 257.4 254.5 19.56 0.00 72 0.272 7064K 700K
Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0
** Compaction Stats [default] **
Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
Low 0/0 0.00 KB 0.0 4.9 0.8 4.1 4.5 0.3 0.0 0.0 349.5 316.4 14.40 0.00 20 0.720 7064K 700K
High 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.4 0.4 0.0 0.0 0.0 81.7 5.12 0.00 51 0.100 0 0
User 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 50.9 0.03 0.00 1 0.030 0 0
Uptime(secs): 109318.0 total, 92481.2 interval
Flush(GB): cumulative 0.410, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 4.86 GB write, 0.05 MB/s write, 4.92 GB read, 0.05 MB/s read, 19.6 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 1 memtable_compaction, 0 memtable_slowdown, interval 0 total count
** File Read Latency Histogram By Level [default] **

RocksDB does have a built-in cache, which you can definitely explore, but I suspect your write stalls can be handled by increasing either the max_write_buffer_number or max_background_flushes configuration parameters.
The former will reduce the size of the memory table table to flush, while the latter will incerase the number of background threads available for flushing operations.
It's hard to say whether the workload is too high for the machine just from this information, because there are a lot of moving parts, but I doubt it. For starters, two background flush threads is pretty low. Also, I'm obviously not familiar with the workload you're testing for, but since you're not using the caching mechanism, the performance can only improve from here.
Disk-writes are tricky. Any linux application "writing to disk" is really calling the write system call and thus writing to an intermediate kernel buffer which is then written to its intended destination once the kernel deems fit and proper. The process of actually updating the data on the disk is known as writeback (I highly recommend reading the file I/O chapter from Robert Love's Linux System Programming for more on this).
You can call use the fsync(2) system call to manually commit a file's system cache to disk, but it's not recomended, since the kernel tends to know when the best time to do this is.
Before rolling any bespoke optimizations, I would checkout the tuning guide. If tuning the configuration for your workload doesn't solve the problem, I would probably look into a RAID configuration. Write stalls are a disk I/O throughput limitation, so data striping might give you enough of a boost, even giving you the option to add some redundancy into your storage solution, depending on the RAID configuration you opt for.
Either way, my gut feeling is that 2000 connections per second seems way too low for the problem to be the current machine's CPU or even memory. Unless each request requires a significant amount of work to process, I doubt this is it. Of course, it never hurts to make sure your endpoint's server configuration is also optimally tuned.

Related

Intel MPI Benchmarks result exceeds network bandwidth

I was running the Intel MPI Benchmarks on a mini MPI cluster of two nodes on AWS EC2.
The executables of the benchmark were compiled with make CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx.
The cluster was set up based on this tutorial.
(What I did differently is that I installed OpenMPI instead of MPICH.)
The type of the AWS instances is t3a.xlarge.
According to this spec the network performance is up to 5 Gbps (i.e. 625 MB/s).
I chose Ubuntu 18.04 as the OS for these instances.
The PingPong benchmark gave me surprising results:
mpiuser#mpin0:~$ mpirun -rf rankfile -np 2 ./IMB-MPI1 PingPong
#------------------------------------------------------------
# Intel(R) MPI Benchmarks 2019 Update 3, MPI-1 part
#------------------------------------------------------------
# Date : Thu Nov 28 15:17:38 2019
# Machine : x86_64
# System : Linux
# Release : 4.15.0-1054-aws
# Version : #56-Ubuntu SMP Thu Nov 7 16:15:59 UTC 2019
# MPI Version : 3.1
# MPI_Datatype : MPI_BYTE
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 32.05 0.00
1 1000 31.74 0.03
2 1000 31.98 0.06
4 1000 32.08 0.12
8 1000 31.86 0.25
16 1000 31.30 0.51
32 1000 31.40 1.02
64 1000 32.86 1.95
128 1000 32.66 3.92
256 1000 33.56 7.63
512 1000 35.00 14.63
1024 1000 34.74 29.48
2048 1000 36.56 56.02
4096 1000 39.80 102.91
8192 1000 49.92 164.10
16384 1000 57.73 283.79
32768 1000 78.86 415.54
65536 640 170.26 384.91
131072 320 239.80 546.60
262144 160 357.34 733.61
524288 80 609.82 859.74
1048576 40 1106.36 947.77
2097152 20 2514.62 833.98
4194304 10 5830.39 719.39
Some of the speeds exceed 625 MB/s.
The rankfile was used to force distributing the two processes onto two nodes.
It is not the case that the two MPI processes are running on the same node.
The content of the rankfile is:
rank 0=mpin0 slot=0
rank 1=mpin1 slot=0
What are the possible reasons that caused the benchmark result to exceed the theoretical limit?

How to test the integrity of hardware on aws instance?

I have a cluster of consumers (50 or so instance) consuming from kafka partitions.
I notice that there is this one server that is consistently slow. Its cpu usage is always around 80-100%. While the other partitions is around 50%.
Originally I thought there is a slight chance that this is traffic dependent, so I manually switch the partitions that the slow loader is consuming.
However I did not observe an increase in processing speed.
I also don't see cpu steal from iostat, but since all consumer is running the same code I suspect there is some bottle neck in the hardware.
Unfortunately, I can't just replace the server unless I can provide conclusive proof that the hardware is the problem.
So I want to write a load testing script that pin point the bottle neck.
My plan is to write a while loop in python that does n computations, and find out what is the max computation that the slow consumer can do and what is the max computation that the fast consumer can do.
What other testing strategy can I do?
Perhaps I should test disk bottle neck by having my python script write to txt file?
Here is fast consumer iostat
avg-cpu: %user %nice %system %iowait %steal %idle
50.01 0.00 3.96 0.13 0.12 45.77
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 1.06 0.16 11.46 422953 30331733
xvdb 377.63 0.01 46937.99 35897 124281808572
xvdc 373.43 0.01 46648.25 26603 123514631628
md0 762.53 0.01 93586.24 22235 247796440032
Here is slow consumer iostat
avg-cpu: %user %nice %system %iowait %steal %idle
81.58 0.00 5.28 0.11 0.06 12.98
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 1.02 0.40 13.74 371145 12685265
xvdb 332.85 0.02 40775.06 18229 37636091096
xvdc 327.42 0.01 40514.44 10899 37395540132
md0 676.47 0.01 81289.50 11287 75031631060

How to read data into an array

I have FORTRAN 77 code from an engineering textbook that I would like to make use of. The problem is that I am unable to understand how I input the data into the arrays that are called namely: FDAM1(61),FDAM2(61),FPOW1(61),FPOW2(61),UDAM(61) and UPOW(61).
For your reference the code has been taken from Page 49 of this book: https://books.google.pt/books?id=i2hyniQpecYC&lpg=PR6&dq=optimal%20design%20siddall&pg=PA49#v=onepage&q=optimal%20design%20siddall&f=false
C PROGRAM TST (INPUT,OUTPUT,TAPE5=INPUT,TAPE6=OUTPUT)
C
C PROGRAM TO ESTIMATE MAXIMUM EXPECTED VALUE FOR ALTERNATE DESIGNS
C
C FDENS(I)= ARRAYS FOR DATA DEFINING DENSITY FUNCTIONS
C FDAM1(I)= ARRAY DEFINING DENSITY FUNCTION FOR DAMAGE IN DESIGN 1
C DFAM2(I)= ARRAY DEFINING DENSITY FUNCTION FOR DAMAGE IN DESIGN 2
C FPOW1(I)= ARRAY DEFINING DENSITY FUNCTION FOR POWER IN DESIGN 1
C FPOW2(I)= ARRAY DEFINING DENSITY FUNCTION FOR POWER IN DESIGN 2
C UDAM(I)= VALUE CURVE FOR DAMAGE
C UPOW(I)= VALUE CURVE FOR POWER
C
DIMENSION FDENS(61),FDAM1(61),FDAM2(61),FPOW1(61),FPOW2(61),
1UDAM(61),UPOW(61),FUNC(61)
C
C NORMALIZE DENSITY FUNCTIONS
C
DO 1 I=1,4
READ(5,10)(FDENS(J),J=1,61)
READ(5,11)RANGE
AREA=FSIMP(FDENS,RANGE,61)
DO 2 J=1,61
GO TO(3,4,5,6)I
3 FDAM1(J)=FDENS(J)/AREA
GO TO 2
4 FDAM2(J)=FDENS(J)/AREA
GO TO 2
5 FPOW1(J)=FDENS(J)/AREA
GO TO 2
6 FPOW2(J)=FDENS(J)/AREA
2 CONTINUE
1 CONTINUE
C
C DETERMINE EXPECTED VALUES
C
READ(5,10)(UDAM(J),J=1,61)
READ(5,10)(UPOW(J),J=1,61)
DO 20 I=1,6
GO TO (30,31,32,33,34,35)I
30 DO 40 J=1,61
40 FUNC(J)=FDAM1(J)*UDAM(J)
RANGE=12.
E1=FSIMP(FUNC,RANGE,61)
GO TO 20
31 DO 41 J=1,61
41 FUNC(J)=FDAM2(J)*UDAM(J)
C
RANGE=12.
E2=FSIMP(FUNC,RANGE,61)
GO TO 20
32 DO 42 J=1,61
RANGE=60.
42 FUNC(J)=FPOW1(J)*UPOW(J)
E3=FSIMP(FUNC,RANGE,61)
33 DO 43 J=1,61
43 FUNC(J)=FPOW2(J)*UPOW(J)
RANGE=60.
E4=FSIMP(FUNC,RANGE,61)
GO TO 20
34 E5=8.17
GO TO 20
35 E6=2.20
20 CONTINUE
DES1=E1+E3+E5
DES2=E2+E4+E6
C
C OUTPUT
C
WRITE(6,100)
100 FORMAT(/,1H ,15X,24HEXPECTED VALUES OF VALUE,//)
WRITE(6,101)
101 FORMAT(/,1H ,12X,6HDAMAGE,7X,5HPOWER,9X,5HPARTS,8X,5HTOTAL,//)
WRITE(6,102)E1,E3,E5,DES1
102 FORMAT(/,1H ,8HDESIGN 1,4X,F5.3,8X,F5.3,9X,F5.3,8X,F6.3)
WRITE(6,103)E2,E4,E6,DES2
103 FORMAT(/,1H ,8HDESIGN 2,4X,F5.3,8X,F5.3,9X,F5.3,8X,F6.3)
10 FORMAT(16F5.2)
11 FORMAT(F5.0)
STOP
END
SUBROUTINE FSIMP
FUNCTION FSIMP(FUNC,RANGE,MINT)
C.... CALCULATES INTEGRAL BY SIMPSONS RULE WITH
C MODIFICATION IF MINT IS EVEN
C.... INPUT
C FUNC = ARRAY OF EQUALLY SPACED VALUES OF FUNCTION
C DIMENSION MINT
C RANGE = RANGE OF INTEGRATION
C MINT = NUMBER OF STATIONS
C.... OUTPUT
C FSIMP = AREA
DIMENSION FUNC(1)
C.... CHECK MINT FOR ODD OR EVEN
XX=RANGE/(3.*FLOAT(MINT-1))
M=MINT/2*2
IF(M.EQ.MINT) GO TO 3
C.... ODD
AREA=FUNC(1)+FUNC(M)
MM=MINT-1
DO 1 I=2,MM,2
1 AREA=AREA+4.*FUNC(I)
MM=MM-1
DO 2 I=3,MM,2
2 AREA=AREA+2.*FUNC(I)
FSIMP=XX*AREA
RETURN
C.... EVEN
C.... USE SIMPSONS RULE FOR ALL BUT THE LAST 3 INTERVALS
3 M=MINT-3
AREA=FUNC(1)+FUNC(M)
MM=M-1
DO 4 I=2,MM,2
4 AREA=AREA+4.*FUNC(I)
MM=MM-1
DO 5 I=3,MM,2
5 AREA=AREA+2.*FUNC(I)
FSIMP=XX*AREA
C.... USE NEWTONS 3/3 RULE FOR LAST THREE INTERVALS
FSIMP=FSIMP+9./3.*XX*(FUNC(MINT-3)+3.*(FUNC(MINT-2)+FUNC(MINT-1))
1 +FUNC(MINT))
RETURN
END
Here is a minimal example to help you get started:
C Minimal working example of creaky old FORTRAN I/O
PROGRAM ABYSS
IMPLICIT NONE
C
REAL FDENS(61)
REAL XRANGE
INTEGER J
C
10 FORMAT(16F5.2)
11 FORMAT(F5.0)
909 FORMAT(/, 'BEHOLD! A DENSITY DISTRIBUTION',/)
910 FORMAT(10(F5.2, 3X),/)
911 FORMAT(/, 'XRANGE is ', F6.1)
C
CONTINUE
C
READ(5,10) (FDENS(J), J=1,61)
READ(5,11) XRANGE
C
WRITE(6,909)
WRITE(6,910) (FDENS(J), J=1,61)
WRITE(6,911) XRANGE
C
STOP
END
Apologies for writing this in F77; I'm sticking with the style of the code posted above for the sake of this example. Ideally, you'd use a F03 or F08 for new code or a completely different language which actually has decent I/O features and a rich standard library. But I digress.
This code will operate on the data (be careful to preserve the spaces):
0.1 0.3 0.5 0.9 1.30 1.90 2.50 3.20 3.80 4.20
4.70 5.0 5.1 5.2 5.2 5.1 4.9 4.7 4.6 4.4 4.2 3.9 3.8 3.6 3.4 3.2
3.0 2.9 2.7 2.5 2.4 2.2 2.1 1.9 1.8 1.6 1.5 1.4 1.2 1.1 1.0 0.9
0.8 0.7 0.6 0.5 0.4 0.3 0.3 0.2 0.1 0.1
12.
to produce
BEHOLD! A DENSITY DISTRIBUTION
0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.30 0.50 0.90
1.30 1.90 2.50 3.20 3.80 4.20 4.70 5.00 5.10 5.20
5.20 5.10 4.90 4.70 4.60 4.40 4.20 3.90 3.80 3.60
3.40 3.20 3.00 2.90 2.70 2.50 2.40 2.20 2.10 1.90
1.80 1.60 1.50 1.40 1.20 1.10 1.00 0.90 0.80 0.70
0.60 0.50 0.40 0.30 0.30 0.20 0.10 0.10 0.00 0.00
0.00
XRANGE is 12.0
If the code is in abyss.f, the input data is in abyss.dat, you should be able to build the code with
gfortran -g -Wall -Og -o abyss abyss.f
and generate similar results by running
abyss < abyss.dat > abyss.out
A key point to note is that the original code is reading from unit 5 (traditionally taken as stdin, now officially canonized in iso_fortran_env as INPUT_UNIT). In your own code, I'd suggest reading from a data file, so replace the literal 5 with whatever variable contains the unit number of the file you're reading from (hint: consider using the newunit argument to the open command introduced in Fortran 2008. It solves the perennially stupid Fortran problem of trying to find a free I/O unit number.) While you can use I/O redirection, it's suboptimal; it's used here to show how to work around the limitations of the original code.
Also, for the sake of later generations and your own sanity, please avoid taking advantage of Cold-War-era FORTRAN misfeatures such as this spaces-equal-zeroes nonsense. If your data is worth using, it's worth putting in a sensible format which can be easily parsed; columnar, space-delimited values are as good a choice as any. Fortran may actually get a standard library which can read and write CSV files sometime around 2156 (give or take a century) so you have plenty of time to design something decent...

Determine if an allocation via malloc() is backed by a huge page

I understand pretty well how transparent hugepages work, and that any allocation, such as those performed by malloc may be satisfied by a huge page.
What I'd like to know, is if there is any check I can make (possibly heuristic) after an allocation to determine if the memory is backed by a huge page.
You can determine the exact status of any page, including whether it is backed by a transparent (or non-transparent) hugepage by looking up the "pfn" (page frame number) in the /proc/kpageflags file. You get the pfn for a page by reading from the /proc/$PID/pagemap file for your process, which is indexed by virtual address.
Unfortunately, both the pfn value from pagemap1 and the entire /proc/kpageflags file are accessible only to root users. Still if you can run your process as root at least in the testing or benchmarking scenario you are interested in, this works well.
I wrote a small library called page-info which does the relevant parsing for you. Give it a range of memory and it will return you info on each page, including whether it is present in memory, backed by a hugepage, etc.
For example, running the included test process as sudo ./page-info-test THP gives the following output:
PAGE_SIZE = 4096, PID = 18868
size memset FLAG SET UNSET UNAVAIL
0.25 MiB BEFORE THP 0 1 64
0.25 MiB AFTER THP 0 65 0
0.50 MiB BEFORE THP 0 1 128
0.50 MiB AFTER THP 0 129 0
1.00 MiB BEFORE THP 0 1 256
1.00 MiB AFTER THP 0 257 0
2.00 MiB BEFORE THP 0 1 512
2.00 MiB AFTER THP 0 513 0
4.00 MiB BEFORE THP 0 1 1024
4.00 MiB AFTER THP 512 513 0
8.00 MiB BEFORE THP 0 1 2048
8.00 MiB AFTER THP 1536 513 0
16.00 MiB BEFORE THP 0 1 4096
16.00 MiB AFTER THP 3584 513 0
32.00 MiB BEFORE THP 0 1 8192
32.00 MiB AFTER THP 7680 513 0
64.00 MiB BEFORE THP 0 1 16384
64.00 MiB AFTER THP 15872 513 0
128.00 MiB BEFORE THP 0 1 32768
128.00 MiB AFTER THP 32256 513 0
256.00 MiB BEFORE THP 0 1 65536
256.00 MiB AFTER THP 65024 513 0
512.00 MiB BEFORE THP 0 1 131072
512.00 MiB AFTER THP 124416 6657 0
1024.00 MiB BEFORE THP 0 1 262144
1024.00 MiB AFTER THP 0 262145 0
DONE
The UNAVAIL column means that no information about the mapping was available - usually because the page has never been accesses and so isn't yet backed by any page at all. You can see that for these "largeish" allocations only a single page is mapped in following the allocation, since we haven't touched the memory.
The AFTER rows are the same information after calling memset() on the entire allocation, which causes all pages to be physically allocated. Here we can see that no allocations are backed by transparent hugepages until we hit allocations of 4 MiB, at which point the majority of each allocation is backed by THP, except for 513 pages (which turn out to be at the edges of the allocated region). At 512 MiB the system starts running out of available hugepages but still satisfies most of the allocation, but at 1024 MiB the entire allocation is satisfied with small pages.
This library isn't production ready so don't use it for anything critical (e.g., some failures simply call exit()). Contributions welcome.
1 Since kernel 4.0 approximately, before that the pfn was accessible to non-root user processes. From 4.0 to 4.1 or thereabouts, the entire pagemap was off-limits to non-root processes, but since then the file is again available but with the pfn masked out (it will always appear as zero).
There is a difference between traditional hugepages and transparent huge pages (THP). In the case of THP's, the application can use huge pages without any developer support (mmap, shmget, etc) or sys-admin intervention.
In the code, I am afraid there may be no straight forward way check this. However, if you know the sizeof() allocated data structure or buffers, it worth grepping and checking the THP usage on the system using the following command. This usage should increase while running your application:
# grep AnonHugePages /proc/meminfo
AnonHugePages: 2648064 kB

MinGW gprof inaccurate results?

I've been profiling a program with gprof on Linux (Ubuntu 11.04) and Windows (7, latest version of MinGW), same program on more or less the same dataset each time, and getting significantly different results. (Significantly as in they would lead to different conclusions about what part of the code needs optimizing.)
It's possible that the results could be legitimately different on the two systems, but I also have to consider the possibility that one result set is inaccurate and should be ignored, and a priori the more likely one would be MinGW as gprof is less extensively tested on Windows than on Linux. A stronger argument for that conclusion is that the results on Windows look distinctly weird:
% cumulative self self total
time seconds seconds calls us/call us/call name
27.43 1.13 1.13 68589813 0.02 0.02 addt
21.48 2.02 0.89 tok
19.17 2.81 0.79 hash
9.95 3.21 0.41 slot
7.89 3.54 0.33 nextx
4.85 3.74 0.20 next
3.52 3.88 0.14 27809047 0.01 0.01 get
0.85 3.92 0.04 eol
0.73 3.95 0.03 __mingw_pformat
0.73 3.98 0.03 ch
0.73 4.01 0.03 tokx
0.49 4.03 0.02 slot
0.49 4.05 0.02 tok
0.24 4.06 0.01 166896 0.06 0.06 mk2
0.24 4.07 0.01 6693 1.49 1.49 initt
0.24 4.08 0.01 __pformat_putchars
0.24 4.09 0.01 hashs
0.24 4.10 0.01 pop
0.24 4.11 0.01 quoted
0.12 4.12 0.01 eat
0.12 4.12 0.01 expand
0.00 4.12 0.00 145841014 0.00 0.00 initparse
There are a lot of gaps, and then initparse, which is an initialization function called only once that calls almost nothing else, is reported as having been called one hundred and forty-five million times.
Should I disregard the results from Windows and just use the ones from Linux? Or is there some issue with the reporting of number of calls on Windows that doesn't affect the percentage time results? Or am I misreading the output or otherwise misusing the tool?