Determine if an allocation via malloc() is backed by a huge page - c++

I understand pretty well how transparent hugepages work, and that any allocation, such as those performed by malloc may be satisfied by a huge page.
What I'd like to know, is if there is any check I can make (possibly heuristic) after an allocation to determine if the memory is backed by a huge page.

You can determine the exact status of any page, including whether it is backed by a transparent (or non-transparent) hugepage by looking up the "pfn" (page frame number) in the /proc/kpageflags file. You get the pfn for a page by reading from the /proc/$PID/pagemap file for your process, which is indexed by virtual address.
Unfortunately, both the pfn value from pagemap1 and the entire /proc/kpageflags file are accessible only to root users. Still if you can run your process as root at least in the testing or benchmarking scenario you are interested in, this works well.
I wrote a small library called page-info which does the relevant parsing for you. Give it a range of memory and it will return you info on each page, including whether it is present in memory, backed by a hugepage, etc.
For example, running the included test process as sudo ./page-info-test THP gives the following output:
PAGE_SIZE = 4096, PID = 18868
size memset FLAG SET UNSET UNAVAIL
0.25 MiB BEFORE THP 0 1 64
0.25 MiB AFTER THP 0 65 0
0.50 MiB BEFORE THP 0 1 128
0.50 MiB AFTER THP 0 129 0
1.00 MiB BEFORE THP 0 1 256
1.00 MiB AFTER THP 0 257 0
2.00 MiB BEFORE THP 0 1 512
2.00 MiB AFTER THP 0 513 0
4.00 MiB BEFORE THP 0 1 1024
4.00 MiB AFTER THP 512 513 0
8.00 MiB BEFORE THP 0 1 2048
8.00 MiB AFTER THP 1536 513 0
16.00 MiB BEFORE THP 0 1 4096
16.00 MiB AFTER THP 3584 513 0
32.00 MiB BEFORE THP 0 1 8192
32.00 MiB AFTER THP 7680 513 0
64.00 MiB BEFORE THP 0 1 16384
64.00 MiB AFTER THP 15872 513 0
128.00 MiB BEFORE THP 0 1 32768
128.00 MiB AFTER THP 32256 513 0
256.00 MiB BEFORE THP 0 1 65536
256.00 MiB AFTER THP 65024 513 0
512.00 MiB BEFORE THP 0 1 131072
512.00 MiB AFTER THP 124416 6657 0
1024.00 MiB BEFORE THP 0 1 262144
1024.00 MiB AFTER THP 0 262145 0
DONE
The UNAVAIL column means that no information about the mapping was available - usually because the page has never been accesses and so isn't yet backed by any page at all. You can see that for these "largeish" allocations only a single page is mapped in following the allocation, since we haven't touched the memory.
The AFTER rows are the same information after calling memset() on the entire allocation, which causes all pages to be physically allocated. Here we can see that no allocations are backed by transparent hugepages until we hit allocations of 4 MiB, at which point the majority of each allocation is backed by THP, except for 513 pages (which turn out to be at the edges of the allocated region). At 512 MiB the system starts running out of available hugepages but still satisfies most of the allocation, but at 1024 MiB the entire allocation is satisfied with small pages.
This library isn't production ready so don't use it for anything critical (e.g., some failures simply call exit()). Contributions welcome.
1 Since kernel 4.0 approximately, before that the pfn was accessible to non-root user processes. From 4.0 to 4.1 or thereabouts, the entire pagemap was off-limits to non-root processes, but since then the file is again available but with the pfn masked out (it will always appear as zero).

There is a difference between traditional hugepages and transparent huge pages (THP). In the case of THP's, the application can use huge pages without any developer support (mmap, shmget, etc) or sys-admin intervention.
In the code, I am afraid there may be no straight forward way check this. However, if you know the sizeof() allocated data structure or buffers, it worth grepping and checking the THP usage on the system using the following command. This usage should increase while running your application:
# grep AnonHugePages /proc/meminfo
AnonHugePages: 2648064 kB

Related

Intel MPI Benchmarks result exceeds network bandwidth

I was running the Intel MPI Benchmarks on a mini MPI cluster of two nodes on AWS EC2.
The executables of the benchmark were compiled with make CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx.
The cluster was set up based on this tutorial.
(What I did differently is that I installed OpenMPI instead of MPICH.)
The type of the AWS instances is t3a.xlarge.
According to this spec the network performance is up to 5 Gbps (i.e. 625 MB/s).
I chose Ubuntu 18.04 as the OS for these instances.
The PingPong benchmark gave me surprising results:
mpiuser#mpin0:~$ mpirun -rf rankfile -np 2 ./IMB-MPI1 PingPong
#------------------------------------------------------------
# Intel(R) MPI Benchmarks 2019 Update 3, MPI-1 part
#------------------------------------------------------------
# Date : Thu Nov 28 15:17:38 2019
# Machine : x86_64
# System : Linux
# Release : 4.15.0-1054-aws
# Version : #56-Ubuntu SMP Thu Nov 7 16:15:59 UTC 2019
# MPI Version : 3.1
# MPI_Datatype : MPI_BYTE
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 32.05 0.00
1 1000 31.74 0.03
2 1000 31.98 0.06
4 1000 32.08 0.12
8 1000 31.86 0.25
16 1000 31.30 0.51
32 1000 31.40 1.02
64 1000 32.86 1.95
128 1000 32.66 3.92
256 1000 33.56 7.63
512 1000 35.00 14.63
1024 1000 34.74 29.48
2048 1000 36.56 56.02
4096 1000 39.80 102.91
8192 1000 49.92 164.10
16384 1000 57.73 283.79
32768 1000 78.86 415.54
65536 640 170.26 384.91
131072 320 239.80 546.60
262144 160 357.34 733.61
524288 80 609.82 859.74
1048576 40 1106.36 947.77
2097152 20 2514.62 833.98
4194304 10 5830.39 719.39
Some of the speeds exceed 625 MB/s.
The rankfile was used to force distributing the two processes onto two nodes.
It is not the case that the two MPI processes are running on the same node.
The content of the rankfile is:
rank 0=mpin0 slot=0
rank 1=mpin1 slot=0
What are the possible reasons that caused the benchmark result to exceed the theoretical limit?

memory mapped file access is very slow

I am writing to a 930GB file (preallocated) on a Linux machine with 976 GB memory.
The application is written in C++ and I am memory mapping the file using Boost Interprocess. Before starting the code I set the stack size:
ulimit -s unlimited
The writing was very fast a week ago, but today it is running slow. I don't think the code has changed, but I may have accidentally changed something in my environment (it is an AWS instance).
The application ("write_data") doesn't seem to be using all the available memory. "top" shows:
Tasks: 559 total, 1 running, 558 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 98.5%id, 1.5%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1007321952k total, 149232000k used, 858089952k free, 286496k buffers
Swap: 0k total, 0k used, 0k free, 142275392k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4904 root 20 0 2708m 37m 27m S 1.0 0.0 1:47.00 dockerd
56931 my_user 20 0 930g 29g 29g D 1.0 3.1 12:38.95 write_data
57179 root 20 0 0 0 0 D 1.0 0.0 0:25.55 kworker/u257:1
57512 my_user 20 0 15752 2664 1944 R 1.0 0.0 0:00.06 top
I thought the resident size (RES) should include the memory mapped data, so shouldn't it be > 930 GB (size of the file)?
Can someone suggest ways to diagnose the problem?
Memory mappings generally aren't eagerly populated. If some other program forced the file into the page cache, you'd see good performance from the start, otherwise you'd see poor performance as the file was paged in.
Given you have enough RAM to hold the whole file in memory, you may want to hint to the OS that it should prefetch the file, reducing the number of small reads triggered by page faults, substituting larger bulk reads. The posix_madvise API can be used to provide this hint, by passing POSIX_MADV_WILLNEED as the advice, indicating it should prefetch the whole file.

Allocate large array in PGI Fortran

I am trying to allocate an real array finn_var(459,299,27,24,nspec) in Fortran. nspec = 24 works ok, while nspec = 25 not. No error message for allocation process, but print command print empty rather than zero values. If you use the array, there will be "segmentation fault" error message. The test program is
program test
implicit none
integer :: nx, ny, nez, nt, nspec
integer :: allocation_status
real , allocatable :: finn_var(:,:,:,:,:)
nx = 459
ny = 299
nez = 27
nt = 24
nspec = 24
allocate( finn_var(nx, ny, nez, nt, nspec), stat = allocation_status )
if (allocation_status > 0) then
print*, "Allocation error for finn_var"
stop
end if
print*, finn_var
end
Should not be the memory issue. I allocated double precision finn_var(459,299,27,24,24) without problem. What is the reason then?
I use pgf90 on linux server. the cat /proc/meminfo command:
MemTotal: 396191724 kB
MemFree: 66065188 kB
Buffers: 402388 kB
Cached: 274584600 kB
SwapCached: 0 kB
Active: 131679328 kB
Inactive: 191625200 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 396191724 kB
LowFree: 66065188 kB
SwapTotal: 20971484 kB
SwapFree: 20971180 kB
Dirty: 605508 kB
Writeback: 0 kB
AnonPages: 48317148 kB
Mapped: 123328 kB
Slab: 6612824 kB
PageTables: 132920 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 219067344 kB
Committed_AS: 53206972 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 275624 kB
VmallocChunk: 34359462559 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
the unlimit -a command:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 3153920
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 3153920
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I compiled by pgf90. But if I compiled by gfortran, there is no problem.
It doesn't have to be insufficient memory. The size of the array is 2 223 304 200. That is suspiciously close to the maximum 32bit integer 2 147 483 648.
It looks like that the element count that the compiler uses internally overflows. The internal call to malloc requests not enough memory and then any attempt to read some of the elements at the end fails.
It is a limitation of the compiler in its default settings. It can be set-up to use 64bit addressing by using the option ‑Mlarge_arrays.
See http://www.pgroup.com/products/freepgi/freepgi_ref/ch05.html#ArryIndex
Your problem is most likely a memory issue.
You array demands 459*299*27*24 * 4B per nspec (assuming default real requires 4B of memory). For nspec == 24 this results in a memory requirement of approximately 7.95GiB, while nspec == 25 needs around 8.28GiB.
I guess, your physical memory is limited to 8GiB or some ulimit limits the amount of allowed memory for this process.

Cache line size

It might be a very common and simple question but I need some explanation about the curve that I just obtained from a cache benchmarks code. The goal here is to find the cache line size. I used the code from here:
(h**ps://github.com/jiewmeng/cs3210-assign1/blob/master/cache-l1-line.cpp)
This is the curve that I have obtained from running the code on my machine (Macbook Pro with core i7 - cache line size is 64byte - L1 data cache is 32KB).
The Time vs different stride size curve
I think the peak happens on 128 bytes and not on the 64 bytes. if it is true I want to know why?
Why the time is reduced at 512 bytes?
Update:
I also ran a code to determine the size of the L1 and L2 caches. Here is the figure just to document the data. As you can see there is two peak in 32KB (L1 Cache size) and 256KB (L2 Cache size).
Question:
I am wondering if there is any way to find the size of L3 shared cache.
Cache size figure.
Thanks
I'm guessing that the 128B peak is most likely due to spatial prefetching. You can see in Intels' Optimization guide, under section 2.1.5.4
This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk
It wouldn't be a clean jump since this prefetches is not always firing, and even when it does, it only prefetches into the L2, but it's much better than fetching from memory. To make sure this is the case, you can disable prefetches (through BIOS or other means, although some systems may not support that), and check again.
As for the L3 size - you didn't specify your exact model, but i'm guessing you have more than 4M L3 - just keep the curve going and see if it jumps.
EDIT
Just noticed another thing - your k*i expression is probably overflowing int at the max range, which means your access pattern might not be cyclic as you expect.
My BusSpeed benchmark was intended to identify cache sizes and performance at different strides, to show burst reading on buses:
http://www.roylongbottom.org.uk/busspd2k%20results.htm
Following are results on a Core i7 with 8 MB L3:
Memory Reg2 Reg2 Reg2 Reg2 Reg1 Reg2 Reg1 Reg2 Reg1 Reg8
KBytes Inc64 Inc32 Inc16 Inc8 Inc4 Inc4 Inc4 Inc4 Inc8 Inc8
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 10025 10800 11262 11498 11612 11634 5850 11635 23093 23090
8 10807 11267 11505 11627 11694 11694 5871 11694 23299 23297
16 11251 11488 11620 11614 11712 11719 5873 11718 23391 23398
32 9893 9853 10890 11170 11558 11492 5872 11466 21032 21025
64 3219 4620 7289 9479 10805 10805 5875 10797 14426 14426
128 3213 4805 7305 9467 10811 10810 5875 10805 14442 14408
256 3144 4592 7231 9445 10759 10733 5870 10743 14336 14337
512 2005 3497 5980 9056 10466 10467 5871 10441 13906 13905
1024 2003 3482 5974 9017 10468 10466 5874 10467 13896 13818
2048 2004 3497 5958 9088 10447 10448 5870 10447 13857 13857
4096 1963 3398 5778 8870 10328 10328 5851 10328 13591 13630
8192 1729 3045 5322 8270 9977 9963 5728 9965 12923 12892
16384 692 1402 2495 4593 7811 7782 5406 7848 8335 8337
32768 695 1406 2492 4584 7820 7826 5401 7792 8317 8322
65536 695 1414 2488 4584 7823 7826 5403 7800 8321 8321
131072 696 1402 2491 4575 7827 7824 5411 7846 8322 8323
262144 696 1413 2498 4594 7791 7826 5409 7829 8333 8334
524288 693 1416 2498 4595 7841 7842 5411 7847 8319 8285
1048576 704 1415 2478 4591 7845 7840 5410 7853 8290 8283
End of test Fri Jul 30 16:44:29 2010
CPUID and RDTSC Assembly Code
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000106A5
Intel(R) Core(TM) i7 CPU 930 # 2.80GHz Measured 2807 MHz

Calculating used memory by a set of processes on Linux

I'm having trouble with calculating the actually used memory (resident) by a set of processes.
The issue that just came up is a user with a set of processes that share memory between themselves, so a simple addition of used memory ends up with a nonsense number (>60gb when the machine only has 48gb memory).
Is there any simple way to approach this problem?
I can probably do some approximation. Take (res mem - shared mem) * num proc + shared mem. But not all processes necessarily share the same memory block.
I'm looking for a POSIX or Linux solution to this problem for C/C++.
You will want to iterate through each processes /proc/[pid]/smaps
It will contain an entry for each VM mapping of the likes:
7ffffffe7000-7ffffffff000 rw-p 00000000 00:00 0 [stack]
Size: 100 kB
Rss: 20 kB
Pss: 20 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 20 kB
Referenced: 20 kB
Anonymous: 20 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Private_Dirty memory is what you are interested in.
If you have the Pss field in your smaps file then this is the amount of resident memory divided by the amount of processes that share the physical memory.
Private_Clean could be copy-on-write mappings. Those are commonly used for shared libraries and are generally read/no-write/execute.