memory mapped file access is very slow - c++

I am writing to a 930GB file (preallocated) on a Linux machine with 976 GB memory.
The application is written in C++ and I am memory mapping the file using Boost Interprocess. Before starting the code I set the stack size:
ulimit -s unlimited
The writing was very fast a week ago, but today it is running slow. I don't think the code has changed, but I may have accidentally changed something in my environment (it is an AWS instance).
The application ("write_data") doesn't seem to be using all the available memory. "top" shows:
Tasks: 559 total, 1 running, 558 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 98.5%id, 1.5%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1007321952k total, 149232000k used, 858089952k free, 286496k buffers
Swap: 0k total, 0k used, 0k free, 142275392k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4904 root 20 0 2708m 37m 27m S 1.0 0.0 1:47.00 dockerd
56931 my_user 20 0 930g 29g 29g D 1.0 3.1 12:38.95 write_data
57179 root 20 0 0 0 0 D 1.0 0.0 0:25.55 kworker/u257:1
57512 my_user 20 0 15752 2664 1944 R 1.0 0.0 0:00.06 top
I thought the resident size (RES) should include the memory mapped data, so shouldn't it be > 930 GB (size of the file)?
Can someone suggest ways to diagnose the problem?

Memory mappings generally aren't eagerly populated. If some other program forced the file into the page cache, you'd see good performance from the start, otherwise you'd see poor performance as the file was paged in.
Given you have enough RAM to hold the whole file in memory, you may want to hint to the OS that it should prefetch the file, reducing the number of small reads triggered by page faults, substituting larger bulk reads. The posix_madvise API can be used to provide this hint, by passing POSIX_MADV_WILLNEED as the advice, indicating it should prefetch the whole file.

Related

Limiting Java 8 Memory Consumption

I'm running three Java 8 JVMs on a 64 bit Ubuntu VM which was built from a minimal install with nothing extra running other than the three JVMs. The VM itself has 2GB of memory and each JVM was limited by -Xmx512M which I assumed would be fine as there would be a couple of hundred MB spare.
A few weeks ago, one crashed and the hs_err_pid dump showed:
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 196608 bytes for committing reserved memory.
# Possible reasons:
# The system is out of physical RAM or swap space
# In 32 bit mode, the process size limit was hit
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full
# Use 64 bit Java on a 64 bit OS
# Decrease Java heap size (-Xmx/-Xms)
# Decrease number of Java threads
# Decrease Java thread stack sizes (-Xss)
# Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
I restarted the JVM with a reduced heap size of 384MB and so far everything is fine. However when I currently look at the VM using the ps command and sort in descending RSS size I see
RSS %MEM VSZ PID CMD
708768 35.4 2536124 29568 java -Xms64m -Xmx512m ...
542776 27.1 2340996 12934 java -Xms64m -Xmx384m ...
387336 19.3 2542336 6788 java -Xms64m -Xmx512m ...
12128 0.6 288120 1239 /usr/lib/snapd/snapd
4564 0.2 21476 27132 -bash
3524 0.1 5724 1235 /sbin/iscsid
3184 0.1 37928 1 /sbin/init
3032 0.1 27772 28829 ps ax -o rss,pmem,vsz,pid,cmd --sort -rss
3020 0.1 652988 1308 /usr/bin/lxcfs /var/lib/lxcfs/
2936 0.1 274596 1237 /usr/lib/accountsservice/accounts-daemon
..
..
and the free command shows
total used free shared buff/cache available
Mem: 1952 1657 80 20 213 41
Swap: 0 0 0
Taking the first process as an example, there is an RSS size of 708768 KB even though the heap limit would be 524288 KB (512*1024).
I am aware that extra memory is used over the JVM heap but the question is how can I control this to ensure I do not run out of memory again ? I am trying to set the heap size for each JVM as large as I can without crashing them.
Or is there a good general guideline as to how to set JVM heap size in relation to overall memory availability ?
There does not appear to be a way of controlling how much extra memory the JVM will use over the heap. However by monitoring the application over a period of time, a good estimate of this amount can be obtained. If the overall consumption of the java process is higher than desired, then the heap size can be reduced. Further monitoring is needed to see if this impacts performance.
Continuing with the example above and using the command ps ax -o rss,pmem,vsz,pid,cmd --sort -rss we see usage as of today is
RSS %MEM VSZ PID CMD
704144 35.2 2536124 29568 java -Xms64m -Xmx512m ...
429504 21.4 2340996 12934 java -Xms64m -Xmx384m ...
367732 18.3 2542336 6788 java -Xms64m -Xmx512m ...
13872 0.6 288120 1239 /usr/lib/snapd/snapd
..
..
These java processes are all running the same application but with different data sets. The first process (29568) has stayed stable using about 190M beyond the heap limit while the second (12934) has reduced from 156M to 35M. The total memory usage of the third has stayed well under the heap size which suggests the heap limit could be reduced.
It would seem that allowing 200MB extra non heap memory per java process here would be more than enough as that gives 600MB leeway total. Subtracting this from 2GB leaves 1400MB so the three -Xmx parameter values combined should be less than this amount.
As will be gleaned from reading the article pointed out in a comment by Fairoz there are many different ways in which the JVM can use non heap memory. One of these that is measurable though is the thread stack size. The default for a JVM can be found on linux using java -XX:+PrintFlagsFinal -version | grep ThreadStackSize In the case above it is 1MB and as there are about 25 threads, we can safely say that at least 25MB extra will always be required.

Not enough space to cache rdd in memory warning

I am running a spark job, and I got Not enough space to cache rdd_128_17000 in memory warning. However, in the attached file, it obviously saying only 90.8 G out of 719.3 G is used. Why is that? Thanks!
15/10/16 02:19:41 WARN storage.MemoryStore: Not enough space to cache rdd_128_17000 in memory! (computed 21.4 GB so far)
15/10/16 02:19:41 INFO storage.MemoryStore: Memory use = 4.1 GB (blocks) + 21.2 GB (scratch space shared across 1 thread(s)) = 25.2 GB. Storage limit = 36.0 GB.
15/10/16 02:19:44 WARN storage.MemoryStore: Not enough space to cache rdd_129_17000 in memory! (computed 9.4 GB so far)
15/10/16 02:19:44 INFO storage.MemoryStore: Memory use = 4.1 GB (blocks) + 30.6 GB (scratch space shared across 1 thread(s)) = 34.6 GB. Storage limit = 36.0 GB.
15/10/16 02:25:37 INFO metrics.MetricsSaver: 1001 MetricsLockFreeSaver 339 comitted 11 matured S3WriteBytes values
15/10/16 02:29:00 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt1/var/lib/hadoop/s3/959a772f-d03a-41fd-bc9d-6d5c5b9812a1-0000 134217728 bytes md5: qkQ8nlvC8COVftXkknPE3A== md5hex: aa443c9e5bc2f023957ed5e49273c4dc
15/10/16 02:38:15 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt/var/lib/hadoop/s3/959a772f-d03a-41fd-bc9d-6d5c5b9812a1-0001 134217728 bytes md5: RgoGg/yJpqzjIvD5DqjCig== md5hex: 460a0683fc89a6ace322f0f90ea8c28a
15/10/16 02:42:20 INFO metrics.MetricsSaver: 2001 MetricsLockFreeSaver 339 comitted 10 matured S3WriteBytes values
This is likely to be caused by the configuration of spark.storage.memoryFraction being too low. Spark will only use this fraction of the allocated memory to cache RDDs.
Try either:
increasing the storage fraction
rdd.persist(StorageLevel.MEMORY_ONLY_SER) to reduce memory usage by serializing the RDD data
rdd.persist(StorageLevel.MEMORY_AND_DISK) to partially persist onto disk if memory limits are reached.
This could be due to the following issue if you're loading lots of avro files:
https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCANx3uAiJqO4qcTXePrUofKhO3N9UbQDJgNQXPYGZ14PWgfG5Aw#mail.gmail.com%3E
With a PR in progress at:
https://github.com/databricks/spark-avro/pull/95
I have a Spark-based batch application (a JAR with main() method, not written by me, I'm not a Spark expert) that I run in local mode without spark-submit, spark-shell, or spark-defaults.conf. When I tried to use IBM JRE (like one of my customers) instead of Oracle JRE (same machine and same data), I started getting those warnings.
Since the memory store is a fraction of the heap (see the page that Jacob suggested in his comment), I checked the heap size: IBM JRE uses a different strategy to decide default heap size and it was too small, so I simply added appropriate -Xms and -Xmx params and the problem disappeared: now the batch works fine both with IBM and Oracle JRE.
My usage scenario is not typical, I know, however I hope this can help someone.

Determine if an allocation via malloc() is backed by a huge page

I understand pretty well how transparent hugepages work, and that any allocation, such as those performed by malloc may be satisfied by a huge page.
What I'd like to know, is if there is any check I can make (possibly heuristic) after an allocation to determine if the memory is backed by a huge page.
You can determine the exact status of any page, including whether it is backed by a transparent (or non-transparent) hugepage by looking up the "pfn" (page frame number) in the /proc/kpageflags file. You get the pfn for a page by reading from the /proc/$PID/pagemap file for your process, which is indexed by virtual address.
Unfortunately, both the pfn value from pagemap1 and the entire /proc/kpageflags file are accessible only to root users. Still if you can run your process as root at least in the testing or benchmarking scenario you are interested in, this works well.
I wrote a small library called page-info which does the relevant parsing for you. Give it a range of memory and it will return you info on each page, including whether it is present in memory, backed by a hugepage, etc.
For example, running the included test process as sudo ./page-info-test THP gives the following output:
PAGE_SIZE = 4096, PID = 18868
size memset FLAG SET UNSET UNAVAIL
0.25 MiB BEFORE THP 0 1 64
0.25 MiB AFTER THP 0 65 0
0.50 MiB BEFORE THP 0 1 128
0.50 MiB AFTER THP 0 129 0
1.00 MiB BEFORE THP 0 1 256
1.00 MiB AFTER THP 0 257 0
2.00 MiB BEFORE THP 0 1 512
2.00 MiB AFTER THP 0 513 0
4.00 MiB BEFORE THP 0 1 1024
4.00 MiB AFTER THP 512 513 0
8.00 MiB BEFORE THP 0 1 2048
8.00 MiB AFTER THP 1536 513 0
16.00 MiB BEFORE THP 0 1 4096
16.00 MiB AFTER THP 3584 513 0
32.00 MiB BEFORE THP 0 1 8192
32.00 MiB AFTER THP 7680 513 0
64.00 MiB BEFORE THP 0 1 16384
64.00 MiB AFTER THP 15872 513 0
128.00 MiB BEFORE THP 0 1 32768
128.00 MiB AFTER THP 32256 513 0
256.00 MiB BEFORE THP 0 1 65536
256.00 MiB AFTER THP 65024 513 0
512.00 MiB BEFORE THP 0 1 131072
512.00 MiB AFTER THP 124416 6657 0
1024.00 MiB BEFORE THP 0 1 262144
1024.00 MiB AFTER THP 0 262145 0
DONE
The UNAVAIL column means that no information about the mapping was available - usually because the page has never been accesses and so isn't yet backed by any page at all. You can see that for these "largeish" allocations only a single page is mapped in following the allocation, since we haven't touched the memory.
The AFTER rows are the same information after calling memset() on the entire allocation, which causes all pages to be physically allocated. Here we can see that no allocations are backed by transparent hugepages until we hit allocations of 4 MiB, at which point the majority of each allocation is backed by THP, except for 513 pages (which turn out to be at the edges of the allocated region). At 512 MiB the system starts running out of available hugepages but still satisfies most of the allocation, but at 1024 MiB the entire allocation is satisfied with small pages.
This library isn't production ready so don't use it for anything critical (e.g., some failures simply call exit()). Contributions welcome.
1 Since kernel 4.0 approximately, before that the pfn was accessible to non-root user processes. From 4.0 to 4.1 or thereabouts, the entire pagemap was off-limits to non-root processes, but since then the file is again available but with the pfn masked out (it will always appear as zero).
There is a difference between traditional hugepages and transparent huge pages (THP). In the case of THP's, the application can use huge pages without any developer support (mmap, shmget, etc) or sys-admin intervention.
In the code, I am afraid there may be no straight forward way check this. However, if you know the sizeof() allocated data structure or buffers, it worth grepping and checking the THP usage on the system using the following command. This usage should increase while running your application:
# grep AnonHugePages /proc/meminfo
AnonHugePages: 2648064 kB

Why my systems doesnt free the buffers/cache

I have a memory hungry application. I let it run over night on a system with 32GB RAM. Also ran free -m -s 20 along with it to see how the memory status changes. My application was the only thing I manually started after restarting my Ubuntu (except the terminal of course). let's look at parts of output:
when the application started:
total used free shared buffers cached
Mem: 32100 1428 30671 35 69 594
-/+ buffers/cache: 765 31335
Swap: 32693 0 32693
before the application ends:
total used free shared buffers cached
Mem: 32100 31860 240 84 2 17420
-/+ buffers/cache: 14437 17663
Swap: 32693 12 32681
right after the application ends:
total used free shared buffers cached
Mem: 32100 18723 13376 84 2 17434
-/+ buffers/cache: 1285 30814
Swap: 32693 12 32681
and the status remain same for many hours until I came back in the morning.
My question is:
Why most of my memory is still considered a free part of buffers/cache ? when is this part of memory going to be the free part of the overall Mem: again?
I then opened a browser, an IDE and some other GUI application to see how and from where the memory is allocated to the new applications:
total used free shared buffers cached
Mem: 32100 20378 11721 88 160 18075
-/+ buffers/cache: 2143 29956
Swap: 32693 12 32681
Apparently, Free memory from both Mem: as well as buffers/cache: was allocated to new applications. Can you please interpret this for me?
The cached data is also part of the used memory. After a program ends, the data loaded from disk by your program will become a part of the cache. So your system will not free any of the data, but it drops the cached data, if your memory will be needed again. A more or less funny page, that describes the problem.

What information does "Top" output while using MPI

I am trying to figure out how much memory my program which uses MPI needs. It was suggested to use the function "top" to obtain the usage of memory. However, I am unclear on what the information means.
I wish to know how to estimate the system memory and how much it utilizes?
top - 13:52:41 up 208 days, 19:50, 1 user, load average: 0.68, 0.15, 0.05
Tasks: 86 total, 6 running, 80 sleeping, 0 stopped, 0 zombie
Cpu(s): 98.5% us, 0.6% sy, 0.0% ni, 0.8% id, 0.0% wa, 0.0% hi, 0.1% si
Mem: 1024708k total, 225144k used, 799564k free, 104232k buffers
Swap: 0k total, 0k used, 0k free, 37276k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12052 amohan 16 0 9024 4756 5504 R 99.0 0.5 0:09.65 greet
12054 amohan 16 0 9024 4756 5504 R 99.0 0.5 0:09.64 greet
12055 amohan 16 0 9024 4752 5504 R 98.7 0.5 0:09.65 greet
12053 amohan 16 0 9024 4760 5504 R 98.7 0.5 0:09.63 greet
This question is related to an earlier post Fatal Error in MPI_Irecv: Aborting Job
The standard information displayed by top is, in order:
The Process ID
The owning username
The kernel-assigned process priority (higher is "lower" priority)
The "niceness" of the process (higher is "nicer" to other processes and gives the process a "lower" priority)
The virtual memory allocated for the process in KiB
The resident memory in use by the process (the memory that has been malloc()ed and is not swapped out) in KiB
The shared memory (the memory that is accessible to both the running process, its sibling processes, and any other processes that have been granted access) accessible to the process in KiB
The run state (R is running, Z is zombie, S is sleeping, etc.)
The % of CPU in use (obviously)
The % of memory in use relative to total system memory
The cumulative time that the process has been in a Run state
More detailed information should be available in man top.
In particular, the memory that MPI likely uses is contained in the shared memory area. More detailed information can be pulled from the /proc/ directory, but I don't know the specifics off the top of my head.