Too many brk() noticed in strace - c++

I have a c++ service, where I am trying to debug the cause of high service startup time. From strace logs I notice a lot of brk() (which contributes to about 300ms, which is very high for our SLA). On studying further , I see brk() to be a memory management call which helps control amount of memory allocated to data segment of the process. malloc() can use brk() (for small allocations) or mmap (for big allocations) based on situation.
Logs from strace:
891905 1674977609.119549 mmap2(NULL, 163840, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf6a01000 <0.000051>
891905 1674977609.119963 brk(0x209be000) = 0x209be000 <0.000025>
891905 1674977609.120495 brk(0x209df000) = 0x209df000 <0.000022>
891905 1674977609.121167 brk(0x20a00000) = 0x20a00000 <0.000041>
891905 1674977609.121776 brk(0x20a21000) = 0x20a21000 <0.000024>
891905 1674977609.122327 brk(0x20a42000) = 0x20a42000 <0.000024>
...
..
.
891905 1674977609.427039 brk(0x2432b000) = 0x2432b000 <0.000064>
891905 1674977609.427827 brk(0x2434c000) = 0x2434c000 <0.000069>
891905 1674977609.428695 brk(0x2436d000) = 0x2436d000 <0.000050> 0 0 2 2
I believe , there is way number of brk() calls called, which has a scope for improving efficiency . I am trying play around with tunables such as M_TRIM_THRESHOLD, M_MMAP_THRESHOLD to reduce the number of brk() , but I don't notice any change.
https://www.linuxjournal.com/files/linuxjournal.com/linuxjournal/articles/063/6390/6390s2.html
For instance , this is one of the ways I tried to make the changes before restarting the service. The service is running as a docker container, so I am trying to make the changes on the docker container and on the host. , but I don't notice any change.
export MALLOC_MMAP_THRESHOLD_=131072
export MALLOC_TRIM_THRESHOLD_=131072
export MALLOC_MMAP_MAX_=65536
Am I trying to hit the wrong target , or is my understanding there is lot of fragmentation that is happening is wrong. Any help is appreciated.
Thanks

Related

C++ _popen() windows leaks paged pool memory

Main application runs in Windows service and that process starts other c++ console processes but all console modes are hidden, i.e. parent process is Windows Service and child processes are non-console applications.
Observed paged pool memory of the system is increasing during call _popen() on the customer system windows server 2016. The application runs clean on our lab system same OS.
From the Windows Performance tool xperf, captured the logs and check the call stack.
attached the pic for reference.
void CMachine::GetJavaVersion()
{
m_stJavaVersion.m_strName = " Java version";
CPUChar strVersion[64] = { 0 };
BOOL bFound = CheckJREVersion(strVersion, 64);
BYTE bytColorSt = RED;
string strRemark;
FILE *fp = NULL;
char version[130] = { 0 };
BOOL bFoundVersion = FALSE;
fp = _popen("java -version 2>&1", "r");
while (fp && fgets(version, sizeof version, fp))
{
string strTmp = version;
if (strTmp.find("version") != string::npos)
{
bFoundVersion = TRUE;
break;
}
}
if(fp) _pclose(fp);
....
PoolMon trace
Memory:33401164K Avail:30057324K PageFlts: 92362 InRam Krnl:20212K P:776328K
Commit:3228052K Limit:37595468K Peak:4747992K Pool N:182820K P:782568K
System pool information
Tag Type Allocs Frees Diff Bytes Per Alloc
Toke Paged 10546816 ( 390) 10319712 ( 382) 227104 324868080 ( 11392) 1430
CM31 Paged 42886 ( 0) 20849 ( 0) 22037 101154816 ( 0) 4590
SeAt Paged 44678436 (1662) 43769798 (1630) 908638 87253680 ( 3072) 96
QINi Paged 234 ( 0) 1 ( 0) 233 60293216 ( 0) 258769
MmSt Paged 2683066 ( 79) 2670922 ( 83) 12144 27223856 ( 3312) 2241
PoolMon
Eric Lippert writes about benchmark mistakes. I think mistake #1 applies to your case:
Mistake #1: Choosing a bad metric.
Why do you measure "paged pool" to determine a memory leak?
Paged memory is the memory that is swapped out to disk. This happens because the physical RAM is needed for something else. What is the physical RAM needed for? Probably for running the process that you start.
Once the memory is swapped to disk, it may take a while until it is swapped back to RAM. That will happen just when some other application tries to access the memory - and that may be minutes, if ever.
I also tend to say that memory isn't leaked during a method call but after a method call. After the method call, all variables should be destroyed and the related resources should be released.
If you are told that the paged pool is the cause, then ask for proof.
On my Windows 10 system, the paged pool limit is 17 GB. This can be shown by Process Explorer in View/System Information with Symbols configured.
If you're running java -version so often that it leaks 17 GB of kernel memory, then something is seriously wrong. Of course there will be a pipe or something to redirect the output from Java to your application so you can read the stream. There will also be other kernel objects like a process, a thread etc.
Even with 1 kB of kernel memory leak for each call, you would need to call that 17 million times to exhaust the paged pool. If that's the case, maybe you should consider caching the result anyway. It should be unlikely that server admins install and uninstall Java 17 million times in a few days.
For monitoring the paged pool, you can try Poolmon with /p /P command line parameters. Poolmon is part of the WDK.
Problems in your code:
Your code has at least 2 problems:
if "version" never appears in the output, your code might run in an endless loop. How could that happen? It's unlikely, but if I rename my HelloWorld.exe to java.exe, it could.
if "version" appears in the output but accidentally "ver" is in the first buffer and "sion" is in the second buffer, you'll never find out it actually was there. Your code could run into an endless loop.

Google Cloud Filestore in status REPAIRING blocks *everything*

We are using Google's Filestore cloud service for sharing files between our GCE VMs. Randomly, all processes seem to hang, especially interactive SSH sessions, and after some investigation we have determined that our Filestore, universally mounted across all VMs, was being repaired and was blocking all processes that tried to get any information on it.
I was able to log in as root and investigate, I noticed that all my interactive activity would hang, and eventually I pinpointed it to trying to stat the mountpoint of the Filestore instance. An strace df would hang like this:
statfs("/sys/kernel/config", {f_type=0x62656570, f_bsize=4096, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
stat("/sys/kernel/config", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
statfs("/sys/fs/selinux", {f_type=SELINUX_MAGIC, f_bsize=4096, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
stat("/sys/fs/selinux", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
statfs("/proc/sys/fs/binfmt_misc", {f_type=BINFMTFS_MAGIC, f_bsize=4096, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
stat("/proc/sys/fs/binfmt_misc", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
statfs("/dev/hugepages", {f_type=HUGETLBFS_MAGIC, f_bsize=2097152, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255, f_frsize=2097152, f_flags=ST_VALID|ST_RELATIME}) = 0
stat("/dev/hugepages", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
statfs("/mnt/local-storage", {f_type=0x58465342, f_bsize=4096, f_blocks=131007745, f_bfree=86129973, f_bavail=86129973, f_files=262143488, f_ffree=262141571, f_fsid={2065, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_RELATIME}) = 0
stat("/mnt/local-extra", {st_mode=S_IFDIR|0755, st_size=75, ...}) = 0
statfs("/mnt/shared-storage" ***HANG***
There was apparently no remedy except for waiting for the repair operation to complete. gcloud filestore operations list was showing no operations ongoing during that time. But gcloud filestore instances list would show the REPAIRING state like this:
[root#vm ~]# gcloud filestore instances list
INSTANCE_NAME ZONE TIER CAPACITY_GB FILE_SHARE_NAME IP_ADDRESS STATE CREATE_TIME
shared-storage europe-west1-b STANDARD 1024 shared_storage **.**.**.** REPAIRING 2019-08-09T16:03:02
Google Cloud Status Dashboard never showed any issue at or around the time.
Does anybody know why this happens and how to prevent it from happening, if possible. As shown in the output above, we are using the standard tier of Filestore.
We've configured coredumps to be written to the share from two dozen VMs, when a mass-death of our processes occurs, it seems that we reach the throughput limit of the share (standard tier) and that causes the share to enter the REPAIRING state, in turn blocking everything that tries to access it.
If you have a similar problem: check if it's possible that somehow you're reaching the throughput limit on your share.

Windows shared memory access time slow

I am currently using shared memory with two mapped files (1.9 GBytes for the first one and 600 MBytes for the second) in a software.
I am using a process that read data from the first file, process the data and write the results to the second file.
I have noticed a strong delay sometimes (the reason is out of my knowledge) when reading or writing to the mapping view with memcpy function.
Mapped files are created this way :
m_hFile = ::CreateFileW(SensorFileName,
GENERIC_READ | GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL);
m_hMappedFile = CreateFileMapping(m_hFile,
NULL,
PAGE_READWRITE,
dwFileMapSizeHigh,
dwFileMapSizeLow,
NULL);
And memory mapping is done this way :
m_lpMapView = MapViewOfFile(m_hMappedFile,
FILE_MAP_ALL_ACCESS,
dwOffsetHigh,
dwOffsetLow,
m_i64ViewSize);
The dwOffsetHigh/dwOffsetLow are "matching" granularity from the system info.
The process is reading about 300KB * N times, storing that in a buffer, processing and then writing 300KB * N times the processed contents of the previous buffer to the second file.
I have two different memory views (created/moved with MapViewOfFile function) with a size of 10 MBytes as default size.
For memory view size, I tested 10kBytes, 100kB, 1MB, 10MB and 100MB. Statistically no difference, 80% of the time reading process is as described below (~200ms) but writing process is really slow.
Normally :
1/ Reading is done in ~200ms.
2/ Process done in 2.9 seconds.
3/ Writing is done in ~200ms.
I can see that 80% of the time, either reading or writing (in the worst case both are slow) will take between 2 and 10 seconds.
Example : For writing, I am using the below code
for (unsigned int i = 0 ; i < N ; i++) // N = 500~3k
{
// Check the position of the memory view for ponderation
if (###)
MoveView(iOffset);
if (m_lpMapView)
{
memcpy((BYTE*)m_lpMapView + iOffset, pANNHeader, uiANNStatus);
// uiSize = ~300 kBytes
memcpy((BYTE*)m_lpMapView + iTemp, pLine[i], uiSize);
}
else
return uiANNStatus;
}
After using GetTickCount function to pinpoint where is the delay, I am seeing that the second memcpy call is always the one taking most of the time.
So, so far I am seeing N (for test, I used N = 500) calls to memcpy taking 10 seconds at the worst time when using those shared memories.
I made a temporary software that was doing the same quantity of memcpy calls, same amount of data and couldn't see the problem.
For tests, I used the following conditions, they all show the same delay :
1/ I can see this on various computers, 32 or 64 bits from windows 7 to windows 10.
2/ Using the main thread or multi-threads (up to 8 with critical sections for synchronization purpose) for reading/writing.
3/ OS on SATA or SSD, memory mapped files of the software physically on a SATA or SSD hard-disk, and if on external hard-disk, tests were done through USB1, USB2 or USB3.
I am kindly asking you what you would think my mistake is for memcpy to go slow.
Best regards.
I found a solution that works for me but not might be the case for others.
Following Thomas Matthews comments, I checked the MSDN and found two interesting functions FlushViewOfFile and FlushFileBuffers (but couldn't find anything interesting about locking memory).
Calling both after the for loop force update of the mapped file.
I am having no more "random" delay, but instead of the expected 200ms, I have an average of 400ms which is enough for my application.
After doing some tests I saw that calling those too often will cause heavy hard-disk access and will make the delay worse (10 seconds for every for loop) so the flush should be use carefully.
Thanks.

Python script terminated by SIGKILL rather than throwing MemoryError

Update Again
I have tried to create some simple way to reproduce this, but have not been successful.
So far, I have tried various simple array allocations and manipulations, but they all throw an MemoryError rather than just SIGKILL crashing.
For example:
x =np.asarray(range(999999999))
or:
x = np.empty([100,100,100,100,7])
just throw MemoryErrors as they should.
I hope to have a simple way to recreate this at some point.
End Update
I have a python script running numpy/scipy and some custom C extensions.
On my Ubuntu 14.04 under Virtual Box, it runs to completion just fine.
On an Amazon EC2 T2 micro instance, it terminates (after running a while) with the output:
Killed
Running under the python debugger, the signal is not caught and the debugger exits as well.
Running under strace, I get:
munmap(0x7fa5b7fa6000, 67112960) = 0
mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa5b7fa6000
mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa5affa4000
mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa5abfa3000
mmap(NULL, 67637248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa5a7f22000
mmap(NULL, 67637248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa5a3ea1000
mmap(NULL, 67637248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa59fe20000
gettimeofday({1406518336, 306209}, NULL) = 0
gettimeofday({1406518336, 580022}, NULL) = 0
+++ killed by SIGKILL +++
running under gdb while trying to catch "SIGKILL", I get:
[Thread 0x7fffe7148700 (LWP 28022) exited]
Program terminated with signal SIGKILL, Killed.
The program no longer exists.
(gdb) where
No stack.
running python's trace module (python -m trace --trace ), I get:
defmatrix.py(292): if (isinstance(obj, matrix) and obj._getitem): return
defmatrix.py(293): ndim = self.ndim
defmatrix.py(294): if (ndim == 2):
defmatrix.py(295): return
defmatrix.py(336): return out
--- modulename: linalg, funcname: norm
linalg.py(2052): x = asarray(x)
--- modulename: numeric, funcname: asarray
numeric.py(460): return array(a, dtype, copy=False, order=order)
I can't think of anything else at the moment to figure out what is going on.
I suspect maybe it might be running out of memory (it is an AWS Micro instance), but I can't figure out how to confirm or deny that.
Is there another tool I could use that might help pinpoint exactly where the program is stopping? (or I am running one of the above tools the wrong way for this problem?)
Update
The Amazon EC2 T2 micro instance has no swap space defined by default, so I added a 4GB swap file and was able to run the program to completion.
However, I am still very interested in a way to have run the program such that it terminated with some message a little closer to "Not Enough Memory" rather than "Killed"
If anyone has any suggestions, they would be appreciated.
It sounds like you've run into the dreaded Linux OOM Killer. When the system completely runs of out of memory and the kernel absolutely needs to allocate memory, it kills a process rather than crashing the entire system.
Look in the syslog for confirmation of this. A line similar to:
kernel: [884145.344240] mysqld invoked oom-killer:
followed sometime later with:
kernel: [884145.344399] Out of memory: Kill process 3318
Should be present (in this example, it mentions mysql specifically)
You can add these lines to your /etc/sysctl.conf file to effectively disable the OOM killer:
vm.overcommit_memory = 2
vm.overcommit_ratio = 100
And then reboot. Now, the original, memory hungry, process should fail to allocate memory and, hopefully, throw the proper exception.
Setting overcommit_memory means that Linux won't over commit memory, meaning memory allocations will fail if there isn't enough memory for them. See this answer for details on what effect the overcommit_ratio has: https://serverfault.com/a/510857

Memory access time slow with VirtualAllocExNuma on Windows 7/64

In our application we are running on a dual Xeon server with memory configured as 12gb local to each processor and a memory bus connecting the two Xeon's. For performance reasons, we want to control where we allocate a large (>6gb) block of memory. Below is simplified code -
DWORD processorNumber = GetCurrentProcessorNumber();
UCHAR nodeNumber = 255;
GetNumaProcessorNode((UCHAR)processorNumber, &nodeNumber );
// get amount of physical memory available of node.
ULONGLONG availableMemory = MAXLONGLONG;
GetNumaAvailableMemoryNode(nodeNumber, &availableMemory )
// make sure that we don't request too much. Initial limit will be 75% of available memory
_allocateAmt = qMin(requestedMemory, availableMemory * 3 / 4);
// allocate the cached memory region now.
HANDLE handle = (HANDLE)GetCurrentProcess ();
cacheObject = (char*) VirtualAllocExNuma (handle, 0, _allocateAmt,
MEM_COMMIT | MEM_RESERVE ,
PAGE_READWRITE| PAGE_NOCACHE , nodeNumber);
The code as is, works correctly using VS2008 on Win 7/64.
In our application this block of memory functions as a cache store for static objects (1-2mb ea) that are normally stored on the hard drive. My problem is that when we transfer data into the cache area using memcpy, it takes > 10 times as longer than if we allocate memory using new char[xxxx]. And no other code changes.
We are at a loss to understand why this is happening. Any suggestions as to where to look?
PAGE_NOCACHE is murder on perf, it disables the CPU cache. Was that intentional?