gstreamer initialization (gst_init) increases memory usage by 47 MB - c++

We've been using (lib)gstreamer to do some RTSP streaming stuff. So far, so good. It works and we got it all up an running. However, I noticed that after gst_init being called, memory usage increases a lot. I reproduced this by making a very simple program that does nothing else.
So, before the call to gst_init, the memory usage is around 6 MB, and then right after it, it's like 53 MB.
Does anyone know what is causing this large memory usage increase and how we might prevent it?
I've checked which extra modules (libraries) are being loaded from gst_init and that sums up to like 5 MB, so that can't be the issue. I also checked the gstreamer source code, but couldn't really find what was causing the issue.
The memory usage is just too much.
EDIT:
Because someone asked it in the questions. It's going to run on a security system. The hardware is usually old and slow. With slow I mean, usually still running XP, 2-4 GB memory, 32 bit and not even an i3. It runs 24h a day and there are more applications running on the same system that are using the limited amount of memory. The less memory an application uses, the more we have for others.

Related

I cannot create a memory leak

I am currently testing out memory leaks in C++ and found out that I cannot get the ram to go close to 99%. I mainly just wanted to test it and I remember once I had managed to throttle the computer before closing the program. However, I am currently doing
while (true)
{
int* x = new int[100];
}
It takes a while to accumulate memory but I also noticed the program closes on its own. In VS it will give me badalloc exception. So I went and tested with release. And there also, the program accumulates around 2 gigs (I have 32 gigabytes and 40% was currently used by other processes) before closing on its own. I remember I once managed to get it to close to 90% usage but I cannot seem to get there anymore before it closes on its own. I checked via task manager and the debug window in visual studio.
I am wondering, is this the operating system that figures out that the program is allocating cluelessly and shuts it off or what exactly is making it close on its own? Shouldnt the program just run? My PC doesnt freeze up or anything, like mentioned it slowly approaches 2 gigabytes before it shuts off on its own. I was running the release config.
Please do let me know the process behind all this. Yes I know its kind of stupid but I am learning and was just curious how fast it would hit 99% RAM usage but it doesnt do it.
TIA.
The symptoms you describe seem logical for a 32bit program on Windows: a maximum memory usage of 2GB. You could change some property on the exe (or during compilation) to use a bit more of memory.
If you like to know more, you can read: https://superuser.com/questions/1163749/why-do-32-bit-processes-have-a-2-gb-ram-limit
The best to do is to change your compiler flags to create a 64 bit executable, see https://learn.microsoft.com/en-us/cpp/build/how-to-configure-visual-cpp-projects-to-target-64-bit-platforms

Extremely slow ffmpeg/sws_scale() - only on heavy duty

I am writing a video player using ffmpeg (Windows only, Visual Studio 2015, 64 bit compile).
With common videos (up to 4K # 30FPS), it works pretty good. But with my maximum target - 4K # 60FPS, it fails. Decoding still is fast enough, but when it comes to YUV/BGRA conversion it is simply not fast enough, even though it's done in 16 threads (one thread per frame on a 16/32 core machine).
So as a first countermeasure I skipped the conversion of some frames and got a stable frame rate of ~40 that way. Comparing the two versions in Concurrency Visualizer, I found a strange issue I don't know the reason of.
.
Here's an image of the frameskip version:
You see that the conversion is pretty quick (average roughly ~35ms)
Thus, as multiple threads are used, it also should be quick enough for 60FPS, but it isn't!
.
The image of the non-frameskip version shows why:
The conversion of a single frame has become ten times slower than before (average roughly ~350ms). Now a heavy workload on many cores would of course cause a minor slowdown per core due to reduced turbo - let's say 10 or 20%. But never an extreme slowdown of ~1000%.
.
Interesting detail is, that the stack trace of the non-frameskip version shows some system activity I don't really understand - beginning with ntoskrnl.exe!KiPageFault+0x373. There are no exceptions, other error messages or such - it just becomes extremely slow.
Edit: A colleague just told me that this looks like a memory problem with paged-out memory at first glance - but my memory utilization is low (below 1GB, and more than 20GB free)
Can anyone tell me what could be causing this?
This is probably too old to be useful, but just for the record:
What's probably happening is that you're allocating 4k frames over and over again in multiple threads. The windows allocator really doesn't like that access pattern.
The malloc itself will not show up in the profiler, since only when the memory is actually accessed, will the OS fetch the pages. This shows up as ntoskrnl.exe!KiPageFault and gets attributed to the function first accessing the new memory.
Solutions include:
Using a different allocator (e.g. tbb_malloc, mimalloc, etc.)
Using your own per-thread or per process frame pool. ffmpeg does something similar internally, maybe you can just use that.

32 bit process memory leak on x64 processor

I made a 32 bit c++ program which is always run on x64 machines. A client is saying that running 5 instances of this process is using causing all of their 24 GB RAM to be used.
Immediately I would think there was a memory leak but I am unable to reproduce this memory issue.
Doing a bit more research into memory allocations I found Memory Limits for Windows. This tells me that a 32 bit process will not be allowed more than 2 GB of memory by the OS.
Is it at all possible that a 32 bit application on a 64 bit windows will be able to have a memory leak use more than 2 GB?
P.S. Killing the process results in the memory being restored to normal operating levels (about 2 GB).
[EDIT] I have now seen that most of the memory being used is Kernel Memory: Nonpaged. Does this mean that it is some system resource which is being used and not a memory leak?
[UPDATE] The problem is not a driver or memory leak. It seems to be a process handle leak. There is something which is continuously starting new handles to a file. This was found using perfmon to monitor the process. As a rule of thumb if a process has more than 2000 to 3000 handles you should investigate. Especially if that number is increasing every few seconds.
As stated in Memory Limits for Windows, limit for 32-bit process on 64-bit system is 4 GB with IMAGE_FILE_LARGE_ADDRESS_AWARE set, thus your 5 processes could consume 20 GB of memory total. This can be set through LARGEADDRESSAWARE option, which expands virtual address space.
It is obviously possible, as the client is experiencing it.
(maybe you did expect like some ideas how? You don't provide much info or code, so in a very general way I would suggest the memory allocation may be not in the app itself directly. Maybe the app itself will take only ~1-2GiB, but will also stir the OS to do something stupid, like virtual memory map of file of size of 4+GiB, or other devices lock, where the device driver does something stupid, etc...)
You should profile the memory usage on the target system to have idea how much your code does use. Then you can try to search for the rest of it.
In general, using the /LARGEADDRESSAWARE:ON linker switch can allow a 32bit application use more than 2GB. Also using the Address Windowing Extensions can allow using more memory. But if you aren't using any of these techniques in your application then it should have the 2GB range. However since the upper 2GB range is used for system resources maybe you are leaking system resources?

Check available memory in the system for new allocations

I'm working in a Windows C++ application to work with point clouds. We use the PCL library along with Qt and OpenSceneGraph. The computer has 4 GB of RAM.
If we load a lot of points (for example, 40 point clouds have around 800 million points in total) the system goes crazy.
The app is almost unresponsive (it takes ages to move the mouse around it and the arrow changes to a circle that keeps spinning) and in the task manager, in the Performance tab, I got this output:
Memory (1 in the picture): goes up to 3,97 GB, almost the total of the system.
Free (2 in the picture): 0
I have checked this posts: here and here and with the MEMORYSTATUSEX version, I got the memory info.
The idea here is, before loading more clouds, check the memory available. If the "weight" of the cloud that we're gonna load is bigger than the available memory don't load it, so the app won't freeze and the user has the chance to remove older clouds to free some memory. It's worth to note that no exceptions are thrown, the worst scenario I got was that Windows killed the app itself, when the memory was insufficient.
Now, is this a good idea? Is there a canonical way to deal with this thing?
I would be glad to hear your thoughts on this matter.
Your view is from a different direction from the usual approach to similar problems.
Normally, one would probably allocate then attempt to lock in physical memory the space they needed. (mlock() in POSIX, VirtualLock() in WinAPI). The reasoning is that even if the system has enough available physical memory at the moment, some other process could spawn the next moment and push part of your resident set into swap.
This will require you to use a custom allocator as well as ensure that your process has permission to lock down the required number of pages.
Read here for a start on this: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366895(v=vs.85).aspx
You are also likely running into memory issues with your graphics card even once the points are loaded. You should probably monitor that as well. Once your loaded points clouds exceed your dedicated graphics card memory (which they almost certainly are in this case) the rendering slows to a crawl.
800 million is also an immense amount of points. With a minimum 3 floats per point (assuming no colorization) you are talking about 9.6GB of points so you are swapping like crazy.
I generally start voxeling to reduce memory usage once I get beyond 30-40 million points.
This is more complicated than you might imagine. The available memory shown in the system display is physical memory. The amount of memory available to your application is virtual memory.
The physical memory is shared by all processes on the computer. If you have something else running at the same time.
-=-=-=--=-=
I suspect that the problem you are seeing is processing. Using half the memory on an 4GB system should be no big deal.
If you are doing lengthy calculations do you give the system a chance to process accumulated events?
That is what I suspect the real problem is.

Random Complete System Unresponsiveness Running Mathematical Functions

I have a program that loads a file (anywhere from 10MB to 5GB) a chunk at a time (ReadFile), and for each chunk performs a set of mathematical operations (basically calculates the hash).
After calculating the hash, it stores info about the chunk in an STL map (basically <chunkID, hash>) and then writes the chunk itself to another file (WriteFile).
That's all it does. This program will cause certain PCs to choke and die. The mouse begins to stutter, the task manager takes > 2 min to show, ctrl+alt+del is unresponsive, running programs are slow.... the works.
I've done literally everything I can think of to optimize the program, and have triple-checked all objects.
What I've done:
Tried different (less intensive) hashing algorithms.
Switched all allocations to nedmalloc instead of the default new operator
Switched from stl::map to unordered_set, found the performance to still be abysmal, so I switched again to Google's dense_hash_map.
Converted all objects to store pointers to objects instead of the objects themselves.
Caching all Read and Write operations. Instead of reading a 16k chunk of the file and performing the math on it, I read 4MB into a buffer and read 16k chunks from there instead. Same for all write operations - they are coalesced into 4MB blocks before being written to disk.
Run extensive profiling with Visual Studio 2010, AMD Code Analyst, and perfmon.
Set the thread priority to THREAD_MODE_BACKGROUND_BEGIN
Set the thread priority to THREAD_PRIORITY_IDLE
Added a Sleep(100) call after every loop.
Even after all this, the application still results in a system-wide hang on certain machines under certain circumstances.
Perfmon and Process Explorer show minimal CPU usage (with the sleep), no constant reads/writes from disk, few hard pagefaults (and only ~30k pagefaults in the lifetime of the application on a 5GB input file), little virtual memory (never more than 150MB), no leaked handles, no memory leaks.
The machines I've tested it on run Windows XP - Windows 7, x86 and x64 versions included. None have less than 2GB RAM, though the problem is always exacerbated under lower memory conditions.
I'm at a loss as to what to do next. I don't know what's causing it - I'm torn between CPU or Memory as the culprit. CPU because without the sleep and under different thread priorities the system performances changes noticeably. Memory because there's a huge difference in how often the issue occurs when using unordered_set vs Google's dense_hash_map.
What's really weird? Obviously, the NT kernel design is supposed to prevent this sort of behavior from ever occurring (a user-mode application driving the system to this sort of extreme poor performance!?)..... but when I compile the code and run it on OS X or Linux (it's fairly standard C++ throughout) it performs excellently even on poor machines with little RAM and weaker CPUs.
What am I supposed to do next? How do I know what the hell it is that Windows is doing behind the scenes that's killing system performance, when all the indicators are that the application itself isn't doing anything extreme?
Any advice would be most welcome.
I know you said you had monitored memory usage and that it seems minimal here, but the symptoms sound very much like the OS thrashing like crazy, which would definitely cause general loss of OS responsiveness like you're seeing.
When you run the application on a file say 1/4 to 1/2 the size of available physical memory, does it seem to work better?
What I suspect may be happening is that Windows is "helpfully" caching your disk reads into memory and not giving up that cache memory to your application for use, forcing it to go to swap. Thus, even though swap use is minimal (150MB), it's going in and out constantly as you calculate the hash. This then brings the system to its knees.
Some things to check:
Antivirus software. These often scan files as they're opened to check for viruses. Is your delay occuring before any data is read by the application?
General system performance. Does copying the file using Explorer also show this problem?
Your code. Break it down into the various stages. Write a program that just reads the file, then one that reads and writes the files, then one that just hashes random blocks of ram (i.e. remove the disk IO part) and see if any particular step is problematic. If you can get a profiler then use this as well to see if there any slow spots in your code.
EDIT
More ideas. Perhaps your program is holding on to the GDI lock too much. This would explain everything else being slow without high CPU usage. Only one app at a time can have the GDI lock. Is this a GUI app, or just a simple console app?
You also mentioned RtlEnterCriticalSection. This is a costly operation, and can hang the system quite easily, i.e. mismatched Enters and Leaves. Are you multi-threading at all? Is the slow down due to race conditions between threads?
XPerf is your guide here - watch the PDC Video about it, and then take a trace of the misbehaving app. It will tell you exactly what's happening throughout the system, it is extremely powerful.
I like the disk-caching/thrashing suggestions, but if that's not it, here are some scattershot suggestions:
What non-MSVC libraries, if any, are you linking to?
Can your program be modified (#ifdef'd) to run without a GUI? Does the problem occur?
You added ::Sleep(100) after each loop in each thread, right? How many threads are you talking about? A handful or hundreds? How long does each loop take, roughly? What happens if you make that ::Sleep(10000)?
Is your program perhaps doing something else that locks a limited resources (ProcExp can show you what handles are being acquired ... of course you might have difficulty with ProcExp not responding:-[)
Are you sure CriticalSections are userland-only? I recall that was so back when I worked on Windows (or so I believed), but Microsoft could have modified that. I don't see any guarantee in the MSDN article Critical Section Objects (http://msdn.microsoft.com/en-us/library/ms682530%28VS.85%29.aspx) ... and this leads me to wonder: Anti-convoy locks in Windows Server 2003 SP1 and Windows Vista
Hmmm... presumably we're all multi-processor now, so are you setting the spin count on the CS?
How about running a debugging version of one of these OSes and monitoring the kernel debugging output (using DbgView)... possibly using the kernel debugger from the Platform SDK ... if MS still calls it that?
I wonder whether VMMap (another SysInternal/MS utility) might help with the Disk caching hypothesis.
It turns out that this is a bug in the Visual Studio compiler. Using a different compiler resolves the issue entirely.
In my case, I installed and used the Intel C++ Compiler and even with all optimizations disabled I did not see the fully-system hang that I was experiencing w/ the Visual Studio 2005 - 2010 compilers on this library.
I'm not certain as to what is causing the compiler to generate such broken code, but it looks like we'll be buying a copy of the Intel compiler.
It sounds like you're poking around fixing things without knowing what the problem is. Take stackshots. They will tell you what your program is doing when the problem occurs. It might not be easy to get the stackshots if the problem occurs on other machines where you cannot use an IDE or a stack sampler. One possibility is to kill the app and get a stack dump when it's acting up. You need to reproduce the problem in an environment where you can get a stack dump.
Added: You say it performs well on OSX and Linux, and poorly on Windows. I assume the ratio of completion time is some fairly large number, like 10 or 100, if you've even had the patience to wait for it. I said this in the comment, but it is a key point. The program is waiting for something, and you need to find out what. It could be any of the things people mentioned, but it is not random.
Every program, all the time while it runs, has a call stack consisting of a hierarchy of call instructions at specific addresses. If at a point in time it is calculating, the last instruction on the stack is a non-call instruction. If it is in I/O the stack may reach into a few levels of library calls that you can't see into. That's OK. Every call instruction on the stack is waiting. It is waiting for the work it requested to finish. If you look at the call stack, and look at where the call instructions are in your code, you will know what your program is waiting for.
Your program, since it is taking so long to complete, is spending nearly all of its time waiting for something to finish, and as I said, that's what you need to find out. Get a stack dump while it's being slow, and it will give you the answer. The chance that it will miss it is 1/the-slowness-ratio.
Sorry to be so elemental about this, but lots of people (and profiler makers) don't get it. They think they have to measure.