Matplotlib: Memory and 'CPU' leak - python-2.7

python: 2.7
Ubuntu: 18.04
matpltolib: 2.2.2
I have a client GUI that get information from a server and displays it. I see memory leak and change in CPU consumption with time. The picture below shows a change in CPU and memory utilization after restarting the client with GUI (~25 seconds from the right, aligned with a spice in network traffic).
The CPU graph has a dip in the CPU utilization showing that CPU usage is different before and after the restart of program.
The Memory graph shows a large drop in the memory utilization and then slight increase due to initialization of the same program.
The Network graph has a spike because the client requests all data from the server for visualization.
I suspect it is something to do with matplotlib. I have 7 figures that I rechart every 3 seconds.
I have added the image of my GUI. The middle 4 graphs are the history charts. However, I am binning all data points in 300 bins since I have ~ 300 pixels in that area. The binning is done in a separate thread. The data arrays( 2x1 000 000 points, time and value) that store the information are created from the very beginning to ensure that I don't have any memory runaway problem when my datasets grow. I do not expect the datasets to grow beyond that since the typical experiment runs at 0.1-0.01 Hz which will take several million seconds to reach the end.
Question: If it is Matplotlib, what can I do? If it is not, what else could it be?
added Sept 6 2018:
I thought of adding another example. Here is the screenshot of CPU and memory usage after I closed the GUI. The code ran for ~ 3 days. Python 2.7, Ubuntu 18.04.1.

Thank you, everyone, for helpful comments.
After some struggle, I have figure out the way to solve the problem. Unfortunately, I have made several changes in my code, so I cannot say definitively what actually helped.
Here what was done:
all charting is done in a separate thread. The image is saved in a buffer as a byte stream with io.Bytes() and later passed to the GUI. This was important for me to solve another problem(GUI freezes while charting with matplotlib).
create a new figure(figure = Figure(figsize=(7,8),dpi=80)) each time you generate the plot. previously I have been reusing the same figure (self.figure = Figure(figsize=(7,8),dpi=80)).

Related

Why does my program run faster on first launch than on next launches?

I have been working for 2.5 years on a personal flight sim project on my leisure time, written in C++ and using Opengl on a windows 7 PC.
I recently had to move to windows 10. Hardware is exactly the same. I reinstalled Code::blocks.
It turns out that on first launch of my project after the system start, performance is OK, similar to what I used to see with windows 7. But, the second, third, and all next launches give me lower performance, with significant less fluidity in frame rate compared to the first run, detectable by eye. This never happened with windows 7.
Any time I start my system, first run is fast, next ones are slower.
I had a look at the task manager while doing some runs. The first run is handled by one of the 4 cores of my CPU (iCore5-6500) at approximately 85%. For the next runs, the load is spread accross the 4 cores. During those slower runs on 4 cores, I tried to modify the affinity and direct my program to only one core without significant improvement in performance. The selected core was working at full load, though.
My C++ code doesn't explicitly use any thread function at this stage. From my modest programmer's point of view, there is only one main thread run in the main(). On the task manager, I can see that some 10 to 14 threads are alive when my program runs. I guess (wrongly?) that they are implicitly created by the use of joysticks, track ir or other communication task with GPU...
Could it come from memory not being correctly freed when my program stops? I thought windows would free it properly, even if I forgot some 'delete' after using 'new'.
Has anyone encountered a similar situation? Any explanation coming to your minds?
Any suggestion to better understand these facts? Obviously, my ultimate goal is to have a consistent performance level whatever the number of launches.
trying to upload screenshots of second run as viewed by task manager:
trying to upload screenshots of first run as viewed by task manager:
Well I got a problems when switching to win10 for clients at my work too here few I encountered all because Windows10 has changed up scheduling of processes creating a lot of issues like:
older windowses blockless thread synchronizations techniques not working anymore
well placed Sleep() helps sometimes. Btw. similar problems was encountered when switching from w2k to wxp.
huge slowdowns and frequent freezes for few seconds on older single threaded apps
usually setting affinity to single core solves this. You can do this also in task manager just to check and if helps then you can do this in code too. Here an example on how to do it with winapi:
Cache size estimation on your system?
messed up drivers timings causing zombies processes even total freeze and or BSOD
I deal with USB in my work and its a nightmare sometimes on win10. On top of all this Win10 tends to enforce wrong drivers on devices (like gfx cards, custom USB systems etc ...)
auto freeze close app if it does not respond the wndproc in time
In Windows10 the timeout is much much smaller than in older versions. If the case You can try running in compatibility mode (set in icon properties on desktop) for older windows (however does not work for #1,#2), or change the apps code to speed up response. For example in VCL you can call Process Messages from inside of blocking code to remedy this... Or you can use threads for the heavy lifting ... just be careful with rendering and using winapi as accessing some winapi (any window/visual related stuff) functions from outside main thread causes havoc ...
On top of all this old IDEs (especially for MCUs) don't work properly anymore and new ones are usually much worse (or unusable because of lack of functionality that was present in older versions) to work with so I stayed faith full to Windows7 for developer purposes.
If none of above helps then try to log the times some of your task did need ... it might show you which part of code is the problem. I usually do this using timing graph like this:
both x,y axises are time and each task has its own color and row in graph. the graph is scrolling in time (to the left side in my case) and has changeable time scale. The numbers are showing actual and max (or sliding avg) value ...
This way I can see if some task is not taking too much time or even overlaps its next execution, peaks are also nicely visible and all runs during runtime without any debug tools which might change the behavior of execution.

Cassandra crashes with Out Of Memory within minutes after starting

We have a Cassandra cluster with 3 nodes and replication factor 3 on AWS using EC2Snitch.
Instance type is c5.2xlarge (8 core and 16GB RAM).
The cluster had been working fine but suddenly since yesterday evening, the cassandra process on all the nodes started crashing. They are set to restart automatically but then they crash with Out of Memory Heap Space error in 1 or 2 or 3 minutes after start.
Heap configs:
MAX_HEAP_SIZE="4G"
HEAP_NEWSIZE="800M"
After this, we tried increasing the node size to r5.4x or 128 GB memory and assigned 64GB Heap but still the same thing happens, irrespective of all 3 nodes being started or only one node being started at a time. We could note that first garbage collection happens after some time and then consecutively within seconds, failing to free any further memory and eventually crashing.
We are not sure what is being pulled to memory immediately after starting.
Other parameters:
Cassandra version : 2.2.13
Database size is 250GB
hinted_handoff_enabled: true
commitlog_segment_size_in_mb: 64
memtable_allocation_type: offheap_buffers
Any help here, would be appreciated.
Edit:
We found that there is particular table when queried, it causes the casssandra node to crash.
cqlsh:my_keyspace> select count(*) from my_table ;
ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
So we think, it is related to the data being corrupt/huge in this particular table.
Thanks.
Some quick observations:
If you're building a new cluster, use the latest 3.11.x version. There's no point in building new on 2.2.
Based on your settings, it looks like you're using CMS GC. If you're not overly familiar with GC tuning, you may get more stability by switching to G1, and not specifying a HEAP_NEWSIZE (G1 figures out Eden sizing on its own).
If you're stuck on CMS, the guidance for setting HEAP_NEWSIZE at 100mb x cores, is wrong. To avoid new->old gen promotion, set HEAP_NEWSIZE at 40%-50% of total heap size and increase MaxTenuringThreshold to something like 6-8.
On a 16GB RAM machine with CMS GC, I would use a 8GB heap, and flip memtable_allocation_type: offheap_buffers back to heap_buffers.
Set commitlog_segment_size_in_mb back to 32. Usually when folks need to mess with that, it's to lower it, unless you've also changed max_mutation_size_in_kb.
You haven't mentioned what the application is doing when the crash happens. I suspect that a write-heavy load is happening. In that case, you may need more than 3 nodes, or look at rate-limiting the number of in-flight writes on the application side.
Additional info to help you:
CASSANDRA-8150 - A Cassandra committer discussion on good JVM settings.
Amy's Cassandra 2.1 Tuning Guide - Amy Tobey's admin guide has a lot of wisdom on good default settings for cluster configuration.
Edit
We are using G1 GC.
It is very, VERY important that you not set a heap new size (Xmn) with G1. Make sure that gets commented out.
select count(*) from my_table ;
Yes, unbound queries (queries without WHERE clauses`) will absolutely put undue stress on a node. Especially if the table is huge. These types of queries are something that Cassandra just doesn't do well. Find a way around using/needing this result.
You might be able to engineer this to work by setting your paging size smaller (driver side), or by using something like Spark. Or maybe with querying by token range, and totaling the result on the app-side. But you'll be much better off not doing it.
In addition to the CG and memory tuning suggestions by #aaron, you should also check that you are using the right compaction strategy for your data.
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/config/configChooseCompactStrategy.html#Whichcompactionstrategyisbest
You should also check for corrupt SStables, as trying to fetch corrupted data will also manifest in the same way. (for example https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/tools/toolsScrub.html)

Google Cloud Compute engine CPU usage shows 100% but dashboard only shows 10% usage

I am running a multiprocessing program and i expect the CPU usage to be close to 100%. It did show 100% when i run top command
However, the dashboard seem only shows 10% usage
My Machine Setup is as follow:
I am curious whether this is the problem of google cloud? or i am misunderstanding some concept?
In the top output on a particular process row the 100% refers to a single CPU core (as seen by the OS), not all of them. If you press the 1 key top will also display the per-core CPU usage, you'll see only one core being actually at/close to 100% busy.
Since you have 8 cores on your instance your overall usage would be 100% / 8 = 12.5% - pretty much inline with the graph.
Maybe relevant - assuming the python process you're showing in the top output is the one you're interested in you should know it can't run on multiple cores, see Python threads all executing on a single core.
So if you're expecting to bump your CPU usage you'd have to split your python application into multiple processes, not threads.

Profiling a legacy application

I am using an old version of a metastorm workflow designer.
We support this while we rewriting it in Microsoft Technologies.
After a few changes the "MAP" (*.epc) has become exceedingly slow to work with and "PUBLISH".
The publish writes the map and its binaries to the DB which then a service will pick up and execute.
However the publish "hangs" never completing and taking from a completion time of 15 min to in excess of 3 hours but still not completing.
I can see the CPU is being hammered but memory seems fine.
I ran process monitor but it does not show me much which leads me to believe the process is doing something either than the norm or the map has grown to a point which is leading it to destruction.
My question: How else can I profile this black box exe?

ColdFusion: Recursion too deep; the stack overflowed

For the last few years we've been randomly seeing this message in the output logs when running scheduled tasks in ColdFusion:
Recursion too deep; the stack overflowed.
The code inside the task that is being called can vary, but in this case it's VERY simple code that does nothing but reset a counter in the database and then send me an email to tell me it was successful. But I've seen it happen with all kinds of code, so I'm pretty sure it's not the code that's causing this problem.
It even has an empty application.cfm/cfc to block any other code being called.
The only other time we see this is when we are restarting CF and we are attempting to view a page before the service has fully started.
The error rarely happens, but now we have some rather critical scheduled tasks that cause issues if they don't run. (Hence I'm posting here for help)
Memory usage is fine. The task that ran just before it reported over 80% free memory. Monitoring memory through the night doesn't show any out-of-the-ordinary spikes. The machine has 4 gigs of memory and nothing else running on it but the OS and CF. We recently tried to reinstall CF to resolve the problem, but it did not help. It happens on several of our other servers as well.
This is an internal server, so usage at 3am should be nonexistent. There are no other scheduled tasks being run at that time.
We've been seeing this on our CF7, CF8, and CF9 boxes (fully patched).
The current box in question info:
CF version: 9,0,1,274733
Edition: Enterprise
OS: Windows 2003 Server
Java Version: 1.6.0_17
Min JVM Heap: 1024
Max JVM Heap: 1024
Min Perm Size: 64m
Max Perm Size: 384m
Server memory: 4gb
Quad core machine that rarely sees more than 5% CPU usage
JVM settings:
-server -Dsun.io.useCanonCaches=false -XX:PermSize=64m -XX:MaxPermSize=384m -XX:+UseParallelGC -XX:+AggressiveHeap -Dcoldfusion.rootDir={application.home}/../
-Dcoldfusion.libPath={application.home}/../lib
-Doracle.jdbc.V8Compatible=true
Here is the incredible complex code that failed to run last night, but has been running for years, and will most likely run tomorrow:
<cfquery datasource="common_app">
update import_counters
set current_count = 0
</cfquery>
<cfmail subject="Counters reset" to="my#email.com" from="my#email.com"></cfmail>
If I missed anything let me know. Thank you!
We had this issue for a while after our server was upgraded to ColdFusion 9. The fix seems to be in this technote from Adobe on jRun 4: http://kb2.adobe.com/cps/950/950218dc.html
You probably need to make some adjustments to permissions as noted in the technote.
Have you tried reducing the size of your heap from 1024 to say 800 something. You say there is over 80% of memory left available so if possible I would look at reducing the max.
Is it a 32 or 64 bits OS? When assigning the heap space you have to take into consideration all the overhead of the JVM (stack, libraries, etc.) so that you don't go over the OS limit for the process.
what you could try is to set Minimum JVM Heap Size to the same as your Maximum JVM Heap Size (MB) with in your CF administrator.
Also update the JVM to the latest (21) or at least 20.
In the past i've always upgraded the JVM whenever something wacky started happening as that usually solved the problem.