Cassandra crashes with Out Of Memory within minutes after starting

Cassandra crashes with Out Of Memory within minutes after starting - amazon-web-services

We have a Cassandra cluster with 3 nodes and replication factor 3 on AWS using EC2Snitch.
Instance type is c5.2xlarge (8 core and 16GB RAM).
The cluster had been working fine but suddenly since yesterday evening, the cassandra process on all the nodes started crashing. They are set to restart automatically but then they crash with Out of Memory Heap Space error in 1 or 2 or 3 minutes after start.
Heap configs:
MAX_HEAP_SIZE="4G"
HEAP_NEWSIZE="800M"
After this, we tried increasing the node size to r5.4x or 128 GB memory and assigned 64GB Heap but still the same thing happens, irrespective of all 3 nodes being started or only one node being started at a time. We could note that first garbage collection happens after some time and then consecutively within seconds, failing to free any further memory and eventually crashing.
We are not sure what is being pulled to memory immediately after starting.
Other parameters:
Cassandra version : 2.2.13
Database size is 250GB
hinted_handoff_enabled: true
commitlog_segment_size_in_mb: 64
memtable_allocation_type: offheap_buffers
Any help here, would be appreciated.
Edit:
We found that there is particular table when queried, it causes the casssandra node to crash.
cqlsh:my_keyspace> select count(*) from my_table ;
ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
So we think, it is related to the data being corrupt/huge in this particular table.
Thanks.

Some quick observations:
If you're building a new cluster, use the latest 3.11.x version. There's no point in building new on 2.2.
Based on your settings, it looks like you're using CMS GC. If you're not overly familiar with GC tuning, you may get more stability by switching to G1, and not specifying a HEAP_NEWSIZE (G1 figures out Eden sizing on its own).
If you're stuck on CMS, the guidance for setting HEAP_NEWSIZE at 100mb x cores, is wrong. To avoid new->old gen promotion, set HEAP_NEWSIZE at 40%-50% of total heap size and increase MaxTenuringThreshold to something like 6-8.
On a 16GB RAM machine with CMS GC, I would use a 8GB heap, and flip memtable_allocation_type: offheap_buffers back to heap_buffers.
Set commitlog_segment_size_in_mb back to 32. Usually when folks need to mess with that, it's to lower it, unless you've also changed max_mutation_size_in_kb.
You haven't mentioned what the application is doing when the crash happens. I suspect that a write-heavy load is happening. In that case, you may need more than 3 nodes, or look at rate-limiting the number of in-flight writes on the application side.
Additional info to help you:
CASSANDRA-8150 - A Cassandra committer discussion on good JVM settings.
Amy's Cassandra 2.1 Tuning Guide - Amy Tobey's admin guide has a lot of wisdom on good default settings for cluster configuration.
Edit
We are using G1 GC.
It is very, VERY important that you not set a heap new size (Xmn) with G1. Make sure that gets commented out.
select count(*) from my_table ;
Yes, unbound queries (queries without WHERE clauses`) will absolutely put undue stress on a node. Especially if the table is huge. These types of queries are something that Cassandra just doesn't do well. Find a way around using/needing this result.
You might be able to engineer this to work by setting your paging size smaller (driver side), or by using something like Spark. Or maybe with querying by token range, and totaling the result on the app-side. But you'll be much better off not doing it.

In addition to the CG and memory tuning suggestions by #aaron, you should also check that you are using the right compaction strategy for your data.
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/config/configChooseCompactStrategy.html#Whichcompactionstrategyisbest
You should also check for corrupt SStables, as trying to fetch corrupted data will also manifest in the same way. (for example https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/tools/toolsScrub.html)

Related

Matplotlib: Memory and 'CPU' leak

python: 2.7
Ubuntu: 18.04
matpltolib: 2.2.2
I have a client GUI that get information from a server and displays it. I see memory leak and change in CPU consumption with time. The picture below shows a change in CPU and memory utilization after restarting the client with GUI (~25 seconds from the right, aligned with a spice in network traffic).
The CPU graph has a dip in the CPU utilization showing that CPU usage is different before and after the restart of program.
The Memory graph shows a large drop in the memory utilization and then slight increase due to initialization of the same program.
The Network graph has a spike because the client requests all data from the server for visualization.
I suspect it is something to do with matplotlib. I have 7 figures that I rechart every 3 seconds.
I have added the image of my GUI. The middle 4 graphs are the history charts. However, I am binning all data points in 300 bins since I have ~ 300 pixels in that area. The binning is done in a separate thread. The data arrays( 2x1 000 000 points, time and value) that store the information are created from the very beginning to ensure that I don't have any memory runaway problem when my datasets grow. I do not expect the datasets to grow beyond that since the typical experiment runs at 0.1-0.01 Hz which will take several million seconds to reach the end.
Question: If it is Matplotlib, what can I do? If it is not, what else could it be?
added Sept 6 2018:
I thought of adding another example. Here is the screenshot of CPU and memory usage after I closed the GUI. The code ran for ~ 3 days. Python 2.7, Ubuntu 18.04.1.

Thank you, everyone, for helpful comments.
After some struggle, I have figure out the way to solve the problem. Unfortunately, I have made several changes in my code, so I cannot say definitively what actually helped.
Here what was done:
all charting is done in a separate thread. The image is saved in a buffer as a byte stream with io.Bytes() and later passed to the GUI. This was important for me to solve another problem(GUI freezes while charting with matplotlib).
create a new figure(figure = Figure(figsize=(7,8),dpi=80)) each time you generate the plot. previously I have been reusing the same figure (self.figure = Figure(figsize=(7,8),dpi=80)).

Why do I keep getting MySQL Client Out of Memory error after a certain period of time...?

I'm developing a server application that preforms various (probably hundreds to thousands) of MySQL Queries a day (SELECTS, INSERTS, & UPDATES).
The querying works great, until....
For some reason after the Server has been up for roughly 1 to 2 days it generates a MySQL Error any time I try to preform any MySQL Query from the Server... The Server was developed using C++.
The error says that The MySQL Client has run out of memory.
I'm Using
MySQL Community Server 5.6.24
Is there some kind of hidden cache of data stored in memory that I don't know about that gets occupied anytime a MySQL Query gets executed....? That's the only thing I can think of.

This is probably related to the queries you're using.
Indexing can help as well as limiting the results returned from these queries.
I'm guessing you're loading a certain amount of data using SELECT commands and the returned values need to be stored somewhere, which is the memory.
http://dev.mysql.com/doc/refman/5.7/en/out-of-memory.html
This link offers some possible solutions.
To remedy the problem, first check whether your query is correct. Is it reasonable that it should return so many rows? If not, correct the query and try again. Otherwise, you can invoke mysql with the --quick option. This causes it to use the mysql_use_result() C API function to retrieve the result set, which places less of a load on the client (but more on the server).
Also if you have enough RAM you could try to increase the memory limit in the config file.
To be more specific this is what you want to adjust:
InnoDB Settings
innodb_buffer_pool_size - Defaults to 128M. This is the main setting you want to change, since it sets how much memory InnoDB will use for data+indexes loaded into memory. For a dedicated MySQL server, the recommended size is 50-80% of installed memory. So for example, a server with 64G of RAM should have around a 50G buffer pool.
The danger of setting this value too high is that there will be no memory left for the operating system and some MySQL-subsystems that rely on filesystem cache such as binary logs, and InnoDB's transaction logs.
Taken from:
http://www.tocker.ca/2013/09/17/what-to-tune-in-mysql-56-after-installation.html
Another possibility is that something in the c++ code is not allocating the memory and after it's done deallocating the memory properly.

Another thing that could be leaking is connections. Database connections are very expensive, and hang around. I googled "mysql connection pooling". If you're using Connector/J, you could look at http://dev.mysql.com/doc/connector-j/en/connector-j-usagenotes-j2ee-concepts-connection-pooling.html, and if you're using connector/net you could try http://dev.mysql.com/doc/connector-net/en/connector-net-programming-connection-pooling.html.

cfindex causing template to be killed

my first question on stack,
I'm running cf10 Enterprise on windows 2003 server AMD Opteron 2.30 Ghz with 4gb ram. Im using cfindex action = update to index over 1k pdfs
I'm getting jvm memory errors and the page is being killed when its run as a scheduled task in the early hours of the morning.
This is the code in the page :
cfindex collection= "pdfs" action = "update" type= "path" extensions = ".pdf" recurse="yes" urlpath="/site/files/" key="D:\Inetpub\wwwroot\site\files"
JVM.config contents
java.home=s:\ColdFusion10\jre
application.home=s:\ColdFusion10\cfusion
java.args=-server -Xms256m -Xmx1024m -XX:MaxPermSize=192m -XX:+UseParallelGC -Xbatch -Dcoldfusion.home={application.home} -Dcoldfusion.rootDir={application.home} -Dcoldfusion.libPath={application.home}/lib -Dorg.apache.coyote.USE_CUSTOM_STATUS_MSG_IN_HEADER=true -Dcoldfusion.jsafe.defaultalgo=FIPS186Random -Dcoldfusion.classPath={application.home}/lib/updates,{application.home}/lib,{application.home}/lib/axis2,{application.home}/gateway/lib/,{application.home}/wwwroot/WEB-INF/flex/jars,{application.home}/wwwroot/WEB-INF/cfform/jars
java.library.path={application.home}/lib,{application.home}/jintegra/bin,{application.home}/jintegra/bin/international,{application.home}/lib/oosdk/classes/win
java.class.path={application.home}/lib/oosdk/lib,{application.home}/lib/oosdk/classes
Ive also tried going higher than 1024mb for -Xmx however cf would not restart until i tokk it back to 1024mb
Could it be a rogue pdf or do i need more ram on the server ?
Thanks in advance

I would say you probably need more RAM. CF10 64 bit with 4Gigs of RAM is pretty paltry. As an experiment why don't you try indexing half of the files. Then try the other half (or divide them up however appropriate). If in each case the process completes and mem use remains normal or recovers to normal then there is your answer. You have ceiling on RAM.
Meanwhile, more information would be helpful. Can you post your JVM settings (the contents of your jvm.config file). If you are using the default heap size (512megs) then you may have room (not much but a little) to increase. Keep in mind it is the max heap size and not the size of physical RAM that constrains your CF engine - though obviously your heap must run within said RAM.
Also keep in mind taht Solr runs in it's own jvm with it's own settings. Chekc out this post for information on that - though I suspect it is your CF heap that is being overrun.

ColdFusion scheduler threads eating CPU

I've got CF10 running on a dev box, Windows 7, 64 bit.
Periodically, every minute or so, the CPU usage for CF10 will spike up to 100% for about 20 seconds and come back down. It's pretty regular.
I've found it difficult to diagnose this issue. I've seen talk of client variables purges, logging, monitoring and all manner of things - but I've turned these all off to no avail.
With VisualVM, I've managed to track the issue down to the 'scheduler' threads. I have 5 of these in a waiting state. Periodically each will run, bumping up the CPU dramatically.
Taking a thread dump, it seems that all these threads are calling java.io.WinNTFileSystem.getBooleanAttributes - something I've seen mentioned a few times as potentially problematic.
UPDATE: Recently I've been playing with onSessionEnd on another app, and discovered that the scheduler-x threads appear to be internal to ColdFusion - my onSessionEnd tasks always seem to run in one of these threads.
Looking in the temp folder, I can see that a lot of EH Cache folders have been made which I think are to do with query caching. The apps I have running make use of this fairly extensively. I thought clearing the temp folder out might improve performance but it has had no effect.
It's worth noting that if I start the CF service without actually calling any of my apps, the problem does not occur. That might suggest the issue is with the apps themselves, however they do not cause any issue in production - only on this box.
There are no scheduled tasks set up either.
Below is an example of one of the threads causing high CPU. I'd appreciate any help in diagnosing what this thread is doing and why, as well as how to potentially stop it from using so much resources.
"scheduler-2" - Thread t#84
java.lang.Thread.State: RUNNABLE
at java.io.WinNTFileSystem.getBooleanAttributes(Native Method)
at java.io.File.isDirectory(File.java:849)
at coldfusion.watch.Watcher.accept(Watcher.java:352)
at java.io.File.listFiles(File.java:1252)
at coldfusion.watch.Watcher.getFiles(Watcher.java:386)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.getFiles(Watcher.java:397)
at coldfusion.watch.Watcher.checkWatchedDirectories(Watcher.java:166)
at coldfusion.watch.Watcher.run(Watcher.java:216)
at coldfusion.scheduling.ThreadPool.run(ThreadPool.java:211)
at coldfusion.scheduling.WorkerThread.run(WorkerThread.java:71)
My environment:
Win 7 64-bit
CF10 Update 12
JDK 1.8.0_11
The issue occurs on multiple versions of JVM - this version is currently used to make monitoring available.
My java settings:
Min heap size: 512mb
Max heap size: 1024mb
-server -XX:MaxPermSize=512m -XX:+UseParallelGC -Xbatch -Dcoldfusion.home={application.home} -Dcoldfusion.rootDir={application.home} -Dcoldfusion.libPath={application.home}/lib -Dorg.apache.coyote.USE_CUSTOM_STATUS_MSG_IN_HEADER=true -Dcoldfusion.jsafe.defaultalgo=FIPS186Random -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8701 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
I'd be lying if I said I understood what all of these settings do!
Sorry if you're one of those people that believes all CF developers should be Java app stack experts. I am not.
Any help, much appreciated. ;)

Using FusionReactor 6, I was able to solve this for us today. We were using this.javaSettings to hot load java class files. The WatchInterval from this.javaSettings uses the DirectoryWatcher at the specified watch number. In our case, I had lowered it to one second.
How I solved it: I set a breakpoint in FusionReactor and could see that it was constantly stuck scanning the directory above the one I specified in this.javasettings. This directory has enough files and subfolders, that it looks like one DirectoryWatcher was unable to finish before the next one was created. Had ColdFusion just stuck to the subfolder, I specified in this.javaSettings, it would not have been a problem.
Example:
This.javaSettings = {
loadPaths = ["\externals\lib\"]
, loadColdFusionClassPath = true
, reloadOnChange = true
, watchInterval = 1
};
In the above case, lib has just 5 files. However, "externals" is loaded with stuff. In the breakpoint, it was typically looking at stuff in "externals."

Do you have scheduled tasks running that use the CFFILE tag? They tend to be resource hogs. Spinning these into their own threads may help with the CPU spike.
another thought:
looking at the JVM,
•Min heap size: 512mb
•Max heap size: 1024mb
These establish the minimum and maximum memory available to the java virtual machine
-server -XX:MaxPermSize=512m
This is the amount of memory dedicated to the java permanent memory generation.
you've got half of your JVM allocated memory dedicated to the permanent generation, try bumping up the maximum heap size to 2048mb. and restarting the ColdFusion service. It could go higher based on whether or not you're running a 64Bit operating system or not.

ColdFusion: Recursion too deep; the stack overflowed

For the last few years we've been randomly seeing this message in the output logs when running scheduled tasks in ColdFusion:
Recursion too deep; the stack overflowed.
The code inside the task that is being called can vary, but in this case it's VERY simple code that does nothing but reset a counter in the database and then send me an email to tell me it was successful. But I've seen it happen with all kinds of code, so I'm pretty sure it's not the code that's causing this problem.
It even has an empty application.cfm/cfc to block any other code being called.
The only other time we see this is when we are restarting CF and we are attempting to view a page before the service has fully started.
The error rarely happens, but now we have some rather critical scheduled tasks that cause issues if they don't run. (Hence I'm posting here for help)
Memory usage is fine. The task that ran just before it reported over 80% free memory. Monitoring memory through the night doesn't show any out-of-the-ordinary spikes. The machine has 4 gigs of memory and nothing else running on it but the OS and CF. We recently tried to reinstall CF to resolve the problem, but it did not help. It happens on several of our other servers as well.
This is an internal server, so usage at 3am should be nonexistent. There are no other scheduled tasks being run at that time.
We've been seeing this on our CF7, CF8, and CF9 boxes (fully patched).
The current box in question info:
CF version: 9,0,1,274733
Edition: Enterprise
OS: Windows 2003 Server
Java Version: 1.6.0_17
Min JVM Heap: 1024
Max JVM Heap: 1024
Min Perm Size: 64m
Max Perm Size: 384m
Server memory: 4gb
Quad core machine that rarely sees more than 5% CPU usage
JVM settings:
-server -Dsun.io.useCanonCaches=false -XX:PermSize=64m -XX:MaxPermSize=384m -XX:+UseParallelGC -XX:+AggressiveHeap -Dcoldfusion.rootDir={application.home}/../
-Dcoldfusion.libPath={application.home}/../lib
-Doracle.jdbc.V8Compatible=true
Here is the incredible complex code that failed to run last night, but has been running for years, and will most likely run tomorrow:
<cfquery datasource="common_app">
update import_counters
set current_count = 0
</cfquery>
<cfmail subject="Counters reset" to="my#email.com" from="my#email.com"></cfmail>
If I missed anything let me know. Thank you!

We had this issue for a while after our server was upgraded to ColdFusion 9. The fix seems to be in this technote from Adobe on jRun 4: http://kb2.adobe.com/cps/950/950218dc.html
You probably need to make some adjustments to permissions as noted in the technote.

Have you tried reducing the size of your heap from 1024 to say 800 something. You say there is over 80% of memory left available so if possible I would look at reducing the max.
Is it a 32 or 64 bits OS? When assigning the heap space you have to take into consideration all the overhead of the JVM (stack, libraries, etc.) so that you don't go over the OS limit for the process.

what you could try is to set Minimum JVM Heap Size to the same as your Maximum JVM Heap Size (MB) with in your CF administrator.
Also update the JVM to the latest (21) or at least 20.
In the past i've always upgraded the JVM whenever something wacky started happening as that usually solved the problem.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js