Fabric no handlers could be found for logger “paramiko.transport” - fabric

I am using the fabric to run an experiment in amazon web services ec2 cluster(50 instances). The experiment is mainly about using some clients to perform requests to my servers.
Because I am testing the scalability of my project, I increase the server numbers while keeping the client number at the same. In this process, I ocaasionally come across this error that interrupt my fabric task.
If I run task again, then this error will not happen. I read the question No handlers could be found for logger “paramiko.transport” , but that does not really explain why this error terminate my task and why the error happened occasionally.
I also checked the context where the error occurs, but the last executed commands are not even the same
Could someone just provide some debugging tricks to identify where the problem is.

Related

504 timeout on AWS with nginx and gunicorn

I am running a python Django app on an AWS EC2 instance. It uses gunicorn and nginx to serve the app, the EC2 is behind an application load balancer. Occasionally I get 504 error where the entire EC2 instance becomes unreachable for everyone (including via SSH which I use all the time otherwise). I then need to restart everything which takes time.
I can replicate the error by overloading the app (e.g. uploading and processing a very large image), in that case, gunicorn worker times out (I see the timeout message in logs), 504 error appears and the instance becomes unreachable. I set my gunicorn to time out in 5 minutes (300 seconds) but it falls down quicker than that. There is nothing really useful in CloudWatch logs.
I am looking for ways to resolve this for all current and future cases. I.e., I want to have the situation where, if the site gets overloaded, it returns an error message as opposed to becoming completely unreachable for everyone. Is there a way to do that?
There are many things to consider and test here in order to get what is a reason for this, but I think it is OOM(out of memory) mainly because you have to restart even to login in SSH.
Nginx uses "event‑driven" approach to handle requests so a single worker of nginx can handle 1000s of req simultaneously. But Gunicorn on the other hand mostly(by default) uses sync worker which means a request will remain with a worker till it is processed.
When you put a large request your machine tries to process that request until an overflow occurs, mostly it will not get detected by any service running inside a machine. Just try to monitor memory by any monitoring tool in AWS or just SSH and use htop before calling the API.
In most cases with Django/gunicorn the culprit is oom.
Edit:
AFAIK You cannot capture(cache) an oom, the only thing you can do is aftermath i.e after system restart sees/var/logs/syslogs ... As I said monitor in AWS memory monitor(I don't have much experience with AWS).
And regarding the solution,
you first increase the memory of your EC2 until you don't get an error to see how big the problem is.
Then you optimise your application by profiling which part is actually taking this much memory. I haven't used any memory profiling so maybe you can tell me after which is better.
The only thing you can do is optimise your application see common gotchas, best practices, Query optimisations etc.
https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/
https://www.pluralsight.com/blog/tutorials/how-to-profile-memory-usage-in-python

ChromeOS errors in GCP Logging

I'm seeing errors in StackDriver logging for my Compute instance. The logs are showing repeated issues every hour, creating a lot of noise. I have a Spring Boot API deployed in a container to a VM in Compute Engine using latest stable version of Container OS.
I'm relatively new to GCP and don't understand what is causing this issue, searches have come up empty so far.
Failed to call method: org.chromium.SessionManagerInterface.RetrieveActiveSessions: object_path= /org/chromium/SessionManager: org.freedesktop.DBus.Error.ServiceUnknown: The name org.chromium.SessionManager was not provided by any .service files
CallMethodAndBlockWithTimeout(...): Domain=dbus, Code=org.freedesktop.DBus.Error.ServiceUnknown, Message=The name org.chromium.SessionManager was not provided by any .service file
Error calling D-Bus proxy call to interface '/org/chromium/SessionManager': The name org.chromium.SessionManager was not provided by any .service files
The same 3 lines are repeating every hour. Anyone aware of what might be causing this or how to fix/suppress these?
I looked into this error, and as per my findings:
The error message that you have been receiving is a manifestation of Chrome to reliably exit shortly after starting up.
The UI’s job (which encompasses Chrome, the session_manager and the window manager) gets shut down by upstart because of it's thrashing, and when the test tries to restart the session_manager, the session_manager cannot communicate it over to the D-Bus.
The crash collection software in Container OS was originally for Chromebooks (The laptop using Chrome browser). So the code typically expects Chrome and some other related software on the system.
However, Container OS is a server OS, and does not have Chrome. So if Chrome is missing, the software will report some errors. They are actually not real failures, just some verbose error messages.
Overall, It is safe to ignore these logs and continue using your VM Instances.
Hope this helps.

flask-cache memcache connection auto-reconnect

I've recently moved my memcache server behind an Elastic Load Balancer in AWS. I'm also using Flask-Cache with this memcache. If I'm not mistaken (and it's totally possible I am), Flask-Cache opens a connection to memcache and holds it open. It also appears that the ELB terminates these long-standing connections after some period of time (I think it's about 60 minutes). This will result in errors like:
SomeErrors: error 19 from flush_all: (0x4ff96f0) CONNECTION FAILURE, ::rec() returned zero, server has disconnected
If there was some way I could catch these errors and reconnect (or some magic setting to "try to reconnect on connection failure"), that would solve this problem.
FWIW, I'm using pylibmc, but don't see anything obvious (to me) that I could pass.
Any help would be greatly appreciated!
Being disconnected from ELB is very common and also very difficult to debug. Here are a few things that might help:
Debugging Ideas
Attempt to debug the problem in a staging environment with only one
instance connected to ELB.
Make sure you have application logging with time stamps and that if you catch all exceptions in Python (which is generally not a great idea), that you log the exception. It is possible you have a subtle and hidden bug that appears to be something else if you are catching all exceptions.
Simulate the failure (i.e. manually remove "one" instance from ELB), now look at your logs and make sure you see this manifested in your logs. If you can reproduce the same behavior than you can figure out how to fix it.
Look into a web service automated testing tool like https://loader.io/. This can be very helpful to simulate the conditions when the disconnects appear to happen.
Try the same application with a different load balancer, i.e. HAProxy (I would potentially try this last).

What could be causing seemingly random AWS EC2 server to Crash? (Error couldn't establish database connection)

To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!

Service blocks windows startup

We have automatically started service which in some cases spends a lot of the time loading necessary data, let's say 10 minutes. During this time it works as expected (processing some huge data files required to start). I report the progess by C++ SetServiceStatus function, it is working fine.
This service is not dependent on anything and has only one dependency which is again our own service. It is started after those 10 minutes, it needs the first "server" service to be fully running to accept the requests.
I thought that windows would start all other automatic services (in less then 10 minutes as usually) and then start working normally but system is completely blocked during startup (i can't login to computer or ping the computer) until this one specific service is started (reports SERVICE_RUNNING by SetServiceStatus). When out service completely starts, the other missing system services (required for network, remote desktop, whatever, it's quite random) are also started. Is this normal behaviour? Why are non-depending processes (as remote desktop, network connections, etc.) waiting for this process? Am I missing something?
I tried to add some dependencies to postpone the startup of my service but I ended up with many dependencies and behaviour still somehow random (as order of services is random). Sometimes I was able to login but for example Start button started working only after those 10 minutes when my service was started. I am not sure what is "the last service" to depend on and what services to include to my depend-list and on some computers this services can be disabled and it can bring new problems... so I don't like this solution very much.
Another option was Delayed start option for our service. This should start service when all other automatic services are running. Well, this works fine, windows boots, computer running and responding, our service is started, but the performance is very bad, many times slower than usually, it seems that delayed started services have much lower priority or something like that.
My only current solution is to report to system that my service is running (by SetServiceStatus function), but to continue loading (this works, I tested it). But then we have problem with our dependent service as it needs to be started when the first one is really ready. It can be solved but I still wonder how is this possible and if there is something I could use to keep the current state of automatic started service which reports "started" when it is really fully started and prepared to work. Thanks for any ideas.
Set SERVICE_RUNNING as soon as possible, and then continue processing in background. Make your other service resilient to the first service being in a running state, but not yet ready to service.
The longer the service is in the starting state the more problems we get from different windows versions.