I have a t2 small machine with a weird problem that cripples my site.
Access to even getting a single small logo image can take from less than a second to more than a minute. I just do F5 refresh on the browser and it takes various times to get a small png!!
I have more than 100 cpu credits.
No errors on apache error log
In my tests, I'm accessing it with the IP address to bypass the ELB, but still some refreshes in the browser takes random time from immediate to a minute! sometimes it returns 504 error because it was more than 60 seconds.
It is an Ubuntu machine that used to work ok. Apache 2.4 with KeepAliveTimeout=5
Any ideas?
Thanks
This is assuming that the site "used to work ok"
1) Check for any changes you have made to the site's configuration. For example you might have installed a misconfigured module into apache. If you have an old snapshot then resurrect it and compare with what you have now
2) look in the /var/log/syslog. If there are any mysterious messages that look like a potential hardware fault then do a EC2 stop and start to move the vm to a different physical host
Related
I am running a python Django app on an AWS EC2 instance. It uses gunicorn and nginx to serve the app, the EC2 is behind an application load balancer. Occasionally I get 504 error where the entire EC2 instance becomes unreachable for everyone (including via SSH which I use all the time otherwise). I then need to restart everything which takes time.
I can replicate the error by overloading the app (e.g. uploading and processing a very large image), in that case, gunicorn worker times out (I see the timeout message in logs), 504 error appears and the instance becomes unreachable. I set my gunicorn to time out in 5 minutes (300 seconds) but it falls down quicker than that. There is nothing really useful in CloudWatch logs.
I am looking for ways to resolve this for all current and future cases. I.e., I want to have the situation where, if the site gets overloaded, it returns an error message as opposed to becoming completely unreachable for everyone. Is there a way to do that?
There are many things to consider and test here in order to get what is a reason for this, but I think it is OOM(out of memory) mainly because you have to restart even to login in SSH.
Nginx uses "event‑driven" approach to handle requests so a single worker of nginx can handle 1000s of req simultaneously. But Gunicorn on the other hand mostly(by default) uses sync worker which means a request will remain with a worker till it is processed.
When you put a large request your machine tries to process that request until an overflow occurs, mostly it will not get detected by any service running inside a machine. Just try to monitor memory by any monitoring tool in AWS or just SSH and use htop before calling the API.
In most cases with Django/gunicorn the culprit is oom.
Edit:
AFAIK You cannot capture(cache) an oom, the only thing you can do is aftermath i.e after system restart sees/var/logs/syslogs ... As I said monitor in AWS memory monitor(I don't have much experience with AWS).
And regarding the solution,
you first increase the memory of your EC2 until you don't get an error to see how big the problem is.
Then you optimise your application by profiling which part is actually taking this much memory. I haven't used any memory profiling so maybe you can tell me after which is better.
The only thing you can do is optimise your application see common gotchas, best practices, Query optimisations etc.
https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/
https://www.pluralsight.com/blog/tutorials/how-to-profile-memory-usage-in-python
I am facing some issues with GCP and the AI Platform (Jupyterlab)
It seems that I am unable to maintain a stable connection with the server for a long time. I keep getting those 'server connection error' message. From there two possibilities:
either nothing happens and my cell keeps running
or
the cells have stopped running and I can see the status 'No Kernel!' on the top right of the notebook. Whenever I select a kernel (python 3) again, depending on my luck I can either keep working, or the cell will display the running status (with the * on the left of it) but the kernel status on the bottom left will stay on : 'connected' (instead of 'busy'). For the latter, I need to restart the kernel and run all the cells again, which can be very long.
Sometimes this happens as soon as I run the first cell after (re)starting the instance, sometimes a bit later. The longest stable period I was able to work on the notebook without any issue was 20, 30-ish minutes, which is quite annoying.
Configuration of my main instance :
- 16 CPUs
- 60gb of RAM
- A P100 NVIDIA GPU
I have tried different types of instance and I keep having the same problem, network at home is stable.
error message
What operating system and browser are you using at work?
I had the same problem as you did on Ubuntu 18 with the Firefox browser.
When I switched to Windows with Chrome the error did not reoccur, even though it was the same network.
I had a similar issue today: according to the google docs the cause for this is that the docker/ Jupyter service is not starting.
The cause why these services couldn't be started in our specific case was a full disk.
I'm experiencing a very weird behavior of the AWS Classic Load Balancer. From time to time, for no apparent reason, some requests to the load balancer randomly fail.
I saw 3 requests to the same url in the Chrome console.
The first were successful:
But the third failed. For some reason the "Remote Address" part is not there:
The app keeps trying to send the request, it fails a few times and then it succeeds.
That happens only for some users for some time (never for the same users and never at the same time) and then it all goes back to normal. It happens for users in different networks and countries.
I checked the CLB logs and I can see the first two requests there but not the third. I contacted AWS support but they have no clue of what is happening. They asked me to run tcpdump on the machine next time it happens and send them the .cap file.
The fact that the "Remote Address" part is not there makes me think it is a DNS issue but I don't see how this makes sense given that it was working a few seconds/minutes before and it starts working again a few seconds/minutes later
To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!
First my setup that is used for testing purpose:
3 Virtual Machines running with the following configuration:
MS Windows 2008 Server Standard Edition
Latest version of AppFabric Cache
Each one has a local network share where the config file is stored (I have added all the machines in each config)
The cache is distributed but not high availibility (we don't have Enterprise version of Windows)
Each host is configured as lead, so according to the documentation at least one host should be allowed to crash.
Each machine has the website I testing installed, and local cache configured
One linux machine that is used as a proxy (varnish is used) to distribute the traffic for testing purpose.
That's the setup and now on to the problem. The scenario I am testing is simulating one of the servers crashing and then bring it back in the cluster. I have problem both with the server crashing and bringing it back up. Steps I am using to test it:
Direct the traffic with Varnish on the linux machine to one server only.
Log in to make sure there is something in the cache.
Unplug the network cable for one of the other servers (simulates that server crashing)
Now I get a cache timeout and I get a service error. I want the application to still be up on the servers that didn't crash, and it take some time for the cache to come back up on the remaining servers. Is that how it should be? Plugging the network cable back in and starting the host cause a similar problem.
So my question is if I have missed something? What I would like to see happen is that if one server crashes the cache should still remaing upp since a majority of the leads are still up, and starting the crashed server again should bring it back gracefully into the cluster without any causing any problems on the other hosts. But that might no be how it works?
I ran through a similar test scenario a few months ago where I had a test client generating load on a 3 lead-server cluster with a variety of Puts, Gets, and Removes. I rebooted one of the servers multiple times while the load test was running and the cache stayed online. If I remember correctly, there were a limited number errors as that server rebooted, but overall the cache appeared to remain healthy.
I'm not sure why you're not seeing similar results, but I would try removing the Varnish proxy from your test and see if that helps.