API / Server Issue Google Cloud - google-cloud-platform

At 9am this morning I found that my server was unresponsive I was unable to connection into via either ssh or http/s.
When I finally logged in, everything was fine other then a high cpu (averaging 70-80%). Via the console I see that the disk IO was high and also there was a peak in the amount of API requests per second.
Can anyone point me in the right direction so I can find out what happened ?
Thanks,

What you experienced was most likely the result of the networking issue described here. Also, in general, this forum is a good source to stay updated.
Beyond this, if the time window does not line up with your observations (there's no time zone specified in the question), one would need logs (i.e. Apache's, syslog, etc) in order to determine a possible cause, albeit it can turn into an exercise in speculation after the fact.

Related

Overleaf lost connection every several minutes

I have two windows computers, both using the Microsoft Edge browser. When I'm typing on the Overleaf website, the connection gets lost every 2-5 minutes. What's worse, some unsynced sentences are gone when the connection resumes. I'm not sure whether this is a network problem since all other websites look good, including Twitter and Gmail. I wondered if this is about the framework of what Overleaf cloud service used. Could anyone give some tips about the issue here?

Duration of service alert constantly changing on Nagios

OK, so before I start, full disclosure: I'm pretty new to Nagios (only been using it for 3 weeks), so forgive me for lack of brevity in this explanation.
In my environment which I inherited, I have two redundant Nagios instances running (primary and secondary). On the primary, I added an active check for seeing if Apache is running on a select group of remote hosts (modifying commands.cfg and services.cfg). Unfortunately, it didn't go well so I had to revert the changes back to the previous configuration.
Where my issue comes in is this: after reverting the changes (deleted the added lines, started Nagios back up), the primary instance of Nagios' web UI is showing that a particular service is going critical intermittently with a change in duration, e.g., when the service is showing as OK, it'll be 4 hours but when it's critical, it'll show as 10 days (see here for an example host; the screenshots were taken less than a minute apart). This is only happening when I'm refreshing any of the Current Status pages or going to an individual host to view monitored services and refreshing there as well. Also, to note, this is a passive check for the service with checking freshness enabled.
I've already did a manual check from the primary Nagios server via the CLI and the status comes back as OK every time. I figured that there was a stale state somewhere in retention.dat, status.dat, objects.cache, or objects.precache, but even after stopping Nagios, removing said files, and starting it back up, and restarting NSCA, the same behavior persists. The secondary Nagios server isn't showing this behavior and is showing the correct statuses for all hosts and services and no modifications were made to it either.
Any help would be greatly appreciated and in advance, thanks! I've already posted up on the Nagios Support forums, but to no avail.
EDIT: Never mind. Turns out there were two instances of Nagios running, hence the intermittent nature. Killed off both and started Nagios again and it stabilized.

What could be causing seemingly random AWS EC2 server to Crash? (Error couldn't establish database connection)

To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!

Strange apache lag in requests

I have an Apache2 and Django (mod_wsgi) setup that provides a RESTful API. I have a set of automated tests for this, that executes ~1000 API requests (pure http GET/POST/PUT/DELETE) in sequential order.
The problem is, for every 80 requests or so, I get a strange lag/timeout for exactly 5s or 10s. See timestamp examples here:
Request 1: 2013-08-30T03:49:20.915
Response 1: 2013-08-30T03:49:30.940
Request 2: 2013-08-30T03:50:32.559
Response 2: 2013-08-30T03:50:37.597
I can't figure out why this happens. I have an apache config with KeepAlive Off (recommended setup setting for Django) but otherwise standard install for Ubuntu 12.04 LTS.
I'm running the tests from the same server where the webserver is, I first thought this was some kind of DNS cache thing, but I've added the hostname I'm requesting to /etc/hosts but the problem persists.
The system is idle and have lots of cpu and mem when this lag/timeouts happens.
The lag is not specific to a certain request (URL), it seems kinda random.
Considering that it's always exactly to the millisecond 5s or 10s, it feels like this is some specific setting somewhere causing this.
In case it provides some insight, watch my talk from PyCon US.
http://lanyrd.com/2013/pycon/scdyzk/
The talk deals with things like process churn and startup costs. One thing you shouldn't do is set maximum requests if you don't really need it.
Also consider trying New Relic to help diagnose where the issue is. That will save a lot of guessing if it is a web application of backend service infrastructure issue.
As far as seeing how such monitoring can help, watch another one of my PyCon talks.
http://lanyrd.com/2012/pycon/spcdg/
This was a DNS issue, adding the domainname I used locally to /etc/hosts actually solved the problem. I just hadn't reboot the server for the changes to take effect, thought restarting networking would take care of that, but apparently not.

Is there any way to access information about a Coldfusion server's load from within coldfusion?

I am writing a scheduled task which I would like to run frequently.
The problem is that I do not want this task to be run if the server is experiencing a high traffic load.
Is there any way other then getting the free/total/max memory from java to try and figure out whether this task should continue?
GetMetricData() is going to give you a very good indication of how busy your server is, i.e. how many requests are running and how many are queued as well as other info.
It's the same info that you get from running cfstat from the command line (you'll find that under {cfroot}\bin\cfstat.exe).
However, knowing how busy you are at the very moment might not be very useful to you if you just call that function once. It might be better for you to log performance data to file or to a database table using Windows perfmon. You can then get the average number of running/queued requests over the past 5 minutes (or whatever) and make your decision on whether to run your task.
There's an easy way to retrieve the memory usage information.
http://misterdai.wordpress.com/2009/11/25/retrieving-coldfusion-memory-usage/
For CPU load I think you can get it from getMetricData() but there are other methods too, but since this is my first stackoverflow post I'm only allowed one link :P But it's on my blog so just do a CPU search when you look at the link above.
You might find it useful to dig into getMetricData() for the performance monitoring stats. It's a good way of telling how busy your server is by the number of running and queued requests.
Hope this helps,
Dave (aka Mister Dai)
Use the ColdFusion AdminApi. Call http://servername/CFIDE/adminapi/servermonitor.cfc in your browser to get the cfcdocs of the component. If gives you many methods to get the health of you CF server instance.