I'm writing a MapReduce job that gets its input from Accumulo. I'm using AccumuloInputFormat with a RegExFilter. When I run the job, it connects to Accumulo with no problems, but after the connection is established, I see the following warning ad infinitum:
WARN mapreduce.InputFormatBase: Unable to locate bins for specified ranges. Retrying.
I don't think there is anything wrong with Accumulo as I can scan my table of interest from the shell. What am I missing?
You likely want to check the Accumulo monitor and make sure that there are no unassigned tablets. Either your tablet server died, or for some other reason you have unassigned tablets. This would be the most obvious reason as to why you were repeatedly seeing that warning.
Could also be that ZK is down, so accumulo can't know where it's data is.
Related
I am working with a customer who is having issues with their GCP Cloud SQL deployment. There questions are transcribed here:
When connecting to Cloud SQL, connections often fail intermittently. This can look like a Python error:
(psycopg2.DatabaseError) server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Or, in Node, it can look like a timeout error or a socket hang up:
TimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
We have everything configured correctly, as far as we can tell, and have followed all the instructions in the Cloud SQL troubleshooting guide. We have an instance with 20GB of memory that should support 250 connections. The timeouts should be set to refresh the connections at the right intervals (< 10 min). So we're not sure what's going on here.
I know that isn't a ton to go on but I wanted to try and do my due diligence in seeing how we can help them. I realize we may not get a perfect answer on what is going on but some additional questions I can ask of them to help debug the issue would be a great help to start with.
I found this similar question that seems to be describing the same issue but it has no answers: PostgreSQL 'Sever closed the connection unexpectedly'
Thanks for any help!
As the error suggest, it's not clear what caused the connection to be closed. I would suggest looking into the Cloud SQL error logs (within your Google Cloud Console) to see detailed information as to why the connection was closed, as it was the case in this Github Issue (The wrong role was assigned).
Is there a way to determine through a command line interface or other trick if an AWS EC2 instance is ready to receive ssh connections?
The running state seems not to be enough. Trying to connect in in the first minutes of the running state, the machine Status checks still shows initialising and ssh times out while trying to connect.
(I am using the awscli pip package.)
Running is similar to turning a computer on and finishing a bios check. As far as the hypervisor is concerned your instance is on.
The best way to know when your instance is ready, is to run a script at the end of startup (or when certain services are on) that will report its status to some other listener. Using that data, or event, you should know that your instance is ready to be connected to. This is purposely vague since there are so many different ways this can be accomplished.
You could also time the expected startup time, and try to connect after that and retry the connection if it fails. Still need a point at which you would stop trying as instances can fail to launch in some cases.
OK, so before I start, full disclosure: I'm pretty new to Nagios (only been using it for 3 weeks), so forgive me for lack of brevity in this explanation.
In my environment which I inherited, I have two redundant Nagios instances running (primary and secondary). On the primary, I added an active check for seeing if Apache is running on a select group of remote hosts (modifying commands.cfg and services.cfg). Unfortunately, it didn't go well so I had to revert the changes back to the previous configuration.
Where my issue comes in is this: after reverting the changes (deleted the added lines, started Nagios back up), the primary instance of Nagios' web UI is showing that a particular service is going critical intermittently with a change in duration, e.g., when the service is showing as OK, it'll be 4 hours but when it's critical, it'll show as 10 days (see here for an example host; the screenshots were taken less than a minute apart). This is only happening when I'm refreshing any of the Current Status pages or going to an individual host to view monitored services and refreshing there as well. Also, to note, this is a passive check for the service with checking freshness enabled.
I've already did a manual check from the primary Nagios server via the CLI and the status comes back as OK every time. I figured that there was a stale state somewhere in retention.dat, status.dat, objects.cache, or objects.precache, but even after stopping Nagios, removing said files, and starting it back up, and restarting NSCA, the same behavior persists. The secondary Nagios server isn't showing this behavior and is showing the correct statuses for all hosts and services and no modifications were made to it either.
Any help would be greatly appreciated and in advance, thanks! I've already posted up on the Nagios Support forums, but to no avail.
EDIT: Never mind. Turns out there were two instances of Nagios running, hence the intermittent nature. Killed off both and started Nagios again and it stabilized.
To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!
I am running a c1.medium machine. Once in a while, the machine becomes unresponsive. Is there a way to detect this?
My initial solution was to write a simple script which pings this machine every hour. But, this failed because, it was able to ping, yet was unresponsive(really weird!).
Is there a better way to monitor this?
then you have to define what unresponsive means. If unresponsive on http port(80) you can probe that specific port - the instance might be up but the process dead.