Nexus running with JettyServer will not start - jetty

Firstly please forgive my ignorance here - it's the first time I have asked a question on here and I am definitely out of my league. I have two staff members who would normally maintain this application but both are completely unavailable for some time yet.
We run an instance of Sonatype Nexus 2.11.1-01 using JettyServer on and Ubuntu instance on AWS. This morning we attempted to take snapshot of the instance and the process froze up completely. We had to cancel this and since then Nexus will not run. There is simply a message "Nexus OSS failed to run".
I've tried this as difference users and oddly there don't appear to be any entries in the logs for the last 4 hours or so, which is around the time it initially stopped working. Since then despite many attempts at restarting there is nothing in them, unless I am missing some stored somewhere else.
Again I apologise for any ignorance on my part but this isn't normally my forte and it is really important I get this running again. Thanks in advance for any help.

The problem was that during the process of creating the snapshot on AWS, the /var/run/nexus directory was deleted. This is pretty frightening - haven't actually got to the bottom of that, but we have created that directory again and given ownership to nexus:nexus, restarted and everything is working again.

Related

Unable to connect to runtime & how to avoid disconnecting

I've been running a few ML training sessions on a GCE VM (with Colab). At the start they are saving me a good deal of time/computing resources, but, like everything Google so far, ultimately the run time disconnects and I cannot reconnect to my VM despite it still being there. a) how do we reconnect to a runtime if the VM exists, we have been disconnected, and it says it cannot reconnect to runtime?
b) how to avoid disconnecting/this issue at all? I am using Colab Pro+ and paying for VM's. And always they cut out at some point and it's just another week of time gone out the window. I must be doing something wrong as there's no way we pay for just losing all of our progress/time all the time and have to restart in hope it doesn't collapse again (it's been about 2 weeks of lost time and I'm just wondering why it GCE VM's can't just run a job for 4 days without collapsing at some point). What am I doing wrong? I just want to pay for an external resource that runs the jobs I pay for, and no connect/disconnect/lose everything issue every few days. I don't understand why Google does this.

Hadoop single node cluster slows down AWS instance

Happy ugly Christmas sweater day :-)
I am running into some strange problems with my AWS Linux 16.04 instance running Hadoop 2.9.2.
I have just successfully installed and configured Hadoop to run in a simulated distributed mode. Everything seems to be fine. When I start hdfs and yarn I don't get any errors. But as soon as I try to do even something as simple as list the contents of the root hdfs directory, or create a new directory, the whole instance becomes super slow. I wait for about 10 min and it never produces a directory listing so I hit Ctrl+C and it takes another 5 minutes to kill the process. Then I try to stop both, the hdfs and yarn, and it succeeds but also takes a long time to do that. And even after hdfs and yarn have been stopped the instance is still being barely responsive. At this point all I can do to make it function normally again is to go to AWS console and restart it.
Does anyone have any idea what I might've screwed up ( I am pretty sure it's something I did. It usually is :-) )?
Thank you.
Well, I think I figured out what was wrong and the answer is trivial. Basically, my ec2 instance doesn't have enough RAM. It's a basic free tier eligible instance and by default it comes with only 1GB of RAM. Hilarious. Totally useless.
But I learned something useful anyway. One other thing I had to do to make my Hadoop installation work (I was getting "connection refused" error but I did make it work) was that in core-site.xml file I had to change the line that says
<value>hdfs://localhost:9000</value>
to
<value>hdfs://ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws:9000</value>
(replace the XXXs in the above with your instance's IP address)

Very slow: ActiveRecord::QueryCache#call

I have an app on heroku, running on Puma:
workers 2
threads_count 3
pool 5
It looks like some requests get stuck in the middleware, and it makes the app very slow (VERY!).
I have seen other people threads about this problem but no solution so far.
Please let me know if you have any hint.
!
!
I work for Heroku support and Middleware/Rack/ActiveRecord::QueryCache#call is a commonly reported as a problem by New Relic. Unfortunately, it's usually a red herring as each time the source of the problem lies elsewhere.
QueryCache is where Rails first tries to check out a connection for use, so any problems with a connection will show up here as a request getting 'stuck' waiting. This doesn't mean the database server is out of connections necessarily (if you have Librato charts for Postgres they will show this). It likely means something is causing certain database connections to enter a bad state, and new requests for a connection are waiting. This can occur in older versions of Puma where multiple threads are used and the reaping_frequency is set - if some connections get into a bad state and the others are reaped this will cause problems.
Some high-level suggestions are as follows:
Upgrade Ruby & Puma
If using the rack-timeout gem, upgrade that too
These upgrades often help. If not, there are other options to look into such as switching from threads to worker based processes or using a Postgres connection pool such as PgBouncer. We have more suggestions on configuring concurrent web servers for use with Postgres here: https://devcenter.heroku.com/articles/concurrency-and-database-connections
I will answer my own question:
I simply had to check all queries to my DB. One of them was taking a VERY long time, and even if it was not requested often, it would slow down the whole server for quite some time afterwards(even after the process was done, there was a sort of "traffic jam" on the server).
Solution:
Check all the queries to your database, fix the slowest ones (it might simply mean breaking it down in few steps, it might mean make it run at night when there is no traffic, etc...).
Once this queries are fixed, everything should go back to normal.
I recently started seeing a spike in time spent in ActiveRecord::QueryCache#call. After looking at the source, I decided to try clearing said cache using ActiveRecord::Base.connection.clear_query_cache from a Rails Console attached to the production environment. The error I got back was PG::ConnectionBad: could not fork new process for connection: Cannot allocate memory which lead me to this other SO question at least Heroku Rails could not fork new process for connection: Cannot allocate memory

Duration of service alert constantly changing on Nagios

OK, so before I start, full disclosure: I'm pretty new to Nagios (only been using it for 3 weeks), so forgive me for lack of brevity in this explanation.
In my environment which I inherited, I have two redundant Nagios instances running (primary and secondary). On the primary, I added an active check for seeing if Apache is running on a select group of remote hosts (modifying commands.cfg and services.cfg). Unfortunately, it didn't go well so I had to revert the changes back to the previous configuration.
Where my issue comes in is this: after reverting the changes (deleted the added lines, started Nagios back up), the primary instance of Nagios' web UI is showing that a particular service is going critical intermittently with a change in duration, e.g., when the service is showing as OK, it'll be 4 hours but when it's critical, it'll show as 10 days (see here for an example host; the screenshots were taken less than a minute apart). This is only happening when I'm refreshing any of the Current Status pages or going to an individual host to view monitored services and refreshing there as well. Also, to note, this is a passive check for the service with checking freshness enabled.
I've already did a manual check from the primary Nagios server via the CLI and the status comes back as OK every time. I figured that there was a stale state somewhere in retention.dat, status.dat, objects.cache, or objects.precache, but even after stopping Nagios, removing said files, and starting it back up, and restarting NSCA, the same behavior persists. The secondary Nagios server isn't showing this behavior and is showing the correct statuses for all hosts and services and no modifications were made to it either.
Any help would be greatly appreciated and in advance, thanks! I've already posted up on the Nagios Support forums, but to no avail.
EDIT: Never mind. Turns out there were two instances of Nagios running, hence the intermittent nature. Killed off both and started Nagios again and it stabilized.

What could be causing seemingly random AWS EC2 server to Crash? (Error couldn't establish database connection)

To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!