Hadoop single node cluster slows down AWS instance - amazon-web-services

Happy ugly Christmas sweater day :-)
I am running into some strange problems with my AWS Linux 16.04 instance running Hadoop 2.9.2.
I have just successfully installed and configured Hadoop to run in a simulated distributed mode. Everything seems to be fine. When I start hdfs and yarn I don't get any errors. But as soon as I try to do even something as simple as list the contents of the root hdfs directory, or create a new directory, the whole instance becomes super slow. I wait for about 10 min and it never produces a directory listing so I hit Ctrl+C and it takes another 5 minutes to kill the process. Then I try to stop both, the hdfs and yarn, and it succeeds but also takes a long time to do that. And even after hdfs and yarn have been stopped the instance is still being barely responsive. At this point all I can do to make it function normally again is to go to AWS console and restart it.
Does anyone have any idea what I might've screwed up ( I am pretty sure it's something I did. It usually is :-) )?
Thank you.

Well, I think I figured out what was wrong and the answer is trivial. Basically, my ec2 instance doesn't have enough RAM. It's a basic free tier eligible instance and by default it comes with only 1GB of RAM. Hilarious. Totally useless.
But I learned something useful anyway. One other thing I had to do to make my Hadoop installation work (I was getting "connection refused" error but I did make it work) was that in core-site.xml file I had to change the line that says
<value>hdfs://localhost:9000</value>
to
<value>hdfs://ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws:9000</value>
(replace the XXXs in the above with your instance's IP address)

Related

Unable to connect to runtime & how to avoid disconnecting

I've been running a few ML training sessions on a GCE VM (with Colab). At the start they are saving me a good deal of time/computing resources, but, like everything Google so far, ultimately the run time disconnects and I cannot reconnect to my VM despite it still being there. a) how do we reconnect to a runtime if the VM exists, we have been disconnected, and it says it cannot reconnect to runtime?
b) how to avoid disconnecting/this issue at all? I am using Colab Pro+ and paying for VM's. And always they cut out at some point and it's just another week of time gone out the window. I must be doing something wrong as there's no way we pay for just losing all of our progress/time all the time and have to restart in hope it doesn't collapse again (it's been about 2 weeks of lost time and I'm just wondering why it GCE VM's can't just run a job for 4 days without collapsing at some point). What am I doing wrong? I just want to pay for an external resource that runs the jobs I pay for, and no connect/disconnect/lose everything issue every few days. I don't understand why Google does this.

How to store rabbitmq RABBITMQ_MNESIA_DIR on remote disk

We have two ec2 servers. One has the rabbitmq on it. Second one is a new one for storage purposes. Both of these are Amazon Linux 2.
On the second one we just purchased /dev/nvme1n1 70G 104M 70G 1% /data
Where we would love to push our rabbitmq queues and data. Basically we would like to RABBITMQ_MNESIA_DIR setup on the first rabbitmq server to be directly connecting and saving queues in /data remote mentioned.
Currently that is /var/lib/rabbitmq/mnesia and our config file for rabbitmq is just default /etc/rabbitmq/rabbitmq.conf
I wonder if somebody has been doing this before, or can point us in the right direction on how to set RABBITMQ_MNESIA_DIR to be directly connecting to remote ec2 and store and work with queues from there. Thank you
At the end of the day #Parsifal was right.
We ended up making one instance bigger and changed RABBITMQ_MNESIA_DIR
This was bit tricky, because after restarting service rabbitmq-server restart
First off was needed to make sure we had current right to the /data/mnesia we mounted, I managed it with chmod 755 -R /data though read/write should be sufficient based on docs.
Then we were looking for why it always produces the error like this "broker forced connection closure with reason 'shutdown'" & Error on AMQP connection so it was after the start.
So I figured and checked the ownership of the current mnesia dir and the new one. And turned out the user and group was root root compared to original one.
Switched it to drwxr-xr-x 4 rabbitmq rabbitmq 97 Dec 16 14:57 mnesia and this started working.
Maybe it will save you some headaches, I didn't realize there was a different user group for rabbitmq, since I didn't create it.
Only thing to add, is once you are shifting the current working mnesia you might consider copying the directory to the new one since there is a lot of stuff that was currently being used and ran from. I tried it without it and even the password to admin didn't work :D

Shutdown scripts to run upon AWS termination

I am trying to get some scripts to run upon an aws termination action. I have created /etc/init.d/Script.sh and linked symbolically to /etc/rc01.d/K01Script.sh
However terminating through aws console did not produce the output I was looking for. (It is a script that does a quick API call to a server over https should take only a few seconds).
Then I tried again but specifically changed a kernel parameter:
'sudo sysctl -w kernel.poweroff_cmd=/etc/rc0.d/K01Script.sh'
and again no output.
I get the message "The system is going down for power off NOW!" when terminating the server so I'm pretty sure the Ubuntu server is going into runlevel 0. Permissions are owned by root.
I know I could create a lifecycle to do something like this but my team prefers the quick and dirty way.
any help very much appreciated!

Nexus running with JettyServer will not start

Firstly please forgive my ignorance here - it's the first time I have asked a question on here and I am definitely out of my league. I have two staff members who would normally maintain this application but both are completely unavailable for some time yet.
We run an instance of Sonatype Nexus 2.11.1-01 using JettyServer on and Ubuntu instance on AWS. This morning we attempted to take snapshot of the instance and the process froze up completely. We had to cancel this and since then Nexus will not run. There is simply a message "Nexus OSS failed to run".
I've tried this as difference users and oddly there don't appear to be any entries in the logs for the last 4 hours or so, which is around the time it initially stopped working. Since then despite many attempts at restarting there is nothing in them, unless I am missing some stored somewhere else.
Again I apologise for any ignorance on my part but this isn't normally my forte and it is really important I get this running again. Thanks in advance for any help.
The problem was that during the process of creating the snapshot on AWS, the /var/run/nexus directory was deleted. This is pretty frightening - haven't actually got to the bottom of that, but we have created that directory again and given ownership to nexus:nexus, restarted and everything is working again.

What could be causing seemingly random AWS EC2 server to Crash? (Error couldn't establish database connection)

To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!