Can't access HDFS on Mesosphere DC/OS despite "healthy" status - amazon-web-services

So I've deployed a Mesos cluster in AWS using the CloudFormation script / instructions found here with the default cluster settings (5 private slaves, one public slave, single master, all m3.xlarge), and installed HDFS on the cluster with the dcos command: dcos package install hdfs.
The HDFS service is apparently up and healthy according to the DC/OS web UI and Marathon:
(the problem) At this point I should be able to SSH into my slave nodes and execute hadoop fs commands, but that returns the error -bash: hadoop: command not found (basically telling me there is no hadoop installed here).
There are no errors coming from the STDOUT and STDERR logging for the HDFS service, but for what its worth there is a recurring "offer decline" message appearing in the logs:
Processing DECLINE call for offers: [ 5358a8d8-74b4-4f33-9418-b76578d6c82b-O8390 ] for framework 5358a8d8-74b4-4f33-9418-b76578d6c82b-0001 (hdfs) at scheduler-60fe6c75-9288-49bc-9180-f7a271c …
I'm sure I'm missing something silly.

So I figured out a solution to at least verifying HDFS is running on your Mesos DC/OS cluster after install.
SSH into your master with the dcos CLI: dcos node ssh --master-proxy --leader
Create a docker container with hadoop installed to query your HDFS: docker run -ti cloudera/quickstart hadoop fs -ls hdfs://namenode-0.hdfs.mesos:9001/
Why this isn't a good solution & what to look out for:
Previous documentation all points to a default URL of hdfs://hdfs/, which instead will throw a java.net.UnknownHostException. I don't like pointing directly to a namenode.
Other documentation suggests you can run hdfs fs ... commands when you SSH into your cluster - this does not work as documented.
The image I used just to test that you can access HDFS is > 4GB (better options?)
None of this is documented (or at least not clearly/completely, hence why I'm keeping this post updated). I had to dig through DC/OS slack chat to find an answer.
The Mesosphere/HDFS repo is a completely different version than the HDFS that is installed via dcos package install hdfs. That repo is no longer maintained and the new version isn't open sourced yet (hence the lack of current documentation I guess).
I'm hoping there is an easier way to interface with HDFS that I'm still missing. Any better solutions would still be very helpful!

Related

Is there a time sync process on the Docker for AWS nodes?

I haven't been able to determine if there is a time sync process (such as ntpd or chronyd) running on the docker swarm I've deployed to AWS using Docker Community Edition (CE) for AWS.
I've ssh'd to a swarm manager, but ps doesn't show much, and I don't see anything in /etc or /etc/conf.d that looks relevant.
I don't really have a good understanding of cloudformation, but I can see that the created instances running the docker nodes used AMI image Moby Linux 18.09.2-ce-aws1 stable (ami-0f4fb04ea796afb9a). I created a new instance w/ that AMI so I could ssh there. Still no time sync process indications w/ ps or in /etc
I suppose one of the swarm control containers that is running may deal with sync'ing time (maybe docker4x/l4controller-aws:18.09.2-ce-aws1)? Or maybe the cloudformation template installed one on the instances? But I don't know how to verify that.
So if anyone can tell me if there is a time sync process running (and where)?
And if not, I feel there should be so how might I start one up?
You can verify resources that are created by cloud formation Docker-no-vpc.tmpl from the link you provided.
Second thing, do you think ntpd have something do with docker-swarm? or it should be installed on the underlying EC2 instance?
Do ssh to your ec2 instance and very the status of the service, normally all AWS AMI has ntpd installed.
or you can just type to check
ntpd
If you did not find, you can install it for your self or you can run docker swarm with custom AMI.
UCP requires that the system clocks on all the machines in a UCP
cluster be in sync or else it can start having issues checking the
status of the different nodes in the cluster. To ensure that the
clocks in a cluster are synced, you can use NTP to set each machine's
clock.
First, on each machine in the cluster, install NTP. For example, to
install NTP on an Ubuntu distribution, run:
sudo apt-get update && apt-get install ntp
#On CentOS and RHEL, run:
sudo yum install ntp
what-does-clock-skew-detected-mean
Last thing, do you really need the stack that is created by cloudformation?
EC2 instances + Auto Scaling groups
IAM profiles
DynamoDB Tables
SQS Queue
VPC + subnets and security groups
ELB
CloudWatch Log Group
I know the cloudformation ease our life, but if you do not know the template (what resouces will be created) do not try to run the template, otherwise you will bear sweet cost at the of the month.
Also will suggest exploring AWS ECS and EKS, these are service that are sepcifly designed for docker container.

How to change yarn scheduler configuration on aws EMR?

Unlike HortonWorks or Cloudera, AWS EMR does not seem to give any GUI to change xml configurations of various hadoop ecosystem frameworks.
Logging into my EMR namenode and doing a quick
find \ -iname yarn-site.xml
I was able to find it to be located at /etc/hadoop/conf.empty/yarn-site.xml and capacity-scheduler to be located at /etc/hadoop/conf.empty/capacity-scheduler.xml.
But note how these are under conf.empty and I suspect these might not be the actual locations for yarn-site and capacity-scheduler xmls.
I understand that I can change these configurations while making a cluster but what I need to know is how to be able to change them without tearing apart the cluster.
I just want to play around scheduling properties and such and try out different schedulers to identify what might work will with my spark applications.
Thanks in advance!
Well, the yarn-site.xml and capacity-scheduler.xml are indeed under correct locations (/etc/hadoop/conf.empty/) and on running cluster , editing them on master node and restarting YARN RM Daemon will change the scheduler.
When spinning up a new cluster , you can use EMR Configurations API to change appropriate values. http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
For example : Specify appropriate values in capacity-scheduler and yarn-site classifications on your Configuration for EMR to change those values in corresponding XML files.
Edit: Sep 4, 2019 :
With Amazon EMR version 5.21.0 and later, you can override cluster configurations and specify additional configuration classifications for each instance group in a running cluster. You do this by using the Amazon EMR console, the AWS Command Line Interface (AWS CLI), or the AWS SDK.
Please see
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-running-cluster.html

How to run Python Spark code on Amazon Aws?

I have written a python code in spark and I want to run it on Amazon's Elastic Map reduce.
My code works great on my local machine, but I am slightly confused over how to run it on Amazon's AWS?
More specifically, how should I transfer my python code over to the Master node? Do I need to copy my Python code to my s3 bucket and execute it from there? Or, should I ssh into Master and scp my python code to the spark folder in Master?
For now, I tried running the code locally on my terminal and connecting to the cluster address ( I did this by reading the output of --help flag of spark, so I might be missing a few steps here)
./bin/spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.1 \
--master spark://hadoop#ec2-public-dns-of-my-cluster.compute-1.amazonaws.com \
mypythoncode.py
I tried it with and without my permissions file i.e.
-i permissionsfile.pem
However, it fails and the stack trace shows something on the lines of
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
......
......
Is my approach correct and I need to resolve the Access issues to get going or am I heading in a wrong direction?
What is the right way of doing it?
I searched a lot on youtube but couldn't find any tutorials on running Spark on Amazon's EMR.
If it helps, the dataset I am working on it is part of Amazon's public dataset.
go to EMR, create new cluster... [recommendation: start with 1 node only, just for testing purposes].
Click the checkbox to install Spark, you can uncheck the other boxes if you don't need those additional programs.
configure the cluster further by choosing a VPC and a security key (ssh key, a.k.a pem key)
wait for it to boot up. Once your cluster says "waiting", you're free to proceed.
[spark submission via the GUI] in the GUI, you can add a Step and select Spark job, and upload your spark file to S3, and then choose the path to that newly uploaded S3 file. Once it runs it will either succeed or fail. If it fails, wait a moment, and then click "view logs" over on the of that Step line in the list of steps. Keep tweaking your script until you've got it working.
[submission via the command line] SSH into the driver node following the ssh instructions at the top of the page. Once inside, use a command-line text editor to create a new file, and paste the contents of your script in. Then spark-submit yourNewFile.py. If it fails, you'll see the error output straight to the console. Tweak your script, and re-run. Do that until you've got it working as expected.
Note: running jobs from your local machine to a remote machine is troublesome because you may actually be causing your local instance of spark to be responsible for some expensive computations and data transfer over the network. So thats why you want to submit AWS EMR jobs from within EMR.
There are typical two ways to run a job on an Amazon EMR cluster (whether for Spark or other job types):
Login to the master node an run Spark jobs interactively. See: Access the Spark Shell
Submit jobs to the EMR cluster. See: Adding a Spark Step
If you have Apache Zeppelin installed on your EMR cluster, you can use a web browser to interact with Spark.
The error you are experiencing is saying that files where accessed via the s3n: protocol, which requires AWS credentials to be provided. If, instead, the files were accessed via s3:, I suspect that the credentials would be sourced from the IAM Role that is automatically assigned to nodes in the cluster and this error would be resolved.

Using Cloudwatch log service with older AMIs

I want to use cloudwatch log service for the programs running on older AMIs (2008-2010). Is there a way I can install it on such machines?.
A workaround which I could think of, is to copy log files from these AMIs to the latest AMI with log service installed and upload the logs from there. But the downside is that I will end up paying cost for data transfer. Is there any alternate better way?
When Henry Hahn gives Amazon Deep Dive CloudWatch presentation and says: "I am gonna to do a direct install", you find what you need.
$ wget https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/awslogs-agent-setup.py
$ sudo python awslogs-agent-setup.py --region eu-west-1
(the --region can differ in your case)
accepts defaults.
It shall install a service called awslogs, which can be started/stopped as any other service.
Configuration file can be found at /var/awslogs/etc/awslogs.conf
For me, this worked for my Debian Jessie notebook which is definitely not an EC2 instance, so it shall work for your older EC2 instance as well.
I expect, this will work for RPi instance too (planning to try soon).

Setting UP Spark on existing EC2 cluster

I have to access some big files in buckets in Amazon S3 and do processing on them. For this I was planning to use Apache Spark. I have 2 EC2 instances for this learning project. These are not used but for small crons, so could I use them to install and run Spark? If so, how to install Spark on existing EC2 boxes, so that I can make one master and one slave?
If it helps, I installed Spark in standalone mode on one branch, and the other as well, setting one as Master, and the other as slave. The detailed instructions for the same as I followed are
https://spark.apache.org/docs/1.2.0/spark-standalone.html#installing-spark-standalone-to-a-cluster
See the tutorial on Apache Spark Cluster on EC2 here http://www.supergloo.com/fieldnotes/apache-spark-cluster-amazon-ec2-tutorial/
yes you can create easily a master slave with 2 aws instances just set SPARK_MASTER_IP = instance_privateIP_1 in spark-env.sh on both instances and put instance2 private ip in slaves file in conf folder and these configurations are same on both the machine and other configurations also set like memory core etc. and then you can start it from master, and make sure the spark is install on same location in both the machines.