AWS Spark images not supporting HDFS - amazon-web-services

I am currently working on running our daily ETLS On EKS instead of EMR. How ever i see few of the jobs are failing (few of them use HDFS). I see the official Docker image provided by AWS doesn't support HDFS.
I am using 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-5.32.0 image.
bash-4.2$ hdfs
/usr/bin/hdfs: line 8: /usr/lib/hadoop-hdfs/bin/hdfs: No such file or directory
This are logs for Job fail
Jobrun failed. Main Spark container terminated with errors. Last error seen in logs - Caused by: java.lang.Exception: java.io.IOException: Incomplete HDFS URI, no host: hdfs:/bucket-name/process-store/checkpoints/086461cd-df44-4ab4-a2ee-da2c5671f9b4\nCaused by: java.io.IOException: Incomplete HDFS URI, no host: hdfs:/bucket-name/process-store/checkpoints/086461cd-df44-4ab4-a2ee-da2c5671f9b4. Please refer logs uploaded to S3/CloudWatch based on your monitoring configuration
Not sure how to proceed from here.

Related

AWS ECS Task Error - Error response from daemon: driver failed programming external connectivity on endpoint ecs-XXX

I'm new to Docker but know a little bit of AWS. I currently have a Docker image that's working perfectly on local host. I've packaged up my Docker image to ECR and copied the URI and used the URI in the cluster on ECS. I've created a task on the cluster but when I try to run the task, I get this error
STOPPED (CannotStartContainerError: Error response from dae)
Expanding the error further provides the error details as:
CannotStartContainerError: Error response from daemon: driver failed programming external connectivity on endpoint ecs-XXXX-New-1-XXXXX-new-c0f1889582c3fea77500
If I SSH into the EC2 instance I can't pull the docker image, I get this error:
no basic auth credentials
I've checked all permissions in IAM and everything is as it should be and the answers I come across don't mention AWS.
I'm so stuck right now!
Thanks in advance

AWS ECS task fails to start becasue daemon can't create Logstream

I have 2 versions of a service that run in the same cluster. I'm using the awslogs driver
The v2 logs works fine however the v1 task fails to start because it can't create a log stream.
The setup is identical between services except for the container being used.
The log group exists and the role has permissions to create a "logstream" and can "putevents" as this is pretty much the same setup for the v2 in a different group.
CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: RequestError: send request failed caused by: Post https://logs.eu-west-1-v1.amazonaws.com/: dial tcp: lookup logs.eu-west-1
I've setup a new service and tried to spin it up again but it failed so I thought that this was to do with the container setup.
On the official documentation here it recommends adding this to the environment variables
ECS_AVAILABLE_LOGGING_DRIVERS '["json-file","awslogs"]'
After adding this, it still failed. I've been searching for a while on this and would appreciate any help or preferably guidance.

Spark on EMR - Downloading Different Jar Files

I am downloading, using bootstrap, a mysql jar file to the spark/jars folder. I use the following:
sudo aws s3 cp s3://buck/emrtest/mysql-connector-java-5.1.39-bin.jar /usr/lib/spark/jars
Everything downloads correctly but I eventually get a provisioning error and the cluster terminates. I get this error :
On 5 slave instances (including i-0505b9beda64e9,i-0f85f4664e1359 and i-00d346a73f717b), application provisioning failed
It doesn't fail on my master node but fails on my slave nodes. I have checked my logs and it doesn't give me any information. Why does this fail and how would I go about downloading this jar file to every node in a bootstrap fasion?
Thanks!
I figured out the answer. First off, the logging for this is not there. The master node launches on a failure.
I was retrieving a file in a private s3 bucket. Note: aws configs do not get inherited in your EMR cluster.

SYNC with master failed: -ERR unknown command 'SYNC'

I was trying to dump my Redis data that is hosted via AWS. I can log into the interactive mode via redis-cli, but when I tried dumping the data to an RDB file I received the following error in the title...
user#awshost:~/TaoRedisExtract$ redis-cli -h myawsredis.amazonaws.com --rdb redis.dump.rdb
SYNC with master failed: -ERR unknown command 'SYNC'
I'm not sure if this is a bug, a configuration issue, or known/expected behavior for AWS redis? I've searched and searched and not found any other reports of users getting this error message.
according to reply of similar question of aws forum
From redis version 2.8.22 SYNC has been disabled:
"To maintain enhanced replication performance in Multi-AZ replication groups and for increased cluster stability, non-ElastiCache replicas are no longer supported"

Can't access HDFS on Mesosphere DC/OS despite "healthy" status

So I've deployed a Mesos cluster in AWS using the CloudFormation script / instructions found here with the default cluster settings (5 private slaves, one public slave, single master, all m3.xlarge), and installed HDFS on the cluster with the dcos command: dcos package install hdfs.
The HDFS service is apparently up and healthy according to the DC/OS web UI and Marathon:
(the problem) At this point I should be able to SSH into my slave nodes and execute hadoop fs commands, but that returns the error -bash: hadoop: command not found (basically telling me there is no hadoop installed here).
There are no errors coming from the STDOUT and STDERR logging for the HDFS service, but for what its worth there is a recurring "offer decline" message appearing in the logs:
Processing DECLINE call for offers: [ 5358a8d8-74b4-4f33-9418-b76578d6c82b-O8390 ] for framework 5358a8d8-74b4-4f33-9418-b76578d6c82b-0001 (hdfs) at scheduler-60fe6c75-9288-49bc-9180-f7a271c …
I'm sure I'm missing something silly.
So I figured out a solution to at least verifying HDFS is running on your Mesos DC/OS cluster after install.
SSH into your master with the dcos CLI: dcos node ssh --master-proxy --leader
Create a docker container with hadoop installed to query your HDFS: docker run -ti cloudera/quickstart hadoop fs -ls hdfs://namenode-0.hdfs.mesos:9001/
Why this isn't a good solution & what to look out for:
Previous documentation all points to a default URL of hdfs://hdfs/, which instead will throw a java.net.UnknownHostException. I don't like pointing directly to a namenode.
Other documentation suggests you can run hdfs fs ... commands when you SSH into your cluster - this does not work as documented.
The image I used just to test that you can access HDFS is > 4GB (better options?)
None of this is documented (or at least not clearly/completely, hence why I'm keeping this post updated). I had to dig through DC/OS slack chat to find an answer.
The Mesosphere/HDFS repo is a completely different version than the HDFS that is installed via dcos package install hdfs. That repo is no longer maintained and the new version isn't open sourced yet (hence the lack of current documentation I guess).
I'm hoping there is an easier way to interface with HDFS that I'm still missing. Any better solutions would still be very helpful!