How to add a shell script after AWS EMR has started

How to add a shell script after AWS EMR has started - amazon-web-services

Currently, I am using a transient cluster, whenever my shell script encounters a failure in "add_step", it shuts down. I have started an EMR to debug this, but don't know where to add and test my script after it has launched.
I clicked on the steps and selected "Custom Jar" and
If I give my shell script in the S3 path as shown in the below screenshot. It fail's. How can I execute the script when EMR is running.
Thanks,
Xi

Here are the detailed steps on how to add
https://emr-etl.workshop.aws/spark_etl/steps.html

Related

How to run a script on a Google Compute Instance?

You can run a startup-script and a shutdown-script, but is it possible to use the Compute Engine API to run a script after startup?
The primary reason I'm asking is because the startup script isn't executing for me upon first run

Yes, this can be done. You can pass the file through a gcloud command or just add it to the instance metadata using the UI. Take a look at the following documentation for startup-script and shutdown-script.

AWS EMR script-runner access error

I'm running emr-5.12.0, with Amazon 2.8.3, Hive 2.3.2, Hue 4.1.0, Livy 0.4.0, Spark 2.2.1 and Zeppelin 0.7.3 on 1 m4.large as my master node and 1 m4.large as core node.
I am trying to execute a bootstrap action that configures some parts of the cluster. One of these includes the line:
sudo sed -i '/zeppelin.pyspark.python/c\ \"zepplin.pyspark.python\" : \"python3\",' /etc/alternatives/zeppelin-conf/interpreter.json
It makes sure that the Zeppelin uses python3.4 instead of python2.7. It works fine if I execute this in the terminal after SSH'ing to the master node, but it fails when I submit it as a Custom JAR step on the AWS Web interface. I get the following error:
ed: can't read /etc/alternatives/zeppelin-conf/interpreter.json
: No such file or directory
Command exiting with ret '2'
The same thing happens if I use
sudo sed -i '/zeppelin.pyspark.python/c\ \"zepplin.pyspark.python\" : \"python3\",' /etc/zeppelin-conf/interpreter.json
Obviously I could just change it from the Zeppelin UI, but I would like to include it in the bootstrap action.
Thanks!

It turns out that a bootstrap action submitted throug the AWS EMR web interface is submitted as a regular EMR step, so it's only run on the master node. This can be seen if you click the 'AWS CLI export' in the cluster web interface. The intended bootstrap action is listed as a regular step.
Using the command line to launch a cluster with a bootstrap action bypasses this problem, so I've just used that.
Edit: Looking back at the web interface, it's pretty clear that I was adding regular steps instead of bootstrap actions. My bad!

Getting Data From A Specific Website Using Google Cloud

I have a machine learning project and I have to get data from a website every 15 minutes. And I cannot use my own computer so I will use Google cloud. I am trying to use Google Compute Engine and I have a script for getting data (here is the link: https://github.com/BurkayKirnik/Automatic-Crypto-Currency-Data-Getter/blob/master/code.py). This script gets data every 15 mins and writes it down to csv files. I can run this code by opening an SSH terminal and executing it from there but it stops working when I close the terminal. I tried to run it by executing it in startup script but it doesn't work this way too. How can I run this and save the csv files? BTW I have to install an API to run the code and I am doing it in startup script. There is no problem in this part.

Instances running in Google Cloud Platform can be configured with the same tools available in the operating system that they are running. If your instance is a Linux instance, the best method would be to use a cronjob to execute your script repeatedly at your chosen interval.
Once you have accessed the instance via SSH, you can open the crontab configuration file by running the following command:
$ crontab -e
The above command will provide access to your personal crontab configuration (for the user you are logged in as). If you want to run the script as root you can use this instead:
$ sudo crontab -e
You can now edit the crontab configuration and add an entry that tells cron to execute your script at your required interval (in your case every 15 minutes).
Therefore, your crontab entry should look something like this:
*/15 * * * * /path/to/you/script.sh
Notice the first entry is for minutes, so by using the */15, you are telling the cron daemon to execute the script once every 15 minutes.
Once you have edited the crontab configuration file, it is a good idea to restart the cron daemon to ensure the change you made will take place. To do this you can run:
$ sudo service cron restart
If you would like to check the status to ensure the cron service is running you can run:
$ sudo service cron status
You script will now execute every 15 minutes.
In terms of storing the CSV files, you could either program your script to store them on the instance, or an alternative would be to use Google Cloud Storage bucket. File can be copied to buckets easily by making use of the gsutil (part of Cloud SDK) command as described here. It's also possible to mount buckets as a file system as described here.

Running updates on EC2s that roll back on failure of status check

I’m setting up a patch process for EC2 servers running a web application.
I need to build an automated process that installs system updates but, reverts back to the last working ec2 instance if the web application fails a status check.
I’ve been trying to do this using an Automation Document in EC2 Systems Manager that performs the following steps:
Stop EC2 instance
Create AMI from instance
Launch new instance from newly created AMI
Run updates
Run status check on web application
If check fails, stop new instance and restart original instance
The Automation Document runs the first 5 steps successfully, but I can't identify how to trigger step 6? Can I do this within the Automation Document? What output would I be able to call from step 5? If it uses aws:runCommand, should the runCommand trigger a new automation document or another AWS tool?

I tried the following to solve this, which more or less worked:
Included an aws:runCommand action in the automation document
This ran the DocumentName "AWS-RunShellScript" with the following parameters:
Downloaded the script from s3:
sudo aws s3 cp s3://path/to/s3/script.sh /tmp/script.sh
Set the file to executable:
chmod +x /tmp/script.sh
Executed the script using variables set in, or generated by the automation document
bash /tmp/script.sh -o {{VAR1}} -n {{VAR2}} -i {{VAR3}} -l {{VAR4}} -w {{VAR5}}
The script included the following getopts command to set the inputted variables:
while getopts o:n:i:l:w: option
do
case "${option}"
in
n) VAR1=${OPTARG};;
o) VAR2=${OPTARG};;
i) VAR3=${OPTARG};;
l) VAR4=${OPTARG};;
w) VAR5=${OPTARG};;
esac
done
The bash script used the variables to run the status check, and roll back to last working instance if it failed.

AWS Elastic Beanstalk - Starting SWF Background Workers

I have been trying to find out the best way to run background jobs using PHP on AWS Elastic beanstalk, and after many hours searching on Google and SO, I believe that one good solution is using SWF and activity workers.
I found this example buried in the aws-sdk-for-php: https://github.com/amazonwebservices/aws-sdk-for-php/tree/master/_samples/AmazonSimpleWorkflow/cron
The read-me file says:
To run this sample, you need to execute three scripts from the command line in separate terminal/console windows
and
Note that the start_cron_example_workflow.php script will exit quickly
while the decider and activity worker scripts keep running until you
manually terminate them.
the decider and activity worker will loop "forever", and trying to run these in EB is what I'm having trouble doing.
In my .ebextensions directory I have a file that executes these files:
container_commands:
01background_task:
command: "php -f start_cron_example_activity_workers.php"
02background_task:
command: "php -f start_cron_example_workflow_workers.php"
But I get the following error messages:
ERROR
Failed to deploy application version.
ERROR
Some instances have not responded to commands. Responses were not received from [i-a5417ed4].
Any way I can do this using config files? How can I make this work in AWS EB without introducing a single point of failure?
Thank you.

You might consider using a service like IronWorker — this is specifically designed for what you are trying to do and will probably work better than putting together your own solution on a micro instance.
I have not used Iron.io yet, but was evaluating it as I am looking to move my stuff over to AWS so I need to have cron jobs handled as well.

Have you taken a look at the Fat Controller ? It can daemonise anything. There's documentation and examples on the website: http://fat-controller.sourceforge.net

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js