Run cron task on AWS EMR master node - amazon-web-services

How can i run periodic job in background on EMR cluster?
I have script.sh with cron job and application.py in s3 and want to run cluster with this command:
aws emr create-cluster
--name "Test cluster"
–-release-label emr-5.12.0
--applications Name=Hive Name=Pig Name=Ganglia Name=Spark
--use-default-roles
--ec2-attributes KeyName=myKey
--instance-type m3.xlarge
--instance-count 3
--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,
Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,
Args["s3://mybucket/script-path/script.sh"]
Finally, i want that cron job from script.sh execute application.py
Now i don't understand how to install cron on master node, python file need some libraries, they should be installed to.

You need to SSH into the master node, and then perform the crontab setup from there, not on your local machine:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html
Connect to the Master Node Using SSH
Secure Shell (SSH) is a network protocol you can use to create a
secure connection to a remote computer. After you make a connection,
the terminal on your local computer behaves as if it is running on the
remote computer. Commands you issue locally run on the remote
computer, and the command output from the remote computer appears in
your terminal window.
When you use SSH with AWS, you are connecting to an EC2 instance,
which is a virtual server running in the cloud. When working with
Amazon EMR, the most common use of SSH is to connect to the EC2
instance that is acting as the master node of the cluster.
Using SSH to connect to the master node gives you the ability to
monitor and interact with the cluster. You can issue Linux commands on
the master node, run applications such as Hive and Pig interactively,
browse directories, read log files, and so on. You can also create a
tunnel in your SSH connection to view the web interfaces hosted on the
master node. For more information, see View Web Interfaces Hosted on
Amazon EMR Clusters.

By default crontab is installed in linux system, you don't need to install manually.
To add spark job scheduling in cron, please follow below steps
Login into master node (SSH into master).
run command
crontab -e
Add the below line in crontab and save it ( :w)
*/15 0 * * * /script-path/script.sh
Now cron will schedule the jobs for every 15 minutes.
Please refer this link to know about cron.
Hope this helps.
Thanks
Ravi

Related

Cronjob Script not working for Graviton based EC2 instance with encryption enabled on EBS volume

I have one Ubuntu 18.04.6 LTS EC2 Instance having Graviton2 arm64 Architecture. I have also enabled encryption on EBS volume.
I configured some cronjob bash script.
I was able to run those script manually by ./backup-script.sh command.
But when configured cron job below.
0 4 * * * /bin/sh /path/to/script/backup-script.sh
It is not able to execute.
I have other EC2 instances where cron job is running successfully but those are not graviton2 based instance neither EBS encryption enabled on them.
Have you checked logs grep CRON /var/log/syslog? It should have worked. Verified by running simple cron job on Grv2 instance Ubuntu 18.
I found the solution. Cronjob script is now working in graviton.
But any data from /var/ directory is not getting copy to s3.
It gives below error-
"The user-provided path /var/www/ does not exist."

How to automatically start, execute and stop EC2?

I want to test my Python library in GPU machine once a day.
I decided to use AWS EC2 for testing.
However, the fee of gpu machine is very high, so I want to stop the instance after the test ends.
Thus, I want to do the followings once a day automatically
Start EC2 instance (which is setup manually)
Execute command (test -> push logs to S3)
Stop EC2 (not remove)
How to do this?
It is very simple...
Run script on startup
To run a script automatically when the instance starts (every time it starts, not just the first time), put your script in this directory:
/var/lib/cloud/scripts/per-boot/
Stop instance when test has finished
Simply issue a shutdown command to the operating system at the end of your script:
sudo shutdown now -h
You can push script logs to custom coudwatch namespaces. Like when the process ends publish a state to cloudwatch. In cloudwatch create alarms based on the state of process, so if it has a completed state trigger an AWS lambda function that will stop instance after completion of your job.
Also if you want to start and stop on specific time you can use ec2 instance scheduler to start/stop instances. It just works like a cron job at specific intervals.
You can use the aws cli
To start an instance you would do the following
aws ec2 start-instances --instance-ids i-1234567890abcdef0
and to stop the instance you would do the following
aws ec2 stop-instances --instance-ids i-1234567890abcdef0
To execute commands inside the machine, you will need to ssh into it and run the commands that you need, then you can use the aws cli to upload files to s3
aws s3 cp test.txt s3://mybucket/test2.txt
I suggest reading the aws cli documentation, you will find most if not all what you need to automate aws commands there.
I created a shell script to start an EC2 instance -if not already running,- connect via SSH and, if you want, run a command.
https://gist.github.com/jotaelesalinas/396812f821785f76e5e36cf928777a12
You can use it in three different ways:
./ec2-start-and-ssh.sh -i <instance id> -s
will show status information about your instance: running state and private and public IP addresses.
./ec2-start-and-ssh.sh -i <instance id>
will connect and leave you inside the default shell.
./ec2-start-and-ssh.sh -i <instance id> <command>
will run whatever command you specify, e.g.:
./ec2-start-and-ssh.sh -i <instance id> ./run.sh
./ec2-start-and-ssh.sh -i <instance id> sudo poweroff
I use the last two commands to run periodic jobs minimizing billing costs.
I hope this helps!

spark-submit from outside AWS EMR cluster

I have an AWS EMR cluster running spark, and I'd like to submit a PySpark job to it from my laptop (--master yarn) to run in cluster mode.
I know that I need to set up some config on the laptop, but I'd like to know what the bare minimum is. Do I just need some of the config files from the master node of the cluster? If so, which? Or do I need to install hadoop or yarn on my local machine?
I've done a fair bit of searching for an answer, but I haven't yet been able to be sure that what I was reading referred to launching a job from the master of the cluster or some arbitrary laptop...
If you want to run the spark-submit job solely on your AWS EMR cluster, you do not need to install anything locally. You only need the EC2 key pair you specified in the Security Options when you created the cluster.
I personally scp over any relevant scripts &/or jars, ssh into the master node of the cluster, and then run spark-submit.
You can specify most of the relevant spark job configurations via spark-submit itself. AWS documents in some more detail how to configure spark-submit jobs.
For example:
>> scp -i ~/PATH/TO/${SSH_KEY} /PATH/TO/PYSPARK_SCRIPT.py hadoop#${PUBLIC_MASTER_DNS}:
>> ssh -i ~/PATH/TO/${SSH_KEY} hadoop#${PUBLIC_MASTER_DNS}
>> spark-submit --conf spark.OPTION.OPTION=VALUE PYSPARK_SCRIPT.py
However, if you already pass a particular configuration when creating the cluster itself, you do not need to re-specify those same configuration options via spark-submit.
You can setup the AWS CLI on your local machine, put your deployment on S3, and then add an EMR step to run on the EMR cluster. Something like this:
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/],ActionOnFailure=CONTINUE
Source: https://aws.amazon.com/de/blogs/big-data/submitting-user-applications-with-spark-submit/

How to run a script on an EC2 instance remotely?

I have an EC2 instance and I need to download a file from its D drive through my program. Currently, it's a very annoying process because I can't access the instance directly from my local machine. The way what I am doing now is running a script on the instance and the instance uploads the file I need to S3 and my program access S3 to read the file.
Just wonder whether there is any simple way to access the drive on the instance instead of going through S3?
I have used AWS DataPipeline and its task runner to execute scripts on a remote instance. The taskrunner waits for a pipeline event published to its worker group.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-using-task-runner.html
I use it to execute shell script and commands on a schedule. The script to run should be uploaded to S3, and the Data pipeline template specifies the script's path. Works great for periodic tasks. You can do anything you want on the remote box via the script.
You cannot directly download the file from EC2, but via s3( or maybe using scp command) from your remote ec2.
But to simplify this annoying process you can use AWS Systems Manager.
AWS Systems Manager Run Command allows you to remotely and securely run set of commands on EC2 as well on-premise server. Below are high-level steps to achieve this.
Attach Instance IAM role:
The ec2 instance must have IAM role with policy AmazonSSMFullAccess. This role enables the instance to communicate with the Systems Manager API.
Install SSM Agent:
The EC2 instance must have SSM agent installed on it. The SSM Agent process the run command requests & configure the instance as per command.
Execute command :
Example usage via AWS CLI:
Execute the following command to retrieve the services running on the instance. Replace Instance-ID with ec2 instance id.
aws ssm send-command --document-name "AWS-RunShellScript" --comment "listing services" --instance-ids "Instance-ID" --parameters commands="service --status-all" --region us-west-2 --output text
More detailed information: https://www.justdocloud.com/2018/04/01/run-commands-remotely-ec2-instances/

How to auto create new hosts in Logentries for AWS EC2 autoscaling group

What's the best way to send logs from Auto scaling groups (of EC2) to Logentries.
I previously used the EC2 platform to create EC2 log monitoring for all of my EC2 instances created by an Autoscaling group. However according to Autoscaling rules, new instance will spin up if a current one is destroyed.
Now how do I create an automation for Logentries to create a new hosts and starting getting logs. I've read this https://logentries.com/doc/linux-agent-with-chef/#updating-le-agent I'm stuck at the override['le']['pull-server-side-config'] = false since I don't know anything about Chef (I just took the training from their site)
For an Autoscaling group, you need to get this baked into an AMI, or scripted to run on startup. You can get an EC2 instance to run commands on startup, after you've figured out which script to run.
The Logentries Linux Agent installation docs has setup instructions for an Amazon AMI (under Installation > Select your distro below > Amazon AMI).
Run the following commands one by one in your terminal:
You will need to provide your Logentries credentials to link the agent to your account.
sudo -s
tee /etc/yum.repos.d/logentries.repo <<EOF
[logentries]
name=Logentries repo
enabled=1
metadata_expire=1d
baseurl=http://rep.logentries.com/amazon\$releasever/\$basearch
gpgkey=http://rep.logentries.com/RPM-GPG-KEY-logentries
EOF
yum update
yum install logentries
le register
yum install logentries-daemon
I recommend trying that script once and seeing if it works properly for you, then you could include it in the user data for your Autoscaling launch configuration.