Cloning existing EMR cluster into a new one using boto3 - amazon-web-services

When creating a new cluster using boto3, I want to use configuration from existing clusters (which is terminated) and thus clone it.
As far as I know, emr_client.run_job_flow requires all the configuration(Instances, InstanceFleets etc) to be provided as parameters.
Is there any way I can clone from existing cluster like I can do from aws console for EMR.

What i can recommend you, is using the AWS CLI to fire your Cluster.
It permit to versioning your cluster configuration and you can easily load steps configuration with a json file.
aws create-cluster --name "Cluster's name" --ec2-attributes KeyName=SSH_KEY --instance-type m3.xlarge --release-label emr-5.2.1 --log-uri s3://mybucket/logs/ --enable-debugging --instance-count 1 --use-default-roles --applications Name=Spark --steps file://step.json
Where step.json looks like :
[
{
"Name": "Step #1",
"Type":"SPARK",
"Jar":"command-runner.jar",
"Args":
[
"--deploy-mode", "cluster",
"--class", "com.your.data.set.class",
"s3://path/to/your/spark-job.jar",
"-c", "s3://path/to/your/config/or/not",
"--aws-access-key", "ACCESS_KEY",
"--aws-secret-key", "SECRET_KEY"
],
"ActionOnFailure": "CANCEL_AND_WAIT"
}
]
(Multiple steps is okey too)
After that you can always startUp the same configured Cluster.
And for example Schedule the whole Cluster and steps from one AirFlow job.
But if you really want to use Boto3, i suppose that the describe_cluster() method can help you to get the whole informations and use the returned object to Fire Up a new one.

There is no way to get "emr export cli" through command line.
You should parse the parameter what you want to clone, through "describe-cluster".
See below sample,
https://github.com/awslabs/aws-support-tools/tree/master/EMR/Get_EMR_CLI_Export
import boto3
import json
import sys
cluster_id = sys.argv[1]
client = boto3.client('emr')
clst = client.describe_cluster(ClusterId=cluster_id)
...
awscli += ' --steps ' + '\'' + json.dumps(cli_steps) + '\''
...
awscli += ' --instance-groups ' + '\'' + json.dumps(cli_igroups) + '\''
print(awscli)
It works parsing the parameters from “describe-cluster” at first, and make strings to fit “create-cluster” of aws-cli.

Related

Groovy script issue with escaping quotes

I'm running this shell command using groovy (which worked in bash):
aws --profile profileName --region us-east-1 dynamodb update-item --table-name tableName --key '{"group_name": {"S": "group_1"}}' --attribute-updates '{"attr1": {"Value": {"S": "STOP"},"Action": "PUT"}}'
This updates the value of an item to STOP in DynamoDB. In my groovy script, I'm running this command like so:
String command = "aws --profile profileName --region us-east-1 dynamodb update-item --table-name tableName --key '{\"group_name\": {\"S\": \"group_1\"}}' --attribute-updates '{\"attr1\": {\"Value\": {\"S\": \"STOP\"},\"Action\": \"PUT\"}}'"
println(command.execute().text)
When I run this with groovy afile.groovy, nothing is printed out and when I check the table in DynamoDB, it's not updated to STOP. There is something wrong with the way I'm escaping the quotes but I'm not sure what. Would appreciate any insights.
Sidenote: When I do a simple aws command like aws s3 ls it works and prints out the results so it's something with this particular command that is throwing it off.
You don't quote for groovy (and the underlying exec) -- you would have to quote for your shell. The execute() on a String does not work like a shell - the underlyting code just splits at whitespace - any quotes are just passed down as part of the argument.
Use ["aws", "--profile", profile, ..., "--key", '{"group_name": ...', ...].execute() and ignore any quoting.
And instead of banging strings together to generate JSON, use groovy.json.JsonOutput.toJson([group_name: [S: "group_1"]])

Automating network interface related configurations on red hat ami-7.5

I have an ENI created, and I need to attach it as a secondary ENI to my EC2 instance dynamically using cloud formation. As I am using red hat AMI, I have to go ahead and manually configure RHEL which includes steps as mentioned in below post.
Manually Configuring secondary Elastic network interface on Red hat ami- 7.5
Can someone please tell me how to automate all of this using cloud formation. Is there a way to do all of it using user data in a cloud formation template? Also, I need to make sure that the configurations remain even if I reboot my ec2 instance (currently the configurations get deleted after reboot.)
Though it's not complete automation but you can do below to make sure that the ENI comes up after every reboot of your ec2 instance (only for RHEL instances). If anyone has any better suggestion, kindly share.
vi /etc/systemd/system/create.service
Add below content
[Unit]
Description=XYZ
After=network.target
[Service]
ExecStart=/usr/local/bin/my.sh
[Install]
WantedBy=multi-user.target
Change permissions and enable the service
chmod a+x /etc/systemd/system/create.service
systemctl enable /etc/systemd/system/create.service
Below shell script does the configuration on rhel for ENI
vi /usr/local/bin/my.sh
add below content
#!/bin/bash
my_eth1=`curl http://169.254.169.254/latest/meta-data/network/interfaces/macs/0e:3f:96:77:bb:f8/local-ipv4s/`
echo "this is the value--" $my_eth1 "hoo"
GATEWAY=`ip route | awk '/default/ { print $3 }'`
printf "NETWORKING=yes\nNOZEROCONF=yes\nGATEWAYDEV=eth0\n" >/etc/sysconfig/network
printf "\nBOOTPROTO=dhcp\nDEVICE=eth1\nONBOOT=yes\nTYPE=Ethernet\nUSERCTL=no\n" >/etc/sysconfig/network-scripts/ifcfg-eth1
ifup eth1
ip route add default via $GATEWAY dev eth1 tab 2
ip rule add from $my_eth1/32 tab 2 priority 600
Start the service
systemctl start create.service
You can check if the script ran fine or not by --
journalctl -u create.service -b
Still need to figure out the joining of the secondary ENI from Linux, but this was the Python script I wrote to have the instance find the corresponding ENI and attach it to itself. Basically the script works by taking a predefined naming tag for both the ENI and Instance, then joins the two together.
Pre-reqs for setting this up are:
IAM role on the instance to allow access to S3 bucket where script is stored
Install pip and the AWS CLI in the user data section
curl -O https://bootstrap.pypa.io/get-pip.py
python get-pip.py
pip install awscli --upgrade
aws configure set default.region YOUR_REGION_HERE
pip install boto3
sleep 180
Note on sleep 180 command: I have my ENI swap out on instance in an autoscaling group. This allows an extra 3 min for the other instance to shut down and drop the ENI, so the new one can pick it up. May or may not be necessary for your use case.
AWS CLI command in user data to download the file onto the instance (example below)
aws s3api get-object --bucket YOURBUCKETNAME --key NAMEOFOBJECT.py /home/ec2-user/NAMEOFOBJECT.py
# coding: utf-8
import boto3
import sys
import time
client = boto3.client('ec2')
# Get the ENI ID
eni = client.describe_network_interfaces(
Filters=[
{
'Name': 'tag:Name',
'Values': ['Put the name of your ENI tag here']
},
]
)
eni_id = eni['NetworkInterfaces'][0]['NetworkInterfaceId']
# Get ENI status
eni_status = eni['NetworkInterfaces'][0]['Status']
print('Current Status: {}\n'.format(eni_status))
# Detach if in use
if eni_status == 'in-use':
eni_attach_id = eni['NetworkInterfaces'][0]['Attachment']['AttachmentId']
eni_detach = client.detach_network_interface(
AttachmentId=eni_attach_id,
DryRun=False,
Force=False
)
print(eni_detach)
# Wait until ENI is available
print('start\n-----')
while eni_status != 'available':
print('checking...')
eni_state = client.describe_network_interfaces(
Filters=[
{
'Name': 'tag:Name',
'Values': ['Put the name of your ENI tag here']
},
]
)
eni_status = eni_state['NetworkInterfaces'][0]['Status']
print('ENI is currently: ' + eni_status + '\n')
if eni_status != 'available':
time.sleep(10)
print('end')
# Get the instance ID
instance = client.describe_instances(
Filters=[
{
'Name': 'tag:Name',
'Values': ['Put the tag name of your instance here']
},
{
'Name': 'instance-state-name',
'Values': ['running']
}
]
)
instance_id = instance['Reservations'][0]['Instances'][0]['InstanceId']
# Attach the ENI
response = client.attach_network_interface(
DeviceIndex=1,
DryRun=False,
InstanceId=instance_id,
NetworkInterfaceId=eni_id
)

How to know RDS free storage

I've created a RDS postgres instance with size of 65GB initially.
Is it possible to get free space available using any postgres query.
If not, how can I achieve the same?
Thank you in advance.
A couple ways to do it
Using the AWS Console
Go to the RDS console and select the region your database is in. Click on the Show Monitoring button and pick your database instance. There will be a graph (like below image) that shows Free Storage Space.
This is documented over at AWS RDS documentation.
Using the API via AWS CLI
Alternatively, you can use the AWS API to get the information from cloudwatch.
I will show how to do this with the AWS CLI.
This assumes you have set up the AWS CLI credentials. I export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in my environment variables, but there are multiple ways to configure the CLI (or SDKS).
REGION="eu-west-1"
START="$(date -u -d '5 minutes ago' '+%Y-%m-%dT%T')"
END="$(date -u '+%Y-%m-%dT%T')"
INSTANCE_NAME="tstirldbopgs001"
AWS_DEFAULT_REGION="$REGION" aws cloudwatch get-metric-statistics \
--namespace AWS/RDS --metric-name FreeStorageSpace \
--start-time $START --end-time $END --period 300 \
--statistics Average \
--dimensions "Name=DBInstanceIdentifier, Value=${INSTANCE_NAME}"
{
"Label": "FreeStorageSpace",
"Datapoints": [
{
"Timestamp": "2017-11-16T14:01:00Z",
"Average": 95406264320.0,
"Unit": "Bytes"
}
]
}
Using the API via Java SDK
Here's a rudimentary example of how to get the same data via the Java AWS SDK, using the Cloudwatch API.
build.gradle contents
apply plugin: 'java'
apply plugin: 'application'
sourceCompatibility = 1.8
repositories {
jcenter()
}
dependencies {
compile 'com.amazonaws:aws-java-sdk-cloudwatch:1.11.232'
}
mainClassName = 'GetRDSInfo'
Java class
Again, I rely on the credential chain to get AWS API credentials (I set them in my environment). You can change the call to the builder to change this behavior (see Working with AWS Credentials documentation).
import java.util.Calendar;
import java.util.Date;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.cloudwatch.AmazonCloudWatch;
import com.amazonaws.services.cloudwatch.AmazonCloudWatchClientBuilder;
import com.amazonaws.services.cloudwatch.model.GetMetricStatisticsRequest;
import com.amazonaws.services.cloudwatch.model.GetMetricStatisticsResult;
import com.amazonaws.services.cloudwatch.model.StandardUnit;
import com.amazonaws.services.cloudwatch.model.Dimension;
import com.amazonaws.services.cloudwatch.model.Datapoint;
public class GetRDSInfo {
public static void main(String[] args) {
final long GIGABYTE = 1024L * 1024L * 1024L;
// calculate our endTime as now and startTime as 5 minutes ago.
Calendar cal = Calendar.getInstance();
Date endTime = cal.getTime();
cal.add(Calendar.MINUTE, -5);
Date startTime = cal.getTime();
String dbIdentifier = "tstirldbopgs001";
Regions region = Regions.EU_WEST_1;
Dimension dim = new Dimension()
.withName("DBInstanceIdentifier")
.withValue(dbIdentifier);
final AmazonCloudWatch cw = AmazonCloudWatchClientBuilder.standard()
.withRegion(region)
.build();
GetMetricStatisticsRequest req = new GetMetricStatisticsRequest()
.withNamespace("AWS/RDS")
.withMetricName("FreeStorageSpace")
.withStatistics("Average")
.withStartTime(startTime)
.withEndTime(endTime)
.withDimensions(dim)
.withPeriod(300);
GetMetricStatisticsResult res = cw.getMetricStatistics(req);
for (Datapoint dp : res.getDatapoints()) {
// We requested only the average free space over the last 5 minutes
// so we only have one datapoint
double freespaceGigs = dp.getAverage() / GIGABYTE;
System.out.println(String.format("Free Space: %.2f GB", freespaceGigs));
}
}
}
Example Java Code Execution
> gradle run
> Task :run
Free Space: 88.85 GB
BUILD SUCCESSFUL in 7s
The method using the AWS Management Console has changed.
Now you have to go:
RDS > Databases > [your_db_instance]
From there, scroll down, and click on "Monitoring"
There you should be able to see your db's "Free Storage Space" (in MB/Second)

AWS EMR Spark Step args bug

I'm submitting a Spark job to EMR via AWSCLI, EMR steps and spark configs are provided as separate json files. For some reason the name of my main class gets passed to my Spark jar as an unnecessary command line argument, resulting in a failed job.
AWSCLI command:
aws emr create-cluster \
--name "Spark-Cluster" \
--release-label emr-5.5.0 \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
InstanceGroupType=CORE,InstanceCount=20,InstanceType=m3.xlarge \
--applications Name=Spark \
--use-default-roles \
--configurations file://conf.json \
--steps file://steps.json \
--log-uri s3://blah/logs \
The json file describing my EMR Step:
[
{
"Name": "RunEMRJob",
"Jar": "s3://blah/blah.jar",
"ActionOnFailure": "TERMINATE_CLUSTER",
"Type": "CUSTOM_JAR",
"MainClass": "blah.blah.MainClass",
"Args": [
"--arg1",
"these",
"--arg2",
"get",
"--arg3",
"passed",
"--arg4",
"to",
"--arg5",
"spark",
"--arg6",
"main",
"--arg7",
"class"
]
}
]
The argument parser in my main class throws an error (and prints the parameters provided):
Exception in thread "main" java.lang.IllegalArgumentException: One or more parameters are invalid or missing:
blah.blah.MainClass --arg1 these --arg2 get --arg3 passed --arg4 to --arg5 spark --arg6 main --arg7 class
So for some reason the main class that I define in steps.json leaks into my separately provided command line arguments.
What's up?
I misunderstood how EMR steps work. There were two options for resolving this:
I could use Type = "CUSTOM_JAR" with Jar = "command-runner.jar" and add a normal spark-submit call to Args.
Using Type = "Spark" simply adds the "spark-submit" call as the first argument, one still needs to provide a master, jar location, main class etc...

Getting a list of instances in an EC2 auto scale group?

Is there a utility or script available to retrieve a list of all instances from AWS EC2 auto scale group?
I need a dynamically generated list of production instance to hook into our deploy process. Is there an existing tool or is this something I am going to have to script?
Here is a bash command that will give you the list of IP addresses of your instances in an AutoScaling group.
for ID in $(aws autoscaling describe-auto-scaling-instances --region us-east-1 --query AutoScalingInstances[].InstanceId --output text);
do
aws ec2 describe-instances --instance-ids $ID --region us-east-1 --query Reservations[].Instances[].PublicIpAddress --output text
done
(you might want to adjust the region and to filter per AutoScaling group if you have several of them)
On a higher level point of view - I would question the need to connect to individual instances in an AutoScaling Group. The dynamic nature of AutoScaling would encourage you to fully automate your deployment and admin processes. To quote an AWS customer : "If you need to ssh to your instance, change your deployment process"
--Seb
The describe-auto-scaling-groups command from the AWS Command Line Interface looks like what you're looking for.
Edit: Once you have the instance IDs, you can use the describe-instances command to fetch additional details, including the public DNS names and IP addresses.
You can use the describe-auto-scaling-instances cli command, and query for your autoscale group name.
Example:
aws autoscaling describe-auto-scaling-instances --region us-east-1
--query 'AutoScalingInstances[?AutoScalingGroupName==`YOUR_ASG`]' --output text
Hope that helps
You can also use below command to fetch private ip address without any jq/awk/sed/cut
$ aws autoscaling describe-auto-scaling-instances --region us-east-1 --output text \
--query "AutoScalingInstances[?AutoScalingGroupName=='ASG-GROUP-NAME'].InstanceId" \
| xargs -n1 aws ec2 describe-instances --instance-ids $ID --region us-east-1 \
--query "Reservations[].Instances[].PrivateIpAddress" --output text
courtesy this
I actually ended up writing a script in Python because I feel more comfortable in Python then Bash,
#!/usr/bin/env python
"""
ec2-autoscale-instance.py
Read Autoscale DNS from AWS
Sample config file,
{
"access_key": "key",
"secret_key": "key",
"group_name": "groupName"
}
"""
from __future__ import print_function
import argparse
import boto.ec2.autoscale
try:
import simplejson as json
except ImportError:
import json
CONFIG_ACCESS_KEY = 'access_key'
CONFIG_SECRET_KEY = 'secret_key'
CONFIG_GROUP_NAME = 'group_name'
def main():
arg_parser = argparse.ArgumentParser(description=
'Read Autoscale DNS names from AWS')
arg_parser.add_argument('-c', dest='config_file',
help='JSON configuration file containing ' +
'access_key, secret_key, and group_name')
args = arg_parser.parse_args()
config = json.loads(open(args.config_file).read())
access_key = config[CONFIG_ACCESS_KEY]
secret_key = config[CONFIG_SECRET_KEY]
group_name = config[CONFIG_GROUP_NAME]
ec2_conn = boto.connect_ec2(access_key, secret_key)
as_conn = boto.connect_autoscale(access_key, secret_key)
try:
group = as_conn.get_all_groups([group_name])[0]
instances_ids = [i.instance_id for i in group.instances]
reservations = ec2_conn.get_all_reservations(instances_ids)
instances = [i for r in reservations for i in r.instances]
dns_names = [i.public_dns_name for i in instances]
print('\n'.join(dns_names))
finally:
ec2_conn.close()
as_conn.close()
if __name__ == '__main__':
main()
Gist
The answer at https://stackoverflow.com/a/12592543/20774 was helpful in developing this script.
Use the below snippet for sorting out ASGs with specific tags and listing out its instance details.
#!/usr/bin/python
import boto3
ec2 = boto3.resource('ec2', region_name='us-west-2')
def get_instances():
client = boto3.client('autoscaling', region_name='us-west-2')
paginator = client.get_paginator('describe_auto_scaling_groups')
groups = paginator.paginate(PaginationConfig={'PageSize': 100})
#print groups
filtered_asgs = groups.search('AutoScalingGroups[] | [?contains(Tags[?Key==`{}`].Value, `{}`)]'.format('Application', 'CCP'))
for asg in filtered_asgs:
print asg['AutoScalingGroupName']
instance_ids = [i for i in asg['Instances']]
running_instances = ec2.instances.filter(Filters=[{}])
for instance in running_instances:
print(instance.private_ip_address)
if __name__ == '__main__':
get_instances()
for ruby using aws-sdk gem v2
First create ec2 object as this:
ec2 = Aws::EC2::Resource.new(region: 'region',
credentials: Aws::Credentials.new('IAM_KEY', 'IAM_SECRET')
)
instances = []
ec2.instances.each do |i|
p "instance id---", i.id
instances << i.id
end
This will fetch all instance ids in particular region and can use more filters like ip_address.