GCP Dataproc - configure YARN fair scheduler - google-cloud-platform

I was trying to set up a dataproc cluster that would compute only one job (or specified max jobs) at a time and the rest would be in queue.
I have found this solution, How to configure monopolistic FIFO application queue in YARN? , but as I'm always creating a new cluster, I needed to automatize this. I have added this to cluster creation:
"softwareConfig": {
"properties": {
"yarn:yarn.resourcemanager.scheduler.class":"org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler",
"yarn:yarn.scheduler.fair.user-as-default-queue":"false",
"yarn:yarn.scheduler.fair.allocation.file":"$HADOOP_CONF_DIR/fair-scheduler.xml",
}
}
with another line in init action script:
sudo echo "<allocations><queueMaxAppsDefault>1</queueMaxAppsDefault></allocations>" > /etc/hadoop/conf/fair-scheduler.xml
and the cluster tells me this when I fetch its config:
'softwareConfig': {
'imageVersion': '1.2.27',
'properties': {
'capacity-scheduler:yarn.scheduler.capacity.root.default.ordering-policy': 'fair',
'core:fs.gs.block.size': '134217728',
'core:fs.gs.metadata.cache.enable': 'false',
'distcp:mapreduce.map.java.opts': '-Xmx4096m',
'distcp:mapreduce.map.memory.mb': '5120',
'distcp:mapreduce.reduce.java.opts': '-Xmx4096m',
'distcp:mapreduce.reduce.memory.mb': '5120',
'hdfs:dfs.datanode.address': '0.0.0.0:9866',
'hdfs:dfs.datanode.http.address': '0.0.0.0:9864',
'hdfs:dfs.datanode.https.address': '0.0.0.0:9865',
'hdfs:dfs.datanode.ipc.address': '0.0.0.0:9867',
'hdfs:dfs.namenode.http-address': '0.0.0.0:9870',
'hdfs:dfs.namenode.https-address': '0.0.0.0:9871',
'hdfs:dfs.namenode.secondary.http-address': '0.0.0.0:9868',
'hdfs:dfs.namenode.secondary.https-address': '0.0.0.0:9869',
'mapred-env:HADOOP_JOB_HISTORYSERVER_HEAPSIZE': '3840',
'mapred:mapreduce.job.maps': '189',
'mapred:mapreduce.job.reduce.slowstart.completedmaps': '0.95',
'mapred:mapreduce.job.reduces': '63',
'mapred:mapreduce.map.cpu.vcores': '1',
'mapred:mapreduce.map.java.opts': '-Xmx4096m',
'mapred:mapreduce.map.memory.mb': '5120',
'mapred:mapreduce.reduce.cpu.vcores': '1',
'mapred:mapreduce.reduce.java.opts': '-Xmx4096m',
'mapred:mapreduce.reduce.memory.mb': '5120',
'mapred:mapreduce.task.io.sort.mb': '256',
'mapred:yarn.app.mapreduce.am.command-opts': '-Xmx4096m',
'mapred:yarn.app.mapreduce.am.resource.cpu-vcores': '1',
'mapred:yarn.app.mapreduce.am.resource.mb': '5120',
'spark-env:SPARK_DAEMON_MEMORY': '3840m',
'spark:spark.driver.maxResultSize': '1920m',
'spark:spark.driver.memory': '3840m',
'spark:spark.executor.cores': '8',
'spark:spark.executor.memory': '37237m',
'spark:spark.yarn.am.memory': '640m',
'yarn:yarn.nodemanager.resource.memory-mb': '81920',
'yarn:yarn.resourcemanager.scheduler.class': 'org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler',
'yarn:yarn.scheduler.fair.allocation.file': '$HADOOP_CONF_DIR/fair-scheduler.xml',
'yarn:yarn.scheduler.fair.user-as-default-queue': 'false',
'yarn:yarn.scheduler.maximum-allocation-mb': '81920',
'yarn:yarn.scheduler.minimum-allocation-mb': '1024'
}
},
The file fair-scheduler.xml also contains the specified code (everything is in one line, but I don't think this could be the problem)
After all this, the cluster still acts like if the capacity scheduler was in charge. No idea why. Any recommendation would help.
Thanks.

as init actions script is running after the cluster is created, the yarn service is already running in the time when the script modify the yarn-site.xml.
So after modifying the xml config file and creating the other xml file, the yarn service needs to be restarted.
It can be done using this command:
sudo systemctl restart hadoop-yarn-resourcemanager.service
Also, since the $HADOOP_CONF_DIR was not set (I thought it should be), its needed to input the whole path to the file. But, after that, the initial YARN service won't start, because it can't find the file that is created later in init actions script. So, what I did is to add the last few lines to yarn-site.xml in the init actions script as well.
The code for init actions script is the following:
ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
echo "<allocations>" > /etc/hadoop/conf/fair-scheduler.xml
echo " <queueMaxAppsDefault>1</queueMaxAppsDefault>" >> /etc/hadoop/conf/fair-scheduler.xml
echo "</allocations>" >> /etc/hadoop/conf/fair-scheduler.xml
sed -i '$ d' /etc/hadoop/conf/yarn-site.xml
echo " <property>" >> /etc/hadoop/conf/yarn-site.xml
echo " <name>yarn.scheduler.fair.allocation.file</name>" >> /etc/hadoop/conf/yarn-site.xml
echo " <value>/etc/hadoop/conf/fair-scheduler.xml</value>" >> /etc/hadoop/conf/yarn-site.xml
echo " </property>" >> /etc/hadoop/conf/yarn-site.xml
echo "</configuration>" >> /etc/hadoop/conf/yarn-site.xml
systemctl restart hadoop-yarn-resourcemanager.service
fi

Related

Wrong Yarn node label mapping with AWS EMR machine types

Does anyone have experience with Yarn node labels on AWS EMR? If so you please share your thoughts. We want to run All the Spark executors on Task(Spot) machine and all the Spark ApplicationMaster/Driver on Core(on-Demand) machine. Previously we were running Spark executors and Spark Driver all on the CORE machine(on-demand).
In order to achieve this, we have created the "TASK" yarn node label as a part of a custom AWS EMR Bootstrap action. And Have mapped the same "TASK" yarn label when any Spot instance is registered with AWS EMR in a separate bootstrap action. As "CORE" is the default yarn node label expression, so we are simply mapping it with an on-demand instance upon registration of the node in the bootstrap action.
We are using "spark.yarn.executor.nodeLabelExpression": "TASK" spark conf to launch spark executors on Task nodes.
So.. we are facing the problem of the wrong mapping of the Yarn node label with the appropriate machine i.e For a short duration of time(around 1-2 mins) the "TASK" yarn node label is mapped with on-demand instances and "CORE" yarn node label is mapped with spot instance. So During this short duration of wrong labeling Yarn launches Spark executors on On-demand instances and Spark drivers on Spot instances.
This wrong mapping of labels with corresponding machine type persists till the bootstrap actions are complete and after that, the mapping is automatically resolved to its correct state.
The script we are running as a part of the bootstrap action:
This script is run on all new machines to assign a label to that machine. The script is being run as a background PID as the yarn is only available after all custom bootstrap actions are completed
#!/usr/bin/env bash
set -ex
function waitTillYarnComesUp() {
IS_YARN_EXIST=$(which yarn | grep -i yarn | wc -l)
while [ $IS_YARN_EXIST != '1' ]
do
echo "Yarn not exist"
sleep 15
IS_YARN_EXIST=$(which yarn | grep -i yarn | wc -l)
done
echo "Yarn exist.."
}
function waitTillTaskLabelSyncs() {
LABEL_EXIST=$(yarn cluster --list-node-labels | grep -i TASK | wc -l)
while [ $LABEL_EXIST -eq 0 ]
do
sleep 15
LABEL_EXIST=$(yarn cluster --list-node-labels | grep -i TASK | wc -l)
done
}
function getHostInstanceTypeAndApplyLabel() {
HOST_IP=$(curl http://169.254.169.254/latest/meta-data/local-hostname)
echo "host ip is ${HOST_IP}"
INSTANCE_TYPE=$(curl http://169.254.169.254/latest/meta-data/instance-life-cycle)
echo "instance type is ${INSTANCE_TYPE}"
PORT_NUMBER=8041
spot="spot"
onDemand="on-demand"
if [ $INSTANCE_TYPE == $spot ]; then
yarn rmadmin -replaceLabelsOnNode "${HOST_IP}:${PORT_NUMBER}=TASK"
elif [ $INSTANCE_TYPE == $onDemand ]
then
yarn rmadmin -replaceLabelsOnNode "${HOST_IP}:${PORT_NUMBER}=CORE"
fi
}
waitTillYarnComesUp
# holding for resource manager sync
sleep 100
waitTillTaskLabelSyncs
getHostInstanceTypeAndApplyLabel
exit 0
yarn rmadmin -addToClusterNodeLabels "TASK(exclusive=false)"
This command is being run on the Master instance to create a new TASK yarn node label at the time of cluster creation.
Does anyone have clue to prevent this wrong mapping of labels?
I would like to propose the next:
Create every node with some default label, like LABEL_PENDING. You can do it using the EMR classifications;
In the bootstrap script, you should identify if the current node is On-Demand or Spot instance;
After that, on every node you should update change LABEL_PENDING in /etc/hadoop/conf/yarn-site.xml to ON_DEMAND or SPOT;
On the master node, you should add 3 labels to YARN: LABEL_PENDING, ON_DEMAND, and SPOT.
Example of EMR Classifications:
[
{
"classification": "yarn-site",
"properties": {
"yarn.node-labels.enabled": "true",
"yarn.node-labels.am.default-node-label-expression": "ON_DEMAND",
"yarn.nodemanager.node-labels.provider.configured-node-partition": "LABEL_PENDING"
},
"configurations": []
},
{
"classification": "capacity-scheduler",
"properties": {
"yarn.scheduler.capacity.root.accessible-node-labels.ON_DEMAND.capacity": "100",
"yarn.scheduler.capacity.root.accessible-node-labels.SPOT.capacity": "100",
"yarn.scheduler.capacity.root.default.accessible-node-labels.ON_DEMAND.capacity": "100",
"yarn.scheduler.capacity.root.default.accessible-node-labels.SPOT.capacity": "100"
},
"configurations": []
},
{
"classification": "spark-defaults",
"properties": {
"spark.yarn.am.nodeLabelExpression": "ON_DEMAND",
"spark.yarn.executor.nodeLabelExpression": "SPOT"
},
"configurations": []
}
]
Example of the additional part to your bootstrap script
yarnNodeLabelConfig="yarn.nodemanager.node-labels.provider.configured-node-partition"
yarnSiteXml="/etc/hadoop/conf/yarn-site.xml"
function waitForYarnConfIsReady() {
while [[ ! -e $yarnSiteXml ]]; do
sleep 2
done
IS_CONF_PRESENT_IN_FILE=$(grep $yarnNodeLabelConfig $yarnSiteXml | wc -l)
while [[ $IS_CONF_PRESENT_IN_FILE != "1" ]]
do
echo "Yarn conf file doesn't have properties"
sleep 2
IS_CONF_PRESENT_IN_FILE=$(grep $yarnNodeLabelConfig $yarnSiteXml | wc -l)
done
}
function updateLabelInYarnConf() {
INSTANCE_TYPE=$(curl http://169.254.169.254/latest/meta-data/instance-life-cycle)
echo "Instance type is $INSTANCE_TYPE"
if [[ $INSTANCE_TYPE == "spot" ]]; then
sudo sed -i 's/>LABEL_PENDING</>SPOT</' $yarnSiteXml
elif [[ $INSTANCE_TYPE == "on-demand" ]]
then
sudo sed -i 's/>LABEL_PENDING</>ON_DEMAND</' $yarnSiteXml
fi
}
waitForYarnConfIsReady
updateLabelInYarnConf

Is there a way to confirm user_data ran successfully with Terraform for EC2?

I'm wondering if it's possible to know when the script in user data executes completely?
data "template_file" "script" {
template = file("${path.module}/installing.sh")
}
data "template_cloudinit_config" "config" {
gzip = false
base64_encode = false
# Main cloud-config configuration file.
part {
filename = "install.sh"
content = "${data.template_file.script.rendered}"
}
}
resource "aws_instance" "web" {
ami = "ami-04e7b4117bb0488e4"
instance_type = "t2.micro"
key_name = "KEY"
vpc_security_group_ids = [aws_default_security_group.default.id]
subnet_id = aws_default_subnet.default_az1.id
associate_public_ip_address = true
iam_instance_profile = "Role_S3"
user_data = data.template_cloudinit_config.config.rendered
tags = {
Name = "Terraform-Ansible"
}
}
And in the content of the script I have this.
It tells me Terraform successfully apply the changes, but the script is still running, is there a way I can monitor that?
#!/usr/bin/env bash
exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1
echo BEGIN
sudo apt update
sudo apt upgrade -y
sudo apt install -y unzip
echo END
No, You can not confirm the user data status from the terraform, as it posts launching script that executes once EC2 instance launched. But you will need some extra effort on init script that one way to check.
How to check User Data status while launching the instance in aws
If you do something that is mentioned above to make some marker file once user data completed, then you can try this to check.
resource "null_resource" "user_data_status_check" {
provisioner "local-exec" {
on_failure = "fail"
interpreter = ["/bin/bash", "-c"]
command = <<EOT
echo -e "\x1B[31m wait for few minute for instance warm up, adjust accordingly \x1B[0m"
# wait 30 sec
sleep 30
ssh -i yourkey.pem instance_ip ConnectTimeout=30 -o 'ConnectionAttempts 5' test -f "/home/user/markerfile.txt" && echo found || echo not found
if [ $? -eq 0 ]; then
echo "user data sucessfully executed"
else
echo "Failed to execute user data"
fi
EOT
}
triggers = {
#remove this once you test it out as it should run only once
always_run ="${timestamp()}"
}
depends_on = ["aws_instance.my_instance"]
}
so this script will check marker file on the newly launch server by doing ssh with timeout 30 seconds with max attempts 5.
Here are some pointers to remember:
User data shell scripts must start with the Shebang #! characters and the path to the interpreter you want to read the script (commonly /bin/bash).
Scripts entered as user data are run as the root user, so no need to use the sudo command in the init script.
When a user data script is processed, it is copied to and run from /var/lib/cloud/instances/instance-id/. The script is not deleted after it is run and can be found in this directory with the name user-data.txt So to check if your shell script made to the server refer this directory and the file.
The cloud-init output log file (/var/log/cloud-init-output.log) captures console output of your user_data shell script. to know how your user_data shell script was executed and its output check this file.
Source: https://www.middlewareinventory.com/blog/terraform-aws-ec2-user_data-example/
Well I use these two ways to confirm.
At the end of cloudinit config file this line sends me a notification through whatsapp (using callmebot). Thus no matter how much does it take to setup, I always get notified when it's ready to use. I watch some series or read something in that time. no time wasted.
curl -X POST "https://api.callmebot.com/whatsapp.php?phone=12345678910&text=Ec2+transcoder+setup+complete&apikey=12345"
At the end of cloudinit config this line runs -
echo "for faster/visual confirmation of above execution.."
wget https://www.sample-videos.com/video123/mp4/720/big_buck_bunny_720p_1mb.mp4 -O /home/ubuntu/dpnd_comp.mp4
When I sign in to the instance I can see directly the file.
And I'm loving it. Hope this helps someone. Also, don't forget to tell me your method too.

Environment Variables in newest AWS EC2 instance

I am trying to get ENVIRONMENT Variables into the EC2 instance (trying to run a django app on Amazon Linux AMI 2018.03.0 (HVM), SSD Volume Type ami-0ff8a91507f77f867 ). How do you get them in the newest version of amazon's linux, or get the logging so it can be traced.
user-data text (modified from here):
#!/bin/bash
#trying to get a file made
touch /tmp/testfile.txt
cat 'This and that' > /tmp/testfile.txt
#trying to log
echo 'Woot!' > /home/ec2-user/user-script-output.txt
#Trying to get the output logged to see what is going wrong
exec > >(tee /var/log/user-data.log|logger -t user-data ) 2>&1
#trying to log
echo "XXXXXXXXXX STARTING USER DATA SCRIPT XXXXXXXXXXXXXX"
#trying to store the ENVIRONMENT VARIABLES
PARAMETER_PATH='/'
REGION='us-east-1'
# Functions
AWS="/usr/local/bin/aws"
get_parameter_store_tags() {
echo $($AWS ssm get-parameters-by-path --with-decryption --path ${PARAMETER_PATH} --region ${REGION})
}
params_to_env () {
params=$1
# If .Ta1gs does not exist we assume ssm Parameteres object.
SELECTOR="Name"
for key in $(echo $params | /usr/bin/jq -r ".[][].${SELECTOR}"); do
value=$(echo $params | /usr/bin/jq -r ".[][] | select(.${SELECTOR}==\"$key\") | .Value")
key=$(echo "${key##*/}" | /usr/bin/tr ':' '_' | /usr/bin/tr '-' '_' | /usr/bin/tr '[:lower:]' '[:upper:]')
export $key="$value"
echo "$key=$value"
done
}
# Get TAGS
if [ -z "$PARAMETER_PATH" ]
then
echo "Please provide a parameter store path. -p option"
exit 1
fi
TAGS=$(get_parameter_store_tags ${PARAMETER_PATH} ${REGION})
echo "Tags fetched via ssm from ${PARAMETER_PATH} ${REGION}"
echo "Adding new variables..."
params_to_env "$TAGS"
Notes -
What i think i know but am unsure
the user-data script is only loaded when it is created, not when I stop and then start mentioned here (although it also says [i think outdated] that the output is logged to /var/log/cloud-init-output.log )
I may not be starting the instance correctly
I don't know where to store the bash script so that it can be executed
What I have verified
the user-data text is on the instance by ssh-ing in and curl http://169.254.169.254/latest/user-data shows the current text (#!/bin/bash …)
What Ive tried
editing rc.local directly to export AWS_ACCESS_KEY_ID='JEFEJEFEJEFEJEFE' … and the like
putting them in the AWS Parameter Store (and can see them via the correct call, I just can't trace getting them into the EC2 instance without logs or confirming if the user-data is getting run)
putting ENV variables in Tags and importing them as mentioned here:
tried outputting the logs to other files as suggested here (Not seeing any log files in the ssh instance or on the system log)
viewing the System Log on the aws webpage to see any errors/logs via selecting the instance -> 'Actions' -> 'Instance Settings' -> 'Get System Log' (not seeing any commands run or log statements [only 1 unrelated word of user])

Packer's Puppet provisioning stalls

I'm very new to the this whole Packer/Vagrant,Puppet world. I'm trying to build my first VM using Packer and Puppet.
I can successfully build a virtualbox and I've included a shell script provisioner to install puppet. I've ssh'ed into the VM to verify that it works and puppet is installed.
Then I added an additional puppet-masterless provisioner that looks simply like this:
# java dependency
package { 'openjdk-7-jdk' :
ensure => present
}
When I run packer, it gets to this point and gets stuck:
==> virtualbox-iso: Provisioning with Puppet...
virtualbox-iso: Creating Puppet staging directory...
virtualbox-iso: Uploading manifests...
virtualbox-iso: Running Puppet: sudo -E puppet apply --verbose --modulepath='' --detailed-exitcodes /tmp/packer-puppet-masterless/manifests/ubuntu.pp
Any suggestions would be helpful. Even on how to debug it to see what's going on behind the scenes
I was having the same problem, and changed the execute_command to receive the password of the vagrant user.
"override": {
"virtualbox-iso": {
"execute_command": "echo 'vagrant' | {{.FacterVars}}{{if .Sudo}} sudo -S -E {{end}}puppet apply --verbose --modulepath='{{.ModulePath}}' {{if ne .HieraConfigPath \"\"}}--hiera_config='{{.HieraConfigPath}}' {{end}} {{if ne .ManifestDir \"\"}}--manifestdir='{{.ManifestDir}}' {{end}} --detailed-exitcodes {{.ManifestFile}}"
}
}
The whole block looks like this
{
"type": "puppet-masterless",
"manifest_file": "../puppet/manifests/base.pp",
"module_paths": [
"../puppet/modules/"
],
"override": {
"virtualbox-iso": {
"execute_command": "echo 'vagrant' | {{.FacterVars}}{{if .Sudo}} sudo -S -E {{end}}puppet apply --verbose --modulepath='{{.ModulePath}}' {{if ne .HieraConfigPath \"\"}}--hiera_config='{{.HieraConfigPath}}' {{end}} {{if ne .ManifestDir \"\"}}--manifestdir='{{.ManifestDir}}' {{end}} --detailed-exitcodes {{.ManifestFile}}"
}
}
}
Source: Found an example here https://github.com/AdoptOpenJDK/openjdk-virtual-images/blob/master/packer/openjdk-development/openjdk-development.json

Amazon RDS - Online only when needed?

I had a question about Amazon RDS. I only need the database online for about 2 hours a day but I am dealing with quite a large database at around 1gb.
I have two main questions:
Can I automate bringing my RDS database online and offline via scripts to save money?
When I put a RDS offline to stop the "work hours" counter running and billing me, when I bring it back online will it still have the same content (i.e will all my data stay there, or will it have to be a blank DB?). If so, is there any way around this rather than backing up to S3 and reimporting it every time?
If you wish to do this programatically,
Snapshot the RDS instance using rds-create-db-snapshot http://docs.aws.amazon.com/AmazonRDS/latest/CommandLineReference/CLIReference-cmd-CopyDBSnapshot.html
Delete the running instance using rds-delete-db-instance http://docs.aws.amazon.com/AmazonRDS/latest/CommandLineReference/CLIReference-cmd-DeleteDBInstance.html
Restore the database from the snapshot using rds-restore-db-instance-from-db-snapshot http://docs.aws.amazon.com/AmazonRDS/latest/CommandLineReference/CLIReference-cmd-RestoreDBInstanceFromDBSnapshot.html
You may also do all of this from the AWS Web Console as well, if you wish to do this manually.
You can start EC2* instances using shell scripts, so I guess that you can as well for RDS.
(see http://docs.aws.amazon.com/AmazonRDS....html)
But unlike EC2*, you cannot "stop" an RDS instance without "destroying" it. You need to create a DB snapshot when terminating your database. You will use this DB snapshot when re-starting the database.
*EC2 : Elastic Computing, renting a virtual server or a server.
Here's a script that will stop/start/reboot an RDS instance
#!/bin/bash
# usage ./startStop.sh lhdevices start
INSTANCE="$1"
ACTION="$2"
# export vars to run RDS CLI
export JAVA_HOME=/usr;
export AWS_RDS_HOME=/home/mysql/RDSCli-1.15.001;
export PATH=$PATH:/home/mysql/RDSCli-1.15.001/bin;
export EC2_REGION=us-east-1;
export AWS_CREDENTIAL_FILE=/home/mysql/RDSCli-1.15.001/keysLightaria.txt;
if [ $# -ne 2 ]
then
echo "Usage: $0 {MySQL-Instance Name} {Action either start, stop or reboot}"
echo ""
exit 1
fi
shopt -s nocasematch
if [[ $ACTION == 'start' ]]
then
echo "This will $ACTION a MySQL Instance"
rds-restore-db-instance-from-db-snapshot lhdevices
--db-snapshot-identifier dbStart --availability-zone us-east-1a
--db-instance-class db.m1.small
echo "Sleeping while instance is created"
sleep 10m
echo "waking..."
rds-modify-db-instance lhdevices --db-security-groups kfarrell
echo "Sleeping while instance is modified for security group name"
sleep 5m
echo "waking..."
elif [[ $ACTION == 'stop' ]]
then
echo "This will $ACTION a MySQL Instance"
yes | rds-delete-db-snapshot dbStart
echo "Sleeping while deleting old snapshot "
sleep 10m
#rds-create-db-snapshot lhdevices --db-snapshot-identifier dbStart
# echo "Sleeping while creating new snapshot "
# sleep 10m
# echo "waking...."
#rds-delete-db-instance lhdevices --force --skip-final-snapshot
rds-delete-db-instance lhdevices --force --final-db-snapshot-identifier dbStart
echo "Sleeping while instance is deleted"
sleep 10m
echo "waking...."
elif [[ $ACTION == 'reboot' ]]
then
echo "This will $ACTION a MySQL Instance"
rds-reboot-db-instance lhdevices ;
echo "Sleeping while Instance is rebooted"
sleep 5m
echo "waking...."
else
echo "Did not recognize command: $ACTION"
echo "Usage: $0 {MySQL-Instance Name} {Action either start, stop or reboot}"
fi
shopt -u nocasematch
Amazon recently updated their CLI to include a way to start and stop RDS instances. stop-db-instance and start-db-instance detail the steps needed to perform these operations.