How to use AWS Data pipeline shellcommandprecondition - amazon-web-services

My first question here! I've built a Data Pipeline for daily ETL whichs moves and tranforms data between Aurora, Redshift and Hive. All works well however I'm truly stuck on trying to implement a Shellcommandprecondition. The aim is to check the total row count in a view sitting on Aurora MySQL. If the view is empty (0 rows) then Data Pipeline should execute. If there are rows in the view - then the pipeline wait a bit, and then eventually fail after 4 retries.
Can someone help me out with the code for the actual check and query? This is what I've got so far but no luck with it:
#!/bin/bash
count=`mysql -u USER -pPW -h MASTERPUBLIC -p 3306 -D DBNAME -s -N -e "SELECT count(*) from MyView"`
if $count = 0
then exit 0
else exit 1
fi
In the pipeline definition it looks as follows:
{
"retryDelay": "15 Minutes",
"scriptUri": "s3://mybucket/ETLprecondition.bash",
"maximumRetries": "4",
"name": "CheckViewEmpty",
"id": "PreconditionId_pznm2",
"type": "ShellCommandPrecondition"
},
I have very little experience coding so I may be completely off...

Right, a few hours have passed and I finally solved it. There were a few issues holding me up.
Mysql client was not installed on the ec2 instance. Solved that by adding install command
Next issue was that the if $count = 0 line wasn't working as I would have expected it to do with my limited experience. Exchanged it for if [ "$count" -eq "0" ];
Final and working code is:
#!/bin/bash
if type mysql >/dev/null 2>&1; then
count=`mysql -u USER -pPW -h MASTERPUBLIC -p 3306 -D DBNAME -s -N -e "SELECT count(*) from MyView"`
if [ "$count" -eq "0" ];
then exit 0
else exit 1
fi
else
sudo yum install -y mysql
count=`mysql -u USER -pPW -h MASTERPUBLIC -p 3306 -D DBNAME -s -N -e "SELECT count(*) from MyView"`
if [ "$count" -eq "0" ];
then exit 0
else exit 1
fi
fi

Related

Wrong Yarn node label mapping with AWS EMR machine types

Does anyone have experience with Yarn node labels on AWS EMR? If so you please share your thoughts. We want to run All the Spark executors on Task(Spot) machine and all the Spark ApplicationMaster/Driver on Core(on-Demand) machine. Previously we were running Spark executors and Spark Driver all on the CORE machine(on-demand).
In order to achieve this, we have created the "TASK" yarn node label as a part of a custom AWS EMR Bootstrap action. And Have mapped the same "TASK" yarn label when any Spot instance is registered with AWS EMR in a separate bootstrap action. As "CORE" is the default yarn node label expression, so we are simply mapping it with an on-demand instance upon registration of the node in the bootstrap action.
We are using "spark.yarn.executor.nodeLabelExpression": "TASK" spark conf to launch spark executors on Task nodes.
So.. we are facing the problem of the wrong mapping of the Yarn node label with the appropriate machine i.e For a short duration of time(around 1-2 mins) the "TASK" yarn node label is mapped with on-demand instances and "CORE" yarn node label is mapped with spot instance. So During this short duration of wrong labeling Yarn launches Spark executors on On-demand instances and Spark drivers on Spot instances.
This wrong mapping of labels with corresponding machine type persists till the bootstrap actions are complete and after that, the mapping is automatically resolved to its correct state.
The script we are running as a part of the bootstrap action:
This script is run on all new machines to assign a label to that machine. The script is being run as a background PID as the yarn is only available after all custom bootstrap actions are completed
#!/usr/bin/env bash
set -ex
function waitTillYarnComesUp() {
IS_YARN_EXIST=$(which yarn | grep -i yarn | wc -l)
while [ $IS_YARN_EXIST != '1' ]
do
echo "Yarn not exist"
sleep 15
IS_YARN_EXIST=$(which yarn | grep -i yarn | wc -l)
done
echo "Yarn exist.."
}
function waitTillTaskLabelSyncs() {
LABEL_EXIST=$(yarn cluster --list-node-labels | grep -i TASK | wc -l)
while [ $LABEL_EXIST -eq 0 ]
do
sleep 15
LABEL_EXIST=$(yarn cluster --list-node-labels | grep -i TASK | wc -l)
done
}
function getHostInstanceTypeAndApplyLabel() {
HOST_IP=$(curl http://169.254.169.254/latest/meta-data/local-hostname)
echo "host ip is ${HOST_IP}"
INSTANCE_TYPE=$(curl http://169.254.169.254/latest/meta-data/instance-life-cycle)
echo "instance type is ${INSTANCE_TYPE}"
PORT_NUMBER=8041
spot="spot"
onDemand="on-demand"
if [ $INSTANCE_TYPE == $spot ]; then
yarn rmadmin -replaceLabelsOnNode "${HOST_IP}:${PORT_NUMBER}=TASK"
elif [ $INSTANCE_TYPE == $onDemand ]
then
yarn rmadmin -replaceLabelsOnNode "${HOST_IP}:${PORT_NUMBER}=CORE"
fi
}
waitTillYarnComesUp
# holding for resource manager sync
sleep 100
waitTillTaskLabelSyncs
getHostInstanceTypeAndApplyLabel
exit 0
yarn rmadmin -addToClusterNodeLabels "TASK(exclusive=false)"
This command is being run on the Master instance to create a new TASK yarn node label at the time of cluster creation.
Does anyone have clue to prevent this wrong mapping of labels?
I would like to propose the next:
Create every node with some default label, like LABEL_PENDING. You can do it using the EMR classifications;
In the bootstrap script, you should identify if the current node is On-Demand or Spot instance;
After that, on every node you should update change LABEL_PENDING in /etc/hadoop/conf/yarn-site.xml to ON_DEMAND or SPOT;
On the master node, you should add 3 labels to YARN: LABEL_PENDING, ON_DEMAND, and SPOT.
Example of EMR Classifications:
[
{
"classification": "yarn-site",
"properties": {
"yarn.node-labels.enabled": "true",
"yarn.node-labels.am.default-node-label-expression": "ON_DEMAND",
"yarn.nodemanager.node-labels.provider.configured-node-partition": "LABEL_PENDING"
},
"configurations": []
},
{
"classification": "capacity-scheduler",
"properties": {
"yarn.scheduler.capacity.root.accessible-node-labels.ON_DEMAND.capacity": "100",
"yarn.scheduler.capacity.root.accessible-node-labels.SPOT.capacity": "100",
"yarn.scheduler.capacity.root.default.accessible-node-labels.ON_DEMAND.capacity": "100",
"yarn.scheduler.capacity.root.default.accessible-node-labels.SPOT.capacity": "100"
},
"configurations": []
},
{
"classification": "spark-defaults",
"properties": {
"spark.yarn.am.nodeLabelExpression": "ON_DEMAND",
"spark.yarn.executor.nodeLabelExpression": "SPOT"
},
"configurations": []
}
]
Example of the additional part to your bootstrap script
yarnNodeLabelConfig="yarn.nodemanager.node-labels.provider.configured-node-partition"
yarnSiteXml="/etc/hadoop/conf/yarn-site.xml"
function waitForYarnConfIsReady() {
while [[ ! -e $yarnSiteXml ]]; do
sleep 2
done
IS_CONF_PRESENT_IN_FILE=$(grep $yarnNodeLabelConfig $yarnSiteXml | wc -l)
while [[ $IS_CONF_PRESENT_IN_FILE != "1" ]]
do
echo "Yarn conf file doesn't have properties"
sleep 2
IS_CONF_PRESENT_IN_FILE=$(grep $yarnNodeLabelConfig $yarnSiteXml | wc -l)
done
}
function updateLabelInYarnConf() {
INSTANCE_TYPE=$(curl http://169.254.169.254/latest/meta-data/instance-life-cycle)
echo "Instance type is $INSTANCE_TYPE"
if [[ $INSTANCE_TYPE == "spot" ]]; then
sudo sed -i 's/>LABEL_PENDING</>SPOT</' $yarnSiteXml
elif [[ $INSTANCE_TYPE == "on-demand" ]]
then
sudo sed -i 's/>LABEL_PENDING</>ON_DEMAND</' $yarnSiteXml
fi
}
waitForYarnConfIsReady
updateLabelInYarnConf

Google API 0Auth2 cannot retrieve access token

I am new to Google API. I have a script to import all mails into Google Groups but I cannot get the API to work.
I have my client_id, client_secret
then I used this link:
https://accounts.google.com/o/oauth2/auth?client_id=[CLIENID]&redirect_uri=urn:ietf:wg:oauth:2.0:oob&scope=https://www.googleapis.com/auth/apps.groups.migration&response_type=code
where I replaced the [CLIENDID] with my ClientID, I can authenticate and get back the AuthCode which I then used to run this command:
curl --request POST --data "code=[AUTHCODE]&client_id=[CLIENTID]&client_secret=[CLIENTSECRET]&redirect_uri=urn:ietf:wg:oauth:2.0:oob&grant_type=authorization_code" https://accounts.google.com/o/oauth2/token
This works and shows me the refresh token, however, the script does say authentication failed. So I tried to run the command again and it says
"error": "invalid_grant"
"error_description": "Bad Request"
If I reopen the link above, get a new authcode and run the command again, it works but only for the first time. I am on a NPO Google Account and I activated the Trial Period.
Can anyone help me out here?
Complete script:
client_id="..."
client_secret="...."
refresh_token="......"
function usage() {
(
echo "usage: $0 <group-address> <mbox-dir>"
) >&2
exit 5
}
GROUP="$1"
shift
MBOX_DIR="$1"
shift
[ -z "$GROUP" -o -z "$MBOX_DIR" ] && usage
token=$(curl -s --request POST --data "client_id=$client_id&client_secret=$client_secret&refresh_token=$refresh_token&grant_type=refresh_token" https://accounts.google.com/o/oauth2/token | sed -n "s/^\s*\"access_token\":\s*\"\([^\"]*\)\",$/\1/p")
# create done folder if it doesn't already exist
DONE_FOLDER=$MBOX_DIR/../done
mkdir -p $DONE_FOLDER
i=0
for file in $MBOX_DIR/*; do
echo "importing $file"
response=$(curl -s -H"Authorization: Bearer $token" -H'Content-Type: message/rfc822' -X POST "https://www.googleapis.com/upload/groups/v1/groups/$GROUP/archive?uploadType=media" --data-binary #${file})
result=$(echo $response | grep -c "SUCCESS")
# check to see if it worked
if [[ $result -eq 0 ]]; then
echo "upload failed on file $file. please run command again to resume."
exit 1
fi
# it worked! move message to the done folder
mv $file $DONE_FOLDER/
((i=i+1))
if [[ $i -gt 9 ]]; then
expires_in=$(curl -s "https://www.googleapis.com/oauth2/v1/tokeninfo?access_token=$token" | sed -n "s/^\s*\"expires_in\":\s*\([0-9]*\),$/\1/p")
if [[ $expires_in -lt 300 ]]; then
# refresh token
echo "Refreshing token..."
token=$(curl -s --request POST --data "client_id=$client_id&client_secret=$client_secret&refresh_token=$refresh_token&grant_type=refresh_token" https://accounts.google.com/o/oauth2/token | sed -n "s/^\s*\"access_token\":\s*\"\([^\"]*\)\",$/\1/p")
fi
i=0
fi
done

How to make timer task in informatica succeed after a duration

I'm curious as to how to make the status of the timer task changes to succeed? I have many sessions whereby some of them are connected in series and some are in parallel... After every session has run successfully, the status of the timer task is still showing running... How do I make it change to succeed as well...
The condition is if the workflow finishes below the allocated time of 20 minutes, the timer task has to change to succeed, but if it exceeds 20 minutes, then it should send an email to the assigned user and abort the workflow.....
Unix:
if[[ $Event_Exceed20min > 20 AND $EVent_Exceed20min.Status = Running ]]
pmcmd stopworkflow -service informatica-integration-Service -d domain-name - u user-name -p password -f folder-name -w workflow-name
$Event_Exceed20min.Status = SUCCEEDED
fi
You can use UNIX script to do this. I dont see informatica alone can do this.
You can create a script which will kick off the informatica using pmcmd,
keep polling the status.
kick off the flow and start timer
start checking status
if timer goes >1200 seconds, abort and mail, else continue polling
Code sniped below...
#!/bin/bash
wf=$1
sess=$2
mailids="xyz#abc.com,abc#goog.com"
log="~/log/"$wf"log.txt"
echo "Start Workflow..."> $log
pmcmd startworkflow -sv service -d domain -u username -p password -f "FolderName" $wf
#Timer starts, works only in BASH
start=$SECONDS
while :
do
#Check Timer, if >20min abort the flow.
end=$SECONDS
duration=$(( end - start ))
if [ $duration -gt 1200 ]; then
pmcmd stopworkflow -sv service -d domain -u username -p password -f prd_CLAIMS -w $wf
STAT=$?
#Error check if not aborted
mailx -s "Workflow took >20min so aborted" $mailids
fi
pmcmd getsessionstatistics -sv service -d domain -u username -p password -f prd_CLAIMS -w $wf $sess > ~/log/tmp.txt
STAT=$?
if [ "$STAT" != 0 ]; then
echo "Staus check failed" >> $log
fi
echo $(grep "[Succeeded] " ~/log/tmp.txt| wc -l) > ~/log/tmp2.txt
STAT=$?
if [ -s ~/log/tmp2.txt ]; then
echo "Workflow Succeeded...">> $log
exit
fi
sleep 30
done
echo "End Workflow...">> $log

Accumulo Overview console not reachable outside of VirtualBox VM

I am running Accumulo 1.5 in an Ubuntu 12.04 VirtualBox VM. I have set the accumulo-site.xml instance.zookeeper.host file to the VM's IP address, and I can connect to accumulo and run queries from a remote client machine. From the client machine, I can also use a browser to see the hadoop NameNode, browse the filesystem, etc. But I cannot connect to the Accumulo Overview page (port 50095) from anywhere else than directly from the Accumulo VM. There is no firewall between the VM and the client, and besides the Accumulo Overview page not being reachable, everything else seems to work fine.
Is there a config setting that I need to change to allow outside access to the Accumulo Overview console?
thanks
I was able to get the Accumulo monitor to bind to all network interfaces by manually applying this patch:
https://git-wip-us.apache.org/repos/asf?p=accumulo.git;a=commit;h=7655de68
In conf/accumulo-env.sh add:
# Should the monitor bind to all network interfaces -- default: false
export ACCUMULO_MONITOR_BIND_ALL="true"
In bin/config.sh add:
# ACCUMULO-1985 provide a way to use the scripts and still bind to all network interfaces
export ACCUMULO_MONITOR_BIND_ALL=${ACCUMULO_MONITOR_BIND_ALL:-"false"}
And modify bin/start-server.sh to match:
SOURCE="${BASH_SOURCE[0]}"
while [ -h "$SOURCE" ]; do # resolve $SOURCE until the file is no longer a symlink
bin="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
SOURCE="$(readlink "$SOURCE")"
[[ $SOURCE != /* ]] && SOURCE="$bin/$SOURCE" # if $SOURCE was a relative symlink, we need to resolve it relative to the path where the symlink file was located
done
bin="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
# Stop: Resolve Script Directory
. "$bin"/config.sh
HOST="$1"
host "$1" >/dev/null 2>/dev/null
if [ $? -ne 0 ]; then
LOGHOST="$1"
else
LOGHOST=$(host "$1" | head -1 | cut -d' ' -f1)
fi
ADDRESS="$1"
SERVICE="$2"
LONGNAME="$3"
if [ -z "$LONGNAME" ]; then
LONGNAME="$2"
fi
SLAVES=$( wc -l < ${ACCUMULO_HOME}/conf/slaves )
IFCONFIG=/sbin/ifconfig
if [ ! -x $IFCONFIG ]; then
IFCONFIG='/bin/netstat -ie'
fi
# ACCUMULO-1985 Allow monitor to bind on all interfaces
if [ ${SERVICE} == "monitor" -a ${ACCUMULO_MONITOR_BIND_ALL} == "true" ]; then
ADDRESS="0.0.0.0"
fi
ip=$($IFCONFIG 2>/dev/null| grep inet[^6] | awk '{print $2}' | sed 's/addr://' | grep -v 0.0.0.0 | grep -v 127.0.0.1 | head -n 1)
if [ $? != 0 ]
then
ip=$(python -c 'import socket as s; print s.gethostbyname(s.getfqdn())')
fi
if [ "$HOST" = "localhost" -o "$HOST" = "`hostname`" -o "$HOST" = "$ip" ]; then
PID=$(ps -ef | egrep ${ACCUMULO_HOME}/.*/accumulo.*.jar | grep "Main $SERVICE" | grep -v grep | awk {'print $2'} | head -1)
else
PID=$($SSH $HOST ps -ef | egrep ${ACCUMULO_HOME}/.*/accumulo.*.jar | grep "Main $SERVICE" | grep -v grep | awk {'print $2'} | head -1)
fi
if [ -z $PID ]; then
echo "Starting $LONGNAME on $HOST"
if [ "$HOST" = "localhost" -o "$HOST" = "`hostname`" -o "$HOST" = "$ip" ]; then
#${bin}/accumulo ${SERVICE} --address $1 >${ACCUMULO_LOG_DIR}/${SERVICE}_${LOGHOST}.out 2>${ACCUMULO_LOG_DIR}/${SERVICE}_${LOGHOST}.err &
${bin}/accumulo ${SERVICE} --address ${ADDRESS} >${ACCUMULO_LOG_DIR}/${SERVICE}_${LOGHOST}.out 2>${ACCUMULO_LOG_DIR}/${SERVICE}_${LOGHOST}.err &
MAX_FILES_OPEN=$(ulimit -n)
else
#$SSH $HOST "bash -c 'exec nohup ${bin}/accumulo ${SERVICE} --address $1 >${ACCUMULO_LOG_DIR}/${SERVICE}_${LOGHOST}.out 2>${ACCUMULO_LOG_DIR}/${SERVICE}_${LOGHOST}.err' &"
$SSH $HOST "bash -c 'exec nohup ${bin}/accumulo ${SERVICE} --address ${ADDRESS} >${ACCUMULO_LOG_DIR}/${SERVICE}_${LOGHOST}.out 2>${ACCUMULO_LOG_DIR}/${SERVICE}_${LOGHOST}.err' &"
MAX_FILES_OPEN=$($SSH $HOST "/usr/bin/env bash -c 'ulimit -n'")
fi
if [ -n "$MAX_FILES_OPEN" ] && [ -n "$SLAVES" ] ; then
if [ "$SLAVES" -gt 10 ] && [ "$MAX_FILES_OPEN" -lt 65536 ]; then
echo "WARN : Max files open on $HOST is $MAX_FILES_OPEN, recommend 65536"
fi
fi
else
echo "$HOST : $LONGNAME already running (${PID})"
fi
Check that the monitor is bound to the correct interface, and not the "localhost" loopback interface. You may have to edit the monitors file in Accumulo's configuration directory with the IP/hostname of the correct interface.

Amazon RDS - Online only when needed?

I had a question about Amazon RDS. I only need the database online for about 2 hours a day but I am dealing with quite a large database at around 1gb.
I have two main questions:
Can I automate bringing my RDS database online and offline via scripts to save money?
When I put a RDS offline to stop the "work hours" counter running and billing me, when I bring it back online will it still have the same content (i.e will all my data stay there, or will it have to be a blank DB?). If so, is there any way around this rather than backing up to S3 and reimporting it every time?
If you wish to do this programatically,
Snapshot the RDS instance using rds-create-db-snapshot http://docs.aws.amazon.com/AmazonRDS/latest/CommandLineReference/CLIReference-cmd-CopyDBSnapshot.html
Delete the running instance using rds-delete-db-instance http://docs.aws.amazon.com/AmazonRDS/latest/CommandLineReference/CLIReference-cmd-DeleteDBInstance.html
Restore the database from the snapshot using rds-restore-db-instance-from-db-snapshot http://docs.aws.amazon.com/AmazonRDS/latest/CommandLineReference/CLIReference-cmd-RestoreDBInstanceFromDBSnapshot.html
You may also do all of this from the AWS Web Console as well, if you wish to do this manually.
You can start EC2* instances using shell scripts, so I guess that you can as well for RDS.
(see http://docs.aws.amazon.com/AmazonRDS....html)
But unlike EC2*, you cannot "stop" an RDS instance without "destroying" it. You need to create a DB snapshot when terminating your database. You will use this DB snapshot when re-starting the database.
*EC2 : Elastic Computing, renting a virtual server or a server.
Here's a script that will stop/start/reboot an RDS instance
#!/bin/bash
# usage ./startStop.sh lhdevices start
INSTANCE="$1"
ACTION="$2"
# export vars to run RDS CLI
export JAVA_HOME=/usr;
export AWS_RDS_HOME=/home/mysql/RDSCli-1.15.001;
export PATH=$PATH:/home/mysql/RDSCli-1.15.001/bin;
export EC2_REGION=us-east-1;
export AWS_CREDENTIAL_FILE=/home/mysql/RDSCli-1.15.001/keysLightaria.txt;
if [ $# -ne 2 ]
then
echo "Usage: $0 {MySQL-Instance Name} {Action either start, stop or reboot}"
echo ""
exit 1
fi
shopt -s nocasematch
if [[ $ACTION == 'start' ]]
then
echo "This will $ACTION a MySQL Instance"
rds-restore-db-instance-from-db-snapshot lhdevices
--db-snapshot-identifier dbStart --availability-zone us-east-1a
--db-instance-class db.m1.small
echo "Sleeping while instance is created"
sleep 10m
echo "waking..."
rds-modify-db-instance lhdevices --db-security-groups kfarrell
echo "Sleeping while instance is modified for security group name"
sleep 5m
echo "waking..."
elif [[ $ACTION == 'stop' ]]
then
echo "This will $ACTION a MySQL Instance"
yes | rds-delete-db-snapshot dbStart
echo "Sleeping while deleting old snapshot "
sleep 10m
#rds-create-db-snapshot lhdevices --db-snapshot-identifier dbStart
# echo "Sleeping while creating new snapshot "
# sleep 10m
# echo "waking...."
#rds-delete-db-instance lhdevices --force --skip-final-snapshot
rds-delete-db-instance lhdevices --force --final-db-snapshot-identifier dbStart
echo "Sleeping while instance is deleted"
sleep 10m
echo "waking...."
elif [[ $ACTION == 'reboot' ]]
then
echo "This will $ACTION a MySQL Instance"
rds-reboot-db-instance lhdevices ;
echo "Sleeping while Instance is rebooted"
sleep 5m
echo "waking...."
else
echo "Did not recognize command: $ACTION"
echo "Usage: $0 {MySQL-Instance Name} {Action either start, stop or reboot}"
fi
shopt -u nocasematch
Amazon recently updated their CLI to include a way to start and stop RDS instances. stop-db-instance and start-db-instance detail the steps needed to perform these operations.