Dataprep: job finish event - google-cloud-platform

We are considering using Dataprep on an automatic schedule in order to wrangle & load a folder of GCS .gz files into Big Query.
The challenge is: how can the source .gz files be moved to cold storage once they are processed ?
I can't find an event that's generated by Dataprep that we could hook-up to in order to perform the archiving task. What would be ideal is if Dataprep could archive the source files by itself.
Any suggestions ?

I don't believe there is a way to get notified when a job is done directly from Dataprep. What you could do instead is poll the underlying dataflow jobs. You could schedule a script to run whenever your scheduled dataprep job runs. Here's a simple example:
#!/bin/bash
# list running dataflow jobs, filter so that only the one with the "dataprep" string in its name is actually listed and keep its id
id=$(gcloud dataflow jobs list --status=active --filter="name:dataprep" | sed -n 2p | cut -f 1 -d " ")
# loop until the state of the job changes to done
until [ $(gcloud dataflow jobs describe $id | grep currentState | head -1 | awk '{print $2}') == "JOB_STATE_DONE" ]
do
# sleep so that you reduce API calls
sleep 5m
done
# send to cold storage, e.g. gsutil mv ...
echo "done"
The problem here is that the above assumes that you only run one dataprep job. If you schedule many concurrent dataprep jobs the script would be more complicated.

Related

Automating the maintenance of Athena views

I am currently working on creating a data lake where we can compile, combine and analysis multiple data sets in S3.
I am using Athena and Quicksight as a central part of this to be able to quickly query and explore the data. To make things easier in Quicksight for end-users, I am creating many Athena views that do some basic transformation and aggregations.
I would like to be able to source control my views and create some automation around them so that we can have a code-driven approach and not rely on users manually updating views and running DDL to update the definitions.
There does not seem to be any support in Cloudformation for Athena views.
My current approach would be to just save the create or replace view as ... DDL in an .sql file in source control and then create some sort of script that runs the DDL so it could be made part of a continuous integration solution.
Anyone have any other experience with automation and CI for Athena views?
Long time since OP posted, but here goes a bash script to do just that. You can use this script on CI of your choice.
This script assumes that you have a directory with all your .sql files definition for the views. Within that directory there is a .env file to make some deploy time replacements of shell envs.
#!/bin/bash
export VIEWS_DIRECTORY="views"
export DEPLOY_ENVIRONMENT="dev"
export ENV_FILENAME="views.env"
export OUTPUT_BUCKET="your-bucket-name-$DEPLOY_ENVIRONMENT"
export OUTPUT_PREFIX="your-prefix"
export AWS_PROFILE="your-profile"
cd $VIEWS_DIRECTORY
# Create final .env file with any shell env replaced
env_file=".env"
envsubst < $ENV_FILENAME > $env_file
# Export variables in .env as shell environment variables
export $(grep -v '^#' ./$env_file | xargs)
# Loop through all SQL files replacing env variables and pushing to AWS
FILES="*.sql"
for view_file in $FILES
do
echo "Processing $view_file file..."
# Replacing env variables in query file
envsubst < $view_file > query.sql
# Running query via AWS CLI
query=$(<query.sql) \
&& query_execution_id=$(aws athena start-query-execution \
--query-string "$query" \
--result-configuration "OutputLocation=s3://${OUTPUT_BUCKET}/${OUTPUT_PREFIX}" \
--profile $AWS_PROFILE \
| jq -r '.QueryExecutionId')
# Checking for query completion successfully
echo "Query executionID: $query_execution_id"
while :
do
query_state=$(aws athena get-query-execution \
--query-execution-id $query_execution_id \
--profile $AWS_PROFILE \
| jq '.QueryExecution.Status.State')
echo "Query state: $query_state"
if [[ "$query_state" == '"SUCCEEDED"' ]]; then
echo "Query ran successfully"
break
elif [[ "$query_state" == '"FAILED"' ]]; then
echo "Query failed with ExecutionID: $query_execution_id"
exit 1
elif [ -z "$query_state" ]; then
echo "Unexpected error. Terminating routine."
exit 1
else
echo "Waiting for query to finish running..."
sleep 1
fi
done
done
I think you could use AWS Glue
When Should I Use AWS Glue?
You can use AWS Glue to build a data warehouse to organize, cleanse,
validate, and format data. You can transform and move AWS Cloud data
into your data store. You can also load data from disparate sources
into your data warehouse for regular reporting and analysis. By
storing it in a data warehouse, you integrate information from
different parts of your business and provide a common source of data
for decision making.
AWS Glue simplifies many tasks when you are building a data warehouse:
Discovers and catalogs metadata about your data stores into a central catalog.
You can process semi-structured data, such as clickstream or process logs.
Populates the AWS Glue Data Catalog with table definitions from scheduled crawler programs. Crawlers call classifier logic to infer
the schema, format, and data types of your data. This metadata is
stored as tables in the AWS Glue Data Catalog and used in the
authoring process of your ETL jobs.
Generates ETL scripts to transform, flatten, and enrich your data from source to target.
Detects schema changes and adapts based on your preferences.
Triggers your ETL jobs based on a schedule or event. You can initiate jobs automatically to move your data into your data
warehouse. Triggers can be used to create a dependency flow between
jobs.
Gathers runtime metrics to monitor the activities of your data warehouse.
Handles errors and retries automatically.
Scales resources, as needed, to run your jobs.
https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

Remove processed source files after AWS Datapipeline completes

A third party sends me a daily upload of log files into an S3 bucket. I'm attempting to use DataPipeline to transform them into a slightly different format with awk, place the new files back on S3, then move the original files aside so that I don't end up processing the same ones again tomorrow.
Is there a clean way of doing this? Currently my shell command looks something like :
#!/usr/bin/env bash
set -eu -o pipefail
aws s3 cp s3://example/processor/transform.awk /tmp/transform.awk
for f in "${INPUT1_STAGING_DIR}"/*; do
basename=${f//+(*\/|.*)}
unzip -p "$f" | awk -f /tmp/transform.awk | gzip > ${OUTPUT1_STAGING_DIR}/$basename.tsv.gz
done
I could use the aws cli tool to move the source file aside on each iteration of the loop, but that seems flakey - if my loop dies halfway through processing, those earlier files are going to get lost.
Few possible solutions:
Create a trigger on your s3 bucket.. Whenever any object added to the bucket --> invoke lambda function which can be a python script which performs transformation --> copies back to another bucket. Now, on this other bucket again lambda function is invoked which deletes file from first bucket.
I personally feel; what you have achieved is good enough..All you need is exception handling in the shell script and delete the file ( never loose data ) ONLY when output file is successfully created ( probably u can check the size of output file also )

How to get EMR cluster information from slave machine (task instance group) [duplicate]

I want to be able to create EMR clusters, and for those clusters to send messages back to some central queue. In order for this to work, I need to have some sort of agent running on each master node. Each one of those agents will have to identify itself in this message so that the recipient knows which cluster the message is about.
Does the master node know its ID (j-*************)? If not, then is there some other piece of identifying information that could allow the message recipient to infer this ID?
I've taken a look through the config files in /home/hadoop/conf, and I haven't found anything useful. I found the ID in /mnt/var/log/instance-controller/instance-controller.log, but it looks like it'll be difficult to grep for. I'm wondering where instance-controller might get that ID from in the first place.
You may look at /mnt/var/lib/info/ on Master node to find lot of info about your EMR cluster setup. More specifically /mnt/var/lib/info/job-flow.json contains the jobFlowId or ClusterID.
You can use the pre-installed json parser (jq) to get the jobflow id.
cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId"
(updated as per #Marboni)
You can use Amazon EC2 API to figure out. The example below uses shell commands for simplicity. In real life you should use appropriate API to do this steps.
First you should find out your instance ID:
INSTANCE=`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`
Then you can use your instance ID to find out the cluster id :
ec2-describe-instances $INSTANCE | grep TAG | grep aws:elasticmapreduce:job-flow-id
Hope this helps.
As been specifed above, the information is in the job-flow.json file. This file has several other attributes. So, knowing where it's located, you can do it in a very easy way:
cat /mnt/var/lib/info/job-flow.json | grep jobFlowId | cut -f2 -d: | cut -f2 -d'"'
Edit: This command works in core nodes also.
Another option - query the metadata server:
curl -s http://169.254.169.254/2016-09-02/user-data/ | sed -r 's/.*clusterId":"(j-[A-Z0-9]+)",.*/\1/g'
Apparently the Hadoop MapReduce job has no way to know which cluster it is running on - I was surprised to find this out myself.
BUT: you can use other identifiers for each map to uniquely identify the mapper which is running, and the job that is running.
These are specified in the environment variables passed on to each mapper. If you are writing a job in Hadoop streaming, using Python, the code would be:
import os
if 'map_input_file' in os.environ:
fileName = os.environ['map_input_file']
if 'mapred_tip_id' in os.environ:
mapper_id = os.environ['mapred_tip_id'].split("_")[-1]
if 'mapred_job_id' in os.environ:
jobID = os.environ['mapred_job_id']
That gives you: input file name, the task ID, and the job ID. Using one or a combination of those three values, you should be able to uniquely identify which mapper is running.
If you are looking for a specific job: "mapred_job_id" might be what you want.

Does an EMR master node know its cluster ID?

I want to be able to create EMR clusters, and for those clusters to send messages back to some central queue. In order for this to work, I need to have some sort of agent running on each master node. Each one of those agents will have to identify itself in this message so that the recipient knows which cluster the message is about.
Does the master node know its ID (j-*************)? If not, then is there some other piece of identifying information that could allow the message recipient to infer this ID?
I've taken a look through the config files in /home/hadoop/conf, and I haven't found anything useful. I found the ID in /mnt/var/log/instance-controller/instance-controller.log, but it looks like it'll be difficult to grep for. I'm wondering where instance-controller might get that ID from in the first place.
You may look at /mnt/var/lib/info/ on Master node to find lot of info about your EMR cluster setup. More specifically /mnt/var/lib/info/job-flow.json contains the jobFlowId or ClusterID.
You can use the pre-installed json parser (jq) to get the jobflow id.
cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId"
(updated as per #Marboni)
You can use Amazon EC2 API to figure out. The example below uses shell commands for simplicity. In real life you should use appropriate API to do this steps.
First you should find out your instance ID:
INSTANCE=`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`
Then you can use your instance ID to find out the cluster id :
ec2-describe-instances $INSTANCE | grep TAG | grep aws:elasticmapreduce:job-flow-id
Hope this helps.
As been specifed above, the information is in the job-flow.json file. This file has several other attributes. So, knowing where it's located, you can do it in a very easy way:
cat /mnt/var/lib/info/job-flow.json | grep jobFlowId | cut -f2 -d: | cut -f2 -d'"'
Edit: This command works in core nodes also.
Another option - query the metadata server:
curl -s http://169.254.169.254/2016-09-02/user-data/ | sed -r 's/.*clusterId":"(j-[A-Z0-9]+)",.*/\1/g'
Apparently the Hadoop MapReduce job has no way to know which cluster it is running on - I was surprised to find this out myself.
BUT: you can use other identifiers for each map to uniquely identify the mapper which is running, and the job that is running.
These are specified in the environment variables passed on to each mapper. If you are writing a job in Hadoop streaming, using Python, the code would be:
import os
if 'map_input_file' in os.environ:
fileName = os.environ['map_input_file']
if 'mapred_tip_id' in os.environ:
mapper_id = os.environ['mapred_tip_id'].split("_")[-1]
if 'mapred_job_id' in os.environ:
jobID = os.environ['mapred_job_id']
That gives you: input file name, the task ID, and the job ID. Using one or a combination of those three values, you should be able to uniquely identify which mapper is running.
If you are looking for a specific job: "mapred_job_id" might be what you want.

AWS S3 Glacier - Programmatically Initiate Restore

I have been writing an web-app using s3 for storage and glacier for backup. So I setup the lifecycle policy to archive it. Now I want to write a webapp that lists the archived files, the user should be able to initiate restore from this and then get an email once their restore is complete.
Now the trouble I am running into is I cant find a php sdk command I can issue to initiateRestore. Then it would be nice if it notified SNS when restore was complete, SNS would push the JSON onto SQS and I would poll SQS and finally email the user when polling detected a complete restore.
Any help or suggestions would be nice.
Thanks.
You could also use the AWS CLI tool like so (here I'm assuming you want to restore all files in one directory):
aws s3 ls s3://myBucket/myDir/ | awk '{if ($4) print $4}' > myFiles.txt
for x in `cat myFiles.txt`
do
echo "restoring $x"
aws s3api restore-object \
--bucket myBucket \
--key "myDir/$x" \
--restore-request '{"Days":30}'
done
Regarding your desire for notification, the CLI tool will report "A client error (RestoreAlreadyInProgress) occurred: Object restore is already in progress" if request already initiated, and probably a different message once it restores. You could run this restore command several times, looking for "restore done" error/message. Pretty hacky of course; there's probably a better way with AWS CLI tool.
Caveat: be careful with Glacier restores that exceed the allotted free-restore amount/period. If you restore too much data too quickly, charges can exponentially pile up.
I wrote something fairly similar. I can't speak to any PHP api, however there's a simple http POST that kicks off glacier restoration.
Since that happens asyncronously (and takes up to 5 hours), you have to set up a process to poll files that are restoring by making HEAD requests for the object, which will have restoration status info in an x-amz-restore header.
If it helps, my ruby code for parsing this header looks like this:
if restore = headers['x-amz-restore']
if restore.first =~ /ongoing-request="(.+?)", expiry-date="(.+?)"/
restoring = $1 == "true"
restore_date = DateTime.parse($2)
elsif restore.first =~ /ongoing-request="(.+?)"/
restoring = $1 == "true"
end
end