Automating the maintenance of Athena views - amazon-web-services

I am currently working on creating a data lake where we can compile, combine and analysis multiple data sets in S3.
I am using Athena and Quicksight as a central part of this to be able to quickly query and explore the data. To make things easier in Quicksight for end-users, I am creating many Athena views that do some basic transformation and aggregations.
I would like to be able to source control my views and create some automation around them so that we can have a code-driven approach and not rely on users manually updating views and running DDL to update the definitions.
There does not seem to be any support in Cloudformation for Athena views.
My current approach would be to just save the create or replace view as ... DDL in an .sql file in source control and then create some sort of script that runs the DDL so it could be made part of a continuous integration solution.
Anyone have any other experience with automation and CI for Athena views?

Long time since OP posted, but here goes a bash script to do just that. You can use this script on CI of your choice.
This script assumes that you have a directory with all your .sql files definition for the views. Within that directory there is a .env file to make some deploy time replacements of shell envs.
#!/bin/bash
export VIEWS_DIRECTORY="views"
export DEPLOY_ENVIRONMENT="dev"
export ENV_FILENAME="views.env"
export OUTPUT_BUCKET="your-bucket-name-$DEPLOY_ENVIRONMENT"
export OUTPUT_PREFIX="your-prefix"
export AWS_PROFILE="your-profile"
cd $VIEWS_DIRECTORY
# Create final .env file with any shell env replaced
env_file=".env"
envsubst < $ENV_FILENAME > $env_file
# Export variables in .env as shell environment variables
export $(grep -v '^#' ./$env_file | xargs)
# Loop through all SQL files replacing env variables and pushing to AWS
FILES="*.sql"
for view_file in $FILES
do
echo "Processing $view_file file..."
# Replacing env variables in query file
envsubst < $view_file > query.sql
# Running query via AWS CLI
query=$(<query.sql) \
&& query_execution_id=$(aws athena start-query-execution \
--query-string "$query" \
--result-configuration "OutputLocation=s3://${OUTPUT_BUCKET}/${OUTPUT_PREFIX}" \
--profile $AWS_PROFILE \
| jq -r '.QueryExecutionId')
# Checking for query completion successfully
echo "Query executionID: $query_execution_id"
while :
do
query_state=$(aws athena get-query-execution \
--query-execution-id $query_execution_id \
--profile $AWS_PROFILE \
| jq '.QueryExecution.Status.State')
echo "Query state: $query_state"
if [[ "$query_state" == '"SUCCEEDED"' ]]; then
echo "Query ran successfully"
break
elif [[ "$query_state" == '"FAILED"' ]]; then
echo "Query failed with ExecutionID: $query_execution_id"
exit 1
elif [ -z "$query_state" ]; then
echo "Unexpected error. Terminating routine."
exit 1
else
echo "Waiting for query to finish running..."
sleep 1
fi
done
done

I think you could use AWS Glue
When Should I Use AWS Glue?
You can use AWS Glue to build a data warehouse to organize, cleanse,
validate, and format data. You can transform and move AWS Cloud data
into your data store. You can also load data from disparate sources
into your data warehouse for regular reporting and analysis. By
storing it in a data warehouse, you integrate information from
different parts of your business and provide a common source of data
for decision making.
AWS Glue simplifies many tasks when you are building a data warehouse:
Discovers and catalogs metadata about your data stores into a central catalog.
You can process semi-structured data, such as clickstream or process logs.
Populates the AWS Glue Data Catalog with table definitions from scheduled crawler programs. Crawlers call classifier logic to infer
the schema, format, and data types of your data. This metadata is
stored as tables in the AWS Glue Data Catalog and used in the
authoring process of your ETL jobs.
Generates ETL scripts to transform, flatten, and enrich your data from source to target.
Detects schema changes and adapts based on your preferences.
Triggers your ETL jobs based on a schedule or event. You can initiate jobs automatically to move your data into your data
warehouse. Triggers can be used to create a dependency flow between
jobs.
Gathers runtime metrics to monitor the activities of your data warehouse.
Handles errors and retries automatically.
Scales resources, as needed, to run your jobs.
https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

Related

How to schedule BigQuery DataTransfer Service using bq command

I am trying to create a Data Transfer service using BigQuery. I used bq command to create the DTS,
I am able to create DTS successfully
I need to specify custom time for scheduling using the bq command
Is it possible to schedule custom time while creating the Data Transfer service. Refer sample bq command
bq mk --transfer_config \
--project_id='My project' \
--target_dataset='My Dataset' \
--display_name='test_bqdts' \
--params='{"data_path":<data_path>,
"destination_table_name_template":<destination_table_name>,
"file_format":<>,
"ignore_unknown_values":"true",
"access_key_id": "access_key_id",
"secret_access_key": "secret_access_key"
}' \
--data_source=data_source_id
NOTE: When you create an Amazon S3 transfer using the command-line tool, the transfer configuration is set up using the default value for Schedule (every 24 hours).
You can use the flag --schedule as you can see here
Option 2: Use the bq mk command.
Scheduled queries are a kind of transfer. To schedule a query, you can
use the BigQuery Data Transfer Service CLI to make a transfer
configuration.
Queries must be in StandardSQL dialect to be scheduled.
Enter the bq mk command and supply the transfer creation flag
--transfer_config. The following flags are also required:
--data_source
--target_dataset (Optional for DDL/DML queries.)
--display_name
--params
Optional flags:
--project_id is your project ID. If --project_id isn't specified, the default project is used.
--schedule is how often you want the query to run. If --schedule isn't specified, the default is 'every 24 hours' based on creation
time.
For DDL/DML queries, you can also supply the --location flag to specify a particular region for processing. If --location isn't
specified, the global Google Cloud location is used.
--service_account_name is for authenticating your scheduled query with a service account instead of your individual user account. Note:
Using service accounts with scheduled queries is in beta.
bq mk \
--transfer_config \
--project_id=project_id \
--target_dataset=dataset \
--display_name=name \
--params='parameters' \
--data_source=data_source
If you want to set a 24 hours schedule, for example, you should use --schedule='every 24 hours'
You can find the complete reference for the time syntax here
I hope it helps

Why doesn't my Kinesis Analytics Application Schema Discovery work?

I am sending comma-separated data to my kinesis stream, and I want my kinesis analytics app to recognize that there are two columns (both bigints). But when I populate my stream with some records and click "Discover Schema", it always gives me a schema of one column! Here's a screenshot:
I have tried many different delimiters to indicate columns, including comma, space, and comma-space, but none of these cause aws to detect my schema properly. At one point I gave up and edited the schema manually, which caused this error:
While I know that I have the option to keep the schema as a single column and use string and date-time manipulation to structure my data, I prefer not to do it this way... Any suggestions?
While I wasn't able to get the schema discovery tool to work, I realized that I am able to manually edit my schema and it works fine. I was getting that error because I had just populated the stream initially, and I was not continuously sending data.
Schema Discovery required me to send data to my input kinesis stream during the schema discovery. To do this for my Proof of Concept application I used the AWS CLI:
# emittokinesis.sh
JSON='{
"messageId": "31c14ee7-9bde-484d-af05-03509c2c33aa",
"myTest": "myValue"
}'
echo "$JSON"
JSONBASE64=$(echo ${JSON} | base64)
echo 'aws kinesis put-record --stream-name logstash-input-test --partition-key 1 --data "'${JSONBASE64}'"'
aws kinesis put-record --stream-name logstash-input-test --partition-key 1 --data "${JSONBASE64}"
I clicked the "Run Schema Discovery" button in the AWS UI and then quickly ran my shell script in a CMD window.
Once my initial schema was discovered I could manually edit the schema but it mostly matched what I expected based on my input JSON.

How can I export data from Bigquery to an external server in a CSV?

I need to automate a process to extract data from Google Big Query and exported to an external CSV in a external server outside of the GCP.
I just researching how to to that I found some commands to run from my External Server. But I prefer to do everything in GCP to avoid possible problems.
To run the query to CSV in Google storage
bq --location=US extract --compression GZIP 'dataset.table' gs://example-bucket/myfile.csv
To Download the csv from Google Storage
gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [OBJECT_DESTINATION]
But I would like to hear your suggestions
If you want to fully automatize this process, I would do the following:
Create a Cloud Function to handle the export:
This is the more lightweight solution, as Cloud Functions are serverless, and provide flexibility to implement code with the Client Libraries. See the quickstart, I recommend you to use the console to create the functions to start with.
In this example I recommend you to trigger the Cloud Function from an HTTP request, i.e. when the function URL is called, it will run the code inside of it.
An example Cloud Function code in Python, that creates the export when a HTTP request is made:
main.py
from google.cloud import bigquery
def hello_world(request):
project_name = "MY_PROJECT"
bucket_name = "MY_BUCKET"
dataset_name = "MY_DATASET"
table_name = "MY_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
requirements.txt
google-cloud-bigquery
Note that the job will run asynchronously in the background, you will receive a return response with the job ID, which you can use to check the state of the export job in the Cloud Shell, by running:
bq show -j <job_id>
Create a Cloud Scheduler scheduled job:
Follow this documentation to get started. You can set the Frequency with the standard cron format, for example 0 0 * * * will run the job every day at midnight.
As a target, choose HTTP, in the URL put the Cloud Function HTTP URL (you can find it in the console, inside the Cloud Function details, under the Trigger tab), and as HTTP method choose GET.
Create it, and you can test it in the Cloud Scheduler by pressing the Run now button in the Console.
Synchronize your external server and the bucket:
Up until now you only have scheduled exports to run every 24 hours, now to synchronize the bucket contents with your local computer, you can use the gsutil rsync command. If you want to save the imports, lets say to the my_exports folder, you can run, in your external server:
gsutil rsync gs://BUCKET_WITH_EXPORTS /local-path-to/my_exports
To periodically run this command in your server, you could create a standard cron job in your crontab inside your external server, to run each day as well, just at a few hours later than the bigquery export, to ensure that the export has been made.
Extra:
I have hard-coded most of the variables in the Cloud Function to be always the same. However, you can send parameters to the function, if you do a POST request instead of a GET request, and send the parameters as data in the body.
You will have to change the Cloud Scheduler job to send a POST request to the Cloud Function HTTP URL, and in the same place you can set the body to send the parameters regarding the table, dataset and bucket, for example. This will allow you to run exports from different tables at different hours, and to different buckets.

Dataprep: job finish event

We are considering using Dataprep on an automatic schedule in order to wrangle & load a folder of GCS .gz files into Big Query.
The challenge is: how can the source .gz files be moved to cold storage once they are processed ?
I can't find an event that's generated by Dataprep that we could hook-up to in order to perform the archiving task. What would be ideal is if Dataprep could archive the source files by itself.
Any suggestions ?
I don't believe there is a way to get notified when a job is done directly from Dataprep. What you could do instead is poll the underlying dataflow jobs. You could schedule a script to run whenever your scheduled dataprep job runs. Here's a simple example:
#!/bin/bash
# list running dataflow jobs, filter so that only the one with the "dataprep" string in its name is actually listed and keep its id
id=$(gcloud dataflow jobs list --status=active --filter="name:dataprep" | sed -n 2p | cut -f 1 -d " ")
# loop until the state of the job changes to done
until [ $(gcloud dataflow jobs describe $id | grep currentState | head -1 | awk '{print $2}') == "JOB_STATE_DONE" ]
do
# sleep so that you reduce API calls
sleep 5m
done
# send to cold storage, e.g. gsutil mv ...
echo "done"
The problem here is that the above assumes that you only run one dataprep job. If you schedule many concurrent dataprep jobs the script would be more complicated.

AWS S3 Glacier - Programmatically Initiate Restore

I have been writing an web-app using s3 for storage and glacier for backup. So I setup the lifecycle policy to archive it. Now I want to write a webapp that lists the archived files, the user should be able to initiate restore from this and then get an email once their restore is complete.
Now the trouble I am running into is I cant find a php sdk command I can issue to initiateRestore. Then it would be nice if it notified SNS when restore was complete, SNS would push the JSON onto SQS and I would poll SQS and finally email the user when polling detected a complete restore.
Any help or suggestions would be nice.
Thanks.
You could also use the AWS CLI tool like so (here I'm assuming you want to restore all files in one directory):
aws s3 ls s3://myBucket/myDir/ | awk '{if ($4) print $4}' > myFiles.txt
for x in `cat myFiles.txt`
do
echo "restoring $x"
aws s3api restore-object \
--bucket myBucket \
--key "myDir/$x" \
--restore-request '{"Days":30}'
done
Regarding your desire for notification, the CLI tool will report "A client error (RestoreAlreadyInProgress) occurred: Object restore is already in progress" if request already initiated, and probably a different message once it restores. You could run this restore command several times, looking for "restore done" error/message. Pretty hacky of course; there's probably a better way with AWS CLI tool.
Caveat: be careful with Glacier restores that exceed the allotted free-restore amount/period. If you restore too much data too quickly, charges can exponentially pile up.
I wrote something fairly similar. I can't speak to any PHP api, however there's a simple http POST that kicks off glacier restoration.
Since that happens asyncronously (and takes up to 5 hours), you have to set up a process to poll files that are restoring by making HEAD requests for the object, which will have restoration status info in an x-amz-restore header.
If it helps, my ruby code for parsing this header looks like this:
if restore = headers['x-amz-restore']
if restore.first =~ /ongoing-request="(.+?)", expiry-date="(.+?)"/
restoring = $1 == "true"
restore_date = DateTime.parse($2)
elsif restore.first =~ /ongoing-request="(.+?)"/
restoring = $1 == "true"
end
end