Background:
I am trying to generate patch compliance data report in quicksight. In order to do it I am using terraform I have added all inventory data in S3 bucket.
I have created Athena automation document which creates database/tables in Athena using S3 bucket data. Now I want to add some terraform code which execute automation document daily on scheduled time.
For more information about this task: https://reinvent2019.awsmanagement.tools/mgt410/en/cont.html
Problem:
I can create maintenance window to define crone job for automation task but I do not have target to add.
My Athena Automation script is only creating/updating database in the Athena.There is no role of target here.
Can someone guid me on this issue?
Thank you in advance
You can create a CloudWatch Event that triggers on schedule and calls Lambda function, that in turn invokes you Athena logic. Here is the good example: https://thedataguy.in/automate-aws-athena-create-partition-on-daily-basis/
Note on QuickSight - if you are using Spice, instead of direct query - you need to manage Spice rebuild too. Which might be tricky... The default setting only allow for once-a-day rebuild on schedule.
Related
I'm trying to create the Architecture on AWS where a lambda function runs SQL Code to refresh a materialized view on AWS Redshift. I would like the materialized view to refresh after the daily ETL processes have completed on the Redshift cluster. Is there a way of setting up the lambda function to be triggered after a particular SQL command on the Redshift Cluster has completed?
Unfortunately, I've only seen examples of people scheduling the Lambda Function to run on particular intervals/at a particular time. Any help would be much appreciated.
A couple of ways that this can be done (out of many):
Have the ETL process trigger the Lambda - this is straight forward
if the ETL tool can generate the trigger but organizational factors
can make changing ETL frameworks difficult.
Use an S3 semaphore - have your ETL SQL UNLOAD some small data (like
a text string of metadata) to S3 where the objects creation will
trigger the Lambda. Insert the UNLOAD at the point in the ETL SQL
where you want the update to occur.
I Am working on one AWS POC, it uses different aws component, below are the details of each individual components.
1- java function have code to generate data, I am calling it from lambda function through cloud watch scheduler
2- datapipe-line to copy data from RDS to S3.
3- Run hive scripts using athena over s3 data.
4- quicksight for visualization.
I am done with creating individual model but not able to understand what could be best way to connect all these components,So it can run in one go.
one though is to use lambda as a connector for each step. but have no template to connect lamda with Athena.
Kindly anyone can suggest best way to connect all above component.So that it can run in one go.
I am not familiar with hive scripts or quicksight, but a cloudformation stack or a terraform stack should assist you to connect various aws components as your workflow demands.
I am trying to build out a job for extracting data from Redshift and write the same data to S3 buckets.
Till now I have explored AWS Glue, but Glue is not capable to run custom sql's on redshift. I know we can run unload commands and can be stored to S3 directly. I am looking for a solution which can be parameterised and scheduled in AWS.
Consider using AWS Data Pipeline for this.
AWS Data Pipeline is AWS service that allows you to define and schedule regular jobs. These jobs are referred to as pipelines. Pipeline contains a business logic of the work required, for example, extracting data from Redshift to S3. You can schedule a pipeline to run however often you require e.g. daily.
Pipeline is defined by you, you can even version control it. You can prepare a pipeline definition in a browser using Data Pipeline Architect or compose it using JSON file locally on your computer. Pipeline definition is composed of components, such as, Redshift database, S3 node , SQL activity, as well as parameters, for example to specifying S3 path to use for extracted data.
AWS Data Pipeline service handles scheduling, dependency between components in your pipeline, monitoring and error handling.
For your specific use case, I would consider the following options:
Option 1
Define pipeline with the following components: SQLDataNode and S3DataNode. SQLDataNode would reference your Redshift database and SELECT query to use to extract your data. S3DataNode would point to S3 path to be used to store your data. You add a CopyActivity activity to copy data from SQLDataNode to S3DataNode. When such pipeline runs, it will retrieve data from Redshift using SQLDataNode and copy that data to S3DataNode using CopyActivity. S3 path in S3DataNode can be parameterised so it is different every time you run a pipeline.
Option 2
Firstly, define SQL query with UNLOAD statement to be used to unload your data to S3. Optionally, you can save it in a file and upload to S3. Use SQLActivity component to specify SQL query to execute in Redshift database. SQL query in SQLActivity can be a reference to S3 path where you stored your query (optionally), or just a query itself. Whenever a pipeline runs, it will connect to Redshift and execute SQL query which stores the data in S3.
Constraints of option 2: in UNLOAD statement, S3 path is static. If you plan to store every data extract in a separate S3 path, you will have to modify UNLOAD statement to use another S3 path every time you run it which is not out-of-the-box function.
Where do these pipelines run?
On EC2 instance with a TaskRunner, a tool provided by AWS to run data pipelines. You can start that instance automatically at the time when pipeline runs, or you can reference already running instance with a TaskRunner installed on it. You have to make sure that EC2 instance is allowed to connect to your Redshift database.
Relevant documentation:
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftdatabase.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-sqldatanode.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-sqlactivity.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-using-task-runner.html
I think Pawel has answered this correctly , I'm just adding details on option two for anyone who wants to implement this:
Go to "Data Pipeline" from AWS console
Click on "New Pipeline" on top right corner page
Edit each field in this json file(after copying to your favorite editor) and update the fields which has "$NEED_TO_UPDATE_THIS_WITH_YOURS" with the correct value that pertains to your AWS environment and save it as data_pipeline_template.json some where on your computer
Go back to AWS Console again, Click on "Load Local File" for the source field and upload the json file
if you are not able to upload it because you may be getting some error related to your database instances etc then follow these steps:
Go to "Data Pipeline" from AWS console
Click on "New Pipeline" on top right corner page
Populate all the fields manually (see below)
Click on "Edit in Architect" at the bottom of the page
Implement the same activities and resources as below , again make sure your are adding the correct values such as your Database JDBC connection etc
I have my data in a table in Redshift cluster. I want to periodically run a query against the Redshift table and store the results in a S3 bucket.
I will be running some data transformations on this data in the S3 bucket to feed into another system. As per AWS documentation I can use the UNLOAD command, but is there a way to schedule this periodically? I have searched a lot but I haven't found any relevant information around this.
You can use a scheduling tool like Airflow to accomplish this task. Airflow seem-lessly connects to Redshift and S3. You can have a DAG action, which polls Redshift periodically and unloads the data from Redshift onto S3.
I don't believe Redshift has the ability to schedule queries periodically. You would need to use another service for this. You could use a Lambda function, or you could schedule a cron job on an EC2 instance.
I believe you are looking for AWS data pipeline service.
You can copy data from redshift to s3 using the RedshiftCopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html).
I am copying the relevant content from the above URL for future purposes:
"You can also copy from Amazon Redshift to Amazon S3 using RedshiftCopyActivity. For more information, see S3DataNode.
You can use SqlActivity to perform SQL queries on the data that you've loaded into Amazon Redshift."
Let me know if this helped.
You should try AWS Data Pipelines. You can schedule them to run periodically or on demand. I am confident that it would solve your use case
As per AWS docs, there's no Redshift-Lambda integration yet.
What we would like to do is monitoring redshift activity in order to do something when a redshift table is created, a copy from S3 is made or a bulk insert is performed.
Is there a way to register this kind of activity, and then do something similar to run a lambda function ir order run a small script or so?
Redshift provides an event notification mechanism. You can find a full list of the event categories and messages here. If that covers the kind of information you are interested in you can simply have your Lambda function add the SNS topic used by Redshift for event notification as an event source and your Lambda function will get called every time an event is sent by Redshift.
You can enable audit logs that end up in s3.
All the info you want is also available in various admin tables with prefixes like stl_, stv_ and pg_. For example, COPY commands from S3 are recorded in stl_load_commits, and stl_utilitytext has info on non-select queries like CREATE.
As for triggering events, you could have S3 trigger a lambda when one of the log files lands or run occasional jobs that query the system tables and take action with something like cron jobs or airflow.