I am trying to build out a job for extracting data from Redshift and write the same data to S3 buckets.
Till now I have explored AWS Glue, but Glue is not capable to run custom sql's on redshift. I know we can run unload commands and can be stored to S3 directly. I am looking for a solution which can be parameterised and scheduled in AWS.
Consider using AWS Data Pipeline for this.
AWS Data Pipeline is AWS service that allows you to define and schedule regular jobs. These jobs are referred to as pipelines. Pipeline contains a business logic of the work required, for example, extracting data from Redshift to S3. You can schedule a pipeline to run however often you require e.g. daily.
Pipeline is defined by you, you can even version control it. You can prepare a pipeline definition in a browser using Data Pipeline Architect or compose it using JSON file locally on your computer. Pipeline definition is composed of components, such as, Redshift database, S3 node , SQL activity, as well as parameters, for example to specifying S3 path to use for extracted data.
AWS Data Pipeline service handles scheduling, dependency between components in your pipeline, monitoring and error handling.
For your specific use case, I would consider the following options:
Option 1
Define pipeline with the following components: SQLDataNode and S3DataNode. SQLDataNode would reference your Redshift database and SELECT query to use to extract your data. S3DataNode would point to S3 path to be used to store your data. You add a CopyActivity activity to copy data from SQLDataNode to S3DataNode. When such pipeline runs, it will retrieve data from Redshift using SQLDataNode and copy that data to S3DataNode using CopyActivity. S3 path in S3DataNode can be parameterised so it is different every time you run a pipeline.
Option 2
Firstly, define SQL query with UNLOAD statement to be used to unload your data to S3. Optionally, you can save it in a file and upload to S3. Use SQLActivity component to specify SQL query to execute in Redshift database. SQL query in SQLActivity can be a reference to S3 path where you stored your query (optionally), or just a query itself. Whenever a pipeline runs, it will connect to Redshift and execute SQL query which stores the data in S3.
Constraints of option 2: in UNLOAD statement, S3 path is static. If you plan to store every data extract in a separate S3 path, you will have to modify UNLOAD statement to use another S3 path every time you run it which is not out-of-the-box function.
Where do these pipelines run?
On EC2 instance with a TaskRunner, a tool provided by AWS to run data pipelines. You can start that instance automatically at the time when pipeline runs, or you can reference already running instance with a TaskRunner installed on it. You have to make sure that EC2 instance is allowed to connect to your Redshift database.
Relevant documentation:
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftdatabase.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-sqldatanode.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-sqlactivity.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-using-task-runner.html
I think Pawel has answered this correctly , I'm just adding details on option two for anyone who wants to implement this:
Go to "Data Pipeline" from AWS console
Click on "New Pipeline" on top right corner page
Edit each field in this json file(after copying to your favorite editor) and update the fields which has "$NEED_TO_UPDATE_THIS_WITH_YOURS" with the correct value that pertains to your AWS environment and save it as data_pipeline_template.json some where on your computer
Go back to AWS Console again, Click on "Load Local File" for the source field and upload the json file
if you are not able to upload it because you may be getting some error related to your database instances etc then follow these steps:
Go to "Data Pipeline" from AWS console
Click on "New Pipeline" on top right corner page
Populate all the fields manually (see below)
Click on "Edit in Architect" at the bottom of the page
Implement the same activities and resources as below , again make sure your are adding the correct values such as your Database JDBC connection etc
Related
I have a requirement of reading a csv batch file that was uploaded to s3 bucket, encrypt data in some columns and persist this data in a Dynamo DB table. While persisting each row in the DynamoDB table, depending on the data in each row, I need to generate an ID and store that in the DynamoDB table too. It seems AWS Data pipeline allows to create a job to import S3 bucket files into DynanoDB, but I can't find a way to add a custom logic there to encrypt some of the column values in the file and add custom logic to generate the id mentioned above.
Is there any way that I can achieve this requirement using AWS Data Pipeline? If not what would the best approach that I can follow using AWS services?
We also have a situation where we need fetch data from S3 and populate it to DynamoDb after performing some transformations (business logic).
We also use AWS DataPipeline for this process.
We first trigger a EMR cluster from Data Pipeline where we fetch the data from S3 and then transform it and populate the DynamoDB(DDB). You can include all the logic you require in the EMR cluster.
We have a timer set in the pipeline which triggers the EMR cluster every day once to perform the task.
This can be having additional costs too.
Background:
I am trying to generate patch compliance data report in quicksight. In order to do it I am using terraform I have added all inventory data in S3 bucket.
I have created Athena automation document which creates database/tables in Athena using S3 bucket data. Now I want to add some terraform code which execute automation document daily on scheduled time.
For more information about this task: https://reinvent2019.awsmanagement.tools/mgt410/en/cont.html
Problem:
I can create maintenance window to define crone job for automation task but I do not have target to add.
My Athena Automation script is only creating/updating database in the Athena.There is no role of target here.
Can someone guid me on this issue?
Thank you in advance
You can create a CloudWatch Event that triggers on schedule and calls Lambda function, that in turn invokes you Athena logic. Here is the good example: https://thedataguy.in/automate-aws-athena-create-partition-on-daily-basis/
Note on QuickSight - if you are using Spice, instead of direct query - you need to manage Spice rebuild too. Which might be tricky... The default setting only allow for once-a-day rebuild on schedule.
I'm doing designing some ETL data pipelines with Airflow. Data transformations is done by provisioning an AWS EMR Spark cluster and sending its some jobs. The jobs read data from S3, process them and write them back to S3 using date as a partition.
For my last step, I need to load the S3 data to a datawarehouse using SQL scripts that are submitted to Redshift using Python script, however I cannot find a clean way to get retrieve which data need to be loaded, ie. which date partitions have been generated during Spark transformations (can only be known during the execution of the job and not beforehand).
Note that everything is orchestrated through a Python script using boto3 library that is run from a corporate VM that cannot be accessed from outside.
What would be the best way to fetch this information from EMR?
For now I'm thinking about different solutions:
- Write the information into a log file. Get the data from Spark master node using SSH through Python script
- Write the information to an S3 file
- Write the information to a database (RDS?)
I'm struggling to determine what are the pros and the cons of these solutions. I'm also wondering what would be the best way to inform that data transformations is over and that metadata can be fetched.
Thanks in advance
The most straightforward is to use S3 as your temporary storage. After finishing your Spark execution (Writing result to S3), you can add one more step writing data to S3 bucket which you want to get in next step.
The approach with RDS should be the similar to S3, but it requires more implementations than S3. You need to setup RDS, maintain Schema, implementation to work with RDS...
With S3 tmp file, after EMR terminated and AF running next step, using Boto to fetch that tmp file (S3 Path depends on your requirement) and that is it.
I have a use case wherein I want to take a data from DynamoDB and do some transformation on the data. After this I want to create 3 csv files (there will be 3 transformations on the same data) and dump them to 3 different s3 locations.
My architecture would be sort of following:
Is it possible to do so? I can't seem to find any documentation regarding it. If it's not possible using pipeline, are there any other services which could help me with my use case?
These dumps will be scheduled daily. My other consideration was using aws lamda. But according to my understanding, it's event based triggered rather time based scheduling, is that correct?
Yes it is possible but not using HiveActivity instead EMRActivity. If you look into Data pipeline documentation for HiveActivity, it clearly states its purpose and not suits your use case:
Runs a Hive query on an EMR cluster. HiveActivity makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with ${input1}, ${input2}, and so on, based on the input fields in the HiveActivity object.
Below is how your data pipeline should look like. There is also a inbuilt template Export DynamoDB table to S3 in UI for AWS Data Pipeline which creates the basic structure for you, and then you can extend/customize to suit your requirements.
To your next question using Lambda, Of course lambda can be configured to have event based triggering or schedule based triggering, but I wouldn't recommend using AWS Lambda for any ETL operations as they are time bound & usual ETLs are longer than lambda time limits.
AWS has specific optimized feature offerings for ETLs, AWS Data Pipeline & AWS Glue, I would always recommend to choose between one of two. In case your ETL involves data sources not managed within AWS compute and storage services OR any speciality use case which can't be sufficed by above two options, then AWS Batch will be my next consideration.
Thanks amith for your answer. I have been busy for quite some time now. I did some digging after you posted your answer. Turns out we can dump the data to different s3 locations using Hive activity as well.
This is how the data pipeline would like in that case.
But I believe writing multiple hive activities, when your input source is DynamoDB table, is not a good idea since hive doesn't load any data in memory. It does all the computations on the actual table which could deteriorate the performance of the table. Even documentation suggests to export the data incase you need to make multiple queries to same data. Reference
Enter a Hive command that maps a table in the Hive application to the data in DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.
In my case I needed to perform different type of aggregations on the same data once a day. Since dynamoDB doesn't support aggregations, I turned to Data pipeline using Hive. In the end we ended up using AWS Aurora which is My-SQL based.
I have my data in a table in Redshift cluster. I want to periodically run a query against the Redshift table and store the results in a S3 bucket.
I will be running some data transformations on this data in the S3 bucket to feed into another system. As per AWS documentation I can use the UNLOAD command, but is there a way to schedule this periodically? I have searched a lot but I haven't found any relevant information around this.
You can use a scheduling tool like Airflow to accomplish this task. Airflow seem-lessly connects to Redshift and S3. You can have a DAG action, which polls Redshift periodically and unloads the data from Redshift onto S3.
I don't believe Redshift has the ability to schedule queries periodically. You would need to use another service for this. You could use a Lambda function, or you could schedule a cron job on an EC2 instance.
I believe you are looking for AWS data pipeline service.
You can copy data from redshift to s3 using the RedshiftCopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html).
I am copying the relevant content from the above URL for future purposes:
"You can also copy from Amazon Redshift to Amazon S3 using RedshiftCopyActivity. For more information, see S3DataNode.
You can use SqlActivity to perform SQL queries on the data that you've loaded into Amazon Redshift."
Let me know if this helped.
You should try AWS Data Pipelines. You can schedule them to run periodically or on demand. I am confident that it would solve your use case