Orchestration of Redshift stored procedures using AWS Glue - amazon-web-services

I have multiple Redshift stored procedures (~15), some are dependent on the previous run stored procedures while some can run asynchronously.
I need to orchestrate this with proper failure handling in case any successor stored procedure fails then I can run that particular one.
I tried orchestrating this using AWS Eventbridge but in that, I found many limitations. Like triggering any specific stored procedure. Using Eventbridge rules 5 targets to run both combination sync and async
Is there any way to run my stored procedures in AWS Glue using custom canvas, to construct orchestration. Putting one stored procedue in one block
How to make redshift connection in the flow diagram so that my stored procedure will be executed in my redshift cluster?

I don't know about Glue (I doubt it) but this is a great case for Step Functions. Lots of blogs about how to set up a serverless data orchestration process using Step Functions with proper error / exception handling. A place to start - https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/

Related

AWS lambda function for copying data into Redshift

I am new to AWS world and I am trying to implement a process where data written into S3 by AWS EMR can be loaded into AWS Redshift. I am using terraform to create S3 and Redshift and other supported functionality. For loading data I am using lambda function which gets triggered when the redshift cluster is up . The lambda function has the code to copy the data from S3 to redshift. Currently the process seams to work fine .The amount of data is currently low
My question is
This approach seems to work right now but I don't know how it will work once the volume of data increases and what if lambda functions times out
can someone please suggest me any alternate way of handling this scenario even if it can be handled without lambda .One alternate I came across searching for this topic is AWS data pipeline.
Thank you
A server-less approach I've recommended clients move to in this case is Redshift Data API (and Step Functions if needed). With the Redshift Data API you can launch a SQL command (COPY) and close your Lambda function. The COPY command will run to completion and if this is all you need to do then your done.
If you need to take additional actions after the COPY then you need a polling Lambda that checks to see when the COPY completes. This is enabled by Redshift Data API. Once COPY completes you can start another Lambda to run the additional actions. All these Lambdas and their interactions are orchestrated by a Step Function that:
launches the first Lambda (initiates the COPY)
has a wait loop that calls the "status checker" Lambda every 30 sec (or whatever interval you want) and keeps looping until the checker says that the COPY completed successfully
Once the status checker lambda says the COPY is complete the step function launches the additional actions Lambda
The Step function is an action sequencer and the Lambdas are the actions. There are a number of frameworks that can set up the Lambdas and Step Function as one unit.
With bigger datasets, as you already know, Lambda may time out. But 15 minutes is still a lot of time, so you can implement alternative solution meanwhile.
I wouldn't recommend data pipeline as it might be an overhead (It will start an EC2 instance to run your commands). Your problem is simply time out, so you may use either ECS Fargate, or Glue Python Shell Job. Either of them can be triggered by Cloudwatch Event triggered on an S3 event.
a. Using ECS Fargate, you'll have to take care of docker image and setup ECS infrastructure i.e. Task Definition, Cluster (simple for Fargate).
b. Using Glue Python Shell job you'll simply have to deploy your python script in S3 (along with the required packages as wheel files), and link those files in the job configuration.
Both of these options are serverless and you may chose one based on ease of deployment and your comfort level with docker.
ECS doesn't have any timeout limits, while timeout limit for Glue is 2 days.
Note: To trigger AWS Glue job from Cloudwatch Event, you'll have to use a Lambda function, as Cloudwatch Event doesn't support Glue start job yet.
Reference: https://docs.aws.amazon.com/eventbridge/latest/APIReference/API_PutTargets.html

ML Pipeline on AWS SageMaker: How to create long-running query/preprocessing tasks

I'm a software engineer transitioning toward machine learning engineering, but need some assistance.
I'm currently using AWS Lambda and Step Functions to run query and preprocessing jobs for my ML pipeline, but am restrained by Lambda's 15m runtime limitation.
We're a strictly AWS shop, so I'm kind of stuck with SageMaker and other AWS tools for the time being. Later on we'll consider experimenting with something like Kubeflow if it looks advantageous enough.
My current process
I have my data scientists write python scripts (in a git repo) for the query and preprocessing steps of a model, and deploy them (via Terraform) as Lambda functions, then use Step Functions to sequence the ML Pipeline steps as a DAG (query -> preprocess -> train -> deploy)
The Query lambda pulls data from our data warehouse (Redshift), and writes the unprocessed dataset to S3
The Preprocessing lambda loads the unprocessed dataset from S3, manipulates it as needed, and writes it as training & validation datasets to a different S3 location
The Train and Deploy tasks use the SageMaker python api to train and deploy the models as SageMaker Endpoints
Do I need to be using Glue and SageMaker Processing jobs? From what I can tell, Glue seems more targeted towards ETLs than for writing to S3, and SageMaker Processing jobs seem a bit more complex to deploy to than Lambda.
There is a solution that just came out for long running actions in Redshift - Redshift Data API. https://aws.amazon.com/about-aws/whats-new/2020/09/announcing-data-api-for-amazon-redshift/
This allows Lambdas in a Step function to issue a set of SQL to Redshift and poll to see when the SQL is done. Now the run time of your Lambda is only as long as it needed to launch the SQL.
As for the processing steps - I'd recommend doing as much of the processing inside of Redshift before unloading the data to S3 (I hope you are not pulling lots of data through a select statement). This will be much faster than processing in Lambda and can benefit from Data API as well. Now there will likely be some processing steps that you cannot do in Redshift and Lambda is a good option. One additional benefit of UNLOAD is that you can set the output file size. This way you can launch a Lambda per file of the output and then you have many, shorter running Lambdas.
You could attempt to break up the work and have many Lambdas running in series but processing large amounts of data at once is not a strength of Lambda. Being able to do this will depend on the data processing you are doing.
You could use Glue for this but this is likely complete overkill, a whole new service to learn, and since it is an EMR wrapper it can get costly. To be honest Glue is not my favorite AWS service as it only does the most basic things easily and anything even slightly complex becomes a battle. So if this is a tool you know and like go for it.

Is AWS Lambda preferred over AWS Glue Job?

In AWS Glue job, we can write some script and execute the script via job.
In AWS Lambda too, we can write the same script and execute the same logic provided in above job.
So, my query is not whats the difference between AWS Glue Job vs AWS Lambda, BUT iam trying to undestand when AWS Glue job should be preferred over AWS Lambda, especially while when both does the same job? If both does the same job, then ideally I would blindly prefer using AWS Lambda itself, right?
Please try to understand my query..
Additional points:
Per this source and Lambda FAQ and Glue FAQ
Lambda can use a number of different languages (Node.js, Python, Go, Java, etc.) vs. Glue can only execute jobs using Scala or Python code.
Lambda can execute code from triggers by other services (SQS, Kafka, DynamoDB, Kinesis, CloudWatch, etc.) vs. Glue which can be triggered by lambda events, another Glue jobs, manually or from a schedule.
Lambda runs much faster for smaller tasks vs. Glue jobs which take longer to initialize due to the fact that it's using distributed processing. That being said, Glue leverages its parallel processing to run large workloads faster than Lambda.
Lambda looks to require more complexity/code to integrate into data sources (Redshift, RDS, S3, DBs running on ECS instances, DynamoDB, etc.) while Glue can easily integrate with these. However, with the addition of Step Functions, multiple lambda functions can be written and ordered sequentially due reduce complexity and improve modularity where each function could integrate into a aws service (Redshift, RDS, S3, DBs running on ECS instances, DynamoDB, etc.)
Glue looks to have a number of additional components, such as Data Catalog which is a central metadata repository to view your data, a flexible scheduler that handles dependency resolution/job monitoring/retries, AWS Glue DataBrew for cleaning and normalizing data with a visual interface, AWS Glue Elastic Views for combining and replicating data across multiple data stores, AWS Glue Schema Registry to validate streaming data schema.
There are other examples I am missing, so feel free to comment and I can update.
Lambda has a lifetime of fifteen minutes. It can be used to trigger a glue job as an event based activity. That is, when a file lands in S3 for example, we can have an event trigger which can run a glue job. Glue is a managed services for all data processing.
If the data is very low maybe you can do it in lambda, but for some reason the process goes beyond fifteen minutes, then data processing would fail.
The answer to this can involve some foundational design decisions. What is this job doing? What kind of data are you dealing with? Is there a decision to be made whether the task should be executed in a batch or event oriented paradigm?
Batch
This may be necessary or desirable because the task:
Is being done over large monolithic data (e.g., binary).
Relies on context of multiple records in a dataset such that they must be loaded into a single job.
Order matters.
I feel like just as often I see batch handling chosen by default because "this is the way we've always done it" but breaking from this approach could be worth consideration.
Glue is built for batch operations. With a current maximum execution time of 15 minutes and maximum memory of 10gb, Lambda has become capable of processing fairly large datasets in a single execution, as well. It can be difficult to pin down a direct cost comparison without specifics of the workload. When it comes to development, I feel that Lambda has the edge as far as tooling to build, test, deploy.
Event
In the case where your data consists of a set of records, it might behoove you to parse and "stream" them into Lambda. Consider a flow like:
CSV lands in S3.
S3 event triggers Lambda.
Lambda reads and parses CSV into discrete events, submits to another Lambda or publishes to SNS for downstream processing. Concurrent instances of this Lambda can be employed to speed up ingest, where each instance is responsible for certain lines of the S3 object.
This pushes all logic and error handling, as well as resources required, to the level of individual event/record level. Often mechanisms such as dead-letter queues are employed for remediation. While context of a given container persists across invocations - assuming the container has not been idle and torn down - Lambda should generally be considered stateless such that the processing of an event/record is thought of as occurring within its own scope, outside that of others in the dataset.

AWS data pipeline: dump data to 3 s3 nodes

I have a use case wherein I want to take a data from DynamoDB and do some transformation on the data. After this I want to create 3 csv files (there will be 3 transformations on the same data) and dump them to 3 different s3 locations.
My architecture would be sort of following:
Is it possible to do so? I can't seem to find any documentation regarding it. If it's not possible using pipeline, are there any other services which could help me with my use case?
These dumps will be scheduled daily. My other consideration was using aws lamda. But according to my understanding, it's event based triggered rather time based scheduling, is that correct?
Yes it is possible but not using HiveActivity instead EMRActivity. If you look into Data pipeline documentation for HiveActivity, it clearly states its purpose and not suits your use case:
Runs a Hive query on an EMR cluster. HiveActivity makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with ${input1}, ${input2}, and so on, based on the input fields in the HiveActivity object.
Below is how your data pipeline should look like. There is also a inbuilt template Export DynamoDB table to S3 in UI for AWS Data Pipeline which creates the basic structure for you, and then you can extend/customize to suit your requirements.
To your next question using Lambda, Of course lambda can be configured to have event based triggering or schedule based triggering, but I wouldn't recommend using AWS Lambda for any ETL operations as they are time bound & usual ETLs are longer than lambda time limits.
AWS has specific optimized feature offerings for ETLs, AWS Data Pipeline & AWS Glue, I would always recommend to choose between one of two. In case your ETL involves data sources not managed within AWS compute and storage services OR any speciality use case which can't be sufficed by above two options, then AWS Batch will be my next consideration.
Thanks amith for your answer. I have been busy for quite some time now. I did some digging after you posted your answer. Turns out we can dump the data to different s3 locations using Hive activity as well.
This is how the data pipeline would like in that case.
But I believe writing multiple hive activities, when your input source is DynamoDB table, is not a good idea since hive doesn't load any data in memory. It does all the computations on the actual table which could deteriorate the performance of the table. Even documentation suggests to export the data incase you need to make multiple queries to same data. Reference
Enter a Hive command that maps a table in the Hive application to the data in DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.
In my case I needed to perform different type of aggregations on the same data once a day. Since dynamoDB doesn't support aggregations, I turned to Data pipeline using Hive. In the end we ended up using AWS Aurora which is My-SQL based.

Dependency based ETL flow in AWS

We want to create a dynamic flow based on input data in S3. Based on data available in S3 and along with meta data we want to create dynamic clusters and dynamic tasks/transformation jobs in the system. And Some jobs are dependency based. Here I am sharing the expected flow, want to know how efficiently we can do this using AWS services and env.
I am exploring AWS SWF, Data Pipe Line and Lambda. But now sure how to take care of dynamic tasks and dynamic dependencies. Any thoughts around this.
Data Flow is explained in the attached image (refer ETL Flow)
ETL Flow
Amazon Step Functions with S3 Triggers should get the job done in a cost effective and scalable manner.
All Steps are defined with state language.
https://states-language.net/spec.html
You can run jobs in parallel and wait for them to finish before you start your next job.
Below is one of the sample from AWS Step Functions,
If you use AWS Flow Framework that is part of official SWF client then modeling such dynamic flow is pretty straightforward. You define its object model, write code that instantiate it based on your pipeline definition and execute using the framework. See Deployment Sample for an example of such dynamic workflow implementation.