We are implementing a system where we have few Java components running on Spark cluster on AWS EKS. We need to orchestrates this components in a pipeline fashion where a file arrives on S3 and that needs to kick off a job on Component A and then depending on the outcome job on Component B and then C etc. Also, depending on the file type, we may want to skip certain 'steps' (job on a Component). The way we would kick off the job on individual components is by invoking spark-submit command via kubectl.
Step function seems to be an ideal candidate for this but we are not allowed to use this. Is there any better alternative to implement this whether using AWS native or something else?
Related
How is it currently done the handling of multiple lambda functions for a single stack/application?
Considering a use case with more than one function is it better to stick all together in the same repository or have one for each?
Having a single repository for all the functions would be much easier for me coming from old/classic backend development with a single codebase for all the business logic, but moving on the AWS ecosystem means I can no longer "deploy" my entire business logic with a single command since I need to zip a single function and update the archive with the aws cli, and that is impossible with standard merge requests or pipeline due the impossibility of automation for these steps (every time it could be a different function or multiple ones).
From the other side, having e.g. 5 or 6 repositories one for each lambda alongside the ones for frontend and AWS stack would be very impractical to manage.
Bundle your different lambda functions together as a Cloudformation stack. Cloudformation allows you to create multiple AWS services, bridge them together as you wish. There are many tools you can use to achieve this. AWS Cloudformation, AWS SAM (serverless application model) or third party tools like serverless and Terraform. Base concept is known as Infrastructure as Code (IAC).
As per respositories, you can have a single repository per stack. (AWS SAM provides sample codes with a good directory structure) You can try sam init as an example.
Consider AWS Serverless Application Model for your development. It allows you to bash script build, package and deploy using sam cli based on the yaml template. SAM will figure out the diff in your code by itself (because it runs CloudFormation under the hood). It allows not only to combine several functions into one package, but also add API gateways, dynamoDB tables and so much more! Another cool feature is that your functions will appear as an integrated application in Lambda console so you can monitor them all at the same time.
I have multiple Redshift stored procedures (~15), some are dependent on the previous run stored procedures while some can run asynchronously.
I need to orchestrate this with proper failure handling in case any successor stored procedure fails then I can run that particular one.
I tried orchestrating this using AWS Eventbridge but in that, I found many limitations. Like triggering any specific stored procedure. Using Eventbridge rules 5 targets to run both combination sync and async
Is there any way to run my stored procedures in AWS Glue using custom canvas, to construct orchestration. Putting one stored procedue in one block
How to make redshift connection in the flow diagram so that my stored procedure will be executed in my redshift cluster?
I don't know about Glue (I doubt it) but this is a great case for Step Functions. Lots of blogs about how to set up a serverless data orchestration process using Step Functions with proper error / exception handling. A place to start - https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/
Essentially, we are running a batch ML model using a Spark EMR cluster on AWS. There will be several iterations of the model so we want to have some sort of model metadata endpoint on top of the Spark cluster. In this way, other services that rely on the output of the EMR cluster can ping the spark cluster's REST API endpoint and be informed of the latest ML system version it's using. I'm not sure if this is feasible or not.
Objective:
We want other services to be able to ping the EMR cluster which runs the latest ML model and obtain the metadata for the model, which includes ML system version.
If I have understood correctly, you want to add metadata (e.g., version, last-updated, action performed etc) somewhere once the spark job is finished, right?
There can be several possibilities and all will be somehow integrated into your data pipeline in the same way as other task, for example, triggering spark job with workflow management tool (airflow/luigi), lambda function or even cron.
Updating meta-data after Spark job runs
So for the post spark job step, you can have add something in your pipeline that adds this metadata to some DB or event store. I am sharing to options and you can decide which one is more feasible
Utilize cloudwatch event and associate a lambda with the event. Amazon EMR automatically sends events to a CloudWatch event stream
Add a step in your workflow management tool (airflow/luigi) that triggers a DB/event-store update step/operator "on-completion" of the EMR step function. (for e.g., using EmrStepSensor in Airflow to issue next step of writing to DB depending on that)
For Rest-api on top of DB/event store
Now, once you have regular updating mechanism in place for every emr spark step run, you can build normal rest API using EC2 or a serverless API using AWS lambda. You will essentially be returning this meta-data from the rest service.
I am working on a problem where we intend to perform multiple transformations on data using EMR (SparkSQL).
After going through the documentation of AWS Data Pipelines and AWS Step Functions, I am slightly confused as to what is the use-case each tries to solve. I looked around but did not find a authoritative comparison between both. There are multiple resources that show how I can use them both to schedule and trigger Spark jobs on an EMR cluster.
Which one should I use for scheduling and orchestrating my processing EMR jobs?
More generally, in what situation would one be a better choice over the other as far as ETL/data processing is concerned?
Yes, there are many ways to achieve the same thing, and the difference is in the details and in your use case. I am going to even offer yet one more alternative :)
If you are doing a sequence of transformations and all of them are on an EMR cluster, maybe all you need is either to create the cluster with steps, or submit an API job with several steps. Steps will execute in order on your cluster.
If you have different sources of data, or you want to handle more complex scenarios, then both AWS Data Pipeline and AWS Step Functions would work. AWS Step Functions is a generic way of implementing workflows, while Data Pipelines is a specialized workflow for working with Data.
That means that Data Pipeline will be better integrated when it comes to deal with data sources and outputs, and to work directly with tools like S3, EMR, DynamoDB, Redshift, or RDS. So for a pure data pipeline problem, chances are AWS Data Pipeline is a better candidate.
Having said so, AWS Data Pipeline is not very flexible. If the data source you need is not supported, or if you want to execute some activity which is not integrated, then you need to hack your way around with shell scripts.
On the other hand, AWS Step Functions are not specialized and have good integration with some AWS Services and with AWS Lambda, meaning you can easily integrate with anything via serverless apis.
So it really depends on what you need to achieve and the type of workload you have.
We want to create a dynamic flow based on input data in S3. Based on data available in S3 and along with meta data we want to create dynamic clusters and dynamic tasks/transformation jobs in the system. And Some jobs are dependency based. Here I am sharing the expected flow, want to know how efficiently we can do this using AWS services and env.
I am exploring AWS SWF, Data Pipe Line and Lambda. But now sure how to take care of dynamic tasks and dynamic dependencies. Any thoughts around this.
Data Flow is explained in the attached image (refer ETL Flow)
ETL Flow
Amazon Step Functions with S3 Triggers should get the job done in a cost effective and scalable manner.
All Steps are defined with state language.
https://states-language.net/spec.html
You can run jobs in parallel and wait for them to finish before you start your next job.
Below is one of the sample from AWS Step Functions,
If you use AWS Flow Framework that is part of official SWF client then modeling such dynamic flow is pretty straightforward. You define its object model, write code that instantiate it based on your pipeline definition and execute using the framework. See Deployment Sample for an example of such dynamic workflow implementation.