aws step function/state machine, can you launch and forget? - amazon-web-services

considerations:
I have a step that will probably take longer than a lambda can wait.
the step involves a s3 copy (terabytes of data)
I don't need to wait for the successful result
Do I need to wrap the state machine in something like launching it from a docker container? (because docker can wait longer than a lambda for the return value which I don't need)
or can I launch this state machine from a lambda and not wait for the result.

Step Functions Standard Workflows executions can run for up to 1 year. The workflow execution will run asynchronously during that time and you can monitor the progress from the console, CLI, or SDK. This is completely managed and you don't need to run any processes, such as a Docker container, on your side to keep it going.
That said, Step Functions will orchestrate the actions over the course of that period but you will still need to implement the tasks that it will orchestrate. For a bulk S3 copy, it seems you may be running a docker container to do the actual copy in which case you could use the Service Integration with ECS (including Fargate) to make that easier.
Alternatively, you could the new Distributed Map capability to avoid needing to write any code in Lambda or Container if the source and destination bucket are in the same region. This is an example of a bulk delete that could be easily modified to do a copy.

Related

synchronizing, scheduling and executing Node.js scripts on AWS

Usually, we store our code on github, and then deploy it on AWS lambda.
We are now challenged with a specific Node.js script.
it takes roughly an hour to run, we can't deploy it on a lambda because of that.
it needs to run just once a month.
once in a while we'll update the script in our github repository, and we want the script in AWS to stay in sync if we make changes (e.g. using a pipeline)
this scripts copies files from S3 and processes them locally. It does some heavy lifting with data.
What would be the recommended way to set this up on AWS ?
The serverless approach fits nicely since you will run the work only once per month. Data transfer between Lambda and S3 (in the same region) is free. If Lambda is comfortable for your use case except for execution time constraints and you can "track the progress" of the processing, you can create a state machine that will invoke your lambda as a step function in the loop while you will not process all S3 data chunks. Each lambda execution can take up to 15 minutes and state machine execution time is way beyond 1 hour. Regarding ops, you can have a trigger on your GitHub that will publish a new version of the lambda. You can use AWS CloudFormation, CDK or any other suitable tool for that.

Best AWS architecture solution for migrating data to cloud

Say I have 4 or 5 data sources that I access through API calls. The data aggregation and mining is all scripted in a python file. Lets say the output is all structured data. I know there are plenty of considerations, but from a high level, what would some possible solutions look like if I ultimately wanted to run analysis in BI software?
Can I host the python script in Lambda and set a daily trigger to run the python file. And then have the output stored in RDS/Aurora? Or since the applications I'm running API calls to aren't in AWS, would I need the data to be in an AWS instance before running a Lambda function?
Or host the python script in an EC2 instance, use lambda to trigger a daily refresh that just stores the data in EC2-ESB or Redshift?
Just starting to learn AWS cloud architecture so my knowledge is fairly limited. Just seems like there can be multiple solutions to any problem so not sure if the 2 ideas above are viable.
You've mentioned two approaches which are working. Ultimately it very depends on your use case, budget etc.. and you are right, usually in AWS you will have different solutions that can solve the same problem. For example, another possible solution could be to Dockerize your Python script and run it on containers services (ECS/EKS). But considering you just started with AWS I will focus on the approaches you mentioned as it's probably the most 2 common ones.
In short, based on your description, I would not suggest to go with EC2 because it adds complexity to your use case and moreover extra costs. If you can imagine the final setup, you will need to configure and manage the instance itself, its class type, AMI, your script deployment, access to internet, subnets, etc. Also a minor thing to clarify: you would probably set a cron expression on it to trigger your script (not a lambda reaching the EC2 !). As you can see, quite a big setup for poor benefits (except maybe gaining some experience with AWS ;)) and the instance would be idle most of the time which is far from optimum.
If you just have to run a daily Python script and need to store the output somewhere I would suggest to use lambda for the processing, you can simply have a scheduled event (prefered way is now Amazon EventBridge instead) that triggers your lambda function once a day. Then depending on your output and how you need to process it, you can use RDS obviously from lambda using the Python SDK but you can also use S3 as blob storage if you don't need to run specific queries - for example if you can store your output in json format.
Note that one limitation to lambda is that it can only run for 15 minutes straight per execution. The good thing is that by default lambda has internet access so you don't need to care about any gateway setup and can reach your external endpoints.
Also from a cost perspective running one lambda/day combined with S3 should be free or almost free. The pricing in lambda is very cheap. Running 24/7 an EC2 instance or RDS (which is also an instance) will cost you some money.
Lambda with storage in S3 is the way to go. EC2 / EBS costs add up over time and EC2 will limit the parallelism you can achieve.
Look into Step Functions as a way to organize and orchestrate your Lambdas. I have python code that copies 500K+ files to S3 and takes a week to run. If I copy the files in parallel (500-ish at a time) this process takes about 10 hours. The parallelism is limited by the sourcing system as I can overload it by going wider. The main Lambda launches the file copy Lambdas at a controlled rate but also terminates after a few minutes of run time but returns the last file updated to the controlling Step Function. The Step Function restarts the main Lambda where the last one left off.
Since you have multiple sources you can have multiple top level Lambdas running in parallel all from the same Step Function and each launching a controlled number of worker Lambdas. You won't overwhelm S3 but you will want to make sure you don't overload your sources.
The best part of this is that it costs pennies (at the scale I'm using it).
Once the data is in S3 I'm copying it up to Redshift and transforming it. These processes are also part of the Step Function through additional Lambda Functions.

AWS lambda function for copying data into Redshift

I am new to AWS world and I am trying to implement a process where data written into S3 by AWS EMR can be loaded into AWS Redshift. I am using terraform to create S3 and Redshift and other supported functionality. For loading data I am using lambda function which gets triggered when the redshift cluster is up . The lambda function has the code to copy the data from S3 to redshift. Currently the process seams to work fine .The amount of data is currently low
My question is
This approach seems to work right now but I don't know how it will work once the volume of data increases and what if lambda functions times out
can someone please suggest me any alternate way of handling this scenario even if it can be handled without lambda .One alternate I came across searching for this topic is AWS data pipeline.
Thank you
A server-less approach I've recommended clients move to in this case is Redshift Data API (and Step Functions if needed). With the Redshift Data API you can launch a SQL command (COPY) and close your Lambda function. The COPY command will run to completion and if this is all you need to do then your done.
If you need to take additional actions after the COPY then you need a polling Lambda that checks to see when the COPY completes. This is enabled by Redshift Data API. Once COPY completes you can start another Lambda to run the additional actions. All these Lambdas and their interactions are orchestrated by a Step Function that:
launches the first Lambda (initiates the COPY)
has a wait loop that calls the "status checker" Lambda every 30 sec (or whatever interval you want) and keeps looping until the checker says that the COPY completed successfully
Once the status checker lambda says the COPY is complete the step function launches the additional actions Lambda
The Step function is an action sequencer and the Lambdas are the actions. There are a number of frameworks that can set up the Lambdas and Step Function as one unit.
With bigger datasets, as you already know, Lambda may time out. But 15 minutes is still a lot of time, so you can implement alternative solution meanwhile.
I wouldn't recommend data pipeline as it might be an overhead (It will start an EC2 instance to run your commands). Your problem is simply time out, so you may use either ECS Fargate, or Glue Python Shell Job. Either of them can be triggered by Cloudwatch Event triggered on an S3 event.
a. Using ECS Fargate, you'll have to take care of docker image and setup ECS infrastructure i.e. Task Definition, Cluster (simple for Fargate).
b. Using Glue Python Shell job you'll simply have to deploy your python script in S3 (along with the required packages as wheel files), and link those files in the job configuration.
Both of these options are serverless and you may chose one based on ease of deployment and your comfort level with docker.
ECS doesn't have any timeout limits, while timeout limit for Glue is 2 days.
Note: To trigger AWS Glue job from Cloudwatch Event, you'll have to use a Lambda function, as Cloudwatch Event doesn't support Glue start job yet.
Reference: https://docs.aws.amazon.com/eventbridge/latest/APIReference/API_PutTargets.html

Is AWS Lambda the proper way of running a batch job?

I have a batch job that I need to run on AWS. I'm wondering what's the best service to use. The job needs to run once a day, so I think that naturally AWS Lambda with a CloudWatch Rule triggering it would do it. However, I'm starting to think that AWS Lambda is thought to be used as a service to handle requests. This AWS official library to integrate Spring-Boot is very oriented to handle HTTP requests, and when creating a lambda via AWS Console, only test cases that send an input to the lambda can be written.
Then, is this a use case for AWS Lambda? Also, these functions can run up to 15 minutes. What should I use if my job needs to run longer?
The purpose of Lambda, as compared to AWS EC2, is to simplify building smaller, on-demand applications that are responsive to events and new information.
If your batch is running within a limit of 15 minutes then you can go with a lambda function.
But if you want batch processing to be done, you should check AWS batch.
Here is nice article which demonstrates the usage of AWS batch.
If you are already using some batch framework like spring-batch, you can also take a look at ECS scheduled task with Fargate.
With ECS Fargate you can launch and stop container services that you need to run only at certain times.
Here are some related articles on Fargate event and scheduled task and Scheduled Tasks.
If you're confident that your function will only run at maximum of 15mins, AWS Lambda could be the solution. Here are the AWS Lambda limits that could help you decide on that.
Also note that lambda has cold start, it's when it will run slower at first but will eventually pick up the pace. Here are some good reads about it that could help you decide on the lambda direction, but feel free to check on any articles that could better explain at your disposal.
This one shows a brief lists that you would like to consider and the factors affecting it.
This one might have a deeper explanation of the cold start with regards to how it works internally.
What should I use if my job needs to run longer?
Depending on your infrastructure, you could maybe explore Scheduled Tasks

Running Batch Jobs on Amazon ECS

I'm very new to using AWS, and even more so for ECS. Currently, I have developed an application that can take an S3 link, download the data from that link, processes the data, and then output some information about that data. I've already packaged this application up in a docker container and now resides on the amazon container registry. What I want to do now is start up a cluster, send an S3 link to each EC2 instance running Docker, have all the container instances crunch the numbers, and return all the results back to a single node. I don't quite understand how I am supposed to change my application at this point. Do I need to make my application running in the docker container a service? Or should I just send commands to containers via ssh? Then assuming I get that far, how do I then communicate with the cluster to farm out the work for potentially hundreds of S3 links? Ideally, since my application is very compute intensive, I'd like to only run one container per EC2 instance.
Thanks!
Your story is hard to answer since it's a lot of questions without a lot of research done.
My initial thought is to make it completely stateless.
You're on the right track by making them start up and process via S3. You should expand this to use something like an SQS queue. Those SQS messages would contain an S3 link. Your application will start up, grab a message from SQS, process the link it got, and delete the message.
The next thing is to not output to a console of any kind. Output somewhere else. Like a different SQS queue, or somewhere.
This removes the requirement for the boxes to talk to each other. This will speed things up, make it infinitely scalable and remove the strange hackery around making them communicate.
Also why one container per instance? 2 threads at 50% is the same as 1 at 100% usually. Remove this requirement and you can use ECS + Lambda + Cloudwatch to scale based on the number of messages. >10000, scale up, that kind of thing. <100 scale down. This means you can throw millions of messages into SQS and just let ECS scale up to process them and output somewhere else to consume.
I agree with Marc Young, you need to make this stateless and decouple the communication layer from the app.
For an application like this I would put the S3 links into a queue (rabbitMQ is a good one, I personally don't care for SQS but it's also an option). Then have your worker nodes in ECS pull messages off the queue and process.
It sounds like you have another app that does processing. Depending on the output you could then put the result into another processing queue and use the same model or just stuff it directly in a database of some sort (or as files in S3).
In addition to what Marc said about autoscaling, consider using cloudwatch + spot instances to manage the cost of your ECS container instances. Particularly for heavy compute tasks, you can get big discounts that way.