synchronizing, scheduling and executing Node.js scripts on AWS - amazon-web-services

Usually, we store our code on github, and then deploy it on AWS lambda.
We are now challenged with a specific Node.js script.
it takes roughly an hour to run, we can't deploy it on a lambda because of that.
it needs to run just once a month.
once in a while we'll update the script in our github repository, and we want the script in AWS to stay in sync if we make changes (e.g. using a pipeline)
this scripts copies files from S3 and processes them locally. It does some heavy lifting with data.
What would be the recommended way to set this up on AWS ?

The serverless approach fits nicely since you will run the work only once per month. Data transfer between Lambda and S3 (in the same region) is free. If Lambda is comfortable for your use case except for execution time constraints and you can "track the progress" of the processing, you can create a state machine that will invoke your lambda as a step function in the loop while you will not process all S3 data chunks. Each lambda execution can take up to 15 minutes and state machine execution time is way beyond 1 hour. Regarding ops, you can have a trigger on your GitHub that will publish a new version of the lambda. You can use AWS CloudFormation, CDK or any other suitable tool for that.

Related

AWS lambda function for copying data into Redshift

I am new to AWS world and I am trying to implement a process where data written into S3 by AWS EMR can be loaded into AWS Redshift. I am using terraform to create S3 and Redshift and other supported functionality. For loading data I am using lambda function which gets triggered when the redshift cluster is up . The lambda function has the code to copy the data from S3 to redshift. Currently the process seams to work fine .The amount of data is currently low
My question is
This approach seems to work right now but I don't know how it will work once the volume of data increases and what if lambda functions times out
can someone please suggest me any alternate way of handling this scenario even if it can be handled without lambda .One alternate I came across searching for this topic is AWS data pipeline.
Thank you
A server-less approach I've recommended clients move to in this case is Redshift Data API (and Step Functions if needed). With the Redshift Data API you can launch a SQL command (COPY) and close your Lambda function. The COPY command will run to completion and if this is all you need to do then your done.
If you need to take additional actions after the COPY then you need a polling Lambda that checks to see when the COPY completes. This is enabled by Redshift Data API. Once COPY completes you can start another Lambda to run the additional actions. All these Lambdas and their interactions are orchestrated by a Step Function that:
launches the first Lambda (initiates the COPY)
has a wait loop that calls the "status checker" Lambda every 30 sec (or whatever interval you want) and keeps looping until the checker says that the COPY completed successfully
Once the status checker lambda says the COPY is complete the step function launches the additional actions Lambda
The Step function is an action sequencer and the Lambdas are the actions. There are a number of frameworks that can set up the Lambdas and Step Function as one unit.
With bigger datasets, as you already know, Lambda may time out. But 15 minutes is still a lot of time, so you can implement alternative solution meanwhile.
I wouldn't recommend data pipeline as it might be an overhead (It will start an EC2 instance to run your commands). Your problem is simply time out, so you may use either ECS Fargate, or Glue Python Shell Job. Either of them can be triggered by Cloudwatch Event triggered on an S3 event.
a. Using ECS Fargate, you'll have to take care of docker image and setup ECS infrastructure i.e. Task Definition, Cluster (simple for Fargate).
b. Using Glue Python Shell job you'll simply have to deploy your python script in S3 (along with the required packages as wheel files), and link those files in the job configuration.
Both of these options are serverless and you may chose one based on ease of deployment and your comfort level with docker.
ECS doesn't have any timeout limits, while timeout limit for Glue is 2 days.
Note: To trigger AWS Glue job from Cloudwatch Event, you'll have to use a Lambda function, as Cloudwatch Event doesn't support Glue start job yet.
Reference: https://docs.aws.amazon.com/eventbridge/latest/APIReference/API_PutTargets.html

Aws lambda vs aws batch

I am currently working on a project where I need to merge two significantly large csv files into one(both are a few hundred MBs). I am fairly new to aws. I am aware of memory allocation and execution time limitations of lambda. Other than that are there any advantages of using batch jobs over lambda for this project? Is there any other aws component which more suitable for this task? Either lambda or batch job will be triggered inside a step function using a sns notification.
Lambda function has some limitations:
Execute time: 15 mins
RAM: 3G
disk space /tmp only 500mb <= difficult to store any file large than this number on lambda
The good point is cheap and fast boot up
I suggest you use the ECS (Both Fargate and Container are good)
Try using a Python function in Lambda that writes to S3 with boto3.

Idea and guidelines on end to end AWS solution

I want to build an end to end automated system which consists of the following steps:
Getting data from source to landing bucket AWS S3 using AWS Lambda
Running some transformation job using AWS Lambda and storing in processed bucket of AWS S3
Running Redshift copy command using AWS Lambda to push the transformed/processed data from AWS S3 to AWS Redshift
From the above points, I've completed pulling data, transforming data and running manual copy command from a Redshift using a SQL query tool.
Doubts:
I've heard AWS CloudWatch can be used to schedule/automate things but never worked on it. So, if I want to achieve the steps above in a streamlined fashion, how to go about it?
Should I use Lambda to trigger copy and insert statements? Or are there better AWS services to do the same?
Any other suggestion on other AWS Services and of the likes are most welcome.
Constraint: Want as many tasks as possible to be serverless (except for semantic layer, Redshift).
CloudWatch:
Your options here are either to use CloudWatch Alarms or Events.
With alarms, you can respond to any metric of your system (eg CPU utilization, Disk IOPS, count of Lambda invocations etc) when it crosses some threshold, and when this alarm is triggered, invoke a lambda function (or send SNS notification etc) to perform a task.
With events you can use either a cron expression or some AWS service event (eg EC2 instance state change, SNS notification etc) to then trigger another service (eg Lambda), so you could for example run some kind of clean-up operation via lambda on a regular schedule, or create a snapshot of an EBS volume when its instance is shut down.
Lambda itself is a very powerful tool, and should allow you to program a decent copy/insert function in a language you are familiar with. AWS has several GitHub repos with lots of examples too, see for example the serverless examples and many samples. There may be other services which could work for you in your specific case, but part of Lambda's power is its flexibility.

Is AWS Lambda the proper way of running a batch job?

I have a batch job that I need to run on AWS. I'm wondering what's the best service to use. The job needs to run once a day, so I think that naturally AWS Lambda with a CloudWatch Rule triggering it would do it. However, I'm starting to think that AWS Lambda is thought to be used as a service to handle requests. This AWS official library to integrate Spring-Boot is very oriented to handle HTTP requests, and when creating a lambda via AWS Console, only test cases that send an input to the lambda can be written.
Then, is this a use case for AWS Lambda? Also, these functions can run up to 15 minutes. What should I use if my job needs to run longer?
The purpose of Lambda, as compared to AWS EC2, is to simplify building smaller, on-demand applications that are responsive to events and new information.
If your batch is running within a limit of 15 minutes then you can go with a lambda function.
But if you want batch processing to be done, you should check AWS batch.
Here is nice article which demonstrates the usage of AWS batch.
If you are already using some batch framework like spring-batch, you can also take a look at ECS scheduled task with Fargate.
With ECS Fargate you can launch and stop container services that you need to run only at certain times.
Here are some related articles on Fargate event and scheduled task and Scheduled Tasks.
If you're confident that your function will only run at maximum of 15mins, AWS Lambda could be the solution. Here are the AWS Lambda limits that could help you decide on that.
Also note that lambda has cold start, it's when it will run slower at first but will eventually pick up the pace. Here are some good reads about it that could help you decide on the lambda direction, but feel free to check on any articles that could better explain at your disposal.
This one shows a brief lists that you would like to consider and the factors affecting it.
This one might have a deeper explanation of the cold start with regards to how it works internally.
What should I use if my job needs to run longer?
Depending on your infrastructure, you could maybe explore Scheduled Tasks

How to execute an SQL script stored in S3 other than datapipeline service

We have been working on leveraging AWS services for scheduling few sql scripts on a daily basis. Datapipeline is a good option but we have found issues with underlying support systems which is Task runner. Is there any other options that we can look for. The Lambda has a limitation of 300 seconds. And the query which we are using will exceed 5 minutes. Any suggestions/workarounds is much appreciated.!!
You're on the right path. Use Lambda just to kick off the job, not to do the actual workload.
For example, pack your app in a Docker container, push it to ECR and use Lambdas to periodically add an AWS Batch job.