Understanding where to begin with batch processing on AWS - amazon-web-services

I have a set of calculations that needs to run in a batch, and the workload is easily parallelized across machines. The work to be done is already done within a Docker container. I'm trying to understand the easiest way for me to run this workload in a highly parallel way on AWS. However, in trying to figure out where to begin I'm having trouble finding the right entrypoint. I read about AWS Batch and AWS Fargate, but each time I try to go down one of those paths to learn about them in more detail, more AWS services start popping up (Lamdas, Step Functions, ECS, AutoScaling groups), with each article having a different combination. Furthermore, I start thinking about the problem as a Batch vs Fargate problem, and then I find another article that talks about Batch + Fargate, or X + ECS + ....
I'm having trouble finding the appropriate introduction to the choices so I can get started with setting something up and getting some experience. Any pointers on which direction I might go or some resources for me to look at?

AWS containers services team member here. Your question triggers all my button cause I have been working on a deliverable to address some of this confusion ("where do I start with xyz?"). I can try to answer your question briefly here but if you want to read more (perhaps way more than you'd need feel free to contact me offline (mreferre at amazon dot com will work).
First and foremost it's not a Vs but it's an AND. Think of all these products you mention being distributed at different layers of the stack (this is a draft visual in the deliverable):
Fargate represents capacity (where your container is running), ECS represents a core containers orchestrator and Batch is one of the provisioners on top of the container orchestrator. Lambda is something separate and that live on its own. The options for your specific use case seem to be:
Lambda
ECS/Fargate
Batch/ECS/Fargate
Step Functions/ECS/Fargate (this one is outside of analysis and you don't see it in my visual - wondering if I should add it).
As others have hinted you probably want to use Lambda if your model is event-driven (e.g. if you want to fire up a dedicated function for every event like a new file uploaded to S3).
You probably do not want to use a naked ECS/Fargate solution because it would require more work to deal with the triggering and the scheduling of your batch jobs.
You probably want to use either Batch or Step Functions to schedule jobs on ECS/Fargate. I'd argue SF is good if you have basic workflows that you need to deal with and Batch if you need to manage complex jobs at scale. Perhaps this 35 mins presentation that I did last year can provide a bit more background on these Batch Vs SF differences.
Let me know if you have any additional questions because this discussion is super useful for the positioning I am trying to build.

Related

Is there a way to discover VMs using Terraform?

Infrastructure team members are creating, deleting and modifying resources in GCP project using console. Security team wants to scan the infra and check weather proper security measures are taken care
I am tryng to create a terraform script which will:
1. Take project ID as input and list all instances of the given project.
2. Loop all the instances and check if the security controls are in place.
3. If any security control is missing, terraform script will be modifying the resource(VM).
I have to repeat the same steps for all resoources available in project like subnet, cloud storage buckets, firewalls etc.
As per my initial investigation to do such task We will have to import the resources to terraform using "terraform import" command and after that will have to think of loops.
Now it looks like using APIs of GCP is the best fit for this task, as it looks terraform is not the good choice for this kind of tasks and I am not sure weather it is achievable using teffarform.
Can somebody provide any directions here?
Curious if by "console" you mean the gcp console (aka by hand), because if you are not already using terraform to create the resources (and do not plan to in the future), then terraform is not the correct tool for what you're describing. I'd actually argue it is increasing the complexity.
Mostly because:
The import feature is not intended for this kind of use case and we still find regular issues with it. Maybe 1 time for a few resources, but not for entire environments and not without it becoming the future source of truth. Projects such as terraforming do their best but still face wild west issues in complex environments. Not all resources even support importing
Terraform will not tell you anything about the VM's that you wouldn't know from the GCP cli already. If you need more information to make an assessment about the controls then you will need to use another tool or have some complicated provisioners. Provisioners at best would end up being a wrapper around other tooling you could probably use directly.
Honestly, I'm worried your team is trying to avoid the pain of converting older practices to IaC. It's uncomfortable and challenging, but yields better fruit in the long run then the path you're describing.
Digress, if you have infra created via terraform then I'd invest more time in some other practices that can accomplish the same results. Some other options are: 1) enforce best practices via parent modules that security has "blessed", 2) implement some CI on your terraform, 3) AWS has Config and Systems Manager, not sure if GCP has an equivalent but I would look around. Also it's worth evaluating using different technologies for different layers of abstraction. What checks your OS might be different from what checks your security groups and that's ok. Knowing is half the battle and might make for a more sane first version then automatic remediation.
With or without terraform, there is a an ecosystem of both products and opensource projects that can help with the compliance or control enforcement. Take a look at tools like inspec, sentinel, or salstack for inspiration.

How do I handle instant spikes in traffic in my APIs

I am using aws for my cloud infrastructure. I use ecs fargate as my compute machine. I am currently maintaining 10-20 apis which interact with members who have my application downloaded on their phone. Obviously one or two of these apis are my "main" apis and these are the ones which are really personalised to my users and honestly, these are the only two apis which members really access (by navigating to those screens).
My business team wants to send push notifications to members to alert them on certain new events which lands them on a screen where these APIs need to be called. Due to this, my application has mini crashes during this time period.
I've thought of a couple of ideas for the same, but since this is obviously an issue across industries and a solved problem, I wanted the standard solutions.
The ideas I have:
Sending notifications in batches. This seems like the best solution though it requires a bit of effort though I'm not sure how much.
Have a serverless machine run my requests (aws lambda functions) for those APIs which need to scale instantly. I have a lot of other APIs which I keep in fargate because I don't want my lambda function to be too heavy and then take a while to start up.
Scale machines all the time to handle the load I get during push notifications. This seems suboptimal due to cost reasons.
Scale machines up just during those periods where I want to send push notifications and them scale them back down. This seems like a decent solution if I can automate the entire process. I can have a flow which I follow for each push notification which will cause the system to scale and then start sending the notifications.
Is there a better way to do this. This seems like a relatively straightforward problem for people to have, but I don't see too much information on this topic.
I like your second option best because it's by far the easiest to manage (because you don't have to manage it). After that I'd go with your last option. I would use step functions to manage this, where the first step is to scale up the number of instances in Fargate. Once that has reached the desired level you would send the notifications. Add autoscaling to your services in Fargate to have it handle coming down automatically.

Suitability of app with long running tasks for AWS Lambda or AWS Step Functions

I have an application on an AWS EC2 instance that runs once daily. The application fetches some files from a web service, parses the files line by line, updates a database, updates S3 files based on changes in the database, sends notification emails to customers as well as a few other tasks.
This is a series of logical tasks that must take place in sequence, although some of the tasks can be thought of as sub-tasks that can be executed in parallel. All tasks are a combination of Perl scripts and Java programs, with a single Perl script acting as the manager that executes each in turn. Some tasks can take as long as 45 minutes to complete, and the whole process can take up to 3 hours in total.
I'd like to make this whole process serverless. My initial idea was to use AWS Lambda, whereby each task would execute as a Lambda function, until I discovered Lambda functions impose a 5 minute execution timeout. It seems like the AWS Step Functions service is actually a better fit for my use case, but my understanding is that this service is backed by Lambda, so the tasks will still have the 5 min execution limitation.
(I'm also aware that I would have to re-write my Perl scripts to a language supported by Lambda).
I assume that I can work around the execution time limit by refactoring my code into smaller functions that will guarantee to complete in under 5 minutes. In my particular situation though, this seems inefficient.
Currently the database update task processes lines from a file one at a time. For this to work with Lambda, a Lambda function would need to handle only a single line from the file (or a very small number of lines) in order to guarantee not spilling over 5 minutes execution time. This would involve opening and closing a connection with the database on every invocation of the Lambda function. Also, each line processed should result in an entry written to a file, to be stored in S3. Right now, I just keep a file handle in memory and write the file to S3 when all lines are processed, but with Lambda I would need to keep reading the file, updating it and writing it back to S3.
What I'm asking is:
Is my use case a bad fit for AWS Lambda and/or AWS Step Functions?
Have I misunderstood how these services work?
Is there another AWS service that would be a better fit for my use case?
After further research, I think AWS Batch might be a good idea.
What you want are called Activity Workers. Tl;dr: You register "activities" and each gets an ARN. Then you can put that ARN in the resource field of Task states and then you run some code (the "worker") somewhere (in a Lambda, on EC2, in your basement, wherever) that polls for tasks identified by that ARN, then calls back to report success or failure. Activity Workers can run for up to a year.
Step-by-step details at the AWS docs
In response to RTF's comment, here's a deeper dive: Suppose you have code to color turtles in color_turtles.pl. So what you do is call the CreateActivity API - see http://docs.aws.amazon.com/step-functions/latest/apireference/API_CreateActivity.html - giving the name "ColorTurtles" and it'll give you back an ARN, a string beginning arn:aws... Then in your state machine you make a Task state with that ARN as the value of the resource field. Then you add code to color_turtles.pl to poll the service with http://docs.aws.amazon.com/step-functions/latest/apireference/API_GetActivityTask.html - whenever a machine you're running gets to that task, it'll go look for activity workers polling. It'll give your polling worker the input for the task, then you process the input and generate some output, and call SendTaskSuccess or SendTaskFailure. All these are just REST HTTP calls, so you can run them anywhere and I mean anywhere; in a Lambda, on an EC2 instance, or on some computer anywhere on the Internet.
So to answer your questions:
1) Yeah, if you've got something that'll run for around 45 minutes, whilst you could engineer it with Lambda/Step functions you're probably better off getting a EC2 micro instance.
2)Nope you've pretty much got it.
3) As above you want to go with EC2 for this, there's a good article on using Data Pipelines to start / stop an EC2 instance here that way by starting instance only when you need it the cost(if any) is negligible.
I have jobs that run in this fashion normally you can get away with with a t2.micro instance which is free tier eligible.
You can also run your perl scripts on an EC2 instance so no need to rewrite them!
I will start with that it seems you are looking for workflow solutions on AWS. SWF and Step functions are the two most popular ones. Steps function is more recent offering and encouraged by AWS more than SWF.
SWF has native capability to handle long-running tasks, the downside is that you have to provide your own execution environment for deciders (can't use lambda).
With step functions, you can do this in two different ways. One of the approaches is suggested by Tim in his answer. There is an alternative way to achieve the same which is using job poller in step functions. Job pollers have the ability to call (poll) your resource and find out if the task is done and if not you can send execution in wait mode for the specified time. As mentioned above maximum execution time allowed currently for any workflow is 1 year. In case you have tasks which may take longer than 1 year, you can't use step functions in its current form.

Running application on a cluster

Abstract
I have my processing done using two console applications (Stage-estimate, Stage-step), each application processes files on disk, files are organized into folders. Each folder represents one step of processing which is considered completed when all files are estimated.
As an example lets consider that we are at Step 0 and the folder 0 contains the following files:
Folder 0 contains:
000.data
001.data
002.data
...
999.data
We have the data files, now we need to estimate them, we run Stage-estimate application 1000 times that result with the following directory structure:
Folder 0 contains:
000.data
000.estimate
001.data
001.estimate
002.data
002.estimate
...
999.data
999.estimate
Step 0 is now complete we have all the data/estimate pairs. In order to switch to Step 1 we run Stage-step application 1000 times on every data/estimate pair files and it results with new set of 1000 *.data files into folder 1. After Stage-step application completed, we have a folder 1 with the same structure as we had on Step 0:
Folder 1 contains:
000.data
001.data
002.data
...
999.data
From now on the process repeats until it is canceled.
The Problem
Application Stage-estimate does some pretty heavy calculations it consumes 99% of overall processing power compared to Stage-step application.
I was planing to use AWS in order to speed the things up. I don't want to start inventing special batch files that would call my applications the way described above, I know that there is special software that does some high-lifting at scheduling processes and other cluster related stuff.
Question
I was never dealing with cluster computing, off top of my head I see that application is parallelized really nice and it fits into AWS infrastructure. On the other hand I'm complete newbie in the world of cluster-computing and I don't know where to start from. I was dealing with AWS however nothing related to cluster computing, I don't know how to organize the flow I've described and how to make it run efficiently, so I would appreciate if you point me in right direction or provide some links on demos / best practices.
Thank you in advance!
__________Edit__________
Based on your comment, you can put all your jobs from stage 0 into a queue and start to process it. You can also have a logic what checks if you have only a few jobs left and tries to add new jobs from stage 1. This would speed up a bit your calculation, gives you better resource usage, but it's optional and makes your system more complex.
I suggest you to use SQS ( Or SWF) for storing the jobs, S3 for storing the files and an autoscaling group of spot instances for worker nodes.
Unfortunately Lambda doesn't support C++ at the moment. ( Node.js and Java is supported.)
________Original________
AWS supports several concepts which you may consider:
Decoupling: You can use SQS (Simple Queue Service) for job queuing, which gives you a redundant and fault tolerant job queue. You can have a fleet of worker instances, which are requesting jobs form the queue, running them and if they are finished, deleting the job from the queue. If the instances hangs/crashes during the execution of the job, after the timeout period the job goes back to the queue and an other instance will execute it again.
Other service is the SWF ( Simple Workflow Service). This service internally uses SQS queues, with this service, you may need less script to glue your entire workflow together.
Redundant storage: I would definitely use AWS S3 for storage, because it's cheap and redundant. After the first read, I don't think you need any advanced (file system like) feature. ( for example locking.)
Spot instances: For the worker nodes, I would use Spot instances which are much cheaper. The only issue with them if you need a really fast answer for your task all the time. ( If you generating daily reports, spot instances are perfect solution.)
+1: You may use AWS Lambda function to run your jobs. You can trigger your lambda function based on S3 events. For example you uploaded a new *.data file. However Lambda functions cannot run too long. But if you are able to use lambda function, then all your environment will contains only S3 buckets and lambda function. Both of them are AWS managed service, so your system would be extremely flexible, fault tolerant. I can't say any exact details about pricing, but I assume it would be cheaper then running EC2 instances.
Summary: If you can run your estimations parallel, AWS gives you a lots of power and speed. (for a good money) especially if your load is changing during the day.
A good source: White Paper on ‘Cloud Architectures’ and Best Practices of Amazon S3, EC2, SimpleDB, SQS

Are there any Schedulers for AWS/DynamoDB?

We're trying to move to AWS and to use DynamoDB. It'd be nice to keep everything under DynamoDB so there aren't extraneous types of databases, but aside from half complete research projects I'm not really finding anything to use for a scheduler. There's going to be dynamically set schedules in the range of thousands+, possibly with many running at the same time. For languages, Java or at least JVM would be awesome.
Does anyone know a good Scheduler for DynamoDB or other AWS technology?
---Addendum
When I say scheduler I'm thinking of something all purpose like quartz. I want to set a cron and it runs at that time with the code I give it. This isn't doing some AWS task, this is a task internal to our product. SWF's cron runs inside the VM, so I'm worried what happens when the VM is down. Data Pipeline seems a bit too much. I've been looking into making a dynamodb job store for quartz, consistent read might get around the transaction and consistency issues, but I'm hesitant, might be biting off a lot with a lot of hard to notice problems.
Have you looked at AWS Simple Workflow? You would use the AWS Flow Framework to program against the service, and they have a well documented Java API with lots of samples. They support continuous workflows with timers which you can use to run periodic code (see code example here). I'm using SWF and the Flow Framework for Ruby to run async code that gets kicked off from my main app, and it's been working great.
Another new option for you is to look at AWS Lambda. You can attach your Lambda function code directly to a DynamoDB table update event, and Lambda will spin up and shut down the compute resources for you, without you having to manage a server to run your code. Also, recently, AWS launched the ability to call the Lambda function directly -- e.g. you could have an external timer or other code that triggers the function on a specific schedule.
Lastly, this SO thread may have other options for you to consider.
Another option is to use AWS Lambda Scheduled Functions (newly announced on October 8th 2015 at AWS re:Invent).
Here is a relevant snippet from the blog (source):
Scheduled Functions (Cron)
You can now invoke a Lambda function on a regular, scheduled basis. You can specify a fixed rate (number of minutes, hours, or days between invocations) or you can specify a Cron-like expression: