Running application on a cluster - amazon-web-services

Abstract
I have my processing done using two console applications (Stage-estimate, Stage-step), each application processes files on disk, files are organized into folders. Each folder represents one step of processing which is considered completed when all files are estimated.
As an example lets consider that we are at Step 0 and the folder 0 contains the following files:
Folder 0 contains:
000.data
001.data
002.data
...
999.data
We have the data files, now we need to estimate them, we run Stage-estimate application 1000 times that result with the following directory structure:
Folder 0 contains:
000.data
000.estimate
001.data
001.estimate
002.data
002.estimate
...
999.data
999.estimate
Step 0 is now complete we have all the data/estimate pairs. In order to switch to Step 1 we run Stage-step application 1000 times on every data/estimate pair files and it results with new set of 1000 *.data files into folder 1. After Stage-step application completed, we have a folder 1 with the same structure as we had on Step 0:
Folder 1 contains:
000.data
001.data
002.data
...
999.data
From now on the process repeats until it is canceled.
The Problem
Application Stage-estimate does some pretty heavy calculations it consumes 99% of overall processing power compared to Stage-step application.
I was planing to use AWS in order to speed the things up. I don't want to start inventing special batch files that would call my applications the way described above, I know that there is special software that does some high-lifting at scheduling processes and other cluster related stuff.
Question
I was never dealing with cluster computing, off top of my head I see that application is parallelized really nice and it fits into AWS infrastructure. On the other hand I'm complete newbie in the world of cluster-computing and I don't know where to start from. I was dealing with AWS however nothing related to cluster computing, I don't know how to organize the flow I've described and how to make it run efficiently, so I would appreciate if you point me in right direction or provide some links on demos / best practices.
Thank you in advance!

__________Edit__________
Based on your comment, you can put all your jobs from stage 0 into a queue and start to process it. You can also have a logic what checks if you have only a few jobs left and tries to add new jobs from stage 1. This would speed up a bit your calculation, gives you better resource usage, but it's optional and makes your system more complex.
I suggest you to use SQS ( Or SWF) for storing the jobs, S3 for storing the files and an autoscaling group of spot instances for worker nodes.
Unfortunately Lambda doesn't support C++ at the moment. ( Node.js and Java is supported.)
________Original________
AWS supports several concepts which you may consider:
Decoupling: You can use SQS (Simple Queue Service) for job queuing, which gives you a redundant and fault tolerant job queue. You can have a fleet of worker instances, which are requesting jobs form the queue, running them and if they are finished, deleting the job from the queue. If the instances hangs/crashes during the execution of the job, after the timeout period the job goes back to the queue and an other instance will execute it again.
Other service is the SWF ( Simple Workflow Service). This service internally uses SQS queues, with this service, you may need less script to glue your entire workflow together.
Redundant storage: I would definitely use AWS S3 for storage, because it's cheap and redundant. After the first read, I don't think you need any advanced (file system like) feature. ( for example locking.)
Spot instances: For the worker nodes, I would use Spot instances which are much cheaper. The only issue with them if you need a really fast answer for your task all the time. ( If you generating daily reports, spot instances are perfect solution.)
+1: You may use AWS Lambda function to run your jobs. You can trigger your lambda function based on S3 events. For example you uploaded a new *.data file. However Lambda functions cannot run too long. But if you are able to use lambda function, then all your environment will contains only S3 buckets and lambda function. Both of them are AWS managed service, so your system would be extremely flexible, fault tolerant. I can't say any exact details about pricing, but I assume it would be cheaper then running EC2 instances.
Summary: If you can run your estimations parallel, AWS gives you a lots of power and speed. (for a good money) especially if your load is changing during the day.
A good source: White Paper on ‘Cloud Architectures’ and Best Practices of Amazon S3, EC2, SimpleDB, SQS

Related

Understanding where to begin with batch processing on AWS

I have a set of calculations that needs to run in a batch, and the workload is easily parallelized across machines. The work to be done is already done within a Docker container. I'm trying to understand the easiest way for me to run this workload in a highly parallel way on AWS. However, in trying to figure out where to begin I'm having trouble finding the right entrypoint. I read about AWS Batch and AWS Fargate, but each time I try to go down one of those paths to learn about them in more detail, more AWS services start popping up (Lamdas, Step Functions, ECS, AutoScaling groups), with each article having a different combination. Furthermore, I start thinking about the problem as a Batch vs Fargate problem, and then I find another article that talks about Batch + Fargate, or X + ECS + ....
I'm having trouble finding the appropriate introduction to the choices so I can get started with setting something up and getting some experience. Any pointers on which direction I might go or some resources for me to look at?
AWS containers services team member here. Your question triggers all my button cause I have been working on a deliverable to address some of this confusion ("where do I start with xyz?"). I can try to answer your question briefly here but if you want to read more (perhaps way more than you'd need feel free to contact me offline (mreferre at amazon dot com will work).
First and foremost it's not a Vs but it's an AND. Think of all these products you mention being distributed at different layers of the stack (this is a draft visual in the deliverable):
Fargate represents capacity (where your container is running), ECS represents a core containers orchestrator and Batch is one of the provisioners on top of the container orchestrator. Lambda is something separate and that live on its own. The options for your specific use case seem to be:
Lambda
ECS/Fargate
Batch/ECS/Fargate
Step Functions/ECS/Fargate (this one is outside of analysis and you don't see it in my visual - wondering if I should add it).
As others have hinted you probably want to use Lambda if your model is event-driven (e.g. if you want to fire up a dedicated function for every event like a new file uploaded to S3).
You probably do not want to use a naked ECS/Fargate solution because it would require more work to deal with the triggering and the scheduling of your batch jobs.
You probably want to use either Batch or Step Functions to schedule jobs on ECS/Fargate. I'd argue SF is good if you have basic workflows that you need to deal with and Batch if you need to manage complex jobs at scale. Perhaps this 35 mins presentation that I did last year can provide a bit more background on these Batch Vs SF differences.
Let me know if you have any additional questions because this discussion is super useful for the positioning I am trying to build.

A Global Variable(State) in AWS for Serverless Orchestration

I am writing a syncing/ETL app inside AWS. It works as follows:
The source of the data is outside of AWS
Whenever new data is changed/added AWS is alerted via API Gateway (REST)
The REST API triggers a lambda function that does ETL and stores the data in CSV format to S3
This works fine for small tables. However, we are dealing with larger amount of data lately and I have to switch to Fargate (EKS/ECS) instead of lambda. As you can imagine these will be long running jobs and not cheap to perform. Usually when the data is changed in it changes multiple times within a period of 5 minutes, say for example 3 times. So REST API gets a ping 3 times in a row and triggers the ETL jobs 3 times as well. This is very inefficient as you can imagine.
I came up with idea that every time that REST API is triggered lets wait for 5 minutes if the API has not been invoked during the waiting period do ETL otherwise do nothing. I think I can do the waiting using Step Functions. However I cannot find a suitable way to store hash/id of the latest ping to API to one single variable. I thought maybe I can store the hash to an S3 object and after 5 minutes check to see if it is the same as the variable in my step function, but apparently ordinality is not guaranteed. I looked into SQS but the fact that is a FIFO is not very convenient and way more than what I actually need. I am pretty sure that other people have had a similar issue and there must a standard solution for this problem. I could not find any by googling and hence my plea here
Thanks
From what I understand, Amazon DynamoDB is the store you are looking for to save the state of your job.
Also, please note that SQS is not FIFO by default. Using SQS won't prevent you from storing your job state.
What I would do:
Trigger a job and store the state in DynamoDB. Do not further launch job until the job state is done.
Orchestrate the ETL from Step Functions (including the 5 minutes wait)
You can also expire your jobs so DynamoDB will automatically clean them up with time.

Suitability of app with long running tasks for AWS Lambda or AWS Step Functions

I have an application on an AWS EC2 instance that runs once daily. The application fetches some files from a web service, parses the files line by line, updates a database, updates S3 files based on changes in the database, sends notification emails to customers as well as a few other tasks.
This is a series of logical tasks that must take place in sequence, although some of the tasks can be thought of as sub-tasks that can be executed in parallel. All tasks are a combination of Perl scripts and Java programs, with a single Perl script acting as the manager that executes each in turn. Some tasks can take as long as 45 minutes to complete, and the whole process can take up to 3 hours in total.
I'd like to make this whole process serverless. My initial idea was to use AWS Lambda, whereby each task would execute as a Lambda function, until I discovered Lambda functions impose a 5 minute execution timeout. It seems like the AWS Step Functions service is actually a better fit for my use case, but my understanding is that this service is backed by Lambda, so the tasks will still have the 5 min execution limitation.
(I'm also aware that I would have to re-write my Perl scripts to a language supported by Lambda).
I assume that I can work around the execution time limit by refactoring my code into smaller functions that will guarantee to complete in under 5 minutes. In my particular situation though, this seems inefficient.
Currently the database update task processes lines from a file one at a time. For this to work with Lambda, a Lambda function would need to handle only a single line from the file (or a very small number of lines) in order to guarantee not spilling over 5 minutes execution time. This would involve opening and closing a connection with the database on every invocation of the Lambda function. Also, each line processed should result in an entry written to a file, to be stored in S3. Right now, I just keep a file handle in memory and write the file to S3 when all lines are processed, but with Lambda I would need to keep reading the file, updating it and writing it back to S3.
What I'm asking is:
Is my use case a bad fit for AWS Lambda and/or AWS Step Functions?
Have I misunderstood how these services work?
Is there another AWS service that would be a better fit for my use case?
After further research, I think AWS Batch might be a good idea.
What you want are called Activity Workers. Tl;dr: You register "activities" and each gets an ARN. Then you can put that ARN in the resource field of Task states and then you run some code (the "worker") somewhere (in a Lambda, on EC2, in your basement, wherever) that polls for tasks identified by that ARN, then calls back to report success or failure. Activity Workers can run for up to a year.
Step-by-step details at the AWS docs
In response to RTF's comment, here's a deeper dive: Suppose you have code to color turtles in color_turtles.pl. So what you do is call the CreateActivity API - see http://docs.aws.amazon.com/step-functions/latest/apireference/API_CreateActivity.html - giving the name "ColorTurtles" and it'll give you back an ARN, a string beginning arn:aws... Then in your state machine you make a Task state with that ARN as the value of the resource field. Then you add code to color_turtles.pl to poll the service with http://docs.aws.amazon.com/step-functions/latest/apireference/API_GetActivityTask.html - whenever a machine you're running gets to that task, it'll go look for activity workers polling. It'll give your polling worker the input for the task, then you process the input and generate some output, and call SendTaskSuccess or SendTaskFailure. All these are just REST HTTP calls, so you can run them anywhere and I mean anywhere; in a Lambda, on an EC2 instance, or on some computer anywhere on the Internet.
So to answer your questions:
1) Yeah, if you've got something that'll run for around 45 minutes, whilst you could engineer it with Lambda/Step functions you're probably better off getting a EC2 micro instance.
2)Nope you've pretty much got it.
3) As above you want to go with EC2 for this, there's a good article on using Data Pipelines to start / stop an EC2 instance here that way by starting instance only when you need it the cost(if any) is negligible.
I have jobs that run in this fashion normally you can get away with with a t2.micro instance which is free tier eligible.
You can also run your perl scripts on an EC2 instance so no need to rewrite them!
I will start with that it seems you are looking for workflow solutions on AWS. SWF and Step functions are the two most popular ones. Steps function is more recent offering and encouraged by AWS more than SWF.
SWF has native capability to handle long-running tasks, the downside is that you have to provide your own execution environment for deciders (can't use lambda).
With step functions, you can do this in two different ways. One of the approaches is suggested by Tim in his answer. There is an alternative way to achieve the same which is using job poller in step functions. Job pollers have the ability to call (poll) your resource and find out if the task is done and if not you can send execution in wait mode for the specified time. As mentioned above maximum execution time allowed currently for any workflow is 1 year. In case you have tasks which may take longer than 1 year, you can't use step functions in its current form.

AWS services appropriate for concurrent access to a resource

I'm designing a system where a cluster of EC2 instances do some computing and then update a large file continually. What would be ideal is if I could have the file in S3, and have all the instances take turns writing to it one at a time, performing calculations while they wait.
As it stands if 2 instances PUT to S3 at the same time, 1 will simply override the other.
How can I solve this concurrency issue?
AWS has a preview service called EFS (http://aws.amazon.com/documentation/efs/) that is an NFS4 that can be shared among EC2 instances. But such service alone does not solve your problem as you may still have concurrency issues. Consider having something more sophisticated such as exploiting "embarrassingly parallel processing" such as having N processes creating N file chunks and finally having a single file joining all pieces together when everything is done.
As it is Amazon states that if you receive a success code then your S3 object is committed. Amazon also adds that there wouldn't be any dirty writes or overlapping inconsistency - you would read either of a fully committed write.
If you need more control you might be able to do it application like implementing a critical section.
It certainly makes sense to enable versioning the bucket so that you get to maintain all the writes and later you can specify which version as the latest.
You can also leverage the life cycle rules delete ( keep deleting ) the last n version to save cost.

Are there any Schedulers for AWS/DynamoDB?

We're trying to move to AWS and to use DynamoDB. It'd be nice to keep everything under DynamoDB so there aren't extraneous types of databases, but aside from half complete research projects I'm not really finding anything to use for a scheduler. There's going to be dynamically set schedules in the range of thousands+, possibly with many running at the same time. For languages, Java or at least JVM would be awesome.
Does anyone know a good Scheduler for DynamoDB or other AWS technology?
---Addendum
When I say scheduler I'm thinking of something all purpose like quartz. I want to set a cron and it runs at that time with the code I give it. This isn't doing some AWS task, this is a task internal to our product. SWF's cron runs inside the VM, so I'm worried what happens when the VM is down. Data Pipeline seems a bit too much. I've been looking into making a dynamodb job store for quartz, consistent read might get around the transaction and consistency issues, but I'm hesitant, might be biting off a lot with a lot of hard to notice problems.
Have you looked at AWS Simple Workflow? You would use the AWS Flow Framework to program against the service, and they have a well documented Java API with lots of samples. They support continuous workflows with timers which you can use to run periodic code (see code example here). I'm using SWF and the Flow Framework for Ruby to run async code that gets kicked off from my main app, and it's been working great.
Another new option for you is to look at AWS Lambda. You can attach your Lambda function code directly to a DynamoDB table update event, and Lambda will spin up and shut down the compute resources for you, without you having to manage a server to run your code. Also, recently, AWS launched the ability to call the Lambda function directly -- e.g. you could have an external timer or other code that triggers the function on a specific schedule.
Lastly, this SO thread may have other options for you to consider.
Another option is to use AWS Lambda Scheduled Functions (newly announced on October 8th 2015 at AWS re:Invent).
Here is a relevant snippet from the blog (source):
Scheduled Functions (Cron)
You can now invoke a Lambda function on a regular, scheduled basis. You can specify a fixed rate (number of minutes, hours, or days between invocations) or you can specify a Cron-like expression: