I was wondering what would be the recommended way to host a long running scheduled task in AWS.
Currently we have an EC2 instance (windows) in charge of triggering our app every few hours. This task will take between 1-3 hours depending on the number of items to process.
Lambda does not seem to be appropriate since my task is too long
Found this topic about Hangfire Scheduled Jobs in .NET Core 2 Web app hosted in AWS. Seems good but outside of AWS.
Any suggestions?
Thx
Seb
I would recommend AWS Step Functions. Very easy to implement. Part of the AWS Serverless Platform.
AWS Step Functions
AWS Step Functions makes it easy to coordinate the components of
distributed applications and microservices using visual workflows.
Building applications from individual components that each perform a
discrete function lets you scale and change applications quickly. Step
Functions is a reliable way to coordinate components and step through
the functions of your application. Step Functions provides a graphical
console to arrange and visualize the components of your application as
a series of steps. This makes it simple to build and run multistep
applications. Step Functions automatically triggers and tracks each
step, and retries when there are errors, so your application executes
in order and as expected. Step Functions logs the state of each step,
so when things do go wrong, you can diagnose and debug problems
quickly. You can change and add steps without even writing code, so
you can easily evolve your application and innovate faster.
Using AWS Lambda with Scheduled Events would let you create an AWS Lambda that will respond to a scheduled event. This Lambda could then trigger your app. Your app doesn't need to be in a Lambda itself.
Related
I need to run a pipeline of data transformation that is composed of several scripts in distinct projects = Python repos.
I am thinking of using Compute Engine to run these scripts in VMs when needed as I can manage resources required.
I need to be able to orchestrate these scripts in the sense that I want to run steps sequentially and sometimes asyncronously.
I see that GCP provides us with a Worflows components which seems to suit this case.
I am thinking of creating a specific project to orchestrate the executions of scripts.
However I cannot see how I can trigger the execution of my scripts which will not be in the same repo as the orchestrator project. From what I understand of GCE, VMs are only created when scripts are executed and provide no persistent HTTP endpoints to be called to trigger the execution from elsewhere.
To illustrate, let say I have two projects step_1 and step_2 which contain separate steps of my data transformation pipeline.
I would also have a project orchestrator with the only use of triggering step_1 and step_2 sequentially in VMs with GCE. This project would not have access to the code repos of these two former projects.
What would be the best practice in this case? Should I use other components than GCE and Worflows for this or there is a way to trigger scripts in GCE from an independent orchestration project?
One possible solution would be to not use GCE (Google Compute Engines) but instead create Docker containers that contain your task steps. These would then be registered with Cloud Run. Cloud Run spins up docker containers on demand and charges you only for the time you spend processing a request. When the request ends, you are no longer charged and hence you are optimally consuming resources. Various events can cause a request in Cloud Run but the most common is a REST call. With this in mind, now assume that your Python code is now packaged in a container which is triggered by a REST server (eg. Flask). Effectively you have created "microservices". These services can then be orchestrated by Cloud Workflows. The invocation of these microservices is through REST endpoints which can be Internet addresses with authorization also present. This would allow the microservices (tasks/steps) to be located in separate GCP projects and the orchestrator would see them as distinct callable endpoints.
Other potentials solutions to look at would be GKE (Kubernetes) and Cloud Composer (Apache Airflow).
If you DO wish to stay with Compute Engines, you can still do that using shared VPC. Shared VPC would allow distinct projects to have network connectivity between each other and you could use Private Catalog to have the GCE instances advertize to each other. You could then have a GCE instance choreograph or, again, choreograph through Cloud Workflows. We would have to check that Cloud Workflows supports parallel items ... I do not believe that as of the time of this post it does.
This is a common request, to organize automation into it's own project. You can setup service account that spans multiple projects.
See a tutorial here: https://gtseres.medium.com/using-service-accounts-across-projects-in-gcp-cf9473fef8f0
On top of that, you can also think to have Workflows in both orchestrator and sublevel project. This way the orchestrator Workflow can call another Workflow. So the job can be easily run, and encapsuled also under the project that has the code + workflow body, and only the triggering comes from other project.
Other devs and I are currently testing/building lambda functions for cleaning data that flows from S3 -> SQS -> Data Router Lambda(python), DynamoDB Rules Engine, and then a text processor in Lambda. We're currently working on the AWS platform but I'm trying to test this part of the data pipeline locally.
Ideally simulating S3 and SQS and dumping the zip files and running it through the lambda function. Currently toying with the SAM-CLI and Visual Studio, but nothing's stuck yet. Any tips?
There are several ways you can approach (local) testing of your AWS application:
Use unit tests for the different parts of your "pipeline", mocking the other parts like DynamoDB, SQS, etc.
Use something like LocalStack.
Every developer has their own "developer environment" in AWS. You could for example prefix every resource with the name of the developer (john_processing_lambda). You deploy to AWS and run integration tests from your local machine. You can achieve something like this with tools like Terraform, which allow you to "dynamically" name resources and for example add prefixes with the developers name.
Personally, I think running "AWS on your local machine" via Docker containers or tools like LocalStack not really satisfying. We had the best results with a combination of option 1 and option 3. Both have the upside that you can use the same tests in your CI/CD pipeline.
Furthermore, not running in the actual cloud (AWS) always bears the risk of "forgetting" something. Most notably IAM permissions. So everything runs fine on your local machine, but then it does not work on AWS.
Deploying a separate environment for every developer, so that they can play around with the actual resources and run tests directly in AWS, would be my recommendation. This paired with solid unit tests should yield the best results.
The downside of developer environments in AWS is that a developer has to deploy their code to AWS every time they want to test something. So making deployments fast is important. I found that with sufficient experience, you don't need to deploy that often anymore and this becomes less of an issue. Nevertheless, developer satisfaction in your team is important, so make sure to make this as smooth as possible.
I have a daily process that needs to digest a tremendous amount of data from two external sources. It normally requires around 28GB or RAM, and a decent amount of processing power. Due to this, an AWS Lambda won't work.
In the meantime, I've been running the process on an EC2 instance. In order to save resources, I've attempted to start the instance using a CloudWatch event. Since no event exists for "StartEC2," I'm kicking off a AWS Lambda instead, which in turn starts the EC2 isntance using Amazon support libraries.
All of this is extremely cumbersome, and I've been looking for a library or pattern that can do what I want. Essentially, I need to start an EC2 instance on a cron/event, deliver a unit of work to it (Shell Script, Java App, whatever), have it run it, then shutdown.
I'd love any suggestions for accomplishing this.
Look into AWS Systems Manager (SSM), you can create an Automation document that will launch the instance, run any custom scripts or tasks, and shut it down again when you're done. You can trigger the SSM Automation with a cron schedule via CloudWatch Events.
You may also want to consider AWS Batch for this type of workload.
I'm moving stuff from Azure to AWS, and the only thing I'm really gonna miss is the webjobs, where I can schedule command line jobs.
I know I can achieve somewhat the same with task scheduler or windows services, but I do also like the way webjobs shows logs and that stuff...
Do anybody know a tool like that, that can run windows command line apps on AWS?
Checkout AWS Lambda. It is a new service from AWS.
AWS Lambda, compute service that runs your code in response to events and automatically manages the compute resources for you, making it easy to build applications that respond quickly to new information.
Lambda vs WebJobs
I am designing my first Amazon AWS project and I could use some help with the queue processing.
This service accepts processing jobs, either via an ASP.net Web API service or a GUI web site (which just calls the API). Each job has one or more files associated with it and some rules about the type of job. I want to queue each job as it comes in, presumably using AWS SQS. The jobs will then be processed by a "worker" which is a python script with a .Net wrapper. The python script is an existing batch processor that cannot be altered/customized for AWS, hence the wrapper in .Net that manages the AWS portions and passing in the correct params to python.
The issue is that we will not have a huge number of jobs, but each job is somewhat compute intensive. One of the reasons to go to AWS was to minimize infrastructure costs. I plan on having the frontend web site (Web API + ASP.net MVC4 site) run on elastic beanstalk. But I would prefer not to have a dedicated worker machine always online polling for jobs, since these workers need to be a bit "beefier" instance (for processing) and it would cost us a lot to mostly sit doing nothing.
Is there a way to only run the web portion on beanstalk and then have the worker process only spin up if there are items in the queue? I realize I could have a micro "controller" instance always online polling and then have it control the compute spinup, but even that seems like it shouldn't be needed. Can EC2 instances be started based on a non-zero SQS queue size? So basically web api adds job to queue, something watches the queue and sees it's non-zero, this triggers the EC2 worker to start, it spins up and polls the queue on startup. It processes until the queue until empty, then something triggers it to shutdown.
You can use Autoscaling in conjunction with SQS to dynamically start and stop EC2 instances. There is a AWS blog post that describes the architecture you are thinking of.