Use Azure Batch for non-parallel work? Is there a better option? - azure-virtual-machine

I have a scenario where our Azure App Service needs to run a job every night. The job cannot scale to multiple machines -- it involves downloading a large data file, and does special processing on it (only takes a couple minutes). Special software will be required to be installed as well. A lot of memory will be needed on the machine for the computation, therefore I was thinking one of the Ev-series machines. For these reasons, I cannot run the job as a web job on the Azure App Service, and I need to delegate it elsewhere.
Anyway, I have experience with Azure Batch so at first I was thinking of Azure Batch. But I am not sure this makes sense for my scenario because the work cannot scale to multiple machines. Does it make sense to have a pool with a single node and single vm on the node? When I need to do the work, an Azure web job enqueues the job, and the pool automatically sizes from 0 to 1?
Are there better options out there? I was look to see if there are any .NET libraries to spin up a single VM and start executing work on it, then disable the VM when done, but I couldn't find anything.

For Azure Batch, the scenario of a single VM in a single pool is valid. Azure Container Instances or Azure Functions would appear to be a better fit, however, if you can provision the appropriate VM sizes for your workload.
As you suggested, you can combine Azure Functions/Web Jobs to enqueue the work to an Azure Batch Job. If you have autoscaling or an autopool set on the Azure Batch Pool or Job, respectively, then the work will be processed and the compute resources will be deallocated after (assuming you have the correct settings in-place).


Recommended way to run a web server on demand, with auto-shutdown (on AWS)

I am trying to find the best way to architect a low cost solution to provide an on-demand web server for a certain amount of time.
The context is as follows: I have some large amount of data sitting on S3. From time to time, users will want to consult that data. I've written a Flask app that can display the data in a nice way for them. Beign poorly written, it really only accepts a single user session at the time. Currently therefore they have to download the Flask app and run it on their own machine.
I would like to find a way for users to request a cloud-based web server that would run the Flask app (through a docker container for example) on-demand, and give them access to it quickly, without having to do much if anything on their own machine.
Every user wanting to view the data would have their own web server created on demand (to avoid multiple users sharing the same web server, which wouldn't work with my Flask app)
Critically, and in order to avoid cost, the web server would terminate itself automatically after some (configurable) idle time (possibly with the Flask app informing the user that it's about to shut down, so that they can "renew" the lease).
Initially I thought that maybe AWS Fargate would be good: it can run docker instances, is quite configurable in terms of CPU/disk it can get (my Flask app is resource-hungry), and at least on paper could be used in a way that there is zero cost when users are not consulting the data (bar S3 costs naturally). But it's when it comes to the detail that I'm not sure...
How to ensure that every new user gets their own Fargate instance?
How to shut-down the instance automatically after idle time?
Is Fargate quick enough in terms of boot time?
The closest I can think is AWS App Runner. It's built on top of Fargate and it provides an intelligent scale out mechanism (probably you are not interested in this) as well as a scale to (almost) 0 capability. The way it works is that when the endpoint is solicited and it's doing work you pay for the entire fargate task (cpu/memory) you have selected in the configuration. If the endpoint is doing nothing you only pay for the memory (note the memory cost is roughly 20% of the entire cost so it's not scale to 0 but "quasi"). Checkout the pricing examples at the bottom of this page.
Please note you can further optimize costs by pausing/starting the endpoint (when it's paused you pay nothing) but in that case you need to create the logic that pauses/restarts it.
Another option you may want to explore is using Lambda this way (which would allow you to use the same container image and benefit from the intrinsic scale to 0 of Lambda). But given your comment "Lambda doesn’t have enough power, and the timeout is not flexible" may be a show stopper.

How does Google Cloud Run spin up instantly

So, I really like the idea of server-less. I came across Google Cloud Functions and Google Cloud Run.
So google cloud functions are individual functions, which is a broad perspective, I assume google must be securely running on a huge nodejs server. And it contains all the functions of all the google consumers and fulfils the request using unique URLs. Now, Google takes care of the cost of this one big server and charges users for every hit their function gets. So its pay to use. And makes sense.
But when it comes to Cloud Run. I fail to understand how does it work. Obviously the container must not always be running because then they will simply charge a monthly basis instead of a per-hit basis, just like a normal VM where docker image is deployed. But no, in reality, they charge on per hit basis, that means they spin up the container when a request arrives. So, I don't understand how does it spin it up so fast? The users have the flexibility of running any sort of environment, that means the docker container could contain literally anything. Maybe a full-fledged Linux OS. How does it load up the environment OS so quickly and fulfils the request? Well, maybe it maintains the state of the machine and shut it down when not in use, but even then, it will require a decent amount of time to restore the state.
So how does google really does it? How is it able to spin up a customer's container in literally no time?
The idea of fast spinning-up sandboxes containers (that run on their own kernel for security reasons) have been around for a pretty long time. For example, Intel Clear Linux Containers and Firecracker provide fast startup through various optimizations.
As you can imagine, implementing something like this would require optimizations at many layers (scheduling, traffic serving, autoscaling, image caching...).
Without giving away Google’s secrets, we can probably talk about image storage and caching: Just like how VMs use initramfs to pre-cache the state of the VM, instead of reading all the files from harddisk and following the boot sequence, we can do similar tricks with containers.
Google uses a similar solution for Cloud Run, called gVisor. It's a user-space virtualization technique (not an actual VMM or hypervisor). To run containers on a Linux-like environment, gVisor doesn't need to boot a Linux kernel from scratch (because gVisor reimplements the linux kernel in go!).
You’ll find many optimizations on other serverless platforms across most cloud providers (such as how to keep a container instance around, should you be predictively scheduling inactive containers before the load arrives). I recommend reading the Peeking Behind the Curtains of Serverless Platforms paper to get an idea about what are the problems in this space and what are cloud providers trying to optimize for speed and cost.
You have to decouple the containers to the VMs. The second link of Dustin is great because if you understand the principles of Kubernetes (and more if you have a look to Knative), it's easy to translate this to Cloud Run.
You have a pool of resources (Nodes in Kubernetes, the VM in fact with CPU and memory) and on these resources, you can run container: 1, 2, 1000 per VM, maybe, you don't know and you don't care. The power of the container, is the ability to be packaged with all the dependency that it needs. Yes, I talked about package because your container isn't an OS, it contains the dependencies for interacting with the host OS.
For preventing any problem between container from different project/customer, the container run into a sandbox (GVisor, first link of Dustin).
So, there is no VM to start and to stop, no VM to create when you deploy a Cloud Run services,... It's only a start of your container on existing resources. It's also for this reason that you need to have a stateless container, without disks attached to it.
Do you want 3 "secrets"?
It's exactly the same things with Cloud Functions! Your code is packaged into a container and deploy exactly as it's done with Cloud Run.
The underlying platform that manages Cloud Functions and Cloud Run is the same. That's why the behavior and the feature are very similar! Cloud Functions is longer to deploy because Google need to build the container for you. With Cloud Run the container is already built.
Your Compute Engine instance is also managed as a container on the Google infrastructure! More generally, all is container at Google!

Deploying distributed queue mechanism on GCP

I have a system working on Google Cloud. This system built on Micro-services architecture. The application communicating with each other with queues. Currently, I using Azure Storage Queues and want to move this managed service into GCP too.
Max processing time - 1 second to 1 hour.
Ack mechanism to mark tasks as completed.
Scaleable solution - should support the load of a few thousands of messages per second.
Support pulling of messages.
Nice to have:
Handle priorities - some tasks queues are more important than others.
Managed solution - I prefer to use a managed solution rather than install it my self.
Solutions I have checked
I already checked those services and I refused it because:
Microsoft Azure Storage Queue - Outside of google cloud.
Google Cloud PubSub - Not meet the mandatory requirement - Processing time is limited to 10 minutes.
Google Cloud Task - Seems like mostly designed for serverless applications, which are not suitable for my application.
I also check the RabbitMQ solution which seems to support all my requirements, except the one of 'managed solution'. It seems like GCP not provide RabitMQ as a service. So, I'll have to deploy it into virtual machines and make all the configuration my self...
The question
To be honest, it seems like the solution that I'm looking for should be pretty common. Nothing really special. But, all of the services I described here have some critical cons.
Which service did I miss here?

Cloud computing service to run thousands of containers in parallel

Is there any provider, that offers such an option out of the box? I need to run at least 1K concurrent sessions (docker containers) of headless web-browsers (firefox) for complex UI tests. I have a Docker image that I just want to deploy and scale to 1000 1CPU/1GB instances in second, w/o spending time on maintaining the cluster of servers (I need to shut them all down after the job is done), just focuse on the code. The most close thing I found so far is Amazon ECS/Fargate, but their limits have no sense to me ("Run containerized applications in production" -> max limit: 50 tasks -> production -> ok). Am I missing something?
I think that AWS Batch might be a better solution for your use case. You define a "compute environment" that provides a certain level of capacity, then submit tasks that are run on that compute environment.
I don't think that you'll find anything that can start up an environment and deploy a large number of tasks in "one second": in my experience it takes about a minute or two ramp-up time for Batch, although once the machines are up and running they are able to sequence jobs quickly. You should also give consideration to whether it makes sense to run all 1,000 jobs concurrently; that will depend on what you're trying to get out of your tests.
You'll also need to be aware of any places where you might be throttled (for example, retrieving configuration from the AWS Parameter Store). This talk from last year's NY Summit covers some of the issues that the speaker ran into when deploying multiple-thousands of concurrent tasks.
You could use lambda layers to run headless browsers (I know there are several implementations for chromium/selenium on github, not sure about firefox).
Alternatively you could try and contact the AWS team to see how much the limit for concurrent tasks on Fargate can be increased. As you can see at the documentation, the 50 task is a soft limit and can be raised.
Be aware if you start via Fargate, there is some API limit on the requests per second. You need to make sure you throttle your API calls or you use the ECS Create Service.
In any case, starting 1000 tasks would require 1000 seconds, which is probably not what you expect.
Those limits are not there if you use ECS, but in that case you need to manage the cluster, so it might be a good idea to explore the lambda option.

Running application on a cluster

I have my processing done using two console applications (Stage-estimate, Stage-step), each application processes files on disk, files are organized into folders. Each folder represents one step of processing which is considered completed when all files are estimated.
As an example lets consider that we are at Step 0 and the folder 0 contains the following files:
Folder 0 contains:
We have the data files, now we need to estimate them, we run Stage-estimate application 1000 times that result with the following directory structure:
Folder 0 contains:
Step 0 is now complete we have all the data/estimate pairs. In order to switch to Step 1 we run Stage-step application 1000 times on every data/estimate pair files and it results with new set of 1000 *.data files into folder 1. After Stage-step application completed, we have a folder 1 with the same structure as we had on Step 0:
Folder 1 contains:
From now on the process repeats until it is canceled.
The Problem
Application Stage-estimate does some pretty heavy calculations it consumes 99% of overall processing power compared to Stage-step application.
I was planing to use AWS in order to speed the things up. I don't want to start inventing special batch files that would call my applications the way described above, I know that there is special software that does some high-lifting at scheduling processes and other cluster related stuff.
I was never dealing with cluster computing, off top of my head I see that application is parallelized really nice and it fits into AWS infrastructure. On the other hand I'm complete newbie in the world of cluster-computing and I don't know where to start from. I was dealing with AWS however nothing related to cluster computing, I don't know how to organize the flow I've described and how to make it run efficiently, so I would appreciate if you point me in right direction or provide some links on demos / best practices.
Thank you in advance!
Based on your comment, you can put all your jobs from stage 0 into a queue and start to process it. You can also have a logic what checks if you have only a few jobs left and tries to add new jobs from stage 1. This would speed up a bit your calculation, gives you better resource usage, but it's optional and makes your system more complex.
I suggest you to use SQS ( Or SWF) for storing the jobs, S3 for storing the files and an autoscaling group of spot instances for worker nodes.
Unfortunately Lambda doesn't support C++ at the moment. ( Node.js and Java is supported.)
AWS supports several concepts which you may consider:
Decoupling: You can use SQS (Simple Queue Service) for job queuing, which gives you a redundant and fault tolerant job queue. You can have a fleet of worker instances, which are requesting jobs form the queue, running them and if they are finished, deleting the job from the queue. If the instances hangs/crashes during the execution of the job, after the timeout period the job goes back to the queue and an other instance will execute it again.
Other service is the SWF ( Simple Workflow Service). This service internally uses SQS queues, with this service, you may need less script to glue your entire workflow together.
Redundant storage: I would definitely use AWS S3 for storage, because it's cheap and redundant. After the first read, I don't think you need any advanced (file system like) feature. ( for example locking.)
Spot instances: For the worker nodes, I would use Spot instances which are much cheaper. The only issue with them if you need a really fast answer for your task all the time. ( If you generating daily reports, spot instances are perfect solution.)
+1: You may use AWS Lambda function to run your jobs. You can trigger your lambda function based on S3 events. For example you uploaded a new *.data file. However Lambda functions cannot run too long. But if you are able to use lambda function, then all your environment will contains only S3 buckets and lambda function. Both of them are AWS managed service, so your system would be extremely flexible, fault tolerant. I can't say any exact details about pricing, but I assume it would be cheaper then running EC2 instances.
Summary: If you can run your estimations parallel, AWS gives you a lots of power and speed. (for a good money) especially if your load is changing during the day.
A good source: White Paper on ‘Cloud Architectures’ and Best Practices of Amazon S3, EC2, SimpleDB, SQS