Routing an activity task to a specific worker in the SWF fleet - amazon-web-services

I have a fleet of multiple worker hosts polling for the following tasks of my SWF:
Activity 1: Perform some business logic to create a large file.
Activity 2: Wait for some time (a human approval, timer, etc.)
Activity 3: Transmit the file using some protocol (governed by input parameters of the SWF).
Activity 4: Clean-up the local-generated file.
The file generated in Step-1 needs to be used again in Step-3, and then eventually discarded at the end of the workflow.
The system would work fine if there is only 1 host polling for all tasks. However, when I have multiple workers, I cannot seem to ensure that task-1 and task-3 would end up on the same host.
I would like to avoid doing the following:
Uploading the file to a central repository (say S3) on step-1 and download it in step-3; or
Having a single activity for the task-1 and task-3.
I have the following questions:
Is it possible to control that subsequent activities be run on the same host as opposed to going to any random host in my fleet?
What are specific guidelines/best practices on re-using resources generated in different activities in a workflow?

Is it possible to control that subsequent activities be run on the
same host as opposed to going to any random host in my fleet?
Yes, absolutely. The basic idea is that SWF task lists (queues used to deliver activity tasks) are dynamic. So each host can have its own task list and workflow can specify specific task list name when calling an activity. See fileprocessing sample which executes download activity on any host from the pool, then converts the file and uploads the result on the same host as the first one.
List item What are specific guidelines/best practices on re-using resources generated in different activities in a workflow?
The approach of caching result in the worker process memory or on the local disk is considered the best practice. Sometimes using external data store and getting it each times also makes sense.

Related

How does Cloud Run scaling down to zero affect long-computation jobs or external API requests?

I'm new to using Cloud Run and the idea of scaling down to zero is very appealing to me, but I have question about a few scenarios about its usage:
If I have a Cloud Run instance querying an external API endpoint, would the instance winds down while waiting for the response if no additional requests come in (i.e. I set the query time out to 60min, and no requests are received in that 60 min)?
If the Cloud Run instance is running computation that lasts for longer than 24 hour, or perhaps even days, without receiving requests, could it be trusted to carry out the computation until it's done without being randomly shutdown or restarted for servicing or other purposes (I ask this because Cloud Run is primarily intended as for stateless applications, but I have infrequent computation jobs that may take a long time that may be considered "stateful" in short-term context).
Does CPU utilization impact auto-scaling (e.g. if I have a computationally intensive job not configured for distributed computing running on one instance, would this trigger Cloud Run to spawn additional instances?)
If you deep dive in the documentation, I'm quite sure that you can find your answers. So, here a summary
(Interesting read).The Cloud Run instances are shut down only when they aren't in used (usually 15 minutes (can change at any time, no commitment, only observations) without request handling). In your case, if you are in a request handling context, no worries, your instance won't be killed, it is in use! Note: don't send an HTTP response before the end of the processing. Background process/jobs aren't considered in a request context. The context is considered from the receipt of the request to the response (OK or KO) back. Partial response/streaming is accepted.
Cloud run instance can, potentially, live more than 24h, but nothing is guaranteed. And, because the request handling is limited to 1h, you can't run process longer that that. I recommend you to have a look to GKE autopilot or to run a container on a Compute Engine and stop the VM at the end of the processing to save resources and money (or a hack to run your container on AI PLatform custom training; even if you train nothing, you run a custom container on a serverless platform!). If you can, I recommend you to design your workload to be split in several small and parallelizable jobs
Yes, it's described here. But keep in mind that only 1 request is processed on one instance. If you send a request that trigger an intensive compute job, the request will be only processed on the same instance (that can have several CPUs if your workload is compliant with that). And if another request comes in during the intensive processing, another Cloud Run instance will be spawn to handle it; only the new request.

Creating a scalable and fault tolerant system using AWS ECS

We're designing C# scheduled task (runs every few hours) that will run on AWS ECS instances that will grab batched transaction data for thousands of customers from an endpoint, modify the data then send it on to another web service. We will be maintaining the state of the last successful batch in a separate database (using some like created date of the transactions). We need the system to be scalable so as more customers are added we add additional ECS containers to process the data.
There are the options we're considering:
Each container only processes a specific subset of the data. As more customers are added more contains are added. We would need to maintain a logical separation of what contains are processing what customers data.
All the containers process all of the customers. We use some kind of locking flags on the database to let other processes know that the customers data is being processed.
Some other approach.
I think that option 2 is probably the best, but it adds a lot of complexity regarding the locking and unlocking of customers. Are there specific design patterns I could be pointed towards if that if the correct solution?
In both scenarios an important thing to consider is retries in case processing for a specific customer fails. One potential way to distribute jobs across a vast number of container with retries would be to use AWS SQS.
A single container would run periodically every few hours and be the job generator. It would create one SQS queued item for each customer that needs to be processed. In response to items appearing in the queue a number of "worker" containers would be spun up by ECS to consume items from the queue. This can be made to autoscale relative to the number of items in the queue to quickly spin up many containers that can work in parallel.
Each container would use its own high performance concurrent poller similar to this (https://www.npmjs.com/package/squiss) to start grabbing items from the queue and processing them. If a worker failed or crashed due to a bug then SQS will automatically redeliver and dropped queued items that worker had been working on to a different worker after they time out.
This approach would give you a great deal of flexibility, and would let you horizontally scale out the number of workers, while letting any of the workers process any jobs from the queue that it grabs. It would also ensure that every queued item gets processed at least once, and that none get dropped forever in case something crashes or goes wrong.

How to track distributed tasks progress

Here is my case:
When my server receieve a request, it will trigger distributed tasks, in my case many AWS lambda functions (the peek value could be 3000)
I need to track each task progress / status i.e. pending, running, success, error
My server could have many replicas
I still want to know about the task progress / status even if any of my server replica down
My current design:
I choose AWS S3 as my helper
When a task start to execute, it will create marker file in a special folder on S3 e.g. running folder
When the task fail or success, it will move the marker file from running folder to fail folder or success folder
I check the marker files on S3 to check the progress of the tasks.
The problems:
There is a limit for AWS S3 concurrent access
My case is likely to exceed the limit some day
Attempt Solutions:
I had tried my best to reduce the number of request to S3
I don't want to track the progress by storing data in my DB because my DB has already been under heavy workload.
To be honest, it is kind of wierd that using marker files on S3 to track progress of the tasks. However, it worked before.
Is there any recommendations ?
Thanks in advance !
This sounds like a perfect application of persistent event queueing, specifically Kinesis. As each Lambda starts it generates a “starting” event on Kinesis. When it succeeds or fails, it generates the appropriate event. You could even create progress events along the way if you want to see how far they have gotten.
Your server can then monitor the number of starting events against ending events (success or failure) until these two numbers are equal. It can query the error events to see which processes failed and why. All servers can query the same events without disrupting each other, and any server can go down and recover without losing data.
Make sure to put an Origination Key on events that are supposed to be grouped together so they don't get mixed up with a subsequent event. Also, each Lambda should be given its own key so you can trace progress per Lambda. Guids are perfect for this.

Akka design: How to add/remove routee from cluster aware router dynamically

I have the following use case and I am not sure if the akka toolkit provide this out of the box:
I have a number of nodes (instance/machine) that can run a finite number of long running task in the background and cannot accept more work while at max capacity.
Each instance can only process 50 tasks.
All instances are behind a load balancer.
Each task can respond to messages from the client who initiated the task, since the client sends the messages via the load balancer the instances need to route it to the correct instance that handles the task.
I have tried initially cluster sharding, but there doesn't seem to be a way to cap the maximum number of shard regions/actors per node (= #tasks).
Then I tried it with a cluster aware router, which acts as a guard for accepting or rejecting work. This seems to work reasonable well, one problem is that once it reaches capacity I need to remove it as a routee and add it back once it has capacity again.
Is there something out of the box that supports this use case or should I carry on with the routing option and if so how can I achieve this?
I'll update the description if you have further questions or something is unclear.
Your scenario sounds like a good fit for the work pulling pattern. The gist of this pattern is:
A master actor coordinates units of work among a number of worker actors.
Workers register themselves to the master, meaning that workers can be added or removed dynamically.
When the master receives work to be done, the master notifies the workers that work is available. Workers pull units of work when they're ready, do what needs to be done with their respective units of work, then ask the master for more work when they're finished.
To learn more about this pattern, read the following (the first two links are listed in the Akka documentation):
The original post (by Derek Wyatt): http://letitcrash.com/post/29044669086/balancing-workload-across-nodes-with-akka-2
A follow-on post (by Michael Pollmeier): http://www.michaelpollmeier.com/akka-work-pulling-pattern
An application of the pattern in a clustered environment with a cluster-aware router (by Ryan Tanner): https://www.conspire.com/blog/2013/10/akka-at-conspire-part-5-the-importance-of/

(AWS SWF) Is there a way to get a list of all activity workers listening on a particular tasklist?

In our beta stack, we have a single EC2 instance listening to a tasklist. Sometimes another developer in the team start's his own instance for testing purposes and forget to turn it off. This creates problems for the next developer who tries to start an activity only for it to be taken up by the last developer's machine. Is there a way to get the hostnames of all activity workers listening to a particular tasklist ?
It is not currently possible to get a list of pollers waiting on a task list through the SWF API. The workaround is to look at the identity field on the ActivityExecutionStarted event after it was picked up by the wrong worker.
One way to avoid this issue is always use a task list name that is specific to a machine or developer to avoid collisions.