Here is my case:
When my server receieve a request, it will trigger distributed tasks, in my case many AWS lambda functions (the peek value could be 3000)
I need to track each task progress / status i.e. pending, running, success, error
My server could have many replicas
I still want to know about the task progress / status even if any of my server replica down
My current design:
I choose AWS S3 as my helper
When a task start to execute, it will create marker file in a special folder on S3 e.g. running folder
When the task fail or success, it will move the marker file from running folder to fail folder or success folder
I check the marker files on S3 to check the progress of the tasks.
The problems:
There is a limit for AWS S3 concurrent access
My case is likely to exceed the limit some day
Attempt Solutions:
I had tried my best to reduce the number of request to S3
I don't want to track the progress by storing data in my DB because my DB has already been under heavy workload.
To be honest, it is kind of wierd that using marker files on S3 to track progress of the tasks. However, it worked before.
Is there any recommendations ?
Thanks in advance !
This sounds like a perfect application of persistent event queueing, specifically Kinesis. As each Lambda starts it generates a “starting” event on Kinesis. When it succeeds or fails, it generates the appropriate event. You could even create progress events along the way if you want to see how far they have gotten.
Your server can then monitor the number of starting events against ending events (success or failure) until these two numbers are equal. It can query the error events to see which processes failed and why. All servers can query the same events without disrupting each other, and any server can go down and recover without losing data.
Make sure to put an Origination Key on events that are supposed to be grouped together so they don't get mixed up with a subsequent event. Also, each Lambda should be given its own key so you can trace progress per Lambda. Guids are perfect for this.
Related
Trying to optimize our application but doing batch pulling. Pub Sub seems to allow asynchronously pulling one message at a time with different client nodes, but is there no way for a single node to do a batch pull from pub sub?
Both Streaming Pull and Pull RPC both only allow the subscriber to consume one message at a time. Right now, it looks like we would have to pull one message at a time and do application level batching.
Any insight would be helpful. Pretty new to this GCP in general.
The underlying pull and streaming pull operations can receive batches of messages in the same response. The Cloud Pub/Sub client library, which uses streaming pull, breaks these batches apart and hands them to the provided user callback one at a time. Therefore, you need not worry about optimizing the underlying receiving of messages.
If your concern is optimizing the subscriber code at the application level, e.g., you want to batch writes into a database, then you have a couple of options:
Use Pull directly, which allows one to process all of the messages in a batch at a time. Note that using pull effectively requires many simultaneously outstanding pull requests and replacing requests that return with new requests immediately.
In your user callback, re-batch messages and once the batch reaches a desired size (or you've waited a sufficient amount of time to fill the batch), process all of the messages together and then ack them.
You probably can implement that by using Dataflow (Apache Beam). You can have a running streaming job, where you group, window, transform messages according to your requirements. The results of processing can be saved in batches or steam further. It probably makes sense in case the number of messages is really big.
We've got a little java scheduler running on AWS ECS. It's doing what cron used to do on our old monolith. it fires up (fargate) tasks in docker containers. We've got a task that runs every hour and it's quite important to us. I want to know if it crashes or fails to run for any reason (eg the java scheduler fails, or someone turns the task off).
I'm looking for a service that will alert me if it's not notified. I want to call the notification system every time the script runs successfully. Then if the alert system doesn't get the "OK" notification as expected, it shoots off an alert.
I figure this kind of service must exist, and I don't want to re-invent the wheel trying to build it myself. I guess my question is, what's it called? And where can I go to get that kind of thing? (we're using AWS obviously and we've got a pagerDuty account).
We use this approach for these types of problems. First, the task has to write a timestamp to a file in S3 or EFS. This file is the external evidence that the task ran to completion. Then you need an http based service that will read that file and calculate if the time stamp is valid ie has been updated in the last hour. This could be a simple php or nodejs script. This process is exposed to the public web eg https://example.com/heartbeat.php. This script returns a http response code of 200 if the timestamp file is present and valid, or a 500 if not. Then we use StatusCake to monitor the url, and notify us via its Pager Duty integration if there is an incident. We usually include a message in the response so a human can see the nature of the error.
This may seem tedious, but it is foolproof. Any failure anywhere along the line will be immediately notified. StatusCake has a great free service level. This approach can be used to monitor any critical task in same way. We've learned the hard way that critical cron type tasks and processes can fail for any number of reasons, and you want to know before it becomes customer critical. 24x7x365 monitoring of these types of tasks is necessary, and helps us sleep better at night.
Note: We always have a daily system test event that triggers a Pager Duty notification at 9am each day. For the truly paranoid, this assures that pager duty itself has not failed in some way eg misconfiguratiion etc. Our support team knows if they don't get a test alert each day, there is a problem in the notification system itself. The tech on duty has to awknowlege the incident as per SOP. If they do not awknowlege, then it escalates to the next tier, and we know we have to have a talk about response times. It keeps people on their toes. This is the final piece to insure you have robust monitoring infrastructure.
OpsGene has a heartbeat service which is basically a watch dog timer. You can configure it to call you if you don't ping them in x number of minutes.
Unfortunately I would not recommend them. I have been using them for 4 years and they have changed their account system twice and left my paid account orphaned silently. I have to find a new vendor as soon as I have some free time.
I have a fleet of multiple worker hosts polling for the following tasks of my SWF:
Activity 1: Perform some business logic to create a large file.
Activity 2: Wait for some time (a human approval, timer, etc.)
Activity 3: Transmit the file using some protocol (governed by input parameters of the SWF).
Activity 4: Clean-up the local-generated file.
The file generated in Step-1 needs to be used again in Step-3, and then eventually discarded at the end of the workflow.
The system would work fine if there is only 1 host polling for all tasks. However, when I have multiple workers, I cannot seem to ensure that task-1 and task-3 would end up on the same host.
I would like to avoid doing the following:
Uploading the file to a central repository (say S3) on step-1 and download it in step-3; or
Having a single activity for the task-1 and task-3.
I have the following questions:
Is it possible to control that subsequent activities be run on the same host as opposed to going to any random host in my fleet?
What are specific guidelines/best practices on re-using resources generated in different activities in a workflow?
Is it possible to control that subsequent activities be run on the
same host as opposed to going to any random host in my fleet?
Yes, absolutely. The basic idea is that SWF task lists (queues used to deliver activity tasks) are dynamic. So each host can have its own task list and workflow can specify specific task list name when calling an activity. See fileprocessing sample which executes download activity on any host from the pool, then converts the file and uploads the result on the same host as the first one.
List item What are specific guidelines/best practices on re-using resources generated in different activities in a workflow?
The approach of caching result in the worker process memory or on the local disk is considered the best practice. Sometimes using external data store and getting it each times also makes sense.
I have a function doWork(id) that I'm offloading to some worker servers using AWS SQS. This function can get called very frequently but I'd like to throttle the function so that for a given id, the work is don't no more than once per second.
Is it possible with AWS / are there any services that feature this functionality?
EDIT: Some clarification.
doWork(id) does some expensive work on a record in a database. This work needs to continuously update whenever the user interacts with the record. Thus, I call doWork(id) whenever the user called a method that edits the record. However, the user may edit the record many times very quickly (I'm building a text editor so every character is an edit). Rather than doWork(id) a unnecessary amount of times, I'd like to throttle that work so it happens at most once per second.
Because this work is expensive, I enqueue a message in SQS and have a set of "worker" servers that dequeue tasks and run them.
My goal here is to somehow maintain the stateless horizontal scalability of my servers while throttling doWork(id). To make matters a little more complicated, I don't want to throttle the doWork function itself -- I want to throttle the work for each individual record identified by the id passed to doWork.
You could use a Redis instance on ElastiCache and configure your workers to use a distributed rate limiter for keys based on id. There are also many packages for different languages based on this kind of idea that might be ready to run on your workers.
That's interesting. You want to delay the work in case they hit another key within a given time period. If they don't hit another key in that time period, you then want to do the work. You might also want to do it after x seconds even if they continue typing (Auto Save).
The problem is that each keypress sends a message to the queue. When a worker receives the message, they have no idea whether another key has been pressed since the message was sent, and there's no way to look in the queue for other matching messages.
Amazon SQS does have the ability to delay a message, which means it will not be available for receiving for a given period, but this alone can't solve the problem because the worker doesn't know what else has happened.
Bottom line: A traditional queue is not a suitable mechanism for this use-case. You need something akin to a database/cache that can update a "last modified" timestamp each time that a key is pressed. Once that timestamp is more than x seconds old, you should queue the worker.
I'm new to using AWS, so any pointers would be appreciated.
I have a need to process large files using our in-house software.
It takes about 2GB of input and generates 5GB of output, running for 2 hours on a c3.8xlarge.
For now I do it manually, start an instance (either on-demand or spot-request), but now I want to reliably automate and scale this processing - what are good frameworks or platform or amazon services to do that?
Especially regarding the possibility that a spot-instance will be terminated half-way through (and I'll need to detect that and restart the job).
I heard about Python Celery, but does it work well with amazon and spot-instances?
Or are there other recommended mechanisms?
Thank you!
This is somewhat opinion-based, but you can mix and match some of the AWS pieces to make this easier:
put the input data on S3
push an entry into a SQS queue indicating a job needs to be processed with a long visibility timeout
set up an autoscaling policy based on SQS with your machine description in CloudFormation.
use UserData/cloudinit to set up the machine and start your application
write code to receive the queue entry, start processing, finish processing, then delete the SQS message.
code should check for another queued entry. If none, code should terminate machine.