Using AWS Lambda to periodically monitor state of a remote resource - amazon-web-services

I need to implement the following feature in the backend on AWS:
- API endpoint which allows a user to start a particular long running "process" in a remote system
- the process status in this remote system should be monitored periodically (every few-several seconds) for status and when status == complete, trigger an action (the remote system does not support sending/triggering notifications or callbacks)
We use primarily lambda functions so I'm thinking about approaching it in the following way:
- my endpoint which is triggered by the user would call remote system to start the process, store record in internal DB and generate a message to SQS (with a delivery delay of X seconds)
- there would be a second lambda that would read messages from SQS & check status of the process in this remote system. When status == complete, trigger an action, when status != complete, generate another message to SQS which would again the same lambda would pick up after X seconds of delay and repeat the check and so on
I'm wondering if there is a better solution/tools to implement this kind of monitoring/notification pattern in the AWS since I'm not that familiar with all the services that AWS provides.
Would anyone comment on this approach and perhaps suggest an alternative if there is one?

Take a look at AWS Step Functions which I think is the best fit for your use case.
All you need to do is, instead of generating a SQS message, initiate an execution of a StateMachine in StepFunctions.
The following tutorial explains a iterator loop with a counter. But you can use the same logic to check the status and keep looping until status == complete
https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-create-iterate-pattern-section.html
Another useful resource which I think very close to your use case
https://docs.aws.amazon.com/step-functions/latest/dg/sample-project-job-poller.html

Related

OpenTelemetry Lambda Layer

Is there any way to lessen the Lambda Layer dropped events? It keeps on dropping the traces before they reached the central collector. Before it exports the traces, it will then fetch the token to make an authorized sending of traces to the central collector. But it does not push the traces as it is being dropped because the lambda function execution is already done.
Lambda Extension Layer Reference: https://github.com/open-telemetry/opentelemetry-lambda/tree/main/collector
Exporter Error:
Exporting failed. No more retries left. Dropping data.
{
"kind": "exporter",
"data_type": "traces",
"name": "otlp",
"error": "max elapsed time expired rpc error: code = DeadlineExceeded desc = context deadline exceeded",
"dropped_items": 8
}
I encountered the same problem and did some research.
Unfortunately, it is a known issue that has not been resolved yet in the latest version of AWS Distro for OpenTelemetry Lambda (ADOT Lambda)
Github issue tickets:
Spans being dropped github.com/aws-observability/aws-otel-lambda issue #229
ADOT collector sends response to Runtime API instead of waiting for sending traces github.com/open-telemetry/opentelemetry-lambda issue #224
The short answer: currently the otel collector extension does not work reliably as it gets frozen by the lamda environment while it is still sending data to the exporters. As a workaround, you can send the traces directly to a collector running outside the lambda container.
The problem is:
the lambda sends the traces to the collector extension process during its execution
the collector queues them for sending them on to the configured exporters
the collector extension does not wait for the collector to finish processing its queue before telling the lambda environment that the extension is done; instead it always immediately tells the environment immediately that it's done, without looking at what the collector is doing
when the lambda is done, the extension is already done, so the lambda container is frozen until the next lambda invocation.
the container is thawed when the next lambda invocation arrives. if the next invocation comes soon and takes long enough, the collector may be able to finish sending the traces to the exporters. if not, the connection to the backend system times out before sending is complete.
What complicates the solution is that it is very hard for an extension to detect whether the main lambda has finished processing.
Ideally, a telemetry extension would:
Wait for the lambda to finish processing
Check if the lambda sent it any data to process and forward
Wait for all processing and forwarding to complete (if any)
Signal to the lambda environment that the extension is done
The lambda extension protocol doesn't tell the extension when the main lambda has finished processing (it would be great if AWS could add that to the extension protocol as a new event type).
There is a proposed PR that tries to work around this by assuming that lambdas always send traces, so instead of waiting for the lambda to complete, it waits for a TCP request to the OTLP receiver to arrive. This works, but it makes the extension hang forever if the lambda never sends any traces.
Note: the same problem that we see here for traces also exists for metrics.

How can I track the progress/status of an asynchronous AWS Lambda invocation?

I have an API which I use to trigger AWS Lambda jobs. Upon request, the API invokes an AWS Lambda job with InvocationType='Event'. Hereafter, I want to periodically poll if the AWS Lambda job has finished.
The way that would fit best to my architecture, is to store an identifier of the Lambda job in a database and periodically check if the job is finished and what its output is. However, I was not able to find how I can do this.
How can I periodically poll for the result of an AWS Lambda job, and view the output once it has finished?
I have looked into using InvocationType='RequestResponse', but this requires me to store a future, which I cannot do in a database.
There's no built-in way to check for the status of an asynchronous Lambda invocation.
Asynchronous Lambda invocation, using the event invocation type, is meant to be a fire and forget job. As such, there's no 'progress' or 'status' to get or poll for.
As you don't want to wait for the Lambda to complete, synchronous Lambda invocation is out of the picture. In this case, you need to write your own logic to keep track of the status.
One way you could do this is to store a (job) item in a DynamoDB jobs table with 2 attributes:
jobId UUID (String attribute, set as the partition key)
completed boolean flag (Boolean attribute)
Workflow is then as follows:
Within your API, create & store a new job with completed defaulting to 'false'
Pass the newly-created jobId to the Lambda being invoked in the payload
When the Lambda finishes, lookup the job associated with the passed in jobId within the jobs table & set the completed attribute of the job to true
You can then periodically poll for the result of the job within the DynamoDB table.
Or take a look at using DynamoDB Streams as a way to know when a job finishes in near-real time without polling.
As to viewing the 'output', AWS Lambda just returns a success response without additional information. There is no 'output'. Store any output you might need in persistent storage - maybe an extra output attribute as a String with each job? - & later retrieve it.
#Ermiya Eskandary's answer is absolutely right.
I am a Dynamodb Subject matter expert, and did this status tracking (also error handling, retry, error logging) pattern for many of my customers
You could check the pynamodb_mate library, it has the status tracker pattern implemented and you can enable that with around 15 lines of code.
in general, when you say you want status tracking, you are talking about the following:
Each task should be handled by only one worker, you want a concurrency lock mechanism to avoid double consumption. (a lot of people didn't aware of this, it is called Idempotent)
For those succeeded tasks, store additional information such as the output of the task and log the success time.
For those failed task, log the error message for debug, so you can fix the bug and rerun the task.
For those failed task, you want to get all of failed tasks by one simple query and rerun with the updated business logic.
For those tasks failed too many times, you don't want to retry them anymore and wants to ignore them. (a lot of people run into endless loop when they deploy to production then realize that it is a necessary feature)
Run custom query based on task status for analytics purpose.
You can read this jupyter notebook example
Basically, with pynamodb_mate your lambda job application code become:
# this is your lambda application code
def lambda_handler(...):
...
# your new code should be:
with tracker.start_job():
lambda_handler()
If your application code is not Python, then you have two options:
create another lambda function that invoke the original one using sync mode. however, you pay more money to run the "caller" lambda function
suppose your lambda code in in Node.js, then add additional lambda runtime as a layer and wrap your node.js caller around a Python function. In short, you are using Python to call node.js.

Is there a way to be notified of status changes in Google AI Platform training jobs without polling the REST API?

Right now I monitor my submitted jobs on Google AI Platform (formerly ml engine) by polling the job REST API. I don't like this solution for a few reasons:
Awareness of status changes is often delayed or missed altogether if the interval between status changes is smaller than the monitoring polling rate
Lots of unnecessary network traffic
Lots of unnecessary function invocations
I would like to be notified as soon as my training jobs complete. It'd be great if there is some way to assign hooks or callbacks to run when the job status changes.
I've also considered adding calls to cloud functions directly within the training task python package that runs on AI Platform. However, I don't think those function calls will occur in cases where the training job is shutdown unexpectedly, such as when a job is cancelled or forced to end by GCP.
Is there a better way to go about this?
You can use a Stackdriver sink to read the logs and send it to Pub/Sub. From Pub/Sub, you can connect to a bunch of other providers:
1. Set up a Pub/Sub sink
Make sure you have access to the logs and publish rights to the topic you desire before you get started. Follow the instructions for setting up a Stackdriver -> Pub/Sub sink. You’ll want to use this query to limit the events only to Training jobs:
resource.type = "ml_job"
resource.labels.task_name = "service"
Note that Stackdriver can further limit down the query. For example, you can limit to a particular Job by adding a condition like resource.labels.job_id = "..." or to a certain event with a filter like jsonPayload.message : "..."
2. Respond to the Pub/Sub message
In order to tell what changed, the recipient of the Pub/Sub message can either query the job status from the ml.googleapis.com API or read the text of the message
Reading state from ml.googleapis.com
When you receive the message, make a call to https://ml.googleapis.com/v1/<project_id>/jobs/<job_id> to get the Job information, replacing [project_id] and [job_id] in the URL with the values of resource.label.project_id and resource.label.job_id from the Pub/Sub message, respectively.
The returned Job object contains a field state that, naturally, tells the status of the job.
Reading state from the message text
The Pub/Sub message will contain a string telling what happened to the job. You probably want behavior when the job ends. Look for these strings in jsonPayload.message:
"Job completed successfully."
"Job cancelled."
"Job failed."
I implemented a Terraform module as #htappen said. I'm happy if it would help you. But my real hope is that Google updates AI Platform with the same feature.
https://github.com/sfujiwara/terraform-google-ai-platform-notification
I think you can programmatically publish a PubSub message at the end of your training job code. Something like this:
from google.cloud import pubsub_v1
# publish job complete message
client = pubsub_v1.PublisherClient()
topic = client.topic_path(args.gcp_project_id, 'topic-name')
data = {
'ACTION': 'JOB_COMPLETE',
'SAVED_MODEL_DIR': args.job_dir
}
data_bytes = json.dumps(data).encode('utf-8')
client.publish(topic, data_bytes)
Then you can setup a cloud function to be triggered by the same pubsub topic.
You can work around the lack of a callback from the service on a custom TF training job by adding a LamdbaCallback to the fit() call. In the on_epoch method, you could then send yourself a notification on job progress and on_train_end when it finishes.
https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LambdaCallback

can I run azure webjob after other webjobs finish

say I have this:
Step 1: A azure webjob triggered by a timer, and this job will create 1000 messages and I will put them in a queue.
Step 2: I have another azure webjob triggered by above message queue, this webjob will process these messages.
Step 3: The final webjob should only be triggered when all messages have been processed by step 2.
Looks like azure Queue doesn't support ordering and the only way is to use ServiceBus. I am wondering is it really the only way?
What I am thinking is this kind of process:
Put all these messages into an azure table, with some guid as primary key and status to be 0.
after finishing step 2, change the status of this message to 1 (i.e. finished) and will trigger step 3 if every messages have been done.
Will it work? Or maybe there are some nuget packages that I can use to achieve what I want?
The simplest way I think is the combination of Azure Logic App and Azure Function.
Logic App is a automated scalable workflow, you could trigger it with Timer, HTTP request and etc. And Azure function is a serverless compute service that enables you to run code on-demand without having to explicitly provision or manage infrastructure.
The Logic App support add code with Function, as for the use of Function, It's similar with WebJob. So I think you could create a Logic App with three Function, the they will run one by one.
As for the WebJob, yes,the QueueTrigger doesn't support ordering. And the Service Bus you mentioned, It did meet your some requirements for its FIFO feature. However you need to make sure your step 3 would be triggered after step 1, because it's already null in your queue before creating queues.
Hope my answer could help you.

How to track distributed tasks progress

Here is my case:
When my server receieve a request, it will trigger distributed tasks, in my case many AWS lambda functions (the peek value could be 3000)
I need to track each task progress / status i.e. pending, running, success, error
My server could have many replicas
I still want to know about the task progress / status even if any of my server replica down
My current design:
I choose AWS S3 as my helper
When a task start to execute, it will create marker file in a special folder on S3 e.g. running folder
When the task fail or success, it will move the marker file from running folder to fail folder or success folder
I check the marker files on S3 to check the progress of the tasks.
The problems:
There is a limit for AWS S3 concurrent access
My case is likely to exceed the limit some day
Attempt Solutions:
I had tried my best to reduce the number of request to S3
I don't want to track the progress by storing data in my DB because my DB has already been under heavy workload.
To be honest, it is kind of wierd that using marker files on S3 to track progress of the tasks. However, it worked before.
Is there any recommendations ?
Thanks in advance !
This sounds like a perfect application of persistent event queueing, specifically Kinesis. As each Lambda starts it generates a “starting” event on Kinesis. When it succeeds or fails, it generates the appropriate event. You could even create progress events along the way if you want to see how far they have gotten.
Your server can then monitor the number of starting events against ending events (success or failure) until these two numbers are equal. It can query the error events to see which processes failed and why. All servers can query the same events without disrupting each other, and any server can go down and recover without losing data.
Make sure to put an Origination Key on events that are supposed to be grouped together so they don't get mixed up with a subsequent event. Also, each Lambda should be given its own key so you can trace progress per Lambda. Guids are perfect for this.