How can I process a newly uploaded object in Google Cloud Storage exactly once? - google-cloud-platform

I would like to receive files into a Google Cloud Storage bucket and have a Python job run exactly once for each file. I would like many such Python jobs to be running concurrently, in order to process many files in parallel, but each file should only be processed once.
I have considered the following:
Pub/Sub messages
Generate Pub/Sub messages for the OBJECT_FINALIZE event on the bucket. The issue here is that Pub/Sub may deliver messages more than once, so a pool of Python jobs listening to the same subscription may run more than one job for the same message, so I could either...
Use Dataflow to deduplicate messages, but in my non-streaming use case, dataflow seems to be expensive overkill, and this answer seems to suggest it's not the right tool for the job.
or
Create a locking mechanism using a transactional database (say, PostgreSQL on Cloud SQL). Any job receiving a message can attempt to acquire a lock with the same name as the file, any job that fails to acquire a lock can terminate and not ACK the message, and any job with the lock can continue processing and label the lock as done to prevent any future acquisition of that lock.
I think 2 would work but it also feels over-engineered.
Polling
Instead of using Pub/Sub, have jobs poll for new files in the bucket.
This feels like it would simply replace Pub/Sub with a less robust solution that would still require a locking mechanism.
Eventarc
Use Eventarc to trigger a Cloud Run container holding my code. This seems similar to Pub/Sub, and simpler, but I can find no explanation of how Eventarc deals with things like retries, or whether it comes with any exactly-once guarantees.
Single controller spawning multiple workers
Create a central controller process that handles deduplication of file events (received either through Pub/Sub, polling, or Eventarc), then spawns worker jobs and allocates each files exactly once to a worker jobs.
I think this could also work but creates a single point of failure and potentially a throughput bottleneck.

You're on the right track, and yes PubSub Push messages may be delivered more than once.
One simple technique to manage that is to rename the file as you start processing it. Renaming is an atomic transactions, so if it succeeds, you're good to process it.
PROC_PRF = "processing"
bucketName = # get it from the message
fileName = # Get it from the message)
# Renaming of the file below trriggers another google.storage.object.finalize event
if PROC_PRF in fileName:
print("Exiting due to rename event")
# Ack the message an exit
return
storage_client = storage.Client()
bucket = storage_client.bucket(bucketName)
blob = bucket.get_blob(fileName)
try:
newBlob = bucket.rename_blob(blob,new_name = fileName+'.'+PROC_PRF)
except:
raise RuntimeError("Error: File rename from " + fileName + " failed, is this a duplicate function call?")
# The rename worked - process the file & message

Related

How can I track the progress/status of an asynchronous AWS Lambda invocation?

I have an API which I use to trigger AWS Lambda jobs. Upon request, the API invokes an AWS Lambda job with InvocationType='Event'. Hereafter, I want to periodically poll if the AWS Lambda job has finished.
The way that would fit best to my architecture, is to store an identifier of the Lambda job in a database and periodically check if the job is finished and what its output is. However, I was not able to find how I can do this.
How can I periodically poll for the result of an AWS Lambda job, and view the output once it has finished?
I have looked into using InvocationType='RequestResponse', but this requires me to store a future, which I cannot do in a database.
There's no built-in way to check for the status of an asynchronous Lambda invocation.
Asynchronous Lambda invocation, using the event invocation type, is meant to be a fire and forget job. As such, there's no 'progress' or 'status' to get or poll for.
As you don't want to wait for the Lambda to complete, synchronous Lambda invocation is out of the picture. In this case, you need to write your own logic to keep track of the status.
One way you could do this is to store a (job) item in a DynamoDB jobs table with 2 attributes:
jobId UUID (String attribute, set as the partition key)
completed boolean flag (Boolean attribute)
Workflow is then as follows:
Within your API, create & store a new job with completed defaulting to 'false'
Pass the newly-created jobId to the Lambda being invoked in the payload
When the Lambda finishes, lookup the job associated with the passed in jobId within the jobs table & set the completed attribute of the job to true
You can then periodically poll for the result of the job within the DynamoDB table.
Or take a look at using DynamoDB Streams as a way to know when a job finishes in near-real time without polling.
As to viewing the 'output', AWS Lambda just returns a success response without additional information. There is no 'output'. Store any output you might need in persistent storage - maybe an extra output attribute as a String with each job? - & later retrieve it.
#Ermiya Eskandary's answer is absolutely right.
I am a Dynamodb Subject matter expert, and did this status tracking (also error handling, retry, error logging) pattern for many of my customers
You could check the pynamodb_mate library, it has the status tracker pattern implemented and you can enable that with around 15 lines of code.
in general, when you say you want status tracking, you are talking about the following:
Each task should be handled by only one worker, you want a concurrency lock mechanism to avoid double consumption. (a lot of people didn't aware of this, it is called Idempotent)
For those succeeded tasks, store additional information such as the output of the task and log the success time.
For those failed task, log the error message for debug, so you can fix the bug and rerun the task.
For those failed task, you want to get all of failed tasks by one simple query and rerun with the updated business logic.
For those tasks failed too many times, you don't want to retry them anymore and wants to ignore them. (a lot of people run into endless loop when they deploy to production then realize that it is a necessary feature)
Run custom query based on task status for analytics purpose.
You can read this jupyter notebook example
Basically, with pynamodb_mate your lambda job application code become:
# this is your lambda application code
def lambda_handler(...):
...
# your new code should be:
with tracker.start_job():
lambda_handler()
If your application code is not Python, then you have two options:
create another lambda function that invoke the original one using sync mode. however, you pay more money to run the "caller" lambda function
suppose your lambda code in in Node.js, then add additional lambda runtime as a layer and wrap your node.js caller around a Python function. In short, you are using Python to call node.js.

Is there a way to be notified of status changes in Google AI Platform training jobs without polling the REST API?

Right now I monitor my submitted jobs on Google AI Platform (formerly ml engine) by polling the job REST API. I don't like this solution for a few reasons:
Awareness of status changes is often delayed or missed altogether if the interval between status changes is smaller than the monitoring polling rate
Lots of unnecessary network traffic
Lots of unnecessary function invocations
I would like to be notified as soon as my training jobs complete. It'd be great if there is some way to assign hooks or callbacks to run when the job status changes.
I've also considered adding calls to cloud functions directly within the training task python package that runs on AI Platform. However, I don't think those function calls will occur in cases where the training job is shutdown unexpectedly, such as when a job is cancelled or forced to end by GCP.
Is there a better way to go about this?
You can use a Stackdriver sink to read the logs and send it to Pub/Sub. From Pub/Sub, you can connect to a bunch of other providers:
1. Set up a Pub/Sub sink
Make sure you have access to the logs and publish rights to the topic you desire before you get started. Follow the instructions for setting up a Stackdriver -> Pub/Sub sink. You’ll want to use this query to limit the events only to Training jobs:
resource.type = "ml_job"
resource.labels.task_name = "service"
Note that Stackdriver can further limit down the query. For example, you can limit to a particular Job by adding a condition like resource.labels.job_id = "..." or to a certain event with a filter like jsonPayload.message : "..."
2. Respond to the Pub/Sub message
In order to tell what changed, the recipient of the Pub/Sub message can either query the job status from the ml.googleapis.com API or read the text of the message
Reading state from ml.googleapis.com
When you receive the message, make a call to https://ml.googleapis.com/v1/<project_id>/jobs/<job_id> to get the Job information, replacing [project_id] and [job_id] in the URL with the values of resource.label.project_id and resource.label.job_id from the Pub/Sub message, respectively.
The returned Job object contains a field state that, naturally, tells the status of the job.
Reading state from the message text
The Pub/Sub message will contain a string telling what happened to the job. You probably want behavior when the job ends. Look for these strings in jsonPayload.message:
"Job completed successfully."
"Job cancelled."
"Job failed."
I implemented a Terraform module as #htappen said. I'm happy if it would help you. But my real hope is that Google updates AI Platform with the same feature.
https://github.com/sfujiwara/terraform-google-ai-platform-notification
I think you can programmatically publish a PubSub message at the end of your training job code. Something like this:
from google.cloud import pubsub_v1
# publish job complete message
client = pubsub_v1.PublisherClient()
topic = client.topic_path(args.gcp_project_id, 'topic-name')
data = {
'ACTION': 'JOB_COMPLETE',
'SAVED_MODEL_DIR': args.job_dir
}
data_bytes = json.dumps(data).encode('utf-8')
client.publish(topic, data_bytes)
Then you can setup a cloud function to be triggered by the same pubsub topic.
You can work around the lack of a callback from the service on a custom TF training job by adding a LamdbaCallback to the fit() call. In the on_epoch method, you could then send yourself a notification on job progress and on_train_end when it finishes.
https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LambdaCallback

How to trigger AWS Lambda just once on multiple S3 notifications

We are designing a pipeline. We get a number of raw files which come into S3 buckets and then we apply a schema and then save them as parquet.
As of now we are triggering a lambda function for each file written but ideally we would like to start this process only after all the files are written. How we can we trigger the lambda just once?
I encourage you to use an alternative that maintains the separation between the publisher (whoever is writing) and the subscriber (you). The publisher tells you when things are written; it's your responsibility to choose when to process those things. The neat pattern here would be for the publisher to write its files in batches and publish manifests for you to trigger on: i.e. a list which says "I just wrote all these things, you can find them in these places". Since you don't have that / can't change the publisher, I suggest the following:
Send the notifications from the publisher to an SQS queue.
Schedule your lambda to run on a schedule; how often is determined by how long you're willing to delay ingestion. If you want data to be delayed at most 5min between being published and getting ingested by your system, set your lambda to trigger every 4min. You can use Cloudwatch notifications for this.
When your lambda runs, poll the queue. Keep going until you accumulate the maximum amount of notifications, X, you want to process in one go, or the queue is empty.
Process. If the queue wasn't empty when you stopped polling, immediately trigger another lambda execution.
Things to keep in mind on the above:
As written, it's not parallel, so if your rate of lambda execution is slower than the rate at which the queue fills up, you'll need to 1. run more frequently or 2. insert a load-balancing step: a lambda that is triggered on a schedule, polls the queue, and calls as many processing lambdas as necessary so that each one gets X notifications.
SNS in general and SQS non-FIFO queues specifically don't guarantee exactly-once delivery. They can send you duplicate notifications. Make sure you can handle duplicate processing cleanly.
Hook your Lambda up to a Webhook (API Gateway) and then just call it from your client app once your client app is done.
Solutions:
Zip all files together, Lambda unzip it
create a UI code and send files one by one, trigger lambda from it when the last one is sent
Lambda check files, if didn't find all files, silent quit. if it finds all files, then handle all files in one thread

Aws lambda audio features extraction ( Not enough storage -Layers )

We have IOT sensors that uploads wav files into S3 Bucket.
We want to be able to extract sound features from each file that is getting uploaded (create obj event) with aws lambda
For that we need:
python librosa or pyAudio analysis package + numpy and scipy. (~ 240mb unzziped)
ffmpeg (~ 70mb unzziped)
As you can see there is no way to put them all together in same lambda package (250mb uncompressed max). And im getting an error when not including the ffmpeg in the layers when gathering the wav file:
[ERROR] FileNotFoundError: [Errno 2] No such file or directory: 'ffprobe': 'ffprobe'
which is related to ffmpeg.
We are looking for implementation recommendation, we thought about:
Putting the ffmpeg file in s3 and getting it every single invoke ( without having to put it in the layers. ( if it is even possible)
Chaining two lambdas: 1 for processing the input file through ffmpeg and puting the output file in abother bucket > 2 function invoked and extracting features from the processed data. ( using SNS / chaining mechanism) ( if it is even possible)
Move to EC2 where there we will have a problem with concurrent invokation accuring when two files uploads at the same time.
there has to be and easier way, ill be glad to hear for other opinions before diving into implementation,
Thank you all!
The scenario appears to be:
Files come in at random times
The files need to be processed, but not in real-time
The required libraries are too big for an AWS Lambda function
Suggested architecture:
Configure an Amazon S3 Event to send a message to an Amazon SQS queue when a file arrives
Configure an Amazon CloudWatch Event to trigger an AWS Lambda function at regular intervals (eg 1 hour)
The Lambda function checks whether there are messages in the queue
If there are messages, it launches an Amazon EC2 instance with a User Data script that installs and starts the processing system
The processing system will:
Grab a message from the queue
Process the message (without the limitations of Lambda)
Delete the message
If there are no messages left in the queue, it will terminate the EC2 instance
This can be very cost-effective because Amazon EC2 Linux instances are charged per-second. You can run several workers in parallel to process the messages (but be careful when writing the termination code, to ensure that all workers have finished processing messages). Or, if things are not time-critical, just choose the smallest usable Instance Type and single-thread it since larger instances cost more anyway (so they are no better from a cost-efficient standpoint).
Make sure you put monitoring in place to ensure that messages are being processed. Implement a Dead Letter Queue in Amazon SQS to catch messages that are failing to process and put a CloudWatch Alarm on the DLQ to notify you if things seem to be going wrong.

How to track distributed tasks progress

Here is my case:
When my server receieve a request, it will trigger distributed tasks, in my case many AWS lambda functions (the peek value could be 3000)
I need to track each task progress / status i.e. pending, running, success, error
My server could have many replicas
I still want to know about the task progress / status even if any of my server replica down
My current design:
I choose AWS S3 as my helper
When a task start to execute, it will create marker file in a special folder on S3 e.g. running folder
When the task fail or success, it will move the marker file from running folder to fail folder or success folder
I check the marker files on S3 to check the progress of the tasks.
The problems:
There is a limit for AWS S3 concurrent access
My case is likely to exceed the limit some day
Attempt Solutions:
I had tried my best to reduce the number of request to S3
I don't want to track the progress by storing data in my DB because my DB has already been under heavy workload.
To be honest, it is kind of wierd that using marker files on S3 to track progress of the tasks. However, it worked before.
Is there any recommendations ?
Thanks in advance !
This sounds like a perfect application of persistent event queueing, specifically Kinesis. As each Lambda starts it generates a “starting” event on Kinesis. When it succeeds or fails, it generates the appropriate event. You could even create progress events along the way if you want to see how far they have gotten.
Your server can then monitor the number of starting events against ending events (success or failure) until these two numbers are equal. It can query the error events to see which processes failed and why. All servers can query the same events without disrupting each other, and any server can go down and recover without losing data.
Make sure to put an Origination Key on events that are supposed to be grouped together so they don't get mixed up with a subsequent event. Also, each Lambda should be given its own key so you can trace progress per Lambda. Guids are perfect for this.