We have IOT sensors that uploads wav files into S3 Bucket.
We want to be able to extract sound features from each file that is getting uploaded (create obj event) with aws lambda
For that we need:
python librosa or pyAudio analysis package + numpy and scipy. (~ 240mb unzziped)
ffmpeg (~ 70mb unzziped)
As you can see there is no way to put them all together in same lambda package (250mb uncompressed max). And im getting an error when not including the ffmpeg in the layers when gathering the wav file:
[ERROR] FileNotFoundError: [Errno 2] No such file or directory: 'ffprobe': 'ffprobe'
which is related to ffmpeg.
We are looking for implementation recommendation, we thought about:
Putting the ffmpeg file in s3 and getting it every single invoke ( without having to put it in the layers. ( if it is even possible)
Chaining two lambdas: 1 for processing the input file through ffmpeg and puting the output file in abother bucket > 2 function invoked and extracting features from the processed data. ( using SNS / chaining mechanism) ( if it is even possible)
Move to EC2 where there we will have a problem with concurrent invokation accuring when two files uploads at the same time.
there has to be and easier way, ill be glad to hear for other opinions before diving into implementation,
Thank you all!
The scenario appears to be:
Files come in at random times
The files need to be processed, but not in real-time
The required libraries are too big for an AWS Lambda function
Suggested architecture:
Configure an Amazon S3 Event to send a message to an Amazon SQS queue when a file arrives
Configure an Amazon CloudWatch Event to trigger an AWS Lambda function at regular intervals (eg 1 hour)
The Lambda function checks whether there are messages in the queue
If there are messages, it launches an Amazon EC2 instance with a User Data script that installs and starts the processing system
The processing system will:
Grab a message from the queue
Process the message (without the limitations of Lambda)
Delete the message
If there are no messages left in the queue, it will terminate the EC2 instance
This can be very cost-effective because Amazon EC2 Linux instances are charged per-second. You can run several workers in parallel to process the messages (but be careful when writing the termination code, to ensure that all workers have finished processing messages). Or, if things are not time-critical, just choose the smallest usable Instance Type and single-thread it since larger instances cost more anyway (so they are no better from a cost-efficient standpoint).
Make sure you put monitoring in place to ensure that messages are being processed. Implement a Dead Letter Queue in Amazon SQS to catch messages that are failing to process and put a CloudWatch Alarm on the DLQ to notify you if things seem to be going wrong.
Related
I have created a lambda function, that is extracting the audio stream from a video file using ffmpeg. I have also configured API gateway as a trigger, where I am passing the file to the lambda function in the request body.
The lambda function is working perfectly well with small files, but with bigger files, it needs a bit more time and then I am running into the API gateway timeout, which according to my understanding is set to 29 seconds max.
So when I trigger audio extraction from a bigger file, I am hitting this timeout and my API request fails to return any result even though the transcoding still runs in the background and the file is extracted, so I was wondering what is the best approach to handle those cases, where the execution of the lambda function is taking longer?
I was thinking to start the transcoding in the background and simply return a JSON with a message that the transcoding might take a couple of minutes, depending on the input file duration, but if I try to push the ffmpeg to the background I am being presented with an error, that the destination file doesn't exist.
os.system(f"{ffmpeg} -loglevel panic -nostdin -i {in_video} -vn -c:a aac -ar 48000 -b:a 192K {out_audio} 2> /dev/null &")
This is the ffmpeg command extracting the audio and transcoding it to AAC.
If I remove the 2> /dev/null & part of the command, it runs just fine, but if I keep it, I get an error:
"errorMessage": "[Errno 2] No such file or directory: 'output_audio.aac'"
"errorType": "FileNotFoundError"
So I was wondering what is the preferred way to run processes in the background.
There are many options that can be considered.
But first, since you already have all the flow working with lambda behind API Gateway, you can use lambda url.
Lambda url are a good way to trigger lambda via HTTPS. It supports multiple authorization mechanism such as IAM.
The interesting point is about timeout. When using Lambda url, the maximum timeout you can have is 15 mins, which is definitely better than the 29s you have when dealing with API Gateway.
Lambda url is free of charge and can be enabled on existing lambda function.
Increasing the timeout might just push back the problem until you have a very big file to convert and in the long run, maybe worth exploring other solution like uploading the file to S3 and maybe use AWS Batch or Spin up an EC2 to process the file. This would require more architecture design and implementation though.
For longer processing, it is recommended to use asynchronous invocations, where the Lambda function is triggered and runs until completion and does not block the caller. One option to solve it would be to upload the file to S3, configure the Lambda function to react to the S3 event, download the file from S3, process it, and upload it to another S3 bucket after processing completes.
I would like to receive files into a Google Cloud Storage bucket and have a Python job run exactly once for each file. I would like many such Python jobs to be running concurrently, in order to process many files in parallel, but each file should only be processed once.
I have considered the following:
Pub/Sub messages
Generate Pub/Sub messages for the OBJECT_FINALIZE event on the bucket. The issue here is that Pub/Sub may deliver messages more than once, so a pool of Python jobs listening to the same subscription may run more than one job for the same message, so I could either...
Use Dataflow to deduplicate messages, but in my non-streaming use case, dataflow seems to be expensive overkill, and this answer seems to suggest it's not the right tool for the job.
or
Create a locking mechanism using a transactional database (say, PostgreSQL on Cloud SQL). Any job receiving a message can attempt to acquire a lock with the same name as the file, any job that fails to acquire a lock can terminate and not ACK the message, and any job with the lock can continue processing and label the lock as done to prevent any future acquisition of that lock.
I think 2 would work but it also feels over-engineered.
Polling
Instead of using Pub/Sub, have jobs poll for new files in the bucket.
This feels like it would simply replace Pub/Sub with a less robust solution that would still require a locking mechanism.
Eventarc
Use Eventarc to trigger a Cloud Run container holding my code. This seems similar to Pub/Sub, and simpler, but I can find no explanation of how Eventarc deals with things like retries, or whether it comes with any exactly-once guarantees.
Single controller spawning multiple workers
Create a central controller process that handles deduplication of file events (received either through Pub/Sub, polling, or Eventarc), then spawns worker jobs and allocates each files exactly once to a worker jobs.
I think this could also work but creates a single point of failure and potentially a throughput bottleneck.
You're on the right track, and yes PubSub Push messages may be delivered more than once.
One simple technique to manage that is to rename the file as you start processing it. Renaming is an atomic transactions, so if it succeeds, you're good to process it.
PROC_PRF = "processing"
bucketName = # get it from the message
fileName = # Get it from the message)
# Renaming of the file below trriggers another google.storage.object.finalize event
if PROC_PRF in fileName:
print("Exiting due to rename event")
# Ack the message an exit
return
storage_client = storage.Client()
bucket = storage_client.bucket(bucketName)
blob = bucket.get_blob(fileName)
try:
newBlob = bucket.rename_blob(blob,new_name = fileName+'.'+PROC_PRF)
except:
raise RuntimeError("Error: File rename from " + fileName + " failed, is this a duplicate function call?")
# The rename worked - process the file & message
We are designing a pipeline. We get a number of raw files which come into S3 buckets and then we apply a schema and then save them as parquet.
As of now we are triggering a lambda function for each file written but ideally we would like to start this process only after all the files are written. How we can we trigger the lambda just once?
I encourage you to use an alternative that maintains the separation between the publisher (whoever is writing) and the subscriber (you). The publisher tells you when things are written; it's your responsibility to choose when to process those things. The neat pattern here would be for the publisher to write its files in batches and publish manifests for you to trigger on: i.e. a list which says "I just wrote all these things, you can find them in these places". Since you don't have that / can't change the publisher, I suggest the following:
Send the notifications from the publisher to an SQS queue.
Schedule your lambda to run on a schedule; how often is determined by how long you're willing to delay ingestion. If you want data to be delayed at most 5min between being published and getting ingested by your system, set your lambda to trigger every 4min. You can use Cloudwatch notifications for this.
When your lambda runs, poll the queue. Keep going until you accumulate the maximum amount of notifications, X, you want to process in one go, or the queue is empty.
Process. If the queue wasn't empty when you stopped polling, immediately trigger another lambda execution.
Things to keep in mind on the above:
As written, it's not parallel, so if your rate of lambda execution is slower than the rate at which the queue fills up, you'll need to 1. run more frequently or 2. insert a load-balancing step: a lambda that is triggered on a schedule, polls the queue, and calls as many processing lambdas as necessary so that each one gets X notifications.
SNS in general and SQS non-FIFO queues specifically don't guarantee exactly-once delivery. They can send you duplicate notifications. Make sure you can handle duplicate processing cleanly.
Hook your Lambda up to a Webhook (API Gateway) and then just call it from your client app once your client app is done.
Solutions:
Zip all files together, Lambda unzip it
create a UI code and send files one by one, trigger lambda from it when the last one is sent
Lambda check files, if didn't find all files, silent quit. if it finds all files, then handle all files in one thread
Here is my case:
When my server receieve a request, it will trigger distributed tasks, in my case many AWS lambda functions (the peek value could be 3000)
I need to track each task progress / status i.e. pending, running, success, error
My server could have many replicas
I still want to know about the task progress / status even if any of my server replica down
My current design:
I choose AWS S3 as my helper
When a task start to execute, it will create marker file in a special folder on S3 e.g. running folder
When the task fail or success, it will move the marker file from running folder to fail folder or success folder
I check the marker files on S3 to check the progress of the tasks.
The problems:
There is a limit for AWS S3 concurrent access
My case is likely to exceed the limit some day
Attempt Solutions:
I had tried my best to reduce the number of request to S3
I don't want to track the progress by storing data in my DB because my DB has already been under heavy workload.
To be honest, it is kind of wierd that using marker files on S3 to track progress of the tasks. However, it worked before.
Is there any recommendations ?
Thanks in advance !
This sounds like a perfect application of persistent event queueing, specifically Kinesis. As each Lambda starts it generates a “starting” event on Kinesis. When it succeeds or fails, it generates the appropriate event. You could even create progress events along the way if you want to see how far they have gotten.
Your server can then monitor the number of starting events against ending events (success or failure) until these two numbers are equal. It can query the error events to see which processes failed and why. All servers can query the same events without disrupting each other, and any server can go down and recover without losing data.
Make sure to put an Origination Key on events that are supposed to be grouped together so they don't get mixed up with a subsequent event. Also, each Lambda should be given its own key so you can trace progress per Lambda. Guids are perfect for this.
I'm new to using AWS, so any pointers would be appreciated.
I have a need to process large files using our in-house software.
It takes about 2GB of input and generates 5GB of output, running for 2 hours on a c3.8xlarge.
For now I do it manually, start an instance (either on-demand or spot-request), but now I want to reliably automate and scale this processing - what are good frameworks or platform or amazon services to do that?
Especially regarding the possibility that a spot-instance will be terminated half-way through (and I'll need to detect that and restart the job).
I heard about Python Celery, but does it work well with amazon and spot-instances?
Or are there other recommended mechanisms?
Thank you!
This is somewhat opinion-based, but you can mix and match some of the AWS pieces to make this easier:
put the input data on S3
push an entry into a SQS queue indicating a job needs to be processed with a long visibility timeout
set up an autoscaling policy based on SQS with your machine description in CloudFormation.
use UserData/cloudinit to set up the machine and start your application
write code to receive the queue entry, start processing, finish processing, then delete the SQS message.
code should check for another queued entry. If none, code should terminate machine.