Tracking file SLAs with AWS S3 - amazon-web-services

I am receiving 10 files from upstream sources dropped to an S3 location. All 10 of them need to be received by a certain SLA, and if that SLA is breached I need to escalate using eventing mechanism.
Is there a feature in S3, or integration with another AWS service that can help with implementation of this functionality?

Something like this may work:
Establish the SLA with your upstream systems (Lets say everyday between 1PM-1:30PM)
Trigger a Step Function that is invoked using CloudWatch events at 1 PM. Keep checking if the files have arrived in the Step Function (every 5 minutes)
If files have not arrived by 1:30PM, trigger an event that emails you saying files have not arrived and end the step function execution.
If files have arrived by 1:30PM, end the step function execution.
OR
Establish the SLA with your upstream systems (Lets say everyday between 1PM-1:30PM)
Trigger a Lambda Function that is invoked using CloudWatch events at 1:35 PM.
If files have not arrived, trigger an event that emails you saying files have not arrived.
If files have arrived, do nothing.

Related

AWS lambda: Execute function on timeout

I am developing a lambda function that migrates logs from an SFTP server to an S3 bucket.
Due to the size of the logs, the function sometimes is timing out - even though I have set the maximum timeout of 15 minutes.
try:
logger.info(f'Migrating {log_name}... ')
transfer_to_s3(log_name, sftp)
logger.info(f'{log_name} was migrated succesfully ')
If transfer_to_s3() fails due to timeoutlogger.info(f'{log_name} was migrated succesfully') line won't be executed.
I want to ensure that in this scenario, I will somehow know that a log was not migrated due to timeout.
Is there a way to force lambda to perform an action, before exiting, in the case of a timeout?
Probably a better way would be to use SQS for that:
Logo info ---> SQS queue ---> Lambda function
If lambda successful moves the files, it removes the log info from SQS queue. If it fails, the log info persists in the SQS queue (or goes to DLQ for special handling), so the next lambda invocation can handle it.

Multiple curl calls php issue

My problem every 20minutes I want to execute the curl request which is around 25000 or more than that and save the curl response in database. In PHP it is not handled properly which is the best AWS services I can use except lambda.
A common technique for processing large number of similar calls is:
Create an Amazon Simple Queue Service (SQS) queue and push each request into the queue as a separate message. In your case, the message would contain the URL that you wish to retrieve.
Create an AWS Lambda function that performs the download and stores the data in the database.
Configure the Lambda function to trigger off the SQS queue
This way, the SQS queue can trigger hundreds of Lambda functions running parallel. The default concurrency limit is 1000 Lambda functions, but you can request for this to be increased.
You would then need a separate process that, every 20 minutes, queries the database for the URLs and pushes the messages into the SQS queue.
The complete process is:
Schedule -> Lambda pusher -> messages into SQS -> Lambda workers -> database
The beauty of this design is that it can scale to handle large workloads and operates in parallel, rather than each curl request having to wait. If a message cannot be processed, it Lambda will automatically try again. Repeated failures will send the message to a Dead Letter Queue for later analysis and reprocessing.
If you wish to perform 25,000 queries every 20 minutes (1200 seconds), this would need a query to complete every 0.05 seconds. That's why it is important to work in parallel.
By the way, if you are attempting to scrape this information from a single website, I suggest you investigate whether they provide an API otherwise you might be violating the Terms & Conditions of the website, which I strongly advise against.

Once only notification with DeliveryDelay with AWS SQS

In a web application, people upload files to be processed. File processing can take anywhere between 30 seconds and 30 minutes per file depending on the size of the file. Within an upload session, people upload anywhere between 1 and 20 files and these may be uploaded within multiple batches with the time lag between the batches being up to 5 minutes.
I want to notify the uploader when processing has completed, but also do not want to send a notification when the first batch has completed processing before another batch has been uploaded within the 2-5 minute time period. Ie. the uploader sees himself uploading multiple batches of files as one single "work period", which he may only do every couple of days.
Instead of implementing a regular check, I have implemented the notification with AWS SQS:
- on completion of each file being processed a message is sent to the queue with a 5 minute delivery delay.
- when this message is processed, it checks if there is any other file being processed still and if not, it sends the email notification
This approach leads to multiple emails being sent, if there are multiple files that complete processing in the last 5 minutes of all file processing.
As a way to fix this, I have thought to use an AQS SQS FIFO queue with the same Deduplicationid, however I understand I need to pass through the last message with the same Deduplicationid, not the first.
Is there a better way to solve this with event driven systems? Ideally I want to limit the amount of queues needed, as this system is very prototype driven and also not introduce another place to store state - I already have a relational database.
You can use AWS StepFunctions to control such types of workflows.
1. Upload files to s3
2. Store jobs in DynamoDB
3. Start StepFunction flow with job Id
4. Last step of flow is sending email notification
...
PROFIT!
I was not able to find a way to do this without using some sort of atomic central storage as suggested by #Ivan Shumov
In my case, a relational database is used to store file data and various processing metrics, so I have refined the process as:
on completion of each file being processed a message is sent to the
queue with a 5 minute delivery delay. The 5 minutes represents the largest time lag of uploads between multiple file batches in a single work session.
when the first file is processed, a unique processing id is established, stored with the user account and linked to all files in this session
when this message is processed, it checks if there is any other file being processed still and checks if there is a processing id against he user.
if there is a processing id against the user, it clears the processing ids from the user and file records, then send

How to trigger AWS Lambda just once on multiple S3 notifications

We are designing a pipeline. We get a number of raw files which come into S3 buckets and then we apply a schema and then save them as parquet.
As of now we are triggering a lambda function for each file written but ideally we would like to start this process only after all the files are written. How we can we trigger the lambda just once?
I encourage you to use an alternative that maintains the separation between the publisher (whoever is writing) and the subscriber (you). The publisher tells you when things are written; it's your responsibility to choose when to process those things. The neat pattern here would be for the publisher to write its files in batches and publish manifests for you to trigger on: i.e. a list which says "I just wrote all these things, you can find them in these places". Since you don't have that / can't change the publisher, I suggest the following:
Send the notifications from the publisher to an SQS queue.
Schedule your lambda to run on a schedule; how often is determined by how long you're willing to delay ingestion. If you want data to be delayed at most 5min between being published and getting ingested by your system, set your lambda to trigger every 4min. You can use Cloudwatch notifications for this.
When your lambda runs, poll the queue. Keep going until you accumulate the maximum amount of notifications, X, you want to process in one go, or the queue is empty.
Process. If the queue wasn't empty when you stopped polling, immediately trigger another lambda execution.
Things to keep in mind on the above:
As written, it's not parallel, so if your rate of lambda execution is slower than the rate at which the queue fills up, you'll need to 1. run more frequently or 2. insert a load-balancing step: a lambda that is triggered on a schedule, polls the queue, and calls as many processing lambdas as necessary so that each one gets X notifications.
SNS in general and SQS non-FIFO queues specifically don't guarantee exactly-once delivery. They can send you duplicate notifications. Make sure you can handle duplicate processing cleanly.
Hook your Lambda up to a Webhook (API Gateway) and then just call it from your client app once your client app is done.
Solutions:
Zip all files together, Lambda unzip it
create a UI code and send files one by one, trigger lambda from it when the last one is sent
Lambda check files, if didn't find all files, silent quit. if it finds all files, then handle all files in one thread

Aws lambda audio features extraction ( Not enough storage -Layers )

We have IOT sensors that uploads wav files into S3 Bucket.
We want to be able to extract sound features from each file that is getting uploaded (create obj event) with aws lambda
For that we need:
python librosa or pyAudio analysis package + numpy and scipy. (~ 240mb unzziped)
ffmpeg (~ 70mb unzziped)
As you can see there is no way to put them all together in same lambda package (250mb uncompressed max). And im getting an error when not including the ffmpeg in the layers when gathering the wav file:
[ERROR] FileNotFoundError: [Errno 2] No such file or directory: 'ffprobe': 'ffprobe'
which is related to ffmpeg.
We are looking for implementation recommendation, we thought about:
Putting the ffmpeg file in s3 and getting it every single invoke ( without having to put it in the layers. ( if it is even possible)
Chaining two lambdas: 1 for processing the input file through ffmpeg and puting the output file in abother bucket > 2 function invoked and extracting features from the processed data. ( using SNS / chaining mechanism) ( if it is even possible)
Move to EC2 where there we will have a problem with concurrent invokation accuring when two files uploads at the same time.
there has to be and easier way, ill be glad to hear for other opinions before diving into implementation,
Thank you all!
The scenario appears to be:
Files come in at random times
The files need to be processed, but not in real-time
The required libraries are too big for an AWS Lambda function
Suggested architecture:
Configure an Amazon S3 Event to send a message to an Amazon SQS queue when a file arrives
Configure an Amazon CloudWatch Event to trigger an AWS Lambda function at regular intervals (eg 1 hour)
The Lambda function checks whether there are messages in the queue
If there are messages, it launches an Amazon EC2 instance with a User Data script that installs and starts the processing system
The processing system will:
Grab a message from the queue
Process the message (without the limitations of Lambda)
Delete the message
If there are no messages left in the queue, it will terminate the EC2 instance
This can be very cost-effective because Amazon EC2 Linux instances are charged per-second. You can run several workers in parallel to process the messages (but be careful when writing the termination code, to ensure that all workers have finished processing messages). Or, if things are not time-critical, just choose the smallest usable Instance Type and single-thread it since larger instances cost more anyway (so they are no better from a cost-efficient standpoint).
Make sure you put monitoring in place to ensure that messages are being processed. Implement a Dead Letter Queue in Amazon SQS to catch messages that are failing to process and put a CloudWatch Alarm on the DLQ to notify you if things seem to be going wrong.