I have an architecture where a customer will upload a file or set of files to process to S3, the files are then moved (and untarred/zipped/etc) to a more appropriate S3 bucket, then a message will be placed in SQS from the lambda to be picked up by a compute engine. In most cases, only one message per customer request is generated. However, there might be a case where the customer will load say 200 images to the same request (all 200 images are slices from a single 3D image) one at a time. This will generate 200 lambda calls and 200 messages. My compute engine can process the same request multiple times without a problem, but I would like to avoid processing the same request 200+ times (each such process takes > 5mins on a large ec2 instance).
Is there a way working within the amazon tools to either coalesce messages in a queue having the same message body to a single message? Or to peek into a queue for a message having a specific message body?
The only thing I can think to do is have a "special" file in my destination S3 bucket that records the last time a lambda put this message in the queue, but the issue with that is, say the first image slice comes in and I put in the queue "Do this guy", 50 more images come in and I notice that the "special" file is there, the message is picked up and processing starts, the rest of the images come in, then processing finishes and fails due to only 50 out of 60 needed images, but there are no pending messages in the queue because I blocked them all...
Or, I just suck it up and let the compute run for 200 times, fail quickly ~199 times, then succeed 1 (or more) times...
Related
Trying to optimize our application but doing batch pulling. Pub Sub seems to allow asynchronously pulling one message at a time with different client nodes, but is there no way for a single node to do a batch pull from pub sub?
Both Streaming Pull and Pull RPC both only allow the subscriber to consume one message at a time. Right now, it looks like we would have to pull one message at a time and do application level batching.
Any insight would be helpful. Pretty new to this GCP in general.
The underlying pull and streaming pull operations can receive batches of messages in the same response. The Cloud Pub/Sub client library, which uses streaming pull, breaks these batches apart and hands them to the provided user callback one at a time. Therefore, you need not worry about optimizing the underlying receiving of messages.
If your concern is optimizing the subscriber code at the application level, e.g., you want to batch writes into a database, then you have a couple of options:
Use Pull directly, which allows one to process all of the messages in a batch at a time. Note that using pull effectively requires many simultaneously outstanding pull requests and replacing requests that return with new requests immediately.
In your user callback, re-batch messages and once the batch reaches a desired size (or you've waited a sufficient amount of time to fill the batch), process all of the messages together and then ack them.
You probably can implement that by using Dataflow (Apache Beam). You can have a running streaming job, where you group, window, transform messages according to your requirements. The results of processing can be saved in batches or steam further. It probably makes sense in case the number of messages is really big.
My problem every 20minutes I want to execute the curl request which is around 25000 or more than that and save the curl response in database. In PHP it is not handled properly which is the best AWS services I can use except lambda.
A common technique for processing large number of similar calls is:
Create an Amazon Simple Queue Service (SQS) queue and push each request into the queue as a separate message. In your case, the message would contain the URL that you wish to retrieve.
Create an AWS Lambda function that performs the download and stores the data in the database.
Configure the Lambda function to trigger off the SQS queue
This way, the SQS queue can trigger hundreds of Lambda functions running parallel. The default concurrency limit is 1000 Lambda functions, but you can request for this to be increased.
You would then need a separate process that, every 20 minutes, queries the database for the URLs and pushes the messages into the SQS queue.
The complete process is:
Schedule -> Lambda pusher -> messages into SQS -> Lambda workers -> database
The beauty of this design is that it can scale to handle large workloads and operates in parallel, rather than each curl request having to wait. If a message cannot be processed, it Lambda will automatically try again. Repeated failures will send the message to a Dead Letter Queue for later analysis and reprocessing.
If you wish to perform 25,000 queries every 20 minutes (1200 seconds), this would need a query to complete every 0.05 seconds. That's why it is important to work in parallel.
By the way, if you are attempting to scrape this information from a single website, I suggest you investigate whether they provide an API otherwise you might be violating the Terms & Conditions of the website, which I strongly advise against.
In a web application, people upload files to be processed. File processing can take anywhere between 30 seconds and 30 minutes per file depending on the size of the file. Within an upload session, people upload anywhere between 1 and 20 files and these may be uploaded within multiple batches with the time lag between the batches being up to 5 minutes.
I want to notify the uploader when processing has completed, but also do not want to send a notification when the first batch has completed processing before another batch has been uploaded within the 2-5 minute time period. Ie. the uploader sees himself uploading multiple batches of files as one single "work period", which he may only do every couple of days.
Instead of implementing a regular check, I have implemented the notification with AWS SQS:
- on completion of each file being processed a message is sent to the queue with a 5 minute delivery delay.
- when this message is processed, it checks if there is any other file being processed still and if not, it sends the email notification
This approach leads to multiple emails being sent, if there are multiple files that complete processing in the last 5 minutes of all file processing.
As a way to fix this, I have thought to use an AQS SQS FIFO queue with the same Deduplicationid, however I understand I need to pass through the last message with the same Deduplicationid, not the first.
Is there a better way to solve this with event driven systems? Ideally I want to limit the amount of queues needed, as this system is very prototype driven and also not introduce another place to store state - I already have a relational database.
You can use AWS StepFunctions to control such types of workflows.
1. Upload files to s3
2. Store jobs in DynamoDB
3. Start StepFunction flow with job Id
4. Last step of flow is sending email notification
...
PROFIT!
I was not able to find a way to do this without using some sort of atomic central storage as suggested by #Ivan Shumov
In my case, a relational database is used to store file data and various processing metrics, so I have refined the process as:
on completion of each file being processed a message is sent to the
queue with a 5 minute delivery delay. The 5 minutes represents the largest time lag of uploads between multiple file batches in a single work session.
when the first file is processed, a unique processing id is established, stored with the user account and linked to all files in this session
when this message is processed, it checks if there is any other file being processed still and checks if there is a processing id against he user.
if there is a processing id against the user, it clears the processing ids from the user and file records, then send
We have IOT sensors that uploads wav files into S3 Bucket.
We want to be able to extract sound features from each file that is getting uploaded (create obj event) with aws lambda
For that we need:
python librosa or pyAudio analysis package + numpy and scipy. (~ 240mb unzziped)
ffmpeg (~ 70mb unzziped)
As you can see there is no way to put them all together in same lambda package (250mb uncompressed max). And im getting an error when not including the ffmpeg in the layers when gathering the wav file:
[ERROR] FileNotFoundError: [Errno 2] No such file or directory: 'ffprobe': 'ffprobe'
which is related to ffmpeg.
We are looking for implementation recommendation, we thought about:
Putting the ffmpeg file in s3 and getting it every single invoke ( without having to put it in the layers. ( if it is even possible)
Chaining two lambdas: 1 for processing the input file through ffmpeg and puting the output file in abother bucket > 2 function invoked and extracting features from the processed data. ( using SNS / chaining mechanism) ( if it is even possible)
Move to EC2 where there we will have a problem with concurrent invokation accuring when two files uploads at the same time.
there has to be and easier way, ill be glad to hear for other opinions before diving into implementation,
Thank you all!
The scenario appears to be:
Files come in at random times
The files need to be processed, but not in real-time
The required libraries are too big for an AWS Lambda function
Suggested architecture:
Configure an Amazon S3 Event to send a message to an Amazon SQS queue when a file arrives
Configure an Amazon CloudWatch Event to trigger an AWS Lambda function at regular intervals (eg 1 hour)
The Lambda function checks whether there are messages in the queue
If there are messages, it launches an Amazon EC2 instance with a User Data script that installs and starts the processing system
The processing system will:
Grab a message from the queue
Process the message (without the limitations of Lambda)
Delete the message
If there are no messages left in the queue, it will terminate the EC2 instance
This can be very cost-effective because Amazon EC2 Linux instances are charged per-second. You can run several workers in parallel to process the messages (but be careful when writing the termination code, to ensure that all workers have finished processing messages). Or, if things are not time-critical, just choose the smallest usable Instance Type and single-thread it since larger instances cost more anyway (so they are no better from a cost-efficient standpoint).
Make sure you put monitoring in place to ensure that messages are being processed. Implement a Dead Letter Queue in Amazon SQS to catch messages that are failing to process and put a CloudWatch Alarm on the DLQ to notify you if things seem to be going wrong.
Here is my case:
When my server receieve a request, it will trigger distributed tasks, in my case many AWS lambda functions (the peek value could be 3000)
I need to track each task progress / status i.e. pending, running, success, error
My server could have many replicas
I still want to know about the task progress / status even if any of my server replica down
My current design:
I choose AWS S3 as my helper
When a task start to execute, it will create marker file in a special folder on S3 e.g. running folder
When the task fail or success, it will move the marker file from running folder to fail folder or success folder
I check the marker files on S3 to check the progress of the tasks.
The problems:
There is a limit for AWS S3 concurrent access
My case is likely to exceed the limit some day
Attempt Solutions:
I had tried my best to reduce the number of request to S3
I don't want to track the progress by storing data in my DB because my DB has already been under heavy workload.
To be honest, it is kind of wierd that using marker files on S3 to track progress of the tasks. However, it worked before.
Is there any recommendations ?
Thanks in advance !
This sounds like a perfect application of persistent event queueing, specifically Kinesis. As each Lambda starts it generates a “starting” event on Kinesis. When it succeeds or fails, it generates the appropriate event. You could even create progress events along the way if you want to see how far they have gotten.
Your server can then monitor the number of starting events against ending events (success or failure) until these two numbers are equal. It can query the error events to see which processes failed and why. All servers can query the same events without disrupting each other, and any server can go down and recover without losing data.
Make sure to put an Origination Key on events that are supposed to be grouped together so they don't get mixed up with a subsequent event. Also, each Lambda should be given its own key so you can trace progress per Lambda. Guids are perfect for this.