I'm creating a Cloud Function in GCP to automatically resize images uploaded to a bucket and transfer them to another bucket. Since the images arrive in batches and one folder might contain hundreds or thousands of images, is it better to incorporate in the code the ability to deal with the multiple files or is better to let cloud functions be triggered on every image uploaded.
Parallel processing is really powerful with serverless product because it scales up and down automatically according to your workloads.
If you can receive thousands of image in few seconds, the serverless product scalability can have difficulties and you can loose some messages (serverless scale up quickly, but it's not magic!!)
A better solution is to publish the Cloud Storage event in PubSub. Like that you can retry easily the failed messages.
If you continue to increase the number of image, or if you want to optimize cost, I recommend you to have a look on Cloud Run.
You can plug PubSub push subscription to Cloud Run. The power of Cloud Run is the capacity to process several HTTP requests (PubSub push message -> Cloud Storage events) on the same instance, and therefore to process concurrently several image on the same instance. If the conversion process is compute intensive, you can have up to 4 CPUs on a Cloud Run instance.
And, as Cloud Functions, you pay only the number of active (being processing request) instances. With Cloud Functions you can process 1 request at a time, therefore 1 instance per file. With Cloud Run you can process up to 1000 concurrent request and therefore your can reduce up to 1000 time the number of instances, and thus your cost. However, take care of the CPU required for you processing, if it's compute intensive, you can't process 1000 images at the same time.
The finalize event is sent when a new object is created (or an existing object is overwritten, and a new generation of that object is created) in the bucket.
A new function will be triggered for each object uploaded. You can try compressing all those images in a ZIP file on client, upload it so it'll trigger only 1 function, then upload images back to storage after unzipping them. But make sure you don't hit any limits mentioned in the documentation.
Related
I have project with the following workflow:
Pull files from server and upload to S3
When files hit S3, a message is sent to a topic using SNS
The lambda function subscribed to said topic will then process files by doing calculations
So far, I have not experienced any issues, but I wonder if this is a use case for SQS?
How many messages can my Lambda function handle all at once? If I am pulling hundreds of files, is SQS a necessity?
By default, parallel invocation limit is set for 1000.
You can change that limitation, but i never hit that number so far.
As soon as a lambda is done with consuming current request, it will be used for another, so if you upload 1000 files, you probably will only need about 100 lambdas, unless you need Minutes for one lambda to run.
The AWS handles the queued triggers, so even if you upload 100.000 files, they will be consumed asap, depending on diverse criteria.
You can test it with creating many little files and upload them all at once :)
For higher speed, upload them to different bucket, and simply move from bucket to bucket (speed is higher this way)
gl !
So, the message size limit is 10Mb.
I've been using Pub/Sub as both an input and output for data pipelines because of its low latency. The assumption here is that the Pub/Sub is the fastest mechanism on Google Cloud to pull data into a Compute Engine instance and push it out of this instance one (or few) data point at a time (not in batch manner). Then a Cloud Function with pub/sub push subscription writes the output to BigQuery.
99% of the data I process does not exceed 1MB. But there are some outliers with over 10MB size.
What can I do about it? Leverage some kind of compression? Write output to Cloud Storage instead of Pub/Sub? Maybe to a persistent SSD? I want to make sure that my compute instances are doing their job digesting one data point at a time and spitting the output with minimal time spent on pulling and pushing data and max time spent on transforming it.
The safest and the most scalable way is to save the data to Cloud Storage and to only publish the file reference in PubSub, not the content. It's also the most cost efficient way.
You can also imagine compressing the data, if they are compressible. It could be fastest than using Cloud Storage, but not as scalable.
I set up my AWS workflow so that my lambda function will be triggered when a text file is added to my S3 bucket, and generally, it worked fine - when I upload a bunch of text files to the S3 bucket, a bunch of lambda will be running at the same time and process each text file.
But my issue is that occasionally, 1 or 2 files (out of 20k or so in total) did not trigger the lambda function as expected. I have no idea why - when I checked the logs, it's NOT that the file is processed by the lambda but failed. The log showed that the lambda was not trigger by that 1 or 2 files at all. I don't believe it's reaching the 1000 concurrent lambda limitation as well since my function runs faster and the peak is around 200 lambdas.
My question is: is this because AWS lambda does not guarantee it will be triggered 100%? Like the S3, there is always a (albeit tiny) possibility of failure? If not, how can I debug and fix this issue?
You don't mention how long the Lambdas take to execute. The default limit of concurrent executions is 1000. If you are uploading files faster than they can be processed with 1000 Lambdas then you'll want to reach out to AWS support and get your limit increased.
Also from the docs:
Amazon S3 event notifications typically deliver events in seconds but can sometimes take a minute or longer. On very rare occasions, events might be lost.
If your application requires particular semantics (for example, ensuring that no events are missed, or that operations run only once), we recommend that you account for missed and duplicate events when designing your application. You can audit for missed events by using the LIST Objects API or Amazon S3 Inventory reports. The LIST Objects API and Amazon S3 inventory reports are subject to eventual consistency and might not reflect recently added or deleted objects.
I'm trying to figure out if there is a service on GCP which would allow consuming a stream from Pub/Sub and dump/batch accumulated data to files in Cloud Storage (e.g. every X minutes). I know that this can be implemented with Dataflow, but looking for more "out of the box" solution, if any exists.
As an example, this is something one can do with AWS Kinesis Firehose - purely on configuration level - one can tell AWS to dump whatever is accumulated in the stream to files on S3, periodically, or when accumulated data reaches some size.
The reason for this is that - when no stream processing is required, but only need to accumulate data - I would like to minimize additional costs of:
building a custom piece of software, even a simple one, if it can be avoided completely
consuming additional compute resources to execute it
To avoid confusion - I'm not looking for a free of charge solution, but the optimal one.
Google maintains a set of templates for Dataflow to perform common tasks between their services.
You can use the "Pubsub to Cloud Storage" template by simply plugging in a few config values - https://cloud.google.com/dataflow/docs/templates/provided-templates#cloudpubsubtogcstext
I'm creating a simple web app that needs to be deployed to multiple regions in AWS. The application requires some dynamic configuration which is managed by a separate service. When the configuration is changed through this service, I need those changes to propagate to all web app instances across all regions.
I considered using cross-region replication with DynamoDB to do this, but I do not want to incur the added cost of running DynamoDB in every region, and the replication console. Then the thought occurred to me of using S3 which is inherently cross-region.
Basically, the configuration service would write all configurations to S3 as static JSON files. Each web app instance will periodically check S3 to see if the any of the config files have changed since the last check, and download the new config if necessary. The configuration changes are not time-sensitive, so polling for changes every 5/10 mins should suffice.
Have any of you used a similar approach to manage app configurations before? Do you think this is a smart solution, or do you have any better recommendations?
The right tool for this configuration depends on the size of the configuration and the granularity you need it.
You can use both DynamoDB and S3 from a single region to serve your application in all regions. You can read a configuration file in S3 from all the regions, and you can read the configuration records from a single DynamoDB table from all the regions. There is some latency due to the distance around the globe, but for reading configuration it shouldn't be much of an issue.
If you need the whole set of configuration every time that you are loading the configuration, it might make more sense to use S3. But if you need to read small parts of a large configuration, by different parts of your application and in different times and schedule, it makes more sense to store it in DynamoDB.
In both options, the cost of the configuration is tiny, as the cost of a text file in S3 and a few gets to that file, should be almost free. The same low cost is expected in DynamoDB as you have probably only a few KB of data and the number of reads per second is very low (5 Read capacity per second is more than enough). Even if you decide to replicate the data to all regions it will still be almost free.
I have an application I wrote that works in exactly the manner you suggest, and it works terrific. As it was pointed out, S3 is not 'inherently cross-region', but it is inherently durable across multiple availability zones, and that combined with cross region replication should be more than sufficient.
In my case, my application is also not time-sensitive to config changes, but none-the-less besides having the app poll on a regular basis (in my case 1 once per hour or after every long-running job), I also have each application subscribed to SNS endpoints so that when the config file changes on S3, an SNS event is raised and the applications are notified that a change occurred - so in some cases the applications get the config changes right away, but if for whatever reason they are unable to process the SNS event immediately, they will 'catch up' at the top of every hour, when the server reboots and/or in the worst case by polling S3 for changes every 60 minutes.