I have an S3 bucket with different files. I need to read those files and publish SQS msg for each row in the file.
I cannot use S3 events as the files need to be processed with a delay - put to SQS after a month.
I can write a scheduler to do this task, read and publish. But can I was AWS for this purpose?
AWS Batch or AWS data pipeline or Lambda.?
I need to pass the date(filename) of the data to be read and published.
Edit : The data volume to be dealt is huge
I can think of two ways to do this entirely using AWS serverless offerings without even having to write a scheduler.
You could use S3 events to start a Step Function that waits for a month before reading the S3 file and sending messages through SQS.
With a little more work, you could use S3 events to trigger a Lambda function which writes the messages to DynamoDB with a TTL of one month in the future. When the TTL expires, you can have another Lambda that listens to the DynamoDB streams, and when there’s a delete event, it publishes the message to SQS. (A good introduction to this general strategy can be found here.)
While the second strategy might require more effort, you might find it less expensive than using Step Functions depending on the overall message throughput and whether or not the S3 uploads occur in bursts or in a smooth distribution.
At the core, you need to do two things:
Enumerate all of the object in a bucket in S3, and perform some action on any object uploaded more than a month ago.
Can you use Lambda or Batch to do this? Sure. A Lambda could be set to trigger once a day, enumerate the files, and post the results to SQS.
Should you? No clue. A lot depends on your scale, and what you plan to do if it takes a long time to perform this work. If your S3 bucket has hundreds of objects, it won't be a problem. If it has billions, your Lambda will need to be able to handle being interrupted, and continuing paging through files from a previous run.
Alternatively, you could use S3 events to trigger a simple Lambda that adds a row to a database. Then, again, some Lambda could run on a cron job that asks the database for old rows, and publishes that set to SQS for others to consume. That's slightly cleaner, maybe, and can handle scaling up to pretty big bucket sizes.
Or, you could do the paging through files, deciding what to do, and processing old files all on a t2.micro if you just need to do some simple work to a few dozen files every day.
It all depends on your workload and needs.
Related
We have a simple ETL setup below
Vendor upload crawled parquet data to our S3 bucket.
S3 event trigger a lambda function, which will trigger a glue crawler to update the existing table partition in glue.
This works fine most of the times, but in some cases our vendor might upload files consecutively in a short time period, for example when refreshing history data. This will cause an issue since glue crawler cannot run concurrently and the job will fail.
I'm wondering if there is anything we can do to avoid the potential error. I've looked into SQS but not exactly sure if this can help me, below is what I would like to achieve:
Vendor upload file to S3.
S3 send event to SQS.
SQS hold the event, wait until there has been no other following event for a given time period, say 5 minutes.
After no further event in 5 minutes, SQS trigger the lambda function to run the glue crawler.
Is this doable with S3 and SQS?
SQS hold the event,
Yes, you can do this, as you can setup SQS delay to up to 15 minues.
wait until there has been no other following event for a given time period, say 5 minutes.
No, there is not automated way for that. You have to develop your own custom solution. The most trivial way would be to not bundle SQS with lambda, and instead have lambda running on schedule (e.g. every 5 minutes). Lambda would have to have logic to determine if there are no new files uploaded after some time, and then trigger your Glue Job. Probably this would involve DynamoDB to keep track of last uploaded files between lambda executions.
I have project with the following workflow:
Pull files from server and upload to S3
When files hit S3, a message is sent to a topic using SNS
The lambda function subscribed to said topic will then process files by doing calculations
So far, I have not experienced any issues, but I wonder if this is a use case for SQS?
How many messages can my Lambda function handle all at once? If I am pulling hundreds of files, is SQS a necessity?
By default, parallel invocation limit is set for 1000.
You can change that limitation, but i never hit that number so far.
As soon as a lambda is done with consuming current request, it will be used for another, so if you upload 1000 files, you probably will only need about 100 lambdas, unless you need Minutes for one lambda to run.
The AWS handles the queued triggers, so even if you upload 100.000 files, they will be consumed asap, depending on diverse criteria.
You can test it with creating many little files and upload them all at once :)
For higher speed, upload them to different bucket, and simply move from bucket to bucket (speed is higher this way)
gl !
The use case is that 1000s of very small-sized files are uploaded to s3 every minute and all the incoming objects are to be processed and stored in a separate bucket using lambda.
But using s3-object-create as a trigger will make many lambda invocations and concurrency needs to be taken care of. I am trying to batch process the newly created objects for every 5-10 minutes. S3 provides batch operations but it reports are generated everyday/week. Is there a service available that can help me?
According to AWS documentation, S3 can publish "New object created events" to following destinations:
Amazon SNS
Amazon SQS
AWS Lambda
In your case I would:
Create SQS.
Configure S3 Bucket to publish S3 new object events to SQS.
Reconfigure your existing Lambda to subscribe to SQS.
Configure batching for input SQS events.
Currently, the maximum batch size for SQS-Lambda subscription is 1000 events. But since your Lambda needs around 2 seconds to process single event, then you should start with something smaller, otherwise Lambda will timeout, because it won't be able to process all of the events.
Thanks to this, uploading X items to S3 will produce X / Y events, where Y is maximum batch size of SQS. For 1000 S3 items and batch size of 100, it will only invoke around 10 concurrent Lambda executions.
The AWS document mentioned above explains, how to publish S3 events to SQS. I won't explain it here, as it's more about implementation details.
Execution time
However you might run into a problem, where the processing is too slow, because Lambda will be processing probably events one-by-one in a loop.
The workaround would be to use asynchronous processing and implementation depends what runtime you use for Lambda, for Node.js it would be very easy to achieve.
Also if you want to speed up the processing in other ways, simply reduce maximum batch size and increase Lambda memory configuration, so single execution will be processing smaller number of events and will have access to more CPU units.
I set up my AWS workflow so that my lambda function will be triggered when a text file is added to my S3 bucket, and generally, it worked fine - when I upload a bunch of text files to the S3 bucket, a bunch of lambda will be running at the same time and process each text file.
But my issue is that occasionally, 1 or 2 files (out of 20k or so in total) did not trigger the lambda function as expected. I have no idea why - when I checked the logs, it's NOT that the file is processed by the lambda but failed. The log showed that the lambda was not trigger by that 1 or 2 files at all. I don't believe it's reaching the 1000 concurrent lambda limitation as well since my function runs faster and the peak is around 200 lambdas.
My question is: is this because AWS lambda does not guarantee it will be triggered 100%? Like the S3, there is always a (albeit tiny) possibility of failure? If not, how can I debug and fix this issue?
You don't mention how long the Lambdas take to execute. The default limit of concurrent executions is 1000. If you are uploading files faster than they can be processed with 1000 Lambdas then you'll want to reach out to AWS support and get your limit increased.
Also from the docs:
Amazon S3 event notifications typically deliver events in seconds but can sometimes take a minute or longer. On very rare occasions, events might be lost.
If your application requires particular semantics (for example, ensuring that no events are missed, or that operations run only once), we recommend that you account for missed and duplicate events when designing your application. You can audit for missed events by using the LIST Objects API or Amazon S3 Inventory reports. The LIST Objects API and Amazon S3 inventory reports are subject to eventual consistency and might not reflect recently added or deleted objects.
I have situation where I need to poll AWS S3 bucket for new files.
Also, its not just one bucket. There are ~1000+ buckets and these buckets could have a lot of files.
What are the usual strategies / design for such use case. I need to consumer new files on each poll. I cannot delete files from the bucket.
Instead of polling, you should subscribe to S3 event notifications: http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
These can be delivered to an SNS topic, an SQS queue, or trigger a Lambda function.
Well, in order to best answer that question, we would need to know what kind of application / architecture is doing the polling and consuming, however the 'AWS' way to do that is to have S3 send out S3 notifications upon creation of each file. The S3 notification contains a reference to the S3 file and can go out to SNS or SQS or even better Lambda which will then trigger the application to spin up, consume the files and then shut down.
Now, if you're going to have a LOT of files, all of those SNS/SQS notifications could get costly and some might then start looking at continuously polling S3 with the S3 SDK/CLI, however you need to keep in mind there are costs associated with the polling as well and you should look at ways to decrease the number of files. For example, if you're using Kinesis Firehose to dump into S3, look at batching. Or you can batch the SQS. Try your best to stick with the event notifications, it's much more resilient.