in S3 buckets we have a folder where incoming files are being placed. And then some of our system picks it up and processes it.
I want to know how many files in this folder is older than some period and then send a notification to corresponding team.
I.e. In S3 bucket, if some file arrived today and it's still there even after 3 hours, I want to get notified.
I am thinking to use boto python library to iterate through all the objects inside S3 bucket at schduled interval to check files are folder. And then send notification. However, this pulling solution doesn't seem good.
I am thinking to have some event based solution. I know, S3 has events which I can subscribe using either queue or lambda. However, I don't want to do any action as soon as I have file available, I just want to to check which files are older than some time and send email notification.
can we achieve this using event based solution?
Per hour we are expecting around 1000 files. Once file is processed they are moved to different folder. However if something goes wrong it will be there. So in day, I am not expecting more than 10,000 files in one bucket. Consider I have multiple buckets.
Itarate through S3 files to do that kind of filter is not a good idea. It can get very slow when you have more than a thousad of files in there. I would suggest you to use a database to store that records.
You can have a dynamodb with 2 columns: file name and upload date. Or, if budget is a problem, you can even have a sqlite3 file on the bucket, and fetch it whenever you need to query or add data to it. I did this using lambda, and it works just fine. Just don't forget to upload the file again when new records are inserted.
You could create an Amazon CloudWatch Event rule that triggers an AWS Lambda function at a desired time interval (eg every 5 minutes or once an hour).
The AWS Lambda function could list the desired folder looking for files older than a desired time period. It would be something like this:
import boto3
from datetime import datetime, timedelta, timezone
s3_client = boto3.client('s3')
paginator = s3_client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(
Bucket = 'my-bucket',
Prefix = 'to-be-processed/'
)
for page in page_iterator:
for object in page['Contents']:
if object['LastModified'] < datetime.now(tz=timezone.utc) - timedelta(hours=3):
// Print name of object older than given age
print(object['Key'])
You could then have it notify somebody. The easiest way would be to send a message to an Amazon SNS topic, and then people can subscribe to that topic via SMS or email to receive a notification.
The above code is quite simple in that it will find the same file every time, not just the new files that have been added to the notification period.
Related
I have an S3 bucket that sends event notifications for new objects to SQS. Event notifications are filtered to one folder.
I want to simulate an upload of a large number of files at the same time. The problem is I need to upload faster. The fastest I got was to upload to another folder in the same s3 bucket and move the folder into the one with the trigger. It still copies files one by one.
Another thing I tried is:
disable event notification
copy files into the target folder
enable event notification
copy each file into itself (which causes the last modified date change and triggers an event notification)
Is there something faster? Or can we change the last modified date and trigger an event notification without copying?
I'm aware I can generate SQS events programmatically, but I want to do some real testing.
I am working on a feature where a user can upload multiple files which need to be parsed and converted to PDF if required. For that, I'm using AWS and when the user selects N files for upload then the following happens:
The client browser is connected to an AWS WebSocket API which is responsible for sending back the parsed data to respective clients later.
A signed URL for S3 is get from the webserver using which all of the user's files are uploaded onto an S3 bucket.
As soon as each file is uploaded, a lambda function is triggered for it which fetches the object for that file in order to get the content and some metadata to associate the files with respective clients.
Once the files are parsed, the response data is sent back to the respective connected clients via the WebSocket and the browser JS catches the event data and renders it.
The issue I'm facing here is that the lambda function randomly times out at the line which fetches the object of the file (either just head_object or get_object). This is happening for roughly 50% of the files (Usually I test by just sending 15 files at once and 6-7 of them fail)
import boto3
s3 = boto3.client("s3")
def lambda_handler(event, context):
bucket = event["Records"][0]["s3"]["bucket"]["name"]
key = urllib.parse.unquote_plus(event["Records"][0]["s3"]["object"]["key"], encoding="utf-8")
response = s3.get_object(Bucket=bucket, Key=key) # This or head_object gets stuck for 50% of the files
What I have observed is that even if the head_object or get_object is fetched for a file which already exists on S3 instead of getting it for the file who's upload triggered the lambda. Then also it times out with the same rate.
But if the objects are fetched in bulk via some local script using boto3 then they are fetched under a second for 15 files.
I have also tried using my own AWS Access ID and Secret key in lambda to avoid any issue caused by the temporarily generated keys.
So it seems that the multiple lambda instances are having trouble in getting the S3 file objects in parallel, which shouldn't happen though as AWS is supposed to scale well.
What should be done to get around it?
I have an S3 bucket into which clients drop data files (CSV files) each month. I was wondering there was a way that I could automatically create a new "folder" (object) every time the files are dropped each month and put the newest files into that "folder". I need the CSV files separated by month so that AWS Glue can create new partitions when I run incremental crawlers on this bucket.
For example, let's say I have a S3 bucket called "client." On December 1st, a new CSV file ("DecClientData") will be dropped into that "client" bucket. I want to know if there is a way to automate the following two processes:
Create a "folder" (let's call it "dec") within "client".
Place the "DecClientData" file in the "dec" "folder".
Thanks in advance for any assistance you can provide!
S3 doesn't have the notion of folders commonly found in file systems but instead has a flat structure, more details can be found here.
Instead, the full path of an object is stored in its Key (filename). For example, an object can be stored in Amazon S3 with a Key of files/2020-12/data.txt regardless of the existence of files and 2020-12 directories (they are not really directories but zero-length objects).
In your case, to solve both points you are mentioning, you should leverage S3 event notifications and use them as a Lambda Trigger. When the Lambda function is triggered, it is passed the name of the object (Key) as an argument, at that point you can simply change its Key.
I.e. Object is uploaded in s3://my_bucket/uploads/file.txt, this creates an event notification that triggers a Lambda function. The functions gets the object and re-uploads it to s3://my_bucket/files/dec/file.txt (and deletes the original one).
Write an AWS Lambda function to create a folder in the client bucket and move the most recent .csv file (or files) in the new folder.
Then, configure the client S3 bucket to trigger the AWS Lambda function on new uploads through the event notification settings.
I have an ETL application which is suppose to migrate to AWS infra. The scheduler being used in my application is Tivoli Work Scheduler and we want to use the same on cloud as well which has file dependencies.
Now when we move to aws , the files to be watched will land in S3 Bucket. Can we put the OPEN dependency for files in S3? If yes, What would be the hostname ( HOST#Filepath ) ?
If Not, what services should be aligned to serve the purpose. I have both time as well as file dependency in my SCHEDULES.
Eg. The file might get uploaded on S3 at 1AM. AT 3 AM my schedule will get triggered, look for the file in S3 bucket. If present, starts execution and if not then it should wait as per other parameters on tws.
Any help or advice would be nice to have.
If I understand this correctly, job triggered at 3am will identify all files uploaded within last e.g. 24 hours.
You can list all s3 files to list everything uploaded within specific period of time.
Better solution would be to create S3 upload trigger which will send information to SQS and have your code inspect the depth (number of messages) there and start processing the files one by one. An additional benefit would be an assurance that all items are processed without having to worry about time overalpse.
I have a large number of logfiles from a service that I need to regularly run analysis on via EMR/Hive. There are thousands of new files per day, and they can technically come out of order relative to the file name (e.g. a batch of files comes a week after the date in the file name).
I did an initial load of the files via Snowball, then set up a script that syncs the entire directory tree once per day using the 'aws s3 sync' cli command. This is good enough for now, but I will need a more realtime solution in the near future. The issue with this approach is that it takes a very long time, on the order of 30 minutes per day. And using a ton of bandwidth all at once! I assume this is because it needs to scan the entire directory tree to determine what files are new, then sends them all at once.
A realtime solution would be beneficial in 2 ways. One, I can get the analysis I need without waiting up to a day. Two, the network use would be lower and more spread out, instead of spiking once a day.
It's clear that 'aws s3 sync' isn't the right tool here. Has anyone dealt with a similar situation?
One potential solution could be:
Set up a service on the log-file side that continuously syncs (or aws s3 cp) new files based on the modified date. But wouldn't that need to scan the whole directory tree on the log server as well?
For reference, the log-file directory structure is like:
/var/log/files/done/{year}/{month}/{day}/{source}-{hour}.txt
There is also a /var/log/files/processing/ directory for files being written to.
Any advice would be appreciated. Thanks!
You could have a Lambda function triggered automatically as a new object is saved on your S3 bucket. Check Using AWS Lambda with Amazon S3 for details. The event passed to the Lambda function will contain the file name, allowing you to target only the new files in the syncing process.
If you'd like wait until you have, say 1,000 files, in order to sync in batch, you could use AWS SQS and the following workflow (using 2 Lambda functions, 1 CloudWatch rule and 1 SQS queue):
S3 invokes Lambda whenever there's a new file to sync
Lambda stores the filename in SQS
CloudWatch triggers another Lambda function every X minutes/hours to check how many files are there in SQS for syncing. Once there's 1,000 or more, it retrieves those filenames and run the syncing process.
Keep in mind that Lambda has a hard timeout of 5 minutes. If you sync job takes too long, you'll need to break it in smaller chunks.
You could set the bucket up to log HTTP requests to a separate bucket, then parse the log to look for newly created files and their paths. One troublespot, as well as PUT requests, you have to look for the multipart upload ops which are a sequence of POSTs. Best to log for a few days to see what gets created before putting any effort in to this approach