I've got a few s3 buckets that I'm using as a storage backend for Duplicacy which stores its metadata in chunks right alongside the backup data.
I current have a lifecycle rule to move all objects with the prefix "chunks/" to Glacier Deep Archive. The problem is, I then can't list the contents of a backup revision because some of those chunks have backup metadata in them that's needed to list, initiate a restore, etc...
The question is, is there a method where I could apply some tag to certain objects such then, even though they are in the "chunks/" folder, the are exempt from the lifecycle rule?
Looking for solution to basically the same problem.
I've seen this which seems consistent with what I'm finding which is it can't be done in a straightforward fashion. This is a few years old, I'll be disappointed if this is the answer.
Expected to see the exclude use case in these examples but no luck.
Related
I'm trying to transfer files from an onsite Drobo to S3 Deep Archive. Because of the way S3 stores things in Deep Storage, it never makes sense to archive objects which are 8KB or smaller (because you will pay for 8KB of Standard anyway). Lifecycle rules are not smart enough to handle this logic, so I wrote a lambda. However, I'm not sure what trigger to use. Right now, this lambda only responds to ObjectCreated:Put events- which works fine for my simple online testing, but I suspect may not work when I'm doing the transfer with a Snowcone or Snowball. The lambda itself then causes an ObjectCreated:Copy event if it archives the file.
So in order to get this to work with Snowcone/Snowball, it'd be nice to know: what event is generated when the files are transferred off those devices into S3? I've contemplated just using DynamoDB and pushing archived filenames into a table so I have a reference, but that seems unnecessary if I can get firm guidance. Another option is to be brutish about it and simply force-archive on every event received, because as far as I can tell, it is as expensive to query the current storage class of the object as it is to attempt the change in storage class.
Checked all the docs, including the 184 page Snowcone User guide PDF. This blog article suggests that the Put and Post events refer back to HTTP, but I don't think the Snow family existed at the time. I tweeted at Jeff Barr and haven't heard back yet. Anyone have actual experience with these devices?
As per best practice aws resources should be per account (prod, stage, ...) and its also good to give devs their own accounts with defined limits (budget, region, ...).
Im now wondering how i can create a full working dev environment especially when it comes to S3 buckets.
Most of the services are pay per use so its totally fine to spin up some lambdas, SQS etc. to use the real services for dev.
Now to the real questions what should be done with static assets like pictures, downloads and so on which are stored in S3 buckets?
Duplicating those buckets for every dev/environment could come expensive as you pay for storage and/or data transfer.
What i thought was to give the devs S3 bucket a redirect rule and when a file is not found (e.g. 404) in the dev bucket it redirects to the prod bucket so that images, ... are retrieved from there.
I have testet this and it works pretty well but it solves only part of the problem.
The other part is how to replace those files in a convenient way?
Currently static assets and downloads are also in our git (maybe not the best idea after all ... - how you handle file changes which should go live with new features, currently its convenient to have it in git as well) and when someone changes stuff they push it and it gets deployed to prod.
We could of course sync back the devs S3 bucket to prod bucket with the new files uploaded but how to combine this with merge requests and have a good CI/CD experience?
What are your solutions to have S3 buckets for every dev so that they can spinn up their own completely working dev environment with everything available to them?
My experience is that you don't want to complicate things just to save a few dollars. S3 costs are pretty cheap, so if you're just talking about website assets, like HTML, CSS, JavaScript, and some images, then you're probably going to spend more time creating, managing, and troubleshooting a solution than you'll save. Time is, after all, your most precious resource.
If you do have large items that need to be stored to make your system work then maybe have the S3 bucket have a lifecycle policy on those large items and delete them after some reasonable amount of time. If/when a dev needs that object they can retrieve it again from its source and upload it again, manually. You could write a script to do that pretty easily.
We have an AWS S3 bucket that is used for storing a number of files, about 50000 a day and about 5-10GB.
Currently we've got lifecycle rules to clear the files out after 2 days. We need to keep these files for longer now (1 year). The names are unique (start with GUID) and we're comfortable with the cost implications.
The question I have is whether insert or retrieval performance will be affected at all?
We don't list the contents of the bucket (obviously that would be slower). The AWS documentation is vague but seems to imply that there will be no change but I wonder if anyone has any real world observations.
Based on my experience, I don't believe there will be any implications, as long as you are not listing the contents of the bucket in order to find the object that you want (and you said you are not).
If you already know the object key when you are about to GET it, then there should be zero performance implications.
There will be no implications for performance of write or read.
These actions are queued as a background job and executed over time rather than happening immediately at the expiry time.
Remember your S3 bucket is distributed across a large pool of nodes that are shared in the region, therefore performance issues would affect many partners.
The delete actions do not count towards your quotas either.
I am currently using S3 to store large quantities of account level data such as images, text files and other forms of durable content that users upload in my application. I am looking to take an incremental snapshot of this data (once per week) and ship it off to another S3 bucket. I'd like to do this in order to protect against accidentally data loss, i.e. one of our engineers accidentally deleting a chunk of data in the S3 browser.
Can anyone suggest some methodology for achieving this? Would we need to host our own backup application on an EC2 instance? Is there an application that will handle this out of the box? The data can go into S3 Glacier and doesn't need to be readily accessible, it's more of an insurance policy than anything else.
EDIT 1
I believe switching on versioning maybe the answer (continuing to research this):
http://docs.amazonwebservices.com/AmazonS3/latest/dev/Versioning.html
EDIT 2
For others looking for answers to this question, there a good thread on ServerFault. I only came across this later:
https://serverfault.com/questions/9171/aws-s3-bucket-backups
Enabling versioning on your bucket is the right solution. It can be used to protect both against accidental deletes and overwrites as well.
There's a question on the S3 FAQ, under "Data Protection", that discusses exactly this issue (accidental deletes/overwrites): http://aws.amazon.com/s3/faqs/#Why_should_I_use_Versioning
In order to support millions of potential images we have previously followed this sort of directory structure:
/profile/avatars/44/f2/47/48px/44f247d4e3f646c66d4d0337c6d415eb.jpg
The filename is md5 hashed, then we extract the first 6 characters in the string and build the folder structure from that.
So in the above example the filename:
44f247d4e3f646c66d4d0337c6d415eb.jpg
produces a directory structure of:
/44/f2/47/
We always did this in order to minimize the number of photos in any single directory, ultimately to aid filesystem performance.
However our new app is using Amazon S3 with Cloudfront
My understanding is that any folders you create on Amazon S3 are actually just references and are not directories on the filesystem.
If that is correct is it still recommended to split into folders/directories in the above, or similar method? Or can we simply remove this complexity in our application code and provide image links like so:
/profile/avatars/48px/filename.jpg
Baring in mind that this app is intended to serve 10's of millions of photos.
Any guidance would be greatly appreciated.
Although S3 folders are basically only another way of writing the key name (as #E.J.Brennan already said in his answer), there are reasons to think about the naming structure of your "folders".
With your current number of photos and probably your access patterns, it might make sense to think about a way to speed up the S3 keyname lookups, making sure that operations on photos get spread out over multiple partitions. There is a great article on the AWS blog explaining all the details.
You don't need to setup that structure on s3 unless you are doing it for your own convenience. All of the folders you create on s3 are really just an illusion for you, the files are stored in one big continuous container, so if you don't have a reason to organize the files in a pseudo-folder hierarchy, then don't bother.
If you needed to control access to different groups of people, based on you folder struture, that might be a reason to keep the structure, but besides that there probably isn't a benefit/