Rails ActiveStorage vs AWS S3 tiers - amazon-web-services

My application stores MANY MANY images in S3 - we use Rails 5.2 ActiveStorage for that. The images are used a lot for 6 to 9 months. Then they are used VERY rarely until they are 15 months old and deleted automatically by ActiveStorage.
To save some money I'd like to move the files from 'S3-Standard' to 'S3-Infrequent Access (S3-IA)' after 9months of the file creation (This can be done automatically in AWS).
My question is: Will ActiveStorage still be able to find/display the image in 'S3-IA' in the rare case someone wants to see it? Will ActiveStorage still be able to find the file to delete it at 15months. Bottom Line: I don't want ActiveStorage to loose track of the file when it goes from 'S3-Standard' to 'S3-IA'

S3-IA just changes the pricing of an object. It doesn't change the visibility of the object, or the time needed to retrieve it (unlike GLACIER storage class).
One thing to be aware of is that IA pricing is based on a minimum object size of 128k. If you have a lot of objects that are smaller, then your costs may actually increase if you save them as IA.
docs

I haven’t tested, but Active Storage should be able to find the object as long as its name doesn’t change.

Related

Deleteing millions of files from S3

I need to delete 64 million objects from a bucket, leaving about the same number of objects untouched. I have created an inventory of the bucket and used that to create a filtered inventory that has only the objects that need to be deleted.
I created a Lambda function that uses NodeJS to 'async' delete the objects that are fed to it.
I have created smaller inventories (10s, 100s and 1000s of objects) from the filtered one, and used S3 Batch Operation jobs to process these, and those all seem to check out: the expected files were deleted, and all other files remained.
Now, my questions:
Am I doing this right? Is this the preferred method to delete millions of files, or did my Googling misfire?
Is it advised to just create on big batch job and let that run, or is it better to break it up in chunks of, say, a million objects?
How long will this take (approx. of course)? Will S3 Batch go through the list and do each file sequentially? Or does it automagically scale out and do a whole bunch in parallel?
What am I forgetting?
Any suggestions, thoughts or criticisms are welcome. Thanks!
You might have a look into Stepfunctions Distributed Map feature. I do not know your specific use case but it could help to get the proper scaling.
Here is a short blog entry how you can achieve it.

How to combine multiple S3 objects in the target S3 object w/o leaving S3

I understand that the minimum part size for uploading to an S3 bucket is 5MB
Is there any way to have this changed on a per-bucket basis?
The reason I'm asking is there is a list of raw objects in S3 which we want to combine in the single object in S3.
Using PUT part/copy we are able to "glue" objects in the single one providing that all objects except last one are >= 5MB. However sometimes our raw objects are not big enough and in this case when we try to complete multipart uploading we're getting famous error "Your proposed upload is smaller than the minimum allowed size" from AWS S3.
Any other idea how we could combine S3 objects without downloading them first?
"However sometimes our raw objects are not big enough... "
You can have a 5MB garbage object sitting on S3 and do concatenation with it where part 1 = 5MB garbage object, part 2 = your file that you want to concatenate. Keep repeating this for each fragment and finally use the range copy to strip out the 5MB garbage
There is no way to have the minimum part size changed
You may want to either;
Stream them together to AWS (which does not seem like an option, otherwise you would already be doing this)
Pad the file so it fill the minimum size of 5MB (what can or cannot be feasible to you since this will increase your bill). You will have the option to either use infrequent access (when you access these files rarely) or reduced redundancy (when you can recover lost files) if you think it can be applied to these specific files in order to reduce the impact.
Use an external service that will zip (or "glue" them together) your files and then re-upload to S3. I dont know if such service exists, but I am pretty sure you can implement it your self using a lambda function (I have even tried something like this in the past; https://github.com/gammasoft/zipper-lambda)

What are the best practices for user uploads with S3?

I was wondering what you recommend for running a user upload system with s3. I plan on using MongoDB for storing metadata such as the uploader, size, etc. How should I go about storing the actual file in s3.
Here are some of my ideas, what do you think is the best? All of these examples would involve saving the metadata to MongoDB.
1.Should I just store all the files in a bucket?
2. Maybe organize them into dates (e.g. 6/8/2014/mypicture.png)?
3.Should I save them all in one bucket, but with an added string (such as d1JdaZ9-mypicture.png) to avoid duplicates.
4. Or should I generate a long string for a folder, and store the file in that folder. (to retain the original file name). e.g. sh8sb36zkj391k4dhqk4n5e4ndsqule6/mypicture.png
This depends primarily on how you intend to use the pictures and which objects/classes/modules/etc. in your code will actually deal with retrieving them.
If you find yourself wanting to do things like - "all user uploads on a particular day" - A simple naming convention with folders for the year, month and day along with a folder at the top level for the user's unique ID will solve the problem.
If you want to ensure uniqueness and avoid collisions in your bucket, you could generate a unique string too.
However, since you've got MongoDB which (i'm assuming) will actually handle these queries for user uploads by date, etc., it makes the choice of your bucket more aesthetic than functional.
If all you're storing in mongoDB is the key/URL, it doesn't really matter what the actual structure of your bucket is. Nevertheless, it makes sense to still split this up in some coherent way - maybe group all a user's uploads and give each a unique name (either generate a unique name or prefix a unique prefix to the file name).
That being said, do you think there might be a point when you might look at changing how your images are stored? You might move to a CDN. A third party might come up with an even cheaper/better product which you might want to try. In a case like that, simply storing the keys/URLs in your MongoDB is not a good idea since you'll have to update every entry.
To make this relatively future-proof, I suggest you give your uploads a definite structure. I usually opt for:
bucket_name/user_id/yyyy/mm/dd/unique_name.jpg
Your database then only needs to store the file name and the upload time stamp.
You can introduce a middle layer in your logic (a new class perhaps or just a helper function/method) which then generates the URL for a file based on this info. That way, if you change your storage method later, you only need to make a small change in this middle layer (after migrating your files of course) and not worry about MongoDB.

How do I survive from Joomla K2 image handling?

I have started a news website for a specific area of business one year ago. The website lists news and for every post there is a featured image. Unfortunately, there have been posted about 1500 news in a year and the website is taking 1,07Gbytes of space. This seemed totally insane to me as joomla is had been only some Mbytes and there were no big additions from my side (like files or graphics etc.).
I did a HUGE mistake. I trusted the joomlaworks guys and installed K2. The main reason I did this was that the default joomla article manager did not offer to save a featured image for each post. But this was added in the new 3.0 version !
K2 does something extremely foolish. If you save a photo of 2 Mbytes, then it will save the original and 4 additional ones, one for each size (Small, Large, Medium, XL). Insanely, you upload a 2Mb image and it ends occupying 4Mbytes of space !
The hosting provider gives me 2Gbytes of space to store my files. I have started to lose my sleep at night because the space expands day by day and If gone beyond 2Gbytes, I will have to upgrade the hosting plan and I do not have the money to do this.
I believe I have three choices:
Move all items, categories, images from K2 back to Joomla articles that is much faster and then upgrade to version 3.0 which supports featured images. This seems extremely difficult and I do not know If it's possible at all. Even If I move all table rows from K2 to Joomla, I don't feel comfortable with 1500 ones and the images' paths are not saved in the db. Chaos.
Move everything to wordpress. No idea how to do that at all.
Compress the images that are in cache or search for ways to stop K2 doing that.
k2 saves the images in two different folders. One folder holds exclusively the originals, and the other folder holds all the resized versions. Technically you can just delete the folder with the originals because those are not the ones used in the articles or anywhere else on the website. Let's not speak poorly of k2 because they save the originals. I think it's a good feature. I once needed to go into that folder on my host and find a file that was deleted from my computer. Also you could easily in the future use that folder to rebuild all the resized files in case you want to change the sizing in layout.
I would just back up the folder every once in a while and delete the copy on your host. That should save a lot of space. Also you can set an option that the resized files are lowered in quality so they don't take up so much space. There is an option in the back-end. At 70-80% the photo quality is still great.
Why do you think that creating small, medium and big image is extremely bad? Do you actually have preview of the image, where it appears in a smaller size? If so - this is a wise way to do it.
If you really do not use any of smaller images - I would recommend go line by line through K2 plugin (or whatever it is) and find where exactly are the lines with saving these additional images and comment them.
Just another thing. How you ended up with 2 Mb images for a news site. In my opinion this should be really high resolution images, because the normal size is like 300kb.
In the folder www__TemplateName__\media\k2\items you will see two subffolders "cache" and "src" - the later one is for source files. Contents can be safely moved out to a local drive once a month. Yet, I'd say if you end up with 1.5k news, datadase would take lots of space too. And most hosters count database space in too. And you won't be able to do ANYTHING about it - you just can't throw away DB...
Then, most likely you have email server on the same host even though you don't use it (probably). If you have 1.5k news in one year - I can imagine how much spam would end up in your mail folder - that takes space out of your 2 gigs at hoster... Check your mail folder - kill all what you don't need there...
You're saying "I need an answer from a joomla expert here" - Joomla expert won't tell you much - k2 expert is needed. And the answer was given - reduce the quality of image cashed by k2 to 70% - it'll do just great - save lots of space and the quality drop won't be visible - this setting is set once and works for all authors...
In the case with DB, I'd highly recommend to have http://extensions.joomla.org/extensions/access-a-security/site-security/site-protection/14087 installed and then click Clean Temp and Repare Tables - it helps too.
Then, one other thing is to batch resize files in original's folder for k2 - there're tonns of different scripts for that in the pit of internet. Run it there from time to time and crazy big files from your users will shrink unbelievably!
But most of all - in these days, having a host of 2gigs??? That's crazy low. In my case $50 per year give me 6gigs - and that's not the cheapest host here... So... Change your host!

Managing %SYS.PTools.SQLStats data

I need to profile an application using Caché database and I'm trying to use CacheMonitor for that.
I have enabled query statistics (I suppose CacheMonitor executes DO SetSQLStats^%apiSQL(3) internally) and two days after, my server has gone out of disk space. I'm afraid there is too much data in %SYS.PTools.SQLQuery and %SYS.PTools.SQLStats and I would like to free some space.
Is there any administration tool to manage these data? How can I delete data from sql statistics?
NOTE: My knowledge about Caché is almost none.
It sounds like this is a pretty general problem of how to delete a global and then reclaim the disk space.
To delete the data, you should be able to use a SQL delete statement to clear out %SYS.PTools.SQLStats (which should be larger), and/or %SYS.PTools.SQLQuery.
Since this is Cache, you also might kill the global from the command line. I haven't used these classes, but looking at the class definition in ^oddDEF it appears to store the data in ^%SYS.PTools.SQLQueryD, ^%SYS.PTools.SQLQueryI, and ^%SYS.PTools.SQLQueryS (which is the standard default storage, so this would be likely anyway).
If you only want to delete some of it you will need to craft your own SQL for it.
Once they are deleted, you need to actually shrink the database (like most databases, it can grow dynamically but doesn't automatically give up any space). See this reference for an example of one way to do that. The basic idea is on page 3 - you can make a new database then copy all the data into it, then delete the old once you are sure you don't need it. Don't forget to do a backup first.
To make this easier in the future, you can use the global mapping feature to save the %SYS.PTools globals into their own, new, database. Then when you want to shrink that database you can just replace it with a new one without copying all the data around (as is suggested in the class documentation for %SYS.PTools.SQLStats).