Find the size of S3 folders using PHP SDK - amazon-web-services

I am using an S3 bucket but I can't find a way to retrieve the size of a specific folder inside my bucket.
Scénario is:
I have a doc for every user on my website /user1..../user2 where each one will have a limited amount of space (1gb per folder). I need to show on my website the space that they still have like:
Canheadd = consumed space (what I'm lookingfor) + new file size
to determine if the user still has space or not to them to upload a new file.
I'm aware that you can do a loop of objectlist but for me it's not the ideal thing because of the amount and size of each document.
Any new or direct solution is welcome.

There is no 'quick' way to obtain the amount of storage used in a particular 'folder'.
The correct way would be to call ListObjects(Prefix='folder/',...), iterate through the objects returned and sum the size of each object. Please note that each call returns a maximum of 1000 objects, so the code might need to make repeated calls to ListObjects.
If this method is too slow, you could maintain a database of all objects and their sizes and query the database when the app needs to determine the size. Use Amazon S3 Events to trigger an AWS Lambda function when objects are created/deleted to keep the database up-to-date. This is a rather complex method, so I would suggest the first method unless there is a specific reason why it is not feasible.

Related

Cant loop over a google page HTTPIterator object twice?

I have what I hope is an easy question. I am using the Google Storage Client library to loop over blobs in a bucket. After I get the list of blobs on the bucket I am unable to loop over the bucket unless I re-run the command to list the bucket.
I read the documentation on page iterators but I still dont quite understand why this sort of thing couldnt just be stored in memory like a normal variable in python. Why is this ValueError being thrown when I try to loop over the object again? Does anyone have any suggestions on how to interact with this data better?
For many sources of data, the potential returned items could be huge. While you may only have dozens or hundreds of objects in your bucket, there is absolutely nothing to prevent you from having millions (billions?) of objects. If you list a bucket, it would make no sense to return a million entries and have any hope of maintaining their state in memory. Instead, Google says you should "page" or "iterate" through them. Each time you ask for a new page, you get the next set of data and are presumed to have lost reference to the previous set of data ... and hence maintain only one set of data at a time at your client.
It is the back-end server that maintains your "window" into that data that is being returned. All you need do is say "give me more data ... my context is " and the next chunk of data is returned.
If you want to walk through your data twice then I would suggest asking for a second iteration. Be careful though, the result of the first iteration may not be the same as the second. If new files are added or old ones removed, the results will be different between one iteration and another.
If you really believe that you can hold the results in memory then as you execute your first iteration, save the results and keep appending new values as you page through them. This may work for specific use cases but realize that you are likely setting yourself up for trouble if the number of items gets too large.

AWS console how to list s3 bucket content order by last modified date

I am writing files to a S3 bucket. How can I see the newly added files? E.g. in the below pic, you can see the files are not ordered by Last modified field. And I can't find a way to do any sort on that field or any other field.
You cannot sort on that, it is just how the UI works.
The main reason being that for buckets with 1000+ objects the UI only "knows" about the current 1000 elements displayed on the current page. And sorting them is meaningless because it would imply to show you the newest or oldest 1000 objects of the bucket but in fact it would just order the currently displayed 1000 objects. That would really confuse people and it is better to not let the user sort instead of sorting incorrectly.
Showing the actual 1000 newest or oldest objects requires you to list everything in the bucket, which takes time (minutes or hours for larger buckets) and backend requests and incurs more of a cost since List requests are billed. If you want to retrieve the 1000 newest or oldest you need to write code to do a full listing on the bucket or the prefix, then order all objects and then display parts of the result.
If you can sufficiently decrease the number of displayed objects with the "Find objects by prefix" field, the sort options become available and meaningful.

Is it Possible to Delete AWS S3 Objects Based on Object Size

I don't seem to find any document about deleting S3 objects based on the object size. For example if an object size is less that 5B then delete it.
From your comments, it appears you want to delete objects immediately after creation if they are smaller than a given size.
To do this, you would:
Create an AWS Lambda function
Configure the S3 bucket to trigger the Lambda function when an object is created
The Lambda function will be passed the Bucket and Key of the object(s) that was/were just created. It can then call HeadObject to obtain the size of the object. If it is smaller than the desired size, it can then call DeleteObject. Make sure to loop through all passed-in Records because one Lambda function can be invoked with multiple input objects.
If you have existing objects on which you wish to perform this operation, and since you mentioned that there are "over 1 million objects", you could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects, including their size. You can write a program that uses this file as input and the calls DeleteObjects to delete up to 1000 objects at a time.
Yes, it is possible to delete an S3 Object based on size.
One workaround is to get the Object size of the S3 bucket via AWS CLI ( you can use cli or boto3 ) and performing a cron job that will perform the condition when true if the object size is less than 5B.
The DeleteObject() API call does not accept parameters such as Size or ModifiedDate.
Instead, you must provide a list of object(s) to be deleted.
If you wish to delete objects based on their size, the typical pattern would be:
Call ListObjets() to obtain a listing of objects in the bucket (and optionally in a given prefix)
In your code, loop through the returned information and example the object size. Where the size is smaller/larger than desired, add the Key (filename) to an array
Call DeteleObject(), passing the array of Keys to be deleted

S3 limit to objects in a directory of a bucket

I want to save all of my images in a directory of a bucket. Is the number of objects in a same directory unlimited?
for example:
/imgs/10000000.jpg
/imgs/10000001.jpg
/imgs/10000002.jpg
....
/imgs/99999999.jpg
Yes, the number of objects is unlimited. As John mentioned elsewhere, the entire S3 "file path" is really just one string internally, the use of the / as a path separator is just convention.
One suggestion I'd have to make use of this effectively is to name each image a ULID - https://github.com/ulid/spec - this gives you a couple of advantages:
you don't need to worry about uniqueness, even if you put images in from multiple servers
because the ULIDs are lexicographic and time based, you can query S3 directly to see which images were uploaded when (you can generate ULIDs for the start and end timestamp and call S3's LIST to get the images between them).
it's easier to handle reports and metrics – you can easily find out which images are new, because they'll have a ULID after that period's timestamp.
There is no limit on the number of objects stored in Amazon S3.
Amazon S3 does not actually have 'directories'. Instead, the Key (filename) of an object includes the full path of the object. This actually allows you to create/upload objects to a non-existent directory. Doing so will make the directory 'appear' to exist, but that is simply to make things easier for us humans to understand.
Therefore, there is also no limit on the number of objects stored in a 'directory'.

Amazon DynamoDB Mapper - limits to batch operations

I am trying to write a huge number of records into a dynamoDB and I would like to know what is the correct way of doing that. Currently, I am using the DynamoDBMapper to do the job in a one batchWrite operation but after reading the documentation, I am not sure if this is the correct way (especially if there are some limits concerning the size and number of the written items).
Let's say, that I have an ArrayList with 10000 records and I am saving it like this:
mapper.batchWrite(recordsToSave, new ArrayList<BillingRecord>());
The first argument is the list with records to be written and the second one contains items to be deleted (no such items in this case).
Does the mapper split this write into multiple writes and handle the limits or should it be handled explicitly?
I have only found examples with batchWrite done with the AmazonDynamoDB client directly (like THIS one). Is using the client directly for the batch operations the correct way? If so, what is the point of having a mapper?
Does the mapper split your list of objects into multiple batches and then write each batch separately? Yes, it does batching for you and you can see that it splits the items to be written into batches of up to 25 items here. It then tries writing each batch and some of the items in each batch can fail. An example of a failure is given in the mapper documentation:
This method fails to save the batch if the size of an individual object in the batch exceeds 400 KB. For more information on batch restrictions see, http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
The example is talking about the size of one record (one BillingRecord instance in your case) exceeding 400KB, which at the time of writing this answer, is the maximum size of a record in DynamoDB.
In the case a particular batch fails, it moves on to the next batch (sleeping the thread for a bit in case the failure was because of throttling). In the end, all of the failed batches are returned in List of FailedBatch instances. Each FailedBatch instance contains a list of unprocessed items that weren't written to DynamoDB.
Is the snippet that you provided the correct way for doing batch writes? I can think of two suggestions. The BatchSave method is more appropriate if you have no items to delete. You might also want to think about what you want to do with the failed batches.
Is using the client directly the correct way? If so, what is the point of the mapper? The mapper is simply a wrapper around the client. The mapper provides you an ORM layer to convert your BillingRecord instances into the sort-of nested hash maps that the low-level client works with. There is nothing wrong with using the client directly and this does tend to happen in some special cases where additional functionality needed needs to be coded outside of the mapper.