Listing all files in a Google Cloud Storage bucket and its pricing - google-cloud-platform

When using the bucket.getFilesStream, which auto-paginates through the files in a bucket, is each page's worth of data request considered single Class A operations? Or is the entire stream using pagination considered a single Class A operation?
If it's multiple operations, is there a cheaper way to get a list of all files in a bucket, assuming there are millions of files?

According to the official Cloud Storage JSON API reference the method for listing the bucket objects is storage.objects.list. It retrieves a list of objects matching the specified criteria. This is the method used in the client libraries to retrieve list of object in the bucket. As long as this is the only method to achieve this, there isn't any workaround to list the buckets objects in a cheaper way.
As you can see in the Google Cloud Storage pricing documentation a call to this method is considered as a Class A operation. The number of calls would depend on how the node.js uses the JSON API.

Related

GCS: Can we have different Storage Class objects inside a bucket?

I am aware of similar concept in AWS cloud where a bucket can hold multiple storage class objects like Standard object and Coldline object.
I tried googling about the same in GCP since the objects that I will have, need to be of different Storage Class objects since they won't be accessed frequently.
Yes, GCS can hold multiple storage class objects within a bucket. Refer this documents DOC1.
DOC2 for detailed steps and explanation to change the storage class of indvidual object within a bucket.
Moreover there are multiple storage classes available in GCP like
Standard - A noraml storage class which can be used in frequent
operations.
Nearline - Nearline is recommended to use when the data that needs to be accessed on average once every 30 days or less.
Coldline - Coldline can be used for infrequent data which needs to be accessed on average once per quarter i.e, 90 days.
Archive - Archive is the best storage plan when the data needs to be accessed once once per year i.e, 365 days
Note: The pricing of storage class differs from each one based on the type you choose.
For more detailed information refer to these documents DOC1 DOC2.
Yes. You can set the storage classes in a number of ways:
First, when you upload an object, you can specify its storage class. It's a property of most the client library "write" or "upload" methods. If you're using the JSON API directly, check the storageClass property on the objects.insert call. If you're using the XML API, use the x-goog-storage-class header.
Second, you can also set the "default storage class" on the bucket, which will be used for all object uploads that do not specify a class.
Third, you can change an object's storage class using the objects.rewrite call. If you're using an API like the Python API, you can use a function like blob.update_storage_class(new_storage_class) to change the storage class (note that this counts as an object write).
Finally, you can put "lifecycle policies" on your bucket that will automatically transition storage classes for individual objects over time or in response to some change. For example, you could have a rule like "downgrade an object's storage class to coldline 60 days after its creation." See https://cloud.google.com/storage/docs/lifecycle for more.
Full documentation of storage classes can be found at : https://cloud.google.com/storage/docs/storage-classes

Storing S3 Urls vs calling listObjects

I have an app that has an attachments feature for users. They can upload documents to S3 and then revisit and preview and/or Download said attachments.
I was planning on storing the S3 urls in DB and then pre-signing them when the User needs them. I'm finding a caveat here is that this can lead to edge cases between S3 and the DB.
I.e. if a file gets removed from S3 but its url does not get removed from DB (or vice-versa). This can lead to data inconsistency and may mislead users.
I was thinking of just getting the urls via the network by using listObjects in the s3 client SDK. I don't really need to store the urls and this guarantees the user gets what's actually in S3.
Only con here is that it makes 1 API request (as opposed to DB hit)
Any insights?
Thanks!
Using a database to store an index to files is a good idea, especially once the volume of objects increases. The ListObjects() API only returns 1000 objects per call. This might be okay if every user has their own path (so you can use ListObjects(Prefix='user1/'), but that's not ideal if you want to allow document sharing between users.
Using a database will definitely be faster to obtain a listing, and it has the advantage that you can filter on attributes and metadata.
The two systems will only get "out of sync" if objects are created/deleted outside of your app, or if there is an error in the app. If this concerns you, then use Amazon S3 Inventory, to provide a regular listing of objects in the bucket and write some code to compare it against the database entries. This will highlight if anything is going wrong.
While Amazon S3 is an excellent NoSQL database (Key = filename, Value = contents), it isn't good for searching/listing a large quantity of objects.

boto3 find object by metadata or tag

Is it possible to search objects in S3 bucket by object's metadata or tag key/value? (without object name or etag)
I know about head_object() method (ref), but it requires a Key in its parameters.
It seems that get_object() method is also not a solution - it takes the same argument set as head_object(), and nothing about metadata.
As I can see, neither get_* nor list_* methods provide any suitable filters. But I believe that such an opportunity should be in S3 API.
No. The ListObjects() API call does not accept search criteria.
You will need to retrieve a listing of all objects, then call head_object() to obtain metadata.
Alternatively, you could use Amazon S3 Inventory, which can provide a regular CSV file containing a list of all objects and their metadata. Your program could use this as a source of information rather than calling ListObjects().
If you require something that can do real-time searching of metadata, the common practice is to store such information in a database (eg DynamoDB, RDS, Elasticsearch) and then reference the database to identify the desired Amazon S3 objects.

how to restrict google cloud storage upload

I have a mobile application that uses Google Cloud Storage. The application allows each registered user to upload a specific number of files.
My question is, is there a way to do some kind of checks before the storage upload? Or do I need to implement a separate reservation API of sorts that OKs an upload step?
Any alternative suggestions are welcome too, of course.
warning: Not an authoritative answer. Happy to accept removal or update requests.
I am not aware of any GCS or Firebase Cloud Storage mechanisms that will inherently limit the number of files (objects) that a given user can create. If it were me, this is how I would approach the puzzle.
I would create a database (eg. Firestore / Datastore) that has a key for each user and a value which is the number of files they have uploaded. When a user wants to upload a new file, it would first make a REST call to a Cloud Function that I would write. This Cloud Function would implicitly know the identity of the calling user. It would look up the record in the database and determine if we are allowed to upload a new file. If no, then return an error and end of story. If yes, then increment the value in the database. Next I would create a GCS "signed URL" that can be used to permit an upload. It would be that signed URL that the Cloud Function would return. The app that now wishes to upload can use that signed URL to perform the actual upload.
I would also add metadata to each file uploaded to identify the logical uploader (user) of the file. That can be then used for reconciliation if needed. We could examine all the files in the bucket and re-build the database of how many files each user had uploaded.
A possible alternative to this story is for the Cloud Function to not return a signed-url but instead receive the data to be uploaded in the same request. If the check on number of files passes, then the Cloud Function could be a proxy to a GCS write to create the file directly. This alternative needs to be carefully examined as a function of the sizes of the files to be uploaded. If the files are large this may be a very poor solution. We want to be in and out of Cloud Functions as quickly as possible and holding a Cloud Function "around" to service data pass through isn't great. We may want to look at Cloud Run in that case as it supports concurrency in the instance without increasing the cost per call.

How to build an index of S3 objects when data exceeds object metadata limit?

Building an index of S3 objects can be very useful to make them searchable quickly : the natural, most obvious way is to store additional data on the object meta-data and use a lambda to write in DynamoDB or RDS, as described here: https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
However, this strategy is limited by the amount of data one can store in the object metadata, which is 2KB, as described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html. Suppose you need to build a system where every time an object is uploaded on S3 you store need to add some information not contained in the file and the object name to a database and this data exceeds 2KB:you can't store it in the object metadata.
What are viable strategies to keep the bucket and the index updated?
Implement two chained API calls where each call is idempotent: if the second fails when the first succeed, one can retry until success. What happens if you perform a PUT of an identical object on S3, and you have versioning activated? Will S3 increase the version? In this case, implementing idempotency requires a single writer to be active at each time
Use some sort of workflow engine to keep track of this two-step behaviour, such as AWS Step. What are the gotchas with this solution?