We're looking into Google Nearline as a solution for some "warm" storage requirements. Basically we expect parts of a dataset of around 5 PB to be accessed every now and again, but the whole set very infrequently.
That said, there may be one or two times a year we want to run something across the whole dataset (ie patch all the data with a new field). These algorithms would run within GCP (dataproc). Doing this on nearline blows up our budget 50k per time.
Wondering if there are possibilities of changing the storage class without incurring the full data retrieval penalty? I see that a storage class can be changed vi a gsutil rewrite but this will retrieve the data.
Perhaps we can use a lifecycle rule to change the storage class without a retrieval? Or is there any other way to do it?
The gsutil rewrite as an operation will end up creating new objects on the storage class which means you read GCS objects in one storage object class and write in another (i.e. new objects get created)
This operation is charged to your project.
Related
I am aware of similar concept in AWS cloud where a bucket can hold multiple storage class objects like Standard object and Coldline object.
I tried googling about the same in GCP since the objects that I will have, need to be of different Storage Class objects since they won't be accessed frequently.
Yes, GCS can hold multiple storage class objects within a bucket. Refer this documents DOC1.
DOC2 for detailed steps and explanation to change the storage class of indvidual object within a bucket.
Moreover there are multiple storage classes available in GCP like
Standard - A noraml storage class which can be used in frequent
operations.
Nearline - Nearline is recommended to use when the data that needs to be accessed on average once every 30 days or less.
Coldline - Coldline can be used for infrequent data which needs to be accessed on average once per quarter i.e, 90 days.
Archive - Archive is the best storage plan when the data needs to be accessed once once per year i.e, 365 days
Note: The pricing of storage class differs from each one based on the type you choose.
For more detailed information refer to these documents DOC1 DOC2.
Yes. You can set the storage classes in a number of ways:
First, when you upload an object, you can specify its storage class. It's a property of most the client library "write" or "upload" methods. If you're using the JSON API directly, check the storageClass property on the objects.insert call. If you're using the XML API, use the x-goog-storage-class header.
Second, you can also set the "default storage class" on the bucket, which will be used for all object uploads that do not specify a class.
Third, you can change an object's storage class using the objects.rewrite call. If you're using an API like the Python API, you can use a function like blob.update_storage_class(new_storage_class) to change the storage class (note that this counts as an object write).
Finally, you can put "lifecycle policies" on your bucket that will automatically transition storage classes for individual objects over time or in response to some change. For example, you could have a rule like "downgrade an object's storage class to coldline 60 days after its creation." See https://cloud.google.com/storage/docs/lifecycle for more.
Full documentation of storage classes can be found at : https://cloud.google.com/storage/docs/storage-classes
I want to change the storage class of gcp bucket existing object based on their access pattern like number of download. I got this link:
https://cloud.google.com/storage/docs/lifecycle
which is based on the object creation time. Is there any way to achieve the same based on the download pattern,
I am looking for a way to update several objects ACL in one (of few request) to the AWS API.
My web application contains several sensitive objects stored in AWS S3. This object have a default ACL to "private". I sometimes need to update several objects ACL to "public-read" for some time (a couple of minutes) before going back to "private".
For a couple of objects, one request per object to PutObjectAcl is ok. But when dealing with several objects (hundreds), the operation requires to much time.
My question is : how can I "mass put object acl" or "bulk put object acl" ? The AWS API doesn't contain a specific answer, like DeleteObjects (which allows to delete several objects at once). But may be I didn't look in the right place ?!
Any tricks or way to work around that would be of great value !
Mixing private and public objects inside a bucket is usually a bad idea. If you only need those objects to be public for a couple of minutes, you can create a pre-signed GET URL and set a desired expiration time.
My requirement is to move the files to archive, once the (current time - last access time) is greater than a specific value. Is such an option possible?
I went through the documentation, but, did not see any storage class change option based on last accessed timestamp.
You can use lifecycle on Cloud Storage to change the storage class based on temporal conditions.
Google's lifecycle has an option condition called "Days since custom time".
Presumably you could set the custom time whenever you access an object and this would work.
Building an index of S3 objects can be very useful to make them searchable quickly : the natural, most obvious way is to store additional data on the object meta-data and use a lambda to write in DynamoDB or RDS, as described here: https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
However, this strategy is limited by the amount of data one can store in the object metadata, which is 2KB, as described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html. Suppose you need to build a system where every time an object is uploaded on S3 you store need to add some information not contained in the file and the object name to a database and this data exceeds 2KB:you can't store it in the object metadata.
What are viable strategies to keep the bucket and the index updated?
Implement two chained API calls where each call is idempotent: if the second fails when the first succeed, one can retry until success. What happens if you perform a PUT of an identical object on S3, and you have versioning activated? Will S3 increase the version? In this case, implementing idempotency requires a single writer to be active at each time
Use some sort of workflow engine to keep track of this two-step behaviour, such as AWS Step. What are the gotchas with this solution?