upload greater than 5TB object to google cloud storage bucket - google-cloud-platform

The max size limit for a single object upload is 5TB. How does one do a backup of a larger single workload? Say - I have a single file that is 10TB or more - that needs to be backed up to cloud storage?
Also, a related question - if the 10 TB is spread across multiple files (each file is less than 5TB) in a single folder, that shouldn't affect anything correct? A single object can't be greater than 5TB, but there isn't a limit on the actual bucket size. Say a folder containing 3 objects equal to 10TB, that upload will be automatically split across multiple buckets (console or gsutil upload)?
Thanks

You are right. The current limit on the size for individual objects is 5 TB. In this way you might split your file.
About the limitation on the Total Bucket size, there is no limit documented on this. Actually, in the overview says "Cloud Storage provides worldwide, highly durable object storage that scales to exabytes of data.".
You might take a look into the best practices of GCS.

Maybe look over http://stromberg.dnsalias.org/~dstromberg/chunkup.html ?
I wrote it to backup to SRB, which is a sort of precursor to Google and Amazon bucketing. But chunkup itself doesn't depend on SRB or any other form of storage.
Usage: /usr/local/etc/chunkup -c chunkhandlerprog -n nameprefix [-t tmpdir] [-b blocksize]
-c specifies the chunk handling program to handle each chunk, and is required
-n specifies the nameprefix to use on all files created, and is required
-t specifies the temporary directory to write files to, and is optional. Defaults to $TMPDIR or /tmp
-b specifies the length of files to create, and is optional. Defaults to 1 gigabyte
-d specifies the number of digits to use in filenames. Defaults to 5
You can see example use of chunkup in http://stromberg.dnsalias.org/~strombrg/Backup.remote.html#SRB
HTH

Related

Can S3 ListObjectsV2 return the keys sorted newest to oldest?

I have AWS S3 buckets with hundreds of top-level prefixes (folders). Each prefix contains somewhere between five thousand and a few million files in each prefix - most growing at rate of 10-100k per year. 99% of the time, all I care about are the newest 1-2000 or so in each folder...
Using ListObjectV2 returns me 1000 files and that is the max (setting "MaxKeys" to a higher value still truncates the list at 1000). This would be reasonably fine, however (per the documentation) it's returning me the file list in ascending alphabetical order (which, given my keys/filenames have the date in them effectively results in a oldest->newest sort) ... which is considerably less useful than if it returned me the NEWEST files (or reverse-alphabetical).
One option is to do a continuation allowing me to pull the entire prefix, then use the tail end of the entire array of keys as needed... but that would be (most importantly) slow for large 'folders'. A prefix with 2 million files would require 2,000 separate API calls, just to get the newest few-hundred filenames. (not to mention the costs incurred by pulling the entire bucket list even though I'm only really interested in the newest 1-2000 files.)
Is there a way to have the ListObjectV2 call (or any other s3 call) give me the list of the newest (or reverse-alphabetical) files? New files come in every few minutes - and the most important file is THE most recent file, so doing an S3 Inventory doesn't seem like it would do the trick.
(or, perhaps, a call that gives me filenames in a created-by date range...?)
Using javascript - but I'm sure every language has more-or-less the same features when it comes to trying to list objects from an S3 bucket.
Edit: weird idea: If AWS doesn't offer a 'sort' option on a basic API call for one of it's most popular services... Would it make sense to document all the filenames/keys in a dynamo table and query that instead?
No. The ListObjectsV2() will always return up to 1000 objects alphabetically in the requested Prefix.
You could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
If you need real-time or fairly fast access to a list of all available objects, your other option would be to trigger an AWS Lambda function whenever objects are created/deleted. The Lambda function would store/update information in a database (eg DynamoDB) that can provide very fast access to the list of objects. You would need to code this solution.

Is it possible to increase the S3 limits for read and write?

Is it possible to increase the S3 limits for read and write per second? I only found the current values in the documentation, but no indication that this limit can be increased. Does anyone know?
A good way to improve those limits is leverage the usage of partitions. Following the documentation, those limits are applied per prefix inside your bucket, thus, the way you store your objects affects the maximum performance. I will give you one example, suppose you use the bucket to store log files. One way you could store is putting everything in the root path.
Ex
2022_02_11_log_a.txt
2022_02_11_log_b.txt
2022_02_11_log_c.txt
2022_02_11_log_d.txt
2022_02_12_log_a.txt
2022_02_12_log_b.txt
2022_02_12_log_c.txt
2022_02_12_log_d.txt
To S3, those objects lives inside the same partition, thus, they will have the maximum throughput defined in the documentation. To improve those limits, you could change the path to the following:
2022_02_11/log_a.txt
2022_02_11/log_b.txt
2022_02_11/log_c.txt
2022_02_11/log_d.txt
2022_02_12/log_a.txt
2022_02_12/log_b.txt
2022_02_12/log_c.txt
2022_02_12/log_d.txt
Now you have two partitions: 2022_02_11 and 2022_02_12. Each one with its own throughput limits.
You should check the access pattern of your files and define partitions that leverage it. If you access pattern is random, you could try to use some hash patterns as part of your object's path.
I will also leave this official documentation about object key naming

How to set a specific compression value in aws glue? If possible, can the compression level and partitions be determined manually in aws glue?

I am looking to ingest data from a source to s3 using AWS Glue.
Is it possible to compress the ingested data in glue to specified value? For example: Compress the data to 500 MB and also be able to partition data based on compression value provided? if yes, how to enable this? I am writing the glue script in Python.
Compression & grouping are similar terms. Compression happens with parquet output. However you can use the 'groupSize': '31457280' (30 mb) to specify the size of the dynamic frame (and is the default output size) of the output file (at least most of them, the last file one is gonna be the remainder).
Also you need to be careful/leverage the Glue CPU type and quantity. like Maximum capacity 10, Worker type Standard.
The G.2X tend to create too many small files (it will/all depend on your situation/inputs.)
If you do nothing but read many small files and write them unchanged in a large group, they will be "default compressed/grouped" into the "groupsize". If you want to see drastic reductions in your file written size, then format the output as parquet. glueContext.create_dynamic_frame_from_options(connection_type = "s3", format="json",connection_options = {"paths":"s3://yourbucketname/folder_name/2021/01/"], recurse':True, 'groupFiles':'inPartition', 'groupSize': '31457280'})

S3 - What Exactly Is A Prefix? And what Ratelimits apply?

I was wondering if anyone knew what exactly an s3 prefix was and how it interacts with amazon's published s3 rate limits:
Amazon S3 automatically scales to high request rates. For example,
your application can achieve at least 3,500 PUT/POST/DELETE and 5,500
GET requests per second per prefix in a bucket. There are no limits to
the number of prefixes in a bucket.
While that's really clear I'm not quite certain what a prefix is?
Does a prefix require a delimiter?
If we have a bucket where we store all files at the "root" level (completely flat, without any prefix/delimters) does that count as single "prefix" and is it subject to the rate limits posted above?
The way I'm interpreting amazon's documentation suggests to me that this IS the case, and that the flat structure would be considered a single "prefix". (ie it would be subject to the published rate limits above)
Suppose that your bucket (admin-created) has four objects with the
following object keys:
Development/Projects1.xls
Finance/statement1.pdf
Private/taxdocument.pdf
s3-dg.pdf
The s3-dg.pdf key does not have a prefix, so its object appears
directly at the root level of the bucket. If you open the Development/
folder, you see the Projects.xlsx object in it.
In the above example would s3-dg.pdf be subject to a different rate limit (5500 GET requests /second) than each of the other prefixes (Development/Finance/Private)?
What's more confusing is I've read a couple of blogs about amazon using the first N bytes as a partition key and encouraging about using high cardinality prefixes, I'm just not sure how that interacts with a bucket with a "flat file structure".
You're right, the announcement seems to contradict itself. It's just not written properly, but the information is correct. In short:
Each prefix can achieve up to 3,500/5,500 requests per second, so for many purposes, the assumption is that you wouldn't need to use several prefixes.
Prefixes are considered to be the whole path (up to the last '/') of an object's location, and are no longer hashed only by the first 6-8 characters. Therefore it would be enough to just split the data between any two "folders" to achieve x2 max requests per second. (if requests are divided evenly between the two)
For reference, here is a response from AWS support to my clarification request:
Hello Oren,
Thank you for contacting AWS Support.
I understand that you read AWS post on S3 request rate performance
being increased and you have additional questions regarding this
announcement.
Before this upgrade, S3 supported 100 PUT/LIST/DELETE requests per
second and 300 GET requests per second. To achieve higher performance,
a random hash / prefix schema had to be implemented. Since last year
the request rate limits increased to 3,500 PUT/POST/DELETE and 5,500
GET requests per second. This increase is often enough for
applications to mitigate 503 SlowDown errors without having to
randomize prefixes.
However, if the new limits are not sufficient, prefixes would need to
be used. A prefix has no fixed number of characters. It is any string
between a bucket name and an object name, for example:
bucket/folder1/sub1/file
bucket/folder1/sub2/file
bucket/1/file
bucket/2/file
Prefixes of the object 'file' would be: /folder1/sub1/ ,
/folder1/sub2/, /1/, /2/. In this example, if you spread reads
across all four prefixes evenly, you can achieve 22,000 requests per
second.
This looks like it is obscurely addressed in an amazon release communication
https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/
Performance scales per prefix, so you can use as many prefixes as you
need in parallel to achieve the required throughput. There are no
limits to the number of prefixes.
This S3 request rate performance increase removes any previous
guidance to randomize object prefixes to achieve faster performance.
That means you can now use logical or sequential naming patterns in S3
object naming without any performance implications. This improvement
is now available in all AWS Regions. For more information, visit the
Amazon S3 Developer Guide.
S3 prefixes used to be determined by the first 6-8 characters;
This has changed mid-2018 - see announcement
https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/
But that is half-truth. Actually prefixes (in old definition) still matter.
S3 is not a traditional “storage” - each directory/filename is a separate object in a key/value object store. And also the data has to be partitioned/ sharded to scale to quadzillion of objects. So yes this new sharding is kinda of “automatic”, but not really if you created a new process that writes to it with crazy parallelism to different subdirectories. Before the S3 learns from the new access pattern, you may run into S3 throttling before it reshards/ repartitions data accordingly.
Learning new access patterns takes time. Repartitioning of the data takes time.
Things did improve in mid-2018 (~10x throughput-wise for a new bucket with no statistics), but it's still not what it could be if data is partitioned properly. Although to be fair, this may not be applied to you if you don't have a ton of data, or pattern how you access data is not hugely parallel (e.g. running a Hadoop/Spark cluster on many Tbs of data in S3 with hundreds+ of tasks accessing same bucket in parallel).
TLDR:
"Old prefixes" still do matter.
Write data to root of your bucket, and first-level directory there will determine "prefix" (make it random for example)
"New prefixes" do work, but not initially. It takes time to accommodate to load.
PS. Another approach - you can reach out to your AWS TAM (if you have one) and ask them to pre-partition a new S3 bucket if you expect a ton of data to be flooding it soon.
In order for AWS to handle billions of requests per second, they need to shard up the data so it can optimise throughput. To do this they split the data into partitions based on the first 6 to 8 characters of the object key. Remember S3 is not a hierarchical filesystem, it is only a key-value store, though the key is often used like a file path for organising data, prefix + filename.
Now this is not an issue if you expect less than 100 requests per second, but if you have serious requirements over that then you need to think about naming.
For maximum parallel throughput you should consider how your data is consumed and use the most varying characters at the beginning of your key, or even generate 8 random character for the first 8 characters of the key.
e.g. assuming first 6 characters define the partition:
files/user/bob would be bad as all the objects would be on one partition files/.
2018-09-21/files/bob would be almost as bad if only todays data is being read from partition 2018-0. But slightly better if the objects are read from past years.
bob/users/files would be pretty good if different users are likely to be using the data at the same time from partition bob/us. But not so good if Bob is by far the busiest user.
3B6EA902/files/users/bob would be best for performance but more challenging to reference, where the first part is a random string, this would be pretty evenly spread.
Depending on your data, you need to think of any one point in time, who is reading what, and make sure that the keys start with enough variation to partition appropriately.
For your example, lets assume the partition is taken from the first 6 characters of the key:
for the key Development/Projects1.xls the partition key would be Develo
for the key Finance/statement1.pdf the partition key would be Financ
for the key Private/taxdocument.pdf the partition key would be Privat
for the key s3-dg.pdf the partition key would be s3-dg.
The upvoted answer on this was a bit misleading for me.
If these are the paths
bucket/folder1/sub1/file
bucket/folder1/sub2/file
bucket/1/file
bucket/2/file
Your prefix for file would actually be
folder1/sub1/
folder1/sub2/
1/file
2/file
https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html
Please se docs. I had issues with the leading '/' when trying to list keys with the airflow s3hook.
In the case you query S3 using Athena, EMR/Hive or Redshift Spectrum increasing the number of prefixes could mean adding more partitions (as the partititon id is part of the prefix). If using datetime as (one of) your partititon keys the number of partittions (and prefixes) will automatically grow as new data is added over time and the total max S3 GETs per second grow as well.
S3 - What Exactly Is A Prefix?
S3 recently updated their document to better reflect this.
"A prefix is a string of characters at the beginning of the object key name. A prefix can be any length, subject to the maximum length of the object key name (1,024 bytes). "
From - https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html
Note: "You can use another character as a delimiter. There is nothing unique about the slash (/) character, but it is a very common prefix delimiter."
As long as two objects have different prefixes, s3 will provide the documented throughput over time.
Update: https://docs.aws.amazon.com/general/latest/gr/glos-chap.html#keyprefix reflecting the updated definition.

AWS Redshift - Set part size while unloading to s3

While unloading a large result set to s3, redshift automatically split the files into multiple parts. Is there a way to set the size of each part while unloading?
When unload, you can use maxfilesize to indicate the maximum size of the file.
For exampe:
unload ('select * from venue')
to 's3://mybucket/unload/'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
maxfilesize 1 gb;
From here
Redshift by default unloads data into multiple files according to the number of slices in your cluster. So, if you have 4 slices in the cluster, you would have 4 files written by each cluster concurrently.
Here is short answer to your question from the Documentation. Go here for details.
"By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster. The default option is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used. The maximum size for a data file is 6.2 GB. So, for example, if you unload 13.4 GB of data, UNLOAD creates the following three files."
I hope this helps.