I'm uploading a file that is 8.6T in size.
$ nohup gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp big_file.jsonl gs://bucket/big_file.jsonl > nohup.mv-big-file.out 2>&1 &
At some point, it just hangs, with no error messages, nothing.
Any suggestions on how I can move this large file from the box to the GS bucket?
In accordance to what #John Hanley mentioned, the maximum size limit for individual objects stored in Cloud Storage is 5 TB, as stated in Buckets and Objects Limits
Here are some workaround that you can try :
You can try uploading it across multiple folders on a single bucket since there is no limit on the actual bucket size.
Second option you can try is chunking of your files up to 32 chunks Parallel composite uploads.
Another option that you may also consider is Transfer Appliance for a faster and higher capacity of upload to Cloud Storage.
You might want to take a look as well to GCS's best practices documentation.
Related
I have a large bucket (PiB) and I'm interested in running some regex queries to understand how many bytes certain paths take.
gsutil du -s -a gs://.... works well at a small scale, but I have two questions:
Is there a better way to analyze size for redundant paths in GCS that isn't gsutil du
Is there an associated cost for running this command on my bucket?
I think gsutil du, is the tool you might use for this analysis. There is no faster way to do it.
But if you need to do it regularly, you may need to enable bucket logging:
You can read more about it, here:
https://cloud.google.com/storage/docs/access-logs#delivery
Although about the cost, It counts as a class B operation
https://cloud.google.com/storage/pricing
With Cloud Storage, you can't search for object based on regex, only based on a prefix. If you want a regex, you have to mirror the file name elsewhere and search for the pattern that you want.
How to mirror? you have to do it by yourselves :(
About gsutil du command, it's pretty simple: the gsutil binary query Cloud Storage API to get list the file. In that API response, the File metadata are present (especially the file size) and gsutil aggregate the results, i.e. 1 Class a operation call per 1000 files (max page size)
To answer your question 2. Is there an associated cost for running this command on my bucket?, the answer is yes.
I was charged $20 today in the category of Class A Operations, and the only thing I did was uploading the files to my bucket and check the bucket size using gsutil du -s.
They explicitly mentioned this in their document:
Caution: The gsutil du command calculates the current space usage by making a series of object listing requests, which can take a long time for large buckets. If the number of objects in your bucket is hundreds of thousands or more, or if you want to monitor your bucket size over time, use Monitoring instead, as described in the Console tab.
When I'm running aws s3 cp local_file.csv s3://bucket_name/file.csv, the upload copying begins properly and runs ok, until the speed slows down and eventually times out (at around 20-30% uploaded) with the following error:
Read timeout on endpoint URL: "https://bucketname.s3.amazonaws.com/file.csv?uploadid=xxx&partNumber=65.
The file is a large one (~2GB) but I ran this process OK in the past from another network with higher upload speeds. Now that I'm running it from my home at lower speed (max 10mbps, but this goes down the longer the upload takes), I want to allow more leeway before it times out.
Any idea how to set that timeout to a different threshold? Couldn't spot this in the AWS docs.
I had to add a new parameter to the command: --cli-read-timeout
For example:
aws s3 cp SOURCE_FOLDER TARGET_FOLDER --recursive --cli-read-timeout 0
More information: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-options.html
You might have to setup some configuration values for cli in your config file so that your large file is broken down into manageable chunks: See the link below:
https://docs.aws.amazon.com/cli/latest/topic/s3-config.html
Also make sure that your CLI version is up to date.
I have a single bucket with a large number of very small text files (betwen 500 bytes to 1.2k). This bucket currently contains over 1.7 Million files and will be ever increasing.
The way that I add data to this bucket is by generating batches of files (in the order 50.000 files) and transfering those files into the bucket.
Now the problem is this. If I transfer the files one by one in a loop it takes an unbareably long amount of time. So if all the files a in a directory origin_directory I would do
aws s3 cp origin_directory/filename_i s3://my_bucket/filename_i
I would do this command 50000 times.
Right now I'm testing this on a set of about 280K files. Doing this would take approximately 68 hours according to my calculations. However I found out that I can sync:
aws s3 sync origin_directory s3://my_bucket/
Now this, works much much faster. (Will take about 5 hours, according to my calculations). However, the sync needs to figure out what to copy (files present in the directory and not present in the bucket). Since the files in the bucket will be ever increasing, I'm thinking that this will take longer and longer as times moves on.
However, since I delete the information after every sync, I know that the sync operation needs to transfer all files in that directory.
So my question is, is there a way to start a "batch copy" similar to the sync, without actually doing the sync?
You can use:
aws s3 cp --recursive origin_directory/ s3://my_bucket/
This is the same as a sync, but it will not check whether the files already exist.
Also, see Use of Exclude and Include Filters to learn how to specify wildcards (eg all *.txt files).
When copying a large number of files using aws s3 sync or aws s3 cp --recursive, the AWS CLI will parallelize the copying, making it much faster. You can also play with the AWS CLI S3 Configuration to potentially optimize it for your typical types of files (eg copy more files simultaneously).
try using https://github.com/mondain/jets3t
it does this same function but works in parallel, so it will complete the job much faster.
Lately, we've noticed that our AWS bill has been higher than usual. It's due to adding an aws s3 sync task to our regular build process. The build process generates something around 3,000 files. After the build, we run aws s3 sync to upload them en masse into a bucket. The problem is that this is monetarily expensive. Each upload is costing us a ~$2 (we think) and this adds up to a monthly bill that raises the eyebrow.
All but maybe 1 or 2 of those files actually change from build to build. The rest are always the same. Yet aws s3 sync sees that they all changed and uploads the whole lot.
The documentation says that aws s3 sync compares the file's last modified date and byte size to determine if it should upload. The build server creates all those files brand-new every time, so the last modified date is always changed.
What I'd like to do is get it to compute a checksum or a hash on each file and then use that hash to compare the files. Amazon s3 already has the etag field which is can be an MD5 hash of the file. But the aws s3 sync command doesn't use etag.
Is there a way to use etag? Is there some other way to do this?
The end result is that I'd only like to upload the 1 or 2 files that are actually different (and save tremendous cost)
The aws s3 sync command has a --size-only parameter.
From aws s3 sync options:
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
This will likely avoid copying all files if they are updated with the same content.
As an alternative to s3 sync or cp you could use s5cmd
https://github.com/peak/s5cmd
This is able to sync files on the size and date if different, and also has speeds of up to 4.6gb/s
Example of the sync command:
AWS_REGION=eu-west-1 /usr/local/bin/s5cmd -stats cp -u -s --parents s3://bucket/folder/* /home/ubuntu
S3 charges $0.005 per 1,000 PUT requests (doc), so it's extremely unlikely that uploading 3,000 files is costing you $2 per build. Maybe $2 per day if you're running 50-100 builds a day, but that's still not much.
If you really are paying that much per build, you should enable CloudTrail events and see what is actually writing that much (for that matter, maybe you've created some sort of recursive CloudTrail event log).
The end result is that I'd only like to upload the 1 or 2 files that are actually different
Are these files the artifacts produced by your build? If yes, why not just add a build step that copies them explicitly?
The issue that I got was using wildcard * in the --include option. Using one wildcard was fine but when I added the second * such as /log., it looked like sync tried to download everything to compare, which took a lot of CPU and network bandwidth.
I have an S3 bucket in Region A structured like this:
ProviderA-1-1
31423423.jpg
ProviderB-1-1
32423432.jpg
The top level folder is a unique image identifier. The filename is the version of the image.
i want to copy the images to a bucket in Region B, structured like this:
ProviderA-1-1.jpg
ProviderB-1-1.jpg
E.g i don't care about the version. I just want the folder name (which is unique) to be the filename.
The reason i'm doing this is to have a flat structure to make use of image services like Imgix / ImageKit. (they provide on the fly image transformation for images, given a flat source origin)
So, my requirements are:
I need to copy lots (millions of images, ~10TB) of images
The destination bucket is in another region
I need to 'flatten' the structure, and change the name of the images to be the name of the folder they are in (folder names isn't fixed)
I've seen a few answers here suggesting the aws cli is the best approach, but not sure how i can achieve 3. with that?
Sounds like i need to loop through the images one by one, changing the name before i copy. If a script is suggested, i'm most comfortable with .NET - so perhaps the AWS .NET SDK?
This is a once off job, where i need to move the images as quickly and cheaply as possible.
Advice please?
Thanks :)
Yes, a script is required because you are moving and renaming the files.
If you're comfortable with .NET, then use that!
The basic program would be:
Create two S3 clients -- one for source bucket (to obtain the listing) and one for the destination bucket (because copy commands are sent to the destination bucket, which pulls the file from the source bucket) because you are using a different region
Use ListObjects() to obtain a list of the source bucket. Note that it will return 1000 files at a time, so use NextMarker to request the subsequent batch.
Loop through each file and use CopyObject() to simultaneously copy and rename the file. Use your own logic to take the folder name and convert it to a filename. Each file will be copied directly between the buckets, without needing to download/upload
Continue, looping through the list of 1000 files and then get the next 1000 files, etc.
The process could be sped up by using multi-threading but the logic gets a bit hard. It might be easier to simply run a few copies of the program at the same time, each handling a different Prefix range (effectively, folder names).
It's a one-off job, so optimization isn't important.
If you are adding more files in future, the best method would be to create an AWS Lambda function that is triggered whenever a new file is created in S3. The Lambda function would then copy the file to the destination, then exit.
Assuming you have no location constraints set up for your buckets, flattening would simply be:
aws s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/
assumes you have the CLI installed and required credentials setup correctly. Or you can pass them on command line:
aws --profile profile_A2B --region XXX s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/ --acl yyy
You don't mention any performance requirements. There are many ways of making transfer faster, depends on many factors. Few blind hints I can give are:
See if transfer acceleration can help you.
In general S3 to S3 transfer is faster than S3 to/from non-S3 location.
See if you can create parallel batches by prefix like:
.
for prefix in {a..z}
do
aws s3 cp --recursive s3://source_bucket/foo/${prefix}* s3://target_bucket/ &
done
If this is not a one time transfer and the transfer acceleration isn't cutting it for you, consider:
download from S3 (in region A) to a local HDD residing in region A.
transfer from local HDD in region A to a local HDD in region B using other methods like Aspera or FileCatalyst or whatever else you can find.
upload from local HDD in region B to S3 (in region B).
I have no practical data to share except that Aspera blows things like FTP out of water, it's not even a competition. YMMV.
John already covered the pseudo code. I'll just make one change to it. Write two separate programs, one to fetch the list of filenames and second to copy. It takes a lot of time to list files if you have millions of them.
Once you've listed the file names in a file, say one per line, it would be pretty easy to parallelize given you can split the file (say split -l 1000 file_list splits).
Use xargs -P or gun parallel to run multiple aws s3 cp commands at once. If you're using shell instead of .NET.
Finally don't forget to set the ACL (and other attributes like TTL etc) on target files during the copy. Doing that after the copy will take a long time.