GCS to S3 transfer - improve speed - amazon-web-services

We perform a weekly transfer from GCS to S3 using gsutil command below. 5,000 compressed objects, ~82 MB each - combined size of ~380 GB. It exports data to be used by Redshift, if that's of any relevance
Same kind of transfer from an on-prem Hadoop cluster to S3 took under 1 hour. Now with gsutil, it takes 4-5 hours.
I'm aware that, under the hood, gsutil downloads the files from GCS and then uploads them to S3 which adds some overhead. So, hoping for faster speeds, I've tried executing gsutil on Compute Engine in the geographical location of S3 and GCS buckets but it was equally slow
I've played with parallel_process_count and parallel_thread_count parameters but it made no difference
gsutil -m rsync -r -n GCS_DIR S3_DIR
My questions are:
Is there anything else I can do to speed it up?
What combinations of parallel_process_count and parallel_thread_count would you try?
Is there any way to find out which stage creates the bottleneck (if any)? I.e. is it upload or download stage?
Looking at logs, does below mean that bandwidth is at 0% for a period of time?
Copying gcs://**s3.000000004972.gz
[Content-Type=application/octet-stream]...
[4.8k/5.0k files][367.4 GiB/381.6 GiB] 96% Done 0.0 B/s
Thanks in advance :)

The optimal values for parallel_process_count and parallel_thread_count depend on network speed, number of CPUs and available memory - it's recommended that you experiment a bit to find the optimal values.
You might try using perfdiag to get more information about the bucket on Google Cloud's side - it's a command that runs a suite of diagnostic tests for a given bucket.
The output you've shared indicates that no upload is happening for some period of time, perhaps due to the way gsutil chunks the uploads.
As a final recommendation for speeding up your transfers to Amazon, you might try using Apache Beam / Dataflow.

Related

How much would AWS ec2 cost for a project of my type

I have tried many times to install the R server on an AWS instance using terminal commands without any luck. I can install it using http://www.louisaslett.com/RStudio_AMI/
and following a Youtube video but I cannot get the dropbox sync to stop "syncing". I have tried installing a fresh version using the terminal and Putty and other methods without much success.
What I wanted to use AWS for was to use the bandwidth / computing time.
I basically wanted to run an R script to download a bunch of documents which could take 2 weeks to download. I had hoped to save these on a large dropbox account I have access to but unfortunately library("RStudioAMI")
linkDropbox()
excludeSyncDropbox("*") doesn`t seem to work for me and the whole dropbox folder gets synced onto my AWS instance and I run out of space.
So basically... I think I will forget dropbox and just use AWS storage.
I want to download appox 500GB - or perhaps 1TB worth of data (running an R script to download documents and save them), it just connects to a website and downloads a document, so no ML or high computing power needed. Just a consistent connection. Once the documents are fully downloaded I would like to then just transfer them to an external hard drive I have for further analysis.
So my question is, "approximately" how much do you think this may cost, I don't care about paying 20-30$ I just don`t want to go in with inexperience/without knowledge and rack up hundreds$.
Additionally: What other instances/servers do you suggest I pay for, I feel like I dont need that much power just consistency.
Here is another SO question I opened:
Amazon AWS Dropbox link error: "No directories are being ignored."
There will be three main costs for your scenario:
Amazon EC2, which is charged hourly. You do not need much processing power, so a t3.small would probably be adequate if you're not doing any big computations. It's only about 2c/hour, which is $7 for 2 weeks.
An Amazon EBS disk volume attached to your Amazon EC2 instance for storing the data. A General Purpose volume is 10c/GB/month. So, 1TB for 2 weeks would be $50. If you configure it to use "Cold HDD (sc1)", then it's a quarter of that price.
Data Transfer for when you download from AWS. If you are using AWS in the USA, it is 9c/GB. So, 1TB = $90. This would be your major cost.
There might be some other minor costs, but they won't be significant compared to the above.
Or, given that your basic goal is to collect and download data, you could just do it on a computer at home.
If you are not strictly limited to EC2 ( which I think you are not, considering the requirement you stated and the AMI approach failed for you) , AWS Lightsail would be a much better solution
It has bundled data transfer package and acceptable performance
Here is the 1-month plan
512 MB Memory
1 Core Processor
20 GB SSD Disk
1 TB Transfer ( Data in will cost nothing, only data Out, Ex: From LightSail to your local PC )
Additional SSD - $10 for 1 TB
Average network performance for that instance I see is about 30 Megabyte per second. You can just shutdown everything and only billed for the hours you used in the month

hadoop fs -du / gsutil du is running slow on GCP

I am trying to obtain the size of directors in Google bucket but command is running a long time.
I have tried with 8TB data having 24k subdirectory and files, it is taking around 20~25 minutes, conversely, same data on HDFS is taking less than a minute to get the size.
commands that I use to get the size
hadoop fs -du gs://mybucket
gsutil du gs://mybucket
Please suggest how can I do it faster.
1 and 2 are nearly identical in that 1 uses GCS Connector.
GCS calculates usage by making list requests, which can take a long time if you have a large number of objects.
This article suggests setting up Access Logs as alternative to gsutil du:
https://cloud.google.com/storage/docs/working-with-big-data#data
However, you will likely still incur the same 20-25 minute cost if you intend to do any analytics on the data. From GCS Best Practices guide:
Forward slashes in objects have no special meaning to Cloud Storage,
as there is no native directory support. Because of this, deeply
nested directory- like structures using slash delimiters are possible,
but won't have the performance of a native filesystem listing deeply
nested sub-directories.
Assuming that you intend to analyze this data; you may want to consider benchmarking fetch performance of different file sizes and glob expressions with time hadoop distcp.

Saving ALS latent factors in Spark ML to S3 taking too long

I am using a Python script to compute users and items latent factors using Spark ML's ALS routine as described here.
After computing latent factors, I am trying to save those to S3 using the following:
model = als.fit(ratings)
# save items latent factors
model.itemFactors.rdd.saveAsTextFile(s3path_items)
# save users latent factors
model.userFactors.rdd.saveAsTextFile(s3path_users)
There are around 150 million users. LFA is computed quickly (~15 min) but writing out the latent factors to S3 takes almost 5 hours. So clearly, something is not right. Could you please help identify the problem?
I am using 100 users blocks and 100 items blocks in computing LFA using ALS - in case this info might be relevant.
Using 100 r3.8xlarge machines for the job.
Is this EMR, the official ASF Spark version, or something else?
One issue here is that the S3 clients have tended to buffer everything locally onto disk, then only start the upload afterwards.
If it's ASF code, you could make sure you are using Hadoop 2.7.x, use s3a:// as the output schema, and play with the fast output stream options, which can do incremental writes as things get generated. It's a bit brittle in 2.7, will be way better in 2.7.
If you are on EMR, you are on your own there.
Another possible cause is that S3 throttles clients generating lots of HTTPS requests to a particular shard of S3, which means: specific bits of an S3 bucket, with the first 5-8 characters apparently determining the shard. If you can use very unique names there, then you may get throttled less.

Backup: Amazon S3 or Glacier - lots of little files?

I'm trying to understand the complicated Amazon Glacier pricing model. I don't want to store a huge amount of data, a few GB's say 10. I hope never to download the files and if I did need to I don't care how long it takes.
Is there a cost per file I upload? Is it cheaper to zip lots of tiny files and upload in a few chunks or does 10,000 say images not matter? (cannot get a straight answer to this during searching)
Am I able to request the download of a whole Archive/Bucket or is it file-by-file?
I know this is a bit old, but you may still find my answer helpful (I hope). The other answer is based on S3 which wasn't your question I believe.
Glacier is intended for rare file access. Having that in mind they sort of punish you if you need to retrieve many files at once. In your particular case I would suggest uploading 10.000 separate files instead of let's say 100 ZIP files with 100 files each. The reason is very simple. Glacier will let you download for free only 5% of the total archive and is prorated daily. So if, for example, you need to download 10 photos you took on a weekend, you would be able to get those 10 photos for free if they are spread in the vault. On the other hand, if you have a ZIP file that has 100 photos inside, you'll be forced to download that zip that will probably be more than 5% of the total archive meaning you'll be paying some fees for the retrieval.
The only reason it makes sense to upload fewer files is to avoid high upload requests (10.000 files usually mean 10.000 requests). Requests are charged $0,05 per 1000. This fees are much lower that retrieval fees (taking into account the limits imposed), that's why I would always recommend uploading separate files. Of course you may zip files that make sense to be together.
Retrieval costs are very complex in Amazon Glacier. They have a good explanation here:
http://aws.amazon.com/glacier/faqs/#How_much_data_can_I_retrieve_for_free
But even there you'll need to pay attention on the calculations to get a clear idea on how costs are billed.
Regarding this question:
Am I able to request the download of a whole Archive/Bucket or is it file-by-file?
Requests are by file-by-file, although you can select many files at once and download them altogether.
Deciding whether to use S3 or Glacier really depends on your needs on file access. If you will rearly need access to your files then Glacier is your answer. Otherwise for 10GB S3 can still be cheap and be more flexible than Glacier.
In my case I find family photos to be a very precious thing. That's why I have a 100GB backup on glacier with all my family photos. I don't intend to access it unless there is some kind of disaster at home. In that case, I think I would not mind the retrieval cost if that saved something I really care about. But that's just me.
Detailed pricing information for S3 is available here. Specifics of the API functions available are here.
For S3, you are mostly charged for upload bandwidth (bytes sent TO S3), download bandwidth (bytes received FROM S3), and storage (bytes IN S3). You are also charged for the number and type of API calls.
So, if you upload your 10GB of data to S3 in 10,000 1MB files, store it for a month, and then download each of the files once, you'll be charged:
$0.00 for upload bandwidth (this is free)
$0.10 for the 10,000 PUT requests to upload the files
$0.95 for storing the 10GB for a month
$1.08 for 10GB download bandwidth (the first is free, then $0.12/GB)
$0.01 for the 10,000 GET requests to download the files
That's $2.14. If you uploaded and downloaded once each, but kept the data for a year, only the storage cost would go up to 12 * $0.95, or $11.40. If your files averaged only 100KB, so you had 100,000 of them, you'd pay 10 times as much for the PUT and GET requests, or $1.10 instead of $0.11.
You can only upload and download a single file per operation. If you combined your files into one using Zip, you'd only save by using fewer operations, which, as you can see, are pretty cheap to start with.
There is one quirk here, though. I'm pretty sure you are charged for all bandwidth usage when uploading and downloading, including request headers, not just the bodies containing your data. So if your files were really tiny the request headers might become significant, perhaps as much as the files themselves. In that case your bandwidth costs would double.
Glacier pricing is more complicated, and I've never used it myself. Basically, it reduces storage cost by almost ten-fold, leaving the other costs the same, and adding costs to archive and restore per object. Those costs seem to be significant if you have a lot of small objects, need to get a lot of your files at a time, or get files frequently. Glacier seems to be best when you have a lot of data (terabytes or more, not just gigabytes), but few operations. Given that you only have 10GB of data, S3 is so inexpensive it doesn't seem worth it to consider Glacier.
Finally, AWS has a free usage tier for the first year, which looks like it would cover all your costs except for half the storage charges.
Better use few larger files than lot of small ones
There are two approaches to putting files into Amazon Glacier. You either interact with vaults directly, or use S3 as frontend.
I am using S3 (and Amazon Management Console) so that I am able to see content of the archive and at the same time have it stored cheaply in Glacier.
This approach has one drawback - as storing any piece of information in Glacier has some data overhead (which you pay for too), then there is logically a break even point. Before 2014-04 price reduction I made a calculation and critical size is about 16 kB, storing smaller files in Glacier (using AWS S3 as frontend) was more expensive than keeping it only on S3. With price reduction for S3 storage (Glacier did not change) the break even point went even higher.
I guess, that even without S3 as frontend, the situation will be similar, even though a bit more friendly to smaller files.
Since November 21, 2016, Amazon updated the free tier policy for Glacier retrievals and updated the "5% of your average monthly storage" policy in favor of a flat 10GB free per month. However, if your retrieval policy was set prior to that day, then you're still on the "5%" policy and the other answers here still apply to you.
If your retrieval policy was set after Nov 21, 2016, and you're in the OP's shoes:
You're only storing 10GB, so you could retrieve all of your data for free once per month using Standard retrievals. It would make no difference if all 10,000 photos are zipped into one zip file or not (for retrievals).
The only variable in this scenario is number of upload requests. 10,000 requests at a price of $0.05 per 1,000 is only $0.50 and that's a one time fee for your specific case.
More pricing info at AWS Glacier FAQ
UPDATE:
Glacier docs recommend using multipart upload for files larger than 100MB.
I came to this conclusion independently after a couple timeouts when trying to upload an 8GB file.

What is the most efficient S3 GET request method?

I can download a file from S3 using either of the following methods.
s3cmd get s3://bucket_name/DB/company_data/abc.txt
wget http://bucket_name.s3.amazonaws.com/DB/company_data/abc.txt
My question is :
1) Which one is faster?
2) Which one is cheaper?
According to some past research, the s3cmd GET operation is about 5 times slower than wget. Keep in mind that s3cmd is a utility designed to retrieve files from your S3 filesystem. It doesn't use the HTTP protocol but instead uses the s3 protocol.
The only time I can see using the s3cmd utility is for cases where you're retrieving files you cannot otherwise retrieve using standard HTTP GET methods, like when the files on S3 don't have read permissions or you're doing maintenance on your S3 buckets.
Based on your question, I'm assuming you're trying to use this utility in a production system; however, it doesn't appear that was the intention or goals of the utility.
For more details, check out performance testing spreadsheet.
As far as costs goes, I'm not an expert on Amazon pricing, but I believe they bill based on actual data transferred, so a 1GB file would cost the same regardless of whether you downloaded it quickly or slowly. It's like the question where someone asks you what is heavier, ten pounds of bricks or ten pounds of feathers.