hadoop fs -du / gsutil du is running slow on GCP - google-cloud-platform

I am trying to obtain the size of directors in Google bucket but command is running a long time.
I have tried with 8TB data having 24k subdirectory and files, it is taking around 20~25 minutes, conversely, same data on HDFS is taking less than a minute to get the size.
commands that I use to get the size
hadoop fs -du gs://mybucket
gsutil du gs://mybucket
Please suggest how can I do it faster.

1 and 2 are nearly identical in that 1 uses GCS Connector.
GCS calculates usage by making list requests, which can take a long time if you have a large number of objects.
This article suggests setting up Access Logs as alternative to gsutil du:
https://cloud.google.com/storage/docs/working-with-big-data#data
However, you will likely still incur the same 20-25 minute cost if you intend to do any analytics on the data. From GCS Best Practices guide:
Forward slashes in objects have no special meaning to Cloud Storage,
as there is no native directory support. Because of this, deeply
nested directory- like structures using slash delimiters are possible,
but won't have the performance of a native filesystem listing deeply
nested sub-directories.
Assuming that you intend to analyze this data; you may want to consider benchmarking fetch performance of different file sizes and glob expressions with time hadoop distcp.

Related

Does Dask communicate with HDFS to optimize for data locality?

In Dask distributed documentation, they have the following information:
For example Dask developers use this ability to build in data locality
when we communicate to data-local storage systems like the Hadoop File
System. When users use high-level functions like
dask.dataframe.read_csv('hdfs:///path/to/files.*.csv') Dask talks to
the HDFS name node, finds the locations of all of the blocks of data,
and sends that information to the scheduler so that it can make
smarter decisions and improve load times for users.
However, it seems that the get_block_locations() was removed from the HDFS fs backend, so my question is: what is the current state of Dask regarding to HDFS ? Is it sending computation to nodes where data is local ? Is it optimizing the scheduler to take into account data locality on HDFS ?
Quite right, with the appearance of arrow's HDFS interface, which is now preferred over hdfs3, the consideration of block locations is no longer part of workloads accessing HDFS, since arrow's implementation doesn't include the get_block_locations() method.
However, we already wanted to remove the somewhat convoluted code which made this work, because we found that the inter-node bandwidth on test HDFS deployments was perfectly adequate that it made little practical difference in most workloads. The extra constrains on the size of the blocks versus the size of the partitions you would like in-memory created an additional layer of complexity.
By removing the specialised code, we could avoid the very special case that was being made for HDFS as opposed to external cloud storage (s3, gcs, azure) where it didn't matter which worker accessed which part of the data.
In short, yes the docs should be updated.

GCS to S3 transfer - improve speed

We perform a weekly transfer from GCS to S3 using gsutil command below. 5,000 compressed objects, ~82 MB each - combined size of ~380 GB. It exports data to be used by Redshift, if that's of any relevance
Same kind of transfer from an on-prem Hadoop cluster to S3 took under 1 hour. Now with gsutil, it takes 4-5 hours.
I'm aware that, under the hood, gsutil downloads the files from GCS and then uploads them to S3 which adds some overhead. So, hoping for faster speeds, I've tried executing gsutil on Compute Engine in the geographical location of S3 and GCS buckets but it was equally slow
I've played with parallel_process_count and parallel_thread_count parameters but it made no difference
gsutil -m rsync -r -n GCS_DIR S3_DIR
My questions are:
Is there anything else I can do to speed it up?
What combinations of parallel_process_count and parallel_thread_count would you try?
Is there any way to find out which stage creates the bottleneck (if any)? I.e. is it upload or download stage?
Looking at logs, does below mean that bandwidth is at 0% for a period of time?
Copying gcs://**s3.000000004972.gz
[Content-Type=application/octet-stream]...
[4.8k/5.0k files][367.4 GiB/381.6 GiB] 96% Done 0.0 B/s
Thanks in advance :)
The optimal values for parallel_process_count and parallel_thread_count depend on network speed, number of CPUs and available memory - it's recommended that you experiment a bit to find the optimal values.
You might try using perfdiag to get more information about the bucket on Google Cloud's side - it's a command that runs a suite of diagnostic tests for a given bucket.
The output you've shared indicates that no upload is happening for some period of time, perhaps due to the way gsutil chunks the uploads.
As a final recommendation for speeding up your transfers to Amazon, you might try using Apache Beam / Dataflow.

what is the efficient way of pulling data from s3 among boto3, athena and aws command line utils

Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.
If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.
If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.
If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.
Some options:
Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.
Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.
Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).
Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.
Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.

Saving ALS latent factors in Spark ML to S3 taking too long

I am using a Python script to compute users and items latent factors using Spark ML's ALS routine as described here.
After computing latent factors, I am trying to save those to S3 using the following:
model = als.fit(ratings)
# save items latent factors
model.itemFactors.rdd.saveAsTextFile(s3path_items)
# save users latent factors
model.userFactors.rdd.saveAsTextFile(s3path_users)
There are around 150 million users. LFA is computed quickly (~15 min) but writing out the latent factors to S3 takes almost 5 hours. So clearly, something is not right. Could you please help identify the problem?
I am using 100 users blocks and 100 items blocks in computing LFA using ALS - in case this info might be relevant.
Using 100 r3.8xlarge machines for the job.
Is this EMR, the official ASF Spark version, or something else?
One issue here is that the S3 clients have tended to buffer everything locally onto disk, then only start the upload afterwards.
If it's ASF code, you could make sure you are using Hadoop 2.7.x, use s3a:// as the output schema, and play with the fast output stream options, which can do incremental writes as things get generated. It's a bit brittle in 2.7, will be way better in 2.7.
If you are on EMR, you are on your own there.
Another possible cause is that S3 throttles clients generating lots of HTTPS requests to a particular shard of S3, which means: specific bits of an S3 bucket, with the first 5-8 characters apparently determining the shard. If you can use very unique names there, then you may get throttled less.

What is the most efficient S3 GET request method?

I can download a file from S3 using either of the following methods.
s3cmd get s3://bucket_name/DB/company_data/abc.txt
wget http://bucket_name.s3.amazonaws.com/DB/company_data/abc.txt
My question is :
1) Which one is faster?
2) Which one is cheaper?
According to some past research, the s3cmd GET operation is about 5 times slower than wget. Keep in mind that s3cmd is a utility designed to retrieve files from your S3 filesystem. It doesn't use the HTTP protocol but instead uses the s3 protocol.
The only time I can see using the s3cmd utility is for cases where you're retrieving files you cannot otherwise retrieve using standard HTTP GET methods, like when the files on S3 don't have read permissions or you're doing maintenance on your S3 buckets.
Based on your question, I'm assuming you're trying to use this utility in a production system; however, it doesn't appear that was the intention or goals of the utility.
For more details, check out performance testing spreadsheet.
As far as costs goes, I'm not an expert on Amazon pricing, but I believe they bill based on actual data transferred, so a 1GB file would cost the same regardless of whether you downloaded it quickly or slowly. It's like the question where someone asks you what is heavier, ten pounds of bricks or ten pounds of feathers.