Fastest way to transfer a file from GCS to GCE - google-cloud-platform

I have a 1TB file and I'm looking to find the fastest way to transfer that from a GCS storage bucket in the same region as a GCE instance. I've tried using gsutil and a few other console utils, but I don't seem to get that fast of transfers between the two (it seems like it's similar to a curl command in that it uses the public internet I believe). It is a large machine, with ~100GB or more of memory.
What is the suggested way to transfer a file in the fastest way possible ? It seems like https://cloud.google.com/network-tiers might be relevant but I'm getting a little bit lost in all the possible 'solutions' to this issue.
From This blog post, the fastest I was able to get it was:
david#instance-2:~$ time gsutil -o 'GSUtil:parallel_thread_count=1'
-o 'GSUtil:sliced_object_download_max_components=8' \
cp gs://gcp-files/Sales20M.csv .
Copying gs://gcp-files/Sales20M.csv...
/ [1 files][ 1.1 GiB/ 1.1 GiB]
Operation completed over 1 objects/1.1 GiB.
real 0m4.559s
user 0m10.787s
sys 0m5.527s
That seems pretty good to be -- about 5s for a 1GB file, so a bit more than 1 Gb/s. Is this the ceiling do you think, or are there any other ways that might be possible to speed this up?

Network ingress from private addresses is not limited in any way, so other than that you are probably capped by persistent disk throughput (since you are moving a large file). Based on what you wrote the only thing that comes to mind that you should check is the size of your persistent disk. According to https://cloud.google.com/compute/docs/disks/performance#performance_factors your persistent disk needs to be at least 4TB to achieve maximum write throughput (400MB/s) when using HDDs or 1667GB to achieve 800MB/s when using SSDs.

Related

How do you clear the persistent storage for a notebook instance on AWS SageMaker?

So I'm running into the following error on AWS SageMaker when trying to save:
Unexpected error while saving file: untitled.ipynb [Errno 28] No space left on device
If I remove my notebook, create a new identical one and run it, everything works fine. However, I'm suspecting the Jupyter checkpoint takes up too much space if I save the notebook while it's running and therefore I'm running out of space. Sadly, getting more storage is not an option for me, so I'm wondering if there's any command I can use to clear the storage before running my notebook?
More specifically, clearing the persistent storage in the beginning and at the end of the training process.
I have googled like a maniac but there is no suggestion aside from "just increase the amount of storage bro" and that's why I'm asking the question here.
Thanks in advance!
If you don't want your data to be persistent across multiple notebook runs, just store them in /tmp which is not persistent. You have at least 10GB. More details here.
I had the exact same problem, and was not unable to find a decent answer to it online. However, I was fortunately able to resolve the issue.
I use an R kernel, so the solution might be slightly different.
You can check the storage going in the terminal and typing db -kh
You are likely mounted on the /home/ec2-user/SageMaker and can see its "Size" "Used" "Avail" and "Use%".
There are hidden folders that function as a recycle bin. When I use R command list.dirs() it reveals a folder named ./.Trash-1000/ which kept a lot of random things that had been supposedly removed from the storage.
I just deleted the folder unlink('./.Trash-1000/', recursive = T) and it the entire storage was freed.
Hope it helps.

AWS Glue ETL: Reading huge JSON file format to process but, got OutOfMemory Error

I am working on AWS-Glue ETL part for reading huge json file (only test 1 file and around 9 GB.) to work in ETL process but, I got an error from AWS Glue of java.lang.OutOfMemoryError: Java heap space after running and processing for a while
My code and flow is so simple as
df = spark.read.option("multiline", "true").json(f"s3/raw_path")
// ...
// and write to be as source_df to other object in s3
df.write.json(f"s3/source_path", lineSep=",\n")
In error/log It seems likes It failed and terminated container since reading this huge file. I have already tried to upgrade worker type to be G1.X with a sample number of worker node, however, I just would like to ask and find another solution that does not look like vertical scaling as increasing resources
I am so new in this area and service so, wanna optimize cost and time as low as possible :-)
Thank you alls in advance
After looking into Glue and Spark, I found that to get the benefit of parallelism processing across multiple executors, for my case - I split the (large) file into multiple smaller files and it worked! The files are distributed to multiple executors.

one line edits to files in AWS S3

I have many very large files (> 6 GB) stored in an AWS S3 bucket that need very minor edits done to them.
I can edit these files by pulling them to a server, using sed or perl to edit the key word, and then pushing them back, but this is very time-consuming, especially for a one-word edit to a 6 or 7 GB text file.
I use a program that makes the AWS S3 like a random-access file system, https://github.com/s3fs-fuse/s3fs-fuse, but this is unusuably slow, so it isn't an option.
How can I edit these files, or use sed, via a script without the expensive and slow step of pulling from and pushing back to S3?
You can't.
The library you use certainly does it right: download the existing file, do the edit locally, then push back the results. It's always going to be slow.
With sed, it may be possible to make it faster, assuming your existing library does it in three separate steps. But you can't send the result right back and overwrite the file before you're done reading it (at least I would suggest not doing so.)
If this is a one time process, then the slowness should not be an issue. If that's something you are likely to perform all the time, then I'd suggest you use a different type of storage. This one may not be appropriate for your app.

HDFS space release - optimal solution

I would like to release some space in HDFS, so I need to find out some of the unwanted/unused HDFS blocks/files and need to delete or archive. So what would be considered as an optimal solution as of now? I am using Clouder distribution. (My cluster HDFS capacity is 900 TB and used 700 TB)
If you are running a licensed version of Cloudera, you can use Cloudera Navigator to see which files have not been used for a period of time and you can assign a policy to delete them.
If not, you are likely looking at writing scripts to identify the files that haven't been used and you'll have to manually delete them.

How to create a program which is working similar like RAID1 (mirroring)?

I want to create a simple program which is working very similar to RAID1. It should work like this:
First i want to give the primary HDD-s drive letter and than the secondary one. I will only write to the primary HDD! If any new data is copied to the primary HDD it should automatically copy it to the secondary one.
I need some help where should i start all this? How to monitor the written data in the primary HDD? Obviously there are many ways to do what i want (i think), but i need the simpliest way.
If this isn't so complicated, than how can i handle that case if the primary HDD has two or more partition, because then i should check the secondary HDD's partition too, and then create/resize them if necessary?
Thanks in advance!
kampi
The concept of mirroring disk writes to another disk in real time is the basis for high availability, and implementing these schemes are not trivial.
The company I work for makes DoubleTake, which does real time mirroring & replication of file based IO to local or remote volumes. This is a little different than what you are describing, which appears to be block based disk/volume replication, but many of the concepts are similar.
For file based replication, there are a quite a few nasty scenarios, i'll describe a few:
Synchronizing the contents of one volume to another volume, keeping in mind that changes can occur while you are doing this. I suppose you could simply this by requiring that volumes start out totally formatted. But for people that have data that will not be a good solution!
keeping up with disk changes: What if the volume you are mirroring to is slower than the source volume? Where do you buffer? To Disk? Memory?
Anyways we use a kernel mode file system filter driver to capture the disk IO, and then our user mode service grabs this IO and forwards it to a local or remote disk.
If you want to learn about file system filtering, one of the best books (its old but good) is File System Internals, by Rajeev Nagar. Its a must read for doing any serious work with file system filters.
Also take a look at the file system filter samples on the Windows 7 WDK, its free, and they have good file mon examples that will get you seeing disk changes pretty quickly.
Good Luck!