HDFS space release - optimal solution - hdfs

I would like to release some space in HDFS, so I need to find out some of the unwanted/unused HDFS blocks/files and need to delete or archive. So what would be considered as an optimal solution as of now? I am using Clouder distribution. (My cluster HDFS capacity is 900 TB and used 700 TB)

If you are running a licensed version of Cloudera, you can use Cloudera Navigator to see which files have not been used for a period of time and you can assign a policy to delete them.
If not, you are likely looking at writing scripts to identify the files that haven't been used and you'll have to manually delete them.

Related

How do you clear the persistent storage for a notebook instance on AWS SageMaker?

So I'm running into the following error on AWS SageMaker when trying to save:
Unexpected error while saving file: untitled.ipynb [Errno 28] No space left on device
If I remove my notebook, create a new identical one and run it, everything works fine. However, I'm suspecting the Jupyter checkpoint takes up too much space if I save the notebook while it's running and therefore I'm running out of space. Sadly, getting more storage is not an option for me, so I'm wondering if there's any command I can use to clear the storage before running my notebook?
More specifically, clearing the persistent storage in the beginning and at the end of the training process.
I have googled like a maniac but there is no suggestion aside from "just increase the amount of storage bro" and that's why I'm asking the question here.
Thanks in advance!
If you don't want your data to be persistent across multiple notebook runs, just store them in /tmp which is not persistent. You have at least 10GB. More details here.
I had the exact same problem, and was not unable to find a decent answer to it online. However, I was fortunately able to resolve the issue.
I use an R kernel, so the solution might be slightly different.
You can check the storage going in the terminal and typing db -kh
You are likely mounted on the /home/ec2-user/SageMaker and can see its "Size" "Used" "Avail" and "Use%".
There are hidden folders that function as a recycle bin. When I use R command list.dirs() it reveals a folder named ./.Trash-1000/ which kept a lot of random things that had been supposedly removed from the storage.
I just deleted the folder unlink('./.Trash-1000/', recursive = T) and it the entire storage was freed.
Hope it helps.

Fastest way to transfer a file from GCS to GCE

I have a 1TB file and I'm looking to find the fastest way to transfer that from a GCS storage bucket in the same region as a GCE instance. I've tried using gsutil and a few other console utils, but I don't seem to get that fast of transfers between the two (it seems like it's similar to a curl command in that it uses the public internet I believe). It is a large machine, with ~100GB or more of memory.
What is the suggested way to transfer a file in the fastest way possible ? It seems like https://cloud.google.com/network-tiers might be relevant but I'm getting a little bit lost in all the possible 'solutions' to this issue.
From This blog post, the fastest I was able to get it was:
david#instance-2:~$ time gsutil -o 'GSUtil:parallel_thread_count=1'
-o 'GSUtil:sliced_object_download_max_components=8' \
cp gs://gcp-files/Sales20M.csv .
Copying gs://gcp-files/Sales20M.csv...
/ [1 files][ 1.1 GiB/ 1.1 GiB]
Operation completed over 1 objects/1.1 GiB.
real 0m4.559s
user 0m10.787s
sys 0m5.527s
That seems pretty good to be -- about 5s for a 1GB file, so a bit more than 1 Gb/s. Is this the ceiling do you think, or are there any other ways that might be possible to speed this up?
Network ingress from private addresses is not limited in any way, so other than that you are probably capped by persistent disk throughput (since you are moving a large file). Based on what you wrote the only thing that comes to mind that you should check is the size of your persistent disk. According to https://cloud.google.com/compute/docs/disks/performance#performance_factors your persistent disk needs to be at least 4TB to achieve maximum write throughput (400MB/s) when using HDDs or 1667GB to achieve 800MB/s when using SSDs.

Snapshot recreation not working with Mini Filter

I am developing a file system mini-filter driver which is being used for tracking SQL Server database files (namely mdf and ndf files). The agenda is to track all the write operations that take place in an mdf file, find the offsets and length (calling this pair an extent) of all the writes that took place, extract the blocks of data from the latest snapshot using respective offsets along with the length and finally try to recreate the latest snapshot using the older one + all the extents applied/merged on it.
Earlier I was only using IRP_MJ_WRITE in the callbacks array to detect only the writes happening in the mdf file that I want to track but every time I apply the changed blocks data on the older snapshot to create the newer one, the snapshots don't match. The newer snapshot (say SN2) is 648 MB in size while the modified snapshot obtained after applying the extents on the older snapshot (say SN1) comes out to be 631 MB in size. Also, the extents I get every time from the mini-filter are different but somehow it results in the same 631 MB mdf file after I merge them with the older snapshot. What can be the reason for that? Would love to know this.
For a change, I added the other IRP operations also present by default in Microsoft's code but it was also of no help. The modified file is still 631 MB in size.
The problem I believe is something else and I am not being able to figure it out. Also, in microsoft's code, I found out that they are using this flag RECORD_TYPE_FLAG_EXCEED_MEMORY_ALLOWANCE in the mspyLog.c file. Can this be the reason for some buffer overflow happening while retrieving logs?
The base code is derived from Microsoft's official repository - https://github.com/microsoft/Windows-driver-samples/tree/master/filesys/miniFilter/minispy
I don't have any experience with filter drivers and would appreciate all kinds of help coming in. Thanks.

Minio/S3 scenarios where files have to be moved in batch

I searched but haven't found a satisfying solution.
Minio/S3 does not have directories, only keys (with prefixes). So far so good.
Now I am in the need to change those prefixes. Not for a single file but for a whole bunch (a lot) files which can be really large (actually no limit).
Unfortunatly these storage servers seem not to have a concept of (and does not support):
rename file
move file
What has to be done is for each file
copy the file to the new target location
delete the file from the old source location
My given design looks like:
users upload files to bucketname/uploads/filename.ext
a background process takes the uploaded files, generates some more files and uploads them to bucketname/temp/filename.ext
when all processings are done the uploaded file and the processed files are moved to bucketname/processed/jobid/new-filenames...
The path prefix is used when handling the object created notification to differentiate if it is a upload (start processing), temp (check if all files are uploaded) and processed/jobid for holding them until the user deletes them.
Imagine a task where 1000 files have to get to a new location (within the same bucket) copy and delete them one by one has a lot of space for errors. Out of storage space during the copy operation and connection errors without any chance for rollback(s). It doesn't get easier if the locations would be different bucktes.
So, having this old design and not chance to rename/move a file:
Is there any change to copy the files without creating new physical files (without duplicating used storage space)?
Any experienced cloud developer could give me please a hint how to do this bulk copy with rollbacks in error cases?
Anyone implemented something like that with a functional rollback mechanism if e.g. file 517 of 1000 fails? Copy and delete them back seems not to be way to go.
Currently I am using Minio server and Minio dotnet library. But since they are compatible with Amazon S3 this scenario could also have happend on Amazon S3.

Restoration from snapshots in virtualbox

I am using virtual box and maintaining a regular back by taking snapshots and storing it in an external hard disk.Now the system in which my virtual box was installed have crashed. How can i recover my last work from the snapshots that have stored in the external hard disk.
Snapshots are essentially "diff files" meaning that this file documents the changes between sessions (or within a session).
You can't diff a non existing base.
Example:
Lets look at the following set of commands:
Pick a number
Add 3
Substract 4
Multiply by 2
Now the outcome would change according to the first number you've picked so if the base is unknown the set of "diff" doesn't really help.
Try putting your hands on the vmdk/vhd/vdi file again, this might do the trick more easily.
Kind regards,
Yaron Shahrabani.
The way I did is that you copy the snapshot to the default Snapshot folder of that vm e.g. for my Windows 2000 "Snapshot" folder is in /home/mian/VirtualBox VMs/Windows_Professional_2000_sp4_x86/Snapshots
Once copied run this command
vboxmanage snapshot Windows_professional_2000_sp4_x86 restore Snapshot5
If the name contain space e.g Snapshot 5 then run this command
vboxmanage snapshot Windows_professional_2000_sp4_x86 restore Snapshot\ 5
It is for linux but you can almost similar for windows too like changing vboxmanage to vboxmanage.exe