Download checkpoint from AWS - amazon-web-services

How can I download the checkpoints and logged statistics after I run my deep learning algorithm on AWS using SageMaker?

When you created the training job (notebook or GUI it doesn't matter) you have defined an output directory, that usually is a S3 bucket. At the end of the training job, sagemaker automatically upload everything (with everything I mean everything that is contained in /opt/ml/model, if you use a predefined container this is automatic, otherwise it's up to you to write you artifact there) in a compressed archive to that output directory. So simply download the archive from S3.

Related

Is there a way to handle files within google cloud storage?

There are many compressed files inside google cloud storage.
I want to unzip the zipped file, rename it and save it back to another bucket.
I've seen a lot of posts, but I couldn't find a way other than how to download it with gsutil and handle it.
Do you have any other way?
To modify a file, such as unzipping, you must read, modify and then write. This means download, unzip and upload the extracted files.
Use gsutil or another tool, one of the SDKs, or the REST APIs. To unzip a file, use a zip tool or one of the libraries that support zip operations.
You can start using cloudFunctions for it. CloudFunction will get triggered when the object is created/finalized and it will do the job automatically.
You can also use the same function to iterate over all the zip files, do the tasks and move same files to another bucket. One thing to make sure it may not be able to move all files in a single go, but for the files which already exists on the bucket, you can use local machine to run the program.
Here is the list of libraries to connect to CloudStorage

Google Cloud Bucket mounted on Compute Engine Instance using gcsfuse does not create files

I have been able to mount Google Cloud Bucket using
gcsfuse --implicit-dirs " production-xxx-appspot /mount
or equally
sudo mount -t gcsfuse -o implicit_dirs,allow_other,uid=1000,gid=1000,key_file=service-account.json production-xxx-appspot /mount
Mounting works fine.
What happens is that when I execute the following commands after mounting, they also work fine :
mkdir /mount/files/
cp -rf /home/files/* /mount/files/
However, when I use :
mcedit /mount/files/a.txt
or
vi /mount/files/a.txt
The output says that there is no file available which makes sense.
Is there any other way to cover this situation, and use applications in a way that they can directly create files on the mounted google cloud bucket rather than creating files locally and copying afterwards.
If you do not want to create files locally and upload later, you should consider using a file storage system like Google Drive
Google Cloud storage is an object Storage system that means objects cannot be modified, you have to write the object completely at once. Object storage also does not work well with traditional databases, because writing objects is a slow process and writing an app to use an object storage API is not as simple as using file storage.
In a file storage system, Data is stored as a single piece of information inside a folder, just like you would organize pieces of paper inside a manila folder. When you need to access that piece of data, your computer needs to know the path to find it. (Beware—It can be a long, winding path.)
If you want to use Google Cloud Storage, you need to create your file locally and then push it to your bucket.
Here are an example of how to configure Google Cloud Storage with Node.js: File Upload example
Here is a tutorial on How to mount Object Storage on Cloud Server using s3fs-fuse
If you want to know more about storage formats please follow this link
More information about reading and writing to Cloud Storage in this link

Update Hard Drive backup on AWS S3

I would like to run an aws s3 sync command daily to update my hard drive backup on S3. Most of the time there will be no changes. The problem is that the s3 sync command takes days to check for changes (for a 4tb HDD). What is the quickest way to update a hard drive backup on S3?
If you are wanting to backup your own computer to Amazon S3, I would recommend using a Backup Utility that knows how to use S3. These utilities can do smart things like compress data, track files that have changed and set an appropriate Storage Class.
For example, I use Cloudberry Backup on a Windows computer. It does regular checking for new/changed files and uploads them to S3. If I delete a file locally, it waits 90 days before deleting it from S3. It can also handle multiple versions of files, rather than always overwriting files.
I would recommend only backing-up data folders (eg My Documents). There is no benefit to backing-up your Operating System or temporary files because you would not restore the OS from a remote backup.
While some backup utilities can compress files individually or in groups, experience has taught me to never do so since it can make restoration difficult if you do not have the original backup software (and remember -- backups last years!). The great things about S3 is that it is easy to access from many devices -- I have often grabbed documents from my S3 backup via my phone when I'm away from home.
Bottom line: Use a backup utility that knows how to do backups well. Make sure it knows how to use S3.
I would recommend using a backup tool that can synchronize with Amazon S3. For example, for Windows you can use Uranium Backup. It syncs with several clouds, including Amazon S3.
It can be scheduled to perform daily backups and also incremental backups (in case there are changes.)
I think this is the best way, considering the tediousness of daily manual syncing. Plus, it runs in the background and notifies you of any error or success logs.
This is the solution I use, I hope it can help you.

Automate uploading files stored locally to Cloud Storage using gsutil

I'm new to GCP, I'm trying to build an ETL stream that will upload data from files to BigQuery. It seems to me that the best solution would be to use gsutil. The steps I see today are:
(done) Downloading the .zip file from the SFTP server to the virtual machine
(done) Unpacking the file
Uploading files from VM to Cloud Storage
(done) Automatically upload files from Cloud Storage to BigQuery
Steps 1 and 2 would be performed according to the schedule, but I would like step 3 to be event driven. So when files are copied to a specific folder, gsutil will send them to the specified bucket in Cloud Storage. Any ideas how can this be done?
Assuming you're running on a Linux VM, you might want to check out inotifywait, as mentioned in this question -- you can run this as a background process to try it out, e.g. bash /path/to/my/inotify/script.sh &, and then set it up as a daemon once you've tested it out and got something working to your liking.

Apache Spark/AWS EMR and tracking of processed files

I have AWS S3 folder where the big number of JSON files is stored. I need to ETL these files with AWS EMR over Spark and store the transformation into AWS RDS.
I have implemented the Spark job for this purpose on Scala and everything is working fine. I plan to execute this job once a week.
From time to time the external logic can add a new files to AWS S3 folder so the next time when my Spark job is starting I'd like to process only the new(unprocessed) JSON files.
Right now I don't know where to store the information about the processed JSON files so the Spark job can decide what files/folders to process. Could you please advise me what is the best practice(and how) to track this changes with Spark/AWS?
If it is spark streaming job, checkpointing is what you are looking for, it is discussed here.
Checkpointing stores the state information (ie offsets etc) in hdfs/s3 bucket, so when the job is started again, spark picks up only the un-processed files. Checkpointing offers better fault tolerance in case of failures as well, as state is handled automatically by spark itself.
Again checkpointing only works in the streaming mode of spark job.