AWS SageMaker Very large Dataset - amazon-web-services

I have a csv file of 500GB and a mysql database of 1.5 TB of data and I want to run aws sagemaker classification and regression algorithm and random forest on it.
Can aws sagemaker support it? can model be read and trained in batches or chunks? any example for it

Amazon SageMaker is designed for such scales and it is possible to use it to train on very large datasets. To take advantage of the scalability of the service you should consider a few modifications to your current practices, mainly around distributed training.
If you want to use distributed training to allow much faster training (“100 hours of a single instance cost exactly the same as 1 hour of 100 instances, just 100 times faster”), more scalable (“if you have 10 times more data, you just add 10 times more instances and everything just works”) and more reliable, as each instance is only handling a small part of the datasets or the model, and doesn’t go out of disk or memory space.
It is not obvious how to implement the ML algorithm in a distributed way that is still efficient and accurate. Amazon SageMaker has modern implementations of classic ML algorithms such as Linear Learner, K-means, PCA, XGBoost etc. that are supporting distributed training, that can scale to such dataset sizes. From some benchmarking these implementations can be 10 times faster compared to other distributed training implementations such as Spark MLLib. You can see some examples in this notebook: https://github.com/awslabs/amazon-sagemaker-workshop/blob/master/notebooks/video-game-sales-xgboost.ipynb
The other aspect of the scale is the data file(s). The data shouldn’t be in a single file as it limits the ability to distribute the data across the cluster that you are using for your distributed training. With SageMaker you can decide how to use the data files from Amazon S3. It can be in a fully replicated mode, where all the data is copied to all the workers, but it can also be sharded by key, that distributed the data across the workers, and can speed up the training even further. You can see some examples in this notebook: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/data_distribution_types

Amazon Sagemaker is built to help you scale your training activities. With large datasets, you might consider two main aspects:
The way data are stored and accessed,
The actual training parallelism.
Data storage: S3 is the most cost-effective way to store your data for training. To get faster startup and training times, you can consider the followings:
If your data is are already stored on Amazon S3, you might want first to consider leveraging the Pipe mode with built-in algorithms or bringing your own. But Pipe mode is not suitable all the time, for example, if your algorithm needs to backtrack or skip ahead within an epoch (the underlying FIFO cannot support lseek() operations) or if it is not easy to parse your training dataset from a streaming source.
In those cases, you may want to leverage Amazon FSx for Lustre and Amazon EFS file systems. If your training data is already in an Amazon EFS, I recommend using it as a data source; otherwise, choose Amazon FSx for Lustre.
Training Parallelism: With large datasets, it is likely you'll want to train on different GPUs. In that case, consider the followings:
If your training is already Horovod ready, you can do it with Amazon SageMaker (notebook).
In December, AWS has released managed data parallelism, which simplifies parallel training over multiple GPUs. As of today, it is available for TensorFlow and PyTorch.
(bonus) Cost Optimisation: Do not forget to leverage Managed Spot training to save up to 90% of the compute costs.
You will find other examples on the Amazon SageMaker Distributed Training documentation page

You can use SageMaker for large scale Machine Learning tasks! It's designed for that. I developed this open source project https://github.com/Kenza-AI/sagify (sagify), it's a CLI tool that can help you train and deploy your Machine Learning/Deep Learning models on SageMaker in a very easy way. I managed to train and deploy all of my ML models whatever library I was using (Keras, Tensorflow, scikit-learn, LightFM, etc)

Related

Clarification on Default SageMaker Distribution Strategy

Context: When using SageMaker distributed training: Let’s say when training a network I do not provide any distribution parameter (keep it to default), but provide 2 instances for the instance_count value in the estimator (could be any deep learning based estimator, e.g., PyTorch).
In this scenario would there be any distributed training taking place? If so, what strategy is used by default?
NOTE: I could see both instances’ GPUs are actively used but wondering what sort of distributed training take place by default ?
If you're using custom code (custom Docker, custom code in Framework container) The answer is NO. Unless you are writing distributed code (Horovod, PyTorch DDP, MPI...), SageMaker will not distribute things for you. It will launch the same Docker or Python code N times, once per instance. Consider SageMaker Training API like a whiteboard, that can create multiple connected and configured machines for you. But the code is still yours to write. SageMaker Distributed Training Libraries can make distributed code much easier to write though.
If you're using a built-in algorithm, the answer is it depends. Some SageMaker built-in algorithms natively are multi-machine, like SM XGBoost or SM Random Cut Forest.

Can SageMaker distributed training be used for training non-deep learning models?

I am following this documentation page to understand SageMaker's distributed training feature.
It says here that:- 
The SageMaker distributed training libraries are available only through the AWS deep learning containers for the TensorFlow, PyTorch, and HuggingFace frameworks within the SageMaker training platform. 
Does this mean that we cannot use SageMaker distributed training to train machine learning models with traditional machine learning algorithms such as linear regression, random forest or XGBoost?
I have a use cases where the data set is very large and distributed training can help with model parallelism and data parallelism. What other options can be recommended to avoid bringing in large amounts of data in memory on a training instance?
SageMaker training offers various tabular built-in algorithms, the KNN, XGBoost and linear learner, and factorization machines algorithms are parallelizable (can run on multiple devices), and supports data streaming (no limit of data size).
Beyong built-in algorithms, SageMaker also supports bringing your own training script.
I guess the parallel implementation of Sagemaker Xgboost algorithm can be called a "data-parallel" approach since the model is copied over to multiple instances and data is distributed across those instances ( when used with distribution = "ShardedByS3Key" in sagemaker.TrainingInput ) .
The "model parallel" approaches are probably more applicable to Neural Networks . It uses smdistributed.modelparallel.torch as smp to create a decorator #smp.step that shall inform the model about how to split the layers of the Neural Network across instances.

Is there a way to access and work with data stored in GCP bucket directly?

I have to do a deep learning project at my university, where I need to work with a medical image database. This database is stored in a Google Cloud Platform bucket.
However, the database's size is over 4 TB, so I can't afford download the data using gsutil. I can't use Google Colab notebook either, since it's disk storage size is 350GB.
Is there any way I can access the data and use it to teach my network?
I think you aren't on the right way.
When you build your model, you only need to have a representative subset of your dataset to validate your layers and the expected behavior.
Then, when all is done and packaged, you run your training job on dedicated VM (like Deep Learning VM). This process can be handle automatically by AI-Platform. You can also set up hyper-parameters server and parallelize your training.
In training phase, you often work with batches: you load only a subset of your dataset, you shuffle it and you train perform several steps on this subset (with RMSE/cross-entropy figure out, evaluation, gradient optimization).
Because you use a subset of your full dataset in batches, your don't need to have the 4Tb on your VM at the same time. Your training loop do it for you (download, train, evaluate, delete).
Like I said before, because you use a subset, you can also parallelize your training on several VMs for reducing your training duration.
I recommend you to review your training loop. If your give me the framework name/version which one you work, I could help you with tutorals and examples.

Handling Very Large volume(500TB) data using spark

I have large volume of data nearly 500TB , I have to do some ETL on that data.
This data is there in the AWS S3, so I planning to use AWS EMR setup to process this data but I am not sure what should be the config I should select .
What kind of cluster I need(master and how many slaves)?
Do I need to process chunk by chunk(10GB) or can I process all data at once?
What should be Master and slave(executor) memory both Ram and storage?
What kind of processor (speed) I need?
Based on this I want to calculate the cost of AWS EMR and start process the data
Based upon your question, you have little or no experience with Hadoop. Get some training first so that you understand how the Hadoop ecosystem works. Plan on spending three months to get to a starter level.
You have a lot of choices to make, some are fundamental to a project's success. For example, what language (Scala, Java or Python)? Which tools (Spark, Hive, Pig, etc.). What format is your data in (CSV, XML, JSON, Parquet, etc.). Do you only need batch processing or do you require near real-time analysis, etc. etc. etc.
You may find other AWS services more applicable such as Athena or Redshift depending on what format your data is in and what information you are trying to extract / process.
With 500 TB in AWS, open a ticket with support. Explain what you have, what you want and your time frame. An SA will be available to direct you on a path.

Amazon Web Service: Use non csv data and retrieve trained model

I am considering using the AWS with the machine learning AMI for training some deep networks that are to slow for my hardware setup.
However I see at the moment two possible major issues that might make this option less interesting or even impossible.
The training data is not in csv format, but images in nifti format. In the AWS description, it is stated that the data has to be in .csv.
Additionally, the FAQ states that trained models cannot be extracted. Which means that all sub-sequential inference and testing has to be made depending on instances in the AWS?
Are both of these issues real?
Yes, I assume you can use only csv format for training data:
http://docs.aws.amazon.com/machine-learning/latest/dg/step-1-download-edit-and-upload-data.html
AWS Machine Learning Datasources
and finally Data from other products can usually be exported into CSV files in Amazon S3, making it accessible to Amazon Machine Learning
It seems that csv is the only format so far, I found it a bit frustrating myself...
And yes, as Machine Learning FAQ indicate:
Q: Can I export my models out of Amazon Machine Learning?
A: No.
So, so far, no way to save your model...
You can probably create a C5.large (compute optimized) instance and install all the Python libraries needed for your machine learning projects. Then use scikit-learn feature to save your model.
If C5.large is not going to be enough you can easily scale it up, just use EBS storage for this instance.
I hope this verification helps