I'm trying to design how my system uses Google ML for predictions.
I have a largish set of images I need to use for prediction. I can either:
send one big dataset to Google ML, and have the model break the data up into smaller iterative datasets, or
break the dataset up before-hand and send multiple Google ML prediction requests with the smaller datasets. This will ensure parallelism, but I'll end up sending about 3X the data.
If I allocate more resources for Google ML, will Google ML automatically parallelize the iterative datasets if I do option 1? Or am I better off handling this myself?
Thanks in advance!
Related
I am planning to use AutoML for the classification of my tabular data. But there is a moderate imbalance in the target variable.
When running my own model, I would either upsample, downsample or build synthetic samples to resolve the imbalance.
Is there such a possibility on AutoML on GCP? If not, how can one resolve such cases?
Auto ML Tabular Data Classification
AutoML Tables is a supervised learning service. This means that you train a machine learning model with example data. In general, the more training examples you have, the better your outcome. The amount of example data required also scales with the complexity of the problem you're trying to solve. See guide on number of data to use.
So with regards to the imbalance in your dataset, the only way to resolve this case is to adjust the data (add or remove samples) for you to achieve optimal results.
For more information you can refer to AutoML Tables guide.
I am building a classification model using AutoML and I have some basic usage questions about the GCP.
1 - Data privacy question; if we save behavior data to train our model in BigQuery, does Google have access to that data? Could Google ever use that data to learn more about behavior of individuals we collected data from?
2 - Since training costs are charged by the hour, I would like to understand the relationship between data and training time. Does the time increase linearly with the size of the training data set? For example, we trained a classification using 1.7MB of data and it took 3 hrs. So, would training a model with 17MB of data take 30 hours?
3 - A batch prediction costs 1.16 USD per hour. However, our data is in a csv and it seems that we cannot upload a csv to do a batch prediction. So, we will try using the API. Therefore I have two questions: A) can we do a batch upload using the API and B) what are the associated costs?
4 - What exactly is an online prediction?
5 - When using the cost calculator (for machine learning), what is a node hour?
1- As is mentioned in the Data Usage FAQ, Google does not use any of your content for any purpose except to provide you with the Cloud AutoML service.
2- The time required to train your model depends on the size and complexity of your training data, for detailed explanation take a look at the Vision documentation for example.
3- You need to upload your csv file to Google Cloud Storage and then you can use it in the API or any of the available client libraries. See Natural Language batch prediction, for example. For costs, check the documentation for the desired product. AutoML pricing depends on what feature you are using: Vision, Natural Language, Translation, Video Intelligence.
4- After you have created (trained) a model, you can deploy the model and request online (single, low-latency and real-time) predictions. Online predictions accept one row of data and provide a predicted result based on your model for that data. You use online predictions when you need a prediction as input for your business logic flow.
5- You can think of node as a single Virtual Machine, which resources are used for computing purposes. Machine types are different depending the product and purpose for which they are used. For example in image classification, the cost for AutoML Vision Image Classification model training is $3.15 per node hour, each node is equivalent to a n1-standard-8 machine with an attached NVIDIA Tesla V100.GPU. Then, node hour are the resources of such node used by one hour.
I have to do a deep learning project at my university, where I need to work with a medical image database. This database is stored in a Google Cloud Platform bucket.
However, the database's size is over 4 TB, so I can't afford download the data using gsutil. I can't use Google Colab notebook either, since it's disk storage size is 350GB.
Is there any way I can access the data and use it to teach my network?
I think you aren't on the right way.
When you build your model, you only need to have a representative subset of your dataset to validate your layers and the expected behavior.
Then, when all is done and packaged, you run your training job on dedicated VM (like Deep Learning VM). This process can be handle automatically by AI-Platform. You can also set up hyper-parameters server and parallelize your training.
In training phase, you often work with batches: you load only a subset of your dataset, you shuffle it and you train perform several steps on this subset (with RMSE/cross-entropy figure out, evaluation, gradient optimization).
Because you use a subset of your full dataset in batches, your don't need to have the 4Tb on your VM at the same time. Your training loop do it for you (download, train, evaluate, delete).
Like I said before, because you use a subset, you can also parallelize your training on several VMs for reducing your training duration.
I recommend you to review your training loop. If your give me the framework name/version which one you work, I could help you with tutorals and examples.
I have a csv file of 500GB and a mysql database of 1.5 TB of data and I want to run aws sagemaker classification and regression algorithm and random forest on it.
Can aws sagemaker support it? can model be read and trained in batches or chunks? any example for it
Amazon SageMaker is designed for such scales and it is possible to use it to train on very large datasets. To take advantage of the scalability of the service you should consider a few modifications to your current practices, mainly around distributed training.
If you want to use distributed training to allow much faster training (“100 hours of a single instance cost exactly the same as 1 hour of 100 instances, just 100 times faster”), more scalable (“if you have 10 times more data, you just add 10 times more instances and everything just works”) and more reliable, as each instance is only handling a small part of the datasets or the model, and doesn’t go out of disk or memory space.
It is not obvious how to implement the ML algorithm in a distributed way that is still efficient and accurate. Amazon SageMaker has modern implementations of classic ML algorithms such as Linear Learner, K-means, PCA, XGBoost etc. that are supporting distributed training, that can scale to such dataset sizes. From some benchmarking these implementations can be 10 times faster compared to other distributed training implementations such as Spark MLLib. You can see some examples in this notebook: https://github.com/awslabs/amazon-sagemaker-workshop/blob/master/notebooks/video-game-sales-xgboost.ipynb
The other aspect of the scale is the data file(s). The data shouldn’t be in a single file as it limits the ability to distribute the data across the cluster that you are using for your distributed training. With SageMaker you can decide how to use the data files from Amazon S3. It can be in a fully replicated mode, where all the data is copied to all the workers, but it can also be sharded by key, that distributed the data across the workers, and can speed up the training even further. You can see some examples in this notebook: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/data_distribution_types
Amazon Sagemaker is built to help you scale your training activities. With large datasets, you might consider two main aspects:
The way data are stored and accessed,
The actual training parallelism.
Data storage: S3 is the most cost-effective way to store your data for training. To get faster startup and training times, you can consider the followings:
If your data is are already stored on Amazon S3, you might want first to consider leveraging the Pipe mode with built-in algorithms or bringing your own. But Pipe mode is not suitable all the time, for example, if your algorithm needs to backtrack or skip ahead within an epoch (the underlying FIFO cannot support lseek() operations) or if it is not easy to parse your training dataset from a streaming source.
In those cases, you may want to leverage Amazon FSx for Lustre and Amazon EFS file systems. If your training data is already in an Amazon EFS, I recommend using it as a data source; otherwise, choose Amazon FSx for Lustre.
Training Parallelism: With large datasets, it is likely you'll want to train on different GPUs. In that case, consider the followings:
If your training is already Horovod ready, you can do it with Amazon SageMaker (notebook).
In December, AWS has released managed data parallelism, which simplifies parallel training over multiple GPUs. As of today, it is available for TensorFlow and PyTorch.
(bonus) Cost Optimisation: Do not forget to leverage Managed Spot training to save up to 90% of the compute costs.
You will find other examples on the Amazon SageMaker Distributed Training documentation page
You can use SageMaker for large scale Machine Learning tasks! It's designed for that. I developed this open source project https://github.com/Kenza-AI/sagify (sagify), it's a CLI tool that can help you train and deploy your Machine Learning/Deep Learning models on SageMaker in a very easy way. I managed to train and deploy all of my ML models whatever library I was using (Keras, Tensorflow, scikit-learn, LightFM, etc)
We have found BigQuery to work great on data sets larger than 100M rows, where the 'initialization time' doesn't really come into effect (or is negligible compared to the rest of the query).
However, on anything under that, the performance is quite slow and poor, which makes it (1) ill-suited to working in an interactive BI tool; and (2) inferior to other products, such as Redshift or even ElasticSearch where the data size is under 100M rows. Actually, we had an engineer at our organization that was evaluating a technology for doing queries on data sizes between 1M and 100M rows for an analytics product that has about 1000 users, and his feedback was that he could not believe how slow BigQuery was.
Without a defense of the BigQuery product, I was wondering if there were any plans on improving:
The speed of BigQuery -- especially its initialization time -- on queries of non-massive data sets?
Will BigQuery ever be able to deliver sub-second response times on 'regular' queries (such as a simple aggregation group by) on datasets under a certain size?
It's time spent on metadata/initiation, but actual execution time is very small. We have work in progress that will address this, but some of the changes are complicated and will take a while.
You can imagine that in its infancy, BigQuery could have central systems for managing jobs, metadata, etc. in a manner that performed very well for all N0 entities using the service. Once you get to N1 entities, however, it may be necessary to rearchitect some things to make them have as little latency as possible. For notification about new features--which is also where we would announce API improvements related to start-up latency--keep an eye on our release notes, which you can also subscribe to as an RSS feed.
After exacts 4 years since this question, we have amazing news to BigQuery users! As stated in this Bi Engine release note from 2021-02-25:
The BI Engine SQL interface expands BI Engine to integrate with other business intelligence (BI) tools such as Looker, Looqbox, Tableau, Power BI, and custom applications to accelerate data exploration and analysis. This page provides an overview of the BI Engine SQL interface, and the expanded capabilities that it brings to this preview version of BI Engine.
I believe this can solve the query latency issue mentioned by David542 question.