Amazon Machine Learning models rebuilding possibilities - amazon-web-services

There is only 2 kinds of in-built prediction/classification models in AWS Machine Learning. Logistic regression and linear regression. Is it possible somehow in current version of AWS ML to:
1) Re-build this what is under the hood of logistic and linear regression models
2) Build your own models written in Python/R, implement them on AWS ML and run things such as neural nets, random forests, clustering alghoritms?
In AWS ML Developer Guide latest version I could not find answers on those questions explicite, that it is impossible to do so. Any tips?

A bit of background first...
Amazon Machine Learning can build models for three kinds of machine learning problems (binary/multiclass classification & regression). As you previously mentioned, the model selected and trained by the platform is abstracted from the user.
This "black box" implementation is perhaps the largest deficiency of Amazon's machine learning platform. You have no information on what model or how the model is trained (beyond, for ex. linear regression, stochastic gradient descent). Amazon is quite clear that this is intentional, as they want the platform to be built into an application, and not just used to train models for one. See the 47:25 and 53:30 mark of this Q&A.
So, to answer your questions:
You cannot see how the exactly models have been trained, for example what constants in a linear regression (although you may be able to deduce by testing the model). When you query the model, the response includes a field which indicates the algorithm used for that particular model (for ex. SGD). A full list of learning algorithms can be found here.
Unfortunately not. You cannot create your own models and import them into AWS Machine Learning, meaning that no decision trees or neural network models can run on the platform.

Related

Is it possible to train ONNX models developed in tensorflow and pytorch with C++?

I wonder if its possible to use tensorflow and pytorch models converted to onnx models to train them with the C++ Api like it is done in e.g. https://gist.github.com/asimshankar/5c96acd1280507940bad9083370fe8dc with a tensorflow model. I just found examples for inference with onnx. The idea is to be able to prototype with tensorflow and pytorch in python, convert to onnx models and to have a unified API in C++ to do inference and training. It would help quite a lot to get some (links to get) informaton.
ONNX's GitHub page suggests that it can be used for inference, but it doesn't seem reasonable to be able to train all models with it (from the development perspective).
Currently we focus on the capabilities needed for inferencing (scoring).
Although there are some difficulties, such as always writing backpropagation is more difficult than feedforwarding, and supporting it would double the framework size, which is not what ONNX is aiming for since there are already so many frameworks for this. To train you will need all the parameters, the derivatives of functions in the GPU and CPU(if its performance is lower than other frameworks, it will be a big problem since nobody will use it). And there are many other things that make a unified framework difficult(Supporting training on multiple GPUs over a network, for example).(So In our perspective, it's great, but in theirs it's so difficult)
But we can see that some functionality for training has been added to the framework, in this case it can train transformer models
Also, to training transformers in PyTorch you could see this link
ONNX Runtime does support training but not in C++. You can train an ONNX model using ORT and Pytorch. Please see here https://onnxruntime.ai/docs/get-started/training-pytorch.html.

Randomforest in amazon aws sagemaker?

I am looking to recreate a randomforest model built locally, and deploy it through sagemaker. The model is very basic, but for comparison I would like to use the same in sagemaker. I don't see randomforest among sagemaker's built in algorithms (which seems weird) - is my only option to go the route of deploying my own custom model? Still learning about containers, and it seems like a lot of work for something that is just a simple randomforestclassifier() call locally. I just want to baseline against the out of the box randomforest model, and show that it works the same when deployed through AWS sagemaker.
edit 03/30/2020: adding a link to the the SageMaker Sklearn random forest demo
in SageMaker you have 3 options to write scientific code:
Built-in algorithms
Open-source pre-written containers (available
for sklearn, tensorflow, pytorch, mxnet, chainer. Keras can be
written in the tensorflow and mxnet containers)
Bring your own container (for R for example)
At the time of writing this post there is no random forest classifier nor regressor in the built-in library. There is an algorithm called Random Cut Forest in the built-in library but it is an unsupervised algorithm for anomaly detection, a different use-case than the scikit-learn random forest used in a supervised fashion (also answered in StackOverflow here). But it is easy to use the open-source pre-written scikit-learn container to implement your own. There is a demo showing how to use Sklearn's random forest in SageMaker, with training orchestration bother from the high-level SDK and boto3. You can also use this other public sklearn-on-sagemaker demo and change the model. A benefit of the pre-written containers over the "Bring your own" option is that the dockerfile is already written, and web serving stack too.
Regarding your surprise that Random Forest is not featured in the built-in algos, the library and its 18 algos already cover a rich set of use-cases. For example for supervised learning over structured data (the usual use-case for the random forest), if you want to stick to the built-ins, depending on your priorities (accuracy, inference latency, training scale, costs...) you can use SageMaker XGBoost (XGBoost has been winning tons of datamining competitions - every winning team in the top10 of the KDDcup 2015 used XGBoost according to the XGBoost paper - and scales well) and linear learner, which is extremely fast at inference and can be trained at scale, in mini-batch fashion over GPU(s). Factorization Machines (linear + 2nd degree interaction with weights being column embedding dot-products) and SageMaker kNN are other options. Also, things are not frozen in stone, and the list of built-in algorithms is being improved fast.
RandomForestClassifier is not supported out of the box with SageMaker, but XGBoost (gradient boosted trees) as well as decisionTreeClassifier from scikit-learn are both supported. You can access scikit-learn's decisionTreeClassifier() directly from the SageMaker SDK.
Here's a notebook demonstrating use of a decisionTreeClassifier from SageMaker's built-in scikit-learn.
Deploying your own custom model via a Dockerfile is certainly possible as well (and can seem daunting at first, but isn't all that bad), but I agree in that it wouldn't be ideal for a simple algorithm that's already included in SageMaker :)
Edit: Mixed up Random Forest and Random Cut Forest in the original answer as discussed in comment. Random Cut Forest algorithm docs for SageMaker are available here: https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html
Random Cut Forest (RCF) Jupyter noetbook ex: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.ipynb

Microsoft Translator API (Switching from Statistical to Neural Models)

I am using the Microsoft Translator API (Python Scripts) and I have noted that it uses the Statistical Machine Translation model by default. However, the Neural Networks Translation Model is more accurate than the Statistical Model. As shown below:
I would like to use the Neural Networks Model with the API. Seeking guidance on how to make the switch from default Statistical Model to the Neural one...
Using the API, add “category=generalnn” to your call to tell our servers to use the NN models vs. SMT. If the app requests NN for a language that is not supported, it will automatically fall back on the SMT ones.

Does or will H2O provide any pretrained vectors for use with h2o word2vec?

H2O recently added word2vec in its API. It is great to be able to easily train your own word vectors on a corpus you provide yourself.
However even greater possibilities exist from using big data and big computers, of the type that software vendors like Google or H2O.ai, but not so many end-users of H2O, may have access to, due to network bandwidth and compute power limitations.
Word embeddings can be seen as a type of unsupervised learning. As such, great value can be had in a data science pipeline by using pretrained word vectors that were built on a very large corpus as infrastructure in specific applications. Using general purpose pretrained word vectors can be seen as a form of transfer learning. Reusing word vectors is analogous to computer vision deep learning generic lowest layers that learn to detect edges in photographs. Higher layers detect specific kinds of objects composed from the edge layers below them.
For example Google provides some pretrained word vectors with their word2vec package. The more examples the better is often true with unsupervised learning. Further, sometimes it's practically difficult for an individual data scientist to download a giant corpus of text on which to train your own word vectors. And there is no good reason for every user to recreate the same wheel by training word vectors themselves on the same general purpose corpuses (corpi?) like wikipedia.
Word embeddings are very important and have the potential to be the bricks and mortar of a galaxy of possible applications. TF-IDF, the old basis for many natural language data science applications, stands to be made obsolete by using word embeddings instead.
Three questions:
1 - Does H2O currently provide any general purpose pretrained word embeddings (word vectors), for example trained on text found at legal or other public-owned (government) websites, or wikipedia or twitter or craigslist, or other free or Open Commons sources of human-written text?
2 - Is there a community site where H2O users can share their trained word2vec word vectors that are built on more specialized corpuses, such as medicine and law?
3 - Can H2O import Google's pretrained word vectors from their word2vec package?
thank you for your questions.
You are absolutely right, there are many situations when you don't need a custom model and pre-trained model will work well. I assume people will mostly build their own models on smaller problems in their specific domain and use pre-trained models to complement the custom model.
You can import 3rd party pre-trained models into H2O as long as they are in a CSV-like format. This is true for many available GloVe models.
To do that import the model into a Frame (just like with any other dataset):
w2v.frame <- h2o.importFile("pretrained.glove.txt")
And then convert it to a regular H2O word2vec model:
w2v.model <- h2o.word2vec(pre_trained = w2v.frame, vec_size = 100)
Please note that you need to provide the size of the embeddings.
H2O doens't plan to provide a model exchange/model market for w2v model as far as I know. You can use models that are available on-line: https://github.com/3Top/word2vec-api
We currently do not support importing Google's binary format of word embeddings, however the support is on our road map as it makes a lot of sense for our users.

Amazon Machine Learning for sentiment analysis

How flexible or supportive is the Amazon Machine Learning platform for sentiment analysis and text analytics?
You can build a good machine learning model for sentiment analysis using Amazon ML.
Here is a link to a github project that is doing just that: https://github.com/awslabs/machine-learning-samples/tree/master/social-media
Since the Amazon ML supports supervised learning as well as text as input attribute, you need to get a sample of data that was tagged and build the model with it.
The tagging can be based on Mechanical Turk, like in the example above, or using interns ("the summer is coming") to do that tagging for you. The benefit of having your specific tagging is that you can put your logic into the model. For example, the difference between "The beer was cold" or "The steak was cold", where one is positive and one was negative, is something that a generic system will find hard to learn.
You can also try to play with some sample data, from the project above or from this Kaggle competition for sentiment analysis on movie reviews: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews. I used Amazon ML on that data set and got fairly good results rather easily and quickly.
Note that you can also use the Amazon ML to run real-time predictions based on the model that you are building, and you can use it to respond immediately to negative (or positive) input. See more here: http://docs.aws.amazon.com/machine-learning/latest/dg/interpreting_predictions.html
It is great for starting out. Highly recommend you explore this as an option. However, realize the limitations:
you'll want to build a pipeline because models are immutable--you have to build a new model to incorporate new training data (or new hyperparameters, for that matter)
you are drastically limited in the tweakability of your system
it only does supervised learning
the target variable can't be other text, only a number, boolean or categorical value
you can't export the model and import it into another system if you want--the model is a black box
Benefits:
you don't have to run any infrastructure
it integrates with AWS data sources well
the UX is nice
the algorithms are chosen for you, so you can quickly test and see if it is a fit for your problem space.