I am looking to recreate a randomforest model built locally, and deploy it through sagemaker. The model is very basic, but for comparison I would like to use the same in sagemaker. I don't see randomforest among sagemaker's built in algorithms (which seems weird) - is my only option to go the route of deploying my own custom model? Still learning about containers, and it seems like a lot of work for something that is just a simple randomforestclassifier() call locally. I just want to baseline against the out of the box randomforest model, and show that it works the same when deployed through AWS sagemaker.
edit 03/30/2020: adding a link to the the SageMaker Sklearn random forest demo
in SageMaker you have 3 options to write scientific code:
Built-in algorithms
Open-source pre-written containers (available
for sklearn, tensorflow, pytorch, mxnet, chainer. Keras can be
written in the tensorflow and mxnet containers)
Bring your own container (for R for example)
At the time of writing this post there is no random forest classifier nor regressor in the built-in library. There is an algorithm called Random Cut Forest in the built-in library but it is an unsupervised algorithm for anomaly detection, a different use-case than the scikit-learn random forest used in a supervised fashion (also answered in StackOverflow here). But it is easy to use the open-source pre-written scikit-learn container to implement your own. There is a demo showing how to use Sklearn's random forest in SageMaker, with training orchestration bother from the high-level SDK and boto3. You can also use this other public sklearn-on-sagemaker demo and change the model. A benefit of the pre-written containers over the "Bring your own" option is that the dockerfile is already written, and web serving stack too.
Regarding your surprise that Random Forest is not featured in the built-in algos, the library and its 18 algos already cover a rich set of use-cases. For example for supervised learning over structured data (the usual use-case for the random forest), if you want to stick to the built-ins, depending on your priorities (accuracy, inference latency, training scale, costs...) you can use SageMaker XGBoost (XGBoost has been winning tons of datamining competitions - every winning team in the top10 of the KDDcup 2015 used XGBoost according to the XGBoost paper - and scales well) and linear learner, which is extremely fast at inference and can be trained at scale, in mini-batch fashion over GPU(s). Factorization Machines (linear + 2nd degree interaction with weights being column embedding dot-products) and SageMaker kNN are other options. Also, things are not frozen in stone, and the list of built-in algorithms is being improved fast.
RandomForestClassifier is not supported out of the box with SageMaker, but XGBoost (gradient boosted trees) as well as decisionTreeClassifier from scikit-learn are both supported. You can access scikit-learn's decisionTreeClassifier() directly from the SageMaker SDK.
Here's a notebook demonstrating use of a decisionTreeClassifier from SageMaker's built-in scikit-learn.
Deploying your own custom model via a Dockerfile is certainly possible as well (and can seem daunting at first, but isn't all that bad), but I agree in that it wouldn't be ideal for a simple algorithm that's already included in SageMaker :)
Edit: Mixed up Random Forest and Random Cut Forest in the original answer as discussed in comment. Random Cut Forest algorithm docs for SageMaker are available here: https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html
Random Cut Forest (RCF) Jupyter noetbook ex: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.ipynb
Related
I am building a time series usecase to automate the preprocess and retrain tasks.At first the data is preprocessed using numpy, pandas, statsmodels etc & later a machine learning algorithm is applied to make predictions.
The reason for using inference pipeline is that it reuses the same preprocess code for training and inference. I have checked the examples given by AWS sagemaker team with spark and sci-kit learn. In both the examples they use a sci-kit learn container to fit & transform their preprocess code. Should I also have to create a container which is not needed in my use case as I am not using any sci-kit-learn code?
Can someone give me a custom example of using these pipelines? Any help is appreciated!
Sources looked into:
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/scikit_learn_inference_pipeline
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/inference_pipeline_sparkml_blazingtext_dbpedia
Apologies for the late response.
Below is some documentation on inference pipelines:
https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html
https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipeline-real-time.html
Should I also have to create a container which is not needed in my use case as I am not using any sci-kit-learn code?
Your container is an encapsulation of the environment needed for your custom code needed to run properly. Based on the requirements listed above, numpy, pandas, statsmodels etc & later a machine learning algorithm, I would create a container if you wish to isolate your dependencies or modify an existing predefined SageMaker container, such as the scikit-learn one, and add your dependencies into that.
Can someone give me a custom example of using these pipelines? Any help is appreciated!
Unfortunately, the two example notebooks referenced above are the only examples utilizing inference pipelines. The biggest hurdle most likely is creating containers that fulfill the preprocessing and prediction task you are seeking and then combining those two together into the inference pipeline.
Blockquote
i am working on this project asssigned by university as final project. But the issue is i am not getting any help from the internet so i thought may be asking here can solve issue. i had read many articles but they had no code or guidance and i am confused what to do. Basically it is an image processing work with machine learning. Data set can be found easily but issue is python python learning algorithm and code
Blockquote
I presume if it's your final project you have to create the program yourself rather than ripping it straight from the internet. If you want a good starting point which you can customise Tensor Flow from Google is very good. You'll want to understand how it works (i.e. how machine learning works) but as a first step there's a good example of image processing on the website in the form of number recognition (which is also the "Hello World" of machine learning).
https://www.tensorflow.org/get_started/mnist/beginners
This also provides a good intro to machine learning with neural nets: https://www.youtube.com/watch?v=uXt8qF2Zzfo
One note on Tensor Flow, you'll probably have to use Python 3.5+ as in my experience it can be difficult getting it on 2.7.
First of all I need to know what type of data are you using because depending on your data, if it is a MRI or PET scan or CT, there could be different suggestion for using machine learning in python for detection.
However, I suppose your main dataset consist of MR images, I am attaching an article which I found it a great overview of different methods>
This project compares four different machine learning algorithms: Decision Tree, Majority, Nearest Neighbors, and Best Z-Score (an algorithm of my own design that is a slight variant of the Na¨ıve Bayes algorithm)
https://users.soe.ucsc.edu/~karplus/abe/Science_Fair_2012_report.pdf
Here, breast cancer and colorectal cancer have been considered and the algorithms that performed best (Best Z-Score and Nearest Neighbors) used all features in classifying a sample. Decision Tree used only 13 features for classifying a sample and gave mediocre results. Majority did not look at any features and did worst. All algorithms except Decision Tree were fast to train and test. Decision Tree was slow, because it had to look at each feature in turn, calculating the information gain of every possible choice of cutpoint.
My Solution:-
Lung Image Database Consortium provides open access dataset for Lung Cancer Images.
Download it then apply any machine learning algorithm to classify images having tumor cells or not.
I attached a link for reference paper. They applied neural network to classify the images.
For coding part, use python "OpenCV" for image pre-processing and segmentation.
When it comes for classification part, use any machine learning libraries (tensorflow, keras, torch, scikit-learn... much more) as you are compatible to work with and perform classification using any better outperforming algorithms as you wish.
That's it..
Link for Reference Journal
There is only 2 kinds of in-built prediction/classification models in AWS Machine Learning. Logistic regression and linear regression. Is it possible somehow in current version of AWS ML to:
1) Re-build this what is under the hood of logistic and linear regression models
2) Build your own models written in Python/R, implement them on AWS ML and run things such as neural nets, random forests, clustering alghoritms?
In AWS ML Developer Guide latest version I could not find answers on those questions explicite, that it is impossible to do so. Any tips?
A bit of background first...
Amazon Machine Learning can build models for three kinds of machine learning problems (binary/multiclass classification & regression). As you previously mentioned, the model selected and trained by the platform is abstracted from the user.
This "black box" implementation is perhaps the largest deficiency of Amazon's machine learning platform. You have no information on what model or how the model is trained (beyond, for ex. linear regression, stochastic gradient descent). Amazon is quite clear that this is intentional, as they want the platform to be built into an application, and not just used to train models for one. See the 47:25 and 53:30 mark of this Q&A.
So, to answer your questions:
You cannot see how the exactly models have been trained, for example what constants in a linear regression (although you may be able to deduce by testing the model). When you query the model, the response includes a field which indicates the algorithm used for that particular model (for ex. SGD). A full list of learning algorithms can be found here.
Unfortunately not. You cannot create your own models and import them into AWS Machine Learning, meaning that no decision trees or neural network models can run on the platform.
So I am creating my own classifiers using the OpenCV Machine Learning module for age estimation. I can train my classifiers but the training takes a long time so I would like to see some output (status classifier, iterations done etc.). Is this possible? I'm using ml::Boost, ml::LogisticalRegression and ml::RTrees all inheriting cv::StatModel. Just to be clear i'm not using the given application for recognizing objects in images (opencv_createsamples and opencv_traincascade). The documentation is very limited so it's very hard to find something in it.
Thanks
Looks like there's an open feature request for a "progress bar" to provide some rudimentary feedback... See https://github.com/Itseez/opencv/issues/4881. Personally, I gave up on using the OpenCV ML a while back. There are several high-quality tools available to build machine learning models. I've personally used Google's Tensorflow, but I've heard good things about Theano and Caffe as well.
How flexible or supportive is the Amazon Machine Learning platform for sentiment analysis and text analytics?
You can build a good machine learning model for sentiment analysis using Amazon ML.
Here is a link to a github project that is doing just that: https://github.com/awslabs/machine-learning-samples/tree/master/social-media
Since the Amazon ML supports supervised learning as well as text as input attribute, you need to get a sample of data that was tagged and build the model with it.
The tagging can be based on Mechanical Turk, like in the example above, or using interns ("the summer is coming") to do that tagging for you. The benefit of having your specific tagging is that you can put your logic into the model. For example, the difference between "The beer was cold" or "The steak was cold", where one is positive and one was negative, is something that a generic system will find hard to learn.
You can also try to play with some sample data, from the project above or from this Kaggle competition for sentiment analysis on movie reviews: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews. I used Amazon ML on that data set and got fairly good results rather easily and quickly.
Note that you can also use the Amazon ML to run real-time predictions based on the model that you are building, and you can use it to respond immediately to negative (or positive) input. See more here: http://docs.aws.amazon.com/machine-learning/latest/dg/interpreting_predictions.html
It is great for starting out. Highly recommend you explore this as an option. However, realize the limitations:
you'll want to build a pipeline because models are immutable--you have to build a new model to incorporate new training data (or new hyperparameters, for that matter)
you are drastically limited in the tweakability of your system
it only does supervised learning
the target variable can't be other text, only a number, boolean or categorical value
you can't export the model and import it into another system if you want--the model is a black box
Benefits:
you don't have to run any infrastructure
it integrates with AWS data sources well
the UX is nice
the algorithms are chosen for you, so you can quickly test and see if it is a fit for your problem space.