I am building a time series usecase to automate the preprocess and retrain tasks.At first the data is preprocessed using numpy, pandas, statsmodels etc & later a machine learning algorithm is applied to make predictions.
The reason for using inference pipeline is that it reuses the same preprocess code for training and inference. I have checked the examples given by AWS sagemaker team with spark and sci-kit learn. In both the examples they use a sci-kit learn container to fit & transform their preprocess code. Should I also have to create a container which is not needed in my use case as I am not using any sci-kit-learn code?
Can someone give me a custom example of using these pipelines? Any help is appreciated!
Sources looked into:
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/scikit_learn_inference_pipeline
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/inference_pipeline_sparkml_blazingtext_dbpedia
Apologies for the late response.
Below is some documentation on inference pipelines:
https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html
https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipeline-real-time.html
Should I also have to create a container which is not needed in my use case as I am not using any sci-kit-learn code?
Your container is an encapsulation of the environment needed for your custom code needed to run properly. Based on the requirements listed above, numpy, pandas, statsmodels etc & later a machine learning algorithm, I would create a container if you wish to isolate your dependencies or modify an existing predefined SageMaker container, such as the scikit-learn one, and add your dependencies into that.
Can someone give me a custom example of using these pipelines? Any help is appreciated!
Unfortunately, the two example notebooks referenced above are the only examples utilizing inference pipelines. The biggest hurdle most likely is creating containers that fulfill the preprocessing and prediction task you are seeking and then combining those two together into the inference pipeline.
Related
I've worked on two projects where DBT is used to transform data between bronze(raw) silver(refined) and gold(serving) layers. I know that cloud Dataprep can be also used to transform data between layers and prepare it for visualization and ML/AI.
So what are the differences between using these two in terms of skills, budget, ease of use, and setup, what are use cases where one can't be substituted with the other?
The more direct analogue to what DBT does is a different GCP service called Dataform. Both of these services can be used to execute version controlled, templated SQL queries to transform data in stages. To use them you need to have a good understanding of your data so that you know what transformations are appropriate.
My understanding is that DataPrep is a fully fledged data exploration and manipulation took; it's more for working with data that you don't yet understand and transforming it for use.
I wish to modify an existing model implementation in order to add an additional upsampling layer to a semantic segmentation algorithm that has previously been implemented in AWS.
It appears that Sagemaker refers to this repo, and I'm hoping to modify the deeplab model to add a final additional upsampling layer that is higher resolution than the initial input layer in order to boost the resolution of the output image (i.e., statistically downscale the original imagery).
(This technique has been demonstrated with UNET architectures.)
SageMaker built-in algos are designed to be used with SageMaker prebuilt algo containers.
If you would like to have full flexibility on the architecture I would suggest building and training your own model using one of the Frameworks such as MXNet.
I am looking to recreate a randomforest model built locally, and deploy it through sagemaker. The model is very basic, but for comparison I would like to use the same in sagemaker. I don't see randomforest among sagemaker's built in algorithms (which seems weird) - is my only option to go the route of deploying my own custom model? Still learning about containers, and it seems like a lot of work for something that is just a simple randomforestclassifier() call locally. I just want to baseline against the out of the box randomforest model, and show that it works the same when deployed through AWS sagemaker.
edit 03/30/2020: adding a link to the the SageMaker Sklearn random forest demo
in SageMaker you have 3 options to write scientific code:
Built-in algorithms
Open-source pre-written containers (available
for sklearn, tensorflow, pytorch, mxnet, chainer. Keras can be
written in the tensorflow and mxnet containers)
Bring your own container (for R for example)
At the time of writing this post there is no random forest classifier nor regressor in the built-in library. There is an algorithm called Random Cut Forest in the built-in library but it is an unsupervised algorithm for anomaly detection, a different use-case than the scikit-learn random forest used in a supervised fashion (also answered in StackOverflow here). But it is easy to use the open-source pre-written scikit-learn container to implement your own. There is a demo showing how to use Sklearn's random forest in SageMaker, with training orchestration bother from the high-level SDK and boto3. You can also use this other public sklearn-on-sagemaker demo and change the model. A benefit of the pre-written containers over the "Bring your own" option is that the dockerfile is already written, and web serving stack too.
Regarding your surprise that Random Forest is not featured in the built-in algos, the library and its 18 algos already cover a rich set of use-cases. For example for supervised learning over structured data (the usual use-case for the random forest), if you want to stick to the built-ins, depending on your priorities (accuracy, inference latency, training scale, costs...) you can use SageMaker XGBoost (XGBoost has been winning tons of datamining competitions - every winning team in the top10 of the KDDcup 2015 used XGBoost according to the XGBoost paper - and scales well) and linear learner, which is extremely fast at inference and can be trained at scale, in mini-batch fashion over GPU(s). Factorization Machines (linear + 2nd degree interaction with weights being column embedding dot-products) and SageMaker kNN are other options. Also, things are not frozen in stone, and the list of built-in algorithms is being improved fast.
RandomForestClassifier is not supported out of the box with SageMaker, but XGBoost (gradient boosted trees) as well as decisionTreeClassifier from scikit-learn are both supported. You can access scikit-learn's decisionTreeClassifier() directly from the SageMaker SDK.
Here's a notebook demonstrating use of a decisionTreeClassifier from SageMaker's built-in scikit-learn.
Deploying your own custom model via a Dockerfile is certainly possible as well (and can seem daunting at first, but isn't all that bad), but I agree in that it wouldn't be ideal for a simple algorithm that's already included in SageMaker :)
Edit: Mixed up Random Forest and Random Cut Forest in the original answer as discussed in comment. Random Cut Forest algorithm docs for SageMaker are available here: https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html
Random Cut Forest (RCF) Jupyter noetbook ex: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.ipynb
When using google cloud ML to train models:
The official examples https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/census/tensorflowcore/trainer/task.py uses hooks, is_client, MonitoredTrainingSession and some other complexity.
Is this required for cloud ml or is using this example enough: https://github.com/amygdala/tensorflow-workshop/tree/master/workshop_sections/wide_n_deep?
The documentation is a bit limited in terms of best practices and optimisation, will GCP ML handle the client/worker mode or do we need to set devices e.g. replica_device_setter and so on?
CloudML Engine is largely agnostic to how you write your TensorFlow programs. You provide a Python program, and the service executes it for you, providing it with some environment variables you can use to perform distributed training (if necessary), e.g., task index, etc.
census/tensorflowcore demonstrates how to do things with the "core" TensorFlow library -- how to do everything "from scratch", including using replica_device_setters, MonitoredTrainingSessions, etc.. This may be necessary sometimes for ultimate flexibility, but can be tedious.
Alongside the census/tensorflowcore example, you'll also see a sample called census/estimator. This example is based on a higher level library, which unfortunately is in contrib and therefore does not yet have a fully stable API (expect lots of deprecation warnings, etc.). Expect it to stabilize in a future version of TensorFlow.
That particularly library (known as Estimators) is a higher level API that takes care of a lot of the dirty work for you. It will parse TF_CONFIG for you and setup the replica_device_setter as well as handle the MonitoredTrainingSession and necessary Hooks, while remaining fairly customizable.
This is the same library that the wide and deep example you pointed to is based on and they are fully supported on the service.
How do you handle mapping Jenkins jobs to your build process, and have you been able to build in cascading configurations on inheritance?
For any given build I'll have at least three jobs (standard continuous integration/nightly, security scan, coverage) and then some downstream integration testing jobs. The configuration slicer plugin handles some aspects cross jobs but each jobs is still very much its own individual entity with no relationship to the other jobs in its group.
I recently saw QuickBuild and it has job inheritance where a parent jobs can define a standard group of steps and its children can override and specialize. With Jenkins, I have copies of jobs, which is fine until I need to change something. With QuickBuild the relationship between jobs allows me to spread my changes with little effort.
I've been trying to figure out how to handle this in Jenkins. I could use the parameterized build trigger plugin to allow jobs to call others and override aspects. I'd then harvest the data from the called jobs to its caller. I suspect I'll run into a series of problems where there are aspects which I can't override which will force me to implement Jenkins functionality in my own script thus making Jenkins less useful.
How do you handle complexity in your build jobs in Jenkins? Have you heard of any serious problems with QuickBuild?
I would like to point out to you the release of a plugin that my team has developed and only recently published under open source.
It implements full "Inheritance between jobs".
Here for further links that might help you:
Presentation: https://www.youtube.com/watch?v=wYi3JgyN7Xg
Wiki: https://wiki.jenkins-ci.org/display/JENKINS/inheritance-plugin
Releases: http://repo.jenkins-ci.org/releases/hudson/plugins/project-inheritance/
I had pretty much the same problem. We have a set of jobs that needs to run for our trunk as well as at least two branches. The branches represent our versions, and a new branch is created every few months. Creating new jobs by hand for this is no solution, so I checked out some possibilities.
One possibility is to use the template plugin. This lets you create a hierarchy of jobs of a kind. It provides inheritance for builders, publishers and SCM settings. Might work for some, for me it was not enough.
Second thing I checked out was the Ant Script for job cloning, and his sibling the Bash Script. These are truly great. The idea is to make the script create a new job for, copy all settings from a template job, make changes as you need them. As this is a script it is very flexible and you can do a lot with that. Only drawback is, that this will not result in a real hierarchy, so changes in the template job will not reflect on jobs already cloned, only on jobs that will be created going forward.
Looking at the drawbacks and virtues of those two solutions, a combination of both might work best. You create a template project with some basic settings that will be true for all jobs, and then use a bash or ant script to create jobs depending on that template.
Hope that helps.
I was asked what our eventual solution to the problem was... After many months of fighting with our purchasing system we spent around $4000 US on Quickbuild. In a about 2-3 months we had a templated build system in place and were very happy with it. Before I left the company we had several product groups in the system and were automating the release process as well.
Quickbuild was a great product. It should be in the $40k class but it's priced at much less. While I'm sure Jenkins could do this, it would be a bit of a kludge whereas Quickbuild had this functionality baked in. I've implemented complex behaviors on top of products before (e.g. merge tracking in SVN 1.0) and regretted it. Quickbuild was reasonably priced and provided a solid base for our build and test systems.
At present, I'm at a firm using Bamboo and hope its new feature branch feature will provide much of what Quickbuild can do
EZ Templates plugin allows you to use any job as a template for other jobs. It is really awesome. All you need is to set the base job as a template:
* Usually you would also disable the base job (like "abstract class").
Then create a new job, set it to use the base job template, and save:
Now edit the new job - it will include everything! (and you can override existing configurations).
Note: There's another plugin Template Project for configuration templates, but it was not updated recently (last commit on 2016).
We use quickbuild and it seems to work great for most things. I have even been able to use their APIs to write custom plugins. One area where quickbuild is lacking is sonar integration. The sonar team has a Jenkins plugin and not one for quickbuild.
Given that the goal is DRY (don't repeat yourself) I presently favor this approach:
Use jenkins shared library with jenkins pipeline unit to support TDD
Use docker images using groovy/python or whatever language you like to execute complex actions requiring apis etc
Keep the actual job pipeline very spartan (basically just for pulling build params and passing them to functions in shared library which may use docker images to do the work.
This works really well an eliminates the DRY issues around complex build jobs.
Shared Pipeline Docker Code Example - vars/releasePipeline.groovy
/**
* Run image
* #param closure to run within image
* #return result from execution
*/
def runRelengPipelineEphemeralDocker(closure) {
def result
artifactory.withArtifactoryEnvAuth {
docker.withRegistry("https://${getDockerRegistry()}", 'docker-creds-id') {
docker.image(getReleasePipelineImage()).inside {
result = closure()
}
}
}
return result
}
Usage example
library 'my-shared-jenkins-library'
releasePipeline.runRelengPipelineEphemeralDocker {
println "Running ${pythonScript}"
def command = "${pythonInterpreter} -u ${pythonScript} --cluster=${options.clusterName}"
sh command
}