Deploying a scikit learn pipeline on IBM DSX - data-science-experience

How to deploy a trained scikit learn pipeline on IBM Data science experience? Can I do that from a Jupyter notebook?

Deployment of Scikit Learn Pipeline will be available in Watson Machine Learning. Today it is in close beta and only support deployment of Spark ML pipelines but we are adding support for Scikit Learn before summer.
More information here: https://console.ng.bluemix.net/docs/#services/PredictiveModeling/index.html

Yesterday I was at IBM DevConnect Hyderabad 2017, where I learned for the first time using Python Jupyter notebooks on DSX, that are by default available as a standard mode of using Python in IBM DSX platform.
By default, all notebooks come pre-installed with the common 20+ libraries necessary for data science experiments like scikit-learn, numpy,matplotlib, pandas etc. You can go to notebook info and check environment to get the full list of modules installed and versions.
You can try some of the samples from Github here to get started: github.com/IBMDevConnect17/DSX_HandsON
There is lot of learning available here with 4 courses here: https://www.ibm.com/developerworks/library/ba-1611data-science-fundamentals-learning-path-bdu-trs/index.html

Find here a tutorial to deploy Scikit Learn pipelines to Watson Machine Learning(WML): https://datascience.ibm.com/exchange/public/entry/view/acba02c8efecc5218b1d65ba9b8a5bbb
WML supports deployment of Scikit-Learn v0.17, find more information here: https://console.bluemix.net/docs/services/PredictiveModeling/pm_service_supported_frameworks.html#supported-machine-learning-frameworks

You can install packages by using the !pip command with user setting.
For example.
!pip install --user --upgrade sklearn

Related

Debug Pyspark on EMR using Pycharm

Does anyone have experience with debugging Pyspark that runs on AWS EMR using Pycharm?
I couldn't find any good guides or existing threads regrading this one.
I know how to debug Scala-Spark with Intellij against EMR but I have no experince with doing this with Python.
I am aware of being able to connect to remote server using ssh (EMR Master) and maybe with Professional edition I can use the remote deployment feature to run my spark job using Pycharm but I'm not sure if it will work and I want to know if anyone has tried it, before I will go with Pycharm Pro.
I got to debug Pyspark on EMR as I wanted to.
Please look at this Medium blog post that describes how to do so:
https://medium.com/explorium-ai/debugging-pyspark-with-pycharm-and-aws-emr-d50f90077c92
It describes how to use Pycharm Pro - remote deployment feature in order to debug your pyspark program.

Versions of Python & Spark to work with VS Code Notebooks

I'm developing scripts for AWS Glue, and trying to mimic development environment as close as possible to their specs here. Since it's a bit costly to run a Notebook server/development endpoint, I set everything up on local machine instead, develop scripts on VS Code Notebook, due to its usefulness.
There're some troubles with Notebook setup due to incompatible versions between installed Python & Spark.
For Python, I have gone through some harsh time to clean up, and its version is 3.8.3 now
For Spark, I use the manual method with version of 2.4.3, since I plan to use Scala alongside at later time. I install the findspark package to load that version as expected.
And it doesn't work! The error was TypeError: an integer is required (got type bytes)
I've searched around, and people said to downgrade to Python 3.7 using pyenv, and I got 3.7.7 installed but still had the same error
As a last resort, I tried pip install pyspark. it's Spark 3.0.0, and works fine, but not as expected.
Hope there's someone have experiences of this matter
A better approach would be to install the glue dependencies on docker then ssh into that docker container using VS code to mimic exact glue local dev environment.
I've written a blog about the same if you like to refer
https://towardsdatascience.com/develop-glue-jobs-locally-using-docker-containers-bffc9d95bd1

Trying Dask on AWS

I am a scientist who is exploring the use of Dask on Amazon Web Services. I have some experience with Dask, but none with AWS. I have a few large custom task graphs to execute, and a few colleagues who may want to do the same if I can show them how. I believe that I should be using Kubernetes with Helm because I fall into the "Try out Dask for the first time on a cloud-based system like Amazon, Google, or Microsoft Azure" category.
I also fall into the "Dynamically create a personal and ephemeral deployment for interactive use" category. Should I be trying native Dask-Kubernetes instead of Helm? It seems simpler, but it's hard to judge the trade-offs.
In either case, how do you provide Dask workers a uniform environment that includes your own Python packages (not on any package index)? The solution I've found suggests that packages need to be on a pip or conda index.
Thanks for any help!
Use Helm or Dask-Kubernetes ?
You can use either. Generally starting with Helm is simpler.
How to include custom packages
You can install custom software using pip or conda. They don't need to be on PyPI or the anaconda default channel. You can point pip or conda to other channels. Here is an example installing software using pip from github
pip install git+https://github.com/username/repository#branch
For small custom files you can also use the Client.upload_file method.

Hadoop : how to start my first project

I'm starting to work with Hadoop but I don't know where and how do it. I'm working on OS X and I follow some tutorial to install Hadoop, it's done and it's work but now I don't know what to do.
Is there an IDE to install (maybe eclipse)? I find some codes but nothing works and I don't know what I have to add in my project etc ...
Can you give me some informations or guide me to a complete tutorial ?
If you want to learn Hadoop framework then i recomend to just start with installing Cloudera QuickStart virtual machine on your OSX system provided your system has all the prerequisites:
http://www.cloudera.com/downloads/quickstart_vms/5-8.html
Cloudera QuickStart virtual machines include everything you need to try Hadoop, MapReduce, Hive, Pig, Impala, etc. and Eclipse IDE as well.
Above will do perfect if you are interested in perusing career as Hadoop Developer however, if you are interested in Hadoop systems administrator then follow the #Alvaro recommendation.
Then there is a intro to Hadoop and MapReduce course on Udacity would be a good start for beginners:
https://www.udacity.com/course/intro-to-hadoop-and-mapreduce--ud617
Hadoop: The Definitive Guide By Tom White could be a great comprehensive book to refer: http://shop.oreilly.com/product/0636920033448.do
I would recommend you install the Cloudera pseudo distributed example on a virtual machine, the latest LTS Ubuntu. That way, you don't messed up with your laptop and it would be a environment closer to anything you would do in production. Have you checked vagrantup.com?
When you have it installed, you could choose on work directly on Java or chose a framework like MrJob (python) to execute some custom programs.
Best,
Alvaro.

Rethinkdb chef solo cookbook

Is there any RethinkDB chef solo cookbook that allows one to install latest rethinkdb on ubuntu 14.04 / AWS.
I tried couple options, however it didn't help.
https://github.com/vFense/rethinkdb-chef - how to install latest version?
https://github.com/sprij/rethinkdb-cookbook.git - source compilation takes hours
I would appreciate any help regarding this.
Thanks
Try the cookbook that is available from the community repository first:
https://supermarket.chef.io/cookbooks/rethinkdb
It claims to be integration tested on Ubuntu. If it doesn't work under chef-solo, then I'd advise you to switch to local mode chef client instead.
https://www.chef.io/blog/2013/10/31/chef-client-z-from-zero-to-chef-in-8-5-seconds/
PS
Also checkout Berkshelf for managing cookbook dependencies. It's a standard tool in the chefdk
I updated rethinkdb-chef to work with the latest version of RethinkDB as well as removed the network portion of the .kitchen.yml file. I validated that this does work on CentOS 6 and Ubuntu 14.04.
I still need to write tests as well as documentation. As per Marks answer, try to use the community supported version 1st. I created this cookbook, so that I can customize it as per my needs with vFense.