How to choose specific algorithm on Amazon Machine Learning or another cloud platform? - amazon-web-services

Is is possible to run some specific machine learning algorithms on Amazon Machine Learning? For me it's seems like it works like a black box: you put data there and get some performance without algorithm selection, parameters tuning, etc.
By the way, is it possible somewhere to run specific machine learning algorithm in cloud?

From the Amazon Machine Learning FAQ here, they state that the algorithm being used in this package is logistic regression.
If you want something more sophisticated you'd have to do more work yourself, such as setting up the necessary packages on and EC2 or EMR cluster.

Related

Clarification on Default SageMaker Distribution Strategy

Context: When using SageMaker distributed training: Let’s say when training a network I do not provide any distribution parameter (keep it to default), but provide 2 instances for the instance_count value in the estimator (could be any deep learning based estimator, e.g., PyTorch).
In this scenario would there be any distributed training taking place? If so, what strategy is used by default?
NOTE: I could see both instances’ GPUs are actively used but wondering what sort of distributed training take place by default ?
If you're using custom code (custom Docker, custom code in Framework container) The answer is NO. Unless you are writing distributed code (Horovod, PyTorch DDP, MPI...), SageMaker will not distribute things for you. It will launch the same Docker or Python code N times, once per instance. Consider SageMaker Training API like a whiteboard, that can create multiple connected and configured machines for you. But the code is still yours to write. SageMaker Distributed Training Libraries can make distributed code much easier to write though.
If you're using a built-in algorithm, the answer is it depends. Some SageMaker built-in algorithms natively are multi-machine, like SM XGBoost or SM Random Cut Forest.

Combinatorial Optimization problems on Docker or AWS

I am attacking a combinatorial optimization problem similar to the multi-knapsack problem. The problem has an optimal solution, and i prefer not to settle for an approximate solution.
Are there any recommended tutorials regarding the quick prototyping and deployment of combinatorial optimization solutions (for senior software engineers that are also Big Data newbies)? I want to move quickly from prototype to deployment onto a docker cluster or AWS.
My background is in distributed systems (a focus on .NET, java, kafka, docker containers, etc...), thus I'm typically inclined to solve complex problems by parallel processing across a cluster of machines (via scaling on a docker cluster or AWS). However, this particular problem can NOT be solved in a brute force manner as the problem space is too large (roughly 100^1000 combinations are possible).
I've limited experience with “big data”, but I'm studying up on knapsack solvers, genetic algorithms, reinforcement learning, and some other AI/ML approaches. Given my limited exposure in this area, how would one recommend I tackle a problem such as this?
I tend to favor the approach of leveraging existing frameworks/libraries as much as possible. Good idea? Or would one recommend using Accord.Net or ML.Net or some other library to build a custom model?
If existing frameworks are the way to go, any particular favorites? tensorflow? Any thoughts on Google OR tools: https://developers.google.com/optimization/ Anything in the AWS space?
Any good tutorials, videos, or podcasts that can get me prototyping quickly? (keeping in mind my goal of deploying and validating the model on a docker cluster)
Thank you for any help and guidance!
The Cloud Balancing problem in OptaPlanner (open source, java) is a multi-knapsack problem. There's a tutorial for it in the user guide. Many users run OptaPlanner implementations on Docker (normal open JDK 8 image) and AWS. Here's an Employee Rostering implementation that is deployed to OpenShift Dedicated (which generates an docker image that it runs on AWS) - it exposes a REST api (which is Swagger documented even).
Thanks to all for your insight above. I’m having a look at optaplanner and google-OT, as well as a few other solvers.
To follow up on this question, if I were to relax the constraint that I want the optimal answer , and allow for “approximate” solutions , would this change your guidance or recommended tool set (libraries/frameworks) in any way?

Deploy hyperledger on AWS - production setup

My company is currently evaluating hyperledger(fabric) and we're using it for our POC. It looks very promising and we're targeting rolling out to production in next few months.
We're targeting AWS as our production environment.
However, we're struggling to find good tutorial/practices/recommendations about operating hyperledger network in such environment.
I'm aware that Cello is aiming to solve/ease deploying/monitoring hyperledger network but i also read that its not production ready yet. Question is, should we even consider looking at Cello at this point?
If not, what are our alternatives? Docker swarm, kubernetes?
I also didn't find information about recommended instance types. I understand this is application and AWS specific but what are the minimal system requirements
(memory&CPU&network) for example for 'peer' node (our application is not network intensive, nor a lot of transactions will be submitted per hour/day, only few of them per day).
Another question is where to create those instances on AWS from geographical&decentralization point of view. Does it make sense all of them to be created in same region? Or, we must create instances running in different regions?
Tnx a lot.
Igor.
yes, look at Cello.. if nothing else it will help you see the aws deployment model.
really nothing special..
design the desired system, peers, orderer, gateways, etc..
then decide who many ec2 instance u need to support that.
as for WHERE (region).. depends on where the connecting application is and what kind of fault tolerance you need for your business model.
one of the businesses I am working with wants a minimum of 99.99999 % availability. so, multi-region is critical. its just another ec2 instance with sockets open from different hosts..
aws doesn't provide much in terms of support for hyperledger. they have some templates which allow you to setup the VMs initially, but that's stuff you can do yourself as well.
you are right, the documentation is very light and most of the time confusing. I got to the point where I can start from scratch with a brand new VM and got everything ready and deploy my own network definition and chaincode and have the scripts to do that.
IBM cloud has much better support for hyperledger however. you can design your network visually, you can download your connection profiles, deploy and instantiate chaincode, create and join channels, handle certificates, pretty much everything you need to run and support such a network. It's light years ahead of AWS. They even have a full CI / CD pipepline that you could replicate for your own project. if you look at their marbles demo, you'll see what i mean.
Cello is definitely worth looking at, with the caveat that it's incubation meaning, not real yet, not production ready and not really useful until it becomes a fully fledged product.

How to make my datalab machine learning run faster

I got some data, which is 3.2 million entries in a csv file. I'm trying to use CNN estimator in tensorflow to train the model, but it's very slow. Everytime I run the script, it got stuck, like the webpage(localhost) just refuse to respond anymore. Any recommendations? (I've tried with 22 CPUs and I can't increase it anymore)
Can I just run it and use a thread, like the command line python xxx.py & to keep the process going? And then go back to check after some time?
Google offers serverless machine learning with TensorFlow for precisely this reason. It is called Cloud ML Engine. Your workflow would basically look like this:
Develop the program to train your neural network on a small dataset that can fit in memory (iron out the bugs, make sure it works the way you want)
Upload your full data set to the cloud (Google Cloud Storage or BigQuery or &c.) (documentation reference: training steps)
Submit a package containing your training program to ML Cloud (this will point to the location of your full data set in the cloud) (documentation reference: packaging the trainer)
Start a training job in the cloud; this is serverless, so it will take care of scaling to as many machines as necessary, without you having to deal with setting up a cluster, &c. (documentation reference: submitting training jobs).
You can use this workflow to train neural networks on massive data sets - particularly useful for image recognition.
If this is a little too much information, or if this is part of a workflow that you'll be doing a lot and you want to get a stronger handle on it, Coursera offers a course on Serverless Machine Learning with Tensorflow. (I have taken it, and was really impressed with the quality of the Google Cloud offerings on Coursera.)
I am sorry for answering even though I am completely igonorant to what datalab is, but have you tried batching?
I am not aware if it is possible in this scenario, but insert maybe only 10 000 entries in one go and do this in so many batches that eventually all entries have been inputted?

Apache Nutch at Amazon Web Services or Local

I want to learn Apache Nutch and I have an account at Amazon Web Services (AWS). I have three machines at AWS and one of them is micro sized, other one is small and the other one is medium. I want to start with small sized and I will install Nutch, Hadoop and Hbase on it. I have Centos 6 at my machines.
There is a question here but not I ask: Nutch 2.1 (HBase, SOLR) with Amazon Web Services
I want to learn that which approach is better. I want to install them on small size machine. After that I want to add micro sized. On the other hand I don't have any experience about Nutch maybe I should work on local or is there a possibility using my machine and AWS both (does it charge more i.e. copying data from AWS may be charged.)
When I want to implement a wrapper into my Nutch, should I install it on my local(to have source codes) and run it on AWS.
Any ideas?
It sounds like your facing a steep learning curve.
For one, you admit that you're just learning Nutch, so I would recommend you install CentOS on a physical box at home and play around there.
On the other hand, you are pondering the use of a micro AWS instance, which will not be useful in running a CPU/memory intensive application like Nutch. Read about AWS micro instances here.
My suggestion is to stick to a single physical box solution at home and work on scripting your solution before moving on to an AWS instance.