I would like to setup a Ray cluster to use Rtune over 4 gpus on AWS. But each gpu belongs to a different member of our team. I have scoured available resources for an answer and found nothing. Help ?
In order to start a Ray cluster using instances that span multiple AWS accounts, you'll need to make sure that the AWS instances can communicate with each other over the relevant ports. To enable that, you will need to modify the AWS security groups for the instances (though be sure not to open up the ports to the whole world).
You can choose which ports are needed via the arguments --redis-port, --redis-shard-ports, --object-manager-port, and --node-manager-port to ray start on the head node and just --object-manager-port, and --node-manager-port on the non-head nodes. See the relevant documentation.
However, what you're trying to do sounds somewhat complex. It'd be much easier to use a single account if possible, in which case you could use the Ray autoscaler.
I have this issue: Two or more nodes on cluster and 5 deployment replicas, and I have to use one volume for them. For example I will add one file to first pod and can take it from another, and if my first pod will deleted, I still can take this data from second pod.
I tried kubernetes volumes types like hostPath, but it's didn't work.
I tried NFS but it didn't work. Because we have many instructions, but each of them not full and not correct! Can you please write full instruction, like for junior, ok - like for idiots? I never use NFS, Gluster, but in kubernetes docs information is too short about how to install it and connect to kubernetes.
Now I try using AWS EFS and kubernetes and the same story, a lot of general information, individual instructions, but not consistent. Why, it's so hard for you, explain how it works? I am in fire now, kubernetes documentation about base elements like deployment, services - ok, but about integrations, not basic volumes - awfully!
Maybe some one can help me with it?
AWS part: https://aws.amazon.com/getting-started/tutorials/create-network-file-system/
KUBERNETES part: https://github.com/kubernetes-incubator/external-storage/blob/master/aws/efs/deploy/manifest.yaml
Thanks for help.
We have a Kubernetes cluster set up on AWS VPC with 10+ nodes. We encountered an incident where one node was not accessible to others and vice-versa for ~10 minutes. Finding this out took quite a lot of time.
Is there a tool for Kubernetes or AWS to detect these kind of network problems? Maybe something like a Daemon Set where each pod pings the others in the network and logs it when the ping fails.
If you are mostly interested in being alerted when such problem happens, I would set up monitoring system and hook it up with something like alertmanager. For collecting metrics, you can look at open source project such as Prometheus. Once you set this up, it is really easy to integrate it with Grafana (for dashboard) and alertmanager (for alerting based on rules you specify in Prometheus). And they are all open source projects.
https://prometheus.io/
I create a Kubernetes (v1.6.1) cluster on AWS with one master and two slave nodes, then I spin up mysql instance using helm and deploy a simple Django web-app that queries latest five rows from the database and displays it. For my web service I specify 'type: LoadBalancer' which creates an ELB on AWS.
If I use 'weave' networking and scale my web-app to at least two replicas, then I begin experiencing inconsistent response time - most of the time it is reasonable (like 0.1-0.2 s), but 20-40% requests take significantly longer (3-5 s, sometimes even more than 15 s). However, if I switch to 'flannel' networking, everything works fast, even with 20-30 replicas of the web-app. All machines have enough resources, so that's not the problem.
I tried debugging to find out what's causing the delay, and the best explanation I have is that AWS ELB doesn't work well with 'weave'. Has anyone experienced similar issues? What could be the problem? Please let me know if I should provide some relevant information.
P.S. I'm new to using Kubernetes.
When writing a web app with Django or such, what's the best way to connect to dynamic EC2 instances, such as a cluster of Redis or memcache instances? IP addresses change between reboots, etc. Elastic IPs are limited to 5 by default - what are some other options for auto-discovering/auto-updating which machines are available?
Late answer, but use Boto: http://boto.cloudhackers.com/en/latest/index.html
You can use security groups, tags, and other means to hit the EC2 API and pick the instances/IPs for each thing (DB Server, caching server, etc.) at load-time. We do this with great success in deployment, and are moving that way with our Django settings.py, as well.
One method that I heard mentioned recently in an AWS webinar was to store this sort of information in SimpleDB. Essentially, you would use SimpleDB as the central configuration location, and each instance that you launch would register its IP etc. with this configuration, so you would always have a complete description of all of your instances in one place. I haven't seen this in practice so I don't know what the best practices would be exactly, but the idea sounds reasonable. I suppose you could use SNS or something to signal all the other instances whenever the configuration changes, so everyone could refresh their in-memory cache of the configuration.
I don't know the AWS administrative APIs yet really, but there's probably an API call to list your EC2 instances, at which point you could use some sort of custom protocol to ping each of them and ask it what it is -- part of the memcache cluster, Redis, etc.
I'm having a similar problem and didn't found a solution yet because we also need to map Load Balancers addresses.
For your problem, there are two good alternatives:
If you are not using EC2 micro instances or load balancers, you should definitely use Amazon Virtual Private Cloud, because it lets you control instances IPs and routing tables (check all limitations before using this service).
If you are only using EC2 instances, you could write a script that uses the EC2 API tools to run the command ec2-describe-instances to find all instances and their public/private IPs. Then, the script could parameterize instances names to hosts and update /etc/hosts. Finally, you should put the script in the crontab of every computer/instance that need to access the EC2 instances (see ec2-describe-instances).
If you want to stay with EC2 instances (I'm in the same boat, I've read that you can do such things with their VPC or use an S3 bucket or something like that.) but with EC2, I'm in the middle of writing stuff like this...it's all really simple up till the part where you need to contact the server with a server from your data center or something. The way I'm doing it currently is using the API to create the instance and start it...then once its ready, I contact the server to execute a powershell script that I have on the server....the powershell renames the computer and reboots it...that takes care of needing the hostname and MAC for our data center firewalls. I haven't found a way yet to remotely rename a computer.
As far as knowing the IP, the elastic IPs are the way to go. They say you're only allowed 5 and gotta apply for more but we've been regularly requesting more and they give em to us..we're up to like 15 now and they haven't complained yet.
Another option if you dont' want to do all the computer renaming and such...you could use DHCP and set your computer up so when it boots it gets the computer name and everything from DHCP....I'm not sure how to do this exactly, I've come across very smart people telling me that's the way to do it during my research for Amazon.
I would definitely recommend that you get into the Amazon API...I've been working with it for less than a month and I can do all kinds of crazy things. My code can detect areas of our system that are getting stressed, spin up 10 amazon servers all configured to act as whatever needs stress relief, and be ready to send jobs to all in less than 7 minutes. Brings a tear to my eye.
The documentation is very complete...the API itself is a work of art and a joy to program against...I've very much enjoyed working with it. (and no, i dont' work for them lol)
Do it the traditional way: with DNS. This is what it was built for, so use it! When a machine boots, have it ask for the domain name(s) related to its function, and use that for your configuration. If it stops responding, re-resolve the DNS (or just do that periodically anyway).
I think route53 and the elastic load balancing stuff can be used to do this, if you want to stick to Amazon solutions.