I would like to setup a Ray cluster to use Rtune over 4 gpus on AWS. But each gpu belongs to a different member of our team. I have scoured available resources for an answer and found nothing. Help ?
In order to start a Ray cluster using instances that span multiple AWS accounts, you'll need to make sure that the AWS instances can communicate with each other over the relevant ports. To enable that, you will need to modify the AWS security groups for the instances (though be sure not to open up the ports to the whole world).
You can choose which ports are needed via the arguments --redis-port, --redis-shard-ports, --object-manager-port, and --node-manager-port to ray start on the head node and just --object-manager-port, and --node-manager-port on the non-head nodes. See the relevant documentation.
However, what you're trying to do sounds somewhat complex. It'd be much easier to use a single account if possible, in which case you could use the Ray autoscaler.
Related
Is this setup possible in Google Cloud Platform? Because this setup is possible and easy to setup in Amazon Web Service.
There is no way do exactly what you're asking in GCP.
However there is a way to map multiple public IP's to a single VM instance using forwarding rules which was discussed here.
To give you better understanding of GCP's networking have a loouk at how
you can create a VM with multiple interfaces (up to 8 per VM).
Also this documentation may be usefull - how to create a VM with multiple NIC's.
Last piece of documentation describes some exaples and use cases with VM's and multiple NIC's which also may be helpful.
We have DC/OS running on AWS with a fixed number of master nodes and agent nodes as part of a POC. However, we'd like to have the cluster (agent nodes) autoscale according to load. So far, we've been unable to find any information about scaling on DC/OS docs. I've also had no luck so far in my web-searches.
If someone's got this working already, please let us know how you did it.
Thanks for your help!
Autoscaling the number of service instances by cpu, memory, or network load is possible: https://docs.mesosphere.com/1.8/usage/tutorials/autoscaling/
Autoscaling the number of DC/OS nodes by adding/removing nodes, however, is outside of the scope of DC/OS and specific to the IaaS it is deployed on. You can imagine that this wouldn't work on bare metal for obvious reasons. It's hypothetically possible, of course, but I haven't seen any existing automation for it.
The DC/OS AWS templates use easily scaled node groups, but it's not automatic. You might try looking for IaaS specific autoscalers that aren't DC/OS specific.
If you have an autoscaling group for your "private agent" nodes and you want to scale the number of nodes in times of heavy load, pick a CloudWatch metric that suits your needs (e.g. traffic on ELB) and scale by an autoscaling scaling policy:
http://docs.aws.amazon.com/autoscaling/latest/userguide/policy_creating.html
Then you can use one of the two ways described in https://docs.mesosphere.com/1.8/usage/tutorials/autoscaling/ to scale your apps within DC/OS (on scheduler level).
I have a basic cluster, which has a master and 2 nodes. The 2 nodes are part of an aws autoscaling group - asg1. These 2 nodes are running application1.
I need to be able to have further nodes, that are running application2 be added to the cluster.
Ideally, I'm looking to maybe have a multi-region setup, whereby aplication2 can be run in multiple regions, but be part of the same cluster (not sure if that is possible).
So my question is, how do I add nodes to a cluster, more specifically in AWS?
I've seen a couple of articles whereby people have spun up the instances and then manually logged in to install the kubeltet and various other things, but I was wondering if it could be done in more of an automatic way?
Thanks
If you followed this instructions, you should have an autoscaling group for your minions.
Go to AWS panel, and scale up the autoscaling group. That should do it.
If you did it somehow manually, you can clone a machine selecting an existing minion/slave, and choosing "launch more like this".
As Pablo said, you should be able to add new nodes (in the same availability zone) by scaling up your existing ASG. This will provision new nodes that will be available for you to run application2. Unless your applications can't share the same nodes, you may also be able to run application2 on your existing nodes without provisioning new nodes if your nodes are big enough. In some cases this can be more cost effective than adding additional small nodes to your cluster.
To your other question, Kubernetes isn't designed to be run across regions. You can run a multi-zone configuration (in the same region) for higher availability applications (which is called Ubernetes Lite). Support for cross-region application deployments (Ubernetes) is currently being designed.
I have no experience with AWS CloudFormation Templates so I apologize for the incredibly simple question which I can't find an answer to because I think it is so basic.
I am trying to create a cloudformation template for a single server in AWS Test Drive. Here is the criteria:
Deploy AMI
Force m3-large (no other sizes available)
Will be running in a single location (no other location available)
Utilize existing security group
Get a public IP Spit back the public DNS or public IP address
Everything I've looked up wants to be more complex than I think I need and I can't figure out which pieces are needed and which ones can be taken out. What is the bare minimum to deploy a single ami with no customization (all customization is performed inside the VM during bootup. There should also be no options for other data center locations or other sizes. All templates I've seen have a bunch of options for multiple data centers and multiple sizes and sets up a security group.
I appreciate the links to the AWS site however I have already been there and this is one of the templates that has too much info and I don't know what I can change\exclude.
Thanks for your help.
Amazon Web Services documentation includes a single-server CloudFormation template that simply creates a Linux EC2 instance and accompanying security group. This one is based in US West 2 (Oregon), but does not appear to be region-specific and should work in any region.
https://s3-us-west-2.amazonaws.com/cloudformation-templates-us-west-2/EC2InstanceWithSecurityGroupSample.template
This sample can be found along with others here:
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/sample-templates-services-us-west-2.html
My question is 2 fold:
**UPDATE*******
I fixed number 1.
I had to specify the region in the config. I guess this is because my keys associate the east by default.
If anyone has an answer to 2 that would be great.
1) I am ultimately trying to setup a 4 node cluster (2 in each region). In the main region (east-us-1) the nodes see each other perfectly fine but in the west, they don't seem to see each other. I'd like to make sure they can see each other before I try multi region (which I'm not entirely sure how to do yet). I've installed the plugin.
Basically, why in a different region are the nodes not seeing each other when it's the same config. I can telnet to/from each server on 9200/9300.
Here is my config:
cloud:
aws:
access_key:
secret_key:
discovery:
type: ec2
ec2:
groups: ELASTIC-SEARCH
2) Is there a way to designate a specific node to "Hold all the data" and then distribute it among them all?
While it's not the answer you want: Don't do that.
It'll be much easier to have two clusters in two regions, and keep them in sync on your application layer. Also, Elasticsearch has introduced the concept of a Tribe-node in 1.0 to make this a bit easier.
Elasticsearch, like any distributed database, is very sensitive to network issues. In this case you're relying on the Internet working reliably. It tends not to.
The setup you suggest will be quite prone to split brains or outages. If you configure minimum master nodes to be a quorum, which you always should, the cluster will go down whenever there's a connection problem between the regions.
We've written two articles that go much more in depth than this about this topic, which you may want to look into:
Elasticsearch in Production has a section on networking related issues.
Elasticsearch Internals: Networking Introduction describes the network topology of Elasticsearch. Specifically, you'll see just how many connections Elasticsearch needs to have working reliably.