What are cgroups and how are people using them for cluster administration? - cgroups

Are there examples of how people are using cgroups to better manage research computing clusters that runs parallel scientific codes and serial codes for an academic community?

The primary example I'm aware of is to be able set the cluster scheduler (e.g. Slurm) to assign multiple jobs to a single node without worrying about a renegade job utilizing more resources than assigned.
Cgroups is the mechanism so that the different jobs are only able to use the resources assigned to them by Slurm.
Prior to having cluster schedulers capable of doing this many HPC Centers only allowed either one job per node or one user per node. Otherwise a job that requested only 1 core, for example, could, once running, actually use all the cores in the node which would cause other jobs on the node to run poorly.

Related

Akka Cluster: Down all when unstable & SBR's role

I have a lot of worker nodes in my akka-cluster, which cause Down all when unstable decision due to their instability; But they don't have SBR's role.
Why Down all when unstable decision in not taken based on SBR's
role?
To solve this problem, should i have distinct clusters or use Multi-DC cluster?
The primary constraint a split-brain resolver has to meet is that every node in the cluster reaches the same decision about which nodes need to be downed (including downing themselves). In the presence of different decisions being made, the guarantees of Cluster Sharding and Cluster Singleton no longer apply: there may be two incarnations of the same sharded entity or the singleton might not be a singleton.
Because there's latency inherent to disseminating reachability observations around the cluster, the less time has elapsed since seeing a change in reachability observations, the more likely it is that there's a node in the cluster which would disagree with our node about which nodes are reachable. That disagreement opens the door that node to make a different SBR decision than the one our node would make. The only strategy the SBR has which guarantees that every node makes the same decision even if there's a disagreement about membership or reachability is down-all.
Accordingly, SBR delays making a decision until there's been a long enough time since a cluster membership or reachability change has happened. In a particularly unstable cluster, if too much time has passed without achieving stability, the SBR will then apply the down-all strategy, which does not take cluster roles into account.
If you're not using cluster sharding or cluster singleton (and haven't implemented something with similar constraints...), you might be able to get away with disabling this fallback to down-all (if every bit of distributed state in your system forms a CRDT, for instance, you might be able to get away with this; if you know what a CRDT is, you know and if you don't, that almost certainly means not all distributed state in your system is a CRDT) with the configuration setting
akka.cluster.split-brain-resolver.down-all-when-unstable = off
Think very carefully about this in the context of your application. I would suspect that at least 99.9% of Akka clusters out there would violate correctness guarantees with this setting.
From your question about distinct clusters or Multi-DC, I take it you are spreading your cluster across multiple datacenters. In that case, note that inter-datacenter networking is typically less reliable than intra-datacenter networking. So that means that you basically have three options:
have separate clusters for each datacenter and use "something(s) else" to coordinate between them
use Multi-DC cluster which takes some account of the difference between inter- and intra-datacenter networking (e.g. that while it's possible for node A in some datacenter and node B in that datacenter to disagree on the reachability of a node C in that datacenter, it's highly likely that node A and node B will agree that node D in a different datacenter is reachable or not)
configuring the failure detector for the reliability of the inter-datacenter link (this is effectively treating even nodes in the same rack (or even running on the same physical host or even VM...) as if they were in separate datacenters). This will mean being very slow to declare that a node has crashed (and giving that node a lot of time to say "no, I'm not dead, I was just being quiet/sleepy/etc."). For some applications, this might be a viable strategy.
Which of those 3 is the right option? I think completely separate clusters communicating and coordinating over some separate channel(s) and modeling this in the domain is often useful (for instance, you might be able to balance traffic to the datacenters in such a way that it's highly unlikely you'd need your west coast datacenter to know what's happening on the east coast). Multi-DC might allow for a more consistency than separate clusters. It's probably unlikely that your application requirements are such that multiple DCs within a vanilla single cluster will work well.

Configuring EMR cluster, which node to choose?

Suppose I read data from RDS and write it into S3 using EMR cluster (Spark), should I use Task nodes only?
Example:
* 1 Master node
* 4 Task nodes
In my case I don't use HDFS to store data, so that using Core node isn't necessary, if I understand it right. Or should I have at least one Core node in any way? Any ideas?
To my knowledge, You should have at least one core node.
I also had a similar use case a very long time back, where I had used Spark-SQL to read data from S3 and insert it into RDS (opposite of your use case, but that does not matter in any way).
Since the nature of this job was not heavy, I had used only the Master node and Core node. I did not use any Task node, since I did not find the need of using it for a small job.
I think it is a little misunderstanding that Only when HDFS is used, we should look towards usage of the Core node. The way I see it is at the end of the day, even core node is an instance where I can run an application.
So even the Core node can perform the job of task/worker node and I have seen multiple examples where the core node is of high instance type(say r5.24xlarge) and even on this instance your executors would be running.
In my above example, all the tasks were performed on the core node itself since I did not have any task node.
In my experience, I have seen a lot of EMR which has only Master node and Core node. Have not seen anything with just the Master node and task node.
One key point which I want to share is Please use at least one on-demand instance in the core node. You can have a fleet of instances in the core node(composed of both on-demand and spot), but having at least one on-demand node is highly advisable.
More readings can be found here:
Understand Node Types
Cluster Configuration Guidelines and Best Practices
So the moral of the story is:
should I have at least one Core node in any way?
Yes, You should in my opinion.

Optimization of the google dataproc cluster

I am using the dataproc cluster for spark processing. I am new to whole google cloud stuff. In our application we have 100s of jobs which uses dataproc. With every job we spawn new cluster and terminate it once the job is over. I am using pyspark for processing purpose.
Is it safe to use hybrid of stable node and pre-emptible nodes for the cost reduction?
What is the best software configuration for improving the performance of the dataproc cluser. I am aware of the in-house infrastructure optimisation of hadoop/spark cluster. Is it applicable as it is for dataroc cluster or something else is needed?
Which instance type is best suit for dataproc cluster when we are processing avro formatted data around 150GB of size.
I have tried spark's dataframe caching / persist for time optimization. But it was not that useful. Is there any way to instruct spark that entire resources (memory, processing power) belong to this job so that it can process it faster?
Does reading and writing back to GCS bucket have a performance hit? If yes, is there any way to optimize it?
Any help in time and price optimisation is appreciated. Thanks in advance.
Thanks
Manish
Is it safe to use hybrid of stable node and pre-emptible nodes for the cost reduction?
That's absolutely fine. We've used that on 300+ node clusters, only issues were with long-running clusters when nodes were getting preempted, and jobs were not optimised to account for node reclamation (no RDD replication, huge long-running DAGs). Also Tez does not like preemptible nodes getting reclaimed.
Is it applicable as it is for dataroc cluster or something else is needed?
Correct. However Google Storage driver has different characteristics when it comes to operation latency (for example, FileOutputCommitter can take huge amounts of time when trying to do recursive move or remove with overpartitioned output), and memory usage (writer buffers are 64 Mb vs 4 Kb on HDFS).
Which instance type is best suit for dataproc cluster when we are processing avro formatted data around 150GB of size.
Only performance tests can help with that.
I have tried spark's dataframe caching / persist for time optimization. But it was not that useful. Is there any way to instruct spark that entire resources (memory, processing power) belong to this job so that it can process it faster?
Make sure to use dynamic allocation and your cluster is sized to your workload. Scheduling tab in YARN UI should show utilisation close to 100% (if not, your cluster is oversized to the job, or you have not enough partitions). In Spark UI, better to have number running tasks close to number of cores (if not, it again might be not enough partitions, or cluster is oversized).
Does reading and writing back to GCS bucket have a performance hit? If yes, is there any way to optimize it?
From throughput perspective, GCS is not bad, but it is much worse in case of many small files, both from reading (when computing splits) and writing (when FileOutputCommitter) perspective. Also many parallel writes can result in OOMs due to bigger write buffer size.

EC2 spark master instance size

I intend to setup spark cluster on EC2. How much resources spark master instance actually needs? Since master is not involved in processing any of the tasks can it be the smallest EC2 instance?
This obviously depends on what kinds of jobs you're planning to run, how big is the cluster etc, so in that sense the advice to simply try different configurations is good. However, in my purely personal experience the driver instance should be at least at the level of the slave instances. This is mainly due to two reasons.
First of all, there are times when you need the result of the job in a single place. Maybe you just don't want to spend time combining files, maybe you need the results in some specific order which would be hard to achieve in a distributed way etc. but this means the driver should be able to hold all the data (as rdd.collect gathers the results to the driver instance).
Second of all, many of the shuffle-based operations seem to require a lot of memory from the driver. I'm not exactly sure about the details of why this happens (if anyone knows, please do share) but I can't count the number of times I've seen reduceyKey causing an out of memory error from the driver.
Edit: I have assumed you were using Spark's spark-ec2 script, which I believe does install the NameNode in the master instance. If the NameNode is not installed at the master intance, however, my answer has no validity as correctly pointed by #DemetriKots in the comments.
Although the master instance is not involved in data processing, it plays a major role during the management of the workload and resource allocation, e.g (all info is taken from the sources):
NameNode
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
This (look for Hardware Recommendations for Hadoop on the left index) Hortonworks document specifies some recommendations for the master instance in a Hadoop cluster. While it might not be adequate for the slave instances (due to Spark's memory usage), I would say it can be useful in the case of the master instance in a Spark cluster.

How to distribute a program on an unreliable cluster?

What I'm looking for is any/all of the following:
automatic discovery of worker failure (computer off for instance)
detection of all running (linux) PCs on a given IP address range (computer on)
... and auto worker spawning (ping+ssh?)
load balancing so that workers do not slow down other processes (nice?)
some form of message passing
... and don't want to reinvent the wheel.
C++ library, bash scripts, stand alone program ... all are welcome.
If you give an example of software then please tell us what of above functions does it have.
Check out the Spread Toolkit, a C/C++ group communication system. It will allow you detect node/process failure and recovery/startup, in a manner that allows you to rebalance a distributed workload.
What you are looking for is called a "job scheduler". There are many job schedulers on the market, these are the ones I'm familiar with:
SGE handles any and all issues related to job scheduling on multiple machines (recovery, monitoring, priority, queuing). Your software does not have to be SGE-aware, since SGE simply provides an environment in which you submit batch jobs.
LSF is a better alternative, but not free.
To support message passing, see the MPI specification. SGE fully supports MPI-based distribution.
Depending on your application requirements, I would check out the BOINC infrastructure. They're implementing a form of client/server communication in their latest releases, and it's not clear what form of communication you need. Their API is in C, and we've written wrappers for it in C++ very easily.
The other advantage of BOINC is that it was designed to scale for large distributed computing projects like SETI or Rosetta#Home, so it supports things like validation, job distribution, and management of different application versions for different platforms.
Here's the link:
BOINC website
There is Hadoop. It has Map Reduce, but I'm not sure whether it has any other features I need. Anybody know?
You are indeed looking for a "job scheduler." Nodes are "statically" registered with a job scheduler. This allows the jobs scheduler to inspect the nodes and determine the core count, RAM, available scratch disc space, OS, and much more. All of that information can be used to select the required resources for a job.
Job schedulers also provide basic health monitoring of the cluster. Nodes that are down are automatically removed from the list of available nodes. Nodes which are running jobs (through the scheduler) are also removed from the list of available nodes.
SLURM is a resource manager & job scheduler that you might consider. SLURM has integration hooks for LSF and PBSPro. Several MPI implementations are "SLURM aware" and can use/set environment variables that will allow an MPI job to run on the nodes allocated to it by SLURM.