I have a lot of worker nodes in my akka-cluster, which cause Down all when unstable decision due to their instability; But they don't have SBR's role.
Why Down all when unstable decision in not taken based on SBR's
role?
To solve this problem, should i have distinct clusters or use Multi-DC cluster?
The primary constraint a split-brain resolver has to meet is that every node in the cluster reaches the same decision about which nodes need to be downed (including downing themselves). In the presence of different decisions being made, the guarantees of Cluster Sharding and Cluster Singleton no longer apply: there may be two incarnations of the same sharded entity or the singleton might not be a singleton.
Because there's latency inherent to disseminating reachability observations around the cluster, the less time has elapsed since seeing a change in reachability observations, the more likely it is that there's a node in the cluster which would disagree with our node about which nodes are reachable. That disagreement opens the door that node to make a different SBR decision than the one our node would make. The only strategy the SBR has which guarantees that every node makes the same decision even if there's a disagreement about membership or reachability is down-all.
Accordingly, SBR delays making a decision until there's been a long enough time since a cluster membership or reachability change has happened. In a particularly unstable cluster, if too much time has passed without achieving stability, the SBR will then apply the down-all strategy, which does not take cluster roles into account.
If you're not using cluster sharding or cluster singleton (and haven't implemented something with similar constraints...), you might be able to get away with disabling this fallback to down-all (if every bit of distributed state in your system forms a CRDT, for instance, you might be able to get away with this; if you know what a CRDT is, you know and if you don't, that almost certainly means not all distributed state in your system is a CRDT) with the configuration setting
akka.cluster.split-brain-resolver.down-all-when-unstable = off
Think very carefully about this in the context of your application. I would suspect that at least 99.9% of Akka clusters out there would violate correctness guarantees with this setting.
From your question about distinct clusters or Multi-DC, I take it you are spreading your cluster across multiple datacenters. In that case, note that inter-datacenter networking is typically less reliable than intra-datacenter networking. So that means that you basically have three options:
have separate clusters for each datacenter and use "something(s) else" to coordinate between them
use Multi-DC cluster which takes some account of the difference between inter- and intra-datacenter networking (e.g. that while it's possible for node A in some datacenter and node B in that datacenter to disagree on the reachability of a node C in that datacenter, it's highly likely that node A and node B will agree that node D in a different datacenter is reachable or not)
configuring the failure detector for the reliability of the inter-datacenter link (this is effectively treating even nodes in the same rack (or even running on the same physical host or even VM...) as if they were in separate datacenters). This will mean being very slow to declare that a node has crashed (and giving that node a lot of time to say "no, I'm not dead, I was just being quiet/sleepy/etc."). For some applications, this might be a viable strategy.
Which of those 3 is the right option? I think completely separate clusters communicating and coordinating over some separate channel(s) and modeling this in the domain is often useful (for instance, you might be able to balance traffic to the datacenters in such a way that it's highly unlikely you'd need your west coast datacenter to know what's happening on the east coast). Multi-DC might allow for a more consistency than separate clusters. It's probably unlikely that your application requirements are such that multiple DCs within a vanilla single cluster will work well.
Related
Recently in a system design interview I was asked a question where cities were divided into zones and data of around 100 zones was available. An api took the zoneid as input and returned all the restaurants for that zone in response. The response time for the api was 50ms so the zone data was kept in memory to avoid delays.
If the zone data is approximately 25GB, then if the service is scaled to say 5 instances, it would need 125GB ram.
Now the requirement is to run 5 instances but use only 25 GB ram with the data split between instances.
I believe to achieve this we would need a second application which would act as a config manager to manage which instance holds which zone data. The instances can get which zones to track on startup from the config manager service. But the thing I am not able to figure out is how we redirect the request for a zone to the correct instance which holds its data especially if we use kubernetes. Also if the instance holding partial data restarts then how do we track which zone data it was holding
Splitting dataset over several nodes: sounds like sharding.
In-memory: the interviewer might be asking about redis or something similar.
Maybe this: https://redis.io/topics/partitioning#different-implementations-of-partitioning
Redis cluster might fit -- keep in mind that when the docs mention "client-side partitioning": the client is some redis client library, loaded by your backends, responding to HTTP client/end-user requests
Answering your comment: then, I'm not sure what they were looking for.
Comparing Java hashmaps to a redis cluster isn't completely fair, considering one is bound to your JVM, while the other is actually distributed / sharded, implying at least inter-process communications and most likely network/non-local queries.
Then again, if the question is to scale an ever-growing JVM: at some point, we need to address the elephant in the room: how do you guarantee data consistency, proper replication/sharding, what do you do when a member goes down, ...?
Distributed hashmap, using Hazelcast, may be more relevant. Some (hazelcast) would make the argument it is safer under heavy write load. Others that migrating from Hazelcast to Redis helped them improve service reliability. I don't have enough background in Java myself, I wouldn't know.
As a general rule: when asked about Java, you could argue that speed and reliability very much rely on your developers understanding of what they're doing. Which, in Java, implies a large margin of error. While we could suppose: if they're asking such questions, they probably have some good devs on their payroll.
Whereas distributed databases (in-memory, on disk, SQL or noSQL), ... is quite a complicated topic, that you would need to master (on top of java), to get it right.
The broad approach they're describing was described by Adya in 2019 as a LInK store. Linked In-memory Key-value stores allow for application objects supporting rich operations to be sharded across a cluster of instances.
I would tend to approach this by implementing a stateful application using Akka (disclaimer: I am at this writing employed by Lightbend, which employs the majority of the developers of Akka and offers support and consulting services to clients using Akka; as my SO history indicates, I would have the same approach even multiple years before I was employed by Lightbend) along these lines.
Akka Cluster to allow a set of JVMs running an application to form a cluster in a peer-to-peer manner and manage/track changes in the membership (including detecting instances which have crashed or are isolated by a network partition)
Akka Cluster Sharding to allow stateful objects keyed by ID to be distributed approximately evenly across a cluster and rebalanced in response to membership changes
These stateful objects are implemented as actors: they can update their state in response to messages and (since they process messages one at a time) without needing elaborate synchronization.
Cluster sharding implies that the actor responsible for an ID might exist on different instances, so that implies some persistence of the state of the zone outside of the cluster. For simplicity*, when an actor responsible for a given zone starts, it initializes itself from datastore (could be S3, could be Dynamo or Cassandra or whatever): after this its state is in memory so reads can be served directly from the actor's state instead of going to an underlying datastore.
By directing all writes through cluster sharding, the in-memory representation is, by definition, kept in sync with the writes. To some extent, we can say that the application is the cache: the backing datastore only exists to allow the cache to survive operational issues (and because it's only in response to issues of that sort that the datastore needs to be read, we can optimize the data store for writes vs. reads).
Cluster sharding relies on a conflict-free replicated data type (CRDT) to broadcast changes in the shard allocation to the nodes of the cluster. This allows, for instance, any instance to handle an HTTP request for any shard: it simply forwards a representation of the important parts of the request as a message to the shard which will distribute it to the correct actor.
From Kubernetes' perspective, the instances are stateless: no StatefulSet or similar is needed. The pods can query the Kubernetes API to find the other pods and attempt to join the cluster.
*: I have a fairly strong prior that event sourcing would be a better persistence approach, but I'll set that aside for now.
Are there examples of how people are using cgroups to better manage research computing clusters that runs parallel scientific codes and serial codes for an academic community?
The primary example I'm aware of is to be able set the cluster scheduler (e.g. Slurm) to assign multiple jobs to a single node without worrying about a renegade job utilizing more resources than assigned.
Cgroups is the mechanism so that the different jobs are only able to use the resources assigned to them by Slurm.
Prior to having cluster schedulers capable of doing this many HPC Centers only allowed either one job per node or one user per node. Otherwise a job that requested only 1 core, for example, could, once running, actually use all the cores in the node which would cause other jobs on the node to run poorly.
What issues can arise when I have an existing akka application that runs on a single node, and then in the future I need to use akka cluster?
Are there any design considerations or best practises I should follow when transitioning from a single node to a clustered akka design? Or do things just work when you use cluster?
The general advice is to start your design as if you were clustering from the start. In particular you need to pay attention to latency differences between cluster members so make sure what you're passing through the network (between actors on different machines) is not "huge". You also need to pay attention to location transparency (Akka solves this), partial failure and concurrency. Each of these things can affect your design in significant ways so it is best to start out considering them and then optimize for the single-node cluster case.
Related Documentation (see also referenced paper from Sun, 1994)
I've setup a AWS instance with cassandra on it and then also setup an auto scaling group to spin up another 4-8 instances depending on alarma. But how does Cassandra know when auto scaling kicks in? How does it know what other nodes to connect to? Do I need to configure something in Cassandra in order for it to sniff the nodes?
when I run node tool, the auto scaling nodes don't show up...
[root#ip-10-205-119-104 bin]# sh nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 127.0.0.1 107.12 MB 256 ? a50294ac-2150-4d9e-9dd2-0a56906e9531 rack1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
The best option for auto-discovery in Cassandra are seed nodes, which are 'anchor' nodes supposed to be always there when a new one shows up, and can be queried for cluster's node list every time it is needed.
So, you deliver every node with a list of seed nodes in its config file (including the seeds themselves), and once it goes up, it will get the nodes list from a seed. This, off course, demands seed nodes to be static and always running (off course, for redundancy, you must have more than just one seed node). Cassandra demands it to be listed by their IP as well (to avoid having problems with DNS).
Nonetheless, I don't think auto-scaling Cassandra would be a good thing. Cassandra partitions its data (rows) across nodes, and every time you add or remove a node, it needs to repartition and redistribute rows, which, depending on how big are you data, takes quite long (and may demand other administrative actions, like repairing, etc). Even if you have enough replicas to afford a sudden node loss (which is what WILL occur using auto-scaling), that's messy. First, because Cassandra won't automatically decomission nodes - the cluster will know the node is unavailable, but it just waits for it to come back, and try to keep the cluster as healthy as possible (including a mechanism that saves the writes to the unavailable node in other nodes for some period).
So, you would need to watch your nodes and manage those ups and downs from outside. And, you may not even have time for decomission one node and set everything (your data) in place again before another one comes up, and down again, and all that could really screw your cluster totally up.
Well, maybe there's some people out there doing this, but according to my knowledge and experience with Cassandra, it's not so simple and magic as that to be auto-scaled like you would do with a web application, and you would probably end up losing data and having a very inconsistent and unstable system.
Another issue with using auto scaling is that, there is no instant gratification. You cannot really see the benefit of the new node till the cluster rebalances, and this could take long depending on your cluster.
While rebalance is in-progress, you end up putting additional load on the original nodes, which would defeat the purpose of adding capacity.
Most of the nosql solution only use eventually consistency, and given that DynamoDB replicate the data into three datacenter, how does read after write consistency is being maintained?
What would be generic approach to this kind of problem? I think it is interesting since even in MySQL replication data is replicated asynchronously.
I'll use MySQL to illustrate the answer, since you mentioned it, though, obviously, neither of us is implying that DynamoDB runs on MySQL.
In a single network with one MySQL master and any number of slaves, the answer seems extremely straightforward -- for eventual consistency, fetch the answer from a randomly-selected slave; for read-after-write consistency, always fetch the answer from the master.
even in MySQL replication data is replicated asynchronously
There's an important exception to that statement, and I suspect there's a good chance that it's closer to the reality of DynamoDB than any other alternative here: In a MySQL-compatible Galera cluster, replication among the masters is synchronous, because the masters collaborate on each transaction at commit-time and a transaction that can't be committed to all of the masters will also throw an error on the master where it originated. A cluster like this technically can operate with only 2 nodes, but should not have less than three, because when there is a split in the cluster, any node that finds itself alone or in a group smaller than half of the original cluster size will roll itself up into a harmless little ball and refuse to service queries, because it knows it's in an isolated minority and its data can no longer be trusted. So three is something of a magic number in a distributed environment like this, to avoid a catastrophic split-brain condition.
If we assume the "three geographically-distributed replicas" in DynamoDB are all "master" copies, they might operate with logic along same lines of synchronous masters like you'd find with Galera, so the solution would be essentially the same since that setup also allows any or all of the masters to still have conventional subtended asynchronous slaves using MySQL native replication. The difference there is that you could fetch from any of the masters that is currently connected to the cluster if you wanted read-after-write consistency, since all of them are in sync; otherwise fetch from a slave.
The third scenario I can think of would be analogous to three geographically-dispersed MySQL masters in a circular replication configuration, which, again, supports subtended slaves off of each master, but has the additional problems that the masters are not synchronous and there is no conflict resolution capability -- not at all viable for this application, but for purposes of discussion, the objective could still be achieved if each "object" had some kind of highly-precise timestamp. When read-after-write consistency is needed, the solution here might be for the system serving the response to poll all of the masters to find the newest version, not returning an answer until all masters had been polled, or to read from a slave for eventual consistency.
Essentially, if there's more than one "write master" then it would seem like the masters have no choice but to either collaborate at commit-time, or collaborate at consistent-read-time.
Interestingly, I think, in spite of some whining you can find in online opinion pieces about the disparity in pricing among the two read-consistency levels in DynamoDB, this analysis -- even as divorced from the reality of DynamoDB's internals as it is -- does seem to justify that discrepancy.
Eventually-consistent read replicas are essentially infinitely scalable (even with MySQL, where a master can easily serve several slaves, each of which can also easily serve several slaves of its own, each of which can serve several... ad infinitum) but read-after-write is not infinitely scalable, since by definition it would seem to require the involvement of a "more-authoritative" server, whatever that specifically means, thus justifying a higher price for reads where that level of consistency is required.
I'll tell you exactly how DynamoDB does this. No guessing.
In order for a write request to be acknowledged to the client, the write must be durable on two of the three storage nodes for that partition. One of the two storage nodes MUST be the leader node for that partition. The third storage node is probably updated as well, but on the off chance something happened, it may not be. DynamoDB will get that one updated as soon as it can.
When you request a strongly consistent read, that read comes from the leader storage node for the partition the item(s) are stored in.
I know I'm answering this question long after it has been asked, but I thought a could contribute some helpful information...
In a distributed database the concept of a "master" is not particularly relevant anymore (at least for reads/writes). Each node should be able to perform reads and writes, so that read/write performance increases as the # of machines increases. If you want reads to be correct immediately after a write, the number of machines you write to and then read from must be greater than the total number of machines in the system.
Example: if you only write to 1 machine, then you must read from all of them to ensure that your data is not stale. Or if you write to 2 machines (in this case, quorum) you can perform reads at quorum and guarantee that your data is recent.
NOTE: these assumptions change when a subset of nodes in the system crash.