Akka clustering - force actors to stay on specific machines

Akka clustering - force actors to stay on specific machines - akka

I've got an akka application that I will be deploying on many machines. I want each of these applications to communicate with each others by using the distributed publish/subscribe event bus features.
However, if I set the system up for clustering, then I am worried that actors for one application may be created on a different node to the one they started on.
It's really important that an actor is only created on the machine that the application it belongs to was started on.
Basically, I don't want the elasticity or the clustering of actors, I just want the distributed pub/sub. I can see options like singleton or roles, mentioned here http://letitcrash.com/tagged/spotlight22, but I wondered what the recommended way to do this is.

There is currently no feature in Akka which would move your actors around: either you programmatically deploy to a specific machine or you put the deployment into the configuration file. Otherwise it will be created locally as you want.
(Akka may one day get automatic actor tree partitioning, but that is not even specified yet.)

I think this is not the best way to use elastic clustering. But we also consider on the same issue, and found that it could to be usefull to spread actors over the nodes by hash of entity id (like database shards). For example, on each node we create one NodeRouterActor that proxies messages to multiple WorkerActors. When we send message to NodeRouterActor it selects the end point node by lookuping it in hash-table by key id % nodeCount then the end point NodeRouterActor proxies message to specific WorkerActor which controlls the entity.

Related

Divide in-memory data between service instances

Recently in a system design interview I was asked a question where cities were divided into zones and data of around 100 zones was available. An api took the zoneid as input and returned all the restaurants for that zone in response. The response time for the api was 50ms so the zone data was kept in memory to avoid delays.
If the zone data is approximately 25GB, then if the service is scaled to say 5 instances, it would need 125GB ram.
Now the requirement is to run 5 instances but use only 25 GB ram with the data split between instances.
I believe to achieve this we would need a second application which would act as a config manager to manage which instance holds which zone data. The instances can get which zones to track on startup from the config manager service. But the thing I am not able to figure out is how we redirect the request for a zone to the correct instance which holds its data especially if we use kubernetes. Also if the instance holding partial data restarts then how do we track which zone data it was holding

Splitting dataset over several nodes: sounds like sharding.
In-memory: the interviewer might be asking about redis or something similar.
Maybe this: https://redis.io/topics/partitioning#different-implementations-of-partitioning
Redis cluster might fit -- keep in mind that when the docs mention "client-side partitioning": the client is some redis client library, loaded by your backends, responding to HTTP client/end-user requests
Answering your comment: then, I'm not sure what they were looking for.
Comparing Java hashmaps to a redis cluster isn't completely fair, considering one is bound to your JVM, while the other is actually distributed / sharded, implying at least inter-process communications and most likely network/non-local queries.
Then again, if the question is to scale an ever-growing JVM: at some point, we need to address the elephant in the room: how do you guarantee data consistency, proper replication/sharding, what do you do when a member goes down, ...?
Distributed hashmap, using Hazelcast, may be more relevant. Some (hazelcast) would make the argument it is safer under heavy write load. Others that migrating from Hazelcast to Redis helped them improve service reliability. I don't have enough background in Java myself, I wouldn't know.
As a general rule: when asked about Java, you could argue that speed and reliability very much rely on your developers understanding of what they're doing. Which, in Java, implies a large margin of error. While we could suppose: if they're asking such questions, they probably have some good devs on their payroll.
Whereas distributed databases (in-memory, on disk, SQL or noSQL), ... is quite a complicated topic, that you would need to master (on top of java), to get it right.

The broad approach they're describing was described by Adya in 2019 as a LInK store. Linked In-memory Key-value stores allow for application objects supporting rich operations to be sharded across a cluster of instances.
I would tend to approach this by implementing a stateful application using Akka (disclaimer: I am at this writing employed by Lightbend, which employs the majority of the developers of Akka and offers support and consulting services to clients using Akka; as my SO history indicates, I would have the same approach even multiple years before I was employed by Lightbend) along these lines.
Akka Cluster to allow a set of JVMs running an application to form a cluster in a peer-to-peer manner and manage/track changes in the membership (including detecting instances which have crashed or are isolated by a network partition)
Akka Cluster Sharding to allow stateful objects keyed by ID to be distributed approximately evenly across a cluster and rebalanced in response to membership changes
These stateful objects are implemented as actors: they can update their state in response to messages and (since they process messages one at a time) without needing elaborate synchronization.
Cluster sharding implies that the actor responsible for an ID might exist on different instances, so that implies some persistence of the state of the zone outside of the cluster. For simplicity*, when an actor responsible for a given zone starts, it initializes itself from datastore (could be S3, could be Dynamo or Cassandra or whatever): after this its state is in memory so reads can be served directly from the actor's state instead of going to an underlying datastore.
By directing all writes through cluster sharding, the in-memory representation is, by definition, kept in sync with the writes. To some extent, we can say that the application is the cache: the backing datastore only exists to allow the cache to survive operational issues (and because it's only in response to issues of that sort that the datastore needs to be read, we can optimize the data store for writes vs. reads).
Cluster sharding relies on a conflict-free replicated data type (CRDT) to broadcast changes in the shard allocation to the nodes of the cluster. This allows, for instance, any instance to handle an HTTP request for any shard: it simply forwards a representation of the important parts of the request as a message to the shard which will distribute it to the correct actor.
From Kubernetes' perspective, the instances are stateless: no StatefulSet or similar is needed. The pods can query the Kubernetes API to find the other pods and attempt to join the cluster.
*: I have a fairly strong prior that event sourcing would be a better persistence approach, but I'll set that aside for now.

What would be the Pattern to display all existing Actors

I programmed an Akka Application that realises Device Management. Every device is an Akka Actor and I implemented Akka Finite State Machine to control the lifecycle of Device, like FUNCTIONAL, BROKEN, IN_REPAIRS, RETIRED, etc...and I persist the devices with Akka Persistence to Cassandra.
Everything works like a dream but I have dilemma and I like to ask what would be pattern to deal with Akka.
I would nearly have 1 000 000 Devices, Akka is ideal to manage those single instances but how I implement that if user one to see all devices system and select one, change it is state...
I can't show it from Akka Journal table, I would not be able show anything other than persistenceId.
So how would you handle this dilemma.
My current plan, while all events coming to my system from Kafka, consume also these messages from Topic and redirect those to Solr/Elasticsearch, so I can index it some metadata with persistenceId, so user can select a Device to process with Akka Actor.
Do you have a better idea or how do you solve this idea?
Another option to save this information Cassandra to another Keyspace but for some reason I don't fancy it.....
Thx for answers...

Akka persistence is for managing Actor state so that it can be resilient with failures of application ( https://www.reactivemanifesto.org/).May not be optimal for using it for business cases. I understood that your requirement is to able to browse Actors in system. I see couple of options:
Option1:
Akka supports feature called named actors (https://doc.akka.io/docs/akka/current/general/addressing.html). In your case you have device to Actor as one to one mapping. So you can take advantage of this using with names actors feature. During the actors creation in actor system ,you apply this pattern so that all your actors in system are named with device ids.Now you can browse all your device ids (As this is your use case details, you can have searchable module using Solar/Elastic Search as you mentioned). Whenever browsing devices means you are browsing Actors in your system. You can use this named actor path to retrieve actor from system and do some actions.
Option2:
You can use monitoring tools for trace/browse actors in the application. Beyond your need it provides several other useful metrics.
https://www.lightbend.com/blog/akka-monitoring-telemetry
https://kamon.io/solutions/monitoring-for-akka/

Akka Persistence is heavily oriented to the Command-Query Responsibility Segregation style of implementing systems. There are plenty of great outlines describing this pattern if you want more depth, but the broad idea is that you divide responsibility for changing data (the intent to change data being modeled through commands) from responsibility for querying data. In some cases this responsibility carries through to separately deployed services, but it doesn't have to (the more separated, in terms of deployment/operations or development, the less coupled they are, so there's a cost/benefit tradeoff for where you want to be on the level-of-segregation spectrum).
Typically the portion of the system which is handling commands and deciding how (or even if) a given command updates state is often called the "write-side". In your application, the FSM actors modeling the state of a device and persisting changes would be the write-side, and you seem to have that part down pat.
The portion handling the queries is, correspondingly, often called the "read-side", and one key benefit is that it can use a different data model than the write-side, up to and including using a different data store (e.g. Solr/Elasticsearch).
Since you're using Akka Persistence and event-sourcing (judging from mentioning the journal table), Akka Projections provides a good opinionated wrapper for publishing events from the write-side to Kafka for another service to update a Solr/Elasticsearch read-side with. It does require (at least at this time) that your write-side tag events; with some effort you can do something similar by combining the persistenceIds and eventsByPersistenceId query streams to feed events from the write-side to Kafka without having to tag.
Note that when going down the CQRS path, you are generally committing to some level of eventual consistency between the write-side and the read-side.

How to preserve "counter" variable across multiple server instances?

We're setting up the back-end architecture for our mobile application to connect to, but we need some help. Our application is built to mimic "take a number" tickets you would see at a deli or pharmacy. Users will use our mobile application to send a request to our node controller and our node controller will respond with a spot number.
We currently have our node controller set up on Amazon Elastic Beanstalk and have enabled load balancing to handle any surges in requests. Our question is: how do we persist our spotNumber across multiple instances of our node controller? We have it built now as a local variable that starts at 1 and increments with each request, but will this persist if AWS spins up a new instance of our node controller to handle increased traffic? If not, what would be the best practice for preserving our spotNumber across all potential instances of our server?
Thank you in advance!

Use a database.
Clearly you can't store the value within the node application, not only due to scaling but to prevent data loss if the application shuts down unexpectedly.
It sounds like you don't already have a database, so DynamoDB might be a good choice, as long as your only use case is to share a counter between applications. You can find an example here.
You could also use Redis on Elasticache, but I think that it's overkill for a single counter.

Keeping accurate counters at different scales may require different implementations. At small scale, a simple session variable and locking logic in the application would be enough. However, at a larger scale session synchronization and locking is better managed with a database. In particular for your case, DynamoDB conditional writes or Redis counters seems useful. However, keep your requirements simple and clear, managing counters at scale may require algorithms and data structures with funny names, like the HyperLogLog.

Akka cluster sharding: do shard entities share a journal?

I am following an akka tutorial demonstrating cluster sharding. In the cluster sharding example, the author starts up a shared journal and makes the following comment:
// Start the shared journal one one node (don't crash this SPOF)
// This will not be needed with a distributed journal
the journal used is:
journal.plugin = "akka.persistence.journal.leveldb-shared"
Why do shard entities share a journal? my understanding is that Akka persistence doesn't support multiple writes but does support multiple reads. what is the need for a shared journal? I was under the impression that each persistent actor has its own journal. Why would the non-shared LeveldbJournal not support distribute reads? Is there any difficulty with doing that?
The tutorial is based on Akka 2.4 and in this version, cluster sharding uses persistence as a default for akka.cluster.sharding.state-store-mode. In this example, what component exactly uses the snapshop/journal support? is it the Persistent actor in different shards or it is information about the shards relating to its replication? What exactly needs to be distributed? I find the relevant documentation vague and confusing.
If I were to have only one shard, do I need to have a distributed journal?
A somewhat related question: I have reimplemented the now deprecated PersistentView based on PersistenceQuery. I can query the journal for the events from a persistentActor and setup a stream to receive its persisted events. I have tested it and it works. However I can't get it to receive the events in a sharded actor in my test environment with InMemoryJournalStorage (which I don't believe is a distributed journal). In my test scenario, I only have one shard and one actor and I use the unique persistenceId for the actor to query it, but I don't receive any events on the read side. Is there something I am missing about getting Akka persistence to work with cluster sharding? Should I be append/prepending the persistenceId used to query for events?

They shouldn't, at least not in production code, see the warning note here:
http://doc.akka.io/docs/akka/current/java/persistence.html#shared-leveldb-journal
A shared LevelDB instance is a single point of failure and should therefore only be used for testing purposes.
Both
Yes, if you wanted failover to work. If you didn't want failover and all you had was a single shard, then there would be no point using sharding at all.
Can't tell without seeing some of your code.

Apache storm execute bolts on different machines (designated node)

I want to create a topology in which one spout is there which emits words, and a bolt which based on these words create a directory named with word.
I have two supervisor nodes on and want that if word starts with "a" to "l" the directory is created on one node and on another node otherwise.
e.g if word is 'acknowledgement' then one directory will be created on one node and if word is "machine" then directory will be created on another node.
Please suggest a way to configure storm to achieve this.
I would also like to know if one bolt is enough or if two bolts are deployed how can one manage that one bolt is run on one machine and another on other machine.
P.S. I am using Pyleus(https://github.com/Yelp/pyleus) for creating bolts and spout.

You can use single bolt but two instances of it. Each instance of this bolt runs a each supervisor node. Use custom field grouping functionality to achive the same. Your custom field grouping logic decides to which instance of the bolt this word has to dispatch.

Basically, you can't be sure a Bolt / Spout will be present in a specific worker (JVM). It is part of Storm design: have workers on different hardwares that are similar as Storm decides which bolt / spout instances goes to which worker.
Storm includes an abstraction: topologies are not tied to a cluster and can be rebalanced at runtime. It is wonderfully resilient and performant but you can't make specific code that run on a specific node easily (it would also be an anti-pattern in storm philosophy).
AFAIK Storm uses a mod hash function for the repartition of tasks / executors in workers (managed by supervisors), and you can't overwrite it easily.
So your best bet is to have tasks >= executors >= workers as storm will try to divide equally the load in workers of your cluster.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js