Choosing akka persistence db - akka

The default database for Akka persistence is LevelDB, but I saw that there are plugins for Redis, MongoDB, and others.
What are the factors that I should take into account when choosing between them? In my use case, I want to use persistence for recovery after processes fail. The state of the actor is a big data structure with high throughput of write/read operations.
Another option is that the actor will use directly Redis persistence without any plugin for Akka persistence. Is this option less preferred?

Related

Divide in-memory data between service instances

Recently in a system design interview I was asked a question where cities were divided into zones and data of around 100 zones was available. An api took the zoneid as input and returned all the restaurants for that zone in response. The response time for the api was 50ms so the zone data was kept in memory to avoid delays.
If the zone data is approximately 25GB, then if the service is scaled to say 5 instances, it would need 125GB ram.
Now the requirement is to run 5 instances but use only 25 GB ram with the data split between instances.
I believe to achieve this we would need a second application which would act as a config manager to manage which instance holds which zone data. The instances can get which zones to track on startup from the config manager service. But the thing I am not able to figure out is how we redirect the request for a zone to the correct instance which holds its data especially if we use kubernetes. Also if the instance holding partial data restarts then how do we track which zone data it was holding
Splitting dataset over several nodes: sounds like sharding.
In-memory: the interviewer might be asking about redis or something similar.
Maybe this: https://redis.io/topics/partitioning#different-implementations-of-partitioning
Redis cluster might fit -- keep in mind that when the docs mention "client-side partitioning": the client is some redis client library, loaded by your backends, responding to HTTP client/end-user requests
Answering your comment: then, I'm not sure what they were looking for.
Comparing Java hashmaps to a redis cluster isn't completely fair, considering one is bound to your JVM, while the other is actually distributed / sharded, implying at least inter-process communications and most likely network/non-local queries.
Then again, if the question is to scale an ever-growing JVM: at some point, we need to address the elephant in the room: how do you guarantee data consistency, proper replication/sharding, what do you do when a member goes down, ...?
Distributed hashmap, using Hazelcast, may be more relevant. Some (hazelcast) would make the argument it is safer under heavy write load. Others that migrating from Hazelcast to Redis helped them improve service reliability. I don't have enough background in Java myself, I wouldn't know.
As a general rule: when asked about Java, you could argue that speed and reliability very much rely on your developers understanding of what they're doing. Which, in Java, implies a large margin of error. While we could suppose: if they're asking such questions, they probably have some good devs on their payroll.
Whereas distributed databases (in-memory, on disk, SQL or noSQL), ... is quite a complicated topic, that you would need to master (on top of java), to get it right.
The broad approach they're describing was described by Adya in 2019 as a LInK store. Linked In-memory Key-value stores allow for application objects supporting rich operations to be sharded across a cluster of instances.
I would tend to approach this by implementing a stateful application using Akka (disclaimer: I am at this writing employed by Lightbend, which employs the majority of the developers of Akka and offers support and consulting services to clients using Akka; as my SO history indicates, I would have the same approach even multiple years before I was employed by Lightbend) along these lines.
Akka Cluster to allow a set of JVMs running an application to form a cluster in a peer-to-peer manner and manage/track changes in the membership (including detecting instances which have crashed or are isolated by a network partition)
Akka Cluster Sharding to allow stateful objects keyed by ID to be distributed approximately evenly across a cluster and rebalanced in response to membership changes
These stateful objects are implemented as actors: they can update their state in response to messages and (since they process messages one at a time) without needing elaborate synchronization.
Cluster sharding implies that the actor responsible for an ID might exist on different instances, so that implies some persistence of the state of the zone outside of the cluster. For simplicity*, when an actor responsible for a given zone starts, it initializes itself from datastore (could be S3, could be Dynamo or Cassandra or whatever): after this its state is in memory so reads can be served directly from the actor's state instead of going to an underlying datastore.
By directing all writes through cluster sharding, the in-memory representation is, by definition, kept in sync with the writes. To some extent, we can say that the application is the cache: the backing datastore only exists to allow the cache to survive operational issues (and because it's only in response to issues of that sort that the datastore needs to be read, we can optimize the data store for writes vs. reads).
Cluster sharding relies on a conflict-free replicated data type (CRDT) to broadcast changes in the shard allocation to the nodes of the cluster. This allows, for instance, any instance to handle an HTTP request for any shard: it simply forwards a representation of the important parts of the request as a message to the shard which will distribute it to the correct actor.
From Kubernetes' perspective, the instances are stateless: no StatefulSet or similar is needed. The pods can query the Kubernetes API to find the other pods and attempt to join the cluster.
*: I have a fairly strong prior that event sourcing would be a better persistence approach, but I'll set that aside for now.

What would be the Pattern to display all existing Actors

I programmed an Akka Application that realises Device Management. Every device is an Akka Actor and I implemented Akka Finite State Machine to control the lifecycle of Device, like FUNCTIONAL, BROKEN, IN_REPAIRS, RETIRED, etc...and I persist the devices with Akka Persistence to Cassandra.
Everything works like a dream but I have dilemma and I like to ask what would be pattern to deal with Akka.
I would nearly have 1 000 000 Devices, Akka is ideal to manage those single instances but how I implement that if user one to see all devices system and select one, change it is state...
I can't show it from Akka Journal table, I would not be able show anything other than persistenceId.
So how would you handle this dilemma.
My current plan, while all events coming to my system from Kafka, consume also these messages from Topic and redirect those to Solr/Elasticsearch, so I can index it some metadata with persistenceId, so user can select a Device to process with Akka Actor.
Do you have a better idea or how do you solve this idea?
Another option to save this information Cassandra to another Keyspace but for some reason I don't fancy it.....
Thx for answers...
Akka persistence is for managing Actor state so that it can be resilient with failures of application ( https://www.reactivemanifesto.org/).May not be optimal for using it for business cases. I understood that your requirement is to able to browse Actors in system. I see couple of options:
Option1:
Akka supports feature called named actors (https://doc.akka.io/docs/akka/current/general/addressing.html). In your case you have device to Actor as one to one mapping. So you can take advantage of this using with names actors feature. During the actors creation in actor system ,you apply this pattern so that all your actors in system are named with device ids.Now you can browse all your device ids (As this is your use case details, you can have searchable module using Solar/Elastic Search as you mentioned). Whenever browsing devices means you are browsing Actors in your system. You can use this named actor path to retrieve actor from system and do some actions.
Option2:
You can use monitoring tools for trace/browse actors in the application. Beyond your need it provides several other useful metrics.
https://www.lightbend.com/blog/akka-monitoring-telemetry
https://kamon.io/solutions/monitoring-for-akka/
Akka Persistence is heavily oriented to the Command-Query Responsibility Segregation style of implementing systems. There are plenty of great outlines describing this pattern if you want more depth, but the broad idea is that you divide responsibility for changing data (the intent to change data being modeled through commands) from responsibility for querying data. In some cases this responsibility carries through to separately deployed services, but it doesn't have to (the more separated, in terms of deployment/operations or development, the less coupled they are, so there's a cost/benefit tradeoff for where you want to be on the level-of-segregation spectrum).
Typically the portion of the system which is handling commands and deciding how (or even if) a given command updates state is often called the "write-side". In your application, the FSM actors modeling the state of a device and persisting changes would be the write-side, and you seem to have that part down pat.
The portion handling the queries is, correspondingly, often called the "read-side", and one key benefit is that it can use a different data model than the write-side, up to and including using a different data store (e.g. Solr/Elasticsearch).
Since you're using Akka Persistence and event-sourcing (judging from mentioning the journal table), Akka Projections provides a good opinionated wrapper for publishing events from the write-side to Kafka for another service to update a Solr/Elasticsearch read-side with. It does require (at least at this time) that your write-side tag events; with some effort you can do something similar by combining the persistenceIds and eventsByPersistenceId query streams to feed events from the write-side to Kafka without having to tag.
Note that when going down the CQRS path, you are generally committing to some level of eventual consistency between the write-side and the read-side.

AWS Elasticache - Redis VS MemcacheD

I am reading in AWS console about Redis and MemcacheD:
Redis
In-memory data structure store used as database, cache and message broker. ElastiCache for Redis offers Multi-AZ with Auto-Failover and enhanced robustness.
Memcached
High-performance, distributed memory object caching system, intended for use in speeding up dynamic web applications.
Did anyone used/compared both? What is the main difference and use cases between the two?
Thanks.
Pasting my answer from another stackoverflow question
Select Memcached if you have these requirements:
You want the simplest model possible.
You need to run large nodes with multiple cores or threads.
You need the ability to scale out/in,
Adding and removing nodes as demand on your system increases and decreases.
You want to partition your data across multiple shards.
You need to cache objects, such as a database.
Select Redis if you have these requirements:
You need complex data types, such as strings, hashes, lists, and sets.
You need to sort or rank in-memory data-sets.
You want persistence of your key store.
You want to replicate your data from the primary to one or more read replicas for read intensive applications.
You need automatic failover if your primary node fails.
You want publish and subscribe (pub/sub) capabilities—to inform clients about events on the server.
You want backup and restore capabilities.
Here is interesting article by aws https://d0.awsstatic.com/whitepapers/performance-at-scale-with-amazon-elasticache.pdf

Difference between Zookeeper and a managed replicated database service

I just came across Zookeeper and am wondering as to what's the difference between Zookeeper and an available, consistent, durable, distributed, replicated database service like AWS DynamoDB or even AWS S3(storage service) for that matter. The key features like configuration management, distributed synchronization etc can very well be achieved with a database offering like AWS DynamoDB. I understand that there would be architectural differences between Zookeeper and products like DynamoDB. But, from a feature perspective, are there any major differences between the two ?
Is there any reason to use Zookeeper over the other.
First let me tell you some basics about zookeeper which you may already know:
Zookeeper is not a database
Zookeeper is a coordination service
Zookeeper is highly available and capable of managing more than 4000 nodes in a cluster.
Zookeeper stores all its information in znodes, and every Znode can be of 1 mb max.
Zookeeper provides 3 types of znodes: ephemeral, sequential and persistence
Now, to answer your query:
Zookeeper is used for providing exclusive locks to the services where there is a master-slave architecture and you want only one service to be active and perform all the reads/writes.
Zookeeper can be used for sessions also. Like an ephemeral node will be generated per user for session and when the user logs out, the node will automatically be deleted from the zookeeper memory.
Zookeeper is reliable and fault-tolerant and performs in-memory operations which makes it even faster.
So, there are the main reason why zookeeper is considered above any other services providing coordination.
Zookeeper in a nutshell if a distributed kernel, it provides low primitives using which you can build complex DISTRIBUTED SYSTEMS further.
1) Zookeeper provides ordered messages, which is very required for distributed locks(distributes systems in general). Dynamo db does not provide ordered message per client guarantee.
2) Sequential znode provide atomic way to add elements in a ordered way with a common prefix string. Combined with Ephemeral nodes and ordered notification they let you create notification.
lets say you want to lock a customerABCD to perform a work, every machine can write
Create('/customerABCD/lock-', Sequential)
if there are 2 nodes performing above Create then znodes formed will be
/customerABCD/lock-1 & /customerABCD/lock-2.
To decide who is leader you can simple query
Get('/customerABCD') key and then decide leader with least key value. Now lets say Node which created lock-1 dies, then lock-2 will get notification message from zookeeper and then it can claim ownership of customerABCD.
More examples of such distributed tasks are in https://learning.oreilly.com/library/view/zookeeper/9781449361297/ch02.html
In Dynamo machine which created /customerABCD/lock-2 znode will have to poll to know if lock exists or not. This is slow way to acquire lock as it requires timeout based polling, this is inefficient as compute is required to perform poll as well, and adds polling load to system as well.
3) when znodes are added/removed then zxid version gets incremented. This forms the basis of versioning which can be used by distributed systems to achieve lock with fencing as explained in "Making the lock safe with fencing" in link https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
Again Dynamo does not seems to have similar auto-increment parent sequence number facility.

What is the disadvantage of just using Redis instead of an RDBMS?

So if for example I am trying to implement something that looks like Facebook's Graph API that needs to be very quick and support millions of users, what is the disadvantage of just using Redis instead of a RDBMS?
Thanks!
Jonathan
There are plenty of potential benefits and potential drawbacks of using Redis instead of a classical RDBMS. They are very different beasts indeed.
Focusing only on the potential drawbacks:
Redis is an in-memory store: all your data must fit in memory. RDBMS usually stores the data on disks, and cache part of the data in memory. With a RDBMS, you can manage more data than you have memory. With Redis, you cannot.
Redis is a data structure server. There is no query language (only commands) and no support for a relational algebra. You cannot submit ad-hoc queries (like you can using SQL on a RDBMS). All data accesses should be anticipated by the developer, and proper data access paths must be designed. A lot of flexibility is lost.
Redis offers 2 options for persistency: regular snapshotting and append-only files. None of them is as secure as a real transactional server providing redo/undo logging, block checksuming, point-in-time recovery, flashback capabilities, etc ...
Redis only offers basic security (in term of access rights) at the instance level. RDBMS all provide fine grained per-object access control lists (or role management).
A unique Redis instance is not scalable. It only runs on one CPU core in single-threaded mode. To get scalability, several Redis instances must be deployed and started. Distribution and sharding are done on client-side (i.e. the developer has to take care of them). If you compare them to a unique Redis instance, most RDBMS provide more scalability (typically providing parallelism at the connection level). They are multi-processed (Oracle, PostgreSQL, ...) or multi-threaded (MySQL, Microsoft SQL Server, ... ), taking benefits of multi-cores machines.
Here, I have only described the main drawbacks, but keep in mind there are also plenty of benefits in using Redis (very fast, good concurrency support, low latency, protocol pipelining, good to easily implement optimistic concurrent patterns, good usability/complexity ratio, excellent support from Salvatore and Pieter, pragmatic no-nonsense approach, ...)
For your specific problem (graph), I would suggest to have a look at neo4J or OrientDB which are specifically designed to store graph-oriented data.
I have some additions:
There is a value length limitations in redis. When using redis, you always think about your redis K,V size, especially in redis cluster