Suggested lightweight key-value store for distributing state between many hosts - database-replication

I'm looking for key-val store that will be used to share some state between multiple hosts.
Goal:
- Achive high availability for limited set of data, that need to be accesible on every host/node
Requirements:
put/get/incr/decr operations
simple numeric data - int/float values, nothing more, no JSON, blobs and so on
full copy of dataset on every node or automated failure tolerance
automatic adding/removing of hosts with no need to reconfigure application
small dataset - only a few megabytes of shared data
node traffic is load balanced with user-to-node sticking, so only one node at once will change data related to users that are sticked to that node. This will only change on node failure, but constraint of one master for a set of keys will be keeped, so many readers, one master for own small dataset
multiple small VM instances will be used, so it should be lightweight in terms of required memory
automated operation - configure once and forget
I've looked at Riak and CouchDB, but they look like too complicated and too heavy
Any suggestions?

After more research I'm heading toward Hazelcast, it provides memcache-like interface and it's easy to configure a simple cluster with automated failover

Related

Divide in-memory data between service instances

Recently in a system design interview I was asked a question where cities were divided into zones and data of around 100 zones was available. An api took the zoneid as input and returned all the restaurants for that zone in response. The response time for the api was 50ms so the zone data was kept in memory to avoid delays.
If the zone data is approximately 25GB, then if the service is scaled to say 5 instances, it would need 125GB ram.
Now the requirement is to run 5 instances but use only 25 GB ram with the data split between instances.
I believe to achieve this we would need a second application which would act as a config manager to manage which instance holds which zone data. The instances can get which zones to track on startup from the config manager service. But the thing I am not able to figure out is how we redirect the request for a zone to the correct instance which holds its data especially if we use kubernetes. Also if the instance holding partial data restarts then how do we track which zone data it was holding
Splitting dataset over several nodes: sounds like sharding.
In-memory: the interviewer might be asking about redis or something similar.
Maybe this: https://redis.io/topics/partitioning#different-implementations-of-partitioning
Redis cluster might fit -- keep in mind that when the docs mention "client-side partitioning": the client is some redis client library, loaded by your backends, responding to HTTP client/end-user requests
Answering your comment: then, I'm not sure what they were looking for.
Comparing Java hashmaps to a redis cluster isn't completely fair, considering one is bound to your JVM, while the other is actually distributed / sharded, implying at least inter-process communications and most likely network/non-local queries.
Then again, if the question is to scale an ever-growing JVM: at some point, we need to address the elephant in the room: how do you guarantee data consistency, proper replication/sharding, what do you do when a member goes down, ...?
Distributed hashmap, using Hazelcast, may be more relevant. Some (hazelcast) would make the argument it is safer under heavy write load. Others that migrating from Hazelcast to Redis helped them improve service reliability. I don't have enough background in Java myself, I wouldn't know.
As a general rule: when asked about Java, you could argue that speed and reliability very much rely on your developers understanding of what they're doing. Which, in Java, implies a large margin of error. While we could suppose: if they're asking such questions, they probably have some good devs on their payroll.
Whereas distributed databases (in-memory, on disk, SQL or noSQL), ... is quite a complicated topic, that you would need to master (on top of java), to get it right.
The broad approach they're describing was described by Adya in 2019 as a LInK store. Linked In-memory Key-value stores allow for application objects supporting rich operations to be sharded across a cluster of instances.
I would tend to approach this by implementing a stateful application using Akka (disclaimer: I am at this writing employed by Lightbend, which employs the majority of the developers of Akka and offers support and consulting services to clients using Akka; as my SO history indicates, I would have the same approach even multiple years before I was employed by Lightbend) along these lines.
Akka Cluster to allow a set of JVMs running an application to form a cluster in a peer-to-peer manner and manage/track changes in the membership (including detecting instances which have crashed or are isolated by a network partition)
Akka Cluster Sharding to allow stateful objects keyed by ID to be distributed approximately evenly across a cluster and rebalanced in response to membership changes
These stateful objects are implemented as actors: they can update their state in response to messages and (since they process messages one at a time) without needing elaborate synchronization.
Cluster sharding implies that the actor responsible for an ID might exist on different instances, so that implies some persistence of the state of the zone outside of the cluster. For simplicity*, when an actor responsible for a given zone starts, it initializes itself from datastore (could be S3, could be Dynamo or Cassandra or whatever): after this its state is in memory so reads can be served directly from the actor's state instead of going to an underlying datastore.
By directing all writes through cluster sharding, the in-memory representation is, by definition, kept in sync with the writes. To some extent, we can say that the application is the cache: the backing datastore only exists to allow the cache to survive operational issues (and because it's only in response to issues of that sort that the datastore needs to be read, we can optimize the data store for writes vs. reads).
Cluster sharding relies on a conflict-free replicated data type (CRDT) to broadcast changes in the shard allocation to the nodes of the cluster. This allows, for instance, any instance to handle an HTTP request for any shard: it simply forwards a representation of the important parts of the request as a message to the shard which will distribute it to the correct actor.
From Kubernetes' perspective, the instances are stateless: no StatefulSet or similar is needed. The pods can query the Kubernetes API to find the other pods and attempt to join the cluster.
*: I have a fairly strong prior that event sourcing would be a better persistence approach, but I'll set that aside for now.

AWS Elasticache - Redis VS MemcacheD

I am reading in AWS console about Redis and MemcacheD:
Redis
In-memory data structure store used as database, cache and message broker. ElastiCache for Redis offers Multi-AZ with Auto-Failover and enhanced robustness.
Memcached
High-performance, distributed memory object caching system, intended for use in speeding up dynamic web applications.
Did anyone used/compared both? What is the main difference and use cases between the two?
Thanks.
Pasting my answer from another stackoverflow question
Select Memcached if you have these requirements:
You want the simplest model possible.
You need to run large nodes with multiple cores or threads.
You need the ability to scale out/in,
Adding and removing nodes as demand on your system increases and decreases.
You want to partition your data across multiple shards.
You need to cache objects, such as a database.
Select Redis if you have these requirements:
You need complex data types, such as strings, hashes, lists, and sets.
You need to sort or rank in-memory data-sets.
You want persistence of your key store.
You want to replicate your data from the primary to one or more read replicas for read intensive applications.
You need automatic failover if your primary node fails.
You want publish and subscribe (pub/sub) capabilities—to inform clients about events on the server.
You want backup and restore capabilities.
Here is interesting article by aws https://d0.awsstatic.com/whitepapers/performance-at-scale-with-amazon-elasticache.pdf

Replicated caching solutions compatible with AWS

My use case is as follow:
We have about 500 servers running in an autoscaling EC2 cluster that need to access the same configuration data (layed out in a key/value fashion) several million times per second.
The configuration data isn't very large (1 or 2 GBs) and doesn't change much (a few dozen updates/deletes/inserts per minute during peak time).
Latency is critical for us, so the data needs to be replicated and kept in memory on every single instance running our application.
Eventual consistency is fine. However we need to make sure that every update will be propagated at some point. (knowing that the servers can be shutdown at any time)
The update propagation across the servers should be reliable and easy to setup (we can't have static IPs for our servers, or we don't wanna go the route of "faking" multicast on AWS etc...)
Here are the solutions we've explored in the past:
Using regular java maps and use our custom built system to propagate updates across the cluster. (obviously, it doesn't scale that well)
Using EhCache and its replication feature. But setting it up on EC2 is very painful and somehow unreliable.
Here are the solutions we're thinking of trying out:
Apache Ignite (https://ignite.apache.org/) with a REPLICATED strategy.
Hazelcast's Replicated Map feature. (http://docs.hazelcast.org/docs/latest/manual/html-single/index.html#replicated-map)
Apache Geode on every application node. (http://geode.apache.org/)
I would like to know if each of those solutions would work for our use case. And eventually, what issues I'm likely to face with each of them.
Here is what I found so far:
Hazelcast's Replicated Map is somehow recent and still a bit unreliable (async updates can be lost in case of scaling down)
It seems like Geode became "stable" fairly recently (even though it's supposedly in development since the early 2000s)
Ignite looks like it could be a good fit, but I'm not too sure how their S3 discovery based system will work out if we keep adding / removing node regularly.
Thanks!
Geode should work for your use case. You should be able to use a Geode Replicated region on each node. You can choose to do synchronous OR asynchronous replication. In case of failures, the replicated region gets an initial copy of the data from an existing member in the system, while making sure that no in-flight operations are lost.
In terms of configuration, you will have to start a couple/few member discovery processes (Geode locators) and point each member to these locators. (We recommend that you start one locator/AZ and use 3 AZs to protect against network partitioning).
Geode/GemFire has been stable for a while; powering low latency high scalability requirements for reservation systems at Indian and Chinese railways among other users for a very long time.
Disclosure: I am a committer on Geode.
Ignite provides native AWS integration for discovery over S3 storage: https://apacheignite-mix.readme.io/docs/amazon-aws. It solves main issue - you don't need to change configuration when instances are restarted. In a nutshell, any nodes that successfully joins topology writes its coordinates to a bucket (and removes them when fails or leaves). When you start a new node, it reads this bucket and connects to one the listed addresses.
Hazelcast's Replicated Map will not work for your use-case. Note that it is a map that is replicated across all it's nodes not on the client nodes/servers. Also, as you said, it is not fully reliable yet.
Here is the Hazelcast solution:
Create a Hazelcast cluster with a set of nodes depending upon the size of data.
Create a Distributed map(IMap) and tweak the count & eviction configurations based on size/number of key/value pairs. The data gets partitioned across all the nodes.
Setup Backup count based on how critical the data is and how much time it takes to pull the data from the actual source(DB/Files). Distributed maps have 1 backup by default.
In the client side, setup a NearCache and attach it to the Distributed map. This NearCache will hold the Key/Value pair in the local/client side itself. So the get operations would end up in milliseconds.
Things to consider with NearCache solution:
The first get operation would be slower as it has to go through network to get the data from cluster.
Cache invalidation is not fully reliable as there will be a delay in synchronization with the cluster and may end reading stale data. Again, this is same case across all the cache solutions.
It is client's responsibility to setup timeout and invalidation of Nearcache entries. So that the future pulls would get fresh data from cluster. This depends on how often the data gets refreshed or value is replaced for a key.

Using & Scaling Titan Graph Database

I am figuring out my options for storing hierarchical data (parent - child relationships).
Since a tree is a graph and a forest (of trees) is also technically a graph, a graph database seems to fit the bill much better than a RDBMS esp. since I am concerned with optimizing both read and write operations.
Optimizing writes implies changes in hierarchy require minimal writes.
Optimizing reads implies materializing the full path to a particular node consumers minimal read operations.
My use case is:
A tree per user. Should I store and use one graph across the user space or one graph per user?
Path queries starting at any node and back to root of tree for a user.
Child nodes store links to parent nodes
Since all of my resources are in AWS, being able to use the Titan DynamoDB backend seems ideal.
My real problem is in understanding how to scale and manage Titan though.
Do I need a gremlin server instance? In other words, do I need to stand up a EC2 instance with gremlin server in order to do anything with Titan? Or can I use the Java Titan API to work with graph data directly?
Do I need to explicitly shard the data? In other words, do I need to stand up more gremlin servers as usage increases and the amount of data and the amount of operations increase? When the number of servers scale out, do I need to consistent hash across those servers from the client in order to perform operations?
Do I need to setup an elastic search cluster to be able to start traversals from any node? Or is using vertices to represent objects and edges to represent parent relationships sufficient at this point? I can guarantee that vertex ID's are unique across the user-space ; I can also decorate each vertex with the unique user ID as well. In that case, do I need elastic search? My hope is that elastic search is for free form or more complex search type queries and not for exact queries!
As the number of front-ends increase, can each front-end open the graph (single graph across user space)? If a graph per user, then since front-ends have no affinity, the same graph may be opened for each user; is that OK?
I wasn't able to find much documentation on any of this. Thank you!
I will try to answer your questions in the following:
Both solutions are possible, it is highly depend on your application to decide between choosing gremlin server or having a customized data access layer with customized queries through other secondary data stores. Although I would prefer having customized data access layer, it is possible to response all gremlin query requirements through gremlin server.
Gremlin server is just an interface between your application and data stores, and due to the caching mechanism it is memory-intensive. Data can be stored in different machines for example a cluster of DynamoDB machines. It depend on the number of concurrent users, but I think vertical scaling is more than enough for most of the applications. If you are going to use titan in a highly concurrent environment, beyond resource of single machine, probably you have to create different gremlin-servers on different machines and handle the load balancing mechanism. The problem is you have to control sending requests in a way that similar queries hit the same gremlin-server from the cache efficiency point of view.
Yes, indexing backend is just useful for more complicated queries other than simple retrieval. Secondary index backend like Solr/ Elastic or Lucene is useful if you want to have conditional search or text search by similarity. It is because that indexer like Lucene can provide a reverse index structure that can be helpful for similar searches. If you are going to search for all parents/children with having "foo" in their names you have to use indexing backend. If you are going to search for all parents/children with age less than 40, you have to use indexing backend too.
More information about indexing backends could be accessed via these link.
http://s3.thinkaurelius.com/docs/titan/1.0.0/indexes.html
http://s3.thinkaurelius.com/docs/titan/1.0.0/index-parameters.html
It is highly recommended to limit the number of open graphs to one for the entire application. Titan uses some caching mechanisms that encourage you to have a single graph instance in the entire application for the sake of performance. Since uncommitted data is just visible on a single graph instance and transaction, if you want real-time application it is suggested that use single graph instance and single transaction. However, using more than 1 graph instance in the entire application for a read-only transaction is not wrong, but it is not efficient.
You can find lots of information about Titan graph database in the following links:
Main Titan documentaion: http://s3.thinkaurelius.com/docs/titan/1.0.0/
An old but really useful document about how Titan works: https://github.com/elffersj/delftswa-aurelius-titan/tree/master/SA-doc

Greenplum Query : Best strategy to Move Objects from Pre-Prod to Prod Env

I have two different environment Production (new) and pre-production (existing), We have given cluster ready with GP Installed on new prod environment.
I want to know what is the best way to move objects from Pre-Production Environment to Production Environment,
I know:
using gp_dump
using pg_dump
Manually dump each object (table ddl, functions ddl, view ddl, sequence ddl etc)
I want to know the best strategy and what are the pros and cons of each strategy, if only objects need to backup and restore from one environment to another.
Need your valuable input for the same.
The available strategies, ranged by priorities:
Use gpcrondump and gpdbrestore. Will work only if the amount of segments in Pre-Production and Production are the same and the dbids are the same. The fastest way to transfer the whole database with schema as it would work as parallel dump and parallel restore. As it is the backup, it will lock pg_class for some short time, which might create some problems on Production system
If the amount of objects to transfer is small, you can use gptransfer utility, see the user guide for reference. It provides you with an ability to transfer data directly between the segments of Pre-Production and Production. The requirement is that all the segment servers of the Pre-Production environment should be able to see all the segments from Production, which means the should be added to the same VLAN for data transfer.
Write custom code and use writable external tables and readable external tables over the pipe object on the shared host. Also you would have to write some manual code to compare DDL. The benefit of this methout is that you can reuse the external tables to transfer the data between environments many times, and if DDL is not changed your transfer would be fast as the data won't be put to the disks. But all the data would be transferred through a single host, which might be a bottleneck (up to 10gbps transfer rate with dual 10GbE connections for the shared host). Another big advantage is that there would be no locks on the pg_class
Run gpcrondump on the source system and restore the data serially on the target system. This is a way to go if you want to use backup-restore solution and your source and target systems have different amount of segments
In general, everything depends on what you want to achieve: move the objects a single time, move them once a month in a period of inactivity on the clusters, move all the objects weekly without stopping production, move the selected objects daily without stopping production, etc. The result would really depend on your needs