The proper way of partitioning data in Elasticache (Redis) - amazon-web-services

We are using Elasticache(Redis) to implement a caching layer for our cloud platform. It has a mongodb back end. We use Node.js and Java to access these data in different different platform components.
Node.js example code is given below,
var redisClient = require('redis').createClient(config.aws.redis.port, config.aws.redis.endpoint, {no_ready_check: true});
var redisKey="urls_"+url;
redisClient.get(redisKey, function (redisErr, reply) {});
In the redis cache we cache different categories of data. Say X, Y, Z. Currently we store them in single redis node. And we use namespace prefix to partition data.
Ex,
key ="X_"+url
key ="Y_"+money
key ="Z_"+length
But I cam across this article and it says its better to avoid name spaces. http://www.mikeperham.com/2015/09/24/storing-data-with-redis/
So what is the best solution for our use case?
Having different redis nodes per different data partition,
X type data will be cached in A redis node
Y type data will be cached in B redis node
Z type data will be cached in C redis node
Using multiple redis dbs in single redis node?
X type data will be cached in 0th DB of A redis node
Y type data will be cached in 1st DB of A redis node
Z type data will be cached in 2nd DB of A redis node

For elasticache, it is generally better to store each set of data in its own cluster.
the primary downside of having multiple cluster is the management overhead, but with elasticache, most of that is managed, minimizing this downside.
while scaling your elasticache cluster is managed, it still needs to move data around -- the more data you have in a single cluster, the longer it will take to scale it, and more performance impact you will suffer while doing online rescaling.
as for cost, by splitting data into multiple clusters, each cluster can use smaller instances. In the end, it may cost about the same.
Of course, there are always exceptions, and you should consider those issues within your unique context.

Related

Amazon DynamoDB read latency while writing

I have an Amazon DynamoDB table which is used for both read and write operations. Write operations are performed only when the batch job runs at certain intervals whereas Read operations are happening consistently throughout the day.
I am facing a problem of increased Read latency when there is significant amount of write operations are happening due to the batch jobs. I explored a little bit about having a separate read replica for DynamoDB but nothing much of use. Global tables are not an option because that's not what they are for.
Any ideas how to solve this?
Going by the Dynamo paper, the concept of a read-replica for a record or a table does not exist in Dynamo. Within the same region, you will have multiple copies of a record depending on the replication factor (R+W > N) where N is the replication factor. However when the client reads, one of those records are returned depending on the cluster health.
Depending on how the co-ordinator node is chosen either at the client library or at the cluster, the client can only ask for a record (get) or send a record(put) to either the cluster co-ordinator ( 1 extra hop ) or to the node assigned to the record (single hop to record). There is just no way for the client to say 'give me a read replica from another node'. The replicas are there for fault-tolerance, if one of the nodes containing the master copy of the record dies, replicas will be used.
I am researching the same problem in the context of hot keys. Every record gets assigned to a node in Dynamo. So a million reads on the same record will lead to hot keys, loss of reads/writes etc. How to deal with this ? A read-replica will work great because I can now manage the hot keys at the application and move all extra reads to read-replica(s). This is again fraught with issues.

Best SymmetricDBS architecture for DC to DR replication

We would like to back up a SQL Server cluster at the DC site to another standalone SQL Server at the DR site. We would like to use SymmetricDS and we want all DB objects from the source to be mirrored to the DR (including new tables, triggers and stored procedures). Some tables do not have primary keys.
We would like to know the type of architecture best suited to our needs.
The configuration for SymmetricDS would be two nodes that sync with each other. You could use one node group and link them, like "primary pushes to primary". By using bi-directional, you can use your mirror database when needed, and it will capture changes to get the other one back in sync when it becomes available.
SymmetricDS will replicate tables and data, but it does not replicate triggers and stored procedures. Also, the table replication works for most common cases, but misses details like computed columns and defaults that call functions.

Elasticache / Redis Timestamp of new data

Does Elasticache store the time when a data is added to the cache? I want to filter data on my cache based on the time it was added but I can't find a clear answer if this information is stored in Elasticache automatically or if I have to add this information (timestamp) manually for each data inserted in the cache?
Thanks!
Neither Redis nor ElastiCache's Redis-compatible service store the timestamp automatically.
This would be inneficient as many use causes don't require it, so it's a client application implementation detail.
You may use a sorted set to store this information, so you can query for date ranges. And you can use Redis server time automatically if you use a Lua script. See How to store in Redis sorted set with server-side timestamp as score?.
This is particularly important if you have multiple nodes connecting, as they may have clock differences.

DynamoDB local db limits - use for initial beta-go-live

given Dynamo's pricing, the thought came to mind to use DynamoDB Local DB on an EC2 instance for the go-live of our startup SaaS solution. I've been trying to find like a data sheet for the local db, specifying limits as to # of tables, or records, or general size of the db file. Possibly, we could even run a few local db instances on dedicated EC2 servers as we know at login what user needs to be connected to what db.
Does anybody have any information on the local db limits or on this approach? Also, anybody knows of any legal/licensing issues with using dynamo-local in that way?
Every item in DynamoDB Local will end up as a row in the SQLite database file. So the limits are based on SQLite's limitations.
Maximum Number Of Rows In A Table = 2^64 but the database file limit will likely be reached first (140 terabytes).
Note: because of the above, the number of items you can store in DynamoDB Local will be smaller with the preview version of local with Streams support. This is because to support Streams the update records for items are also stored. E.g. if you are only doing inserts of these items then the item will effectively be stored twice: once in a table containing item data and once in a table containing the INSERT UpdateRecord data for that item (more records will also be generated if the item is being updated over time).
Be aware that DynamoDB Local was not designed for the same performance, availability, and durability as the production service.

Amazon Redshift optimizer (?) and distribution styles

I was studying Amazon Redshift using the Sybex Official Study Guide, at page 173 there are a couple of phrases:
You can configure the distribution style of a table to give Amazon RS hints as to how the data should be partitioned to best meet your query patterns. When you run a query, the optimizer shifts the rows to the compute node as needed to perform any joins and aggregates.
That leads me to some questions?
1) What is the role of "optimizer"? Do data re-arranged across compute nodes to boost performance for each new query?
2) if 1) is true and new query completly different is performed: What happen to the old data in the compute nodes?
3) Can you explain me better the 3 distribution styles (EVEN, KEY, ALL) particularly the KEY style.
Extra questions:
1) Does the leader node has records?
To clarify a few things:
The Distribution Key is not a hint -- data is actually distributed according to the key
When running a query, the data is not "shifted" -- rather, copies of data might be sent to other nodes so that data can be joined on a particular node, but the data does not then reside on the destination node
The optimizer doesn't actually "do" anything -- it just computes the process that the nodes will follow (Redshift apparently writes C programs that are sent to each node)
The only thing you really need to know about the Optimizer is:
Query optimizer
The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer implements significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, subqueries, and aggregation.
From Data Warehouse System Architecture:
Leader node
The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.
The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any of these functions will return an error if it references tables that reside on the compute nodes.
The Leader Node does not contain any data (unless you launch a single-node cluster, in which case the same server is used as the Leader Node and Compute Node).
For information on Distribution Styles, refer to: Distribution Styles
If you really want to learn about Redshift, then read the Redshift Database Developer Guide. If you are merely studying for the Solutions Architect exam, the above links will be sufficient for the level of Redshift knowledge.