When to Add Shards to a Distributed Key/Value Store - concurrency

I've been reading up on distributed systems lately, and I've seen a lot of examples of how to shard key value stores, like a memcache system or a nosql db.
In general, adding shards makes intuitive sense to me when you want to support more concurrent access to the table, and most of the examples cover that sort of usage. One thing I'm not clear on though is whether you are also supposed to add shards as your total table size grows. For something like a memcache, I'd imagine this is necessary, because you need more nodes with more memory to hold more key/values. But what about databases which also keep the values on some sort of hard drive?
It seems like, if your table size is growing but the amount of concurrent access is not, it would be somewhat wasteful to keep adding nodes just to hold more data. In that case I'd think you could just add more long-term storage. But I suppose the problem is, you are increasing the chance that your data becomes "cold" when somebody needs it, causing more latency for those requests.
Is there a standard approach to scaling nodes vs. storage? Are they always linked? Thanks much for any advice.

I think it is the other way around.
Almost always shards are added because the data is growing to the point where it cannot be held on 1 machine
Sharding makes everything so much more painful so should be only done when vertical scaling doesn't work anymore

Related

Fast and frequent file access while executing C++ code

I am looking for suggestions on how best to implement my code for the following requirements. During execution of my c++ code, I frequently need to access data stored in a dictionary, which itself is stored in a text file. The dictionary contains 100 million entries, and at any point in time, my code would query data corresponding to some particular entry among those 100 million entries. There is no particular pattern in which those queries are made, and further during the lifetime of the program execution, not all entries in the dictionary are queried. Also, the dictionary will remain unchanged during the program's lifetime. The data corresponding to each entry is not all of the same length. The file size of my dictionary is ~24 GB, and I have only 16 GB of RAM memory. I need my application to be very fast, so I would like to know how best to implement such a system so that read access times can be minimized.
I am also the one who is creating the dictionary, so I do have the flexibility in breaking down my dictionary into several smaller volumes. While thinking about what I can do, I came up with the following, but not sure if either are good.
If I store the line offset for each entry in my dictionary from the beginning of the file, then to read the data for the corresponding entry, I can directly jump to the corresponding offset. Is there a way to do this using say ifstream without looping through all lines until the offset line? A quick search on the web seems to suggest this is not possible atleast with ifstream, are there are other ways this can be done?
The other extreme thought was to create a single file for each entry in the dictionary, so I would have 100 million files. This approach has the obvious drawback of overhead in opening and closing the file stream.
In general I am not convinced either of the approaches I have in mind are good, and so I would like some suggestions.
Well, if you only need key value accesses, and if the data is larger than what can fit in memory, the answer is a NoSQL database. That mean a hash type index for the key and arbitrary values. If you have no other constraint like concurrent accesses from many clients or extended scalability, you can roll your own. The most important question for a custom NoSQL database is the expected number of keys that will give the size of the index file. You can find rather good hashing algorithms around, and will have to make a decision between a larger index file and a higher risk of collisions. Anyway, unless you want to use a tera bytes index files, your code must be prepared to possible collisions.
A detailed explaination with examples is far beyond what I can write in a SO answer, but it should give you a starting point.
The next optimization will be what should be cached in memory. It depends on the way you expect the queries. If it is unlikely to query more than one time the same key, you can probably just rely on the OS and filesystem cache, and a slight improvement would be memory mapped files, else caching (of index and/or values) makes sense. Here again you can choose and implement a caching algorithm.
Or if you think that it is too complex for little gain, you can search if one of the free NoSQL databases could meet your requirement...
Once you decide using on-disk data structure it becomes less a C++ question and more a system design question. You want to implement a disk-based dictionary.
You should consider the following factors from now on are - what's your disk parameters? is it SSD? HDD? what's your average lookup rate per second? Are you fine having 20usec - 10ms latencies for your Lookup() method?
On-disk dictionaries require random disk seeks. Such seeks have a latency of dozens of microseconds for SSD and 3-10ms for HDD. Also, there is a limit on how many such seeks you can make a second. You can read this article for example. CPU stops being a bottleneck and IO becomes important.
If you want to pursue this direction - there are state of art C++ libraries that give you on-disk key-value store (no need for out-of- process database) or you can do something simple yourself.
If your application is a batch process and not a server/UI program, i.e. you have another finite stream of items that you want to join with your dictionary then I recommend reading about external algorithms like Hash Join or a MapReduce. In these cases, it's possible to organize your data in such way that instead of having 1 huge dictionary of 24GB you can have 10 dictionaries of size 2.4GB and sequentially load each one of them and join. But for that, I need to understand what kind of problem you are trying to solve.
To summarize, you need to design your system first before coding the solution. Using mmap or tries or other tricks mentioned in the comments are local optimizations (if at all), they are unlikely game-changers. I would not rush exploring them before doing back-on-the-envelope computations to understand the main direction.

Storing a very large array of strings in AWS

I want to store a large array of strings in AWS to be used from my application. The requirements are as follows:
During normal operations, string elements will be added to the array and the array size will continue to grow
I need to enforce uniqueness - i.e. the same string cannot be stored twice
I will have to retrieve the entire array periodically - most probably to put it in a file and use it from the application
I need to backup the data (or at least be convinced that there is a good built-in backup system as part of the features)
I looked at the following:
RDS (MySQL) - this may be overkill and also may become uncomfortably large for a single table (millions of records).
DynamoDB - This is intended for key/value pairs, but I have only a single value per record. Also, and more importantly, retrieving a large number of records seems to be an issue in DynamoDB as the scan operation needs paging and also can be expensive in terms of capacity units, etc.
Single S3 file - This could be a practical solution except that I may need to write to the file (append) concurrently, and that is not a feature that is available in S3. Also, it would be hard to enforce the element uniqueness
DocumentDB - This seems to be too expensive and overkill for this purpose
ElastiCache - I don't have a lot of experience with this and wonder if it would be a good fit for my requirement and if it's practical to have it be backed up periodically. This also uses key/value pairs and it is not advisable to read millions of records (entire data) at the same time
Any insights or recommendations would be helpful.
Update:
I don't know why people are voting to close this. It is definitely a programming related question and I have already gotten extremely useful answers and comments that will help me and hopefully others in the future. Why is there such an obsession with opinionated closure of useful posts on SO?
DynamoDB might be a good fit.
It doesn't matter that you don't have any "value" to your "key". Just use the string as the primary key. That will also enforce uniqueness.
You get on-demand and continuous backups. I don't have experience with these so I can only point you to the documentation.
The full retrieval of the data might be the biggest hassle. It is not recommended to do a full-table SCAN with DynamoDB; it can get expensive. There's a way how to use Data Pipelines to do an export (I also have not used it). Alternatively, you could put together a system by yourself, utilizing DynamoDB streams, e.g. you can push a stream to Kinesis and then to S3.

Is it good idea to store operational data in memcached?

I write data processor on cpp, which should process a lot of requests and do a lot of calculations, requests are connected with each other. Now I think about easy horizontal scalability.
Is it a good idea to use memcached with replication (an instance on every processor) to store operational data? Such every processor instance could process every requests in an equal time.
How fast and stable is memcached replication?
very fast, one major potential shortcoming of memcached is that it is not persistent. While a common design consideration when using a cache layer is that “data in cache may go away at any point”, this can result in painful warmup time and/or costly cache stampedes.
I would check out Couchbase. http://www.couchbase.com/ It stores the cached data in RAM, but also flushes it out to disk periodically so if a machine gets restarted, the data is still there.
It's very easy to add nodes on the fly as well.
Just for fun you could also check out Riak: http://basho.com/riak/. Very easy to add nodes as your cache needs grow and very easy to get up and running. Also focused on key/value storage, which is good for caching objects.

Why does AWS Dynamo SDK do not provide means to store objects larger than 64kb?

I had a use case where I wanted to store objects larger than 64kb in Dynamo DB. It looks like this is relatively easy to accomplish if you implement a kind of "paging" functionality, where you partition the objects into smaller chunks and store them as multiple values for the key.
This got me thinking however. Why did Amazon not implement this in their SDK? Is it somehow a bad idea to store objects bigger than 64kb? If so, what is the "correct" infrastructure to use?
In my opinion, it's an understandable trade-off DynamoDB made. To be highly available and redundant, they need to replicate data. To get super-low latency, they allowed inconsistent reads. I'm not sure of their internal implementation, but I would guess that the higher this 64KB cap is, the longer your inconsistent reads might be out of date with the actual current state of the item. And in a super low-latency system, milliseconds may matter.
This pushes the problem of an inconsistent Query returning chunk 1 and 2 (but not 3, yet) to the client-side.
As per question comments, if you want to store larger data, I recommend storing in S3 and referring to the S3 location from an attribute on an item in DynamoDB.
For the record, the maximum item size in DynamoDB is now 400K, rather than 64K as it was when the question was asked.
From a Design perspective, I think a lot of cases where you can model your problem with >64KB chunks could also be translated to models where you can split those chunks to <64KB chunks. And it is most often a better design choice to do so.
E.g. if you store a complex object, you could probably split it into a number of collections each of which encode one of the various facets of the object.
This way you probably get better, more predictable performance for large datasets as querying for an object of any size will involve a defined number of API calls with a low, predictable upper bound on latency.
Very often Service Operations people struggle to get this predictability out of the system so as to guarantee a given latency at 90/95/99%-tile of the traffic. AWS just chose to build this constraint into the API, as they probably already do for their own website ans internal developments.
Also, of course from an (AWS) implementation and tuning perspective, it is quite comfy to assume a 64KB cap as it allows for predictable memory paging in/out, upper bounds on network roundtrips etc.

list/map of key-value pairs backed up by file on disk

I need to make a list of key-value pairs (similar to std::map<std::string, std::string>) that is stored on disk, can be accessed by multiple threads at once. keys can be added or removed, values can be changed, keys are unique. Supposedly the whole thing might not fit into memory at once, so updates to the map must be saved to the disk.
The problem is that I'm not sure how to approach this problem. I understand how to deal with multithreading issues, but I'm not sure which data structure is suitable for storing data on disk. Pretty much anything I can think of can dramatically change structure and cause massive overwrite of the disk storage, if I approach problem head-on. On other hand, relational databases and windows registry deal with this problem, so there must be a way to approach it.
Is there a data structure that is "made" for such scenario?
Or do I simply use any traditional data structure(trees or skip lists, for example) and make some kind of "memory manager" (disk-backed "heap") that allocates chunks of disk space, loads them into memory on request and unloads them onto disk, when necessary? I can imagine how to write such "disk-based heap", but that solution isn't very elegant, especially when you add multi-threading to the picture.
Ideas?
The data structure that is "made" for your scenario is B-tree or its variants, like B+ tree.
Long and short of it: once you write things to disk you are not longer dealing with "data structures" - you are dealing with "serialization" and "databases."
The C++ STL and its data structures do not really address these issues, but, fortunately, they have already been addressed thousands of times by thousands of programmers already. Chances are 99.9% that they've already written something that will work well for you.
Based on your description, sqlite sounds like it would be a decent, balanced choice for your application.
If you only need to do lookups (and insertions, deletions) by key, and not more complex field-based queries, BDB may be a better choice for your application.