So I see that a yelp url looks like this
http://www.yelp.com/biz_attribute?biz_id=doldrYLTdR9aYckHIsv55Q
The biz_id is a hash-like string, instead of the more commonly seen integer or mongoID. Aside from obfuscation, is there other reasons why one would use a hash as a ID instead of the ID in the database?
One can imagine several reasons, but a good reason that relates to MongoDB is to use a hash as an ID is for a shard key. A good shard key is one that distributes writes across separate shards, thereby achieving good write scaling. A hash is a good way to ensure writes are well distrubuted to separate shards.
Related
I want to store a large array of strings in AWS to be used from my application. The requirements are as follows:
During normal operations, string elements will be added to the array and the array size will continue to grow
I need to enforce uniqueness - i.e. the same string cannot be stored twice
I will have to retrieve the entire array periodically - most probably to put it in a file and use it from the application
I need to backup the data (or at least be convinced that there is a good built-in backup system as part of the features)
I looked at the following:
RDS (MySQL) - this may be overkill and also may become uncomfortably large for a single table (millions of records).
DynamoDB - This is intended for key/value pairs, but I have only a single value per record. Also, and more importantly, retrieving a large number of records seems to be an issue in DynamoDB as the scan operation needs paging and also can be expensive in terms of capacity units, etc.
Single S3 file - This could be a practical solution except that I may need to write to the file (append) concurrently, and that is not a feature that is available in S3. Also, it would be hard to enforce the element uniqueness
DocumentDB - This seems to be too expensive and overkill for this purpose
ElastiCache - I don't have a lot of experience with this and wonder if it would be a good fit for my requirement and if it's practical to have it be backed up periodically. This also uses key/value pairs and it is not advisable to read millions of records (entire data) at the same time
Any insights or recommendations would be helpful.
Update:
I don't know why people are voting to close this. It is definitely a programming related question and I have already gotten extremely useful answers and comments that will help me and hopefully others in the future. Why is there such an obsession with opinionated closure of useful posts on SO?
DynamoDB might be a good fit.
It doesn't matter that you don't have any "value" to your "key". Just use the string as the primary key. That will also enforce uniqueness.
You get on-demand and continuous backups. I don't have experience with these so I can only point you to the documentation.
The full retrieval of the data might be the biggest hassle. It is not recommended to do a full-table SCAN with DynamoDB; it can get expensive. There's a way how to use Data Pipelines to do an export (I also have not used it). Alternatively, you could put together a system by yourself, utilizing DynamoDB streams, e.g. you can push a stream to Kinesis and then to S3.
According to the listing documentation it is possible to treat a large navigate number of keys as though they were hierarchial. I am planning to store a large number of keys (let's say a few hundred million), distributed over a sensible-sized 'hierarchy'.
What is the performance of using a prefix and delimiter? Does it require a full enumeration of keys at the S3 end, and therefore be an O(n) operation? I have no idea whether keys are stored in a big hash table, or whether they have indexing data structures, or if they're stored in a tree or what.
I want to avoid the situation where I have a very large number of keys and navigating the 'hierarchy' suddenly becomes difficult.
So if I have the following keys:
abc/def/ghi/0
abc/def/ghi/1
abc/def/ghi/...
abc/def/ghi/100,000,000,000
Will it affect the speed of the query Delimiter='/, Prefix='abc/def'?
Aside from the Request Rate and Performance Considerations document that Sandeep referenced (which is not applicable to your use case), AWS hasn't publicized very much regarding S3 performance. It's probably private intellectual property. So I doubt you'll find very much information unless you can get it somehow from AWS directly.
However, some things to keep in mind:
Amazon S3 is built for massive scale. Millions of companies are using S3 with millions of keys in millions of buckets.
AWS promotes the prefix + delimiter as a very valid use case.
There are common data structures and algorithms used in computer science that AWS is probably using behind the scenes to efficiently retrieve keys. One such data structure is called a Trie or Prefix Tree.
Based on all of the above, chances are that it's much better than an order O(n) algorithm when you retrieve listing of keys. I think you are safe to use prefixes and delimiters for your hierarchy.
As long as you are not using a continuous sequence (such as date 2016-13-08, 2016-13-09 and so on) in the prefix you shouldn't face any problem. If your keys are auto-generated as a continuous sequence then prepend a randomly generated hash key to the keys (aidk-2016-13-08, ujlk-2016-13-09).
The amazon documentation says:
Amazon S3 maintains an index of object key names in each AWS region. Object keys are stored in UTF-8 binary ordering across multiple partitions in the index. The key name dictates which partition the key is stored in. Using a sequential prefix, such as timestamp or an alphabetical sequence, increases the likelihood that Amazon S3 will target a specific partition for a large number of your keys, overwhelming the I/O capacity of the partition. If you introduce some randomness in your key name prefixes, the key names, and therefore the I/O load, will be distributed across more than one partition.
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
Amazon indicates that prefix naming strategies, such as randomized hashing, no longer influence S3 lookup performance.
https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html
I am looking to index some data in DynamoDB and would like to key on an incrementing integer ID. The higher IDs will get the most of the traffic, however this will be spread evenly across tens of thousands of the highest IDs. Will this create uniform data access which is important for DynamoDB?
AWS don't seem to publish details on the hashing algorithm they use to generate primary keys. I am assuming it is something akin to md5 where, for example, the hash for 3000 is completely different from 3001, 3002 and 3003 and therefore it will result it a uniformly distributed workload.
The reason I ask, is that I know this is not the case in S3 where they suggest reversing auto incrementing IDs in cases like this.
DynamoDB doesn't seem to expose the internal workings of the hashing in documentation. A lot of places seem to quote MD5, but I am not sure if they can be considered authoritative.
An interesting study of distribution of hashes for number sequences is available here. The interesting data sets are Dataset 4 and Dataset 5 which deal with sequence of numbers. Most hashing functions (and MD5 more so) seem to be distributed satisfactorily from the view point of partitioning.
AWS have confirmed that using an incrementing integer ID will create an even workload:
If you are using incrementing numbers as the hash key, they will be distributed equally among the hash key space.
Source: https://forums.aws.amazon.com/thread.jspa?threadID=189362&tstart=0
I had a use case where I wanted to store objects larger than 64kb in Dynamo DB. It looks like this is relatively easy to accomplish if you implement a kind of "paging" functionality, where you partition the objects into smaller chunks and store them as multiple values for the key.
This got me thinking however. Why did Amazon not implement this in their SDK? Is it somehow a bad idea to store objects bigger than 64kb? If so, what is the "correct" infrastructure to use?
In my opinion, it's an understandable trade-off DynamoDB made. To be highly available and redundant, they need to replicate data. To get super-low latency, they allowed inconsistent reads. I'm not sure of their internal implementation, but I would guess that the higher this 64KB cap is, the longer your inconsistent reads might be out of date with the actual current state of the item. And in a super low-latency system, milliseconds may matter.
This pushes the problem of an inconsistent Query returning chunk 1 and 2 (but not 3, yet) to the client-side.
As per question comments, if you want to store larger data, I recommend storing in S3 and referring to the S3 location from an attribute on an item in DynamoDB.
For the record, the maximum item size in DynamoDB is now 400K, rather than 64K as it was when the question was asked.
From a Design perspective, I think a lot of cases where you can model your problem with >64KB chunks could also be translated to models where you can split those chunks to <64KB chunks. And it is most often a better design choice to do so.
E.g. if you store a complex object, you could probably split it into a number of collections each of which encode one of the various facets of the object.
This way you probably get better, more predictable performance for large datasets as querying for an object of any size will involve a defined number of API calls with a low, predictable upper bound on latency.
Very often Service Operations people struggle to get this predictability out of the system so as to guarantee a given latency at 90/95/99%-tile of the traffic. AWS just chose to build this constraint into the API, as they probably already do for their own website ans internal developments.
Also, of course from an (AWS) implementation and tuning perspective, it is quite comfy to assume a 64KB cap as it allows for predictable memory paging in/out, upper bounds on network roundtrips etc.
I've got some data that I want to save on Amazon S3. Some of this data is encrypted and some is compressed. Should I be worried about single bit flips? I know of the MD5 hash header that can be added. This (from my experience) will prevent flips in the most unreliable portion of the deal (network communication), however I'm still wondering if I need to guard against flips on disk?
I'm almost certain the answer is "no", but if you want to be extra paranoid you can precalculate the MD5 hash before uploading, compare that to the MD5 hash you get after upload, then when downloading calculate the MD5 hash of the downloaded data and compare it to your stored hash.
I'm not sure exactly what risk you're concerned about. At some point you have to defer the risk to somebody else. Does "corrupted data" fall under Amazon's Service Level Agreement? Presumably they know what the file hash is supposed to be, and if the hash of the data they're giving you doesn't match, then it's clearly their problem.
I suppose there are other approaches too:
Store your data with an FEC so that you can detect and correct N bit errors up to your choice of N.
Store your data more than once in Amazon S3, perhaps across their US and European data centers (I think there's a new one in Singapore coming online soon too), with RAID-like redundancy so you can recover your data if some number of sources disappear or become corrupted.
It really depends on just how valuable the data you're storing is to you, and how much risk you're willing to accept.
I see your question from two points of view, a theoretical and practical.
From a theoretical point of view, yes, you should be concerned - and not only about bit flipping, but about several other possible problems. In particular section 11.5 of the customer agreements says that Amazon
MAKE NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, WHETHER EXPRESS, IMPLIED, STATUTORY OR OTHERWISE WITH RESPECT TO THE SERVICE OFFERINGS. (..omiss..) WE AND OUR LICENSORS DO NOT WARRANT THAT THE SERVICE OFFERINGS WILL FUNCTION AS DESCRIBED, WILL BE UNINTERRUPTED OR ERROR FREE, OR FREE OF HARMFUL COMPONENTS, OR THAT THE DATA YOU STORE WITHIN THE SERVICE OFFERINGS WILL BE SECURE OR NOT OTHERWISE LOST OR DAMAGED.
Now, in practice, I'd not be concerned. If your data will be lost, you'll blog about it and (although they might not face any legal action), their business will be pretty much over.
On the other hand, that depends on how much vital your data is. Suppose that you were rolling your own stuff in your own data center(s). How would you plan for disaster recovery there? If you says: I'd just keep two copies in two different racks, just use the same technique with Amazon, maybe keeping two copies in two different datacenters (since you wrote that you are not interested in how to protect against bit flips, I'm providing only a trivial example here)
Probably not: Amazon is using checksums to protect against bit flips, regularly combing through data at rest, ensuring that no bit flips have occurred. So, unless you have corruption in all instances of the data within the interval of integrity check loops you should be fine.
Internally, S3 uses MD5 checksums throughout the system to detect/protect against bitflips. When you PUT an object into S3, we compute the MD5 and store that value. When you GET an object we recompute the MD5 as we stream it back. If our stored MD5 doesn't match the value we compute as we're streaming the object back we'll return an error for the GET request. You can then retry the request.
We also continually loop through all data at rest, recomputing checksums and validating them against the MD5 we saved when we originally stored the object. This allows us to detect and repair bit flips that occur in data at rest. When we find a bit flip in data at rest, we repair it using the redundant data we store for each object.
You can also protect yourself against bitflips during transmission to and from S3 by providing an MD5 checksum when you PUT the object (we'll error if the data we received doesn't match the checksum) and by validating the MD5 when GET an object.
Source:
https://forums.aws.amazon.com/thread.jspa?threadID=38587
There are two ways of reading your question:
"Is Amazon S3 perfect?"
"How do I handle the case where Amazon S3 is not perfect?"
The answer to (1) is almost certainly "no". They might have lots of protection to get close, but there is still the possibility of failure.
That leaves (2). The fact is that devices fail, sometimes in obvious ways and other times in ways that appear to work but give an incorrect answer. To deal with this, many databases use a per-page CRC to ensure that a page read from disk is the same as the one that was written. This approach is also used in modern filesystems (for example ZFS, which can write multiple copies of a page, each with a CRC to handle raid controller failures. I have seen ZFS correct single bit errors from a disk by reading a second copy; disks are not perfect.)
In general you should have a check to verify that your system is operating is you expect. Using a hash function is a good approach. What approach you take when you detect a failure depends on your requirements. Storing multiple copies is probably the best approach (and certainly the easiest) because you can get protection from site failures, connectivity failures and even vendor failures (by choosing a second vendor) instead of just redundancy in the data itself by using FEC.