My main table, Users, stores information about users. I plan to have a UserId field as the primary key of the table. I have full control of creation and assignment of these keys, and I want to ensure that I assign keys in a way that provides good performance. What should I do?
You have a few options:
The most generic solution is to use UUIDs, as specified in RFC 4122.
For example, you could have a STRING(36) that stores UUIDs. Or you could store the UUID as a pair of INT64s or as a BYTE(16). There are some pitfalls to using UUIDs, so read the details of this answer.
If you want to save a bit of space and are absolutely sure that you will have fewer than a few billion users, then you could use an INT64 and then assign UserIds using a random number generator. The reason you want to be sure you have fewer than a few billion users is because of the Birthday Problem, the odds that you get at least one collision are about 50% once you have 4B users, and they increase very fast from there. If you assign a UserId that has already been assigned to a previous user, then your insertion transaction will fail, so you'll need to be prepared for that (by retrying the transaction after generating a new random number).
If there's some column, MyColumn, in the Users table that you would like to have as primary key (possibly because you know you'll want to look up entries using this column frequently), but you're not sure about the tendency of this column to cause hotspots (say, because it's generated sequentially or based on timestamps), then you two other options:
3a) You could "encrypt" MyColumn and use this as your primary key. In mathematical terms, you could use an automorphism on the key values, which has the effect of chaotically scrambling them while still never assigning the same value multiple times. In this case, you wouldn't need to store MyColumn separately at all, but rather you would only store/use the encrypted version and could decrypt it when necessary in your application code. Note that this encryption doesn't need to be secure but instead just needs to guarantee that the bits of the original value are sufficiently scrambled in a reversible way. For example: If your values of MyColumn are integers assigned sequentially, you could just reverse the bits of MyColumn to create a sufficiently scrambled primary key. If you have a more interesting use-case, you could use an encryption algorithm like XTEA.
3b) Have a compound primary key where the first part is a ShardId, computed ashash(MyColumn) % numShards and the second part is MyColumn. The hash function will ensure that you don't create a hot-spot by allocating your rows to a single split. More information on this approach can be found here. Note that you do not need to use a cryptographic hash, although md5 or sha512 are fine functions. SpookyHash is a good option too. Picking the right number of shards is an interesting question and can depend upon the number of nodes in your instance; it's effectively a trade-off between hotspot-avoiding power (more shards) and read/scan efficiency (fewer shards). If you only have 3 nodes, then 8 shards is probably fine. If you have 100 nodes; then 32 shards is a reasonable value to try.
Related
I'm trying to decide whether to use binary, number, or string for my DynamoDB table's partition key. My application is a React.js/Node.js social event-management application where as much as half of the data volume stored in DynamoDB will be used to store relationships between Items and Attributes to other Items and Attributes. For example: friends of a user, attendees at an event, etc.
Because the schema is so key-heavy, and because the maximum DynamoDB Item size is only 400KB, and for perf & cost reasons, I'm concerned about keys taking up too much space. That said, I want to use UUIDs for partition keys. There are well-known reasons to prefer UUIDs (or something with similar levels of entropy and minimal chance of collisions) for distributed, serverless apps where multiple nodes are giving out new keys.
So, I think my choices are:
Use a hex-encoded UUID (32 bytes stored after dashes are removed)
Encode the UUID using base64 (22 bytes)
Encode the UUID using z85 (20 bytes)
Use a binary-typed attribute for the key (16 bytes)
Use a number-typed attribute for the key (16-18 bytes?) - the Number type can only accommodate 127 bits, so I'd have to perform some tricks like stripping a version bit, but for my app that's probably OK. See
How many bits of integer data can be stored in a DynamoDB attribute of type Number? for more info.
Obviously there's a tradeoff in developer experience. Using a hex string is the clearest but also the largest. Encoded strings are smaller but harder to deal with in logs, while debugging, etc. Binary and Number are harder than strings, but are the smallest.
I'm sure I'm not the first person to think about these tradeoffs. Is there a well-known best practice or heuristic to determine how UUID keys should be stored in DynamoDB?
If not, then I'm leaning towards using the Binary type, because it's the smallest storage and because its native representation (as a base64-encoded string) can be used everywhere humans need to view and reason about keys, including queries, logging, and client code. Other than having to transform it to/from a Buffer if I use DocumentClient, am I missing some problem with the Binary type or advantage of one of the other options in the list above?
If it matters, I'm planning for all access to DynamoDB to happen via a Lambda API, so even if there's conversion or marshalling required, that's OK because I can do it inside my API.
BTW, this question is a sequel to a 4-year-old question (UUID data type in DynamoDB) but 4 years is a looooooong time in a fast-evolving space, so I figured it was worth asking again.
I had a similar issue and concluded that the size of the key did not matter too much as all my options were going to be small and lightweight, with only minor tradeoffs. I decided that a programmer friendly way i.e. me would be to use the 'sub' that is the number created by cognito for each unique user. That way all the collision issues should they arise would also be taken care of by cognito. I could then encode or not encode. So howseover a user logs in, they will end up with the 'sub' then I match that with the records in the hash key of dynamodb and that immediately grants them fine-grained access to only their data. Three years later, I have found that to be a very reliable method.
The limits for partition and sort keys of dynamoDB are such that if I want to create a table with lots of users (e.g. the entire world population), then I can't just use a unique partition key to represent the personId, I need to use both partition key and sort key to represent a personId.
$util.autoId() in AppSync returns a 128-bit String. If I want to use this as the primary key in the dynamoDB table, then I need to split it into two Strings, one being the partition key and the other being the sort key.
What is the best way to perform this split? Or if this is not the best way to approach the design, how should I design it instead?
Also, do the limits on partition and sort keys apply to secondary indexes as well?
Regarding $util.autoId(), since it's generated randomly, if I call it many times, is there a chance that it will generate two id's that are exactly the same?
I think I'm misunderstanding something from your question's premise because to my brain, using AppSync's $util.autoId() gives you back a 128 bit UUID. The point of UUIDs is that they're unique, so you can absolutely have one UUID per person in the world. And the UUID string will definitely fit within the maximum character length limits of Dynamo's partition key requirements.
You also asked:
if I call it many times, is there a chance that it will generate two
id's that are exactly the same?
It's extremely unlikely.
Looking to build an algorithm that will scramble a matrix based on a 256-bit key. Given two m*n matrices A and B and key K, I would like A and B to be scrambled in the same way. So informally, if A==B, scramble(A,K)==scramble(B,K).
What I'm trying to do seems to have similarities to encryption, but I'm wholly unfamiliar with the field. I feel like there must be some things I can leverage from encryption algorithms to make the process fast and computationally efficient.
To clarify, the main purpose of the scrambling is to obfuscate the matrix content while still allowing for comparisons to be made.
It sounds like you might need Cryptographic hash. Feeding your matrix/image into one generates (almost) unique hash value for it. This hash value is convenient, as it's constant size and usually much smaller than the source data. It's practically impossible to go from the hash value back to original data, and hashing the same image data again yields the same hash value.
If you want to add a secret key into this, you can concatenate the image data and the key, and compute hash over that. With the same data and key you'll receive the same hash value, and if you change either, the hash value changes.
(almost unique: by pigeonhole principle, turning a large input into smaller hash value, there must be multiple inputs that generate the same hash value. In practice, this is rarely a concern)
input
1 - - GET hm_brdr.gif
2 - - GET s102382.gif ( "1", {"- - GET hm_brdr.gif"})
3 - - GET bg_stars.gif map-reduce-> ( "2", {"- - GET s102382.gif"})
3 - - GET phrase.gif ( "3", {"- - GET bg_stars.gif,"- - GET phrase.gif"})
I want to make the first column values 1,2,3.. anonymous using random integers. But it shouldn't change it like 1->x in one line and 1->t in another line. so my solution is to replace the "keys" with random integers (rand(1)=x, rand(2)=y ..) in the reduce step and ungroup the values with their new keys and write to files again as shown below.
output file
x - - GET hm_brdr.gif
y - - GET s102382.gif
z - - GET bg_stars.gif
z - - GET phrase.gif
my question is, is there a better way of doing this in the means of running time ?
If you want to assign a random integer to a key value then you'll have to do that in a reducer where all key/value pairs for that key are gathered in one place. As #jason pointed out, you don't want to assign a random number since there's no guarantee that a particular random number won't be chosen for two different keys. What you can do is just increment a counter held as an instance variable on the reducer to get the next available number to associate with a key. If you have a small amount of data then a single reducer can be used and the numbers will be unique. If you're forced to use multiple reducers then you'll need a slightly more complicated technique. Use
Context.getTaskAttemptID().getTaskID().getId()
to get a unique reducer number with which to calculate an overall unique number for each key.
There is no way this is a bottleneck to your MapReduce job. More precisely, the runtime of your job is dominated by other concerns (network and disk I/O, etc.). A quick little key function? Meh.
But that's not even the biggest issue with your proposal. The biggest issue with your proposal is that it's doomed to fail. What is a key fact about keys? They serve as unique identifiers for records. Do random number generators guarantee uniqueness? No.
In fact, pretend for just a minute that your random key space has 365 possible values. It turns out that if you generate a mere 23 random keys, you are more likely than not to have a key collision; welcome to the birthday paradox. And all of a sudden, you've lost the whole point to the keys in the first place as you've started smashing together records by giving two that shouldn't have the same key the same key!
And you might be thinking, well, my key space isn't as small as 365 possible keys, it's more like 2^32 possible keys, so I'm, like, totally in the clear. No. After approximately 77,000 keys you're more likely than not to have a collision.
Your idea is just completely untenable because it's the wrong tool for the job. You need unique identifiers. Random doesn't guarantee uniqueness. Get a different tool.
In your case, you need a function that is injective on your input key space (that is, it guarantees that f(x) != f(y) if x != y). You haven't given me enough details to propose anything concrete, but that's what you're looking for.
And seriously, there is no way that performance of this function will be an issue. Your job's runtime really will be completely dominated by other concerns.
Edit:
To respond to your comment:
here i am actually trying to make the ip numbers anonymous in the log files, so if you think there is a better way i ll be happy to know.
First off, we have a serious XY problem here. You should have asked searched for answers to that question. Anonymizing IP addresses, or anything for that matter, is hard. You haven't even told us the criteria for a "solution" (e.g., who are the attackers?). I recommend taking a look at this answer on the IT Security Stack Exchange site.
I know the original md5 algorithm produces an 128-bits hash.
Following Mark Adler's comments here I'm interested in getting a good 64-bits hash.
Is there a way to create an md5-based 64-bits hash using OpenSSL? (md5 looks good enough for my needs).
If not, is there another algorithm implemented in the OpenSSL library that can get this job done with quality not less than md5's (except for the length of course)?
I claim, that 'hash quality' is strongly related to the hash length.
AFAIK, OpenSSL does not have 64bit hash algos, so the first idea I had is simple and most probably worthless:
halfMD5 = md5.hiQuadWord ^ md5.lowQuadWord
Finally, I'd simply use an algorithm with appropriate output, like crc64.
Some crc64 sources to verify:
http://www.backplane.com/matt/crc64.html
http://bioinfadmin.cs.ucl.ac.uk/downloads/crc64/
http://en.wikipedia.org/wiki/Computation_of_CRC
http://www.pathcom.com/~vadco/crc.html
Edit
At first glanceת Jenkins looks perfect, however I'm trying to find a friendly c++ implementation for it without luck so far. BTW, I'm wondering, since this is such a good hash for databases' duplication checking, how come that non of the common opensource libraries, like OpenSSL, provides an API of it? – Subway
This might be simply due to the fact, that OpenSSL is a crypto library in the first place, using large hash values with appropriate crypto characteristics.
Hash algos for data structures have some other primary goals, e.g. good distribution characteristics for hash tables, where small hash values are used as an index into a list of buckets containing zero, one or multiple (colliding) element(s).
So the point is, whether, how and where collisions are handled.
In a typical DBMS, an index on a column will handle them itself.
Corresponding containers (maps or sets):
C++: std::size_t (32 or 64bits) for std::unordered_multimap and std::unordered_multiset
In java, one would make a mapping with lists as buckets: HashMap<K,List<V>>
The unique constraint would additionally prohibit insertion of equal field contents:
C++: std::size_t (32 or 64bits) for std::unordered_map and std::unordered_set
Java: int (32bits) for HashMap and HashSet
For example, we have a table with file contents (plaintext, non-crypto application) and a checksum or hash value for mapping or consistency checks. We want to insert a new file. For that, we precompute the hash value or checksum and query for existing files with equal hash values or checksums respectively. If none exists, there won't be a collision, insertion would be safe. If there are one or more existing records, there is a high probability for an exact match and a lower probability for a 'real' collision.
In case collisions should be omitted, one could add an unique constraint to the hash column and reuse existing records with the possibility of mismatching/colliding contents. Here, you'd want to have a database friendly hash algo like 'Jenkins'.
In case collisions need to be handled, one could add an unique constraint to the plaintext column. Less database friendly checksum algos like crc won't have an influence on collisions among records and can be chosen according to certain types of corruption to be detected or other requirements. It is even possible to use the XOR'ed quad words of an md5 as mentioned at the beginning.
Some other thoughts:
If an index/constraint on plaintext columns does the mapping, any hash value can be used to do reasonably fast lookups to find potential matches.
No one will stop you from adding both, a mapping friendly hash and a checksum.
Unique constraints will also add an index, which are basically like the hash tables mentioned above.
In short, it greatly depends on what exactly you want to achieve with a 64bit hash algo.