I am working with a data set that has a secondary index with a sort key that ultimately has user entered information in it. For the sake of the question, consider it a "postal address" field. This model is to permit quick queries of this data for a particular postal address.
Because it is user entered I am finding myself wanting to regularize it before using it as a key. For instance, by stripping spaces and making it a common case. My thinking being that if someone made a trivial capitalization or spacing error it wouldn't be identified as a different address.
Is this a pattern that people typically do if they are creating a key on user entered data? Are "user entered keys" considered harmful? Any obvious pitfalls?
Just make sure you get your normalization function right. Simply stripping spaces might not be a great idea. For example, Hight Railroad and High Trail Road might both normalize to hightrailroad which probably isn't what you want. Instead, you might want to replace one or more consecutive spaces with a single dash or something along those lines.
If you get the normalization right, you should be fine. Others have mentioned vulnerabilities related to overwriting data but you said that this is a Global Secondary Index. You can't write to a GSI so you don't need to worry about this. Also, the user entered data is the sort key. As long as you control the hash key, you will be fine.
One thing I would be cautious of is the data distribution. Any time there is a user-influenced key whether it be direct user input or a side effect of a user action such as a timestamp, you need to take care to avoid unbalanced data distributions which could lead to hot shards and/or throttling
Related
The DynamoDB best practice documentation has this line:
You should maintain as few tables as possible in a DynamoDB application. Most well designed applications require only one table.
It's the last line that confuses me the most.
Take an example photo storage application. Does this mean that I should store user accounts (account ID, password, email) and photos (owner ID, photo location, metadata) in the same table?
If so I assume the primary key should be the account/owner ID, and the sort key would be the type of object it is (e.g. account or photo).
Should I be using one table like this instead of two tables (one for accounts, one for photos)?
It is generally recommended to use as few tables as possible, and very often a single table unless you have a really good reason to use more than one. Chances are you won't have a good reason to use more than one - except for old habits.
It seems counter-intuitive if you are coming from a traditional database background (like me), but it is in fact best practice.
The primary key could become a combination of the 'row'/object type and another value, stored in a single field, i.e. 'account#12345' for an account object with unique id of 12345 and 'photo#67890' for a photo object with your id of 67890 -
If you are looking up an account by your id number, you would query with the account prefix, and if you were looking for a photo, you would add the 'photo' prefix. this is a very simple example - your design may vary.
The video recommended in the first comment on your question is excellent - watch it at 0.75 speed or slower, and watch it a few times.
The short answer is yes. But the way it would be designed would be highly specific to how your application interacts with the database.
I highly recommend that anyone still confused with how to design DynamoDB/NoSQL tables watches this video from re:Invent.
In my DynamoDB table named users, I need a unique identifier, which is easy for users to remember.
In a RDBMS I can use auto increment id to meet the requirement.
As there is no way to have auto increment id in DynamoDB, is there a way to meet this requirement?
If I keep last used id in another table (lastIdTable) retrieve it before adding new document, increment that number and save updated numbers in both tables (lastIdTable and users), that will be very inefficient.
UPDATE
Please note that there's no way of using an existing attribute or getting users input for this purpose.
Since it seems you must create a memorable userId without any information about the user, I’d recommend that you create a random phrase of 2-4 simple words from a standard dictionary.
For example, you might generate the phrase correct horse battery staple. (I know this is a userId and not a password, but the memorability consideration still applies.)
Whether you use a random number (which has similar memorability to a sequential number) or a random phrase (which I think is much more memorable), you will need to do a conditional write with the condition that the ID does not already exist, and if it does exist, you should generate a new ID and try again.
email address seems the best choice...
Either as a partition key, or use a GUID as the partition key and have a Global Secondary Index over email address.
Or as Matthew suggested in a comment, let the users pick a user name.
Docker container naming strategy might give you some idea. https://github.com/moby/moby/blob/master/pkg/namesgenerator/names-generator.go
It will result in unique (limited) yet human friendly
Examples
awesome_einstein
nasty_weinstein
perv_epstein
A similar one: https://github.com/jjmontesl/codenamize
I've included some links along with our approaches to other answers, which seem to be the most optimal on the web right now.
Our records need to be categorized (eg. "horror", "thriller", "tv"), and randomly accessible both in specific categories and across all/some categories. We generally need to access about 20 - 100 items at a time. We also have a smallish number of categories (less than 100).
We write to the database for uploading/removing content, although this is done in batches and does not need to be real time.
We have tried two different approaches, with two different data structures.
Approach 1
AWS DynamoDB - Pick a record/item randomly?
Help selecting nth record in query.
In short, using the category as a hash key, and a UUID as the sort key. Generate a random UUID, query Dynamo using greater than or less than, and limit to 1. This is even suggested by an AWS employee in the second link. (We've also tried increasing the limit to the number of items we need, but this increases the probability of the query failing the first time around).
Issues with this approach:
First query can fail if it is greater than/less than any of the UUIDs
Querying on any specific category will cause throttling at scale (Small number of partitions)
We've also considered adding a suffix to each category to artificially increase the number of partitions we have, as pointed out in the following link.
AWS Database Blog
Choosing the Right DynamoDB Partition Key
Approach 2
Amazon Web Services: How do we get random item from the dynamoDb's table?
Doing something similar to this, where we concatenate the category with a sequential number, and use this as the hash key. e.g. horror-000001.
By knowing the number of records in each category, we're able to perform random queries across our entire data set, while also avoiding hot partitions/keys.
Issues with this approach
We need a secondary data structure to manage the sequential counts across each category
Writing (especially deleting) is significantly more complex, although this doesn't need to happen in real time.
Conclusion
Both approaches solve our main use case of random queries on category/categories, but the cons they offer are really deterring us from using them. We're leaning more towards approach #1 using suffixes to solve the hot partitioning issue, although we would need the additional retry logic for failed queries.
Is there a better way of approaching this problem? Specifically looking for solutions capable of scaling well (No scan), without requiring extra resources be implemented. #1 fits the bill, but needing to manage suffixes and failed attempts really deters us from using it, especially when it is being called inside a lambda (billed for time used).
Thanks!
Follow Up
After more research and testing, my team has decided to move towards MySQL hosted on RDS for these tables. We learned that this is one of the few use cases were DynamoDB does not fit, and requires rewriting your use case to fit the DB (Bad).
We felt that the extra complexity required to integrate random sampling on DynamoDB wasn't worth it, and we were unable to come up with any comparable solutions. We are, however, sticking with DynamoDB for our tables that do not need random accessibility due to the price and response times.
For anyone wondering why we chose MySQL, it was largely due to the Nodejs library available, great online resources (which DynamoDB definitely lacks), easy integration via RDS with our Lambdas, and the option to migrate to Amazons Aurora database.
We also looked at PostgreSQL, but we weren't as happy with the client library or admin tools, and we believe that MySQL will suit our needs for these tables.
If anybody has anything else they'd like to add or a specific question please leave a comment or send me a message!
This was too long for a comment, and I guess it's pretty much a full fledged answer now.
Approach 2
I've found that my typical time to get a single item from dynamodb to a host in the same region is <10ms. As long as you're okay with at most 1-2 extra calls, you can quite easily implement approach 2.
If you use a keys only GSI where the category is your hash key and the primary key of the table is your range key, you can quickly find the largest numbered single item within a category.
When you add a new item, find the largest number for that category from the GSI and then write the new item to the table with sequence number n+1.
When you delete, find the item with the largest sequence number for that category from the GSI, overwrite the item you are deleting, and then delete the now duplicated item from its position at the highest sequence number.
To randomly get an item, query the GSI to find the highest numbered item in the category, and then randomly pick a number since you now know the valid range.
Approach 1
I'm not sure exactly what you mean when you say "without requiring extra resources to be implemented". If you're okay with using a managed resource (no dev work to implement), you can also make Approach 1 work by putting a DAX cluster in front of your dynamodb table. Then you can query to your heart's content without really worrying about hot partitions. (Though the caching layer means that new/deleted items won't be reflected right away.)
I'm considering an app which will store customer data. Given the way buckets work in CouchBase, all customer data will be in one bucket. It appears that I have two choices:
Implement multi-tenancy in views, by assigning a field to each record that indicates the customer it belongs to.
Implement it by putting a factor on every key that is a customer ID.
It seems, though, that since I will be using views, I'll really want to do both. In case number 2, I need to have the data in the record so that it can be indexed on (or maybe I can pull out part of the key in the map phase and index on customer) and in option 1, I'd want it to be part of the key as a check when retrieving data to make sure I don't send the wrong customers data down the line.
The problem is, this is a service where multiple customers will interact, and sometimes one customer will create some data and the other will view it, at the first customers request. But putting an ACL on each record that lists everyone who's authorized to view it would be problematic, to say the least.
I bet there is a common methodology or design pattern to answer this question, and would appreciate some pointers to best practices.
I'm also concerned about the performance if the indexes are indexing both on the particular piece of relevant data, and the customer id... a large number of different customers would presumably make the indexes much less efficient. (but maybe not.)
Here are my thoughts on your questions:
[Concerning items #1 and 2] - It seems, though, that since I will be using views, I'll really want to do both.
This doesn't seem to make sense to me. In Couchbase, the map phase can include content from both the key and the value. It makes little sense to store the data in both the key and the value, as you are guaranteed to have 1:1 duplication there. Store it wherever it makes the most sense to store it; in this case, probably the value.
The problem is, this is a service where multiple customers will interact, and sometimes one customer will create some data and the other will view it, at the first customers request. But putting an ACL on each record that lists everyone who's authorized to view it would be problematic, to say the least.
My site also has muti-tenant data stored in a single database. In my case, I use object unique identifiers as my keys. By default, customers can access all objects that belong to them (I have a user object, and the user is associated with a customer account). Users may also have additional permissions assigned to them, whereby a single object from another customer could be added to their user account, and they would thereby be granted access to view the object.
The alternative is "security through obscurity" and use guids as a random identifier, giving customers access to view any object that they have the guid for.
I would not, however, try to store the permissions on the objects themselves. That would quickly become unwieldy. You need to think about your specific use case, and decide what simple approach would work for the majority of the cases, and just not support the other 1-2% of the cases.
I asked a question similar to this previously (How to use RecId as a foreign key in a form) but would like to explore it a bit further in a more complex scenario.
Replacement keys work great when you have indexes set up and allow duplicates set to no, but they don't seem to work at all with multiple-field indexes or when allow duplicates is set to yes.
Is there way, programmatically, to replace a foreign key in a grid with a translated value without using replacement keys? I tried writing a display method to override the field, but some odd behavior resulted--fields moving around in the grid, and the display method being unaware of which row to use, thus all values in the entire column were the same.
Table A: Bob:1, Sally:1, Sue:3
Table B: 1:Apples, 2:Apples, 3:Oranges
The "people" are tied to their favorite "foods" by the food RecId, refererenced in the People table. Assume there is additional data in other columns that make these records unique, so consolidating "1:Apples" and "2:Apples" is not possible.
It seems there should be a way to write a display method to overwrite a field value in a grid. Any suggestions? Sample code?
Thanks
Firstly, surrogate FK replacement does (or at least should) work with composite keys (e.g., {First Name, Last Name}).
Secondly, you state that there is "additonal data in other columns" that make these records unique...Then why aren't these columns being combined with the food's name to form an alternate key? The data model seems incorrect (or at least some metadata isn't being made consistent with the conditions you've stated)
Thirdly, any Field Group can be chosen as the ReplacementFieldGroup on a Reference Group control. That alone will allow you to do basically whatever you want. That said, I would strongly encourage you to use an alternate key as your replacement field group whenever possible due to the semantics of surrogate FK replacement.
Flow:
1) User types a value(s) into reference group.
2) User's tabs out.
3) User's typed value(s) are used to look up a record in the related table.
4a) If the user's typed in value(s) are uniquely mapped to a record that record is chosen, else,
4b) If the user's typed in values are not unique a lookup is presented to allow the user to pick which record they "meant". Note that the lookup must therefore present a collection of uniquely identifiable records so that the user knows which record to pick (if the records all look the same in the lookup then they'll have no idea what in the hell they should pick.)
5) Upon successful resolution of the typed values, the record is set back on the source form.
Given this flow, it is obvious that steps 3-5 will be broken if there is no unique index (key) on the table. (How is the user supposed to specify a unique reference to the record if the record has no means of being uniquely identified (assuming you don't want to display RecId to the user)???)
In the exceptional case that you decide that you still want to use a non-unique index as your replacement field group you must implement resolveReference and lookupReference to provide the user a unique resolution/lookup experience (to handle steps 3-5 in the above flow). Note: The common use case for this is wanting to effectively eliminate non-selective fields from being displayed in Reference Group and instead letting some outer context or mode implicitly set that value. E.g., if the alternate key was {Size, Color}, one could potentially make "Color" a global form context--perhaps by having the user pick a color at the top of the form--and only have the user enter Size into Reference Group...The Color could then be implicitly added back via the resolveReference and lookupReference overrides.