My use case is that I need to have UUID as primary key in one of my DynamoDB tables. I am using #DynamoDBAutoGeneratedKey for the same and I am able to use UUID. I also understand that the autogenerated key can be retrieved from the entity written to dynamo db just after storing it in dynamo db. But my concern is that is there any clean way to retrieve the auto generated key anywhere in the application or do I need to store the auto generated key in-memory ? Or should I implement secondary indexes to retrieve the autogenerated key ?
Note:- The OP doesn't have information where the use case requires to get data by primary key. I presume the ultimate claim may not be to get the UUID. Rather, it could be to get the item using UUID.
Some general options are as follows:-
If you don't know the Hash key which is auto generated UUID,
1) Scan the table to get the auto-generated key. Please note that this is a full table scan which would be a costly operation.
2) Yes, Global Secondary Index can be used to query the table by different attributes i.e. other than the UUID field defined as Hash key in main table. This is the more efficient option if hash key of the main table is unknown.
3) I am not sure about the full use case. However, if the same HTTP request or process is going to get the data later for the newly inserted UUID, you can keep the UUID in-memory (i.e. using Java collection) to use it later. In this case, in fact you can keep the entire object that was inserted earlier in-memory.
Related
What I never understood about DynamoDB is how to design a table to effectively get all data with one particular field lying in some range. For example, time range - we would like to get data created from timestamp1 up to timestamp2. According to keys design, we can use only sort key for such a purpose. However, it automatically means that the primary key should be the same for all data. But according to documentation, it is an anti-pattern of DynamoDB usage. How to deal with the situation? Could be creating evenly distributed primary key and then a secondary key which primary part is the same for all items but sort part is different for all of them be a better solution?
You can use Global Secondary Index which in essence is
A global secondary index contains a selection of attributes from the base table, but they are organized by a primary key that is different from that of the table.
So you can query on other attributes that are unique.
I.e. as it might not be clear what I meant, is that you can choose something else as primary key that is possible to be unique and use a repetetive ID as GSI on which you are going to base your query.
NOTE: One of the widest applications of NoSQL DBs is to store timeseries, which you cannot expect to have a unique identifier as PK, unless you specify the timestamp.
I want to create a DynamoDB table that allows me to save notes from users.
The attributes I have:
user_id
note_id (uuid)
type
text
The main queries I will need:
Get all notes of a certain user
Get a specific note
Get all notes of a certain type (the less used query)
I know that in terms of performance and DynamoDB partitions note_id would be the right choice because they are unique and would be distributed equally over the partitions but on the other hand is much harder to get all notes of a user without scanning all items or using a GSI. And if they are unique I suppose it doesn't make any sense to have a sort key.
The other option would be to use user_id as partition key and note_id as sort key, but if I have certain users that are a much larger number of notes than others wouldn't that impact my performance?
Is it better to have a partition key unique (like note_id) to scale well with DynamoDB partitions and use GSIs to create my queries or to use instead a partition key for my main query (user_id)?
Thanks
Possibly the simplest and most cost-effective way would be a single table:
Table Structure
note_id (uuid) / hash key
user_id
type
text
Have two GSIs, one for "Get all notes of a certain user" and one for "Get all notes of a certain type (the less used query)":
GSI for "Get all notes of a certain user"
user_id / hash key
note_id (uuid) / range key
type
text
A little note on this - which of your queries is the most frequent: "Get all notes of a certain user" or "Get a specific note"? If it's the former, then you could swap the GSI keys for the table keys and vice-versa (if that makes sense - in essence, have your user_id + note_id as the key for your table and the note_id as the GSI key). This also depends upon how you structure your user_id - I suspect you've already picked up on; make sure your user_id is not sequential - make it a UUID or similar.
GSI for "Get all notes of a certain type (the less used query)"
type / hash key
note_id (uuid) / range key
user_id
text
Depending upon the cardinality of the type field, you'll need to test whether a GSI will actually be of benefit here or not.
If the GSI is of little benefit and you need more performance, another option would be to store the type with an array of note_id in a separate table altogether. Beware of the 400k item limit with this one and the fact that you'll need to perform another query to get the text of the note.
With this table structure and GSIs, you're able to make a single query for the information you're after, rather than making two if you have two tables.
Of course, you know your data best - it's best to start with what you think is best and then test it to ensure it meets what you're looking for. DynamoDB is priced by provisioned throughput + the amount of indexed data stored so creating "fat" indexes with many attributes projects, as above, if there is a lot of data then it could become more cost effective to perform two queries and store less indexed data.
I would use user_id as your primary partition(hash) key and note_id as your primary range(sort) key.
You have already noted that in an ideal situation, each partition key is accessed with equal regularity to optimise performance see Design For Uniform Data Access Across Items In Your Tables. The use of user_id is perfectly fine as long as you have a good spread of users who regularly log in. Indeed AWS specifically encourage this option (see 'Choosing a Partition Key' table in the link above).
This approach will also make your application code much simpler than your alternative approach.
You then have a second choice which is whether to apply a Global Secondary Index for your get notes by type query. A GSI key, unlike a primary key, does not need to be unique (see AWS GSI guide, therefore I suggest you would simply use type as your GSI partition key without a range key.
The obvious plus side to using a GSI is a faster result when you perform the note type query. However you should be aware of the downsides also. A GSI has a separate throughput allowance than your table, so you need to provision this in addition to your table throughput (at extra cost). If you dont provision your GSI with enough read units it could end up slower than a scan on your table. If you dont provision enough write units, your table writes could be throttled, even if your table had enough write units.
Also, AWS warn that GSIs are updated asynchronously (usually within a fraction of a second but it can be longer). This means queries on your GSI might return the 'wrong' result if you have table writes and index reads very close together. If this was a problem you would need to handle it in your application code.
I see this as 2 tables. User and notes with a GSI on the notes table. Not sure how else you could do it. Using userId as primary key and note_id as sort key requires that you can only retrieve elements when you know both the user_id and the note_id. With DynamoDB if your not scanning you have to satisfy all the elements in the primary key, so both the partition and and sort if there is one. Below is how I would do this.
Get all notes of a certain user
When a user creates a note I would add this to the users table in the users notes attribute. When you want to get all of a users notes then retrieve the user and access the array/list of note_ids stored there.
{ userId: xxx,
notes: [ note_id_1,note_id_2,note_id_3]
}
Get a specific note
A notes table with node_id as the primary key would make that easy.
{
noteId: XXXX,
note: "sfsfsfsfsfsf",
type: "standard_note"
}
Get all notes of a certain type (the less used query)
I would use a GSI on the notes table for this with the attributes of "note_type" and note_id projected onto it.
Update
You can pull this off with one table and a GSI (See the two answers below for how) but I would not do it. Your data model is so simple why make it more complicated than users and notes.
Is it possible to Query a DynamoDB table using both the hash & range key AND a local secondary index?
I have three attributes I want to compare against in my query. Two are the main hash and range keys and the third is the range key of the local secondary index.
No, but that shouldn't be necessary based on your description of what you are trying to accomplish.
If you are trying to access an object based on the hash and range key (of the main table) as well as an additional attribute, selecting on only the hash and range of the main table (which is required to return a single record by definition) will return that record.
If your concern is that the third attribute may be a value that you want to ignore the entire record you can use a query filter to have that item filtered out by DynamoDB or you can use logic in your application to ignore that object.
I've been going through AWS DynamoDB docs and cannot figure out what's the difference between batchGetItem() and Query().
My use case: I have a table which has Id as primary hash key, and attribute values are Name and Marks.
I would like to perform batch query which returns list of names and marks by providing list of Id's which are primary keys.
Should I use batchGetItem() or Query()?
BatchGetItem: Allows to you parallelize "GetItem" requests for languages that don't support parallelism (i.e. javascript). This includes retrieving items from different tables (doesn't support indexes though).
Query: Allows you to page through tables with a Hash-Range schema (where you'll have multiple results associated with a Hash key) and allows you to retrieve items from the indexes on your table. Note you can also add an additional condition on range key in your KeyConditions and add conditions on any non primary key attribute in your QueryFilter.
It seems like that your use case calls for a BatchGetItem request, as you are trying to retrieve items from your base table by way of a Hash key.
Hope that helps!
I posted a similar question over on the Adobe Community forums, but it was suggested to ask over here as well.
I'm trying to cache distinct queries associated with a particular database, and need to be able to flush all of the queries for that database while leaving other cached queries intact. So I figured I'd take advantage of ColdFusion's ehcache capabilities. I created a specific cache region to use for queries from this particular database, so I can use cacheRemoveAll(myRegionName) to flush those stored queries.
Since I need each distinct query to be cached and retrievable easily, I figured I'd hash the query parameters into a unique string that I would use for the cache key for each query. Here's the approach I've tried so far:
Create a Struct containing key value pairs of the parameters (parameter name, parameter value).
Convert the Struct to a String using SerializeJSON().
Hash the String using Hash().
Does this approach make sense? I'm wondering how others have approached cache key generation. Also, is the "MD5" algorithm adequate for this purpose, and will it guarantee unique key generation, or do I need to use "SHA"?
UPDATE: use cacheRegion attribute introduced in CF10!
http://help.adobe.com/en_US/ColdFusion/10.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-7fae.html
Then all you need to do is to specify cachedAfter or cachedWithin, and forget about how to to generate unique keys. CF will do it for you by 'hashing':
query "Name"
SQL statement
Datasource
Username and
password
DBTYPE
reference: http://www.coldfusionmuse.com/index.cfm/2010/9/19/safe.caching
I think this would be the easiest, unless you really need to fetch a specific query by a key, then u can feed your own hash using cacheID, another new attribute introduced in CF10.