I'm working on a web application that uses a bunch of Amazon Web Services. I'd like to use DynamoDB for a particular part of the application but I'm not sure if it's an appropriate use-case.
When a registered user on the site performs a "job", an entry is recorded and stored for that job. The job has a bunch of details associated with it, but the most relevant thing is that each job has a unique identifier and an associated username. Usernames are unique too, but there can of course be multiple job entries for the same user, each with different job identifiers.
The only query that I need to perform on this data is: give me all the job entries (and their associated details) for username X.
I started to create a DynamoDB table but I'm not sure if it's right. My understanding is that the chosen hash key should be the key that's used for querying/indexing into the table, but it should be unique per item/row. Username is what I want to query by, but username will not be unique per item/row.
If I make the job identifier the primary hash key and the username a secondary index, will that work? Can I have duplicate values for a secondary index? But that means I will never use the primary hash key for querying/indexing into the table, which is the whole point of it, isn't it?
Is there something I'm missing, or is this just not a good fit for NoSQL.
Edit:
The accepted answer helped me find out what I was looking for as well as this question.
I'm not totally clear on what you're asking, but I'll give it a shot...
With DynamoDB, the combination of your hash key and range key must uniquely identify an item. Range key is optional; without it, hash key alone must uniquely identify an item.
You can also store a list of values (rather than just a single value) as an item's attributes. If, for example, each item represented a user, an attribute on that item could be a list of that user's job entries.
If you're concerned about hitting the size limitation of DynamoDB records, you can use S3 as backing storage for that list - essentially use the DDB item to store a reference to the S3 resource containing the complete list for a given user. This gives you flexibility to query for or store other attributes rather easily. Alternatively (as you suggested in your answer), you could put the entire user's record in S3, but you'd lose some of the flexibility and throughput of doing your querying/updating through DDB.
Perhaps a "Jobs" table would work better for you than a "User" table. Here's what I mean.
If you're worried about all of those jobs inside a user document adding up to more than the 400kb limit, why not store the jobs individually in a table like:
my_jobs_table:
{
{
Username:toby,
JobId:1234,
Status: Active,
CreationDate: 2014-10-05,
FileRef: some-reference1
},
{
Username:toby,
JobId:5678,
Status: Closed,
CreationDate: 2014-10-01,
FileRef: some-reference2
},
{
Username:bob,
JobId:1111,
Status: Closed,
CreationDate: 2014-09-01,
FileRef: some-reference3
}
}
Username is the hash and JobId is the range. You can query on the Username to get all the user's jobs.
Now that the size of each document is more limited, you could think about putting all the data for each job in the dynamo db record instead of using the FileRef and looking it up in S3. This would probably save a significant amount of latency.
Each record might then look like:
{
Username:bob,
JobId:1111,
Status: Closed,
CreationDate: 2014-09-01,
JobCategory: housework,
JobDescription: Doing the dishes,
EstimatedDifficulty: Extreme,
EstimatedDuration: 9001
}
I reckon I didn't really play with the DynamoDB console for long enough to get a good understanding before posting this question. I only just understood now that a DynamoDB table (and presumably any other NoSQL table) is really just a giant dictionary/hash data structure. So to answer my question, yes I can use DynamoDB, and each item/row would look something like this:
{
"Username": "SomeUser",
"Jobs": {
"gdjk345nj34j3nj378jh4": {
"Status": "Active",
"CreationDate": "2014-10-05",
"FileRef": "some-reference"
},
"ghj3j76k8bg3vb44h6l22": {
"Status": "Closed",
"CreationDate": "2014-09-14",
"FileRef": "another-reference"
}
}
}
But I'm not sure it's even worth using DynamoDB after all that. It might be simpler to just store a JSON file containing that content structure above in an S3 bucket, where the filename is the username.json
Edit:
For what it's worth, I just realized that DynamoDB has a 400KB size limit on items. That's a huge amount of data, relatively speaking for my use-case, but I can't take the chance so I'll have to go with S3.
It seems that username as the hash key and a unique job_id as the range, as others have already suggested would serve you well in dynamodb. Using a query you can quickly search for all records for a username.
Another option is to take advantage of local secondary indexes and sparse indexes. It seems that there is a status column but based upon what I've read you could add another column, perhaps 'not_processed': 'x', and make your local secondary index on username+not_processed. Only records which have this field are indexed and once a job is complete you delete this field. This means you can effectively table scan using an index for username where not_processed=x. Also your index will be small.
All my relational db experience seems to be getting in the way for my understanding dynamodb. Good luck!
Related
We use DynamoDB UpdateItem.
This acts as an "upsert" as we can learn from the documentation
Edits an existing item's attributes, or adds a new item to the table if it does not already exist. [...]
When we make a request, to determine if an item was created or an existing item was updated, we request ALL_OLD. This works great and allows us to differentiate between update and create.
As an additional requirement we also want to return ALL_NEW, but still know the type of operation that was performed.
Question: Is this possible to do in a single request or do we have to make a second (get) request?
By default this is not supported in DynamoDB, there is no ALL or NEW_AND_OLD_IMAGES as there is in DynamoDB streams, but you can always go DIY.
When you do the UpdateItem call, you have the UpdateExpression, which is basically the list of changes to apply to the item. Given that you told DynamoDB to return the item as it looked like before that operation, you can construct the new state locally.
Just create a copy of the ALL_OLD response and locally apply the changes from the UpdateExpression to it. That's definitely faster than two API calls at the cost of a slightly more complex implementation.
I’m attempting to create a multi-tenant application with DynamoDB and Cognito. The documentation is pretty clear on how to implement fine-grained authorisation so that users can access only their own records, by adding a condition to the IAM access policy like so:
"Condition": {
"ForAllValues:StringEquals": {
"dynamodb:LeadingKeys": [
"${cognito-identity.amazonaws.com:sub}"
]
}
}
This is great for allowing users to read & write their own records, when the Cognito user id is the hash key of the row, but I’m struggling with how to allow other users to have read only access to some records.
Take as an example my model for a student who has has multiple courses:
{
“student_id”: “ABC-1234567”,
“course_name”: “Statistics 101”,
“tutors”: [“Cognito-sub-1”, “Cognito-sub-2”],
“seminar_reviews”: [
{
“seminar_id”: “XXXYYY-12345”
“date”: “2018-01-12”,
“score”: “8”,
“comments”: “Nice class!”
},
{
“seminar_id”: “ABCDEF-98765”
“date”: “2018-01-25”,
“score”: “3”,
“comments”: “Boring.”
}
]
}
(Cognito-sub-1 is the Cognito id of a tutor)
With the policy conditions above applied to the user’s IAM role, the user could read & write this document since the hash key (student_id) is the Cognito id of the user.
I’d also like the tutors listed in the document to have read-only access to certain attributes, but I can’t find any examples of how this can be done. I know that I can’t use the dynamodb:LeadingKeys condition since tutors is not the hash key of the table. Can this be done if I set up a Global Secondary Index (GSI) that uses the list of tutors as the hash key?
If this can be done with an index, I assume that this would only allow read access to that index (since an index can’t allow write operations). Is there any alternative method to allow write access based on an attribute that is not the hash key?
Alternatively, can I use a longer string as the hash key, concatenating attributes like ”owner”: and ”read-only”: that contain lists of Cognito IDs and consume this within my policy to create a more fine-grained permissions model based only on the hash key? This assumes a policy can decode lists from a string, since DynamoDB does not allow a hash key to be a list, JSON object or similar.
I haven’t been able to find any resources that consider fine-grained access control beyond allowing users to read/write only their own records, so if anyone can direct me to some, that would be a great start.
You can restrict access to specific attributes easily (just the attributes).
However, in order to achieve more fine-grained access patterns you'd have to either:
offload access control task to your code (e.g. Lambda)
or you could evaluate your access patterns (which is generally a good thing, however, it might be a little trickier) and model your data accordingly
Generally speaking, when designing NoSQL applications, you should always evaluate how you consume your data. They are usually tailored for specific use-case - unlike RDBMS, which allow very general queries regardless.
There's a nice example regarding modeling relational data in terms of DynamoDB available here
In an S3 bucket, I have thousands and thousands of files stored with names having a structure that comes down to prefix and number:
A-0001
A-0002
A-0003
B-0001
B-0002
C-0001
C-0002
C-0003
C-0004
C-0005
New objects for a given prefix should come in with varying frequency, but might not. Older objects may disappear.
Is there a way to efficiently query S3 for the highest number of every prefix, i.e. without listing the entire bucket? The result I want is:
A-0003
B-0002
C-0005
The S3 API itself does not seem to offer anything usable for that. However, perhaps another service, like Athena, could do it? So far I have only found it capable of searching within objects, but all I care about are their key names. If it can report on the contents of objects in the bucket, can't it on the bucket itself?
I would be okay with the latest modification date per prefix, but I want to avoid having to switch to a versioned bucket with just the prefixes as names to achieve that.
I think this is what you are looking for:
variable name is $path and you can regexp to get the pattern you are querying...
WHERE regexp_extract(sp."$path", '[^/]+$') like concat('%',cast(current_date - interval '1' day as varchar),'.csv')
The S3 API itself does not seem to offer anything usable for that.
However, perhaps another service, like Athena, could do it?
Yes at the moment, there is not direct way of doing it only with AWS S3. Even with Athena, it will go through the files to query their content but it will be easier using standard SQL support with Athena and would be faster since the queries runs in parallel.
So far I have only found it capable of searching within objects, but
all I care about are their key names.
Both Athena and S3 Select is to query by content not keys.
The best approach I can recommend is to use AWS DynamoDB to keep the metadata of the files, including file names for faster querying.
S3 FAQ mentions that "Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES." However, I don't know how long it takes to get eventual consistency. I tried to search for this but couldn't find an answer in S3 documentation.
Situation:
We have a website consists of 7 steps. When user clicks on save in each step, we want to save a json document (contains information of all 7 steps) to Amazon S3. Currently we plan to:
Create a single S3 bucket to store all json documents.
When user saves step 1 we create a new item in S3.
When user saves step 2-7 we override the existing item.
After user saves a step and refresh the page, he should be able to see the information he just saved. i.e. We want to make sure that we always read after write.
The full json document (all 7 steps completed) is around 20 KB.
After users clicked on save button we can freeze the page for some time and they cannot make other changes until save is finished.
Question:
How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
Is there a function to calculate save/load time based on item size?
Is the save/load time gonna be different if I choose another S3 region? If so which is the best region for Seattle?
I wanted to add to #error2007s answers.
How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
It's not only that you will not find the exact time anywhere - there's actually no such thing exact time. That's just what "eventual consistency" is all about: consistency will be achieved eventually. You can't know when.
If somebody gave you an upper bound for how long a system would take to achieve consistency, then you wouldn't call it "eventually consistent" anymore. It would be "consistent within X amount of time".
The problem now becomes, "How do I deal with eventual consistency?" (instead of trying to "beat it")
To really find the answer to that question, you need to first understand what kind of consistency you truly need, and how exactly the eventual consistency of S3 could affect your workflow.
Based on your description, I understand that you would write a total of 7 times to S3, once for each step you have. For the first write, as you correctly cited the FAQs, you get strong consistency for any reads after that. For all the subsequent writes (which are really "replacing" the original object), you might observe eventual consistency - that is, if you try to read the overwritten object, you might get the most recent version, or you might get an older version. This is what is referred to as "eventual consistency" on S3 in this scenario.
A few alternatives for you to consider:
don't write to S3 on every single step; instead, keep the data for each step on the client side, and then only write 1 single object to S3 after the 7th step. This way, there's only 1 write, no "overwrites", so no "eventual consistency". This might or might not be possible for your specific scenario, you need to evaluate that.
alternatively, write to S3 objects with different names for each step. E.g., something like: after step 1, save that to bruno-preferences-step-1.json; then, after step 2, save the results to bruno-preferences-step-2.json; and so on, then save the final preferences file to bruno-preferences.json, or maybe even bruno-preferences-step-7.json, giving yourself the flexibility to add more steps in the future. Note that the idea here to avoid overwrites, which could cause eventual consistency issues. Using this approach, you only write new objects, you never overwrite them.
finally, you might want to consider Amazon DynamoDB. It's a NoSQL database, you can securely connect to it directly from the browser or from your server. It provides you with replication, automatic scaling, load distribution (just like S3). And you also have the option to tell DynamoDB that you want to perform strongly consistent reads (the default is eventually consistent reads; you have to change a parameter to get strongly consistent reads). DynamoDB is typically used for "small" records, 20kB is definitely within the range -- the maximum size of a record would be 400kB as of today. You might want to check this out: DynamoDB FAQs: What is the consistency model of Amazon DynamoDB?
How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
You will not find the exact time anywhere. If you ask AWS they will give you approx timings. Your file is 20 KB so as per my experience from S3 usage the time will be more or less 60-90 Sec.
Is there a function to calculate save/load time based on item size?
No there is no any function using which you can calculate this.
Is the save/load time gonna be different if I choose another S3 region? If so which is the best region for Seattle?
For Seattle US West Oregon Will work with no problem.
You can also take a look at this experiment for comparison https://github.com/andrewgaul/are-we-consistent-yet
Is there a way to get a random object from a specific bucket by using Riak's HTTP API? Let's say that you have no knowledge about the contents of a bucket, the only thing you know is that all objects in a bucket share a common data structure. What would be a good way to get any object from a bucket, in order to show its data structure? Preferably using MapReduce over Search, since Search will flatten the response.
The best option is to use predictable keys so you don't have to find them. Since that is not always possible, secondary indexing is the next best.
If you are using eLevelDB, you can query the $BUCKET implicit index with max_results set to 1, which will return a single key. You would then issue a get request for that key.
If you are using Bitcask, you have 2 options:
list all of the keys in the bucket
Key listing in Bitcask will need to fold over every value in all buckets in order to return the list of keys in a single bucket. Effectively this means reading your entire dataset from disk, so this is very heavy on the system and could bring a production cluster to its knees.
MapReduce
MapReduce over a full bucket uses a similar query to key listing so it is also very heave on the system. Since the map phase function is executed separately for each object, if your map phase returns an object, every object in the bucket would be passed over the network to the node running the reduce phase. Thus it would be more efficient (read: less disastrous) to have the map phase function return just the key with no data, then have your reduce phase return the first item in the list, which leaves you needing to issue a get request for the object once you have the key name.
While it is technically possible to find a key in a given bucket when you have no information about the keys or the contents, if you designed your system to create a key named <<"schema">> or <<"sample">> that contains a sample object in each bucket, you could simply issue a get request for that key instead of searching, folding, or mapping.
If you are using Riak 2.X then search (http://docs.basho.com/riak/latest/dev/using/search/) is recommended over Map Reduce or 2i queries in most use cases and it is available via the HTTP API.