Finding the best way to put a tiered document into DynamoDB - amazon-web-services

I've been working with regular SQL databases and now wanted to start a new project using AWS services. I want the back end data storage to be DynamoDB and what I want to store is a tiered document, like an instruction booklet of all the programming tips I learned that can be pulled and called up via a React frontend.
So the data will be in a format like Python -> Classes -> General -> "Information on Classes Text Wall"
There will be more than one sub directory at times.
Future plans would be to be able to add new subfolders, move data to different folders, "thumbs up", and eventual multi account with read access to each other's data.
I know how to do this in a SQL DB, but have never used a NoSQL before and figured this would be a great starting spot.
I am also thinking of how to sort the partition, and I doubt this side program would ever grow to more than one cluster but I know with NoSQL you have to plan your layout ahead of time.
If NoSQL is just a horrible fit for this style of data, let me know as well. This is mostly for practice and to practice AWS systems.

DynamoDb is a key-value database with options to add a secondary indices. It's good to store documents that doesn't require full scan or aggregation queries. If you design your tiered document application to show only one document at a time, then DynamoDB would be a good choice. You can put the documents with a structure like this:
DocumentTable:
{
"title": "Python",
"parent_document": "root"
"child_documents": ["Classes", "Built In", ...]
"content": "text"
}
Where:
parent_document - the "title" of the parent document, may be empty for "Python" in your example for a document titled "Classes"
content - text or unstructured document with notes, thumbs up, etc, but you don't plan to execute conditional queries over it otherwise you need a global secondary index. But as you won't have many documents, full scan of a table won't take long.
You can also have another table with a table of contents for a user's tiered document, which you can use for easier navigate over the documents, however in this case you need to care about consistency of this table.
Example:
ContentsTable:
{
"user": -- primary key for this table in case you have many users
"root": [
"Python":[
"Classes": [
"General": [
"Information on Classes Text Wall"
]
]
]
]
}
Where Python, Classes, General and Information on Classes Text Wall are keys for DocumentTable.title. You can also use something instead of titles to keep the keys unique. DynamoDB maximum document size is 400 KB, so this would be enough for a pretty large table of contents

Related

AWS Personalize: how to deal with a huge catalog with not enough interaction data

I'm adding a product recommendation feature with Amazon Personalize to an e-commerce website. We currently have a huge product catalog with millions of items. We want to be able to use Amazon Personalize on our item details page to recommend other relevant items to the current item.
Now as you may be aware, Amazon Personalize heavily rely on the user interaction to provide recommendation. However, since we only just started our new line of business, we're not getting enough interaction data. The majority of items in our catalog have no interaction at all. A few items (thousands) though get interacted a lot, which then impose a huge influence on the recommendation results. Hence you will see those few items always get recommended even if they are not relevant to the current item at all, creating a very odd recommendation.
I think this is what we usually refer as a "cold-start" situation - except that usual cold-start problems are about item "cold-start" or user "cold-start", but the problem I am faced with now is a new business "cold-start" - we don't have the basic amount of interaction data to support the a fully personalized recommendation. With the absence of interaction data of each item, we want the Amazon Personalize service to rely on the item metadata to provide the recommendation. So that ideally, we want the service to recommend based on item metadata and once it's getting more interactions, recommend based on item metadata + interaction.
So far I've done quite some researches only to find one solution - to increase explorationWeight when creating the campaign. As this article indicates, Higher values for explorationWeight signify higher exploration; new items with low impressions are more likely to be recommended. But it does NOT seem to do the trick for me. It improves the situation a little bit but still often times I am seeing odd results being recommended due to a higher integration rate.
I'm not sure if there're any other solutions out there to remedy my situation. How can I improve the recommendation results when I have a huge catalog with not enough interaction data?
I appreciate if anyone has any advice. Thank you and have a good day!
The SIMS recipe is typically what is used on product detail pages to recommend similar items. However, given that SIMS only considers the user-item interactions dataset and you have very little interaction data, SIMS will not perform well in this case. At least at this time. Once you have accumulated more interaction data, you may want to revisit SIMS for your detail page.
The user-personalization recipe is a better match here since it uses item metadata to recommend cold items that the user may be interested in. You can improve the relevance of recommendations based on item metadata by adding textual data to your items dataset. This is a new Personalize feature (see blog post for details). Just add your product descriptions to your items dataset as a textual field as shown below and create a solution with the user-personalization recipe.
{
"type": "record",
"name": "Items",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "ITEM_ID",
"type": "string"
},
{
"name": "BRAND",
"type": [
"null",
"string"
],
"categorical": true
},
{
"name": "PRICE",
"type": "float"
},
{
"name": "DESCRIPTION",
"type": [
"null",
"string"
],
"textual": true
},
],
"version": "1.0"
}
If you're still using this recipe on your product detail page, you can also consider using a filter when calling GetRecommendations to limit recommendations to the current product's category.
INCLUDE ItemID WHERE Items.CATEGORY IN ($CATEGORY)
Where $CATEGORY is the current product's category. This may require some experimentation to see if it fits with your UX and catalog.

DynamoDB query all users sorted by name

I am modelling the data of my application to use DynamoDB.
My data model is rather simple:
I have users and projects
Each user can have multiple projects
Users can be millions, project per users can be thousands.
My access pattern is also rather simple:
Get a user by id
Get a list of paginated users sorted by name or creation date
Get a project by id
get projects by user sorted by date
My single table for this data model is the following:
I can easily implement all my access patterns using table PK/SK and GSIs, but I have issues with number 2.
According to the documentation and best practices, to get a sorted list of paginated users:
I can't use a scan, as sorting is not supported
I should not use a GSI with a PK that would put all my users in the same partition (e.g. GSI PK = "sorted_user", SK = "name"), as that would make my single partition hot and would not scale
I can't create a new entity of type "organisation", put all users in there, and query by PK = "org", as that would have the same hot partition issue as above
I could bucket users and use write sharding, but I don't really know how I could practically query paginated sorted users, as bucket PKs would need to be possibly random, and I would have to query all buckets to be able to sort all users together. I also thought that bucket PKs could be alphabetical letters, but that could crated hot partitions as well, as the letter "A" would probably be hit quite hard.
My application model is rather simple. However, after having read all docs and best practices and watched many online videos, I find myself stuck with the most basic use case that DynamoDB does not seem to be supporting well. I suppose it must be quite common to have to get lists of users in some sort of admin panel for practically any modern application.
What would others would do in this case? I would really want to use DynamoDB for all the benefits that it gives, especially in terms of costs.
Edit
Since I have been asked, in my app the main use case for 2) is something like this: https://stackoverflow.com/users?tab=Reputation&filter=all.
As to the sizing, it needs to scale well, at least to the tens of thousands.
I also thought that bucket PKs could be alphabetical letters, but
that could create hot partitions as well, as the letter "A" would
probably be hit quite hard.
I think this sounds like a reasonable approach.
The US Social Security Administration publishes data about names on its website. You can download the list of name data from as far back as 1879! I stumbled upon a website from data scientist and linguist Joshua Falk that charted the baby name data from the SSA, which can give us a hint of how names are distributed by their first letter.
Your users may not all be from the US, but this can give us an understanding of how names might be distributed if partitioned by the first letter.
While not exactly evenly distributed, perhaps it's close enough for your use case? If not, you could further distribute the data by using the first two (or three, or four...) letters of the name as your partition key.
1 million names likely amount to no more than a few MBs of data, which isn't very much. Partitioning based on name prefixes seems like a reasonable way to proceed.
You might also consider using a tool like ElasticSearch, which could support your second access pattern and more.

AWS Data Structure and Stack Suggestion for highly filterable data

Firstly, let me know if I should place this in a different Community. It is programming related but less than I would prefer.
I am creating a mobile app based which I intend to base on AWS App Sync unless I can determine it is a poor fit.
I want to store a fairly large set of data, say a half million records.
From these records, I need to be able to grab all entries based on a tag and page them from the larger set.
An example of this data would be:
{
"name":"Product123",
"tags":[
{
"name":"1880",
"type":"year",
"value":7092
},
{
"name":"f",
"type":"gender",
"value":4120692
}
]
}
Various objects may or may not have a specific tag but may have up to 500 tags or more (the seed of initial data has 130 tags). My filter would ignore them if they did not match but return them if they did.
In reading about Query vs Scan on DyanmoDB, I feel like my current data structure would require mostly scanning and be in-efficient. Efficiency is only a real restriction due to cost.
With cost in mind, I will focus on the cost per user to access this data in filtered sets. Say 100,000 users for now each filtering and paging data many times a day.
Your concept of tags doesn't sound too different from the concept of Cognito User Pools' groups with AppSync (docs) - authentication based on groups will only return items allowed for groups that the user making the request is in. Cognito's default group limit is 25 per user pool, so while convenient out of the box, it wouldn't itself help you much. Instead, it's interesting just because it's similar conceptually, and can give you insight by looking at how it works internally.
If you go into the AppSync console and set up a request mapping template for groups auth, you'll see that it uses a scan and the contains operation. Doing something similar would probably be your best bet here, if you really want to use Dynamo. If you find that prohibitively costly, you could use a Lambda data source, which allows you to use any data store, if you have one in mind that's a little more flexible for this type of action.

PowerBI and nested 1:N data

I'm trying to leverage the advantages of DocumentDB / Elastic / NoSQL for retrieving big data and to visualize it. I want to use PowerBI to do that, which is pretty good, however, I have no clue how to model a document which has a 1:N nested data field. E.g.
{
name: string,
age: int
children: [ { name: string }... ]
}
In a normal case, you would flatten the table by expanding the nested values and joining them, but how does one do that when it's 1:N / A list. Is there a way to maybe extract that into it's own table?
I've been thinking about making a bridge which translates a document into data tables, but that feels like an incorrect way to go, and further proves some complications with regards to how many endpoints and queries there should be made.
I can't help but think this is a solved issue, as many places analyse and visualize large amounts of data stored in no sql. The alternative is a normalized relational database, but having millions and millions of entries in that which you analyze also seems incorrect when nosql is tuned for these scenarios.
If the data 1:N, but not arbitrarily deep, you can use the expand option in the query tab. You will get one row for each instance of customer that has all the attributes of the container.
If you want to get more sophisticated, you could normalized the schema by expanding just the customer id column (assuming there is one in your data) into one table, and expanding the customer details into another one, then creating a relationship across them. That makes aggregations easier (like count of parents). You'd just load the data twice, and delete the columns you don't need.

Elastic search + translated content in Rails 4

I would like to setup ElasticSearch on a table "content" which also have translation table "content_translation" for localization purpose (Globalize gem). We have 10 languages.
I want to implement Elasticsearch for my data model. In SQL I would search like:
SELECT id, content_translation.content
FROM content
LEFT JOIN content_translation on content.id = content_translation.content_id
WHERE content_translation.content LIKE '%???????%'
I wonder what is the best strategy to do "left join" like search with ElasticSearch?
Should I create just "content" index with all translation data in it?
{"id":21, "translations":{"en":{"content":"Lorem..."}, "de":{"content":"Lorem..."} ..}
Should I create "content_translation" index and just filter results for specific locale?
{"content_id":21, "locale":"en", "content": "Lorem ..."}
Are there some good practices how to do it?
Should I take care of maintaining index by myself or should I use something like "Tire gem" which takes care about indexing by itself.
I would recommend the second alternative (one document per language), assuming that you wouldn't need to show content from multiple languages.
i.e.
{"content_id":21, "locale":"en", "content": "Lorem ..."}
I recommend a gem like Tire, and exploit it's DSL to your advantage.
You could have your content model to look like:
class Content < ActiveRecord:Base
include Tire::Model::Search
include Tire::Model::Callbacks
...
end
Then you could do have a search method that would do something like
Content.search do
query do
...
end
filter :terms, :locale => I18n.locale.to_s
end
Your application would need to maintain locale at all times, to serve the respective localized content. You could just use I18n's locale and look up data. Just pass this in as a filter and you could have the separation you wish. Bonus is, you get fallback for free if you have enabled it in i18n for rails.
However, if you have a use case where you need to show multi-lingual content side by side, then this fails and you could look at a single document holding all language content.