AWS Personalize items attributes - amazon-web-services

I'm trying to implement personalization and having problems with Items schema.
Imagine I'm Amazon, I've products their brands and their categories. In what kind of Items schema should I include this information?
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
What about categories? I've the same questions.
Metadata Fields Metadata includes string or non-string fields that
aren't required or don't use a reserved keyword. Metadata schemas have
the following restrictions:
Users and Items schemas require at least one metadata field,
Users and Interactions datasets can contain up to five metadata
fields. An Items dataset can contain up to 50 metadata fields.
If you add your own metadata field of type string, it must include the
categorical attribute. Otherwise, Amazon Personalize won't use the
field when training a model.
https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html

There are simply 2 ways to include your metadata in Items/Users datasets:
If it can be represented as a number value, then provide the actual value if it makes sense.
If it can be represented as string, then provide the string value and make sure, that categorical is set to true.
But let's take a look into "Why does they need me, to categorize my strings metadata?". The answer is pretty simple.
Let's start with an example.
If you would have Items as Amazon.com products and you would like to provide rates metadata field, then:
You could take all of the rates including the full review text sent by clients and simply put it as metadata field.
You can take just stars rating, calculate the average and put it as metadata field.
Probably the second one is making more sense in general. Having random, long reviews of product as metadata, pretty much changes nothing. Personalize doesn't understands if the review itself is good or bad, or if the author also recommends another product, so pretty much it doesn't really add anything to the recommendations.
However if you simply "cut" your dataset and calculate the average rating, like in the 2. point, then it makes a lot more sense. Maybe some of our customers like crappy products? Maybe they want to buy them, because they are famous YouTubers and they create videos about that? Based on their previous interactions and much more, Personalize will be able to perform just slightly better, because now it knows, that this product has rating of 5/5 or 3/5.
I wanted to show you, that for some cases, providing Items metadata as string makes no sense. That's why your string metadata must be categorical. It means, that it should be finite set of values, so it adds some knowledge for Personalize about given Item and why some of people might want to interact with it.
Going back to your question:
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
I would simply go with brand ID as string. You could also go with brand name, but probably single brand can be renamed, when it's still the same brand, so picking up the ID would be more constant. Also two different brands could have the same names, because they are present on different markets, so picking up the ID solves that.
The "categorical": true switch in your schema just tells Personalize:
Hey, do you see that string field? It's categorised, finite set of values. If you train a model for me, please include this one during the training, it's important!
And as it's said in documentation, if you will provide string metadata field, which is not marked as categorical, then Personalize will "think" that:
Hmm.. this field is a string, it has pretty random values and it's not marked as categorical. It's probably just a leftover from Items export job. Let's ignore that.

Related

DynamoDB query all users sorted by name

I am modelling the data of my application to use DynamoDB.
My data model is rather simple:
I have users and projects
Each user can have multiple projects
Users can be millions, project per users can be thousands.
My access pattern is also rather simple:
Get a user by id
Get a list of paginated users sorted by name or creation date
Get a project by id
get projects by user sorted by date
My single table for this data model is the following:
I can easily implement all my access patterns using table PK/SK and GSIs, but I have issues with number 2.
According to the documentation and best practices, to get a sorted list of paginated users:
I can't use a scan, as sorting is not supported
I should not use a GSI with a PK that would put all my users in the same partition (e.g. GSI PK = "sorted_user", SK = "name"), as that would make my single partition hot and would not scale
I can't create a new entity of type "organisation", put all users in there, and query by PK = "org", as that would have the same hot partition issue as above
I could bucket users and use write sharding, but I don't really know how I could practically query paginated sorted users, as bucket PKs would need to be possibly random, and I would have to query all buckets to be able to sort all users together. I also thought that bucket PKs could be alphabetical letters, but that could crated hot partitions as well, as the letter "A" would probably be hit quite hard.
My application model is rather simple. However, after having read all docs and best practices and watched many online videos, I find myself stuck with the most basic use case that DynamoDB does not seem to be supporting well. I suppose it must be quite common to have to get lists of users in some sort of admin panel for practically any modern application.
What would others would do in this case? I would really want to use DynamoDB for all the benefits that it gives, especially in terms of costs.
Edit
Since I have been asked, in my app the main use case for 2) is something like this: https://stackoverflow.com/users?tab=Reputation&filter=all.
As to the sizing, it needs to scale well, at least to the tens of thousands.
I also thought that bucket PKs could be alphabetical letters, but
that could create hot partitions as well, as the letter "A" would
probably be hit quite hard.
I think this sounds like a reasonable approach.
The US Social Security Administration publishes data about names on its website. You can download the list of name data from as far back as 1879! I stumbled upon a website from data scientist and linguist Joshua Falk that charted the baby name data from the SSA, which can give us a hint of how names are distributed by their first letter.
Your users may not all be from the US, but this can give us an understanding of how names might be distributed if partitioned by the first letter.
While not exactly evenly distributed, perhaps it's close enough for your use case? If not, you could further distribute the data by using the first two (or three, or four...) letters of the name as your partition key.
1 million names likely amount to no more than a few MBs of data, which isn't very much. Partitioning based on name prefixes seems like a reasonable way to proceed.
You might also consider using a tool like ElasticSearch, which could support your second access pattern and more.

Boolean attribute or new table (Django + PostgreSQL)

Situation: I have a Books set. Book can be one of the types: "Test", "Premium" and "Common". Data proportional: 2%, 15%, 83%. Amount query per time unit (in percent): 40%, 20%, 40%
I see some ways for resolve it in database:
Boolean: is_test, is_premium. If we need only "Tests" book: Book.objects.filter(is_test=True). It is can be a proxy model, for example. Analogy for premium books;
Separate Tables: books_test, books_premium, books_common.
Choice field: string in ['Test', 'Premium', 'Common'];
Combine 1 and 2: books_test table and books table with 'is_premium' attribute.
And we need optimally querying this data! All three Book variants need in one page. Exist queryset combinations: only tests, only common, common + premium, only premium.
If we use 1,3 variant: 1 endpoint with specific filter;
If we use 2 variant: one of the tree endpoints without filters (frontend should know what kind endpoint use). Or we can create one endpoint with some conditions and check by backend. Anyway: need extend logic;
Which way is more correct and why?
If you need to mix different types on one page, separate models/tables would complicate things for no good reason. The same goes for mapping more than two exclusive states to a combination of boolean fields.
This leaves you with a choice field or a separate BookType model containing the choices.

modeling datawarehouse multilanguage

I need your help.
I work for a survey company and I am responsible for creating its architecture and modeling a data warehouse that analyzes the results of an international survey (50 countries).
For the architecture, we decided to create a tabular model in PowerBI to analyze our data and to create our reports.
Here below is the model as I thought:
However, I have a design problem.
Since the survey is international, the wording of my dimensions differs from country to country.
My 1st question:
-Would it make more sense to create only one PowerBI embedded model for all countries or 50 PowerBI reports?
My 2nd question:
My model must be multilingual
With my 50 countries, I have several languages (5 languages) and for the same language, I have several variants.
The British English labels differ from the US English labels.
For example, for the Response dimension for France the IdReponse = 1 has the wording 'Vrai' while for the USA the wording is 'True' and for the Britain is 'OK'.
Do you know how to model multi language in a data warehouse?
About question #1 - It's always better, if there is only one model. It will be much easier to maintain. It isn't clear from your question will these 50 reports show the same data (excluding the internationalization of texts like Vrai/True/OK), or each report/country should show it's own subset of the data. In case all reports will show the same data, then definitely it will be better to make one common model and all report use it. You can do this with Power BI by making one "master" report and publishing it, and then the rest of your "per country" reports use it as a data source. And you will need separate reports per country, because you will need to translate the texts (column names, static texts, etc.).
About question #2 - You can create lookup tables in your model (maybe even in the database, it's up to you). The key value (1) will be linked to the key of the table, and there will be columns per language. Depending on the language of the current report, you will select the appropriate column (e.g. French, British, etc.) and even you can fallback to let's say US English, in case there is no translation entered for the current language (e.g. by making a computed column). It is also an option to make separate lookup table per language, but I think it will be more cumbersome to maintain this way.
About question #1: Yes you need only one data model.
About question #2: You Load a question in the language it is asked and the response you get as is in the response DIM. You should create a new column in your response DIM such as Clean_response where you transformed original response to a uniformed value. for example "Vrai", "OK", "True" has same meaning so you may chose to put "Yes" in the Clean_response column. You can also convert different variation of "No", "Nada", "noops", "nah" to a clean value of "No", but keep the original value too.
Labeling a column in the report should be handle in the report code. For example writing a report in French should use your dim column name "Question" and show it as "interroger" as a heading on the report.

User created custom Fields in Django

I'm working on a Django app for keeping track of collections (coins, cards, gems, stamps, cars, whatever). You can have multiple collections, each collection can have sets (Pirates cards, Cardinals cards, etc.) and then of course the individual items in each collection/set. Each item can contain multiple pictures, a name, and description, but here's where I'm unsure how to proceed. Each collection will need it's own set of values, or fields, that the user will need to determine (condition, dimensions in the appropriate units, coin thickness, model number, etc). How can I make custom fields such that the user can name the field and choose the input type (text, numbers, dropdown w/choices) and those fields will show up to be entered on each item within that collection?
This would be called an Entity-Attribute-Value (EAV) model and it is quite tricky to implement in the way you want. You have to anticipate all sorts of issues with user input, how to validate field types, what happens when the user wants to change fields, etc. I would start by reading the issues raised in that question and think about ways that you could modify your schema to avoid letting users define their own metadata at runtime. Are there some fields that could be common to all collections (like condition, dimensions, model number)? How tolerant do you want to be of data type issues, and will users be allowed to change field types after creation?
The more thought you put into implementation, the more issues you can avoid down the road.

SOLR query exclusions

I'm having an issue with querying an index where a common search term also happens to be part of a company name interspersed throughout most of the documents. How do I exclude the business name in results without effecting the ranking on a search that includes part of the business name?
example: Bobs Automotive Supply is the business name.
How can I include relevant results when someone searches automotive or supply without returning every document in the index?
I tried "-'Bobs Automotive Supply' +'search term'" but this seems to exclude any document with Bobs Automotive Supply and isn't very effective on searching 'supply' or 'automotive'
Thanks in advance.
Second answer here, based on additional clarification from first answer.
A few options.
Add the business name as StopWords in the StopWordFilter. This will stop Solr from Indexing them at all. Searches that use them will only really search for those words that aren't in the business name.
Rely on the inherent scoring that Solr will apply due to Term frequency. It sounds like these terms will be in the index frequently. Queries for them will still return the documents, but if the user queries for other, less common terms, those will get a higher score.
Apply a low query boost (not quite negative, but less than other documents) to documents that contain the business name. This is covered in the Solr Relevancy FAQ http://wiki.apache.org/solr/SolrRelevancyFAQ#How_do_I_give_a_negative_.28or_very_low.29_boost_to_documents_that_match_a_query.3F
Do you know that the article is tied to the business name or derive this? If so, you could create another field and then just exclude entities that match on the business name using a filter query. Something like
q=search_term&fq=business_name:(NOT search_term)
It may be helpful to use subqueries for this or to just boost down rather than filter out results.
EDIT: Update to question make this irrelavent. Leaving it hear for posterity. :)
This is why Solr Documents have different fields.
In this case, it sounds like there is a "Footer" field that is separate from your "Body" field in your documents. When searches are performed, they would only done against the Body, which won't include data from the Footer. You could even have a third field which is the "OriginalContent" field, which contains the original copy for display purposes. You wouldn't search that, just store it for later.
The important part is to create the two separate fields in your schema and make sure that you index those field that you want to be able to search.