Rating Items From Ratings of Random Sets - rating

I'm looking for an algorithm to determine ratings or tags for items when my own input is giving the user random sets of these items and letting them tell me which ratings or tags are within that set. For example, they might say if a set of images contains any "good" photos and out of several random sets and I then want to determine which photos are good. Now, I also can adjust the sets of items I give the user to those that would help me refine the knowledge of these items. So, if a given item was in sets marked both "good" and "bad", the system would try to place it in sets of known good items, perhaps, to determine if the user now says that set has a bad item, and I know the one with unsure status is the "bad" item. Does this make sense?

Use an artificial neural network; see Wikipedia. There are lots of links at this FAQ.

Related

How do you give negative feedback to amazon personalize? i.e. tell personalize that a user doesn't really like a certain item?

I did the amazon personalize deep dive series on youtube. At the timestamp 8:33 in the video, it was mentioned that 'Personalize does not understand negative feedback.' and that any interaction you submit is assumed to be a positive one.
But I think that giving negative feedback could improve the recommendations that we give on a whole. Personalize knowing that a user does not like a given item 'A' would help ensure that it does not recommend items similar to 'A' in the future.
Is there any way in which we can give negative feedback(ex. user doesn't like items x,y,z) to amazon personalize?
A possible way to give negative feedback that I thought of:
Let's say users can give ratings out of 5 for movies. Every time a user gives a rating >= 3 in the interactions dataset, we add an additional interaction in the dataset (i.e we have two interactions saying that a user rated a movie >=3 in the interactions.csv instead of just one). However, if he gives a rating <=2 (meaning he probably doesn't like the movie), we just keep the single interaction of that in the interactions dataset (i.e. we only have one interaction saying that a user rated a movie <=2 in the interactions.csv file)
Would this in any way help to convey to personalize that ratings <=2
are not as important/that the user did not like them?
Negative feedback, where the user explicitly indicates that they dislike an item, is currently not supported as training input for Amazon Personalize. Furthermore, there is currently no way to add weight/reward to specific interactions by event type or event value (see this answer for details).
With that said, you can use impressions with your interactions to indicate items that were seen by the user but that they chose not to interact with. Impressions are only supported by the User-Personalization recipe. From the docs:
Unlike other recipes, which solely use positive interactions (clicking, watching, or purchasing), the User-Personalization recipe can also use impressions data. Impressions are lists of items that were visible to a user when they interacted with (clicked, watched, purchased, and so on) a particular item.
Using this information, a solution created with the User-Personalization recipe can calculate the suitability of new items based on how frequently an item has been ignored, and change recommendations accordingly. For more information see Impressions data.
Impressions are not the same as an explicitly negative interaction but they do imply that the impressed items were considered less relevant/important to the user than the item that they chose to interact with.
Another approach that can be used to consider negative interactions when making recommendations is to have two event types: one event type for positive intent (e.g., "like", "watch", or "purchase") and one event type for dislike (e.g., "dislike"). Then create a Personalize solution that only trains on the positive event type. Finally, at inference time, use a Personalize filter to exclude items that the user has recently disliked.
EXCLUDE ItemID WHERE Interactions.EVENT_TYPE IN ("dislike")
In this case, Personalize is still training only on positive interactions but with the filter you won't recommend items that the user disliked.

AWS Personalize items attributes

I'm trying to implement personalization and having problems with Items schema.
Imagine I'm Amazon, I've products their brands and their categories. In what kind of Items schema should I include this information?
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
What about categories? I've the same questions.
Metadata Fields Metadata includes string or non-string fields that
aren't required or don't use a reserved keyword. Metadata schemas have
the following restrictions:
Users and Items schemas require at least one metadata field,
Users and Interactions datasets can contain up to five metadata
fields. An Items dataset can contain up to 50 metadata fields.
If you add your own metadata field of type string, it must include the
categorical attribute. Otherwise, Amazon Personalize won't use the
field when training a model.
https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html
There are simply 2 ways to include your metadata in Items/Users datasets:
If it can be represented as a number value, then provide the actual value if it makes sense.
If it can be represented as string, then provide the string value and make sure, that categorical is set to true.
But let's take a look into "Why does they need me, to categorize my strings metadata?". The answer is pretty simple.
Let's start with an example.
If you would have Items as Amazon.com products and you would like to provide rates metadata field, then:
You could take all of the rates including the full review text sent by clients and simply put it as metadata field.
You can take just stars rating, calculate the average and put it as metadata field.
Probably the second one is making more sense in general. Having random, long reviews of product as metadata, pretty much changes nothing. Personalize doesn't understands if the review itself is good or bad, or if the author also recommends another product, so pretty much it doesn't really add anything to the recommendations.
However if you simply "cut" your dataset and calculate the average rating, like in the 2. point, then it makes a lot more sense. Maybe some of our customers like crappy products? Maybe they want to buy them, because they are famous YouTubers and they create videos about that? Based on their previous interactions and much more, Personalize will be able to perform just slightly better, because now it knows, that this product has rating of 5/5 or 3/5.
I wanted to show you, that for some cases, providing Items metadata as string makes no sense. That's why your string metadata must be categorical. It means, that it should be finite set of values, so it adds some knowledge for Personalize about given Item and why some of people might want to interact with it.
Going back to your question:
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
I would simply go with brand ID as string. You could also go with brand name, but probably single brand can be renamed, when it's still the same brand, so picking up the ID would be more constant. Also two different brands could have the same names, because they are present on different markets, so picking up the ID solves that.
The "categorical": true switch in your schema just tells Personalize:
Hey, do you see that string field? It's categorised, finite set of values. If you train a model for me, please include this one during the training, it's important!
And as it's said in documentation, if you will provide string metadata field, which is not marked as categorical, then Personalize will "think" that:
Hmm.. this field is a string, it has pretty random values and it's not marked as categorical. It's probably just a leftover from Items export job. Let's ignore that.

I have large file contents that I want to make searchable on AWS CloudSearch but the maximum document size is 1MB - how do I deal with this?

I could split the file contents up into separate search documents but then I would have to manually identify this in the results and only show one result to the user - otherwise it will look like there are 2 files that match their search when in fact there is only one.
Also the relevancy score would be incorrect. Any ideas?
So the response from AWS support was to split the files up into separate documents. In response to my concerns regarding relevancy scoring and multiple hits they said the following:
You do raise two very valid concerns here for your more challenging use case here. With regard to relevance, you face a very significant problem already in that is harder to establish a strong 'signal' and degrees of differentiation with large bodies of text. If the documents you have are much like reports or whitepapers, a potential workaround to this may be in indexing the first X number of characters (or the first identified paragraph) into a "thesis" field. This field could be weighted to better indicate what the document subject matter may be without manual review.
With regard to result duplication, this will require post-processing on your end if you wish to filter it. You can create a new field that can generate a unique "Parent" id that will be shared for each chunk of the whole document. The post-processing can check to see if this "Parent" id has already been return(the first result should be seen as most relevant), and if it has, filter the subsequent results. What is doubly useful in such a scenario, is that you include a refinement link into your results that could filter on all matches within that particular Parent id.

Django Template. Use object with highest score in queryset

I have a page where I'm displaying the title of something. Users are allowed to provide their own suggested titles. The title I originally display is stored under the event's object. All suggestions are stored in their own table. Each suggestion has a vote count associated with it. I want to display the original title unless there is a suggestion that is above some threshold of votes, but if several are above that threshold, I want to display the highest one. How would I grab the highest-voted suggestion from the list, but only if it is above some arbitrary threshold, let's say 5?
Add an integer field for the number of suggestions for a suggestion model.
Then find the highest suggestion and check if it has more than 5 votes

How do I find if any names in a list of names appears in a paragraph with ColdFusion?

Assume that I have a list of employee names from a database (thousands, potentially tens-of-thousands in the near future). To make the problem simpler assume that each firstname/lastname combination is unique (a big if, but a tangent).
I also have a RSS stream of news content that pertains to the business (again, could be in the hundreds of items per day).
What I would like to do is detect if an employees name appears in the several paragraph news item and, if so, 'tag' the item with the person its talking about.
There may be more than one employee named in a single news item so breaking the loop after the first positive match isn't a possibility.
I can certainly brute force things: for every news item, loop over each and every employee name and if a regex expression returns a match, make note of it.
Is there a simpler way in ColdFusion or should I just get on with my nested loops?
Just throwing this out there as something you could do...
It sounds like you'll almost unanimously have significantly more employee names than words per post. Here's how I might handle it:
Have an always-running CF app that will pull in the feeds and onAppStart
Grab all employees from your db
Create an app-scoped look up struct with first names as keys and a struct of last names as values ( you could also add middle names sibling to last names with a 3rd tier if desired ).
So one key in the look up might be "Vanessa" with a struct with 2 keys ( "Johnson" and "Forta" ) as its value.
Then, each article you parse, just listToArray with a space as a delimiter and loop through the array doing a simple structKeyExists with each token. For matches, check the next item in the array as a last name.
I'd guess this would be much more performant processingwise than doing however many searches and also take almost no time to code and you can feed in any future sources extremely simply ( your checker takes one argument, any text on Earth ).
Interested to see what route you go and whether your experiments expose anything new about performance in CF.
Matthew, you have a tall order there, and there are really multiple parts to the challenge/solution. But just in terms of comparing a list of values to a given set of text to see if one of them occur in there, you'll find that there's no one could CF function. BEcause of that, I created a new one, findList, available at cflib:
http://cflib.org/index.cfm?event=page.udfbyid&udfid=1908
It's not perfect, nor as optimal as it could be, but it may be a useful first step or you, or give you some ideas. That said, it suited my need (determine if a given blog comment had reference to any of the blacklisted words). I show it comparing a list of URLs, but it could be any words at all. Hope that's a little helpful.
Another option worth exploring is leveraging the Solr engine that ships with CF now. It will do the string search heavy lifting for you and you can probably focus on dynamically keeping your collections up to date and optimized as new feed items come in.
Good luck!