Collaborative Filtering: Ways to determine implicit scores for products for each user? - data-mining

Having implemented an algorithm to recommend products with some success, I'm now looking at ways to calculate the initial input data for this algorithm.
My objective is to calculate a score for each product that a user has some sort of history with.
The data I am currently collecting:
User order history
Product pageview history for both anonymous and registered users
All of this data is timestamped.
What I'm looking for
There are a couple of things I'm looking for suggestions on, and ideally this question should be treated more for discussion rather than aiming for a single 'right' answer.
Any additional data I can collect for a user that can directly imply an interest in a product
Algorithms/equations for turning this data into scores for each product
What I'm NOT looking for
Just to avoid this question being derailed with the wrong kind of answers, here is what I'm doing once I have this data for each user:
Generating a number of user clusters (21 at the moment) using the k-means clustering algorithm, using the pearsons coefficient for the distance score
For each user (on demand) calculating their a graph of similar users by looking for their most and least similar users within their cluster, and repeating for an arbitrary depth.
Calculating a score for each product based on the preferences of other users within the user's graph
Sorting the scores to return a list of recommendations
Basically, I'm not looking for ideas on what to do once I have the input data (I may need further help with that later, but it's not the point of this question), just for ideas on how to generate this input data in the first place

Here's a haymaker of a response:
time spent looking at a product
semantic interpretation of comments left about the product
make a discussion page about a product, brand, or product category and semantically interpret the comments
if they Shared a product page (email, del.icio.us, etc.)
browser (mobile might make them spend less time on the page vis-à-vis laptop while indicating great interest) and connection speed (affects amt. of time spent on the page)
facebook profile similarity
heatmap data (e.g. à la kissmetrics)
What kind of products are you selling? That might help us answer you better. (Since this is an old question, I am addressing both #Andrew Ingram and anyone else who has the same question and found this thread through search.)

You can allow users to explicitly state their preferences, the way netflix allows users to assign stars.
You can assign a positive numeric value for all the stuff they bought, since you say you do have their purchase history. Assign zero for stuff they didn't buy
You could do some sort of weighted value for stuff they bought, adjusted for what's popular. (if nearly everybody bought a product, it doesn't tell you much about a person that they also bought it) See "term frequency–inverse document frequency"
You could also assign some lesser numeric value for items that users looked at but did not buy.

Related

Choose the appropriate way to deal with weights in svyset in Stata

I decided to post here a kind information for support I put in Statalist yesterday. I have not yet received a possible hint and thought it could be useful to extend the audience by posting it here.
The link to the original post is the following:
https://www.statalist.org/forums/forum/general-stata-discussion/general/1659627-choose-the-appropriate-way-to-deal-with-weights-in-svyset?view=thread
Dear Members,
I defined a questionnaire to gather respondents' willingness to get vaccinated against COVID-19 via a discrete choice experiment. I relied on a company specialized in political opinion polls and market research to administer the survey. The company computed a weight for each respondent based on 1) the geographical location where the respondent lives (five macroareas of Italy), 2) whether the respondent has a bachelor degree or not, and 3) to which age group she/he pertains (five classes are considered).
The sum of the weights is equal to the number of individuals in the database. The individuals pertaining to the age classes 30-39 and 40-49 are oversampled, as per our request (related to a research hypothesis). The proportion of such two classes within the sample is larger than the actual in the Italian population. Weights are computed in order to take into account for this feature and guarantee that the sample is representative of the characteristics of the Italian population.
I will use the data to estimate a logit model, multinomial logit models and mixed logit models.
The issue I am facing with is the proper path to follow to declare the nature of the weight. I have no experience in the use of Stata to deal with this issue.
I am using Stata 17 on a PC with Windows 10 Pro 64 bit.
Combining the information from the video, the svysvyset manual and the results from the help for "weight" I tried to think what is the most appropriate solution.
I tried to add here the code multiple times as well but I kept receiving an error message on how I formatted it. My apologies

How can I restrict the output of an Amazon Machine Learning model? (Predicting cricket team results)

I am trying to predict match winner based on the historical data set as shown below,
The data set comprises of IPL seasons and Team_Name_id vs Opponent Team are the team names in IPL. I have set the match id as Row id and created the model. When running realtime testing, the result is not as expected (shown below)
Target is set as Match_winner_id.
Am I missing any configurations? Please help
The model is working perfectly correctly. There's just two problems:
Your input data is not very good
There's no way for the model to know that only one of those two teams should win
Data Quality
A predictive model needs good quality input data on which to reverse-engineer a model that explains a given result. This input data should contain information that can be used to predict a result given a different set of input data.
For example, when predicting house prices, it would need to know the suburb (category), number of bedrooms/bathrooms/parking spaces, age of the building and selling price. It could then predict the selling price for other houses with a slightly different mix of variables.
However, based on your screenshot, you are giving the following information (and probably more) on which to make your prediction:
Teams: Not great, because you are separating Column C and Column D. The model will assume they are unrelated information. It doesn't realise that those two values could be swapped.
Match date: Useless information unless the outcome varies in proportion to time (eg a team continually gets better)
Season: As with Match Date, this is probably useless because you're always predicting the future -- you won't be predicting for a past season
Venue: Only relevant if a particular team always wins at a given venue
Toss Decision: Would this really influence the outcome? Also, it's only known once the game begins, so not great for predicting a future game.
Win Type: You won't know the win type until a game is over, so it's not suitable for predicting a future game.
Score: Again, not known until the actual game, so no good for future predictions.
Man of the Match: Not known for future games.
Umpire: How does an umpire influence the result of a game?
City: Yes, given that home teams often have an advantage.
You have provided very little information that could be used to predict a future game. There is really only the teams and the venue. Everything else is either part of the game itself or irrelevant.
Picking only one of the two teams
When the ML model looks at your data and tries to make a prediction, it will look at all the data you have provided. For example, it might notice that for a given venue and season, Team 8 has a higher propensity to win. Therefore, given that venue and season, it will favour a win by Team 8. The model has no concept that the only possible outcome is one of the two teams given in columns C and D.
You are predicting for two given teams and you are listing the teams in either Column C or Column D and this makes no sense -- the result is the same if you swapped the teams between columns, but the model has no concept of this. Also, information about Team 1 vs Team 2 is totally irrelevant for Team 3 vs Team 4.
What you should do is create one dataset per team, listing all their matches, plus a column that shows the outcome -- either a boolean (Win/Lose) or a value that represents the number of runs by which they won (where negative is a loss). You would then ask them model to predict the result for that team, given the input data, which would be win/lose or a points above/below the other team.
But at the core, I think that your input data doesn't have enough rich content to be able to make a sensible prediction. Just ask yourself: "What data would I like to know if I were to guess which team would win?" It would probably be past results, weather conditions, which players were on each team, how many matches they played in the last week, etc. None of this information is being provided as input on each line of your input data.

Understanding "Real world modelling" program

Few days now I've got new project to do related with a "real world modelling" program.
Here's how it looks like:
A visit to a psychologist (Use queue). Experts provides psychologist's advice, some of them (n) forms therapeutic groups of k people (GrT - duration of group therapy in hours), other experts (m) takes individual patients (InT - duration of individual therapy in hours). Each newly came patient (new patient's appearance probability is p1, recurring patients comes after period of time (h)) can choose to go to a psychologist providing individual therapies, or to group therapies. If group therapy session is full, patients who are wishing to participate in group sessions must wait. Recurring patients wishing to go to group sessions can start a session with smaller group, but can't go to same session with newly came patients. It has been observed that patients who took individual therapy are recovering faster than those, who chose group sessions(they will need less sessions), but there are exceptions - due to social interaction factor, some patients (probability p2) recover h percent faster than those, who choose individual therapy. Individual session costs InC, group session GrC. You need to assess what therapeutic approach patient should choose optimizing with their resources, and how many and what specialists should hire a health care facility.
Here's my approach to this problem:
Read text file containing Names, Surnames, money willing to spend and place everything in queue structure.
Find which group is better for patient by generating random number for p2probability and using it, we'll find if patient recover faster in individual or group therapy. IMO factor sequence here: Money(looking, if patient can afford individual therapy sessions) > p2 (should patient take group sessions if it's better for him).
By looking how many patients there are in queue, we can find how many psychologists we'll need. (Are there any other factors here? What if we are short of experts?)
Problems that I can't understand: how do I implement p1 probability of new patients appearance if I write every patient into a text file and put them in a queue? How many therapy sessions does it take for patient to recover (static number?)?
Am I missing something? Basically it's open question and there could be no right answer. If anyone have any suggestions how to build this program to better one, I'd be glad to take it!
Programming language I'm using: C++
If you want to break up a task, analyse it and prepare it for coding, you could :
Firstly make a Block diagram, representing program flow control.
Followed by Pseudo code implementation.
P.S. update the question following the above and when you reach the "code stage", there, definitely, will be more help.

How to classify a small and peculiar subset out of a large database?

I have to perform a data mining task on a database containing informations about insurance policies. Each tuple indicates data about a single policy, along with information regarding the agency that issued it, the customer it is referring to and other fields. It is like a product between hypotetical tables Policies, Customers and Agencies. The fields are the following:
Policy Type,ID Number,Policy Status ,Product Description,Product Combinations,Issue Date,Effective Date,Maturity Date,Policy Duration,Loan Duration ,Cancellation Date ,Reason for cancellation,Total Premium,Splitter Premium,ID Partners,ID Agency,Country Agency,ID Zone,Agency potential,Sex Contractor ,Birth Year Contractor,Job Contractor,Sex Insured,Job Insured,Birth Year Insured,Product Area,Legal Form,ID Claim,Year Claim,Status Claim,Provision Claim,Payments Claim
This is an academic task and our professor wants us to identify churn rates, cross-selling and up-selling. I am not quite into the field and therefore I sought those terms on wikipedia. I started with churn rate and it appears to me that in this case I have to characterize the properties of customers whose Policy Status is set to "canceled" and the Reason for cancellation is "customer cancellation".
With Rapid Miner, I tried to apply decision trees and rule mining, but the subset of interest is so small that the output model, despite having a good accuracy overall, has a very very very poor accuracy in predicting canceled policies. This happens because the subset of canceled policies is really small. I also tried to apply the MetaCost operator with a given cost matrix in which the cost of misclassifying canceled policies is outrageously high with respect to the others (like a million times higher), but this did not change the result at all.
My best option now is to use the sequential covering algorithm for rule mining, but rapid miner does not implement it and I would have to code it manually.
Do you have any suggestion on how to build a good model for that small subset of canceled policies, so that we could use it to identify customers that would potentially cancel their policy in the future?
N.B.: since it comes from a real source, albeit anonymized, I cannot disclose the database or any data contained within.
Did you try Navie Bayes? It works well with small set of data. You can as well try a variant of it like AODE. AODE is not available in Rapid Miner. You should install Weka extension to access AODE in Rapid Miner.
You need to balance your dataset, so that the classes (cancelled / not cancelled) are the same size. This means (temporarily) discarding lots of data.
You can use the Sample operator with the Balance Labels checkbox to do this.

How does data mining actually work?

Suppose I want to do some data mining on the database of a supermarket. What does that actually mean?
1) What will the output/results be like?
2) Will the output be different every day or change over time?
3) Before applying data mining, do I need to know what I want or will data mining give everything I want automatically?
Data Mining is a general category of techniques that can be applied to different kinds of datasets, just like programming is a general category of techniques that can be applied using different languages to do different things.
None of your questions make any sense.
A1: Data mining will give us an accurate reports about your queries of database of supermarket.
A2: Sure, because Data mining depend on analyzing during time, in this case it depend on your problems or goals that you want to reach it. if your database was very big also you built data warehouse in right way you will get the different output over time.
A3: yes you should determine what are the problems you have to mine then use tools of Data mining to get the results or indicators automatically.
To answer your first question: For the case of supermarket customer data, I could image the following questions:
how many products X are usually sold on Fridays ?
(helps you to determine how many X you should have in stock)
which customers bought product X often in the last month/year ?
Useful when when you introduce a new X-like product: send advertising material (which has a given cost) only to those customers.
given a customer buys product X (e.g. beer) what's the probability that he/she also buys product Y (e.g. chips) ?
useful for the following: make sure X and Y never are on promotional offer at the same time (X and Y are bought together often). Get the customers into the store by offering a rebate on X knowing they'll also by Y at the same time. Or: put a high price X-like product right next to Y, putting the cheaper X somewhere else.
which neighborhoods have the smallest number of customers ?
helps to find out which neighborhoods you could target with advertising to bring more customers into the store.
Often, by 'asking certain questions to the data' one discovers some features and comes up with new questions.
Data mining is a set of techniques. It refers to discovering interesting and unexpected patterns in data.
If you want to apply some data mining techniques, you need to know which one and you should know why. The answer to questions 1, 2 and 3 depends on the techniques that you choose.
For example, if i want to find associations between items sold in a supermarket, i may use association rule mining. If i want to find groups of similar customers, I might use a clustering algorithm. etc.
There is not just ONE technique in data mining.