user profiling and outlier detection - data-mining

I have a dataset including 1 million customers. They are splitted into some categories like electronics customers,food and Beverage customers etc. Group names present customers' profiles.
each customer has different behaviours. For instance suppose that an electronic customer buys one electronic devices at least when he goes shopping. This transaction repeats randomly or continuously. So that I present each transaction by numerical codes.
(Value of transaction, volume of trans., transaction type, etc..) = (100,200,1)
for each transaction I have this vector above.it means every customer has a different trade behaviour.
I want to find out whether each customer has a pattern? Do we have outliers?
it is a profiling problem basically.
which analysis do you recommend?

Can you be more specific? What are you trying to get out of the analysis exactly? Buying patterns, customers that are outliers, purchases that are outliers?
If you want to determine which items are bought together, group the transactions together, just listing the items purchased at the same time and do shopping basket analysis, using the apriori algorithm or similar.
If you want to find similar customers, using k nearest neighbor or k means against a vector representing a customer's buy patterns (probably just the items bought). You can do this on individual transactions also to compare transactions.
To determine outliers, you can use a density based clustering algorithm (e.g. DBSCAN) to cluster customers together that are close to one another, and look at those customers that are not in clusters to determine outliers also.

Related

How can I restrict the output of an Amazon Machine Learning model? (Predicting cricket team results)

I am trying to predict match winner based on the historical data set as shown below,
The data set comprises of IPL seasons and Team_Name_id vs Opponent Team are the team names in IPL. I have set the match id as Row id and created the model. When running realtime testing, the result is not as expected (shown below)
Target is set as Match_winner_id.
Am I missing any configurations? Please help
The model is working perfectly correctly. There's just two problems:
Your input data is not very good
There's no way for the model to know that only one of those two teams should win
Data Quality
A predictive model needs good quality input data on which to reverse-engineer a model that explains a given result. This input data should contain information that can be used to predict a result given a different set of input data.
For example, when predicting house prices, it would need to know the suburb (category), number of bedrooms/bathrooms/parking spaces, age of the building and selling price. It could then predict the selling price for other houses with a slightly different mix of variables.
However, based on your screenshot, you are giving the following information (and probably more) on which to make your prediction:
Teams: Not great, because you are separating Column C and Column D. The model will assume they are unrelated information. It doesn't realise that those two values could be swapped.
Match date: Useless information unless the outcome varies in proportion to time (eg a team continually gets better)
Season: As with Match Date, this is probably useless because you're always predicting the future -- you won't be predicting for a past season
Venue: Only relevant if a particular team always wins at a given venue
Toss Decision: Would this really influence the outcome? Also, it's only known once the game begins, so not great for predicting a future game.
Win Type: You won't know the win type until a game is over, so it's not suitable for predicting a future game.
Score: Again, not known until the actual game, so no good for future predictions.
Man of the Match: Not known for future games.
Umpire: How does an umpire influence the result of a game?
City: Yes, given that home teams often have an advantage.
You have provided very little information that could be used to predict a future game. There is really only the teams and the venue. Everything else is either part of the game itself or irrelevant.
Picking only one of the two teams
When the ML model looks at your data and tries to make a prediction, it will look at all the data you have provided. For example, it might notice that for a given venue and season, Team 8 has a higher propensity to win. Therefore, given that venue and season, it will favour a win by Team 8. The model has no concept that the only possible outcome is one of the two teams given in columns C and D.
You are predicting for two given teams and you are listing the teams in either Column C or Column D and this makes no sense -- the result is the same if you swapped the teams between columns, but the model has no concept of this. Also, information about Team 1 vs Team 2 is totally irrelevant for Team 3 vs Team 4.
What you should do is create one dataset per team, listing all their matches, plus a column that shows the outcome -- either a boolean (Win/Lose) or a value that represents the number of runs by which they won (where negative is a loss). You would then ask them model to predict the result for that team, given the input data, which would be win/lose or a points above/below the other team.
But at the core, I think that your input data doesn't have enough rich content to be able to make a sensible prediction. Just ask yourself: "What data would I like to know if I were to guess which team would win?" It would probably be past results, weather conditions, which players were on each team, how many matches they played in the last week, etc. None of this information is being provided as input on each line of your input data.

Using an element in a data set to drive If/Then logic decision

I'm working with our IT group to develop an optimizer for logistics operations. The basic design is that it will look at shipments, run a search for additional shipments originating with in XX miles of the previous shipments destination, and link them together in a loop. It will continue to do this until it hits a user defined set of shipment legs where the loop ends at or close to 1st shipment origin.
The issue we are facing is that the materials we ship are chemicals, which can have interactions if placed in a tank that contained XX chemical before it. The obvious solution is to use a different tank or wash it out, but we also need it to compute solutions prior to that.
My problem is, currently, there is no way on the market to do that prior product optimization.
The question is: Is there some kind of logic table function I can write that will allow the optimizer to see an element in the data set (say, Product Family of 1) that will pull from a product database containing predefined product families (i.e. PF 1 = Chemicals A1-B7, PF 2 = Chemical B8-J8, etc.) and then ping off of a logic table that defines a do not ship with list (i.e. PF 1 cannot ship if PF 2 was on the previous leg.

How to use Amazon MWS to indicate two different shipping times on items?

I have a bit of a unique problem here. I currently have two warehouses that I ship items out of for selling on Amazon, my primary warehouse and my secondary warehouse. Shipping out of the secondary warehouse takes significantly longer than shipping from the main warehouse, hence why it is referred to as the "secondary" warehouse.
Some of our inventory is split between the two warehouses. Usually this is not an issue, but we keep having a particular issue. Allow me to explain:
Let's say that I have 10 red cups in the main warehouse, and an additional 300 in the secondary warehouse. Let's also say it's Christmas time, so I have all 310 listed. However, from what I've seen, Amazon only allows one shipping time to be listed for the inventory, so the entire 310 get listed as under the primary warehouse's shipping time (2 days) and doesn't account for the secondary warehouse's ship time, rather than split the way that they should be, 10 at 2 days and 300 at 15 days.
The problem comes in when someone orders an amount that would have to be split across the two warehouses, such as if someone were to order 12 of said red cups. The first 10 would come out of the primary warehouse, and the remaining two would come out of the secondary warehouse. Due to the secondary warehouse's shipping time, the remaining two cups would have to be shipped out at a significantly different date, but Amazon marks the entire order as needing to be shipped within those two days.
For a variety of reasons, it is not practical to keep all of one product in one warehouse, nor is it practical to increase the secondary warehouse's shipping time. Changing the overall shipping date for the product to the longest ship time causes us to lose the buy box for the listing, which really defeats the purpose of us trying to sell it.
So my question is this: is there some way in MWS to indicate that the inventory is split up in terms of shipping times? If so, how?
Any assistance in this matter would be appreciated.
Short answer: No.
There is no way to specify two values for FulfillmentLatency, in the same way as there is no way to specify two values for Quantity in stock. You can only ever have one inventory with them (plus FBA stock)
Longer answer: You could.
Sign up twice with Amazon:
"MySellerName" has an inventory of 10 and a fulfillment latency of 2 days
"MySellerName Overseas Warehouse" has an inventory of 300 and a fulfillment latency of 30 days
I haven't tried by I believe Amazon will then automatically direct the customer to the best seller for them, which should be "MySellerName" for small orders and "MySellerName Overseas Warehouse" for larger quantities.

Data Mining and Frequent Datasets

I've been doing some work for my exams in a few days and I'm going through some past papers but unfortunately there are no corresponding answers. I've answered the question and I was wondering if someone could tell me if I am correct.
My question is
(c) A transactional dataset, T, is given below:
t1: Milk, Chicken, Beer
t2: Chicken, Cheese
t3: Cheese, Boots
t4: Cheese, Chicken, Beer,
t5: Chicken, Beer, Clothes, Cheese, Milk
t6: Clothes, Beer, Milk
t7: Beer, Milk, Clothes
Assume that minimum support is 0.5 (minsup = 0.5).
(i) Find all frequent itemsets.
Here is how I worked it out:
Item : Amount
Milk : 4
Chicken : 4
Beer : 5
Cheese : 4
Boots : 1
Clothes : 3
Now because the minsup is 0.5 you eliminate boots and clothes and make a combo of the remaining giving:
{items} : Amount
{Milk, Chicken} : 2
{Milk, Beer} : 4
{Milk, Cheese} : 1
{Chicken, Beer} : 3
{Chicken, Cheese} : 3
{Beer, Cheese} : 2
Which leaves milk and beer as the only frequent item set then as it is the only one above the minsup?
I agree you should go for the Apriori Algorithm.
The Apriori algorithm is based on the idea that for a pair o items to be frequent, each individual item should also be frequent.
If the hamburguer-ketchup pair is frequent, the hamburger itself must also appear frequently in the baskets. The same can be said about the ketchup.
So for the algorithm, it is established a "threshold X" to define what is or it is not frequent. If an item appears more than X times, it is considered frequent.
The first step of the algorithm is to pass for each item in each basket, and calculate their frequency (count how many time it appears).
This can be done with a hash of size N, where the position y of the hash, refers to the frequency of Y.
If item y has a frequency greater than X, it is said to be frequent.
In the second step of the algorithm, we iterate through the items again, computing the frequency of pairs in the baskets. The catch is that
we compute only for items that are individually frequent. So if item y and item z are frequent on itselves,
we then compute the frequency of the pair. This condition greatly reduces the pairs to compute, and the amount of memory taken.
Once this is calculated, the frequencies greater than the threshold are said frequent itemset.
(http://girlincomputerscience.blogspot.com.br/2013/01/frequent-itemset-problem-for-mapreduce.html)
There are two ways to solve the problem:
using Apriori algorithm
Using FP counting
Assuming that you are using Apriori, the answer you got is correct.
The algorithm is simple:
First you count frequent 1-item sets and exclude the item-sets below minimum support.
Then count frequent 2-item sets by combining frequent items from previous iteration and exclude the item-sets below support threshold.
The algorithm can go on until no item-sets are greater than threshold.
In the problem given to you, you only get 1 set of 2 items greater than threshold so you can't move further.
There is a solved example of further steps on Wikipedia here.
You can refer "Data Mining Concepts and Techniques" by Han and Kamber for more examples.
OK to start, you must first understand, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.
Now, the amount of raw data stored in corporate databases is exploding. From trillions of point-of-sale transactions and credit card purchases to pixel-by-pixel images of galaxies, databases are now measured in gigabytes and terabytes. (One terabyte = one trillion bytes. A terabyte is equivalent to about 2 million books!) For instance, every day, Wal-Mart uploads 20 million point-of-sale transactions to an A&T massively parallel system with 483 processors running a centralized database. Raw data by itself, however, does not provide much information. In today's fiercely competitive business environment, companies need to rapidly turn these terabytes of raw data into significant insights into their customers and markets to guide their marketing, investment, and management strategies.
Now you must understand that association rule mining is an important model in data mining. Its mining algorithms discover all item associations (or rules) in the data that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. Minsup controls the minimum number of data cases that a rule must cover. Minconf controls the predictive strength of the rule. Since only one minsup is used for the whole database, the model implicitly assumes that all items in the data are of the same nature and/or have similar frequencies in the data. This is, however, seldom the case in real- life applications. In many applications, some items appear very frequently in the data, while others rarely appear. If minsup is set too high, those rules that involve rare items will not be found. To find rules that involve both frequent and rare items, minsup has to be set very low. This may cause combinatorial explosion because those frequent items will be associated with one another in all possible ways. This dilemma is called the rare item problem. This paper proposes a novel technique to solve this problem. The technique allows the user to specify multiple minimum supports to reflect the natures of the items and their varied frequencies in the database. In rule mining, different rules may need to satisfy different minimum supports depending on what items are in the rules.
Given a set of transactions T (the database), the problem of mining association rules is to discover all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf).
I hope that once you understand the very basics of data mining that the answer to this question shall become apparent.

How does data mining actually work?

Suppose I want to do some data mining on the database of a supermarket. What does that actually mean?
1) What will the output/results be like?
2) Will the output be different every day or change over time?
3) Before applying data mining, do I need to know what I want or will data mining give everything I want automatically?
Data Mining is a general category of techniques that can be applied to different kinds of datasets, just like programming is a general category of techniques that can be applied using different languages to do different things.
None of your questions make any sense.
A1: Data mining will give us an accurate reports about your queries of database of supermarket.
A2: Sure, because Data mining depend on analyzing during time, in this case it depend on your problems or goals that you want to reach it. if your database was very big also you built data warehouse in right way you will get the different output over time.
A3: yes you should determine what are the problems you have to mine then use tools of Data mining to get the results or indicators automatically.
To answer your first question: For the case of supermarket customer data, I could image the following questions:
how many products X are usually sold on Fridays ?
(helps you to determine how many X you should have in stock)
which customers bought product X often in the last month/year ?
Useful when when you introduce a new X-like product: send advertising material (which has a given cost) only to those customers.
given a customer buys product X (e.g. beer) what's the probability that he/she also buys product Y (e.g. chips) ?
useful for the following: make sure X and Y never are on promotional offer at the same time (X and Y are bought together often). Get the customers into the store by offering a rebate on X knowing they'll also by Y at the same time. Or: put a high price X-like product right next to Y, putting the cheaper X somewhere else.
which neighborhoods have the smallest number of customers ?
helps to find out which neighborhoods you could target with advertising to bring more customers into the store.
Often, by 'asking certain questions to the data' one discovers some features and comes up with new questions.
Data mining is a set of techniques. It refers to discovering interesting and unexpected patterns in data.
If you want to apply some data mining techniques, you need to know which one and you should know why. The answer to questions 1, 2 and 3 depends on the techniques that you choose.
For example, if i want to find associations between items sold in a supermarket, i may use association rule mining. If i want to find groups of similar customers, I might use a clustering algorithm. etc.
There is not just ONE technique in data mining.