analysis Quantity in market basket analysis - data-mining

In market basket analysis
is it effect Qty in analysis for example
example 1:
T1=A,B,C
T2=A,A,b
T3=A,b,b,c
Example 2:
T1=A,B,C
T2=A,b
T3=A,b,c
result is same for two example or not

Depends on your data encoding.
If you encode the itemsets as classic sets, then you have "sets".
If you encode as a1, a2, a3, ... then you can get some quantities.

Related

Weka j48 output

I have confusion about the numbers at the end of the branches of a J48 tree. For example, using the weather.nominal data the tree looks the same, whether the Test options are set to Use training set or Cross-validation or Percentage split.
This is the output:
J48 pruned tree
------------------
outlook = sunny
| humidity = high: no (3.0)
| humidity = normal: yes (2.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
According to the textbook by the authors of this software, in an example using this exact data they say, "In the tree structure, a colon introduces the class label that has been assigned to a particular leaf, followed by the number of instances that reach that leaf, expressed as a decimal number because of the way the algorithm uses fractional instances to handle missing values. If there were incorrectly classified instances (there aren’t in this example) their number would appear, too: thus 2.0/1.0 means that two instances reached that leaf, of which one is classified incorrectly"
So this means that no instances were incorrectly classified in the above tree with the weather.nominal dataset.
On the other hand, when the test options are set to either 'Use training set' or 'Percentage split' (with the default random seed), there are incorrectly classified instances. For example, with a 60 percentage split, it shows the following
=== Evaluation on test split ===
=== Summary ===
Correctly Classified Instances 2 40 %
Incorrectly Classified Instances 3 60 %
There seems to be a contradiction here but I must be missing something. Is the tree shown initially not the tree that is built with the 60 percentage split?
That is not stated anywhere as far as I have seen but I can't think of any other explanation.
Just for completeness, the data is here:
outlook,temperature,humidity,windy,play
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
If you take a closer look at the output, you will see the following:
=== Classifier model (full training set) ===
The model that is being depicted there is the model that was trained on the full dataset, not your split.
The next section has the following heading:
=== Evaluation on test split ===
The statistics that you are referring to are based on a model trained and evaluated on your dataset split.

decision trees using R, rpart, fragile families

So, I am utilizing the fragile families challenge for my dataset to see which individual and family level predictors predict adolescent academic performance (measured by GPA). Information about my dataset:
FFCWS is a longitudinal panel study in which baseline interviews were conducted in 1998-
2000 with both the mothers and the fathers. Follow-up interviews were conducted when children were aged 1, 3, 5, 9, and 15. Interviews with the parent, primary caregiver(s),
teachers, and children were conducted either in-home or via telephone (FFCWS, 2021). In the
15th year, children/adolescents are asked to report their grades in four subjects- history,
mathematics, English, and science. These grades are averaged for each student to measure their individual academic performance at age 15. A series of individual-level and family-level
predictors that are known to impact the academic performance as mentioned earlier, are also captured at different time points in the life of the child.
I am very new to machine learning and need some guidance. In order to do this, I first create a dataset that contains all the theoretically relevant variables. It is 4,898x15. My final datasets look like this (all are continuous except:
final <- ffc %>% select(Gender, PPVT, WJ10, Grit, Self-control, Attention, Externalization, Anxiety, Depression, PCG_Income, PCG_Education, Teen_Mom, PCG_Exp, School_connectedness, GPA)
Then, I split into test and train as follows:
final_split <- initial_split(final, prop = .7) final_train <- training(final_split) final_test <- testing(final_split)
Next, I run the models:
train <- rpart(GPA ~.,method = "anova", data = final_train, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10)) test <- rpart(GPA ~.,method = "anova", data = final_test, control=rpart.control(cp = 0.2, minsplit = 5, minbucket = 5, maxdepth = 10))
Next, I visualize cross validation results:
rpart.plot(train, type = 3, digits = 3, fallen.leaves = TRUE) rpart.plot(test, type = 3, digits = 3, fallen.leaves = TRUE)
Next, I run predictions:
pred_train <- predict(train, ffc.final1_train) pred_test <- predict(test, ffc.final1_test)
Next, I calculate accuracy:
MAE <- function(actual, predicted) {mean(abs(actual - predicted)) } MAE(train$GPA, pred_train) MAE(test$GPA, pred_test)
Following are my questions:
Now, I am not sure if I should use rpart or random forest or XG Boost so my first question is that how do I decide which algorithm to use. I decided upon rpart but I want to have a sound reasoning for the same.
Are these steps in the right order? What is the point of splitting my dataset into training and testing? I ultimately get two trees (one for train and the other for test). Which ones should I be using? What do I make out of these? A step-by-step procedure after understanding my dataset would be quite helpful. Thanks!

Opencart change product price upon user order

I hope my Q will make sense but I am really out of ideas.
I will explain by example: http://www.xtdeco.ro/fototapet/texturat/Bloom-R12241-6
There is a standard product, with some attributes.
What I need to do is make this product configurable, as you may notice the two text inputs. The plan is to calculate the Lățime*Înălțime, multiply by price / sqm, verify against the actual product price (no problem this far) and then add or subtract a value to the product price, or add a option to the product for the current cart that would do the same.
Is there anyone with a idea of how this could be done without hacking to much of the sources?
Thank you.
The easiest way is to not let the user to input his concrete dimensions but to let him choose from prepared one.
If this is a wallpaper and You know the role of this wallpaper is always only 1m wide (just for simplicity) then for example sell only this 1m2 and let the user to enter the amount of pieces which will result in that long piece cut from the role (so 8 piecese (m2) ordered result in 8m long piece of role that is 1m wide). In this case You may change the word pieces or quantity for m2.
If this is a wall print with concrete dimensions (or aspect ratio), let the user choose from some predefined sizes, e.g.
XS (120 x 170 cm) + $0
S (150 x 212.5 cm) + $10
M (200 x 283 cm) + $20
L (250 x 354 cm) + $35
XL (300 x 425 cm) + $50
This may be handled by the product options which is again easier than what are You requesting... Don't You think?
EDIT based on comment:
Then there is only one possibility that comes to my mind:
hide the quantity field (don't remove, make it hidden)
create some JS onChange event handling function, that will listen to onChange, onBlur, onKeyUp (whetever) events on both text fields (for dimensions) and this will calculate the resulting area size in m2 which will be shown to the costumer as well as price per that m2 while price per 1m2 is also known and displayed to the customer
this function will also fill the calculated float value into the hidden quantity field so after adding to the cart the cart should contain smth like
4.73m2 WallPrint1 $18.92
(because I was calculating the price $4 for 1m2, thus 4m2 x 4.73 $/m2 = $18.92)
I'm not sure but maybe You will have to edit some other pieces of code to allow You to add float quantity values into cart annd also to order them...

How to match Amazon / CJ / Linkshare Products

I need to create a data base with Amazon, commission junction & link share API's & data feeds and then match the same products to create comparisons on product information.
My problem is related to the matching process.
I start by matching products via SKU/UPC/ASIN but this not perform well because many of the products doesn't contain this information.
I maked some research and the most popular techniques I found are :
-Measuring cosine similarity via TF-IDF
-Measuring edit distance/ levenshtein / Jaro-Winkler
In this technique i used cosine similarity and Jaro-Winkler
How I do the matching :
Step 1 : Preprocessing
Preprocessing to transform strings into a normal form :
 Lowercase
 Filter stop words (new, by, the …)
 Strip whitespace
 replace all whitespace occurrences with a single space character
Step 2, Indexing :
Index Amazon products in a Solr core [core A] and CJ/Linkshare [core B] in an other core. The goal of indexing is to limit the number of string comparisons (via TF-IDF and Jaro-Winkler)
Step 3, matching :
I start by retrieving a product title from core B, make a solr search in core A with this title and take the top 30 results.
I measure similarity via TF-IDF between the product i want to match (the query) and the 30 results retrieved by solr search. I keep the products with similarity > 80%
sort the tokens from each product alphabetically.I then compare the transformed strings with Jaro Winkler distance and keep the products with similarity > 80% (==> This perform a Jaro Winkler similarity between phrases)
Here, I tokenize both strings (query and product to match) , and perform a comparison between tokens.
But this techniques also don't perform well. Example :
Product 1 : Orange by Hugo Boss, 3 Ounce Eau de toilette Spray
Product 2 : In Motion Orange By Hugo Boss Eau De Toilette Spray 3 Ounces
Product 1 and 2 are similar via this techniques but actually they are different.
How can I improve this algorithm? Is that the right way to match products?
How if i train a classifier with token's weight (using Jaro Winkler) (learning data from matched products via UPC) and use this classifier to match products in a final step?
PS : I have products from different categories (health, beauty, electronics, books, movies...) and data is very unstructured or incomplete.
Any advice will be helpfull
Thanks
Smail

Data Mining situation

Suppose I have the data as mentioned below.
11AM user1 Brush
11:05AM user1 Prep Brakfast
11:10AM user1 eat Breakfast
11:15AM user1 Take bath
11:30AM user1 Leave for office
12PM user2 Brush
12:05PM user2 Prep Brakfast
12:10PM user2 eat Breakfast
12:15PM user2 Take bath
12:30PM user2 Leave for office
11AM user3 Take bath
11:05AM user3 Prep Brakfast
11:10AM user3 Brush
11:15AM user3 eat Breakfast
11:30AM user3 Leave for office
12PM user4 Take bath
12:05PM user4 Prep Brakfast
12:10PM user4 Brush
12:15PM user4 eat Breakfast
12:30PM user4 Leave for office
This data tell me about the daily routine of different people. From this data it seems user1 and user2 behave similarly (though there is a difference in time they perform the activity but they are following the same sequence). With the same reason, User3 and User4 behave similarly.
Now I have to group such users into different groups. In this example, group1- user1 and USer2 ... followed by group2 including user3 and user4
How should I approach this kind of situation. I am trying to learn data mining and this is an example I thought of as a data mining problem. I am trying to find an approach for the solution, but I can not think of one. I believe this data has the pattern in it. but I am not able to think of the approach which can reveal it.
Also, I have to map this approach on the dataset I have, which is pretty huge but similar to this :) The data is about logs stating occurrence of events at a time. And I want to find the groups representing similar sequence of events.
Any pointers would be appreciated.
It looks like clustering on top of associating mining, more precisely Apriori algorithm. Something like this:
Mine all possible associations between actions, i.e. sequences Bush -> Prep Breakfast, Prep Breakfast -> Eat Breakfast, ..., Bush -> Prep Breakfast -> Eat Breakfast, etc. Every pair, triplet, quadruple, etc. you can find in your data.
Make separate attribute from each such sequence. For better performance add boost of 2 for pair attributes, 3 for triplets and so on.
At this moment you must have an attribute vector with corresponding boost vector. You can calculate feature vector for each user: set 1 * boost at each position in the vector if this sequence exists in user actions and 0 otherwise). You will get vector representation of each user.
On this vectors use clustering algorithm that fits your needs better. Each found class is the group you use.
Example:
Let's mark all actions as letters:
a - Brush
b - Prep Breakfast
c - East Breakfast
d - Take Bath
...
Your attributes will look like
a1: a->b
a2: a->c
a3: a->d
...
a10: b->a
a11: b->c
a12: b->d
...
a30: a->b->c->d
a31: a->b->d->c
...
User feature vectors in this case will be:
attributes = a1, a2, a3, a4, ..., a10, a11, a12, ..., a30, a31, ...
user1 = 1, 0, 0, 0, ..., 0, 1, 0, ..., 4, 0, ...
user2 = 1, 0, 0, 0, ..., 0, 1, 0, ..., 4, 0, ...
user3 = 0, 0, 0, 0, ..., 0, 0, 0, ..., 0, 0, ...
To compare 2 users some distance measure is needed. The simplest one is cosine distance, that is just value of cosine between 2 feature vectors. If 2 users have exactly the same sequence of actions, their similarity will equal 1. If they have nothing common - their similarity will be 0.
With distance measure use clustering algorithm (say, k-means) to make groups of users.
Using an itemset mining algorithm like Apriori as proposed in the other answer is not the best solution because Apriori does not consider time or the sequential ordering. Thus, it requires to do an additional pre-processing step to consider ordering.
A better solution is to use a sequential pattern mining algorithm like PrefixSpan, SPADE, or CM-SPADE directly. A sequential pattern mining algorithm will directly find subsequences that appears often in a set of sequences.
Then you can still apply clustering on the sequential patterns found!