Weka Apriori Algorithm - data-mining

I would like to use Apriori to carry out affinity analysis on transaction data. I have a table with a list of orders and their information. I mainly need to use the OrderID and ProductID attributes which are in the following format
OrderID ProductID
1 A
1 B
1 C
2 A
2 C
3 A
Weka requires you to create a nominal attribute for every product ID and to specify whether the item is present in the order using a true or false value like like this:
1, TRUE, TRUE, TRUE
2, TRUE, FALSE, TRUE
3, TRUE, FALSE, FALSE
My dataset contains about 10k records... about 3k different products. Can anyone suggest a way to create the dataset in this format? (Besides a manually time consuming way...)

How about writing a script to convert it?
Should be less than 10 lines in a good scripting language such as Python.
Or you may look into options of pivoting the relation as desired.
Either way, it is a straight forward programming task, so I don't see your question here.

You obviously need to convert your data. Easiest way: write a software that read the file in the programming language that you are most familiar with and then write the file in the appropriate format. Since it is text files, it should not be too complicated.
By the way, if you want more algorithms for pattern mining and association mining than just Apriori in Weka, you could check my software SPMF ( http://www.philippe-fournier-viger.com/spmf/ ) which is also in Java, can read ARFF files too and offers about 50 algorithms specialized in pattern mining (Apriori FPGrowth, and many others.

Your data is formatted correctly as-is for implementation in R using the ARULES package (and apriori function). You might consider checking it out, esp. if you're not able to get into script coding.

Related

Can I make Amazon SageMaker deliver a recommendation based on historic data instead of a probability score?

We have a huge set of data in CSV format, containing a few numeric elements, like this:
Year,BinaryDigit,NumberToPredict,JustANumber, ...other stuff
1954,1,762,16, ...other stuff
1965,0,142,16, ...other stuff
1977,1,172,16, ...other stuff
The thing here is that there is a strong correlation between the third column and the columns before that. So I have pre-processed the data and it's now available in a format I think is perfect:
1954,1,762
1965,0,142
1977,1,172
What I want is a predicition on the value in the third column, using the first two as input. So in the case above, I want the input 1965,0 to return 142. In real life this file is thousands of rows, but since there's a pattern, I'd like to retrieve the most possible value.
So far I've setup a train job on the CSV file using the Linear Learner algorithm, with the following settings:
label_size = 1
feature_dim = 2
predictor_type = regression
I've also created a model from it, and setup an endpoint. When I invoke it, I get a score in return.
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='text/csv',
Body=payload)
My goal here is to get the third column prediction instead. How can I achieve that? I have read a lot of the documentation regarding this, but since I'm not very familiar with AWS, I might as well have used the wrong algorithms for what I am trying to do.
(Please feel free to edit this question to better suit AWS terminology)
For csv input, the label should be in the first column, as mentioned here: So you should preprocess your data to put the label (the column you want to predict) on the left.
Next, you need to decide whether this is a regression problem or a classification problem.
If you want to predict a number that's as close as possible to the true number, that's regression. For example, the truth might be 4, and the model might predict 4.15. If you need an integer prediction, you could round the model's output.
If you want the prediction to be one of a few categories, then you have a classification problem. For example, we might encode 'North America' = 0, 'Europe' = 1, 'Africa' = 2, and so on. In this case, a fractional prediction wouldn't make sense.
For regression, use 'predictor_type' = 'regressor' and for classification with more than 2 classes, use 'predictor_type' = 'multiclass_classifier' as documented here.
The output of regression will contain only a 'score' field, which is the model's prediction. The output of multiclass classification will contain a 'predicted_label' field, which is the model's prediction, as well as a 'score' field, which is a vector of probabilities representing the model's confidence. The index with the highest probability will be the one that's predicted as the 'predicted_label'. The output formats are documented here.
predictor_type = regression is not able to return the predicted label, according to
the linear-learner documentation:
For inference, the linear learner algorithm supports the application/json, application/x-recordio-protobuf, and text/csv formats. For binary classification models, it returns both the score and the predicted label. For regression, it returns only the score.
For more information on input and output file formats, see Linear
Learner Response Formats for inference, and the Linear Learner Sample
Notebooks.

How do I change this data into proper format to do association rule mining in WEKA?

This is how the data is stored in the file given. There are 8 attributes given.
I need association rule mining done using Apriori Algorithm in WEKA.
Such as, if item 1 & item 2 are bought --> item 4 is also bought or something reasonable as that.
What I tried:
Converting the file into .arff format and loading into weka.
Turning all attributes in Nominal and running the algorithm Apriori.
But the rules generated are very weird.
This is how the result comes. It has no proper information. No rules like what i want, which actually define what a user will buy with what or anything.
i.e. the rules generated here are of no information to me, there is no relation/rules given as to which item will be bought with what.
How should I preprocess this data to format it well or if am making any other mistake would be really appreciated.

What are the steps of preprocessing anonymized data for predictive analysis?

Suppose we have a large dataset of anonymized data. Dataset consist if certain number of variables and observations. All we can learn about data is a type(numeric, char, date, etc.) of variable. We can do it by looking to data manually.
What are the best practise steps of pre-proccessing dataset for the further analysis?
Just for instance, let this data set be just one table, so we don't need to check any relations between tables.
This link gives the complete set of validations currently in practice. Still, to start with:
wherever possible, have your data written in such a way that you can parse it as fast and as easily as possible, using your preferred programming language's methods/constructors;
you can verify if all the data types match correctly - like int fields do not contain string data etc;
you can verify that your values are in acceptable range;
check if a non-nullable field has null values;
check if dates are in expected ranges;
check if data follows correct set-membership constraints wherever applicable;
if you have pattern following data like phone numbers, make sure they are in (XXX) XXX-XXXX design, if you prefer them that way;
are the zip codes at correct accuracy level (in US you may have 5 or 9 digits of accuracy);
if your data is time-series, is it complete (i.e. you have values for all dates)?
is there any unwanted duplication?
Hope this is good enough to get you started...

Weka: Classifier and ReplaceMissingValues

I am relatively new to the data mining area and have been experimenting with Weka.
I have a dataset which consists of almost 8000 records related to customers and items they have purchased. 58% of this data set has missing values for the "Gender" attribute.
I want to find the missing gender values based on the other data I do have.
I first thought I could do this using a classifier algorithm in Weka using a training set to build a model. Based on examples I saw online, I tried this with pretty much all the available algorithms available in Weka using a training set that consisted of 60-80% of the data which did not have missing values. This gave me a lower accuracy rate than I wanted (80-86% depending on the algorithm used)
Did I go about this correctly? Is there a way to improve this accuracy? I experimented with using different attributes, different pre-processing of the data etc.
I also tried using the ReplaceMissingValues filter on the complete dataset to see how that would handle the missing values. However, it just changed all the missing values to "Female" which obviously cannot be the case. So I'm wondering also wondering if I need to use this filter in my situation or not.
It sounds like you went about it in the correct way. The ReplaceMissingValues filter replaces the missing values with the most frequent of the non-missing values I think, so it is not what you want in this case.
A better way to get an idea of the true accuracy of your gender-predictor would be to use cross-validation instead of the training/test split (Weka has a separate option for that). 80-86% may seem low, but keep in mind that random guessing will only get you about 50%, so it's still a lot better than that. To try to get better performance, pick a classifier that performs well and then play with its parameters until you get better performance. This is likely to be quite labour-intensive (although you could of course use automated methods for tuning, see e.g. Auto-WEKA), but the only way to improve the performance.
You can also combine the algorithm you choose with a separate feature selection step (Weka has a special meta-classifier for this). This may improve performance, but again you'll have to experiment to find the particular configuration that works for you.

Input arff file for Weka Apriori

I am trying to do association mining on version history. I have my transaction data in mysql. Weka apriori algorithm requires arff or csv file in a certain format. It has to have columns for each item. The values will be specified as TRUE or FALSE for each item in a transaction. I am looking for a way to create this file using Weka InstanceQuery. Also what are the options if the transaction data is huge.
I can answer for the second part: options if the transaction data is huge. Weka is a good software but their apriori implementation is horribly slow. I recommend implementations at http://fimi.ua.ac.be/src/ (I used the first one in the list from Ferenc Bodon).
Bodon's implementation use Trie data structure instead of hashtables that Weka uses. Because of this, I found in my work, that Weka would take 3 days to finish something that Bodon's implementation could in less than an hour (yes, the difference is this huge!!).
Plus, Bodon's implementation uses a simple input format: one line for each transaction, with items separated by spaces.
If you want a fast Java implementation of FPGrowth or Apriori, have a look at my project SPMF. The FPGrowth implementation in SPMF beats Weka implementation by up to two orders of magnitude on some datasets. For example, you can see this performance comparison:
http://www.philippe-fournier-viger.com/spmf/performance/chess_fpgrowth_spmf_vs_weka.png
This is the main project webpage:
http://www.philippe-fournier-viger.com/spmf/index.php
Moreover, note that SPMF offers more than 50 algorithms for itemset mining, association rule mining, sequential pattern mining, etc. Also, the GUI version of SPMF also support the ARFF format used by Weka.