Handling missing values for association rules using Weka - data-mining

I'm new to Weka and stuck with a problem. I have a dataset with about 13 features (all binary). Some of the features are applicable only for a small set of data. When I run association rule mining using Weka, it identifies strong co-relations between attributes based on the feature value being 0 (0 implies the feature does not apply).
I would like the co-relation to be identified only for positive features. How do I do this?

This should be default behavior, IMHO.
In typical APRIORI use cases, most items are missing from most transactions.
Maybe convert your items to a non-numeric type and subsitute 0 for missing value?
The classic example uses this format:
#relation supermarket
#attribute 'department1' { t}
...
#data
?,?,...,t,...
where ? indicates missing, t indicates presence.

Related

number expected.read Token[2015-02-02 14:19:00] weka project

i hope you all are doing well!
I have a project at data mining class.Τhe data consists of numerical data and many algorithms do not work.I have to do this:"you should compare the performance of the following categorization algorithms:
RandomForest, C4.5, JRip, Bayesian Network. Where necessary use them
Weka filters to replace or create values ​​for some properties
new properties. For comparison, adopt the Train / Test Percentage Split type with
percentage for training data equal to 80%.Describe your observations by giving tables with the results and
presenting the performance of the algorithms. Repeat the experiment by putting
percentage for training data equal to 70% and 50% presenting the results."
So my first try was to transform the data inside weka with preprocessing data numeric to nominal but a friend of mine suggest that is statistical wrong.So my second try was to use excel to transform all data even the date to numeric,remove the first row(id) and pass it to the weka(I leave double quotes only at date)
.But i have the error that i mention on the title.The dataset is:https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
Thank you for the time.
If you define date-like data as a DATE attribute in the ARFF file (using the right format for parsing the strings), then WEKA will treat it as a numeric attribute internally (Java epoch, ie milli-seconds since 1970-01-01).
Instead of using NumericToNominal, use either the supervised or unsupervised Discretize filter if the algorithm cannot handle numeric attributes.
Converting nominal attributes to numeric ones is not a recommended approach. Instead, try the supervised or unsupervised NominalToBinary filter.

How do I change this data into proper format to do association rule mining in WEKA?

This is how the data is stored in the file given. There are 8 attributes given.
I need association rule mining done using Apriori Algorithm in WEKA.
Such as, if item 1 & item 2 are bought --> item 4 is also bought or something reasonable as that.
What I tried:
Converting the file into .arff format and loading into weka.
Turning all attributes in Nominal and running the algorithm Apriori.
But the rules generated are very weird.
This is how the result comes. It has no proper information. No rules like what i want, which actually define what a user will buy with what or anything.
i.e. the rules generated here are of no information to me, there is no relation/rules given as to which item will be bought with what.
How should I preprocess this data to format it well or if am making any other mistake would be really appreciated.

RapidMiner: Can I use a wildcard as an attribute value for training a decision tree model?

I am working on a fairly simple process in RapidMiner 5.3.013, which reads a CSV file and uses it as a training set to train the decision tree classifier. The result of the process is the model. A second CSV is read and used as the unlabeled set. The model (calculated earlier) is applied to the unlabeled test set, in an effort to label it properly.
Each line of the CSVs contains a few attributes, for example:
15, 0, 1555, abc*15, label1
but some lines of the training set may be like this:
15, 0, *, abc*15, label2
This is done because the third value may take various values, so the creator of the training set used a star as a wildcard in the place of the value.
What I would like to do is let the decision tree know that the star there means "match anything", so that it does not literally only match a star.
Notes:
the star in the 4th field (abc*15) should be matched literally and not as a wildcard.
if the 3rd field always contained stars, I could just not include it in the attributes, but that's not the case. Sometimes the 3rd field contains integer values, which should be matched literally.
I tried leaving the field blank, but it doesn't work
So, is there a way to use regular expressions, or at least a simple wildcard while training the classifier or using the model?
A different way to put it is: Can I instruct the classifier to not use some of the attributes in some of the entries (lines in the CSV)?
Thanks!
I would process the data so the missing value is valid in its own right and I would discretize the valid numbers to be in ranges.
In more detail, what I meant by missing is the situation where the value of an attribute is something like *. I would simply allow this to be one valid value that the attribute takes. For all the other values of this attribute, these are numerical so they need to be converted to a nominal value to be compatible with the now valid *.
It's fairly fiddly to do this and I haven't tried this but I would start with the operator Declare Missing Value to detect the * and make them missing. From there, I would use the operator Discretize by Binning to convert numbers into nominal values. Finally, I would use Replace Missing Values to change the missing values to a nominal value like Missing. You might ask why bother with the first Declare Missing step above? The reason is that it will allow the Discretizing operation to work because it will be working on numbers alone given that non-numbers are marked as missing.
The resulting example set then be passed to a model in the normal way. Obviously, the model has to be able to cope with nominal attributes (Decision trees does).
It occurred to me that some modelling operators are more tolerant of missing data. I think k-nearest-neighbours may be one. In this case, you could simply mark the missing ones as above and not bother with the discretizing step.
The whole area of missing data does need care because it's important to understand the source of missingness. If missing data is correlated with other attributes or with the label itself, handling it inappropriately can skew results.

WEKA - weather prediction

I am pretty new to the concepts of machine learning and clustering. I have installed Weka and am trying to figure out how it works. Currently, I have my training data as below.
#relation weather
#attribute year real
#attribute temperature real
#attribute warmer {yes,no}
#data
1956 , 68.98585 , yes
1957 , 67.52131 , yes
1958 , 65.853386 , no
1959 , 66.32705 , yes
1960 , 65.89773 , no
So, I am trying to build a model which should predict if it is getting warmer each and every year.
If I have to predict if 1961 is warmer or cooler, should I provide my test data like below?
#relation weather
#attribute year real
#attribute temperature real
#data
1961 , 70.98585
I have removed the column warmer which I want to predict using the training set I provided earlier. I can use any algorithm that Weka provides me (J48, BayesNet etc). Can someone please help me out in figuring how to understand the concepts?
You don't need to make the training and test sets yourself, Weka will do that for you. Even if you do, don't delete the value to predict from the test set -- Weka will make sure that everything happens properly, but needs the actual value to determine whether a prediction is correct or not and tell you how your model performs.
Your problem is a classification problem, i.e. you want to predict the label "yes" or "no". Not all of the algorithms in Weka are applicable, but the ones that are not are greyed out (if you use the GUI).
On a more general note, you're unlikely to get good results with the data that you have. This is more of a time series prediction task (i.e. given these past values, how will it develop in the future), for which Weka doesn't really offer the algorithms. You can find some more information on Wikipedia.
To get better models with Weka, you could add the temperature value from the previous year (or the previous 2 years) as a feature, but ultimately it sounds like you want to use something that can do time series analysis and predictions.

Weka Apriori Algorithm

I would like to use Apriori to carry out affinity analysis on transaction data. I have a table with a list of orders and their information. I mainly need to use the OrderID and ProductID attributes which are in the following format
OrderID ProductID
1 A
1 B
1 C
2 A
2 C
3 A
Weka requires you to create a nominal attribute for every product ID and to specify whether the item is present in the order using a true or false value like like this:
1, TRUE, TRUE, TRUE
2, TRUE, FALSE, TRUE
3, TRUE, FALSE, FALSE
My dataset contains about 10k records... about 3k different products. Can anyone suggest a way to create the dataset in this format? (Besides a manually time consuming way...)
How about writing a script to convert it?
Should be less than 10 lines in a good scripting language such as Python.
Or you may look into options of pivoting the relation as desired.
Either way, it is a straight forward programming task, so I don't see your question here.
You obviously need to convert your data. Easiest way: write a software that read the file in the programming language that you are most familiar with and then write the file in the appropriate format. Since it is text files, it should not be too complicated.
By the way, if you want more algorithms for pattern mining and association mining than just Apriori in Weka, you could check my software SPMF ( http://www.philippe-fournier-viger.com/spmf/ ) which is also in Java, can read ARFF files too and offers about 50 algorithms specialized in pattern mining (Apriori FPGrowth, and many others.
Your data is formatted correctly as-is for implementation in R using the ARULES package (and apriori function). You might consider checking it out, esp. if you're not able to get into script coding.