Why doesn't this data work with the Naïve Bayes algorithm? - weka

It is in ARFF format. If you're not familiar with ARFF, it's basically that everything under the #data marker is in CSV.
For clarification, I am trying to use the dataset on Weka but the option to use Naïve Bayes is greyed out.

Every classifier, clusterer, filter etc in Weka can only handle certain types of data, i.e., its capabilities (which you can check in GUI). These capabilities are then compared against the data. In case of a mismatch, the GUI won't allow you to apply the algorithm.
Long story short: the dport attribute is of type string which NaiveBayes can't handle. You can convert that attribute into a nominal one using the StringToNominal filter.

Related

number expected.read Token[2015-02-02 14:19:00] weka project

i hope you all are doing well!
I have a project at data mining class.Τhe data consists of numerical data and many algorithms do not work.I have to do this:"you should compare the performance of the following categorization algorithms:
RandomForest, C4.5, JRip, Bayesian Network. Where necessary use them
Weka filters to replace or create values ​​for some properties
new properties. For comparison, adopt the Train / Test Percentage Split type with
percentage for training data equal to 80%.Describe your observations by giving tables with the results and
presenting the performance of the algorithms. Repeat the experiment by putting
percentage for training data equal to 70% and 50% presenting the results."
So my first try was to transform the data inside weka with preprocessing data numeric to nominal but a friend of mine suggest that is statistical wrong.So my second try was to use excel to transform all data even the date to numeric,remove the first row(id) and pass it to the weka(I leave double quotes only at date)
.But i have the error that i mention on the title.The dataset is:https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
Thank you for the time.
If you define date-like data as a DATE attribute in the ARFF file (using the right format for parsing the strings), then WEKA will treat it as a numeric attribute internally (Java epoch, ie milli-seconds since 1970-01-01).
Instead of using NumericToNominal, use either the supervised or unsupervised Discretize filter if the algorithm cannot handle numeric attributes.
Converting nominal attributes to numeric ones is not a recommended approach. Instead, try the supervised or unsupervised NominalToBinary filter.

Input Arrays Into Weka

I am fairly new to machine learning and I am trying to use WEKA (GUI) to implement a neural network on a sports data set. My issue is that I want my inputs to be Arrays (each Array is a contestant with stats such as speed, winrate, etc). I am wondering how I can tell WEKA that each input is an array of values.
You can define it in an .arff file. See this website for detailed information. As the figure below.
Or after opening your data in Weka, you can convert it with the help of some filters. I do not know the current format of your data. However, if you can open it in Weka, you can edit your data with many filters. Meanwhile, artificial neural networks only accept numerical values. Among these filters, there are those who convert nominal data to numerical data. I share an image from these filters below. If you are new to this area, I recommend you to watch videos of WekaMOOC (owned by Weka developers.). I think it will be very useful. Good luck.
Weka_filters_screen

StringToWordVector in Weka

What is StringToWordVector? All I know about it is that it converts a string attribute to multiple attributes. But what is the advantage of doing so and how an object of StringToWordVector class serves as a filter for FilteredClassifier? How has it become a filter?
StringTOWordVector is the filter class in weka which filters strings into N-grams using WOrdTokenizer class. This helps us to provide strings as N-grams to classifier. Besides just tokenizing, it also provide other functionalities like removing Stopwords, weighting words with TFIDF, output word count rather than just indicating word is present or not, pruning rate, stemming, Lowercase conversion of words, etc. Detailed explanation of this class can be found at http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVecing.html So Basically it provides basic functionalities which helps us to fine tune the training set according to requirements before training.
However, if someone, who wants to perform testing along with training, must use batchfiltering or Filtered classifier for ensuring compatability of train & test Set. This is because if we pass train & test separately through StringToWordVector then it will generate different vocabulary for train & test set. To decide which technique should be opted out of batch filltering & Filtered classifier, follow the post by Nihil Obstat at http://jmgomezhidalgo.blogspot.in/2013/01/text-mining-in-weka-chaining-filters.html
Hope this helps.

Weka: Classifier and ReplaceMissingValues

I am relatively new to the data mining area and have been experimenting with Weka.
I have a dataset which consists of almost 8000 records related to customers and items they have purchased. 58% of this data set has missing values for the "Gender" attribute.
I want to find the missing gender values based on the other data I do have.
I first thought I could do this using a classifier algorithm in Weka using a training set to build a model. Based on examples I saw online, I tried this with pretty much all the available algorithms available in Weka using a training set that consisted of 60-80% of the data which did not have missing values. This gave me a lower accuracy rate than I wanted (80-86% depending on the algorithm used)
Did I go about this correctly? Is there a way to improve this accuracy? I experimented with using different attributes, different pre-processing of the data etc.
I also tried using the ReplaceMissingValues filter on the complete dataset to see how that would handle the missing values. However, it just changed all the missing values to "Female" which obviously cannot be the case. So I'm wondering also wondering if I need to use this filter in my situation or not.
It sounds like you went about it in the correct way. The ReplaceMissingValues filter replaces the missing values with the most frequent of the non-missing values I think, so it is not what you want in this case.
A better way to get an idea of the true accuracy of your gender-predictor would be to use cross-validation instead of the training/test split (Weka has a separate option for that). 80-86% may seem low, but keep in mind that random guessing will only get you about 50%, so it's still a lot better than that. To try to get better performance, pick a classifier that performs well and then play with its parameters until you get better performance. This is likely to be quite labour-intensive (although you could of course use automated methods for tuning, see e.g. Auto-WEKA), but the only way to improve the performance.
You can also combine the algorithm you choose with a separate feature selection step (Weka has a special meta-classifier for this). This may improve performance, but again you'll have to experiment to find the particular configuration that works for you.

Input arff file for Weka Apriori

I am trying to do association mining on version history. I have my transaction data in mysql. Weka apriori algorithm requires arff or csv file in a certain format. It has to have columns for each item. The values will be specified as TRUE or FALSE for each item in a transaction. I am looking for a way to create this file using Weka InstanceQuery. Also what are the options if the transaction data is huge.
I can answer for the second part: options if the transaction data is huge. Weka is a good software but their apriori implementation is horribly slow. I recommend implementations at http://fimi.ua.ac.be/src/ (I used the first one in the list from Ferenc Bodon).
Bodon's implementation use Trie data structure instead of hashtables that Weka uses. Because of this, I found in my work, that Weka would take 3 days to finish something that Bodon's implementation could in less than an hour (yes, the difference is this huge!!).
Plus, Bodon's implementation uses a simple input format: one line for each transaction, with items separated by spaces.
If you want a fast Java implementation of FPGrowth or Apriori, have a look at my project SPMF. The FPGrowth implementation in SPMF beats Weka implementation by up to two orders of magnitude on some datasets. For example, you can see this performance comparison:
http://www.philippe-fournier-viger.com/spmf/performance/chess_fpgrowth_spmf_vs_weka.png
This is the main project webpage:
http://www.philippe-fournier-viger.com/spmf/index.php
Moreover, note that SPMF offers more than 50 algorithms for itemset mining, association rule mining, sequential pattern mining, etc. Also, the GUI version of SPMF also support the ARFF format used by Weka.