WEKA - weather prediction - weka

I am pretty new to the concepts of machine learning and clustering. I have installed Weka and am trying to figure out how it works. Currently, I have my training data as below.
#relation weather
#attribute year real
#attribute temperature real
#attribute warmer {yes,no}
#data
1956 , 68.98585 , yes
1957 , 67.52131 , yes
1958 , 65.853386 , no
1959 , 66.32705 , yes
1960 , 65.89773 , no
So, I am trying to build a model which should predict if it is getting warmer each and every year.
If I have to predict if 1961 is warmer or cooler, should I provide my test data like below?
#relation weather
#attribute year real
#attribute temperature real
#data
1961 , 70.98585
I have removed the column warmer which I want to predict using the training set I provided earlier. I can use any algorithm that Weka provides me (J48, BayesNet etc). Can someone please help me out in figuring how to understand the concepts?

You don't need to make the training and test sets yourself, Weka will do that for you. Even if you do, don't delete the value to predict from the test set -- Weka will make sure that everything happens properly, but needs the actual value to determine whether a prediction is correct or not and tell you how your model performs.
Your problem is a classification problem, i.e. you want to predict the label "yes" or "no". Not all of the algorithms in Weka are applicable, but the ones that are not are greyed out (if you use the GUI).
On a more general note, you're unlikely to get good results with the data that you have. This is more of a time series prediction task (i.e. given these past values, how will it develop in the future), for which Weka doesn't really offer the algorithms. You can find some more information on Wikipedia.
To get better models with Weka, you could add the temperature value from the previous year (or the previous 2 years) as a feature, but ultimately it sounds like you want to use something that can do time series analysis and predictions.

Related

WEKA classification using embeddings

Currently I have been working on code snippet classification using Code2vec models which I have trained on a set of python code snippet, my idea was to produce embeddings for each code snippet and attach the label to it and use it further for the final classification for instance the arff file for weka will look like the following:
relation XYZ
#Attributes #Class# {buggy,non_buggy}
#Attributes index1 real
.........
.........
.........
.........
.........
#Attributes index380 real
#data
buggy, 0.28600096702575684, -0.03643874451518059, -0.06801733374595642,.......
..................
..................
..................
non_buggy, 0.4966501295566559, -0.38083720207214355, -0.378182053565979,.......
For the classification, I split my full dataset into 80% for training and 20% for testing using the percentage split option provided by WEKA. I got a precision of 99% I was surprised though I tried to do other splits for instance 1% for training and 99% for testing however the performance is still good almost 99% precision which I found not logic in this case.
do I have to change anything before the second split ? does anybody have experienced this issue while working with embeddings in WEKA?

Using Logistic Regression For Timeseries Data in Amazon SageMaker

For a project I am working on, which uses annual financial reports data (of multiple categories) from companies which have been successful or gone bust/into liquidation, I previously created a (fairly well performing) model on AWS Sagemaker using a multiple linear regression algorithm (specifically, the AWS stock algorithm for logistic regression/classification problems - the 'Linear Learner' algorithm)
This model just produces a simple "company is in good health" or "company looks like it will go bust" binary prediction, based on one set of annual data fed in; e.g.
query input: {data:[{
"Gross Revenue": -4000,
"Balance Sheet": 10000,
"Creditors": 4000,
"Debts": 1000000
}]}
inference output: "in good health" / "in bad health"
I trained this model by just ignoring what year for each company the values were from and pilling in all of the annual financial reports data (i.e. one years financial data for one company = one input line) for the training, along with the label of "good" or "bad" - a good company was one which has existed for a while, but hasn't gone bust, a bad company is one which was found to have eventually gone bust; e.g.:
label
Gross Revenue
Balance Sheet
Creditors
Debts
good
10000
20000
0
0
bad
0
5
100
10000
bad
20000
0
4
100000000
I hence used these multiple features (gross revenue, balance sheet...) along with the label (good/bad) in my training input, to create my first model.
I would like to use the same features as before as input (gross revenue, balance sheet..) but over multiple years; e.g take the values from 2020 & 2019 and use these (along with the eventual company status of "good" or "bad") as the singular input for my new model. However I'm unsure of the following:
is this an inappropriate use of logistic regression Machine learning? i.e. is there a more suitable algorithm I should consider?
is it fine, or terribly wrong to try and just use the same technique as before, but combine the data for both years into one input line like:
label
Gross Revenue(2019)
Balance Sheet(2019)
Creditors(2019)
Debts(2019)
Gross Revenue(2020)
Balance Sheet(2020)
Creditors(2020)
Debts(2020)
good
10000
20000
0
0
30000
10000
40
500
bad
100
50
200
50000
100
5
100
10000
bad
5000
0
2000
800000
2000
0
4
100000000
I would personally expect that a company which has gotten worse over time (i.e. companies finances are worse in 2020 than in 2019) should be more likely to be found to be a "bad"/likely to go bust, so I would hope that, if I feed in data like in the above example (i.e. earlier years data comes before later years data, on an input line) my training job ends up creating a model which gives greater weighting to the earlier years data, when making predictions
Any advice or tips would be greatly appreciated - I'm pretty new to machine learning and would like to learn more
UPDATE:
Using Long-Short-Term-Memory Recurrent Neural Networks (LSTM RNN) is one potential route I think I could try taking, but this seems to commonly just be used with multivariate data over many dates; my data only has 2 or 3 dates worth of multivariate data, per company. I would want to try using the data I have for all the companies, over the few dates worth of data there are, in training
I once developed a so called Genetic Time Series in R. I used a Genetic Algorithm which sorted out the best solutions from multivariate data, which were fitted on a VAR in differences or a VECM. Your data seems more macro economic or financial than user-centric and VAR or VECM seems appropriate. (Surely it is possible to treat time-series data in the same way so that we can use LSTM or other approaches, but these are very common) However, I do not know if VAR in differences or VECM works with binary classified labels. Perhaps if you would calculate a metric outcome, which you later label encode to a categorical feature (or label it first to a categorical) than VAR or VECM may also be appropriate.
However you may add all yearly data points to one data points per firm to forecast its survival, but you would loose a lot of insight. If you are interested in time series ML which works a little bit different than for neural networks or elastic net (which could also be used with time series) let me know. And we can work something out. Or I'll paste you some sources.
Summary:
1.)
It is possible to use LSTM, elastic NEt (time points may be dummies or treated as cross sectional panel) or you use VAR in differences and VECM with a slightly different out come variable
2.)
It is possible but you will loose information over time.
All the best,
Patrick

Train program to understand high and low value in machine learning

I am generating alerts by reading dataset for KPI (key performance indicator) . My algorithm is looking into historical data and based on that I am able to capture if there's sudden spike in data. But I am generating false alarms . For example KPI1 is historically at .5 but reaches value 12, which is kind of spike .
Same way KPI2 also reaches from .5 to 12. But I know that KPI reaching from .5 to 12 is not a big deal and I need not to capture that . same way KPI2 reaching from .5 to 12 is big deal and I need to capture that.
I want to train my program to understand what is high value , low value or normal value for each KPI.
Could you experts tell me which is best ML algorithm is for this and any package in python I need to explore?
This is the classification problem. You can use classic logistic regression algorithm to classify any given sample into either high value, low value or normal value.
Quoting from the Wikipedia,
In statistics, multinomial logistic regression is a classification
method that generalizes logistic regression to multiclass problems,
i.e. with more than two possible discrete outcomes. That is, it is
a model that is used to predict the probabilities of the different
possible outcomes of a categorically distributed dependent variable,
given a set of independent variables (which may be real-valued,
binary-valued, categorical-valued, etc.)
To perform multi-class classification in python, sklearn library can be useful.
http://scikit-learn.org/stable/modules/multiclass.html

Finding a correlation between variable and class variable

I have a dataset which contains 7 numerical attributes and one nominal which is the class variable. I was wondering how I can the best attribute that can be used to predict the class attribute. Would finding the largest information gain by each attribute be the solution?
So the problem you are asking about falls under the domain of feature selection, and more broadly, feature engineering. There is a lot of literature online regarding this, and there are definitely a lot of blogs/tutorials/resources online for how to do this.
To give you a good link I just read through, here is a blog with a tutorial on some ways to do feature selection in Weka, and the same blog's general introduction on feature selection. Naturally there are a lot of different approaches, as knb's answer pointed out.
To give a short description though, there are a few ways to go about it: you can assign a score to each of your features (like information gain, etc) and filter out features with 'bad' scores; you can treat finding the best parameters as a search problem, where you take different subsets of the features and assess the accuracy in turn; and you can use embedded methods, which kind of learn which features contribute most to the accuracy as the model is being built. Examples of embedded methods are regularization algorithms like LASSO and ridge regression.
Do you just want that attribute's name, or do you also want a quantifiable metric (like a t-value) for this "best" attribute?
For a qualitative approach, you can generate a classification tree with just one split, two leaves.
For example, weka's "diabetes.arff" sample-dataset (n = 768), which has a similar structure as your dataset (all attribs numeric, but the class attribute has only two distinct categorical outcomes), I can set the minNumObj parameter to, say, 200. This means: create a tree with minimum 200 instances in each leaf.
java -cp $WEKA_JAR/weka.jar weka.classifiers.trees.J48 -C 0.25 -M 200 -t data/diabetes.arff
Output:
J48 pruned tree
------------------
plas <= 127: tested_negative (485.0/94.0)
plas > 127: tested_positive (283.0/109.0)
Number of Leaves : 2
Size of the tree : 3
Time taken to build model: 0.11 seconds
Time taken to test model on training data: 0.04 seconds
=== Error on training data ===
Correctly Classified Instances 565 73.5677 %
This creates a tree with one split on the "plas" attribute. For interpretation, this makes sense, because indeed, patients with diabetes have an elevated concentration of glucose in their blood plasma. So "plas" is the most important attribute, as it was chosen for the first split. But this does not tell you how important.
For a more quantitative approach, maybe you can use (Multinomial) Logistic Regression. I'm not so familiar with this, but anyway:
In the Exlorer GUI Tool, choose "Classify" > Functions > Logistic.
Run the model. The odds ratio and the coefficients might contain what you need in a quantifiable manner. Lower odds-ratio (but > 0.5) is better/more significant, but I'm not sure. Maybe read on here, this answer by someone else.
java -cp $WEKA_JAR/weka.jar weka.classifiers.functions.Logistic -R 1.0E-8 -M -1 -t data/diabetes.arff
Here's the command line output
Options: -R 1.0E-8 -M -1
Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
Class
Variable tested_negative
============================
preg -0.1232
plas -0.0352
pres 0.0133
skin -0.0006
insu 0.0012
mass -0.0897
pedi -0.9452
age -0.0149
Intercept 8.4047
Odds Ratios...
Class
Variable tested_negative
============================
preg 0.8841
plas 0.9654
pres 1.0134
skin 0.9994
insu 1.0012
mass 0.9142
pedi 0.3886
age 0.9852
=== Error on training data ===
Correctly Classified Instances 601 78.2552 %
Incorrectly Classified Instances 167 21.7448 %

How to do prediction with weka

i'm using weka to do some text mining, i'm a little bit confused so i'm here to ask how can i ( with a set of comments that are in a some way classified as: notes, status of work, not conformity, warning) predict if a new comment belong to a specific class, with all the comment (9551) i've done a preprocess obtaining with the filter "stringtowordvector" a vector of tokens, and then i've used the simple kmeans to obtain a number of cluster.
So the question is: if a user post a new comment can i predict with those data if it belong to a category of comment?
sorry if my question is a little bit confused but so am i.
thank you
Trivial Training-validation-test
Create two datasets from your labelled instances. One will be training set and the other will be validation set. The training set will contain about 60% of the labelled data and the validation will contain 40% of the labelled data. There is no hard and fast rule for this split, but a 60-40 split is a good choice.
Use K-means (or any other clustering algorithm) on your training data. Develop a model. Record the model's error on training set. If the error is low and acceptable, you are fine. Save the model.
For now, your validation set will be your test dataset. Apply the model you saved on your validation set. Record the error. What is the difference between training error and validation error? If they both are low, the model's generalization is "seemingly" good.
Prepare a test dataset where you have all the features of your training and test dataset but the class/cluster is unknown.
Apply the model on the test data.
10-fold cross validation
Use all of your labelled data instances for this task.
Apply K-means (or any other algorithm of your choice) with a 10-fold CV setup.
Record the training error and CV error. Are they low? Is the difference between the errors is low? If yes, then save the model and apply it on the test data whose class/cluster is unknown.
NB: The training/test/validation errors and their differences will give you an "very initial" idea of overfitting/underfitting of your model. They are sanity tests. You need to perform other tests like learning curves to see if your model overfits or underfits or perfect. If there appears to be an overfitting and underfitting problem, you need to try many different techniques to overcome them.