I am trying to test a machine learning model produced from a training dataset that is triple the size of my test dataset. When I upload my test dataset into weka for training I get a message asking if I would like to "Input Mapped Classifier". I clicked yes and my results showed multiple question marks and a confusion matrix with very few results in it. I was wondering if there is a way to get around this and improve the output?
Thank you for your help!
Related
I am trying to search for suggestions and solutions, but I am unable to find any.
After reading blogs, I am able to build a time series anomaly detection using BigQuery ML (Arima Plus).
My question is: how do I put such a model in production?
Probably I need to:
program the re-training of the model every X days
check whether there are new anomalies on the object table every X hours
record those anomalies in another table
But I also accept other suggestion on how to proceed.
Is there anyone out there that can give me any hint?
Thank you!
The best way I found is to create "scheduled queries":
schedule a query for re-training of the model every X days:
CREATE OR REPLACE MODEL mymodel
OPTIONS( model_type='arima_plus',
TIME_SERIES_DATA_COL='events',
TIME_SERIES_TIMESTAMP_COL='approx_hour',
HOLIDAY_REGION = 'GLOBAL',
CLEAN_SPIKES_AND_DIPS = FALSE,
DECOMPOSE_TIME_SERIES=TRUE)
AS (SELECT
TIMESTAMP_TRUNC( PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*SZ',start_time), hour) as approx_hour,
COUNT(1) AS events
FROM `mytable`
GROUP BY approx_hour);
schedule a query to perform anomaly detection on the latest events, and eventually write them on a table:
insert into `events_anomalies_table`
SELECT approx_hour as hour,
cast(events as int64) as actual_events,
cast(lower_bound as int64) as expected_min_events,
cast(upper_bound as int64) as expected_max_events,
current_timestamp() as execution_timestamp
FROM ML.DETECT_ANOMALIES(
MODEL`my_model`,
STRUCT (0.98 AS anomaly_prob_threshold),
( SELECT
TIMESTAMP_TRUNC( PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*SZ',start_time), hour) as approx_hour,
COUNT(1) AS events
FROM `my_table`
WHERE PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*SZ',start_time) > TIMESTAMP_SUB(CURRENT_TIMESTAMP() , INTERVAL 1 HOUR)
GROUP BY approx_hour
LIMIT 1))
WHERE is_anomaly = True
I am fairly new to machine learning and I am trying to use WEKA (GUI) to implement a neural network on a sports data set. My issue is that I want my inputs to be Arrays (each Array is a contestant with stats such as speed, winrate, etc). I am wondering how I can tell WEKA that each input is an array of values.
You can define it in an .arff file. See this website for detailed information. As the figure below.
Or after opening your data in Weka, you can convert it with the help of some filters. I do not know the current format of your data. However, if you can open it in Weka, you can edit your data with many filters. Meanwhile, artificial neural networks only accept numerical values. Among these filters, there are those who convert nominal data to numerical data. I share an image from these filters below. If you are new to this area, I recommend you to watch videos of WekaMOOC (owned by Weka developers.). I think it will be very useful. Good luck.
Weka_filters_screen
I am developing a analysis procedure using Abaqus&Fortran as follows:
Generate FEM model using Abaqus, and do an modal analysis in Abaqus.Thus, an DAT results were generated.
With Fortran, read the DAT file to get the FEM model and modal results, and then calculate the displacement history for each node, using my own algorithm.
Finally purpose(but I don't know how to do it yet): GET the strain/stress time history of each node, or several chosen nodes.
The displacement history (in step 2) can be written into a bin file. Can I use python to read it and call Abaqus to calculate stress or strain, and then write them into Odb database? Which type of analysis step should I use?
Or, can it just be done in a post-processing procedure, how to do it?
Is anyone has similar experience and could give me some advice on this issue?
Thanks a lot!
I am trying to learn Weka.I am using a data set which has three classes of activity. I am trying to build a classifier, use ten-fold cross validation and tabulate the accuracy. However i cant tell which data belongs to which class. How do i proceed? I am not sure how to upload the data set here.Any help would be appreciated.
In order to get results using a k-fold cross validation, your data points must have class labels. For instance, if I give you a set of data and ask you to classify them into three classes but I do not know the classes of the data points, when you classify them and return them back to me, how do I calculate your classification accuracy?
i'm using weka to do some text mining, i'm a little bit confused so i'm here to ask how can i ( with a set of comments that are in a some way classified as: notes, status of work, not conformity, warning) predict if a new comment belong to a specific class, with all the comment (9551) i've done a preprocess obtaining with the filter "stringtowordvector" a vector of tokens, and then i've used the simple kmeans to obtain a number of cluster.
So the question is: if a user post a new comment can i predict with those data if it belong to a category of comment?
sorry if my question is a little bit confused but so am i.
thank you
Trivial Training-validation-test
Create two datasets from your labelled instances. One will be training set and the other will be validation set. The training set will contain about 60% of the labelled data and the validation will contain 40% of the labelled data. There is no hard and fast rule for this split, but a 60-40 split is a good choice.
Use K-means (or any other clustering algorithm) on your training data. Develop a model. Record the model's error on training set. If the error is low and acceptable, you are fine. Save the model.
For now, your validation set will be your test dataset. Apply the model you saved on your validation set. Record the error. What is the difference between training error and validation error? If they both are low, the model's generalization is "seemingly" good.
Prepare a test dataset where you have all the features of your training and test dataset but the class/cluster is unknown.
Apply the model on the test data.
10-fold cross validation
Use all of your labelled data instances for this task.
Apply K-means (or any other algorithm of your choice) with a 10-fold CV setup.
Record the training error and CV error. Are they low? Is the difference between the errors is low? If yes, then save the model and apply it on the test data whose class/cluster is unknown.
NB: The training/test/validation errors and their differences will give you an "very initial" idea of overfitting/underfitting of your model. They are sanity tests. You need to perform other tests like learning curves to see if your model overfits or underfits or perfect. If there appears to be an overfitting and underfitting problem, you need to try many different techniques to overcome them.