I am trying to select the best attributes for my training data set which contains numeric values/attributes. which attribute evaluator/method would yield the best results for about 10 or so attributes? Training dataset is about 1400 lines of population statistics data.
Related
I am trying to build a binary classifier based on a tabular dataset that is rather sparse, but training is failing with the following message:
Training pipeline failed with error message: Too few input rows passed validation. Of 1169548 inputs, 194 were valid. At least 50% of rows must pass validation.
My understanding was that tabular AutoML should be able to handle Null values, so I'm not sure what's happening here, and I would appreciate any suggestions. The documentation explicitly mentions reviewing each column's nullability, but I don't see any way to set or check a column's nullability on the dataset tab (perhaps the documentation is out of date?). Additionally, the documentation explicitly mentions that missing values are treated as null, which is how I've set up my CSV. The documentation for numeric however does not explicitly list support for missing values, just NaN and inf.
The dataset is 1 million rows, 34 columns, and only 189 rows are null-free. My most sparse column has data in 5,000 unique rows, with the next rarest having data in 72k and 274k rows respectively. Columns are a mix of categorical and numeric, with only a handful of columns without nulls.
The data is stored as a CSV, and the Dataset import seems to run without issue. Generate statistics ran on the dataset, but for some reason the missing % column failed to populate. What might be the best way to address this? I'm not sure if this is a case where I need to change my null representation in the CSV, change some dataset/training setting, or if its an AutoML bug (less likely). Thanks!
To allow invalid & null values during training & prediction, we have to explicitly set the allow invalid values flag to Yes during training as shown in the image below. You can find this setting under model training settings on the dataset page. The flag has to be set on a column by column basis.
I tried #Kabilan Mohanraj's suggestion and it resolved my issue. What I had to do was click the dropdown to allow invalid values into training. After making this change, all rows passed validation and my model was able to train without issue. I'd initially assumed that missing values would not count as invalid, which was incorrect.
I am testing out Google Cloud Vertex AI with a time series AutoML model.
I have created a dataset, from a Biguery table, with 2 columns, one of a timestamp and another of a numeric value I want to predict:
salesorderdate is my TIMESTAMP column and orders is the value I want to predict.
When I proceed to the next step I cannot select orders as my value to predict, there are no available options for this field:
What am I missing here? Surely the time series value is the target value in this case? Is there an expectation of more fields here, and can one actually add additional features as columns to a time series model in this way?
I guess from your question that you are using "forecasting models". Please note that it is in "Preview" Product launch stage with all consequences of that fact.
In the documentation you may find Training data structure following information:
There must be at least two and no more than 1,000 columns.
For datasets that train AutoML models, one column must be the target,
and there must be at least one feature available to train the model.
If the training data does not include the target column, Vertex AI
cannot associate the training data with the desired result.
I suppose you are using AutoML models so in this situation you need to have 3 columns in the data set:
Time column - used to place the observation represented by that row in time
time series identifier column as "Forecasting training data usually includes multiple time series"
and target column which is value that model should learn to predict.
If you want to predict orders this should be target column. But before you are choosing this target this "time series identifier column" is already chosen in previous step, so you do not have available column to choose.
So you need to add to your BigQuery table at least one additional column with will be used as time series column. You can add to your data set column with the same value in each row. This concept is presented in Forecasting data preparation best practices:
You can train a forecasting model on a single time series (in other
words, the time series identifier column contains the same value for
all rows). However, Vertex AI is a better fit for training data that
contains two or more time series. For best results, you should have at
least 10 time series for every column used to train the model.
0
I have a dataset of about 7000 records. After clearing, I performed normalization and discretization operations on it.Then I applied a j48 model to it and saved it to my computer.Now I want to test this model on a dataset of 500 records. All columns in this dataset are the same as the original dataset. However, the "class" column in the test dataset has no value. But I got an error. For this reason, I also applied normalization and discretization operations to the test dataset. But I still get this error. Note that I specified the class attribute in both datasets, but again this error was displayed.
I have a dataset of about 7000 records. After clearing, I performed normalization and discretization operations on it.Then I applied a j48 model to it and saved it to my computer.Now I want to test this model on a dataset of 500 records. All columns in this dataset are the same as the original dataset. However, the "class" column in the test dataset has no value. But I got an error. For this reason, I also applied normalization and discretization operations to the test dataset. But I still get this error. Note that I specified the class attribute in both datasets, but again this error was displayed.
this is a screenshot of my test file:test.arff screenshot
and this is a screenshot of my train dataset file: enter image description here
and these are screenshots of errors : enter image description here
Thanks for the screenshots. The attribute "code" does not have the same values in the training and test set.
It looks like that is a case identifier, so you wouldn't expect the values to be the same. So, instead of having this as a nominal attribute, treat it as a numeric attribute.
#attribute code numeric
Let me know if this fixes the problem.
I have a dataset of about 7000 records. After clearing, I performed normalization and discretization operations on it.Then I applied a j48 model to it and saved it to my computer.Now I want to test this model on a dataset of 500 records. All columns in this dataset are the same as the original dataset. However, the "class" column in the test dataset has no value. But I got an error. For this reason, I also applied normalization and discretization operations to the test dataset. But I still get this error. Note that I specified the class attribute in both datasets, but again this error was displayed.
this is a screenshot of my test file:
test arff file
and this is a screenshot of my train dataset file:
train arff file
can anybody help me?
I am doing multiple linear regression using SAS. I have divided the data into train and test in the ratio 70 % and 30 %. I have used proc reg to build model on the training data. I want to use this model to get predicted values on the test data. How would I do that ?