I am trying to build a binary classifier based on a tabular dataset that is rather sparse, but training is failing with the following message:
Training pipeline failed with error message: Too few input rows passed validation. Of 1169548 inputs, 194 were valid. At least 50% of rows must pass validation.
My understanding was that tabular AutoML should be able to handle Null values, so I'm not sure what's happening here, and I would appreciate any suggestions. The documentation explicitly mentions reviewing each column's nullability, but I don't see any way to set or check a column's nullability on the dataset tab (perhaps the documentation is out of date?). Additionally, the documentation explicitly mentions that missing values are treated as null, which is how I've set up my CSV. The documentation for numeric however does not explicitly list support for missing values, just NaN and inf.
The dataset is 1 million rows, 34 columns, and only 189 rows are null-free. My most sparse column has data in 5,000 unique rows, with the next rarest having data in 72k and 274k rows respectively. Columns are a mix of categorical and numeric, with only a handful of columns without nulls.
The data is stored as a CSV, and the Dataset import seems to run without issue. Generate statistics ran on the dataset, but for some reason the missing % column failed to populate. What might be the best way to address this? I'm not sure if this is a case where I need to change my null representation in the CSV, change some dataset/training setting, or if its an AutoML bug (less likely). Thanks!
To allow invalid & null values during training & prediction, we have to explicitly set the allow invalid values flag to Yes during training as shown in the image below. You can find this setting under model training settings on the dataset page. The flag has to be set on a column by column basis.
I tried #Kabilan Mohanraj's suggestion and it resolved my issue. What I had to do was click the dropdown to allow invalid values into training. After making this change, all rows passed validation and my model was able to train without issue. I'd initially assumed that missing values would not count as invalid, which was incorrect.
Related
so, I got 3 xlsx full of data already treated, so I pretty much just got to display the data using the graphs. The problem seems to be, that Powerbi aggregates all numeric data (using: count, sum, etc.) In their community they suggest to create new measures, the thing is, in that case I HAVE TO CREATE A LOT OF MEASURES...Also, I tried to convert the data to text and even so, Powerbi counts it!!!
any help, pls?
There are several ways to tackle this:
When you pull a field into the field well for a visualisation, you can click the drop down in the field well and select "Don't summarize"
in the data model, select the column and on the ribbon select "don't summarize" as the summarization option in the Properties group.
The screenshot shows the field well option on the left and the data model options on the right, one for a numeric and one for a text field.
And, yes, you never want to use the implicit measures, i.e. the automatic calculations that Power BI creates. If you want to keep on top of what is being calculated, create your own measures, and yes, there will be many.
Edit: If by "aggregating" you are referring to the fact that text values will be grouped in a table (you don't see any duplicates), then you need to add a column with unique values to the table so all the duplicates of the text values show up. This can be done in the data source by adding an Index column, then using that Index column in the table and setting it to a very narrow with to make it invisible.
0
I have a dataset of about 7000 records. After clearing, I performed normalization and discretization operations on it.Then I applied a j48 model to it and saved it to my computer.Now I want to test this model on a dataset of 500 records. All columns in this dataset are the same as the original dataset. However, the "class" column in the test dataset has no value. But I got an error. For this reason, I also applied normalization and discretization operations to the test dataset. But I still get this error. Note that I specified the class attribute in both datasets, but again this error was displayed.
I have a dataset of about 7000 records. After clearing, I performed normalization and discretization operations on it.Then I applied a j48 model to it and saved it to my computer.Now I want to test this model on a dataset of 500 records. All columns in this dataset are the same as the original dataset. However, the "class" column in the test dataset has no value. But I got an error. For this reason, I also applied normalization and discretization operations to the test dataset. But I still get this error. Note that I specified the class attribute in both datasets, but again this error was displayed.
this is a screenshot of my test file:test.arff screenshot
and this is a screenshot of my train dataset file: enter image description here
and these are screenshots of errors : enter image description here
Thanks for the screenshots. The attribute "code" does not have the same values in the training and test set.
It looks like that is a case identifier, so you wouldn't expect the values to be the same. So, instead of having this as a nominal attribute, treat it as a numeric attribute.
#attribute code numeric
Let me know if this fixes the problem.
I am working on a project where I'm reading raw census data into SAS enterprise guide to be processed as a different merged output. The first few columns are character fields, serving as geographic identifiers.
The rest of the raw data contains numeric fields, all fields are like "HD01_VD01" and so on up through numbers like "HD01_VD78". However, occasionally with census data numbers get suppressed and some observations have "*****" in the raw data like in the picture below. Whenever that happens, SAS reads in the numeric field as a character.
What would be a good way to ensure that anytime an "HD01_VD(whatevernumber)" is always numeric and converts "*****" to a blank/missing value like "." thus keeping the field numeric?
I don't want to hard-code every instance of a field being read in as a character back to numeric because my code is working with many different census tables. Would a macro variable be the way to do this? An if statement in each census table's data step?
Using arrays and looping them would be the best option; as mention in the comment by david25272.
Another option is to change the format of the fields in Enterprise Guide either in:
Import Task taht reads the files: change the field to numeric
or
Add a Query Builder Task: and create calculate field and use this advanced expression input(HD02_V36,11.)
In Power BI, I've got some query tables generated from imported data. All the data comes in as type 'Any', and I'm trying to automatically detect the type of the data in each column.
Some of the queries generate tables with columns based on the in-coming data - I don't know what the columns are going to be until the query runs and sets up the table (data comes from an Azure blob). As I will have quite a few tables to maintain, which columns can change (possibly new columns being added) with any data refresh, it would be unmanageable to go through all of them each time and press 'Detect Data Type' on the columns.
So I'm trying to figure out how I can do a 'Detect Data Type' in the query formula language to attach to the end of the query that generates the table columns. I've tried grabbing the first entry in a column and do Value.Type(column{0}), however this seems to come out as 'Text' for a column which has integers in it. Pressing 'Detect Data Type' does however correctly identifies the type as 'Whole Number'.
Does anyone know how to detect a column's entry types?
P.S. I'm not too worried about a column possibly holding values of different data types
You seem to have multiple issues here. And your solution will be fragile, there's a better way. But let's first deal with column type detection. Power Query uses the 'any' data type as it's go to data type. You can write a function that samples the rows of a column in a table does a best match data type detection then explicitly sets the data type of the column. This is probably messy and tricky since you need to do it once per column. This might be workable for a fixed schema but for a dynamic schema you'll run into a couple of things very quickly. First you'll need to write some crazy PQ code to list all the columns and run you function on each. This will work the first time, but might break in subsequent refreshes because data model changes are not allowed during refresh. If you're using a tool like Power BI Desktop, you'll be able to fix things up. If you publish your report to the Power BI service, you'll just see refresh errors.
Dynamic Schemas will suffer the same data model change issue I mentioned above.
The alternate solution that you won't have problems with is using a Direct Query data source instead of using Power Query. If you load your data into Azure SQL or a Tabular Model, the reporting layer will get the updated fields automatically so you don't have to try to work around using PQ.
I have a problem. I am pulling data from Teradata Database directly into SAS. The data looks like this:
id fragmentId fragment
1 34 (some text)
2 67 (some text)
3 89 (some text)
.......
The problem is that the fragment field contains text of 10 pages and even more (30,000,000 characters). Thus in SAS I get the column truncated and loose data.
How can I increase the limit for a SAS column that would contain text?
(PS: I have looked up dbmax_text option as #Joe suggested. However, it appears that this option applies to any dbms except teradata).
How can i code it?
Teradata indeed does not support DBMAX_TEXT. It also does not seem to support character sizes nearly as high as you list; the doc page for teradata lists a maximum of 64,000 bytes; and further, SAS is only able to hold a maximum of 32767 characters in one column.
In your case, you may want to consider splitting the column in-database into 32767 byte chunks (or whatever makes logical sense for your needs). Do that in passthrough in a view, and then read the data in from that view.
-- Previous information (helpful for other DBMSs other than Teradata, not helpful here) --
Odds are you need to change the dbmax_text option to something larger - it tends to default to 1024.
You can change it in the pull (the data step or sql query) as a dataset option, or change it in the database libname statement.
See the documentation page for more information.