SAS. How to overcome cell storage limit issue? - sas

I have a problem. I am pulling data from Teradata Database directly into SAS. The data looks like this:
id fragmentId fragment
1 34 (some text)
2 67 (some text)
3 89 (some text)
.......
The problem is that the fragment field contains text of 10 pages and even more (30,000,000 characters). Thus in SAS I get the column truncated and loose data.
How can I increase the limit for a SAS column that would contain text?
(PS: I have looked up dbmax_text option as #Joe suggested. However, it appears that this option applies to any dbms except teradata).
How can i code it?

Teradata indeed does not support DBMAX_TEXT. It also does not seem to support character sizes nearly as high as you list; the doc page for teradata lists a maximum of 64,000 bytes; and further, SAS is only able to hold a maximum of 32767 characters in one column.
In your case, you may want to consider splitting the column in-database into 32767 byte chunks (or whatever makes logical sense for your needs). Do that in passthrough in a view, and then read the data in from that view.
-- Previous information (helpful for other DBMSs other than Teradata, not helpful here) --
Odds are you need to change the dbmax_text option to something larger - it tends to default to 1024.
You can change it in the pull (the data step or sql query) as a dataset option, or change it in the database libname statement.
See the documentation page for more information.

Related

VertexAI Tabular AutoML rejecting rows containing nulls

I am trying to build a binary classifier based on a tabular dataset that is rather sparse, but training is failing with the following message:
Training pipeline failed with error message: Too few input rows passed validation. Of 1169548 inputs, 194 were valid. At least 50% of rows must pass validation.
My understanding was that tabular AutoML should be able to handle Null values, so I'm not sure what's happening here, and I would appreciate any suggestions. The documentation explicitly mentions reviewing each column's nullability, but I don't see any way to set or check a column's nullability on the dataset tab (perhaps the documentation is out of date?). Additionally, the documentation explicitly mentions that missing values are treated as null, which is how I've set up my CSV. The documentation for numeric however does not explicitly list support for missing values, just NaN and inf.
The dataset is 1 million rows, 34 columns, and only 189 rows are null-free. My most sparse column has data in 5,000 unique rows, with the next rarest having data in 72k and 274k rows respectively. Columns are a mix of categorical and numeric, with only a handful of columns without nulls.
The data is stored as a CSV, and the Dataset import seems to run without issue. Generate statistics ran on the dataset, but for some reason the missing % column failed to populate. What might be the best way to address this? I'm not sure if this is a case where I need to change my null representation in the CSV, change some dataset/training setting, or if its an AutoML bug (less likely). Thanks!
To allow invalid & null values during training & prediction, we have to explicitly set the allow invalid values flag to Yes during training as shown in the image below. You can find this setting under model training settings on the dataset page. The flag has to be set on a column by column basis.
I tried #Kabilan Mohanraj's suggestion and it resolved my issue. What I had to do was click the dropdown to allow invalid values into training. After making this change, all rows passed validation and my model was able to train without issue. I'd initially assumed that missing values would not count as invalid, which was incorrect.

Power Query Preview Top and Order By

Is there a way to keep Power Query from using ORDER BY when getting preview data?
I am trying to work with a table in SQL Server that contains 373 million records. The Power Query Editor wants to produce a preview. The M code for the "Navigate" step looks like...
Source{[Schema="dbo",Item="TableName"]}[Data]
...and it produces SQL that looks like...
select top 4096 [$Ordered].[ThisID]
, [$Ordered].[ThatID]
, [$Ordered].
<bunch
of other
columns>
from [dbo].[TableName] as [$Ordered]
order by [$Ordered].[ThisID]
, [$Ordered].[ThatID]
ThisID and ThatID are ints. These two columns make up the primary key, so the combination of them is indexed. But indexed or not, this query wants to order 373 million records by 2 variables before returning a handful of rows that may as well be random given the context. This query could take a very long time (hours?) to run, so it times out and I have nothing to work with (not even column names). What I need would take less than 1 second. (Basically, remove the ORDER BY clause.)
How can I change the number of rows returned? I think I need 100 - 400, not 4096.
How can I tell the Power Query Editor to not care about which records are in the preview (no ORDER BY)?
These feel like they should be settings (not M code) but I am not seeing anything in the options to control this behavior.
I have a bunch of other tables in my model. Will this one table need to be done using pass-thru SQL? Will that still work in Direct Query mode?
Can this problem be solved by changing the M code in the Advanced Editor?

How to convert several fields in SAS to numeric?

I am working on a project where I'm reading raw census data into SAS enterprise guide to be processed as a different merged output. The first few columns are character fields, serving as geographic identifiers.
The rest of the raw data contains numeric fields, all fields are like "HD01_VD01" and so on up through numbers like "HD01_VD78". However, occasionally with census data numbers get suppressed and some observations have "*****" in the raw data like in the picture below. Whenever that happens, SAS reads in the numeric field as a character.
What would be a good way to ensure that anytime an "HD01_VD(whatevernumber)" is always numeric and converts "*****" to a blank/missing value like "." thus keeping the field numeric?
I don't want to hard-code every instance of a field being read in as a character back to numeric because my code is working with many different census tables. Would a macro variable be the way to do this? An if statement in each census table's data step?
Using arrays and looping them would be the best option; as mention in the comment by david25272.
Another option is to change the format of the fields in Enterprise Guide either in:
Import Task taht reads the files: change the field to numeric
or
Add a Query Builder Task: and create calculate field and use this advanced expression input(HD02_V36,11.)

Prevent SAS EG from outputting every dataset in datastep

I'm new to SAS EG, I usually use BASE SAS when I actually need the program, but my company is moving heavily toward EG. I'm helping some areas with some code to get data they need on an ad-hoc basis (the code won't change though).
However, during processing, we create many temporary files that are just iterations across months. I.E. if the user wants data from 2002 - 2016, we have to pull all those libraries and then concatenate them with our results. This is due to high transactional volume, the final dataset is limited to a small number of observations. Whenever I run this program though, SAS outputs all 183 of the datasteps created in the macro, making it very ugly, and sometimes the "Output Data" that appears isn't even output from the last datastep, but from an intermediary step, making it annoying to search through for the 'final output dataset'.
Is there a way to limit the datasets written to "Output Data" so that it only shows the final dataset - so that our end user doesn't need to worry about being confused?
Above is an example - There's a ton of output data sets that I don't care to see. I just want the final, which is located (somewhere) in that list...
Version is SAS E.G. 7.1
EG will always automatically show every dataset that was created after the program ends. If you don't want it to show any intermediate tables, delete them at the very last step in your process.
In your case, it looks as if your temporary tables all share the name TRN. You can clean it up as such:
/* Start of process flow */
<program statements>;
/* End of process flow*/
proc datasets lib=work nolist nowarn nodetails;
delete TRN:;
quit;
Be careful if you do this. Make sure that all of your temporary tables follow the same prefix naming scheme, otherwise you may accidentally delete tables that you need.
Another solution is to limit the number of datasets generated, and have a user-created link to the final dataset. There's an article about it here.
The alternate solution here is to add the output dataset explicitly as an entry on your process flow, and disregard the OUTPUT window unless you need to investigate something from the intermediary datasets.
This has the advantage that it lets you look at the intermediary datasets if something goes wrong, but also lets you not have to look through all of them to see the final dataset.
You should be able to add the final output dataset to the process flow once it's created once easily, and then after that one time it will be there for you to select to look at.

blocking the values after a specific date

I've got the following question.
I'm trying to run a partial least square forecast on a data model I have. Issue is that I need to block certain line in order to have the forecast for a specific time.
What I want would be the following. For June, every line before May 2014 will be blocked (see the screenshot below).
For May , every line before April 2014 will be blocked (see the screenshot below).
I was thinking of using a delete through a proc sql to do so but this solution seems to be very brutal and I wish to keep my table intact.
Question : Is there a way to block the line for a specific date with needing a deletion?
Many thanks for any insight you can give me as I've never done that before and don't know if there is a way to do that (I did not find anything on the net).
Edit : The aim of the blocking will be to use the missing values and to run the forecast on this missing month namely here in June 2014 and in May 2014 for the second example
I'm not sure what proc you are planning to use, but you should be able to do something like the below.
It builds a control data set, based on a distinct set of dates, including a filter value and building a text data set name. This data set is then called from a data null step.
Call execute is a ridiculously powerful function for this sort of looping behaviour that requires you to build strings that it will than pass as if they were code. Note that column names in the control set are "outside" the string and concatenated with it using ||. The alternative is probably using quite a lot of macro.
proc sql;
create table control_dates as
select distinct
nuov_date,
put(nuov_date,mon3.)||'_results' as out_name
from [csv_import];
quit;
data _null_;
set control_dates;
call execute(
'data '||out_name||';
set control_dates
(where=(nuov_date<'||nouv_date||'));
run;');
call execute('proc [analysis proc] data='||out_name||';run;');
run;