Out of Memory using PROC FREQ - sas

I have approximately 1,000,000 rows and 25 columns of data and I'm trying to return a list of column names, the number of distinct values and whether there are missing values.
I am not able to directly code in column names in PROC SQL and count distinct as I have numerous data sets with different column names and I'm trying to automatically return the desired outcome for all tables with one piece of code.
I've tried running the following code
proc freq nlevels data= &DATASET_NAME;
ods output nlevels=nlevels ;
tables _all_ NOPRINT;
run;
This returns an out of memory error. Is there another way to achieve the result, avoiding the out of memory error.

It is unnecessary to input column name by table _all_, but it possibly makes out of memory by inputting all columns at the same time, try to separate column to do proc freq and then combine results:
proc sql;
create table name as
select name from dictionary.columns where libname='SASHELP' and memname='CLASS';
quit;
data want;
run;
data _null_;
set name;
call execute(
'proc freq data=class nlevels;
table '||name||';
ods output nlevels=nlevels;
run;
data want;
set want nlevels;
run;'
);
run;

This question is very similar to SAS summary statistic from a dataset
The answers cover techniques for
transpose + freq
hash
freq w/ ODS exclude+output

Related

SAS dropping more than 400 columns during proc transpose?? (I do not want it to drop these columns)

I have some code that reads in some data and then transposes it. However, when I tranpose the data, the resulting dataset is missing over 400 columns that were in the input dataset. I have never seen this happen before and can't find any information online as to why this would happen. Any help is much, much appreciated!
The input dataset (mydata) has 716 unique entries in the column "MyCol1". This is the column that becomes the column header in the transposed dataset. The ttransposed dataset has only 269 columns!
libname in "\\folder1\folder2";
data mydata;
set in.mydata; run;
PROC SORT DATA=mydata nodupkey;
BY MyCol3 MyCol4;
RUN;
PROC TRANSPOSE DATA=mydata OUT=mydata_WIDE(DROP=_NAME_);
BY MyCol3 MyCol4;
ID MyCol1;
VAR MyCol2;
RUN;
OK update for anybody else who may have this error: during my proc sort statement I was accidentally deleting a bunch of rows.
You should include MYCOL1 in the BY statement of the PROC SORT. Otherwise only one of the MYCOL1 values will exist per group.
But you probably need to keep the NODUPKEY as duplicate values of MYCOL1 per group will cause trouble for the PROC TRANSPOSE as you cannot make two new variables with the same name.
PROC SORT DATA=mydata nodupkey;
BY MyCol3 MyCol4 MyCol1;
RUN;
PROC TRANSPOSE DATA=mydata OUT=mydata_WIDE(DROP=_NAME_);
BY MyCol3 MyCol4;
ID MyCol1;
VAR MyCol2;
RUN;

SAS: print the name related to the most little value

I'm a beginner in SAS and i have difficulties with this exercice:
I have a very simple table with 2 columns and three lines
I try to find the request that will return me the name of the most little people (so it must return titi)
All what I found is to return the most little size (157) but i don't want this, I want the name related to the most little value!
Could you help me please?
Larapa
A SQL having clause is a good one for this. SAS will automatically summarize the data and merge it back to the original table, giving you one a one-line table with the name of the smallest value of taille.
proc sql noprint;
create table want as
select nom
from have
having taille = min(taille)
;
quit;
Here are some other ways you can do it:
Using PROC MEANS:
proc means data=have noprint;
id nom;
output out=want
min(taille) = min_taille;
run;
Using sort and a data step to keep only the first observation:
proc sort data=have;
by taille;
run;
data want;
set have;
if(_N_ = 1);
run;

PROC FREQ on multiple variables combined into one table

I have the following problem. I need to run PROC FREQ on multiple variables, but I want the output to all be on the same table. Currently, a PROC FREQ statement with something like TABLES ERstatus Age Race, InsuranceStatus; will calculate frequencies for each variable and print them all on separate tables. I just want the data on ONE table.
Any help would be appreciated. Thanks!
P.S. I tried using PROC TABULATE, but it didn't not calculate N correctly, so I'm not sure what I did wrong. Here is my code for PROC TABULATE. My variables are all categorical, so I just need to know N and percentages.
PROC TABULATE DATA = BCanalysis;
CLASS ERstatus PRstatus Race TumorStage InsuranceStatus;
TABLE (ERstatus PRstatus Race TumorStage) * (N COLPCTN), InsuranceStatus;
RUN;
The above code does not return the correct frequencies based on InsuranceStatus where 0 = insured and 1 = uninsured, but PROC FREQ does. Also doesn't calculate correctly with ROWPCTN. So any way that I can get PROC FREQ to calculate multiple variables on one table, or PROC TABULATE to return the correct frequencies, would be appreciated.
Here is a nice image of my output in a simplified analysis of only ERstatus and InsuranceStatus. You can see that PROC FREQ returns 204 people with an ERstatus of 1 and InsuranceStatus of 1. That's correct. The values in PROC TABULATE are not.
OUTPUT
I'll answer this separately as this is answering the other possible interpretation of the question; when it's clarified I'll delete one or the other.
If you want this in a single printed table, then you either need to use proc tabulate or you need to normalize your data - meaning put it in the form of variable | value. PROC FREQ is not capable of doing multiple one-way frequencies in a single table.
For PROC TABULATE, likely your issue is missing data. Any variable that is on the class statement will be checked for missingness, and if any rows are missing data for any of the class variables, those rows are entirely excluded from the tabulation for all variables.
You can override this by adding the missing option on the class statement, or in the table statement, or in the proc tabulate statement. So:
PROC TABULATE DATA = BCanalysis;
CLASS ERstatus PRstatus Race TumorStage InsuranceStatus/missing;
TABLE (ERstatus PRstatus Race TumorStage) * (N COLPCTN), InsuranceStatus;
RUN;
This will result in a slightly different appearance than on your table, though, as it will include the missing rows in places you probably do not want them, and they'll be factored against the colpctn when again you probably don't want them.
Typically some manipulation is then necessary; the easiest is to normalize your data and then run a tabulation (using PROC TABULATE or PROC FREQ, whichever is more appropriate; TABULATE has better percentaging options though) against that normalized dataset.
Let's say we have this:
data class;
set sashelp.class;
if _n_=5 then call missing(age);
if _n_=3 then call missing(sex);
run;
And we want these two tables in one table.
proc freq data=class;
tables age sex;
run;
If we do this:
proc tabulate data=class;
class age sex;
tables (age sex),(N colpctn);
run;
Then we get an N=17 total for both subtables - that's not what we want, we want N=18. Then we can do:
proc tabulate data=class;
class age sex/missing;
tables (age sex),(N colpctn);
run;
But that's not quite right either; I want F to have 8/18 = 44.44% and M 10/18 = 55.55%, not 42% and 53% with 5% allocated to the missing row.
The way I do this is to normalize the data. This means you get a dataset with 2 variables, varname and val, or whatever makes sense for your data, plus whatever identifier/demographic/whatnot variables you might have. val has to be character unless all of your values are numeric.
So for example here I normalize class with age and sex variables. I don't keep any identifiers, but you certainly could in your data, I imagine InsuranceStatus would be kept there if I understand what you're doing in that table. Once I have the normalized table, I just use those two variables, and carefully construct a denominator definition in proc tabulate to have the right basis for my pctn value. It's not quite the same as the single table before - the variable name is in its own column, not on top of the list of values - but honestly that looks better in my opinion.
data class_norm;
set class;
length val $2;
varname='age';
val=put(age,2. -l);
if not missing(age) then output;
varname='sex';
val=sex;
if not missing(sex) then output;
keep varname val;
run;
proc tabulate data=class_norm;
class varname val;
tables varname=' '*val=' ',n pctn<val>;
run;
If you want something better than this, you'll probably have to construct it in proc report. That gives you the most flexibility, but is the most onerous to program in also.
You can use ODS OUTPUT to get all of the PROC FREQ output to one dataset.
ods output onewayfreqs=class_freqs;
proc freq data=sashelp.class;
tables age sex;
run;
ods output close;
or
ods output crosstabfreqs=class_tabs;
proc freq data=sashelp.class;
tables sex*(height weight);
run;
ods output close;
Crosstabfreqs is the name of the cross-tab output, while one-way frequencies are onewayfreqs. You can use ods trace to find out the name if you forget it.
You may (probably will) still need to manipulate this dataset some to get the structure you want ultimately.

How do I put conditions around Proc Freq statements in SAS?

I have the following statement
Proc Freq data =test;
tables gender;
run;
I want this to generate an output based on a condition applied to the gender variable. For example - if count of gender greater than 2 then output.
How can I do this in SAS?
Thanks
If you mean an output dataset, you can put a where clause directly in the output dataset options.
Proc Freq data =sashelp.class;
tables sex/out=sex_freq(where=(count>9));
run;
I'm not aware of how you can accomplish this only using proc freq but you can redirect the output to a data set and then print the results.
proc freq data=test;
tables gender / noprint out=tmp;
run;
proc print data=tmp;
where count > 2;
run;
Alternatively you could use proc summary, but this still requires two steps.
proc summary data=test nway;
class gender;
output out=tmp(where=(_freq_ > 2));
run;
proc print data=tmp;
run;

Extract data using SQL, analyse and write back to SQL table

I am a newbie to SAS Base, and I am struggling to create a simple program that extracts data from a table on my database, runs e.g. PROC MEANS, and writes the data back to the table.
I know how to use PROC SQL (read and update tables) and PROC MEANS, but I can't figure out how to combine the steps.
PROC SQL;
SELECT make,model,type,invoice,horsepower
FROM
SASHELP.CARS
;
QUIT;
PROC Means;
RUN;
What I want to accomplish is create an additional column in the dataset with e.g. the mean of the horsepower.. and in the end I want to write that computed column to the table on the database.
Edit
What I was looking for is this:
PROC SQL;
create table want as
select make,model,type,invoice,horsepower
, mean(horsepower) as mean_horsepower
from sashelp.cars
;
QUIT;
PROC MEANS DATA=want;
RUN;
SAS makes this very easy to do with SQL since it will automatically remerge summary statistics back to detailed records.
create table want as
select make,model,type,invoice,horsepower
, mean(horsepower) as mean_horsepower
from sashelp.cars
;
Or using normal SAS code.
proc means data=sashelp.cars nway noprint ;
var horsepower ;
output out=mean_horsepower mean=mean_horsepower ;
run;
data want ;
set sashelp.cars ;
if _n_=1 then set mean_horsepower (keep=mean_horsepower);
run;