I need to find top 5 Transaction_Due_Date in Proc sql - sas

This code works for top value but I need top 5 values
proc sql;
create table cash.gO5 as
select * , max(Transaction_Due_Date) as max1 format = date9.
from cash.Orders_Dim65
group by Customer_Name;
quit;

PROC SQL does not support order analytical functions such as rank() as found in other flavors of SQL; however, there are numerous ways in which you can get a rank by group. Here are a few options you can use.
Option 1: PROC RANK
proc rank does exactly what it sounds like: ranks stuff. Note that your data must be sorted if being used in SAS 9 or SPRE.
proc rank data=sashelp.cars
out=want(where=(msrp_rank LE 5))
descending;
by make;
var msrp; /* Variable to rank */
ranks msrp_rank; /* Name of variable holding ranks */
run;
Option 2: Data Step
You can rank using a data step. Note that your data must be sorted if using SAS 9 or SPRE.
proc sort data=sashelp.cars
out=cars;
by make descending msrp;
run;
data want;
set cars;
by make descending msrp;
if(first.make) then Rank = 0;
Rank+1;
if(Rank LE 5);
run;
Option 3: simple.topK CAS Action
If you have Viya, you can use CAS actions to quickly rank large datasets. This can be used in both SAS and Python with the SWAT package.
/* Load sashelp.cars into CAS */
data casuser.cars;
set sashelp.cars;
run;
proc cas;
simple.topk result=r /
table = {caslib='casuser' name='cars' groupby='make'}
casout = {caslib='casuser' name='cars_top_5' replace=true}
aggregator ='max'
bottomK = 0
topK = 5
inputs = {{name='msrp'}}
;
quit;

Related

Logistic regression with BY statement ERROR message

I'm currently working on a SAS program that processes 50 logistic regression for 50 different samples. I previously had help on this thread (How to loop a logistic regression n number of times?), people advised me to use a BY statement to avoid looping this process n times. Works really well but I get this ERROR MESSAGE:
ERROR: No valid observations due either to missing values in the response, explanatory, frequency, or weight variable, or to
nonpositive frequency or weight values.
NOTE: The above message was for the following BY group:
Sample Replicate Number=.
You'll find my code below, if any of you have an idea of where does it come from, I'm open to anything, thank you in advance!
proc surveyselect data=TOP_1 NOPRINT out=ALEA_1
seed=0
method=urs
outhits
reps=5
n=300;
run;
proc surveyselect data=TOP_0 NOPRINT out=ALEA_0
seed=0
method=urs
outhits
reps=5
n=300;
run;
PROC SQL;
CREATE TABLE APPEND_TABLE As
SELECT * FROM ALEA_1
OUTER UNION CORR
SELECT * FROM ALEA_0;
QUIT;
/* Régression logistique*/
DATA WORK.TMP0TempTableAddtnlPredictData;
SET WORK.APPEND_TABLE(IN=__ORIG) WORK.BASE_PREDICT_2;
__FLAG=__ORIG;
__DEP=TOP_CREDIT_HABITAT_2017;
if not __FLAG then TOP_CREDIT_HABITAT_2017=.;
RUN;
PROC SQL;
CREATE VIEW WORK.SORTTempTableSorted AS
SELECT *
FROM WORK.TMP0TempTableAddtnlPredictData
ORDER BY REPLICATE;
QUIT;
TITLE;
TITLE1 "Résultats de la régression logistique";
FOOTNOTE;
FOOTNOTE1 "Généré par le Système SAS (&_SASSERVERNAME, &SYSSCPL) le %TRIM(%QSYSFUNC(DATE(), NLDATE20.)) à %TRIM(%SYSFUNC(TIME(), TIMEAMPM12.))";
PROC LOGISTIC DATA=WORK.SORTTempTableSorted
PLOTS(ONLY)=ROC
;
By Replicate;
CLASS age_classe (PARAM=EFFECT) Flag_bq_principale (PARAM=EFFECT) flag_univers_detenus (PARAM=EFFECT) csp_1 (PARAM=EFFECT) SGMT_FIDELITE (PARAM=EFFECT) situ_fam_1 (PARAM=EFFECT);
MODEL TOP_CREDIT_HABITAT_2017 (Event = '1') [...6## Heading ##] /
SELECTION=STEPWISE
SLE=0.1
SLS=0.1
INCLUDE=0
LINK=LOGIT
;
OUTPUT OUT=WORK.PREDLogRegPredictions(LABEL="Statistiques et prédictions de régression logistique pour WORK.APPEND_TABLE" WHERE=(NOT ws__FLAG))
PREDPROBS=INDIVIDUAL;
RUN;
QUIT;
DATA WORK.PREDLogRegPredictions;
set WORK.PREDLogRegPredictions;
TOP_CREDIT_HABITAT_2017=__DEP;
_FROM_=__DEP;
DROP __DEP;
DROP __FLAG;
RUN ;
QUIT ;
/* Création du fichier de sorti final*/
PROC SQL;
CREATE TABLE MODELE_RESULTS As
SELECT IDCLI_CALCULE, IP_1
FROM PREDLogRegPredictions;
RUN;
QUIT;
ODS GRAPHICS OFF;
Probably from this:
DATA WORK.TMP0TempTableAddtnlPredictData;
SET WORK.APPEND_TABLE(IN=__ORIG) WORK.BASE_PREDICT_2;
__FLAG=__ORIG;
__DEP=TOP_CREDIT_HABITAT_2017;
if not __FLAG then TOP_CREDIT_HABITAT_2017=.;
RUN;
You're appending a dataset that does not have a replicate number on it here. I'm not really sure I follow what this dataset is - are you intending this to be added to each replicate perhaps? Then you might do something like this (untested):
DATA WORK.TMP0TempTableAddtnlPredictData;
do _n_ = 1 by 1 until (eof);
SET WORK.APPEND_TABLE(IN=__ORIG) end=eof;
output;
end;
do replicate = 1 to 5;
do n_predict = 1 to nobs_predict;
set WORK.BASE_PREDICT_2 nobs=nobs_predict point=n_predict;
__FLAG=__ORIG;
__DEP=TOP_CREDIT_HABITAT_2017;
if not __FLAG then TOP_CREDIT_HABITAT_2017=.;
output;
end;
end;
stop;
RUN;
This is the complicated way to get 5 copies of that, one for each replicate. But I'm not sure that's actually what you want - does it even have all of the variables you need? Are you sure you didn't mean to MERGE instead of SET?
Separately, I don't understand why you use the SQL step to append the two samples. I'd either do that in the same data step here or I'd use PROC APPEND, both would be faster than the SQL union and then immediately appending more to the dataset.

Average number of rows per variable in SAS

I have the following dataset :
data test;
input business_ID $;
datalines;
'busi1'
'busi1'
'busi1'
'busi2'
'busi3'
'busi3'
;
run;
proc freq data = test ;
table business_ID;
run;
I would like the average nummber of lines per business, that is count the total number of observations and divide it by the number of distinct businesses.
In my example : 6 observations, 3 businesses -> 6/2=3 lines per business.
I was thinking about using a proc freq or a proc mean step but so far I got only the number of lines (~freq) per business and do not know how to get to my goal.
Any idea?
You could use PROC FREQ to get the counts and then run PROC MEANS on the output.
proc freq data=test ;
tables business_id / noprint out=counts ;
run;
proc means data=counts;
var count;
run;
Or you could count them directly with PROC SQL code.
proc sql ;
select count(*)/count(distinct business_id) as mean_count
from test
;
quit;

Create new variables from format values

What i want to do: I need to create a new variables for each value labels of a variable and do some recoding. I have all the value labels output from a SPSS file (see sample).
Sample:
proc format; library = library ;
value SEXF
1 = 'Homme'
2 = 'Femme' ;
value FUMERT1F
0 = 'Non'
1 = 'Oui , occasionnellement'
2 = 'Oui , régulièrement'
3 = 'Non mais j''ai déjà fumé' ;
value ... (many more with different amount of levels)
The new variable name would be the actual one without F and with underscore+level (example: FUMERT1F level 0 would become FUMERT1_0).
After that i need to recode the variables on this pattern:
data ds; set ds;
FUMERT1_0=0;
if FUMERT1=0 then FUMERT1_0=1;
FUMERT1_1=0;
if FUMERT1=1 then FUMERT1_1=1;
FUMERT1_2=0;
if FUMERT1=2 then FUMERT1_2=1;
FUMERT1_3=0;
if FUMERT1=3 then FUMERT1_3=1;
run;
Any help will be appreciated :)
EDIT: Both answers from Joe and the one of data_null_ are working but stackoverflow won't let me pin more than one right answer.
Update to add an _ underscore to the end of each name. It looks like there is not option for PROC TRANSREG to put an underscore between the variable name and the value of the class variable so we can just do a temporary rename. Create rename name=newname pairs to rename class variable to end in underscore and to rename them back. CAT functions and SQL into macro variables.
data have;
call streaminit(1234);
do caseID = 1 to 1e4;
fumert1 = rand('table',.2,.2,.2) - 1;
sex = first(substrn('MF',rand('table',.5),1));
output;
end;
stop;
run;
%let class=sex fumert1;
proc transpose data=have(obs=0) out=vnames;
var &class;
run;
proc print;
run;
proc sql noprint;
select catx('=',_name_,cats(_name_,'_')), catx('=',cats(_name_,'_'),_name_), cats(_name_,'_')
into :rename1 separated by ' ', :rename2 separated by ' ', :class2 separated by ' '
from vnames;
quit;
%put NOTE: &=rename1;
%put NOTE: &=rename2;
%put NOTE: &=class2;
proc transreg data=have(rename=(&rename1));
model class(&class2 / zero=none);
id caseid;
output out=design(drop=_: inter: rename=(&rename2)) design;
run;
%put NOTE: _TRGIND(&_trgindn)=&_trgind;
First try:
Looking at the code you supplied and the output from Joe's I don't really understand the need for the formats. It looks to me like you just want to create dummies for a list of class variables. That can be done with TRANSREG.
data have;
call streaminit(1234);
do caseID = 1 to 1e4;
fumert1 = rand('table',.2,.2,.2) - 1;
sex = first(substrn('MF',rand('table',.5),1));
output;
end;
stop;
run;
proc transreg data=have;
model class(sex fumert1 / zero=none);
id caseid;
output out=design(drop=_: inter:) design;
run;
proc contents;
run;
proc print data=design(obs=40);
run;
One good alternative to your code is to use proc transpose. It won't get you 0's in the non-1 cells, but those are easy enough to get. It does have the disadvantage that it makes it harder to get your variables in a particular order.
Basically, transpose once to vertical, then transpose back using the old variable name concatenated to the variable value as the new variable name. Hat tip to Data null for showing this feature in a recent SAS-L post. If your version of SAS doesn't support concatenation in PROC TRANSPOSE, do it in the data step beforehand.
I show using PROC EXPAND to then set the missings to 0, but you can do this in a data step as well if you don't have ETS or if PROC EXPAND is too slow. There are other ways to do this - including setting up the dataset with 0s pre-proc-transpose - and if you have a complicated scenario where that would be needed, this might make a good separate question.
data have;
do caseID = 1 to 1e4;
fumert1 = rand('Binomial',.3,3);
sex = rand('Binomial',.5,1)+1;
output;
end;
run;
proc transpose data=have out=want_pre;
by caseID;
var fumert1 sex;
copy fumert1 sex;
run;
data want_pre_t;
set want_pre;
x=1; *dummy variable;
run;
proc transpose data=want_pre_t out=want delim=_;
by caseID;
var x;
id _name_ col1;
copy fumert1 sex;
run;
proc expand data=want out=want_e method=none;
convert _numeric_ /transformin=(setmiss 0);
run;
For this method, you need to use two concepts: the cntlout dataset from proc format, and code generation. This method will likely be faster than the other option I presented (as it passes through the data only once), but it does rely on the variable name <-> format relationship being straightforward. If it's not, a slightly more complex variation will be required; you should post to that effect, and this can be modified.
First, the cntlout option in proc format makes a dataset of the contents of the format catalog. This is not the only way to do this, but it's a very easy one. Specify the appropriate libname as you would when you create a format, but instead of making one, it will dump the dataset out, and you can use it for other purposes.
Second, we create a macro that performs your action one time (creating a variable with the name_value name and then assigning it to the appropriate value) and then use proc sql to make a bunch of calls to that macro, once for each row in your cntlout dataset. Note - you may need a where clause here, or some other modifications, if your format library includes formats for variables that aren't in your dataset - or if it doesn't have the nice neat relationship your example does. Then we just make those calls in a data step.
*Set up formats and dataset;
proc format;
value SEXF
1 = 'Homme'
2 = 'Femme' ;
value FUMERT1F
0 = 'Non'
1 = 'Oui , occasionnellement'
2 = 'Oui , régulièrement'
3 = 'Non mais j''ai déjà fumé' ;
quit;
data have;
do caseID = 1 to 1e4;
fumert1 = rand('Binomial',.3,3);
sex = rand('Binomial',.5,1)+1;
output;
end;
run;
*Dump formats into table;
proc format cntlout=formats;
quit;
*Macro that does the above assignment once;
%macro spread_var(var=, val=);
&var._&val.= (&var.=&val.); *result of boolean expression is 1 or 0 (T=1 F=0);
%mend spread_var;
*make the list. May want NOPRINT option here as it will make a lot of calls in your output window otherwise, but I like to see them as output.;
proc sql;
select cats('%spread_var(var=',substr(fmtname,1,length(Fmtname)-1),',val=',start,')')
into :spreadlist separated by ' '
from formats;
quit;
*Actually use the macro call list generated above;
data want;
set have;
&spreadlist.;
run;

Is there a way to name proc rank groups based on values within the group?

So I have multiple continuous variables that I have used proc rank to divide into 10 groups, ie for each observation there is now a "GPA" and a "GRP_GPA" value, ditto for Hmwrk_Hrs and GRP_Hmwrk_Hrs. But for each of the new group columns the values are between 1 - 10. Is there a way to change that value so that rather than 1 for instance it would be 1.2-2.8 if those were the min and max values within the group? I know I can do it by hand using proc format or if then or case in sql but since I have something like 40 different columns that would be very time intensive.
It's not clear from your question if you want to store the min-max values or just format the rank columns with them. My solution below formats the rank column and utilises the ability of SAS to create formats from a dataset. I've obviously only used 1 variable to rank, for your data it will be a simple matter to wrap a macro around the code and run for each of your 40 or so variables. Hope this helps.
/* create ranked dataset */
proc rank data=sashelp.steel groups=10 out=want;
var steel;
ranks steel_rank;
run;
/* calculate minimum and maximum values per rank */
proc summary data=want nway;
class steel_rank;
var steel;
output out=want_min_max (drop=_:) min= max= / autoname;
run;
/* create dataset with formatted values */
data steel_rank_fmt;
set want_min_max (rename=(steel_rank=start));
retain fmtname 'stl_fmt' type 'N';
label=catx('-',steel_min,steel_max);
run;
/* create format from previous dataset */
proc format cntlin=steel_rank_fmt;
run;
/* apply formatted value to rank column */
proc datasets lib=work nodetails nolist;
modify want;
format steel_rank stl_fmt10.;
quit;
In addition to Keith's good answer, you can also do the following:
proc rank data = sashelp.cars groups = 10 out = test;
var enginesize;
ranks es;
run;
proc sql ;
select *, catx('-',min(enginesize), max(enginesize)) as esrange, es from test
group by es
order by make, model
;
quit;

group by in sas

I've the below dataset as input
ID
--
1
2
2
3
4
4
4
5
And need a new dataset as below
ID count of ID
-- -----------
1 1
2 2
3 1
4 3
5 1
Could you please tell how to do this in SAS wihtout using PROC SQL?
or how about Proc Freq or Proc Summary? These avoid having to presort the data.
proc freq data=have noprint;
table id / out=want1 (drop=percent);
run;
proc summary data=have nway;
class id;
output out=want2 (drop=_type_);
run;
proc sql noprint;
create table test as select distinct id, count(id)
from your_table
group by ID
order by ID
;
quit;
Try this:
DATA Have;
input id ;
datalines;
1
2
2
3
4
4
4
5
;
Proc Sort data=Have;
by ID;
run;
Data Want;
Set Have;
By ID;
If first.ID then Count=0;
Count+1;
If Last.ID then Output;
Run;
PROC SORT DATA=YOURS NOPRINT;
BY ID; RUN;
PROC MEANS DATA=YOURS;
VAR ID;
BY ID;
OUTPUT OUT=NEWDATASET N=; RUN;
You can also choose to keep only the Id and N variables in your newdataset.
We can use simple PROC SQL count to do this:
proc sql;
create table want as
select id, count(id) as count_of_id
from have
group by id;
quit;
Here is yet another possibility, often known as a DoW construction:
Data want;
do count=1 by 1 until(last.ID);
set have;
by id;
end;
run;
If the aggregation you want to do is complex then go with PROC SQL only as we are more familiar with Group by in SQL
proc sql ;
create table solution_1 as select distinct ID, count(ID)
from table_1
group by ID
order by ID
;
quit;
OR
If you are using SAS- EG Query builders are very useful in small
analyses .
It's just drag & drop the columns u want to aggregate and in summary option Select whatever operation you want to perform like Avg,Count,miss,NMiss etc .