How can I code proc sql not in select ... in sas merge? - sas

Good Afternoon,
So, I started to learn code in SAS.
I want to have exact same result of
proc sql;
create table award_print_new as
select *
from awards_try2
where BOR_ITEM_TYPE NOT IN (SELECT BOR_ITEM_TYPE from FADSFUND);
run;
I though this is the exact code, but the results are different, so I have been wrong.
proc sort data=awards_try2;
by BOR_ITEM_TYPE;
run;
proc sort data=FADSFUND;
by BOR_ITEM_TYPE;
run;
data award_print;
set awards_try2 (in=a) FADSFUND (in=b);
by BOR_ITEM_TYPE;
if a and not b;
run;
Like below there are 9525 observations instead of 681. how can I get 681 in SAS code?
1665 data award_print;
1666 set awards_try2 (in=a) FADSFUND (in=b);
1667 by BOR_ITEM_TYPE;
1668 if a and not b;
1669 run;
NOTE: There were 9525 observations read from the data set WORK.AWARDS_TRY2.
NOTE: There were 1226 observations read from the data set WORK.FADSFUND.
NOTE: The data set WORK.AWARD_PRINT has 9525 observations and 22 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
1670
1671 proc sql;
1672 create table award_print_new as
1673 select *
1674 from awards_try2
1675 where BOR_ITEM_TYPE NOT IN (SELECT BOR_ITEM_TYPE from FADSFUND);
NOTE: Table WORK.AWARD_PRINT_NEW created, with 681 rows and 15 columns.
1676 run;

You didn't merge the datasets. Because you used a SET statement you merely interleaved them. You can tell from the notes:
NOTE: There were 9525 observations read from the data set WORK.AWARDS_TRY2.
NOTE: There were 1226 observations read from the data set WORK.FADSFUND.
NOTE: The data set WORK.AWARD_PRINT has 9525 observations and 22 variables.
You are just writing out the observations read from the first dataset.
data award_print;
merge awards_try2 (in=a) FADSFUND (in=b);
by BOR_ITEM_TYPE;
if a and not b;
run;
Note you don't need the first part of the IF condition. If you are merging only two datasets any observation that is not in B must have come from A.
Other source of differences could when B has more copies of a particular value of the BY variable(s) than A. Also if B has other non BY variables their values could overwrite the values of the same variables read from A. In this case however no data will be overwritten since you are not writing the observations where B contributed data. However if B has variables that are not in A then they will be added to the output dataset, but with missing values on all observations.

Related

SAS dropping more than 400 columns during proc transpose?? (I do not want it to drop these columns)

I have some code that reads in some data and then transposes it. However, when I tranpose the data, the resulting dataset is missing over 400 columns that were in the input dataset. I have never seen this happen before and can't find any information online as to why this would happen. Any help is much, much appreciated!
The input dataset (mydata) has 716 unique entries in the column "MyCol1". This is the column that becomes the column header in the transposed dataset. The ttransposed dataset has only 269 columns!
libname in "\\folder1\folder2";
data mydata;
set in.mydata; run;
PROC SORT DATA=mydata nodupkey;
BY MyCol3 MyCol4;
RUN;
PROC TRANSPOSE DATA=mydata OUT=mydata_WIDE(DROP=_NAME_);
BY MyCol3 MyCol4;
ID MyCol1;
VAR MyCol2;
RUN;
OK update for anybody else who may have this error: during my proc sort statement I was accidentally deleting a bunch of rows.
You should include MYCOL1 in the BY statement of the PROC SORT. Otherwise only one of the MYCOL1 values will exist per group.
But you probably need to keep the NODUPKEY as duplicate values of MYCOL1 per group will cause trouble for the PROC TRANSPOSE as you cannot make two new variables with the same name.
PROC SORT DATA=mydata nodupkey;
BY MyCol3 MyCol4 MyCol1;
RUN;
PROC TRANSPOSE DATA=mydata OUT=mydata_WIDE(DROP=_NAME_);
BY MyCol3 MyCol4;
ID MyCol1;
VAR MyCol2;
RUN;

Missing values in VARMAX

I have a dataset with visitors and weather variables. I'm trying to forecast visitors based on the weather variables. Since the dataset only consists of visitors in season there is missing values and gaps for every year. When running proc reg in sas it's all okay but the issue comes when i'm using proc VARMAX. I cannot run the regression due to missing values. How can i tackle this?
proc varmax data=tivoli4 printall plots=forecast(all);
id obs interval=day;
model lvisitors = rain sunshine averagetemp
dfebruary dmarch dmay djune djuly daugust doctober dnovember ddecember
dwednesday dthursday dfriday dsaturday dsunday
d_24Dec2016 d_05Dec2013 d_24Dec2017 d_24Dec2014 d_24Dec2015 d_24Dec2019
d_24Dec2018 d_24Sep2012 d_06Jul2015
d_08feb2019 d_16oct2014 d_15oct2019 d_20oct2016 d_15oct2015 d_22sep2017 d_08jul2015
d_20Sep2019 d_08jul2016 d_16oct2013 d_01aug2012 d_18oct2012 d_23dec2012 d_30nov2013 d_20sep2014 d_17oct2012 d_17jun2014
dFrock2012 dFrock2013 dFrock2014 dFrock2015 dFrock2016 dFrock2017 dFrock2018 dFrock2019
dYear2015 dYear2016 dYear2017
/p=7 q=2 Method=ml dftest;
garch p=1 q=1 form=ccc OUTHT=CONDITIONAL;
restrict
ar(3,1,1)=0, ar(4,1,1)=0, ar(5,1,1)=0,
XL(0,1,13)=0, XL(0,1,14)=0, XL(0,1,13)=0, XL(0,1,27)=0, XL(0,1,38)=0, XL(0,1,42)=0;
output lead=10 out=forecast;
run;
As with any forecast, you will first need to prepare your time-series. You should first run through your data through PROC TIMESERIES to fill-in or impute missing values. The impute choice that is most appropriate is dependent on your variables. The below code will:
Sum lvisitors by day and set missing values to 0
Set missing values of averagetemp to average
Set missing values of rain, sunshine, and your variables starting with d to 0 (assuming these are indicators)
Code:
proc timeseries data=have out=want;
id obs interval = day
setmissing = 0
notsorted
;
var lvisitors / accumulate=total;
crossvar averagetemp / accumulate=none setmissing=average;
crossvar rain sunshine d: / accumulate=none;
run;
Important Time Interval Consideration
Depending on your data, this could bias your error rate and estimates since you always know no one will be around in the off-season. If you have many missing values for off-season data, you will want to remove those rows.
Since PROC VARMAX does not support custom time intervals, you can instead create a simple time identifier. You can alternatively turn this into a format for proc format and converttime_id at the end.
data want;
set have;
time_id+1;
run;
proc varmax data=want;
id time_id interval=day;
...
output lead=10 out=myforecast;
run;
data myforecast;
merge myforecast
want(keep=time_id date)
;
by time_id;
run;
Or, if you made a format:
data myforecast;
set myforecast;
date = put(time_id, timeid.);
drop time_id;
run;

Formats export with variable names in SAS

I created a format for a variable as follows
proc format;
value now 0=M
1=F
;
run;
and now I apply this to a dataset.
Data X;
set X2;
format Var1 now.;
run;
and I want to export this format using cntlout
proc format library=work cntlout=form; run;
this gives me the list of formats in the library catalog. But doesnot give me the variable name to which it is attached.
How can I create a dataset with list of formats and the attached variables to it?
So I can see which format is linked to what variable.
If you just want to look up the variables in a specific dataset, often PROC CONTENTS is faster than using SASHELP.VCOLUMN or DICTIONARY.TABLES, particularly when there are lots of libraries/datasets defined.
57 proc contents data=x out=myvars(keep=name format) noprint;
58 run;
NOTE: The data set WORK.MYVARS has 1 observations and 2 variables.
59
60 data _null_;
61 set myvars;
62 put _all_;
63 run;
NAME=Var1 FORMAT=NOW _ERROR_=0 _N_=1
NOTE: There were 1 observations read from the data set WORK.MYVARS.
Assuming you want this for a specific library you can use the SASHELP.VCOLUMN dataset. This dataset contains the formats for all variables and you can filter it as desired.

proc contents is truncating the values in out put dataset.How to get full values in out put dataset?

I'm trying to get all dataset names in a library in to a data set.
proc datasets library=LIB1 memtype=data ;
contents data=_all_ noprint out=Datasets_in_Lib1(keep=memname) ;
run;
The final data set (Datasets_in_Lib1) is having all the data set names that are in LIB1, but names are truncated to 6 characters.Is there any way to get full names of the datasets with out truncation.
Ex: If dataset name is x123456789, the Datasets_in_Lib1 will have x12345 only.
Thanks in advance,
Sam.
Agree with the comment, in 9.3 memname is $32. I don't think there was a version of SAS where data set names were limited to 6 characters. They went from 8 characters to 32 characters in v7 (I think).
Here's a log from running your code, showing it works as you want in 9.3
51 data work.x123456789;
52 x=1;
53 run;
NOTE: The data set WORK.X123456789 has 1 observations and 1 variables.
54
55 proc datasets library=work memtype=data nolist;
56 contents data=_all_ noprint out=Datasets_in_Lib1(keep=memname) ;
57 run;
NOTE: The data set WORK.DATASETS_IN_LIB1 has 1 observations and 1 variables.
58
59 data _null_;
60 set Datasets_in_Lib1;
61 put _all_;
62 run;
MEMNAME=X123456789 _ERROR_=0 _N_=1
NOTE: There were 1 observations read from the data set WORK.DATASETS_IN_LIB1.
You can also query the sashelp.vtable to obtain the list of datasets:
proc sql;
create table mem_list as
select memname
from sashelp.vtable
where libname='LIB1' and memtype='DATA';
quit;

SAS proc sql SELECT DISTINCT fails to remove duplicates on already sorted dataset

Given the following test data:
data test;
input A B;
cards;
1 2
1 1
1 2
run;
NOTE: The data set WORK.TEST has 3 observations and 2 variables.
I am aware that proc sort can behave unexpectedly if you don't sort by a whole key, or even when you appear to sort by the whole key but there is a keep statement in force:
proc sort data=test out=test_dedup_works nodup;
by a _all_;
run;
NOTE: There were 3 observations read from the data set WORK.TEST.
NOTE: Duplicate BY variable(s) specified. Duplicates will be ignored.
NOTE: 1 duplicate observations were deleted.
NOTE: The data set WORK.TEST_DEDUP_WORKS has 2 observations and 2 variables.
proc sort data=test out=test_dedup_fails nodup;
by a;
run;
NOTE: There were 3 observations read from the data set WORK.TEST.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.TEST_DEDUP_FAILS has 3 observations and 2 variables.
proc sort data=test (keep=a) out=test_dedup_alsofails nodup;
by a;
run;
NOTE: There were 3 observations read from the data set WORK.TEST.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.TEST_DEDUP_ALSOFAILS has 3 observations and 1 variables.
What's new to me is that trying to deduplicate the resulting not-actually-deduplicated dataset using PROC SQL also fails to remove duplicates:
proc sql;
create table test_dedup_eventhisfails as
select distinct a
from test_dedup_alsofails;
quit;
NOTE: Table WORK.TEST_DEDUP_EVENTHISFAILS created, with 3 rows and 1 columns.
Is this a bug which is documented somewhere or am I doing something wrong?