How can I prevent duplicates from printing - sas

In the following code, I would like to remove the duplicates inside the columns cd,id,se,nt,dd. Usually, when duplicates appear, the appear in the NT column with a "-" first. But in general they are duplicates in all columns. Thanks in advance!
PROC PRINT DATA=data.data2;
var cd id SE NT DD;
format notional commax32.;
run;

You could just add a proc sort before the print with the nodupkey option to remove any duplicates:
proc sort data=data.data2 nodupkey;
by cd id se nt dd;
run;
Or, if you want to preserve your original data, you can output the result of the proc sort to a new table:
proc sort data=data.data2 out=data3 nodupkey;
by cd id se nt dd;
run;
PROC PRINT DATA=data3;
var cd id SE NT DD;
format notional commax32.;
run;

Related

Using proc format for columns in SAS output

I have the following data
DATA HAVE;
input year dz $8. area;
cards;
2000 stroke 08
2000 stroke 06
2000 stroke 06
;
run;
After using the proc freq
proc freq data=have;
table area*dz/ list nocum ;
run;
I get the below output
I want to replace any values below 5 in "frequency" column and 0 in "percent" column with "<5" and "0", respectively.
I have the below proc format code
proc format;
picture count (round)
0-4 = ' <5' (NOEDIT);
picture pcnt (round)
0 = ' - '
other = '009.9%';
But I am not understanding how to use it in the data step to get the desired results. Please guide.
Thanks!
You can save the result of proc freq as a dataset then report it as your wish.
proc freq data=have noprint;
table area*dz/ list nocum out = want;
run;
proc print;
format COUNT count. PERCENT pcnt.;
run;
If you want your proc freq output result just as your wish, not using an extra report or print procedure, you need to code a little more. The answer is about ods style and there is a very nice article about it: Using Styles and Templates to Customize SAS® ODS Output

Formating proc freq output table in SAS

I have the following data
DATA HAVE;
input year dz area;
cards;
2000 1 08
2000 1 06
2000 1 06
;
run;
proc freq data=have;
table area*dz / norow nocol;
run;
I get the following output
I would like to format it to put frequency in one column and percent in another column and I don't want the total column. Is there a way to do it?
Thank you!
Try adding the LIST option to get a different layout:
proc freq data=have;
table area*dz / norow nocol LIST;
run;
Pipe it to a data set and format as desired:
proc freq data=have;
table area*dz / norow nocol LIST out=want;
run;
proc print data=want;run;
Use PROC TABULATE instead (not shown), which allows you more control over your layout and formats.

Logistic regression with BY statement ERROR message

I'm currently working on a SAS program that processes 50 logistic regression for 50 different samples. I previously had help on this thread (How to loop a logistic regression n number of times?), people advised me to use a BY statement to avoid looping this process n times. Works really well but I get this ERROR MESSAGE:
ERROR: No valid observations due either to missing values in the response, explanatory, frequency, or weight variable, or to
nonpositive frequency or weight values.
NOTE: The above message was for the following BY group:
Sample Replicate Number=.
You'll find my code below, if any of you have an idea of where does it come from, I'm open to anything, thank you in advance!
proc surveyselect data=TOP_1 NOPRINT out=ALEA_1
seed=0
method=urs
outhits
reps=5
n=300;
run;
proc surveyselect data=TOP_0 NOPRINT out=ALEA_0
seed=0
method=urs
outhits
reps=5
n=300;
run;
PROC SQL;
CREATE TABLE APPEND_TABLE As
SELECT * FROM ALEA_1
OUTER UNION CORR
SELECT * FROM ALEA_0;
QUIT;
/* Régression logistique*/
DATA WORK.TMP0TempTableAddtnlPredictData;
SET WORK.APPEND_TABLE(IN=__ORIG) WORK.BASE_PREDICT_2;
__FLAG=__ORIG;
__DEP=TOP_CREDIT_HABITAT_2017;
if not __FLAG then TOP_CREDIT_HABITAT_2017=.;
RUN;
PROC SQL;
CREATE VIEW WORK.SORTTempTableSorted AS
SELECT *
FROM WORK.TMP0TempTableAddtnlPredictData
ORDER BY REPLICATE;
QUIT;
TITLE;
TITLE1 "Résultats de la régression logistique";
FOOTNOTE;
FOOTNOTE1 "Généré par le Système SAS (&_SASSERVERNAME, &SYSSCPL) le %TRIM(%QSYSFUNC(DATE(), NLDATE20.)) à %TRIM(%SYSFUNC(TIME(), TIMEAMPM12.))";
PROC LOGISTIC DATA=WORK.SORTTempTableSorted
PLOTS(ONLY)=ROC
;
By Replicate;
CLASS age_classe (PARAM=EFFECT) Flag_bq_principale (PARAM=EFFECT) flag_univers_detenus (PARAM=EFFECT) csp_1 (PARAM=EFFECT) SGMT_FIDELITE (PARAM=EFFECT) situ_fam_1 (PARAM=EFFECT);
MODEL TOP_CREDIT_HABITAT_2017 (Event = '1') [...6## Heading ##] /
SELECTION=STEPWISE
SLE=0.1
SLS=0.1
INCLUDE=0
LINK=LOGIT
;
OUTPUT OUT=WORK.PREDLogRegPredictions(LABEL="Statistiques et prédictions de régression logistique pour WORK.APPEND_TABLE" WHERE=(NOT ws__FLAG))
PREDPROBS=INDIVIDUAL;
RUN;
QUIT;
DATA WORK.PREDLogRegPredictions;
set WORK.PREDLogRegPredictions;
TOP_CREDIT_HABITAT_2017=__DEP;
_FROM_=__DEP;
DROP __DEP;
DROP __FLAG;
RUN ;
QUIT ;
/* Création du fichier de sorti final*/
PROC SQL;
CREATE TABLE MODELE_RESULTS As
SELECT IDCLI_CALCULE, IP_1
FROM PREDLogRegPredictions;
RUN;
QUIT;
ODS GRAPHICS OFF;
Probably from this:
DATA WORK.TMP0TempTableAddtnlPredictData;
SET WORK.APPEND_TABLE(IN=__ORIG) WORK.BASE_PREDICT_2;
__FLAG=__ORIG;
__DEP=TOP_CREDIT_HABITAT_2017;
if not __FLAG then TOP_CREDIT_HABITAT_2017=.;
RUN;
You're appending a dataset that does not have a replicate number on it here. I'm not really sure I follow what this dataset is - are you intending this to be added to each replicate perhaps? Then you might do something like this (untested):
DATA WORK.TMP0TempTableAddtnlPredictData;
do _n_ = 1 by 1 until (eof);
SET WORK.APPEND_TABLE(IN=__ORIG) end=eof;
output;
end;
do replicate = 1 to 5;
do n_predict = 1 to nobs_predict;
set WORK.BASE_PREDICT_2 nobs=nobs_predict point=n_predict;
__FLAG=__ORIG;
__DEP=TOP_CREDIT_HABITAT_2017;
if not __FLAG then TOP_CREDIT_HABITAT_2017=.;
output;
end;
end;
stop;
RUN;
This is the complicated way to get 5 copies of that, one for each replicate. But I'm not sure that's actually what you want - does it even have all of the variables you need? Are you sure you didn't mean to MERGE instead of SET?
Separately, I don't understand why you use the SQL step to append the two samples. I'd either do that in the same data step here or I'd use PROC APPEND, both would be faster than the SQL union and then immediately appending more to the dataset.

Value labels to be created using data from another data set

I am having two data sets. The first data set has airport codes (JFK, LGA, EWR) in a variable 'airport'. The second dataset has the list of all major airports in the world. This dataset has two variables 'faa' holding the FAA Code (like JFG, LGA, EWR) and 'name' holding the actual name of the airport (John. F Kennedy, Le Guardia etc.).
My requirement is to create value labels for in the first data set, so that instead of airport code, the actual name of the airport comes up. I know I can use custom formats to achieve this. But can I write SAS code which can read the unique airport codes, then get the names from another data set and create a value label automatically?
PS: Other wise, the only option I see is to use MS Excel to get the unique list of FAA codes in dataset 1, and then use VLOOKUP to get the names of the airports. And then create one custom format by listing each unique FAA code and the airport name.
I think "value label" is SPSS terminology. Looks like you want to create a format. Just use your lookup table to create an input dataset for PROC FORMAT.
So if your second table looks like this:
data table2;
length FAA $4 Name $40 ;
input FAA Name $40. ;
cards;
JFK John F. Kennedy (NYC)
LGA Laguardia (NYC)
EWR Newark (NJ)
;
You can use this code to convert it into a dataset that PROC FORMAT can use to create a format.
data fmt ;
fmtname='$FAA';
hlo=' ';
set table2 (rename=(faa=start name=label));
run;
proc format cntlin=fmt lib=work.formats;
run;
Now you can use that format with your other data.
proc freq data=table1 ;
tables airport ;
format airport faa. ;
run;
Firstly, consider if it is really a format what is needed. For example, you may just do a left join to retrieve the column (airport) name from table2 (FAA-Name table).
Anyway, I believe the following macro does the trick:
Create auxiliary tables:
data have1;
input airport $;
datalines;
a
d
e
;
run;
data have2;
input faa $ name $;
datalines;
a aaaa
b bbbb
c cccc
d dddd
;
run;
Macro to create Format:
%macro create_format;
*count number of faa;
proc sql noprint;
select distinct count(faa) into:n
from have2;
quit;
*create macro variables for each faa and name;
proc sql noprint;
select faa, name
into:faa1-:faa%left(&n),:name1-:name%left(&n)
from have2;
quit;
*create format;
proc format;
value $airport
%do i=1 %to &n;
"&faa%left(&i)" = "&name%left(&i)"
%end;
other = "Unknown FAA code";
run;
%mend create_format;
%create_format;
Apply format:
data want;
set have1;
format airport $airport.;
run;

SAS Keep maximum value by ID

Each ID has several instances, and each instance has a different value. I would like the final output to be the maximum value per ID. So the initial dataset is:
ID Value
1 100
1 7
1 65
2 12
2 97
3 82
3 54
And the output will be:
ID Value
1 100
2 97
3 82
I tried running proc sort twice thinking that the first sort would get things in the proper order so that nodupkey on the second sort would get rid of the right values. This did not work.
proc sort work.data; by id value descending; run;
proc sort work.data nodupkey; by id; run;
Thanks!
Your approach should have worked fine but it looks like you have a syntax error - did you forget to check your log? The descending keyword needs to go before the variable you want to sort in descending order.
proc sort data=sashelp.class out=tmp;
by sex descending height;
run;
proc sort data=tmp out=final nodupkey;
by sex;
run;
Also - in case you're not familiar with SQL, I strongly suggest that you should learn it as it will simplify many data manipulation tasks. This can also be solved in a single SQL step:
proc sql noprint;
create table want as
select sex,
max(height) as height
from sashelp.class
group by sex
;
quit;
My preferred solution:
proc means data=have noprint;
class id;
var value;
output out=want max(value)=;
run;
Should be a lot faster than two sorts.