By statement in sas - sas

I have the following data with three groups.
data draft_us;
input group$ ID$ SEX$ Value;
datalines;
G1 010 Male 30
G1 009 Female 35
G1 009 Male 20
G3 009 Male 44
G3 009 Female 0
G3 009 Male 45
G2 009 Male 50
G2 009 Female 100
;
run;
Why the by statement is not working in data steps. I have tried this code
data set_draft;
set draft_us;
by SEX;
run;
Here is the error message
ERROR: BY variable ascending is not on input data set
WORK.DRAFT_US.
NOTE: The SAS System stopped processing this step because of
errors.
WARNING: The data set WORK.SET_DRAFT may be incomplete. When
this step was stopped there were 0 observations and 4
variables.

Your example data is not sorted by SEX. The first observation is Male, the second is Female which comes before Male in sort order. If you tell the data step the data is sorted and it isn't then you get a different error message than the one you posted.
ERROR: BY variables are not properly sorted on data set WORK.DRAFT_US.
Example LOG:
2957 data set_draft;
2958 set draft_us;
2959 by SEX;
2960 run;
ERROR: BY variables are not properly sorted on data set WORK.DRAFT_US.
group=G1 ID=010 SEX=Male Value=30 FIRST.SEX=1 LAST.SEX=1 _ERROR_=1 _N_=1
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.SET_DRAFT may be incomplete. When this step was stopped there were 0 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
If you want to process the data using BY groups (which your simple example is NOT doing) then it needs to actually be organized into those groups. For example by adding a PROC SORT step.

The data set must be sorted before you can use a BY statement. However, the BY statement, by itself, does nothing but ensure your data is sorted correctly.
data draft_us;
input group$ ID$ SEX$ Value;
datalines;
G1 010 Male 30
G1 009 Female 35
G1 009 Male 20
G3 009 Male 44
G3 009 Female 0
G3 009 Male 45
G2 009 Male 50
G2 009 Female 100
;
run;
proc sort data=draft_us;
by sex;
run;
*does nothing yet but create a new data set with a new name;
data set_draft;
set draft_us;
by SEX;
run;

As others have pointed out, you need to use proc sort first. But you can also achieve same result using proc SQL, and in one step at that!
proc SQL;
create table set_draft as
select *
from draft_us
group by sex
;
quit;

Related

In SAS how to display unique counts when applying proc tabulate.

I am looking for a way to produce a frequency table displaying number of unique values for variable ID per uniqe value of variable Subclass.
I would like to order the results by variable Class.
Preferably I would like to display the number of unique values for ID per Subclass as a fraction of n for ID. In the want-example below this values is displayed under %totalID.
In addition I would like to display the number of unique values for ID per Subclass as a fraction of the sum of unique ID values found within each Class. In the want-example below this values is displayed under %withinclassID.
Have:
ID Class Subclass
-------------------------------
ID1 1 1a
ID1 1 1b
ID1 1 1c
ID1 2 2a
ID2 1 1a
ID2 1 1b
ID2 2 2a
ID2 2 2b
ID2 3 3a
ID3 1 1a
ID3 1 1d
ID3 2 2a
ID3 3 3a
ID3 3 3b
Want:
Unique number
Class Subclass of IDs %totalID %withinclassID
--------------------------------------------------------------------
1
1a 3 100.0 50.00
1b 2 66.67 33.33
1c 1 33.33 16.67
SUM 6
2
2a 3 100.0 75.00
2b 1 33.33 25.00
SUM 4
3
3a 2 66.67 66.67
3b 1 33.33 33.33
SUM 3
My initial approach was to perform a PROC FREQ on NLEVELS producing a frequency table for number of unique IDs per subclass. Here however I lose information on class. I therefore cannot order the results by class.
My second approach involved using PROC TABULATE. I however cannot produce any percentage calculations based on unique counts in such a table.
Is there a direct way to tabulate the frequencies of one variable according to a second variable, grouped by a third variable--displaying overall and within group percentages?
You can do a double proc freq or SQL.
/*This demonstrates how to count the number of unique occurences of a variable
across groups. It uses the SASHELP.CARS dataset which is available with any SAS installation.
The objective is to determine the number of unique car makers by origin/
Note: The SQL solution can be off if you have a large data set and these are not the only two ways to calculate distinct counts.
If you're dealing with a large data set other methods may be appropriate.*/
*Count distinct IDs;
proc sql;
create table distinct_sql as
select origin, count(distinct make) as n_make
from sashelp.cars
group by origin;
quit;
*Double PROC FREQ;
proc freq data=sashelp.cars noprint;
table origin * make / out=origin_make;
run;
proc freq data=origin_make noprint;
table origin / out= distinct_freq outpct;
run;
title 'PROC FREQ';
proc print data=distinct_freq;
run;
title 'PROC SQL';
proc print data=distinct_sql;
run;
The nlevels option in proc freq can produce the unique count you're after without losing data, providing you include Class and Subclass variables in the by statement. That also means you'll have to presort the data by the same variables.
Then you could try proc tabulate to get the rest of your requirement.
data have;
input ID $ Class Subclass $;
datalines;
ID1 1 1a
ID1 1 1b
ID1 1 1c
ID1 2 2a
ID2 1 1a
ID2 1 1b
ID2 2 2a
ID2 2 2b
ID2 3 3a
ID3 1 1a
ID3 1 1d
ID3 2 2a
ID3 3 3a
ID3 3 3b
;
run;
proc sort data=have;
by class subclass;
run;
ods output nlevels = unique_id_count;
proc freq data=have nlevels;
by class subclass;
run;

SAS: Exclude patients based on diagnoses on multiple lines and calculate incidence rates

I have large dataset of a few million patient encounters that include a diagnosis, timestamp, patientID, and demographic information.
For each patient, their diagnoses are listed on multiple lines. I need to exclude patients who have a certain diagnosis (282.1) and calculate incidence rates of other diseases in the year 2014.
IF diagnosis NE 282.1;
This in the data step does not work, because it does not take into account the other diagnoses on the other lines.
If possible, I would also like to calculate the incidence rates by disease.
This is an example of what the data looks like. There are multiple lines with multiple diagnoses.
PatientID Diagnosis Date Gender Age
1 282.1 1/2/10 F 25
1 232.1 1/2/10 F 87
1 250.02 1/2/10 F 41
1 125.1 1/2/10 F 46
1 90.1 1/2/10 F 58
2 140 12/15/13 M 57
2 132.3 12/15/13 M 41
2 149.1 12/15/13 M 66
3 601.1 11/19/13 F 58
3 231.1 11/19/13 F 76
3 123.1 11/19/13 F 29
4 282.1 12/30/14 F 81
4 130.1 12/30/14 F 86
5 230.1 1/22/14 M 60
5 282.1 1/22/14 M 46
5 250.02 1/22/14 M 53
Dual reading sollution
Straight forward version
You said you sorted the data first, probably like this
proc sort data=MYLIB.DIAGNOSES;
by PatientID;
run;
Assuming your data is ordered by patientID, you can process each with the diagnose to exclude first.
data WORK.NOT_HAVING_282_1;
set MYLIB.DIAGNOSES (where=(diagnosis EQ 282.1))
MYLIB.DIAGNOSES (where=(diagnosis NE 282.1));
by PatientID;
As we need to report by year, not by date:
year = year(Date);
Next step is to exclude those you don't need, so you need to remember if the unwanted diagnose occured:
retain has_282_1;
if first.PatientID then has_282_1 = 0;
if diagnosis EQ 282.1 then has_282_1 = 1;
and then keep the other diagnoses in 2014 for patients that do not have 282.1
else if not has_282_1 then output;
run;
Next you could SQL to count what you need
proc sql:
create table MYLIB.STATISTICS as
select year, Diagonsis, count(distinct PatientID) as incidence
from WORK.NOT_HAVING_282_1
group by year, Diagonsis;
quit;
Improvements
The above solution would take more processing power then needed:
you read DIAGNOSES from diks, then write FIRST_282_1 to disk, just to read it back in again
you can keep multiple observations of the same diagnose at diffrerent dates in the same year for the same patient, so you need count(distinct PatientID), which is a costly operation.
About diagnose 282.1, we only need to know who was ever diagnosed:
proc sort noduplicates
data=MYLIB.DIAGNOSES (where=(diagnosis EQ 282.1))
out=WORK.HAVING_282_1 (keep=PatientID);
by PatientID;
run;
About other diagnoses, we also need the year, which here:
data WORK.VIEW_OTHER / view=WORK.VIEW_OTHER;
set MYLIB.DIAGNOSES (where=(diagnosis NE 282.1));
year = year(Date);
keep PatientID year Diagnose;
run;
but as we use a view, we do not realy read and calculate anything before the view is used in this sort:
proc sort noduplicates
data=WORK.VIEW_OTHER (where=(diagnosis EQ 282.1))
out=WORK.OTHER_DIAGNOSES;
by PatientID year Diagnose;
run;
Now things become simpler. We use temproary variables exclude and other to indicate where data came from
data WORK.NOT_HAVING_282_1;
set WORK.HAVING_282_1 (in=exclude)
WORK.OTHER_DIAGNOSES (in=other);
by PatientID;
retain has_282_1;
if first.PatientID then has_282_1 = exclude;
if other and not has_282_1 then output;
run;
proc sql:
create table MYLIB.STATISTICS as
select year, Diagonsis, count(*) as incidence
from WORK.NOT_HAVING_282_1
group by year, Diagonsis;
quit;
Remark: this code is not tested

SAS Proc GLM and Npar1way

1.(a) Examine the significance of the differences between the mean worms counts for the four control groups using ANOVA test. Alpha=0.05
(b) If part (a) is significant, find pairs of groups that are significantly different in mean worms counts. Alpha=0.05
(a) Examine the significance of the differences between the means for the four control groups using
Kruskal-Wallis test (Nonparametric Test). Alpha=0.05
(b) Rank your response (all worms counts) in ascending order (choose “mean" tie) and save the ranks
in a dataset "rankworms".
(c) Examine the significance of the differences between the mean ranks of the four control groups using
ANOVA test. Compare your result with Kruskal Wallis test. Use “rankworms" dataset Alpha=0.05enter code here
My Code
data why;
input group $ worm ##;
datalines;
1 279 2 378 3 172 4 381
1 238 2 275 3 335 4 346
1 234 2 412 3 335 4 340
1 198 2 265 3 282 4 471
1 303 2 286 3 250 4 318
;
*Part 1A*;
Proc GLM data=why Alpha=0.05;
Class group;
Model worm = group;
Means group;
Run;
Quit;
*Part 2A*;
Proc Npar1way data=why Alpha=0.05 Wilcoxon;
Class group;
Var worm;
Run;
Quit;
*Part 2B*;
Proc Rank data=why Ties=Mean out=rankworms;
By worm;
Ranks newworm;
Var worm;
Run; Quit;
*Part 2C*;
Proc Npar1way data=why Alpha=0.05 Anova;
Class group;
Var worm;
Run;Quit;
For part 2B I keep getting the error code "data may be incomplete". I tried using proc sort in order to use a BY statement, however I kept getting that the data was not sorted in ascending order. I was under the perception that whenever you used Proc sort it automatically sorted everything by ascending order. For everything else I would just like to make sure that I am on the right track the questions are kinda confusing to me. Thanks in advance!
Remove the following from 2B.
BY WORM ;
since you don't want each worm to be ranked, but the groups of worm. It should probably be
BY GROUP;
You'll probably need to sort it by groups first as well.

Combining data from different rows into one variable

I have a table as below:
id sprvsr phone name
2 123 5232 ali
2 128 5458 ali
3 145 7845 oya
3 125 4785 oya
I would like to put same id and same name on one column and sprvsr and phone in one column together as below:
id sprvsr phone name
2 123-128 5232-5458 ali
3 145-125 7845-4785 oya
edit question:
have one more question- related this one.
i followed the way you showed me and works. Thank you! Another problem is for example:
sprvsr name
5232-5458 ali
5232-5458 ali
5458-5232 ali
is there any way that i can make them in same order?
If you need the variables in the same order, you'll need to use a temporary array and sort it. This requires having some idea of how many rows you might have. Also requires it to be sorted. This is a bit more complicated than the previous solution (in a previous revision).
data have;
input id sprvsr $ phone $ name $;
datalines;
2 123 5232 ali
2 128 5458 ali
3 145 7845 oya
3 125 4785 oya
4 128 5458 ali
4 123 5232 ali
;
run;
data want;
array phones[99] $8 _temporary_; *initialize these two to some reasonably high number;
array sprvsrs[99] $3 _temporary_;
length phone_all sprvsr_all $200; *same;
set have;
by id;
if first.id then do; *for each id, start out clearing the arrays;
call missing(of phones[*] sprvsrs[*]);
_counter=0;
end;
_counter+1; *increment counter;
phones[_counter]=phone; *assign current phone/sprvsr to array elements;
sprvsrs[_counter]=sprvsr;
if last.id then do; *now, create concatenated list and output;
call sortc(of phones[*]); *sort the lists;
call sortc(of sprvsrs[*]);
phone_all = catx('-',of phones[*]); *concatenate them together;
sprvsr_all= catx('-',of sprvsrs[*]);
output;
end;
drop phone sprvsr;
rename
phone_all=phone
sprvsr_all=sprvsr;
run;
The construction array[*] means "All variables of that array". So catx('-',of phones[*]) means put all phones elements in the catx (fortunately, missing ones are ignored by catx).
This is a way to do that:
data have;
input id sprvsr $ phone $ name $;
datalines;
2 123 5232 ali
2 128 5458 ali
3 145 7845 oya
3 125 4785 oya
;
run;
data want (drop=lag_sprvsr lag_phone);
format id;
length sprvsr $7 phone $9;
set have;
by id;
lag_sprvsr=lag(sprvsr);
lag_phone=lag(phone);
if lag(id)=id then do;
sprvsr=catx('-',lag_sprvsr,sprvsr);
phone=catx('-',lag_phone,phone);
end;
if last.id then output;
run;
Just pay attention to the possible lenghts of the input variables and that of the concatenated string. The input dataset must be sorted by id.
The catx() function removes the leading and trailing blanks and concatenates with a delimiter.

Why does the output delivery system fail to select the output I specify?

I have a short piece of SAS 9.3 code that outputs some Chi-squared statistics. I'm using ods select ChiSq to select only the tables of output I need, but SAS returns this warning:
WARNING: Output 'ChiSq' was not created. Make sure that the output
object name, label, or path is spelled correctly. Also, verify that
the appropriate procedure options are used to produce the requested
output object. For example, verify that the NOPRINT option is not
used.
This is the code:
ods html;
data flu;
input gender treatmentcourse improvement;
datalines;
012
000
110
112
002
002
012
000
000
012
111
011
002
101
002
011
100
100
001
110
110
011
001
010
101
001
000
001
110
100
;
proc format;
value treatmentformat 0 = "minor"
1 = "minor"
2 = "major";
ods trace on;
ods select ChiSq;
proc freq data = flu order = data;
tables treatmentcourse * improvement /chisq cmh;
format treatmentcourse treatmentformat.;
run;
ods trace off;
ods select all;
ods html close;
I can clearly see as a result of ods trace on; that the ChiSq table is included in the output, but SAS ignores the ods select statement and displays all the tables.
In the datalines, the first colum is gender, 0 for males and 1 for females. The second column is 0 if the patient took the drug and 1 if the patient took the placebo. The third column is a scale of 0 - 2, 0 being no improvement in flu symptoms and 2 being much improvement.
I think your problem is that you put your custom format on the wrong variable. Try this instead:
proc freq data = flu order = data;
tables treatmentcourse * improvement /chisq cmh;
format improvement treatmentformat.;
run;
If you look at your log more carefully, you should see a SAS NOTE that no statistics were created.
UPDATE: You also should put a run; statement after your proc format step, just to be safe.