Sum over all rows and add as a variable (data step) - sas

I have the following table
Row1, 3
Row2, 5
Row3, 8
and I now want to sum over all rows and place the result as a new variable on all rows, i.e.
Row1, 3, SUM(Row1,Row2,Row3)
Row2, 5, SUM(Row1,Row2,Row3)
Row3, 8, SUM(Row1,Row2,Row3)
Just like sum in proc sql would work... I've tried the simple sum, but that only sums the row. Any tips?

First: the SQL solution, or the PROC solution (where you run PROC MEANS to get the sum and then just incorporate it), is generally substantially preferred to the data step solution in most cases. Using built-in tools is typically better than writing your own tool to replicate something already extant.
However, the data step solution isn't terribly complicated. You just need to use what's known colloquially as a DoW loop (after two of the people who popularized it) and iterate over the dataset twice, once to get the sums and then the second time to output the rows. You can adapt this easily to summing over by-groups by changing until(eof) to until(last.byvar) (byvar being whatever by variable you are summing over) and adding a by group with that byvar inside both of the loops.
data want;
do _n_ = 1 by 1 until (eof);
set sashelp.class end=eof;
sumvar = sum(sumvar,age);
end;
do _n_ = 1 by 1 until (eof1);
set sashelp.class end=eof1;
output;
end;
run;

A SQL sum will accomplish this and merge it back in to the dataset automatically.
You will see a note in the LOG regarding merging the data.
PROC SQL;
Create table want as
Select *, sum(variable2sum) as total
From have;
Quit;
EDIT:
Since SQL wasn't an option, a more common answer is to create the sum in proc means and merge it in. Here's the code for that solution as well:
proc means data=sashelp.class noprint;
output out=summary mean(age)=avg_age;
run;
data class;
set sashelp.class;
if _n_=1 then
set summary;
drop _type_ _freq_;
run;
proc print data=class;
run;

Related

Select many columns and other non-continuous columns to find duplicate?

I have a dataset with many columns like this:
ID Indicator Name C1 C2 C3....C90
A 0001 Black 0 1 1.....0
B 0001 Blue 1 0 0.....1
B 0002 Blue 1 0 0.....1
Some of the IDs are duplicates because the indicator is different, but they're essentially the same record. To find duplicates, I want to select distinct ID, Name and then C1 through C90 to check because some claims who have the same Id and indicator have different C1...C90 values.
Is there a way to select c1...c90 either through proc sql or a sas data step? It seems the only way I can think of is to set the dataset and then drop the non essential columns, but in the actual dataset, it's not only Indicator but at least 15 other columns.
It would be nice if PROC SQL used the : variable name wildcard like other Procs do. When no other alternative is reasonable, I usually use a macro to select bulk columns. This might work for you:
%macro sel_C(n);
%do i=1 %to %eval(&n.-1);
C&i.,
%end;
C&n.
%mend sel_C;
proc sql;
select ID,
Indicator,
Name,
%sel_C(90)
from have_data;
quit;
If I understand the question properly, the easiest way would be to concatenate the columns to one. RETAIN that value from row to row, and you can compare it across rows to see if it's the same or not.
data want;
set have;
by id indicator;
retain last_cols;
length last_cols $500;
cols = catx('|',of c1-c90);
if first.id then call missing(last_cols);
else do;
identical = (cols = last_cols); *or whatever check you need to perform;
end;
output;
last_cols = cols;
run;
There are a few different ways you can do this and it will be much easier if the actual column names are C1 - C90. If you're just looking to remove anything that you know is a duplicate you can use proc sort.
proc sort data=dups out=nodups nodupkey;
by ID Name C1-C90;
run;
The nodupkey option will automatically remove any duplicates in the by statement.
Alternatively, if you want to know which records contain duplicates, you could use proc summary.
proc summary data=dups nway missing;
class ID Name C1-C90;
output out=onlydups(where=(_freq_ > 1));
run;
proc summary creates two new variables, _type_ and _freq_. If you specify _freq_ > 1 you will only output the duplicate records. Also, note that this will remove the Indicator variable.

Delete N highest from a dataset in sas

I have a bunch of sas datasets of various lengths and I need to trim the nth highest and lowest values by a variable value.
To do this for when I needed to trim the highest and lowest I did this
DATA VDBP273_first_night_Systolic;
SET VDBP273_first_night end=eof;
IF _N_ =1 then delete;
if eof then delete;
run;
And it worked fine.
Now I need to do something more like this
PROC SORT DATA=foo OUT=foo_sorted;
BY bar;
run;
DATA foo_out;
SET foo_sorted end=eof;
IF _N_ <= 5 then delete;
if eof *OR THE 4 right before it* then delete;
run;
I'm sure this is easy but it's stumping me. How can I say the last 5 of this sorted data set delete those?
Since you are presorting your data and then trying to eliminate top n and bottom n record, You can easily solve your problem using OBS= and FIRSTOBS= dataset option.
proc sql noprint;
select count(*) -4 into:counter from sashelp.class ;
quit;
proc sort data=sashelp.class out=have;by height;run;
proc print data=have;run;
data want;
set have(firstobs=6 obs=&counter);
run;
proc print data=want;run;
You can use the nobs= dataset option to store the total number of observations, which then means you can do something similar to your code to exclude the top/bottom n records.
I'd recommend putting the number of records to be excluded in a macro variable, it makes it easier to read and change than hard coding it.
%let excl = 6;
data want;
set sashelp.class nobs=numobs;
if &excl.< _n_ <=(numobs-&excl.);
run;
or simply do the same step done before, adding descending to the proc sort variable
proc sort data=have out=want; by var1 descending; run;

sas create a variable that is equal to obs column

I have a file with 10 obs. and different parameters. I need to add to my data a new variable of 'ID' for each observation- i.e a column of numbers 1-10.
How can I add a variable that is simply equal to the obs column?
I thought about doing it with a loop, define an empty vat, run over the var and each time add '1' to previous observation, however, it seems kind of complicated. Is there a better way to do it?
You can use the Data Step automatic variable _n_. This is the iteration count of the Data Step loop.
Data want;
set have;
ID = _n_;
run;
If you opt for a Proc SQL solution, there are two ways:
1. Undocumented:
proc sql;
create table want as
select monotonic() as row, *
from sashelp.class
;
quit;
Documented:
ods listing close;
ods output sql_results=want;
proc sql number;
select * from sashelp.class;
quit;
ods listing;
#DomPazz answer would definitely work! Just in case you would like return the number of observations according to attributes, Try this:
proc sort data= dataset out= sort_data;
by * your attribute(s) *;
data sort_data;
set sort_data;
by * your attribute(s) that is listed in above proc sort statement *;
if first.attribute then i=1; <=== first by group observation, number =1
i + 1; <==== sum statement (retaining)
if last.attribute and .... then ....; <=== whatever you want to do . Not necessary
run;
first / Last is very helpful in doing row operation.

Frequency of a value across multiple variables?

I have a data set of patient information where I want to count how many patients (observations) have a given diagnostic code. I have 9 possible variables where it can be, in diag1, diag2... diag9. The code is V271. I cannot figure out how to do this with the "WHERE" clause or proc freq.
Any help would be appreciated!!
Your basic strategy to this is to create a dataset that is not patient level, but one observation is one patient-diagnostic code (so up to 9 observations per patient). Something like this:
data want;
set have;
array diag[9];
do _i = 1 to dim(diag);
if not missing(diag[_i]) then do;
diagnosis_Code = diag[_i];
output;
end;
end;
keep diagnosis_code patient_id [other variables you might want];
run;
You could then run a proc freq on the resulting dataset. You could also change the criteria from not missing to if diag[_i] = 'V271' then do; to get only V271s in the data.
An alternate way to reshape the data that can match Joe's method is to use proc transpose like so.
proc transpose data=have out=want(keep=patient_id col1
rename=(col1=diag)
where=(diag is not missing));
by patient_id;
var diag1-diag9;
run;

How do I eliminate variables with missing results in SAS?

Here are my results. Since PPD has a missing result I'd like to eliminate all results for PPD. I.e. I'd like to eliminate all records where ticker='PPD' if any record where ticker='PPD' has a missing result (corr).
How can I program this in SAS? I don't want to just eliminate that missing observation but eliminate PPD altogether. Thanks.
Ticker Day Corr
PPD 7 -1
PPD 8
PTP 7 0.547561231
PTP 8 0.183279038
Lots of ways to do this, and what is most efficient depends on your data. If you don't have too much data, then I'd use the easiest method that fits with your knowledge and other habits.
*SQL delete;
proc sql;
delete from have H where exists (
select 1 from have V where H.ticker=V.ticker and V.corr is null);
quit;
*FREQ for missing (or means or whatever) then delete from that;
*Requires have to be sorted.;
proc freq data=have;
tables ticker*corr/missing out=ismiss(where=(missing(corr)));
run;
data want;
merge have(in=_h) ismiss(in=_m);
by ticker;
if _h and not _m;
run;
*double DoW. Requires either dataset is sorted by ticker,;
*or requires it to be organized by ticker (but tickers can be not alphabetically sorted); *and use norsorted on by statement;
data want;
do _n_=1 by 1 until (last.ticker);
set have;
by ticker;
if missing(corr) then _miss=1;
end;
do _n_=1 by 1 until (last.ticker);
set have;
by ticker;
if _miss ne 1 then output;
end;
run;
This is easily accomplished in PROC SQL...
proc sql ;
create table to_delete as
select distinct ticker
from mydata
where missing(corr) ;
delete from mydata
where ticker in(select ticker from to_delete) ;
quit ;
Unfortunately, it can't be done in a single SQL step as the delete from statement would recursively reference the source dataset.