Delete N highest from a dataset in sas - sas

I have a bunch of sas datasets of various lengths and I need to trim the nth highest and lowest values by a variable value.
To do this for when I needed to trim the highest and lowest I did this
DATA VDBP273_first_night_Systolic;
SET VDBP273_first_night end=eof;
IF _N_ =1 then delete;
if eof then delete;
run;
And it worked fine.
Now I need to do something more like this
PROC SORT DATA=foo OUT=foo_sorted;
BY bar;
run;
DATA foo_out;
SET foo_sorted end=eof;
IF _N_ <= 5 then delete;
if eof *OR THE 4 right before it* then delete;
run;
I'm sure this is easy but it's stumping me. How can I say the last 5 of this sorted data set delete those?

Since you are presorting your data and then trying to eliminate top n and bottom n record, You can easily solve your problem using OBS= and FIRSTOBS= dataset option.
proc sql noprint;
select count(*) -4 into:counter from sashelp.class ;
quit;
proc sort data=sashelp.class out=have;by height;run;
proc print data=have;run;
data want;
set have(firstobs=6 obs=&counter);
run;
proc print data=want;run;

You can use the nobs= dataset option to store the total number of observations, which then means you can do something similar to your code to exclude the top/bottom n records.
I'd recommend putting the number of records to be excluded in a macro variable, it makes it easier to read and change than hard coding it.
%let excl = 6;
data want;
set sashelp.class nobs=numobs;
if &excl.< _n_ <=(numobs-&excl.);
run;

or simply do the same step done before, adding descending to the proc sort variable
proc sort data=have out=want; by var1 descending; run;

Related

Sum over all rows and add as a variable (data step)

I have the following table
Row1, 3
Row2, 5
Row3, 8
and I now want to sum over all rows and place the result as a new variable on all rows, i.e.
Row1, 3, SUM(Row1,Row2,Row3)
Row2, 5, SUM(Row1,Row2,Row3)
Row3, 8, SUM(Row1,Row2,Row3)
Just like sum in proc sql would work... I've tried the simple sum, but that only sums the row. Any tips?
First: the SQL solution, or the PROC solution (where you run PROC MEANS to get the sum and then just incorporate it), is generally substantially preferred to the data step solution in most cases. Using built-in tools is typically better than writing your own tool to replicate something already extant.
However, the data step solution isn't terribly complicated. You just need to use what's known colloquially as a DoW loop (after two of the people who popularized it) and iterate over the dataset twice, once to get the sums and then the second time to output the rows. You can adapt this easily to summing over by-groups by changing until(eof) to until(last.byvar) (byvar being whatever by variable you are summing over) and adding a by group with that byvar inside both of the loops.
data want;
do _n_ = 1 by 1 until (eof);
set sashelp.class end=eof;
sumvar = sum(sumvar,age);
end;
do _n_ = 1 by 1 until (eof1);
set sashelp.class end=eof1;
output;
end;
run;
A SQL sum will accomplish this and merge it back in to the dataset automatically.
You will see a note in the LOG regarding merging the data.
PROC SQL;
Create table want as
Select *, sum(variable2sum) as total
From have;
Quit;
EDIT:
Since SQL wasn't an option, a more common answer is to create the sum in proc means and merge it in. Here's the code for that solution as well:
proc means data=sashelp.class noprint;
output out=summary mean(age)=avg_age;
run;
data class;
set sashelp.class;
if _n_=1 then
set summary;
drop _type_ _freq_;
run;
proc print data=class;
run;

sas create a variable that is equal to obs column

I have a file with 10 obs. and different parameters. I need to add to my data a new variable of 'ID' for each observation- i.e a column of numbers 1-10.
How can I add a variable that is simply equal to the obs column?
I thought about doing it with a loop, define an empty vat, run over the var and each time add '1' to previous observation, however, it seems kind of complicated. Is there a better way to do it?
You can use the Data Step automatic variable _n_. This is the iteration count of the Data Step loop.
Data want;
set have;
ID = _n_;
run;
If you opt for a Proc SQL solution, there are two ways:
1. Undocumented:
proc sql;
create table want as
select monotonic() as row, *
from sashelp.class
;
quit;
Documented:
ods listing close;
ods output sql_results=want;
proc sql number;
select * from sashelp.class;
quit;
ods listing;
#DomPazz answer would definitely work! Just in case you would like return the number of observations according to attributes, Try this:
proc sort data= dataset out= sort_data;
by * your attribute(s) *;
data sort_data;
set sort_data;
by * your attribute(s) that is listed in above proc sort statement *;
if first.attribute then i=1; <=== first by group observation, number =1
i + 1; <==== sum statement (retaining)
if last.attribute and .... then ....; <=== whatever you want to do . Not necessary
run;
first / Last is very helpful in doing row operation.

Set the labels of a SAS Dataset equal to their variable name

I'm working with a rather large several dataset that are provided to me as a CSV files. When I attempt to import one of the files the data will come in fine but, the number of variables in the file is too large for SAS, so it stops reading the variable names and starts assigning them sequential numbers. In order to maintain the variable names off of the data set I read in the file with the data row starting on 1 so it did not read the first row as variable names -
proc import file="X:\xxx\xxx\xxx\Extract\Live\Live.xlsx" out=raw_names dbms=xlsx replace;
SHEET="live";
GETNAMES=no;
DATAROW=1;
run;
I then run a macro to start breaking down the dataset and rename the variables based on the first observations in each variable -
%macro raw_sas_datasets(lib,output,start,end);
data raw_names2;
raw_names;
if _n_ ne 1 then delete;
keep A -- E &start. -- &end.;
run;
proc transpose data=raw_names2 out=raw_names2;
var A -- &end.;
run;
data raw_names2;
set raw_names2;
col1=compress(col1);
run;
data raw_values;
set raw;
keep A -- E &start. -- &end.;
run;
%macro rename(old,new);
data raw_values;
set raw_values;
rename &old.=&new.;
run;
%mend rename;
data _null_;
set raw_names2;
call execute('%rename('||_name_||","||col1||")");
run;
%macro freq(var);
proc freq data=raw_values noprint;
tables &var. / out=&var.;
run;
%mend freq;
data raw_names3;
set raw_names2;
if _n_ < 6 then delete;
run;
data _null_;
set raw_names3;
call execute('%freq('||col1||")");
run;
proc sort data=raw_values;
by StudySubjectID;
run;
data &lib..&output.;
set raw_values;
run;
%mend raw_sas_datasets;
The problem I'm running into is that the variable names are now all set properly and the data is lined up correctly, but the labels are still the original SAS assigned sequential numbers. Is there any way to set all of the labels equal to the variable names?
If you just want to remove the variable labels (at which point they default to the variable name), that's easy. From the SAS Documentation:
proc datasets lib=&lib.;
modify &output.;
attrib _all_ label=' ';
run;
I suspect you have a simpler solution than the above, though.
The actual renaming step needs to be done differently. Right now it's rewriting the entire dataset over and over again - for a lot of variables that is a terrible idea. Get your rename statements all into one datastep, or into a PROC DATASETS, or something else. Look up 'list processing SAS' for details on how to do that; on this site or on google you will find lots of solutions.
You likely can get SAS to read in the whole first line. The number of variables isn't the problem; it is probably the length of the line. There's another question that I'll find if I can on this site from a few months ago that deals with this exact problem.
My preferred option is not to use PROC IMPORT for CSVs anyway; I would suggest writing a metadata table that stores the variable names and the length/types for the variables, then using that to write import code. A little more work at first, but only has to be done once per study and you guarantee PROC IMPORT isn't making silly decisions for you.
In the library sashelp is a table vcolumn. vcolumn contains all the names of your variables for each library by table. You could write a macro that puts all your variable names into macro variables and then from there set the label.
Here's some code that I put together (not very pretty) but it does what you're looking for:
data test.label_var;
x=1;
y=1;
label x = 'xx';
label y = 'yy';
run;
proc sql noprint;
select count(*) into: cnt
from sashelp.vcolumn
where memname = 'LABEL_VAR';quit;
%let cnt = &cnt;
proc sql noprint;
select name into: name1 - :name&cnt
from sashelp.vcolumn
where memname = 'LABEL_VAR';quit;
%macro test;
%do i = 1 %to &cnt;
proc datasets library=test nolist;
modify label_var;
label &&name&i=&&name&i;
quit;
%end;
%mend test;
%test;

How do I eliminate variables with missing results in SAS?

Here are my results. Since PPD has a missing result I'd like to eliminate all results for PPD. I.e. I'd like to eliminate all records where ticker='PPD' if any record where ticker='PPD' has a missing result (corr).
How can I program this in SAS? I don't want to just eliminate that missing observation but eliminate PPD altogether. Thanks.
Ticker Day Corr
PPD 7 -1
PPD 8
PTP 7 0.547561231
PTP 8 0.183279038
Lots of ways to do this, and what is most efficient depends on your data. If you don't have too much data, then I'd use the easiest method that fits with your knowledge and other habits.
*SQL delete;
proc sql;
delete from have H where exists (
select 1 from have V where H.ticker=V.ticker and V.corr is null);
quit;
*FREQ for missing (or means or whatever) then delete from that;
*Requires have to be sorted.;
proc freq data=have;
tables ticker*corr/missing out=ismiss(where=(missing(corr)));
run;
data want;
merge have(in=_h) ismiss(in=_m);
by ticker;
if _h and not _m;
run;
*double DoW. Requires either dataset is sorted by ticker,;
*or requires it to be organized by ticker (but tickers can be not alphabetically sorted); *and use norsorted on by statement;
data want;
do _n_=1 by 1 until (last.ticker);
set have;
by ticker;
if missing(corr) then _miss=1;
end;
do _n_=1 by 1 until (last.ticker);
set have;
by ticker;
if _miss ne 1 then output;
end;
run;
This is easily accomplished in PROC SQL...
proc sql ;
create table to_delete as
select distinct ticker
from mydata
where missing(corr) ;
delete from mydata
where ticker in(select ticker from to_delete) ;
quit ;
Unfortunately, it can't be done in a single SQL step as the delete from statement would recursively reference the source dataset.

SAS - How to get last 'n' observations from a dataset?

How can you create a SAS data set from another dataset using only the last n observations from original dataset. This is easy when you know the value of n. If I don't know 'n' how can this be done?
This assumes you have a macro variable that says how many observations you want. NOBS tells you the number of observations in the dataset currently without reading the whole thing.
%let obswant=5;
data want;
set sashelp.class nobs=obscount;
if _n_ gt (obscount-&obswant.);
run;
Using Joe's example of a macro variable to specify the number of observations you want, here is another answer:
%let obswant = 10;
data want;
do _i_=nobs-(&obswant-1) to nobs;
set have point=_i_ nobs=nobs;
output;
end;
stop; /* Needed to stop data step */
run;
This should perform better since it only reads the specific observations you want.
If the dataset is large, you might not want to read the whole dataset. Instead you could try a construction that reads the total number of Observations in the dataset first. So if you want to have the last of observations:
data t;
input x;
datalines;
1
2
3
4
;
%let dsid=%sysfunc(open(t));
%let num=%sysfunc(attrn(&dsid,nlobs));
%let rc=%sysfunc(close(&dsid));
%let number = 2;
data tt;
set t (firstobs = %eval(&num.-&number.+1));
run;
For the sake of variety, here's another approach (not necessarily a better one)
%let obswant=5;
proc sql noprint;
select nlobs-&obswant.+1 into :obscalc
from dictionary.tables
where libname='SASHELP' and upcase(memname)='CLASS';
quit;
data want;
set sashelp.class (firstobs=&obscalc.);
run;
You can achive this using the
_nobs_ and _n_ variables. First, create a temporary variable to store the total no of obs. Then compare the automatic variable N to nobs.
data a;
set sashelp.class nobs=_nobs_;
if _N_ gt _nobs_ -5;
run;