sas create a variable that is equal to obs column - sas

I have a file with 10 obs. and different parameters. I need to add to my data a new variable of 'ID' for each observation- i.e a column of numbers 1-10.
How can I add a variable that is simply equal to the obs column?
I thought about doing it with a loop, define an empty vat, run over the var and each time add '1' to previous observation, however, it seems kind of complicated. Is there a better way to do it?

You can use the Data Step automatic variable _n_. This is the iteration count of the Data Step loop.
Data want;
set have;
ID = _n_;
run;

If you opt for a Proc SQL solution, there are two ways:
1. Undocumented:
proc sql;
create table want as
select monotonic() as row, *
from sashelp.class
;
quit;
Documented:
ods listing close;
ods output sql_results=want;
proc sql number;
select * from sashelp.class;
quit;
ods listing;

#DomPazz answer would definitely work! Just in case you would like return the number of observations according to attributes, Try this:
proc sort data= dataset out= sort_data;
by * your attribute(s) *;
data sort_data;
set sort_data;
by * your attribute(s) that is listed in above proc sort statement *;
if first.attribute then i=1; <=== first by group observation, number =1
i + 1; <==== sum statement (retaining)
if last.attribute and .... then ....; <=== whatever you want to do . Not necessary
run;
first / Last is very helpful in doing row operation.

Related

SAS targed encored

Hi,
Can someone explain to me what a given code sequence does step by step?**
I must describe it in detail what is happening in turn
%macro frequency_encoding(dataset, var);
proc sql noprint;
create table freq as
select distinct(&var) as values, count(&var) as number
from &dataset
group by Values ;
create table new as select *, round(freq.number/count(&var),00.01) As freq_encode
from &dataset left join freq on &var=freq.values;
quit;
data new(drop=values number &var);
set new;
rename freq_encode=&var;
run;
data new;
set new;
keep &var;
run;
data dane(drop = &var);
set dane;
run;
data dane;
set dane;
set new;
run;
The SQL is first finding the frequency of each value of the variable. Then it divides those counts by the total number of non-missing values and rounds that percentage to two decimal places (or integers when you think of the ratio as a percentage).
This could be done in one step with:
proc sql noprint;
create table new as
select *,round(number/count(&var),0.01) as freq_encode
from (select *,&var as values,count(&var) as number
from &dataset
group by &var
)
;
quit;
It is not clear what the DANE dataset is supposed to be. If &DATESET does not equal DANE then those last four data steps make no sense. If it does then it is a convoluted way to replace the original variable with the percentage.
The first one is basically trying to rename the calculated percentage as the original variable and eliminate the original variable and the other two intermediate variables used in calculating the percentage.
The second one is dropping all of the variables except the new percentage.
The third one is dropping the original variable from "dane".
The last one is adding the new variable back to "dane".
Assuming DANE should be replaced with &DATASET then those four data steps could be reduced to one:
data &dataset;
set &dataset(drop=&var);
set new(keep=freq_encode rename=(freq_encode=&var));
run;
It is probably best not to overwrite your original dataset in that way. So perhaps you should add an OUT parameter to your macro to name to new dataset you want to create.
You could have avoided all of those data steps by just adding the DROP= and RENAME= dataset options to the dataset generated by the SQL query.
So perhaps you want something like this:
%macro frequency_encoding(dataset, var,out);
proc sql noprint;
create table &out(drop=&var number rename=(freq_encode=&var)) as
select *,round(number/count(&var),0.01) as freq_encode
from (select *,count(&var) as number
from &dataset
group by &var
)
;
quit;
%mend ;
%frequency_encoding(sashelp.class,sex,work.class);

Select many columns and other non-continuous columns to find duplicate?

I have a dataset with many columns like this:
ID Indicator Name C1 C2 C3....C90
A 0001 Black 0 1 1.....0
B 0001 Blue 1 0 0.....1
B 0002 Blue 1 0 0.....1
Some of the IDs are duplicates because the indicator is different, but they're essentially the same record. To find duplicates, I want to select distinct ID, Name and then C1 through C90 to check because some claims who have the same Id and indicator have different C1...C90 values.
Is there a way to select c1...c90 either through proc sql or a sas data step? It seems the only way I can think of is to set the dataset and then drop the non essential columns, but in the actual dataset, it's not only Indicator but at least 15 other columns.
It would be nice if PROC SQL used the : variable name wildcard like other Procs do. When no other alternative is reasonable, I usually use a macro to select bulk columns. This might work for you:
%macro sel_C(n);
%do i=1 %to %eval(&n.-1);
C&i.,
%end;
C&n.
%mend sel_C;
proc sql;
select ID,
Indicator,
Name,
%sel_C(90)
from have_data;
quit;
If I understand the question properly, the easiest way would be to concatenate the columns to one. RETAIN that value from row to row, and you can compare it across rows to see if it's the same or not.
data want;
set have;
by id indicator;
retain last_cols;
length last_cols $500;
cols = catx('|',of c1-c90);
if first.id then call missing(last_cols);
else do;
identical = (cols = last_cols); *or whatever check you need to perform;
end;
output;
last_cols = cols;
run;
There are a few different ways you can do this and it will be much easier if the actual column names are C1 - C90. If you're just looking to remove anything that you know is a duplicate you can use proc sort.
proc sort data=dups out=nodups nodupkey;
by ID Name C1-C90;
run;
The nodupkey option will automatically remove any duplicates in the by statement.
Alternatively, if you want to know which records contain duplicates, you could use proc summary.
proc summary data=dups nway missing;
class ID Name C1-C90;
output out=onlydups(where=(_freq_ > 1));
run;
proc summary creates two new variables, _type_ and _freq_. If you specify _freq_ > 1 you will only output the duplicate records. Also, note that this will remove the Indicator variable.

SAS equivalent to R’s is.element()

It’s the first time that I’ve opened sas today and I’m looking at some code a colleague wrote.
So let’s say I have some data (import) where duplicates occur but I want only those which have a unique number named VTNR.
First she looks for unique numbers:
data M.import;
set M.import;
by VTNR;
if first.VTNR=1 then unique=1;
run;
Then she creates a table with the duplicated numbers:
data M.import_dup1;
set M.import;
where unique^=1;
run;
And finally a table with all duplicates.
But here she is really hardcoding the numbers, so for example:
data M.import_dup2;
set M.import;
where VTNR in (130001292951,130100975613,130107546425,130108026864,130131307133,130134696722,130136267001,130137413257,130137839451,130138291041);
run;
I’m sure there must be a better way.
Since I’m only familiar with R I would write something like:
import_dup2 <- subset(import, is.element(import$VTNR, import_dup1$VTNR))
I guess there must be something like the $ also for sas?
To me it looks like the most direct translation of the R code
import_dup2 <- subset(import, is.element(import$VTNR, import_dup1$VTNR))
Would be to use SQL code
proc sql;
create table import_dup2 as
select * from import
where VTNR in (select VTNR from import_dup1)
;
quit;
But if your intent is to find the observations in IMPORT that have more than one observation per VTNR value there is no need to first create some other table.
data import_dup2 ;
set import;
by VTNR ;
if not (first.VTNR and last.VTNR);
run;
I would use the options in PROC SORT.
Make sure to specify an OUT= dataset otherwise you'll overwrite your original data.
/*Generate fake data with dups*/
data class;
set sashelp.class sashelp.class(obs=5);
run;
/*Create unique and dup dataset*/
proc sort data=class nouniquekey uniqueout=uniquerecs out=dups;
by name;
run;
/*Display results - for demo*/
proc print data=uniquerecs;
title 'Unique Records';
run;
proc print data=dups;
title 'Duplicate Records';
run;
Above solution can give you duplicates but not unique values. There are many possible ways to do both in SAS. Very easy to understand would be a SQL solution.
proc sql;
create table no_duplicates as
select *
from import
group by VTNR
having count(*) = 1
;
create table all_duplicates as
select *
from import
group by VTNR
having count(*) > 1
;
quit;
I would use Reeza's or Tom's solution, but for completeness, the solution most similar to R (and your preexisting code) would be three steps. Again, I wouldn't use this here, it's excess work for something you can do more easily, but the concept is helpful in other situations.
First, get the dataset of duplicates - either her method, or proc sort.
proc sort nodupkey data=have out=nodups dupout=dups;
by byvar;
run;
Then pull those into a macro list:
proc sql;
select byvar
into :duplist separated by ','
from dups;
quit;
Then you have them in &duplist. and can use them like so:
data want;
set have;
if not (byvar in &duplist.);
run;
data want;
set import;
where VTNR in import_dup1;
run;

SAS sum variables using name after a proc transpose

I have a table with postings by category (a number) that I transposed. I got a table with each column name as _number for example _16, _881, _853 etc. (they aren't in order).
I need to do the sum of all of them in a proc sql, but I don't want to create the variable in a data step, and I don't want to write all of the columns names either . I tried this but doesn't work:
proc sql;
select sum(_815-_16) as nnl
from craw.xxxx;
quit;
I tried going to the first number to the last and also from the number corresponding to the first place to the one corresponding to the last place. Gives me a number that it's not correct.
Any ideas?
Thanks!
You can't use variable lists in SQL, so _: and var1-var6 and var1--var8 don't work.
The easiest way to do this is a data step view.
proc sort data=sashelp.class out=class;
by sex;
run;
*Make transposed dataset with similar looking names;
proc transpose data=class out=transposed;
by sex;
id height;
var height;
run;
*Make view;
data transpose_forsql/view=transpose_forsql;
set transposed;
sumvar = sum(of _:); *I confirmed this does not include _N_ for some reason - not sure why!;
run;
proc sql;
select sum(sumvar) from transpose_Forsql;
quit;
I have no documentation to support this but from my experience, I believe SAS will assume that any sum() statement in SQL is the sql-aggregate statement, unless it has reason to believe otherwise.
The only way I can see for SAS to differentiate between the two is by the way arguments are passed into it. In the below example you can see that the internal sum() function has 3 arguments being passed in so SAS will treat this as the SAS sum() function (as the sql-aggregate statement only allows for a single argument). The result of the SAS function is then passed in as the single parameter to the sql-aggregate sum function:
proc sql noprint;
create table test as
select sex,
sum(sum(height,weight,0)) as sum_height_and_weight
from sashelp.class
group by 1
;
quit;
Result:
proc print data=test;
run;
sum_height_
Obs Sex and_weight
1 F 1356.3
2 M 1728.6
Also note a trick I've used in the code by passing in 0 to the SAS function - this is an easy way to add an additional parameter without changing the intended result. Depending on your data, you may want to swap out the 0 for a null value (ie. .).
EDIT: To address the issue of unknown column names, you can create a macro variable that contains the list of column names you want to sum together:
proc sql noprint;
select name into :varlist separated by ','
from sashelp.vcolumn
where libname='SASHELP'
and memname='CLASS'
and upcase(name) like '%T' /* MATCHES HEIGHT AND WEIGHT */
;
quit;
%put &varlist;
Result:
Height,Weight
Note that you would need to change the above wildcard to match your scenario - ie. matching fields that begin with an underscore, instead of fields that end with the letter T. So your final SQL statement will look something like this:
proc sql noprint;
create table test as
select sex,
sum(sum(&varlist,0)) as sum_of_fields_ending_with_t
from sashelp.class
group by 1
;
quit;
This provides an alternate approach to Joe's answer - though I believe using the view as he suggests is a cleaner way to go.

SAS can I make sas process observations backwards

I know that Sas starts with the observation at the top of a dataset when processing and proceeds to the next until it reaches the bottom observation, but is there an easy way to make sas process the bottom observation first and then work its way to the top?
You can use nobs and point to process it backwards without having to do any intermediate steps. Here's an example:
data backwards;
do k= nobs to 1 by -1;
set sashelp.class nobs = nobs point=k;
output;
end;
stop;
run;
proc print data=sashelp.class;run;
proc print data=backwards;run;
See page 2 of this pdf for all the juicy details.
You certainly can change your data to be in reverse order, then process top down. Add a variable to the data set which acts as an index..then sort the data set descending by that variable.
data work.myData ;
set work.myData ;
indx = _n_ ;
run ;
proc sort data=work.myData ;
by descending indx ;
run ;
You can flip your observations order in a single step using PROC SQL:
proc sql;
create table work.cars as
select *
from sashelp.cars
order by monotonic() desc;
quit;
The key here is order by monotonic() desc which translates as "sort by descending observation numbers".
Alternatively, you can create a view (instead of creating a table) which will refer to the original table, but in a reverse observation number order:
proc sql;
create view work.cars_rev as
select *
from sashelp.cars
order by monotonic() desc;
quit;