SAS differences in outcome between sql and proc means - sas

I want to have a weighted average of some variable in a macro variable. My var is zindi&aa and my weight is wprm&aa
I am trying to make sense of two ways of doing it :
one with a proc sql
proc sql noprint;
select mean(zindi&aa. * wprm&aa.)/sum(wprm&aa.) into :Mean_zindi_aa
from Panel(where=(annee&ap.<="&nais18" ));
quit;
it yields me an average of 0.77
one with proc means
proc means data=Panel(where=(annee&ap.<="&nais18" ));
var zindi&ap. ;
weight wprm&ap ;
output out=mean_zindi&ap mean=moy&ap;
run;
and then a data _null_ part
which yields an average of around 20200 that seems to be the correct one
so my question is
what I am missing with the proc sql so that it does give an absurd result ?
is there a better way to obtain my macro variable &mean_zindi_aa

proc sql noprint;
select sum(zindi&aa. * wprm&aa.)/sum(wprm&aa.) into :Mean_zindi_aa
from Panel(where=(annee&ap.<="&nais18" ));
quit;
Try this. Looks like you are trying to do a mean on (zindi&aa. * wprm&aa.). If you need the weighted average the above should work. because weighted average = sum(weight*variable)/sum(weights)

Change your PROC SQL to:
proc sql noprint;
select SUM(zindi&aa. * wprm&aa.)/sum(wprm&aa.) into :Mean_zindi_aa
from Panel(where=(annee&ap.<="&nais18" ));
quit;
You need to SUM the product, not take the MEAN.

Related

SAS-How to calculate the difference between max and min value?

I used the following code to calculate the difference between max and min value in a column, but it doesn't like a smart way. So could anyone give me some suggestion?
p.s. I need to put the difference back to the dataset as a new variable, because I want to delete datasets based on this difference.
proc univariate noprint date=test;
var time_l_;
output out=result max=max min=min;
run;
data test;
set result test;
run;
data test;
set test;
gap=max-min;
run;
You're pretty close, actually, to what I'd consider a good result. This isn't the absolute fastest way to do it, but it's probably the best when you don't need amazing performance because it's a lot less complicated than the faster methods.
Create the max/min dataset, then use if _n_ = 1 then set result; which will bring it in once. The variables are automatically RETAINed, because they are brought in on the SET statement. Then calculate the gap in the same data step.
proc univariate noprint data=sashelp.class;
var age;
output out=result max=max min=min;
run;
data test;
if _n_=1 then set result;
set sashelp.class;
gap = max-min;
run;
The SQL solution is straightforward but leaves a message in your log regarding remerging.
Proc SQL;
Create table want as
Select *, max(age) as max_age, min(age) as min_age, calculated max_age - calculated min_age as age_diff
From have;
Quit;
A simpler SQL solution using the range function:
proc sql;
create table want as
select *,range(age) as age_range
from sashelp.class;
quit;

finding max of many columns using proc sql statement

I am trying to write a PROC SQL query in SAS to determine maximum of many columns starting with a particular letter (say RF*). The existing proc means statement which i have goes like this.
proc means data = input_table nway noprint missing;
var age x y z RF: ST: ;
class a b c;
output out = output_table (drop = _type_ _freq_) max=;
run;
Where the columns RF: refers to all columns starting with RF and likewise for ST. I was wondering if there is something similar in PROC SQL, which i can use?
Thanks!
Dynamic SQL is indeed the way to go with this, if you must use SQL. The good news is that you can do it all in one proc sql call using only one macro variable, e.g.:
proc sql noprint;
select catx(' ','max(',name,') as',name) into :MAX_LIST separated by ','
from dictionary.columns
where libname = 'SASHELP'
and memname = 'CLASS'
and type = 'num'
/*eq: is not available in proc sql in my version of SAS, but we can use substr to match partial variable names*/
and upcase(substr(name,1,1)) in ('A','W') /*Match all numeric vars that have names starting with A or W*/
;
create table want as select SEX, &MAX_LIST
from sashelp.class
group by SEX;
quit;

SAS equivalent to R’s is.element()

It’s the first time that I’ve opened sas today and I’m looking at some code a colleague wrote.
So let’s say I have some data (import) where duplicates occur but I want only those which have a unique number named VTNR.
First she looks for unique numbers:
data M.import;
set M.import;
by VTNR;
if first.VTNR=1 then unique=1;
run;
Then she creates a table with the duplicated numbers:
data M.import_dup1;
set M.import;
where unique^=1;
run;
And finally a table with all duplicates.
But here she is really hardcoding the numbers, so for example:
data M.import_dup2;
set M.import;
where VTNR in (130001292951,130100975613,130107546425,130108026864,130131307133,130134696722,130136267001,130137413257,130137839451,130138291041);
run;
I’m sure there must be a better way.
Since I’m only familiar with R I would write something like:
import_dup2 <- subset(import, is.element(import$VTNR, import_dup1$VTNR))
I guess there must be something like the $ also for sas?
To me it looks like the most direct translation of the R code
import_dup2 <- subset(import, is.element(import$VTNR, import_dup1$VTNR))
Would be to use SQL code
proc sql;
create table import_dup2 as
select * from import
where VTNR in (select VTNR from import_dup1)
;
quit;
But if your intent is to find the observations in IMPORT that have more than one observation per VTNR value there is no need to first create some other table.
data import_dup2 ;
set import;
by VTNR ;
if not (first.VTNR and last.VTNR);
run;
I would use the options in PROC SORT.
Make sure to specify an OUT= dataset otherwise you'll overwrite your original data.
/*Generate fake data with dups*/
data class;
set sashelp.class sashelp.class(obs=5);
run;
/*Create unique and dup dataset*/
proc sort data=class nouniquekey uniqueout=uniquerecs out=dups;
by name;
run;
/*Display results - for demo*/
proc print data=uniquerecs;
title 'Unique Records';
run;
proc print data=dups;
title 'Duplicate Records';
run;
Above solution can give you duplicates but not unique values. There are many possible ways to do both in SAS. Very easy to understand would be a SQL solution.
proc sql;
create table no_duplicates as
select *
from import
group by VTNR
having count(*) = 1
;
create table all_duplicates as
select *
from import
group by VTNR
having count(*) > 1
;
quit;
I would use Reeza's or Tom's solution, but for completeness, the solution most similar to R (and your preexisting code) would be three steps. Again, I wouldn't use this here, it's excess work for something you can do more easily, but the concept is helpful in other situations.
First, get the dataset of duplicates - either her method, or proc sort.
proc sort nodupkey data=have out=nodups dupout=dups;
by byvar;
run;
Then pull those into a macro list:
proc sql;
select byvar
into :duplist separated by ','
from dups;
quit;
Then you have them in &duplist. and can use them like so:
data want;
set have;
if not (byvar in &duplist.);
run;
data want;
set import;
where VTNR in import_dup1;
run;

SAS sum variables using name after a proc transpose

I have a table with postings by category (a number) that I transposed. I got a table with each column name as _number for example _16, _881, _853 etc. (they aren't in order).
I need to do the sum of all of them in a proc sql, but I don't want to create the variable in a data step, and I don't want to write all of the columns names either . I tried this but doesn't work:
proc sql;
select sum(_815-_16) as nnl
from craw.xxxx;
quit;
I tried going to the first number to the last and also from the number corresponding to the first place to the one corresponding to the last place. Gives me a number that it's not correct.
Any ideas?
Thanks!
You can't use variable lists in SQL, so _: and var1-var6 and var1--var8 don't work.
The easiest way to do this is a data step view.
proc sort data=sashelp.class out=class;
by sex;
run;
*Make transposed dataset with similar looking names;
proc transpose data=class out=transposed;
by sex;
id height;
var height;
run;
*Make view;
data transpose_forsql/view=transpose_forsql;
set transposed;
sumvar = sum(of _:); *I confirmed this does not include _N_ for some reason - not sure why!;
run;
proc sql;
select sum(sumvar) from transpose_Forsql;
quit;
I have no documentation to support this but from my experience, I believe SAS will assume that any sum() statement in SQL is the sql-aggregate statement, unless it has reason to believe otherwise.
The only way I can see for SAS to differentiate between the two is by the way arguments are passed into it. In the below example you can see that the internal sum() function has 3 arguments being passed in so SAS will treat this as the SAS sum() function (as the sql-aggregate statement only allows for a single argument). The result of the SAS function is then passed in as the single parameter to the sql-aggregate sum function:
proc sql noprint;
create table test as
select sex,
sum(sum(height,weight,0)) as sum_height_and_weight
from sashelp.class
group by 1
;
quit;
Result:
proc print data=test;
run;
sum_height_
Obs Sex and_weight
1 F 1356.3
2 M 1728.6
Also note a trick I've used in the code by passing in 0 to the SAS function - this is an easy way to add an additional parameter without changing the intended result. Depending on your data, you may want to swap out the 0 for a null value (ie. .).
EDIT: To address the issue of unknown column names, you can create a macro variable that contains the list of column names you want to sum together:
proc sql noprint;
select name into :varlist separated by ','
from sashelp.vcolumn
where libname='SASHELP'
and memname='CLASS'
and upcase(name) like '%T' /* MATCHES HEIGHT AND WEIGHT */
;
quit;
%put &varlist;
Result:
Height,Weight
Note that you would need to change the above wildcard to match your scenario - ie. matching fields that begin with an underscore, instead of fields that end with the letter T. So your final SQL statement will look something like this:
proc sql noprint;
create table test as
select sex,
sum(sum(&varlist,0)) as sum_of_fields_ending_with_t
from sashelp.class
group by 1
;
quit;
This provides an alternate approach to Joe's answer - though I believe using the view as he suggests is a cleaner way to go.

Can this multi-step process be reduced to one proc sql statement?

I've been trying to make my code more efficient and this is the original code, but I think it can be written in one step.
data TABLE;set ORIGINAL_DATA;
Multi=percent*total_units;
keep Multi Type;
proc sort; by Type;
proc means noprint data=TABLE1; by Type; var Multi;output out=Table2(drop= _type_ _freq_)sum=Multi;run;
proc means noprint data=Table1; var Multi;output out=Table3(drop= _type_ _freq_) sum=total ;run;
proc sql;
create table TABLE4as
select a.Type, a.Multi label="Multi", b.total label="total"
from TABLE2 a, TABLE3 b
order by Type;
quit;
data TABLE5;set TABLE4;
pct=(MULTI/total)*100;
run;
I am able to split up part of it, but I can't figure out how to get the PCT part in my code. This is what I have.
proc sql;
create table TABLE1 as
select distinct type, sum(percent*total_units) as MULTI label "MULTI",
MULTI/(percent*total_units)) as PCT
from ORIGINAL_DATA
group by type;
quit;
I had to edit some of the code but I think the general idea should make sense.
The main problem is I cannot call upon the MULTI column because it is just being created but I want to create a percentage of the total for each type.
The "SAS" way to do something like this is to use a CLASS statement with PROC MEANS. That will calculate statistics on all the interaction levels in the data (identified by the TYPE variable). The row where TYPE=0 will be the "total" value, representing the value of that statistic for the entire data set.
In your case, we can take advantage of the fact that PROC MEANS will create the output data set sorted by TYPE and by the variables listed in the CLASS statement. That means we can just read the first observation and save it's value for calculating percentages.
It's probably easier to just show some code:
data TABLE;
set ORIGINAL_DATA;
Multi = percent * total_units;
keep Multi Type;
run;
proc means noprint data=TABLE;
class Type;
var multi;
output out=next sum=;
run;
data want;
retain total;
set next;
if _n_ = 1 then do;
/* The first obs will be the _TYPE_=0 record */
total = multi;
delete;
end;
pct = (multi / total) * 100;
drop total _freq_ _type_;
run;
Notice that you do not need to sort the data before using PROC MEANS. That's because we are using a CLASS statement rather than a BY statement. The data step is using the first observation in the data set created by MEANS (the TYPE=0 record) to retain the total sum of your variable. The delete statement keeps it out of the result.
CLASS statements with PROC MEANS are very useful. Take a few minutes to read up on how the TYPE variable is calculated, especially if you try using more than one class variable.
You can skip the initial data step by using the WEIGHT option in VAR statement of PROC MEANS (this will effectively do the multiplication for you). You can also use PROC TABULATE instead of PROC MEANS, as tabulate can calculate the percentage. I believe the following code will produce your required output in one go.
ods noresults;
proc tabulate data=have out=want (drop=_: rename=(total_units_sum=total total_units_pctsum_0=pct));
class type;
var total_units / weight=percent;
table type, total_units*(sum pctsum);
run;
ods results;
If you need one step, maybe this will work, but it's not actually efficient, since it processes data twice, once for detail by TYPE, once for total.
proc sql;
create table TABLE1 as
select
d.type
, sum(d.percent*d.total_units) as MULTI label "MULTI"
, calculated MULTI/s.total as PCT
from ORIGINAL_DATA d,
( select sum(percent*total_units) as total
from ORIGINAL_DATA) s
group by type
;
quit;
For more efficiency, but in more than one steps you could simply replace tables withe views in your original code:
data TABLE; => data TABLE / view=TABLE;
create table TABLE4 => create view TABLE4