Ive got 50 columns of data, with 4 different measurements in each, as well as designation tags (groups C, D, and E). Ive averaged the 4 measurements... So every data point now has an average. Now, I am supposed to take the average of all the data points averages of each specific group.
So I want all the data in group C to be averaged, and so on for D and E.... and I dont know how to do that.
avg1=(MEAS1+MEAS2+MEAS3+MEAS4)/4;
avg_score=round(avg1, .1);
run;
proc print;
run;
This is what I have so far.
There are several procedures, and SQL that can average values over a group.
I'll guess you meant to say 50 rows of data.
Example:
Proc MEANS
data have;
call streaminit(314159);
do _n_ = 1 to 50;
group = substr('CDE', rand('integer',3),1);
array v meas1-meas4;
do _i_ = 1 to dim(v);
num + 2;
v(_i_) = num;
end;
output;
end;
drop num;
run;
data rowwise_means;
set have;
avg_meas = mean (of meas:);
run;
* group wise means of row means;
proc means noprint data=rowwise_means nway;
class group;
var avg_meas;
output out=want mean=meas_grandmean;
run;
rowwise_means
want (grandmean, or mean of means)
Related
I can't find a way to summarize the same variable using different weights.
I try to explain it with an example (of 3 records):
data pippo;
a=10;
wgt1=0.5;
wgt2=1;
wgt3=0;
output;
a=3;
wgt1=0;
wgt2=0;
wgt3=1;
output;
a=8.9;
wgt1=1.2;
wgt2=0.3;
wgt3=0.1;
output;
run;
I tried the following:
proc summary data=pippo missing nway;
var a /weight=wgt1;
var a /weight=wgt2;
var a /weight=wgt3;
output out=pluto (drop=_freq_ _type_) sum()=;
run;
Obviously it gives me a warning because I used the same variable "a" (I can't rename it!).
I've to save a huge amount of data and not so much physical space and I should construct like 120 field (a0-a6,b0-b6 etc) that are the same variables just with fixed weight (wgt0-wgt5).
I want to store a dataset with 20 columns (a,b,c..) and 6 weight (wgt0-wgt5) and, on demand, processing a "summary" without an intermediate datastep that oblige me to create 120 fields.
Due to the huge amount of data (more or less 55Gb every month) I'd like also not to use proc sql statement:
proc sql;
create table pluto
as select sum(db.a * wgt1) as a0, sum(db.a * wgt1) as a1 , etc.
quit;
There is a "Super proc summary" that can summarize the same field with different weights?
Thanks in advance,
Paolo
I think there are a few options. One is the data step view that data_null_ mentions. Another is just running the proc summary however many times you have weights, and either using ods output with the persist=proc or 20 output datasets and then setting them together.
A third option, though, is to roll your own summarization. This is advantageous in that it only sees the data once - so it's faster. It's disadvantageous in that there's a bit of work involved and it's more complicated.
Here's an example of doing this with sashelp.baseball. In your actual case you'll want to use code to generate the array reference for the variables, and possibly for the weights, if they're not easily creatable using a variable list or similar. This assumes you have no CLASS variable, but it's easy to add that into the key if you do have a single (set of) class variable(s) that you want NWAY combinations of only.
data test;
set sashelp.baseball;
array w[5];
do _i = 1 to dim(w);
w[_i] = rand('Uniform')*100+50;
end;
output;
run;
data want;
set test end=eof;
i = .;
length varname $32;
sumval = 0 ;
sum=0;
if _n_ eq 1 then do;
declare hash h_summary(suminc:'sumval',keysum:'sum',ordered:'a');;
h_summary.defineKey('i','varname'); *also would use any CLASS variable in the key;
h_summary.defineData('i','varname'); *also would include any CLASS variable in the key;
h_summary.defineDone();
end;
array w[5]; *if weights are not named in easy fashion like this generate this with code;
array vars[*] nHits nHome nRuns; *generate this with code for the real dataset;
do i = 1 to dim(w);
do j = 1 to dim(vars);
varname = vname(vars[j]);
sumval = vars[j]*w[i];
rc = h_summary.ref();
if i=1 then put varname= sumval= vars[j]= w[i]=;
end;
end;
if eof then do;
rc = h_summary.output(dataset:'summary_output');
end;
run;
One other thing to mention though... if you're doing this because you're doing something like jackknife variance estimation or that sort of thing, or anything that uses replicate weights, consider using PROC SURVEYMEANS which can handle replicate weights for you.
You can SCORE your data set using a customized SCORE data set that you can generate
with a data step.
options center=0;
data pippo;
retain a 10 b 1.75 c 5 d 3 e 32;
run;
data score;
if 0 then set pippo;
array v[*] _numeric_;
retain _TYPE_ 'SCORE';
length _name_ $32;
array wt[3] _temporary_ (.5 1 .333);
do i = 1 to dim(v);
call missing(of v[*]);
do j = 1 to dim(wt);
_name_ = catx('_',vname(v[i]),'WGT',j);
v[i] = wt[j];
output;
end;
end;
drop i j;
run;
proc print;[enter image description here][1]
run;
proc score data=pippo score=score;
id a--e;
var a--e;
run;
proc print;
run;
proc means stackods sum;
ods exclude summary;
ods output summary=summary;
run;
proc print;
run;
enter image description here
I have a process flow in SAS Enterprise Guide which is comprised mainly of Data views rather than tables, for the sake of storage in the work library.
The problem is that I need to calculate percentiles (using proc univariate) from one of the data views and left join this to the final table (shown in the screenshot of my process flow).
Is there any way that I can specify the outfile in the univariate procedure as being a data view, so that the procedure doesn't calculate everything prior to it in the flow? When the percentiles are left joined to the final table, the flow is calculated again so I'm effectively doubling my processing time.
Please find the code for the univariate procedure below
proc univariate data=WORK.QUERY_FOR_SGFIX noprint;
var CSA_Price;
by product_id;
output out= work.CSA_Percentiles_Prod
pctlpre= P
pctlpts= 40 to 60 by 10;
run;
In SAS, my understanding is that procs such as proc univariate cannot generally produce views as output. The only workaround I can think of would be for you to replicate the proc logic within a data step and produce a view from the data step. You could do this e.g. by transposing your variables into temporary arrays and using the pctl function.
Here's a simple example:
data example /view = example;
array _height[19]; /*Number of rows in sashelp.class dataset*/
/*Populate array*/
do _n_ = 1 by 1 until(eof);
set sashelp.class end = eof;
_height[_n_] = height;
end;
/*Calculate quantiles*/
array quantiles[3] q40 q50 q60;
array points[3] (40 50 60);
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/*Keep only the quantiles we calculated*/
keep q40--q60;
run;
With a bit more work, you could also make this approach return percentiles for individual by groups rather than for the whole dataset at once. You would need to write a double-DOW loop to do this, e.g.:
data example;
array _height[19];
array quantiles[3] q40 q50 q60;
array points[3] _temporary_ (40 50 60);
/*Clear heights array between by groups*/
call missing(of _height[*]);
/*Populate heights array*/
do _n_ = 1 by 1 until(last.sex);
set class end = eof;
by sex;
_height[_n_] = height;
end;
/*Calculate quantiles*/
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/* Output all rows from input dataset, with by-group quantiles attached*/
do _n_ = 1 to _n_;
set class;
output;
end;
keep name sex q40--q60;
run;
I'm not very familiar with Do Loops in SAS and was hoping to get some help. I have data that looks like this:
Product A: 1
Product A: 2
Product A: 4
I'd like to transpose (easy) and flag that Product A: 3 is missing, but I need to do this iteratively to the i-th degree since the number of products is large.
If I run the transpose part in SAS, my first column will be 1, second column will be 2, and third column will be 4 - but I'd really like the third column to be missing and the fourth column to be 4.
Any thoughts? Thanks.
Get some sample data:
proc sort data=sashelp.iris out=sorted;
by species;
run;
Determine the largest column we will need to transpose to. Depending on your situation you may just want to hardcode this value using a %let max=somevalue; statement:
proc sql noprint;
select cats(max(sepallength)) into :max from sorted;
quit;
%put &=max;
Transpose the data using a data step:
data want;
set sorted;
by species;
retain _1-_&max;
array a[1:&max] _1-_&max;
if first.species then do;
do cnt = lbound(a) to hbound(a);
a[cnt] = .;
end;
end;
a[sepallength] = sepallength;
if last.species then do;
output;
end;
keep species _1-_&max;
run;
Notice we are defining an array of columns: _1,_2,_3,..._max. This happens in our array statement.
We then use by-group processing to populate these newly created columns for a single species at a time. For each species, on the first record, we clear the array. For each record of the species, we populate the appropriate element of the array. On the final record for the species output the array contents.
You need a way to tell SAS that you have 4 products and the values are 1-4. In this example I create dummy ID with the needed information then transpose using ID statement to name new variables using the value of product.
data product;
input id product ##;
cards;
1 1 1 2 1 4
2 2 2 3
;;;;
run;
proc print;
run;
data productspace;
if 0 then set product;
do product = 1 to 4;
output;
end;
stop;
run;
data productV / view=productV;
set productspace product;
run;
proc transpose data=productV out=wide(where=(not missing(id))) prefix=P;
by id;
var product;
id product;
run;
proc print;
run;
I have a dataset looks like the following:
Name Number
a 1
b 2
c 9
d 6
e 5.5
Total ???
I want to calculate the sum of variable Number and record the sum in the last row (corresponding with Name = 'total'). I know I can do this using proc means then merge the output backto this file. But this seems not very efficient. Can anyone tell me whether there is any better way please.
you can do the following in a dataset:
data test2;
drop sum;
set test end = last;
retain sum;
if _n_ = 1 then sum = 0;
sum = sum + number;
output;
if last then do;
NAME = 'TOTAL';
number = sum;
output;
end;
run;
it takes just one pass through the dataset
It is easy to get by report procedure.
data have;
input Name $ Number ;
cards;
a 1
b 2
c 9
d 6
e 5.5
;
proc report data=have out=want(drop=_:);
rbreak after/ summarize ;
compute after;
name='Total';
endcomp;
run;
The following code uses the DOW-Loop (DO-Whitlock) to achieve the result by reading through the observations once, outputting each one, then lastly outputting the total:
data want(drop=tot);
do until(lastrec);
set have end=lastrec;
tot+number;
output;
end;
name='Total';
number=tot;
output;
run;
For all of the data step solutions offered, it is important to keep in mind the 'Length' factor. Make sure it will accommodate both 'Total' and original values.
proc sql;
select max(5,length) into :len trimmed from dictionary.columns WHERE LIBNAME='WORK' AND MEMNAME='TEST' AND UPCASE(NAME)='NAME';
QUIT;
data test2;
length name $ &len;
set test end=last;
...
run;
County...AgeGrp...Population
A.............1..........200
A.............2..........100
A.............3..........100
A............All.........400
B.............1..........200
So, I have a list of counties and I'd like to find the under 18 population as a percent of the population for each county, so as an example from the table above I'd like to add only the population of agegrp 1 and 2 and divide by the 'all' population. In this case it would be 300/400. I'm wondering if this can be done for every county.
Let's call your SAS data set "HAVE" and say it has two character variables (County and AgeGrp) and one numeric variable (Population). And let's say you always have one observation in your data set for a each County with AgeGrp='All' on which the value of Population is the total for the county.
To be safe, let's sort the data set by County and process it in another data step to, creating a new data set named "WANT" with new variables for the county population (TOT_POP), the sum of the two Age Group values you want (TOT_GRP) and calculate the proportion (AgeGrpPct):
proc sort data=HAVE;
by County;
run;
data WANT;
retain TOT_POP TOT_GRP 0;
set HAVE;
by County;
if first.County then do;
TOT_POP = 0;
TOT_GRP = 0;
end;
if AgeGrp in ('1','2') then TOT_GRP + Population;
else if AgeGrp = 'All' then TOT_POP = Population;
if last.County;
AgeGrpPct = TOT_GRP / TOT_POP;
keep County TOT_POP TOT_GRP AgeGrpPct;
output;
run;
Notice that the observation containing AgeGrp='All' is not really needed; you could just as well have created another variable to collect a running total for all age groups.
If you want a procedural approach, create a format for the under 18's, then use PROC FREQ to calculate the percentage. It is necessary to exclude the 'All' values from the dataset with this method (it's generally bad practice to include summary rows in the source data).
PROC TABULATE could also be used for this.
data have;
input County $ AgeGrp $ Population;
datalines;
A 1 200
A 2 100
A 3 100
A All 400
B 1 200
B 2 300
B 3 500
B All 1000
;
run;
proc format;
value $age_fmt '1','2' = '<18'
other = '18+';
run;
proc sort data=have;
by county;
run;
proc freq data=have (where=(agegrp ne 'All')) noprint;
by county;
table agegrp / out=want (drop=COUNT where=(agegrp in ('1','2')));
format agegrp $age_fmt.;
weight population;
run;