Create grid data in SAS using info from another dataset - sas

I need to get a dataset of a uniform grid 20x20 using info from SASHELP.CARS so that x and y variables are obtained as follows:
do y = min(weight) to max(weight) by (min(weight)+max(weight))/20;
do x = min(horsepower) to max(horsepower) by (min(horsepower)+max(horsepower))/20;
output;
end;
end;
Weight and HorsePower are variables of SASHELP.CARS. Furthermore the grid dataset has to have two more columns EnginSizeMean LengthMean with the same value in each row that equals mean(EnginSize) and mean(Length) from SASHELP.CARS (need all this to build dependency graph for regression model).

First calculate the statistics you need to use.
proc summary data=sashelp.cars ;
var weight horsepower enginesize length ;
output out=stats
min(weight horsepower)=
max(weight horsepower)=
mean(enginesize length)=
/ autoname
;
run;
Then use those values to generate your "grid".
data want;
set stats;
do y = 1 to 20 ;
weight= weight_min + (y-1)*(weight_min+weight_max)/20;
do x = 1 to 20 ;
horsepower = horsepower_min + (x-1)*(horsepower_min+horsepower_max)/20;
output;
end;
end;
run;

Related

SAS Array Variable Name Based on Another Array

I have data in the following format:
data have;
input id rtl_apples rtl_oranges rtl_berries;
datalines;
1 50 60 10
2 10 30 80
3 40 8 1
;
I'm trying to create new variables that represent the percent of the sum of the RTL variables, PCT_APPLES, PCT_ORANGES, PCT_BERRIES. The problem is I'm doing this within a macro so the names and number of RTL variables with vary with each iteration so the new variable names need to be generated dynamically.
This data step essentially gets what I need, but the new variables are in the format PCT1, PCT2, PCTn format so it's difficult to know which RTL variable the PCT corresponds too.
data want;
set have;
array rtls[*] rtl_:;
total_sales = sum(of rtl_:);
call symput("dim",dim(rtls));
array pct[&dim.];
do i=1 to dim(rtls);
pct[i] = rtls[i] / total_sales;
end;
drop i;
run;
I also tried creating the new variable name by using a macro variable, but only the last variable in the array is created. In this case, PCT_BERRIES.
data want;
set have;
array rtls[*] rtl_:;
total_sales = sum(of rtl_:);
do i=1 to dim(rtls);
var_name = compress(tranwrd(upcase(vname(rtls[i])),'RTL','PCT'));
call symput("var_name",var_name);
&var_name. = rtls[i] / total_sales;
end;
drop i var_name;
run;
I have a feeling I'm over complicating this so any help would be appreciated.
If you have the list of names in data already then use the list to create the names you need for your arrays.
proc sql noprint;
select distinct cats('RTL_',name),cats('PCT_',name)
into :rtl_list separated by ' '
, :pct_list separated by ' '
from dataset_with_names
;
quit;
data want;
set have;
array rtls &rtl_list;
array pcts &pct_list;
total_sales = sum(of rtls[*]);
do index=1 to dim(rtls);
pcts[index] = rtls[index] / total_sales;
end;
drop index ;
run;
You can't create variables while a data step is executing. This program uses PROC TRANSPOSE to create a new data using the RTL_ variables "renamed" PCT_.
data have;
input id rtl_apples rtl_oranges rtl_berries;
datalines;
1 50 60 10
2 10 30 80
3 40 8 1
;;;;
run;
proc transpose data=have(obs=0) out=names;
var rtl_:;
run;
data pct;
set names;
_name_ = transtrn(_name_,'rtl_','PCT_');
y = .;
run;
proc transpose data=pct out=pct2;
id _name_;
var y;
run;
data want;
set have;
if 0 then set pct2(drop=_name_);
array _rtl[*] rtl_:;
array _pct[*] pct_:;
call missing(of _pct[*]);
total = sum(of _rtl[*]);
do i = 1 to dim(_rtl);
_pct[i] = _rtl[i]/total*1e2;
end;
drop i;
run;
proc print;
run;
You may want to just report the row percents
proc transpose data=&data out=&data.T;
by id;
var rtl_:;
run;
proc tabulate data=&data.T;
class id _name_;
var col1;
table
id=''
, _name_='Result'*col1=''*sum=''
_name_='Percent'*col1=''*rowpctsum=''
/ nocellmerge;
run;

SAS: proc hpbin function

The data I have is
Year Score
2020 100
2020 45
2020 82
.
.
.
2020 91
2020 14
2020 35
And the output I want is
Score_Ranking Count_Percent Cumulative_count_percent Sum
top100 x y z
101-200
.
.
.
800-900
900-989
The dataset has a total of 989 observations for the same year. I want to divide the whole dataset into 10 bins but set the size to 100. However, if I use the proc hpbin function, my results get divided into 989/10 bins. Is there a way I can determine the bin size?
Also, I want additional rows that show proportion, cumulative proportion, and the sum of the scores. How can I print these next to the bins?
Thank you in advance.
Sort your data
Classify into bins
Use PROC FREQ for #/Cumulative Count
Use PROC FREQ for SUM by using WEIGHT
Merge results
Or do 3-4 in same data step.
I'm not actually sure what the first two columns will tell you as they will all be the same except for the last one.
First generate some fake data to work with, the sort is important!
*generate fake data;
data have;
do score=1 to 998;
output;
end;
run;
proc sort data=have;
by score;
run;
Method #1
Note that I use a view here, not a data set which can help if efficiency may be an issue.
*create bins;
data binned / view=binned;
set have ;
if mod(_n_, 100) = 1 then bin+1;
run;
*calculate counts/percentages;
proc freq data=binned noprint;
table bin / out=binned_counts outcum;
run;
*calculate sums - not addition of WEIGHT;
proc freq data=binned noprint;
table bin / out=binned_sum outcum;
weight score;
run;
*merge results together;
data want_merged;
merge binned_counts binned_sum (keep = bin count rename = count= sum);
by bin;
run;
Method #2
And another method, which requires a single pass of your data rather than multiple as in the PROC FREQ approach:
*manual approach;
data want;
set have
nobs = _nobs /*Total number of observations in data set*/
End=last /*flag for last record*/;
*holds values across rows and sets initial value;
retain bin 1 count cum_count cum_sum 0 percent cum_percent ;
*increments bins and resets count at start of each 100;
if mod(_n_, 100) = 1 and _n_ ne 1 then do;
*output only when end of bin;
output;
bin+1;
count=0;
end;
*increment counters and calculate percents;
count+1;
percent = count / _nobs;
cum_count + 1;
cum_percent = cum_count / _nobs;
cum_sum + score;
*output last record/final stats;
if last then output;
*format percents;
format percent cum_percent percent12.1;
run;

SAS averaging into groups

Ive got 50 columns of data, with 4 different measurements in each, as well as designation tags (groups C, D, and E). Ive averaged the 4 measurements... So every data point now has an average. Now, I am supposed to take the average of all the data points averages of each specific group.
So I want all the data in group C to be averaged, and so on for D and E.... and I dont know how to do that.
avg1=(MEAS1+MEAS2+MEAS3+MEAS4)/4;
avg_score=round(avg1, .1);
run;
proc print;
run;
This is what I have so far.
There are several procedures, and SQL that can average values over a group.
I'll guess you meant to say 50 rows of data.
Example:
Proc MEANS
data have;
call streaminit(314159);
do _n_ = 1 to 50;
group = substr('CDE', rand('integer',3),1);
array v meas1-meas4;
do _i_ = 1 to dim(v);
num + 2;
v(_i_) = num;
end;
output;
end;
drop num;
run;
data rowwise_means;
set have;
avg_meas = mean (of meas:);
run;
* group wise means of row means;
proc means noprint data=rowwise_means nway;
class group;
var avg_meas;
output out=want mean=meas_grandmean;
run;
rowwise_means
want (grandmean, or mean of means)

How do I create a SAS data view from an 'Out=' option?

I have a process flow in SAS Enterprise Guide which is comprised mainly of Data views rather than tables, for the sake of storage in the work library.
The problem is that I need to calculate percentiles (using proc univariate) from one of the data views and left join this to the final table (shown in the screenshot of my process flow).
Is there any way that I can specify the outfile in the univariate procedure as being a data view, so that the procedure doesn't calculate everything prior to it in the flow? When the percentiles are left joined to the final table, the flow is calculated again so I'm effectively doubling my processing time.
Please find the code for the univariate procedure below
proc univariate data=WORK.QUERY_FOR_SGFIX noprint;
var CSA_Price;
by product_id;
output out= work.CSA_Percentiles_Prod
pctlpre= P
pctlpts= 40 to 60 by 10;
run;
In SAS, my understanding is that procs such as proc univariate cannot generally produce views as output. The only workaround I can think of would be for you to replicate the proc logic within a data step and produce a view from the data step. You could do this e.g. by transposing your variables into temporary arrays and using the pctl function.
Here's a simple example:
data example /view = example;
array _height[19]; /*Number of rows in sashelp.class dataset*/
/*Populate array*/
do _n_ = 1 by 1 until(eof);
set sashelp.class end = eof;
_height[_n_] = height;
end;
/*Calculate quantiles*/
array quantiles[3] q40 q50 q60;
array points[3] (40 50 60);
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/*Keep only the quantiles we calculated*/
keep q40--q60;
run;
With a bit more work, you could also make this approach return percentiles for individual by groups rather than for the whole dataset at once. You would need to write a double-DOW loop to do this, e.g.:
data example;
array _height[19];
array quantiles[3] q40 q50 q60;
array points[3] _temporary_ (40 50 60);
/*Clear heights array between by groups*/
call missing(of _height[*]);
/*Populate heights array*/
do _n_ = 1 by 1 until(last.sex);
set class end = eof;
by sex;
_height[_n_] = height;
end;
/*Calculate quantiles*/
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/* Output all rows from input dataset, with by-group quantiles attached*/
do _n_ = 1 to _n_;
set class;
output;
end;
keep name sex q40--q60;
run;

SAS summary statistic from a dataset

The dataset looks like this:
colx coly colz
0 1 0
0 1 1
0 1 0
Required output:
Colname value count
colx 0 3
coly 1 3
colz 0 2
colz 1 1
The following code works perfectly...
ods output onewayfreqs=outfreq;
proc freq data=final;
tables colx coly colz / nocum nofreq;
run;
data freq;
retain colname column_value;
set outfreq;
colname = scan(tables, 2, ' ');
column_Value = trim(left(vvaluex(colname)));
keep colname column_value frequency percent;
run;
... but I believe that's not efficient. Say I have 1000 columns, running prof freq on all 1000 columns is not efficient. Is there any other efficient way with out using the proc freq that accomplishes my desired output?
One of the most efficient mechanisms for computing frequency counts is through a hash object set up for reference counting via the suminc tag.
The SAS documentation for "Hash Object - Maintaining Key Summaries" demonstrates the technique for a single variable. The following example goes one step further and computes for each variable specified in an array. The suminc:'one' specifies that each use of ref will add the value of one to an internal reference sum. While iterating over the distinct keys for output, the frequency count is extracted via the sum method.
* one million data values;
data have;
array v(1000);
do row = 1 to 1000;
do index = 1 to dim(v);
v(index) = ceil(100*ranuni(123));
end;
output;
end;
keep v:;
format v: 4.;
run;
* compute frequency counts via .ref();
data freak_out(keep=name value count);
length name $32 value 8;
declare hash bins(ordered:'a', suminc:'one');
bins.defineKey('name', 'value');
bins.defineData('name', 'value');
bins.defineDone();
one = 1;
do until (end_of_data);
set have end=end_of_data;
array v v1-v1000;
do index = 1 to dim(v);
name = vname(v(index));
value = v(index);
bins.ref();
end;
end;
declare hiter out('bins');
do while (out.next() = 0);
bins.sum(sum:count);
output;
end;
run;
Note Proc FREQ uses standard grammars, variables can be a mixed of character and numeric, and has lots of additional features that are specified through options.
I think the most time consuming part in your code is generation of the ODS report. You can transpose the data before applying the freq. The below example does the task for 1000 rows with 1000 variables in few seconds. If you do it using ODS it may take much longer.
data dummy;
array colNames [1000] col1-col1000;
do line = 1 to 1000;
do j = 1 to dim(colNames);
colNames[j] = int(rand("uniform")*100);
end;
output;
end;
drop j;
run;
proc transpose
data = dummy
out = dummyTransposed (drop = line rename = (_name_ = colName col1 = value))
;
var col1-col1000;
by line;
run;
proc freq data = dummyTransposed noprint;
tables colName*value / out = result(drop = percent);
run;
Perhaps this statement from the comments is the real problem.
I felt like the odsoutput with proc freq is slowing down and creating
huge logs and outputs. think of 10,000 variables and million records.
I felt there should be another way of accomplishing this and arrays
seems to be a great fit
You can tell ODS not to produce the printed output if you don't want it.
ods exclude all ;
ods output onewayfreqs=outfreq;
proc freq data=final;
tables colx coly colz / nocum nofreq;
run;
ods exclude none ;