I have three columns in a dataset: spend, age_bucket, and multiplier. The data looks something like...
spend age_bucket multiplier
10 18-24 2x
120 18-24 2x
1 35-54 3x
I'd like a dataset with the columns as the age buckets, the rows as the multipliers, and the entries as the sum (or other aggregate function) of the spend column. Is there a proc to do this? Can I accomplish it easily using proc SQL?
There are a few ways to do this.
data have;
input spend age_bucket $ multiplier $;
datalines;
10 18-24 2x
120 18-24 2x
1 35-54 3x
10 35-54 2x
;
proc summary data=have;
var spend;
class age_bucket multiplier;
output out=temp sum=;
run;
First you can use PROC SUMMARY to calculate the aggregation, sum in this case, for the variable in question. The CLASS statement gives you things to sum by. This will calculate the N-Way sums and the output data set will contain them all. Run the code and look at data set temp.
Next you can use PROC TRANSPOSE to pivot the table. We need to use a BY statement so a PROC SORT is necessary. I also filter to the aggregations you care about.
proc sort data=temp(where=(_type_=3));
by multiplier;
run;
proc transpose data=temp out=want(drop=_name_);
by multiplier;
var spend;
id age_bucket;
idlabel age_bucket;
run;
In traditional mode 35-54 is not a valid SAS variable name. SAS will convert your columns to proper names. The label on the variable will retain the original value. Just be aware if you need to reference the variable later, the name has changed to be valid.
Related
For an if-query I would like to create a macro varibale giving the respective frequency of the underlying time
series. I tried to get some descriptive statistics from proc time series. However, they unfortunately do not include the figure for the frequency.
The underlying times series does not necessarily conclude all periods of the frequency. That excludes a selected count by proc sql from my point of view.
Does anyone know an efficient procedure to determine the frequency without computing the frequency on my own (in a data step or a proc sql code)?
You can use the outspectra statement to help learn what kind of seasonality it has. Based on the data, give PROC TIMESERIES your best guess of day, month, etc. In the example below, we know we want to forecast by month but we do not know what seasonality it has.
proc timeseries data=sashelp.air outspectra=spectra;
id date interval=month;
var air;
run;
Plot this spectra dataset in proc sgplot and you'll see something that looks like this:
proc sgplot data=spectra;
where NOT missing(period);
series x=period y=p;
run;
This line will naturally increase over time, but we're looking for a bumps in the line. Notice the large bump somewhere between 0 and 24 months and the several smaller bumps before it. Let's zoom in on that by filtering out the longer periods.
proc sgplot data=spectra;
where period < 24 and NOT missing(period);
series x=period y=p;
run;
It's pretty clear that there is a strong seasonality of 12, with potentially smaller cycles at 3 and 6 months. From this plot, we can conclude that our seasonality should be 12 based on our spectra plot.
You can turn this into a macro to help identify the season if you'd like. Simply search for the largest bump within a reasonable timeframe. In our case we'll choose 36 because we do not suspect that we have any seasonality > 36 months.
proc sort data=spectra;
by period;
run;
data identify_period;
set spectra;
by period;
where NOT missing(period) AND period LE 36;
delta = abs(p - lag(p) );
run;
proc sql;
select period, max(delta) as max_delta
from identify_period
having delta = max(delta)
;
quit;
Output:
PERIOD max_delta
12 163712
I don't know how to do this without data step logic, but you could wrap the data step in a macro as follows:
%macro get_frequency(data,date_variable,output_variable);
proc sort data=&data (keep=&date_variable) out=__tempsorted;
by &date_variable;
run;
data _null_;
set __tempsorted end=lastobs;
prevdate=lag(&date_variable);
if _n_ > 1 then do;
interval_number+1;
interval_total + (&date_variable - prevdate);
end;
if lastobs then do;
average_interval = interval_total/interval_number;
frequency = round(365.25/average_interval);
call symput ("&output_variable",left(put(frequency,best32.)));
end;
run;
proc datasets nolist;
delete __tempsorted;
run;
quit;
%mend get_frequency;
Then you can call the macro on your original data set timeseries to examine the variable date and create a new macro variable frequency1 with the required frequency.
data work.timeseries;
input date date. value;
format date date9.;
datalines;
01Oct18 3000
01Nov18 4000
01Dec18 6500
01Jan19 7000
01Feb19 4000
01Mar19 5000
01Apr19 7500
01May19 4800
01Jun19 4500
;
run;
%get_frequency(timeseries,date,freqency1)
%put &=frequency1;
This seems to work on your sample data where each date is the first of the month. If your dates are evenly distributed (e.g. always near month start/end, or always near mid-month etc.) then this macro should work ok. Obviously if you have multiple observations per date then it will give the completely incorrect frequency.
I have a dataset with visitors and weather variables. I'm trying to forecast visitors based on the weather variables. Since the dataset only consists of visitors in season there is missing values and gaps for every year. When running proc reg in sas it's all okay but the issue comes when i'm using proc VARMAX. I cannot run the regression due to missing values. How can i tackle this?
proc varmax data=tivoli4 printall plots=forecast(all);
id obs interval=day;
model lvisitors = rain sunshine averagetemp
dfebruary dmarch dmay djune djuly daugust doctober dnovember ddecember
dwednesday dthursday dfriday dsaturday dsunday
d_24Dec2016 d_05Dec2013 d_24Dec2017 d_24Dec2014 d_24Dec2015 d_24Dec2019
d_24Dec2018 d_24Sep2012 d_06Jul2015
d_08feb2019 d_16oct2014 d_15oct2019 d_20oct2016 d_15oct2015 d_22sep2017 d_08jul2015
d_20Sep2019 d_08jul2016 d_16oct2013 d_01aug2012 d_18oct2012 d_23dec2012 d_30nov2013 d_20sep2014 d_17oct2012 d_17jun2014
dFrock2012 dFrock2013 dFrock2014 dFrock2015 dFrock2016 dFrock2017 dFrock2018 dFrock2019
dYear2015 dYear2016 dYear2017
/p=7 q=2 Method=ml dftest;
garch p=1 q=1 form=ccc OUTHT=CONDITIONAL;
restrict
ar(3,1,1)=0, ar(4,1,1)=0, ar(5,1,1)=0,
XL(0,1,13)=0, XL(0,1,14)=0, XL(0,1,13)=0, XL(0,1,27)=0, XL(0,1,38)=0, XL(0,1,42)=0;
output lead=10 out=forecast;
run;
As with any forecast, you will first need to prepare your time-series. You should first run through your data through PROC TIMESERIES to fill-in or impute missing values. The impute choice that is most appropriate is dependent on your variables. The below code will:
Sum lvisitors by day and set missing values to 0
Set missing values of averagetemp to average
Set missing values of rain, sunshine, and your variables starting with d to 0 (assuming these are indicators)
Code:
proc timeseries data=have out=want;
id obs interval = day
setmissing = 0
notsorted
;
var lvisitors / accumulate=total;
crossvar averagetemp / accumulate=none setmissing=average;
crossvar rain sunshine d: / accumulate=none;
run;
Important Time Interval Consideration
Depending on your data, this could bias your error rate and estimates since you always know no one will be around in the off-season. If you have many missing values for off-season data, you will want to remove those rows.
Since PROC VARMAX does not support custom time intervals, you can instead create a simple time identifier. You can alternatively turn this into a format for proc format and converttime_id at the end.
data want;
set have;
time_id+1;
run;
proc varmax data=want;
id time_id interval=day;
...
output lead=10 out=myforecast;
run;
data myforecast;
merge myforecast
want(keep=time_id date)
;
by time_id;
run;
Or, if you made a format:
data myforecast;
set myforecast;
date = put(time_id, timeid.);
drop time_id;
run;
I have the following problem. I need to run PROC FREQ on multiple variables, but I want the output to all be on the same table. Currently, a PROC FREQ statement with something like TABLES ERstatus Age Race, InsuranceStatus; will calculate frequencies for each variable and print them all on separate tables. I just want the data on ONE table.
Any help would be appreciated. Thanks!
P.S. I tried using PROC TABULATE, but it didn't not calculate N correctly, so I'm not sure what I did wrong. Here is my code for PROC TABULATE. My variables are all categorical, so I just need to know N and percentages.
PROC TABULATE DATA = BCanalysis;
CLASS ERstatus PRstatus Race TumorStage InsuranceStatus;
TABLE (ERstatus PRstatus Race TumorStage) * (N COLPCTN), InsuranceStatus;
RUN;
The above code does not return the correct frequencies based on InsuranceStatus where 0 = insured and 1 = uninsured, but PROC FREQ does. Also doesn't calculate correctly with ROWPCTN. So any way that I can get PROC FREQ to calculate multiple variables on one table, or PROC TABULATE to return the correct frequencies, would be appreciated.
Here is a nice image of my output in a simplified analysis of only ERstatus and InsuranceStatus. You can see that PROC FREQ returns 204 people with an ERstatus of 1 and InsuranceStatus of 1. That's correct. The values in PROC TABULATE are not.
OUTPUT
I'll answer this separately as this is answering the other possible interpretation of the question; when it's clarified I'll delete one or the other.
If you want this in a single printed table, then you either need to use proc tabulate or you need to normalize your data - meaning put it in the form of variable | value. PROC FREQ is not capable of doing multiple one-way frequencies in a single table.
For PROC TABULATE, likely your issue is missing data. Any variable that is on the class statement will be checked for missingness, and if any rows are missing data for any of the class variables, those rows are entirely excluded from the tabulation for all variables.
You can override this by adding the missing option on the class statement, or in the table statement, or in the proc tabulate statement. So:
PROC TABULATE DATA = BCanalysis;
CLASS ERstatus PRstatus Race TumorStage InsuranceStatus/missing;
TABLE (ERstatus PRstatus Race TumorStage) * (N COLPCTN), InsuranceStatus;
RUN;
This will result in a slightly different appearance than on your table, though, as it will include the missing rows in places you probably do not want them, and they'll be factored against the colpctn when again you probably don't want them.
Typically some manipulation is then necessary; the easiest is to normalize your data and then run a tabulation (using PROC TABULATE or PROC FREQ, whichever is more appropriate; TABULATE has better percentaging options though) against that normalized dataset.
Let's say we have this:
data class;
set sashelp.class;
if _n_=5 then call missing(age);
if _n_=3 then call missing(sex);
run;
And we want these two tables in one table.
proc freq data=class;
tables age sex;
run;
If we do this:
proc tabulate data=class;
class age sex;
tables (age sex),(N colpctn);
run;
Then we get an N=17 total for both subtables - that's not what we want, we want N=18. Then we can do:
proc tabulate data=class;
class age sex/missing;
tables (age sex),(N colpctn);
run;
But that's not quite right either; I want F to have 8/18 = 44.44% and M 10/18 = 55.55%, not 42% and 53% with 5% allocated to the missing row.
The way I do this is to normalize the data. This means you get a dataset with 2 variables, varname and val, or whatever makes sense for your data, plus whatever identifier/demographic/whatnot variables you might have. val has to be character unless all of your values are numeric.
So for example here I normalize class with age and sex variables. I don't keep any identifiers, but you certainly could in your data, I imagine InsuranceStatus would be kept there if I understand what you're doing in that table. Once I have the normalized table, I just use those two variables, and carefully construct a denominator definition in proc tabulate to have the right basis for my pctn value. It's not quite the same as the single table before - the variable name is in its own column, not on top of the list of values - but honestly that looks better in my opinion.
data class_norm;
set class;
length val $2;
varname='age';
val=put(age,2. -l);
if not missing(age) then output;
varname='sex';
val=sex;
if not missing(sex) then output;
keep varname val;
run;
proc tabulate data=class_norm;
class varname val;
tables varname=' '*val=' ',n pctn<val>;
run;
If you want something better than this, you'll probably have to construct it in proc report. That gives you the most flexibility, but is the most onerous to program in also.
You can use ODS OUTPUT to get all of the PROC FREQ output to one dataset.
ods output onewayfreqs=class_freqs;
proc freq data=sashelp.class;
tables age sex;
run;
ods output close;
or
ods output crosstabfreqs=class_tabs;
proc freq data=sashelp.class;
tables sex*(height weight);
run;
ods output close;
Crosstabfreqs is the name of the cross-tab output, while one-way frequencies are onewayfreqs. You can use ods trace to find out the name if you forget it.
You may (probably will) still need to manipulate this dataset some to get the structure you want ultimately.
Hell everyone --
I have some sales data which looks like this:
data have;
input order_id item $;
cards;
1 A
1 B
2 A
2 C
3 B
4 A
4 B
;
run;
What I'm trying to find out is what are the most popular combinations of items ordered. For example in the above case, there were 2 orders that contained items A&B, 1 order of A&C, and 1 order of B. What would be the best way to output the different combinations along with the numbers of orders placed?
It seems there is no permutation issue, you could try this:
proc sort data=have;
by order_id item;
run;
data temp;
set have;
by order_id;
retain comb;
length comb $4;
comb=cats(comb,item);
if last.order_id then do;
output;
call missing(comb);
end;
run;
proc freq data=temp;
table comb/norow nopercent nocol nocum;
run;
There are many possible approaches to this problem, and I would not presume to say which is the best. Here's a fairly simple method you could use:
Transpose your data so that you only have 1 row for each order, with an indicator variable for each product.
Feed the transposed dataset into proc corr to produce a correlation matrix for the indicator variables, and look for the strongest correlations.
I have data on exam results for 2 years for a number of students. I have a column with the year, the students name and the mark. Some students don't appear in year 2 because they don't sit any exams in the second year. I want to show whether the performance of students persists or whether there's any pattern in their subsequent performance. I can split the data into two halves of equal size to account for the 'first-half' and 'second-half' marks. I can also split the first half into quintiles according to the exam results using 'proc rank'
I know the output I want is a 5 X 5 table that has the original 5 quintiles on one axis and the 5 subsequent quintiles plus a 'dropped out' category as well, so a 5 x 6 matrix. There will obviously be around 20% of the total number of students in each quintile in the first exam, and if there's no relationship there should be 16.67% in each of the 6 susequent categories. But I don't know how to proceed to show whether this is the case of not with this data.
How can I go about doing this in SAS, please? Could someone point me towards a good tutorial that would show how to set this up? I've been searching for terms like 'performance persistence' etc, but to no avail. . .
I've been proceeding like this to set up my dataset. I've added a column with 0 or 1 for the first or second half of the data using the first procedure below. I've also added a column with the quintile rank in terms of marks for all the students. But I think I've gone about this the wrong way. Shoudn't I be dividing the data into quintiles in each half, rather than across the whole two periods?
Proc rank groups=2;
var yearquarter;
ranks ExamRank;
run;
Proc rank groups=5;
var percentageResult;
ranks PerformanceRank;
run;
Thanks in advance.
Why are you dividing the data into quintiles?
I would leave the scores as they are, then make a scatterplot with
PROC SGPLOT data = dataset;
x = year1;
y = year2;
loess x = year1 y = year2;
run;
Here's a fairly basic example of the simple tabulation. I transpose your quintile data and then make a table. Here there is basically no relationship, except that I only allow a 5% DNF so you have more like 19% 19% 19% 19% 19% 5%.
data have;
do i = 1 to 10000;
do year = 1 to 2;
if year=2 and ranuni(7) < 0.05 then call missing(quintile);
else quintile = ceil(5*ranuni(7));
output;
end;
end;
run;
proc transpose data=have prefix=year out=have_t;
by i;
var quintile;
id year;
run;
proc tabulate data=have_t missing;
class year1 year2;
tables year1,year2*rowpctn;
run;
PROC CORRESP might be helpful for the analysis, though it doesn't look like it exactly does what you want.
proc corresp data=have_t outc=want outf=want2 missing;
tables year1,year2;
run;