Calculating proportion and cumulative data in SAS - sas

I have a dataset called stores.I want to extract total_sales(retail_price),
proportion of sales and cumulative proportion of sales by each store in
SAS.
Sample dataset : - Stores
Date Store_Postcode Retail_Price month Distance
08/31/2013 CR7 8LE 470 8 7057.8
10/26/2013 CR7 8LE 640 10 7057.8
08/19/2013 CR7 8LE 500 8 7057.8
08/17/2013 E2 0RY 365 8 1702.2
09/22/2013 W4 3PH 395.5 12 2522
06/19/2013 W4 3PH 360.5 6 1280.9
11/15/2013 W10 6HQ 475 12 3213.5
06/20/2013 W10 6HQ 500 1 3213.5
09/18/2013 E7 8NW 315 9 2154.8
10/23/2013 E7 8NW 570 10 5777.9
11/18/2013 W10 6HQ 455 11 3213.5
08/21/2013 W10 6HQ 530 8 3213.5
Code i tried: -
Proc sql;
Create table work.Top_sellers as
Select Store_postcode as Stores,SUM(Retail_price) as Total_Sales,Round((Retail_price/Sum(Retail_price)),0.01) as
Proportion_of_sales
From work.stores
Group by Store_postcode
Order by total_sales;
Quit;
I've no idea on how to calculate cumulative variable in proc sql...
Please help me improve my code!!

Computing a cumulative result in SQL requires the data to have an explicit unique ordered key and the query involves a reflexive join with 'triangular' criteria for the cumulative aspect.
data have;
do id = 100 to 120;
sales = ceil (10 + 25 * ranuni(123));
output;
end;
run;
proc sql;
create table want as
select
have1.id
, have1.sales
, sum(have2.sales) as sales_cusum
from
have as have1
join
have as have2
on
have1.id >= have2.id /* 'triangle' criteria */
group by
have1.id, have1.sales
order by
have1.id
;
quit;
A second way is re-compute the cusum on row by row basis
proc sql;
create table want as
select have.id, have.sales,
( select sum(inner.sales)
from (select * from have) as inner
where inner.id <= have.id
)
as cusum
from
have;

I change my mind, CDF is a different calculation.
Here's how to do this via a data step. First calculate the cumulative totals (I used a data step here, but I could use PROC EXPAND if you had SAS/ETS).
*sort demo data;
proc sort data=sashelp.shoes out=shoes;
by region sales;
run;
data cTotal last (keep = region cTotal);
set shoes;
by region;
*calculate running total;
if first.region then cTotal=0;
cTotal = cTotal + sales;
*output records, everything to cTotal but only the last record which is total to Last dataset;
if last.region then output last;
output cTotal;
retain cTotal;
run;
*merge in results and calculate percentages;
data calcs;
merge cTotal Last (rename=cTotal=Total);
by region;
percent = cTotal/Total;
run;
If you need a more efficient solution, I'd try a DoW solution.

Related

SAS: proc hpbin function

The data I have is
Year Score
2020 100
2020 45
2020 82
.
.
.
2020 91
2020 14
2020 35
And the output I want is
Score_Ranking Count_Percent Cumulative_count_percent Sum
top100 x y z
101-200
.
.
.
800-900
900-989
The dataset has a total of 989 observations for the same year. I want to divide the whole dataset into 10 bins but set the size to 100. However, if I use the proc hpbin function, my results get divided into 989/10 bins. Is there a way I can determine the bin size?
Also, I want additional rows that show proportion, cumulative proportion, and the sum of the scores. How can I print these next to the bins?
Thank you in advance.
Sort your data
Classify into bins
Use PROC FREQ for #/Cumulative Count
Use PROC FREQ for SUM by using WEIGHT
Merge results
Or do 3-4 in same data step.
I'm not actually sure what the first two columns will tell you as they will all be the same except for the last one.
First generate some fake data to work with, the sort is important!
*generate fake data;
data have;
do score=1 to 998;
output;
end;
run;
proc sort data=have;
by score;
run;
Method #1
Note that I use a view here, not a data set which can help if efficiency may be an issue.
*create bins;
data binned / view=binned;
set have ;
if mod(_n_, 100) = 1 then bin+1;
run;
*calculate counts/percentages;
proc freq data=binned noprint;
table bin / out=binned_counts outcum;
run;
*calculate sums - not addition of WEIGHT;
proc freq data=binned noprint;
table bin / out=binned_sum outcum;
weight score;
run;
*merge results together;
data want_merged;
merge binned_counts binned_sum (keep = bin count rename = count= sum);
by bin;
run;
Method #2
And another method, which requires a single pass of your data rather than multiple as in the PROC FREQ approach:
*manual approach;
data want;
set have
nobs = _nobs /*Total number of observations in data set*/
End=last /*flag for last record*/;
*holds values across rows and sets initial value;
retain bin 1 count cum_count cum_sum 0 percent cum_percent ;
*increments bins and resets count at start of each 100;
if mod(_n_, 100) = 1 and _n_ ne 1 then do;
*output only when end of bin;
output;
bin+1;
count=0;
end;
*increment counters and calculate percents;
count+1;
percent = count / _nobs;
cum_count + 1;
cum_percent = cum_count / _nobs;
cum_sum + score;
*output last record/final stats;
if last then output;
*format percents;
format percent cum_percent percent12.1;
run;

Deleting and adding specific row/ column in SAS output

I have the following data
DATA HAVE;
input year dz $8. area;
cards;
2000 stroke 08
2000 stroke 06
2000 stroke 06
;
run;
After using proc freq
proc freq data=have;
table area*dz/ list nocum ;
run;
I get the below output
In this output
I want to delete the 'dz', what can I do to delete this column?
I want a row in the end that gives 'total', what can I do to get a 'total' row?
Thank you!
There must be a better way of doing this, but the following code creates the desired table:
data have;
input year dz $8. area;
cards;
2000 stroke 08
2000 stroke 06
2000 stroke 06
;
run;
ods output List=list;
proc freq data=have;
table area*dz / list;
run;
data stage1;
set list(keep= area frequency percent CumFrequency CumPercent) end=eof;
area_char = put(area,best.-l); /* Convert it to char to add the Total row */
if eof then do;
call symputx("cumFreq", cumfrequency);
call symputx("cumPerc", cumpercent);
end;
drop area;
run;
data want;
retain area frequency percent; /* Put the variables in the desired order */
set stage1(rename=(area_char=area) drop=cumfrequency cumpercent) end=eof;
output;
if eof then do; /* Manually create the Total row */
area = "Total";
Frequency = &cumfreq.;
Percent = &cumperc.;
output;
end;
run;
Output (want table):
You should subset your data with a where clause and use a title statement if a important partitioning variable is to be removed from output. If you didn't subset how would your audience know if a count contained say episodes of stroke and ministroke if ministroke was also in the data.
Compute the frequencies with freq and use a reporting procedure (print, report, tababulate) that summarizes to show a total line.
Example:
data have;
input year dz $ area;
cards;
2000 stroke 08
2000 stroke 06
2000 stroke 06
;
proc freq noprint data=have;
where dz = 'stroke';
table area / out=freqs;
run;
title 'Stroke dz';
title2 'print';
proc print data=freqs noobs label;
var area;
sum count percent;
run;
title2 'report';
proc report data=freqs;
columns area count percent;
define area / display;
define count / analysis;
rbreak after / summarize;
run;
title2 'tabulate';
proc tabulate data=freqs;
class area;
var count percent;
table area all, count percent;
run;
Thank you all for your valuable responses. The following code gives me the desired output in a concise way
proc freq data=HAVE;
tables area / list nocum out=a;
run;
proc sql;
create table b as
select * from a
union
select
'Total' as area,
sum(count) as count,
sum(percent) as percent
FROM a
;
quit;
proc print data=b; run;

Is there any better ways to compare cases between different row in SAS?

During some data cleaning process, there is a need to compare the data between different rows. For example, if the rows have the same countryID and subjectID then keep the largest temperature:
CountryID SubjectID Temperature
1001 501 36
1001 501 38
1001 510 37
1013 501 36
1013 501 39
1095 532 36
In this case like this, I will use the lag() function as follows.
proc sort table;
by CountryID SubjectID descending Temperature;
run;
data table_laged;
set table;
CountryID_lag = lag(CountryID);
SubjectID_lag = lag(SubjectID);
Temperature_lag = lag(Temperature);
if CountryID = CountryID_lag and SubjectID = SubjectID_lag then do;
if Temperature < Temperature_lag then delete;
end;
drop CountryID_lag SubjectID_lag Temperature_lag;
run;
The code above may work.
But I still want to know if there are any better ways to solve this kind of questions?
I think you complicate task. You can use proc sql and max function:
proc sql noprint;
create table table_laged as
select CountryID,SubjectID,max(Temperature)
from table
group by CountryID,SubjectID;
quit;
I don't know if you want it that way but you code would keep the highest temperatures
So when you have 2 1 3 for one subject if will keep 3. But when you have 1 4 3 4 4 it will keep 4 4 4. Better is to keep simple the first row for each subject which is the highest because of descending order.
proc sort data = table;
by CountryID SubjectID descending Temperature;
run;
data table_laged;
set table;
by CountryID SubjectID;
if first.SubjectID;
run;
You can use double DOW technique to:
Compute a measure over a group,
Apply the measure to items in the group.
The benefit of DOW looping is a single pass over the data set when incoming data is already grouped.
In this question, 1. is to identify the row in the group with the first highest temperature, and 2. is to select the row for output.
data want;
do _n_ = 1 by 1 until (last.SubjectId);
set have;
by CountryId SubjectId;
if temperature > _max_temp then do;
_max_temp = temperature;
_max_at_n = _n_;
end;
end;
do _n_ = 1 to _n_;
set have;
if _n_ = _max_at_n then OUTPUT;
end;
drop _:;
run;
The traditional procedural technique is Proc MEANS
data have;input
CountryID SubjectID Temperature; datalines;
1001 501 36
1001 501 38
1001 510 37
1013 501 36
1013 501 39
1095 532 36
run;
proc means noprint data=have;
by countryid subjectid;
output out=want(drop=_:) max(temperature)=temperature;
run;
If the data is disordered in CountryID and SubjectID going into the data step, a hash object can be used or SQL per #Aurieli.

Using the sum of the columns, to create a new varible

I have data set, that has States, Corn, and Cotton. I want to create a new variable, Corn_Pct in SAS (% of state corn output relative to the country's output of corn). The same for Cotton_pct.
sample of data: (numbers are not real)
State Corn Cotton
TX 135 500
AK 120 350
...
Can anyone help?
You can do this using a simple Proc SQL. Let the dataset be "Test",
Proc sql ;
create table test_percent as
select *,
Corn/sum(corn) as Corn_Pct format=percent7.1,
Cotton/sum(Cotton) as Cotton_Pct format=percent7.1
from test
;
quit;
If you have many columns, you can use Arrays and do loops to automatically generate percentages everytime.
I have calculated the total of a column in Inner Query and then used that total for the calculation in outer query using Cross Join
Hey Try this:-
/*My Dataset */
Data Test;
input State $ Corn Cotton ;
cards;
TK 135 500
AK 120 350
CK 100 250
FG 200 300
run;
/*Code*/
Proc sql;
create table test_percent as
Select a.*, (corn * 100/sm_corn) as Corn_pct, (Cotton * 100/sm_cotton) as Cotton_pct
from test a
cross join
(
select sum(corn) as sm_corn ,
sum(Cotton) as sm_cotton
from test
) b ;
quit;
/*My Output*/
State Corn Cotton Corn_pct Cotton_pct
TK 135 500 24.32432432 35.71428571
AK 120 350 21.62162162 25
CK 100 250 18.01801802 17.85714286
FG 200 300 36.03603604 21.42857143
Here you have an alternative using proc means and data step:
proc means data=test sum noprint;
output out=test2(keep=corn cotton) sum=corn cotton;
quit;
data test_percent (drop=corn_sum cotton_sum);
set test2(rename=(corn=corn_sum cotton=cotton_sum) in=in1) test(in=in2);
if (in1=1) then do;
call symput('corn_sum',corn_sum);
call symput('cotton_sum',cotton_sum);
end;
else do;
Corn_pct = corn/symget('corn_sum');
Cotton_pct = cotton/symget('cotton_sum');
output;
end;
run;

SAS: Insert Blank Rows

I'm calculating some interval statistics (standard deviation of one minute intervals for example) of financial time series data. My code managed to get results for all intervals that contain data, but for intervals that do not contain any observations in the time series, I'd like to insert an empty row just to maintain the timestamp consistency.
For example, if there's data between 10:00 to 10:01, 10:02 to 10:03, but not 10:01 to 10:02, my output would be:
10:01 stat1 stat2 stat3
10:03 stat1 stat2 stat3
It would ideal if the result could be (I want some values to be 0, some missing '.'):
10:01 stat1 stat2 stat3
10:02 0 0 .
10:03 stat1 stat2 stat3
What I did:
data v_temp/view = v_temp;
set &taq_ds;
where TIME_M between &start_time and &end_time;
INTV = hms(00, ceil(TIME_M/'00:01:00't),00); *create one minute interval;
format INTV tod.; *format hh:mm:ss;
run;
proc means data = sorted noprint;
by SYM_ROOT DATE INTV;
var PRICE;
weight SIZE;
output
out=oneMinStats(drop=_TYPE_ _FREQ_)
n=NTRADES mean=VWAP sumwgt=SUMSHS max=HI min=LO std=SIGMAPRC
idgroup(max(TIME_M) last out(price size ex time_m)=LASTTRD LASTSIZE LASTEX LASTTIME);
run;
For some non-active stocks, there're many gaps like this. What would be an efficient way to generate those filling rows?
If you have SAS:ETS licensed, PROC EXPAND is a good choice for adding blank rows in a time series. Here's a very short example:
data mydata;
input timevar stat1 stat2 stat3;
format timevar TIME5.;
informat timevar HHMMSS5.;
datalines;
10:01 1 3 5
10:03 2 4 6
;;;;
run;
proc expand data=mydata out=mydata_exp from=minute to=minute observed=beginning method=none;
id timevar;
run;
The documentation has more details if you want to perform inter/extrapolation or anything like that. The important options are from=minute, observed=beginning, method=none (no extrapolation or interpolation), and id (which identifies the time variable).
If you don't have ETS, then a data step should suffice. You can either merge to a known dataset, or add your own rows; the size of your dataset determines somewhat which is easier. Here's the merge variation. The add your own rows in a datastep variation is similar to how I create the extra rows.
*Select the maximum time available.;
proc sql noprint;
select max(timevar) into :endtime from mydata;
quit;
*Create the empty dataset with just times;
data mydata_tomerge;
set mydata_tomerge(obs=1);
do timevar = timevar to &endtime by 60; *by 60 = minutes!;
output;
end;
keep timevar;
run;
*Now merge the one with all the times to the one with all the data!;
data mydata_fin;
merge mydata_tomerge(in=a) mydata;
by timevar;
if a;
run;