Percent split with where condition in SAS - sas

I am new to SAS and data analytics in general. So sorry if my question sound too dumb.
I have a dataset of brand medicine with three variables. Variable 1 contains the drug name, variable two contains whether that drug is BRANDED, Generic or Brand-Generic and variable 3 contains the total sale of that drug.
What I want is percent split the BRANDED, GENERIC AND BRANDED GENERIC drugs among total drug sale. The final output should look like
Branded : 35%
Generic : 25%
Branded-Generic : 40%
Any help with a sas code which would do that is greatly appreciated thank you.

So you want a % sale split! You can try using SQL (proc sql) to get your desired answer.
proc sql;
create table want as
select drug_type, sum(total_sale) as tot_sale
from have
group by drug_type;
create table want as
select *, tot_sale/sum(tot_sale) as percent_sale format=percent10.2
from want;
quit;
I created a table 'want' that will have total sale for each drug type. Using that table, I created a column that has the calculated sale percentage and formatted it to a percent (for easy view).
Of course, there are other ways of doing it, like using proc summary or proc freq or even a data step. But as a beginner, I guess starting out with SQL would be a good decision.

Related

Getting average price across stores and across months

I am trying to use the proc tabulate procedure to arrive at the average price of some configurable items, across stores and across months. Below is the sample data set, which I need to process
Configuration|Store_Postcode|Retail Price|month
163|SE1 2BN|455|1
320|SW12 9HD|545|1
23|E2 0RY|515|1
The below code is displaying the month wise average price for each configuration.
proc tabulate data=cs2.pos_data_raw;
class configuration store_postcode month;
var retail_price;
table configuration,month*MEAN*retail_price;
run;
But can I get this grouped one more level - at the Store Post code level? I modified the code to read as shown below, but executing this is crashing the system!
proc tabulate data=cs2.pos_data_raw;
class configuration store_postcode month;
var retail_price;
table configuration,store_postcode*month*MEAN*retail_price;
run;
Please advice if my approach is incorrect, or what am I doing wrong in proc tabulate so much so that it crashes the system.
I am not sure if this exactly answers your question since I am new to SAS, but when I switched store_postcode*month*MEAN*retail_price to month*store_postcode*MEAN*retail_price , it worked without crashing. I am just guessing that the reason for this is because your data only contains 1 value for month and multiple for postal code, therefore month is the most general level of categorization then it becomes more specific.
On a side note, I tried to format the table in another way also to segment the data by postal code:
proc tabulate data=pos_data_raw;
class configuration store_postcode month;
var retail_price;
table store_postcode*configuration, month*MEAN*retail_price;
run;
The output looks like this:
where the table will have postal code and configuration id on the left and month and retail price on top.

using by group processing First. and Last

I just start learning sas and would like some help with understanding the following chunk of code. The following program computes the annual payroll by department.
proc sort data = company.usa out=work.temp;
by dept;
run;
data company.budget(keep=dept payroll);
set work.temp;
by dept;
if wagecat ='S' then yearly = wagrate *12;
else if wagecat = 'H' then yearly = wagerate *2000;
if first.dept then payroll=0;
payroll+yearly;
if last.dept;
run;
Questions:
What does out = work.temp do in the first line of this code?
I understand the data step created 2 temporary variables for each by variable (first.varibale/last.variable) and the values are either 1 or 0, but what does first.dept and last.dept exactly do here in the code?
Why do we need payroll=0 after first.dept in the second to the last line?
This code takes the data for salaries and calculates the payroll amount for each department for a year, assuming salary is the same for all 12 months and that an hourly worker works 2000 hours.
It creates a copy of the data set which is sorted and stored in the work library. RTM.
From the docs
OUT= SAS-data-set
names the output data set. If SAS-data-set does not exist, then PROC SORT creates it.
CAUTION:
Use care when you use PROC SORT without OUT=.
Without the OUT= option, PROC SORT replaces the original data set with the sorted observations when the procedure executes without errors.
Default Without OUT=, PROC SORT overwrites the original data set.
Tips With in-database sorts, the output data set cannot refer to the input table on the DBMS.
You can use data set options with OUT=.
See SAS Data Set Options: Reference
Example Sorting by the Values of Multiple Variables
First.DEPT is an indicator variable that indicates the first observation of a specific BY group. So when you encounter the first record for a department it is identified. Last.DEPT is the last record for that specific department. It means the next record would the first record for a different department.
It sets PAYROLL to 0 at the first of each record. Since you have if last.dept; that means that only the last record for each department is outputted. This code is not intuitive - it's a manual way to sum the wages for people in each department. The common way would be to use a summary procedure, such as MEANS/SUMMARY but I assume they were trying to avoid having two passes of the data. Though if you're not sorting it may be just as fast anyways.
Again, RTM here. The SAS documentation is quite thorough on these beginner topics.
Here's an alternative method that should generate the exact same results but is more intuitive IMO.
data temp;
set company.usa;
if wagecat='S' then factor=12; *salary in months;
else if wagecat='H' then factor=2000; *salary in hours;
run;
proc means data=temp noprint NWAY;
class dept;
var wagerate;
weight factor;
output out=company.budget sum(wagerate)=payroll;
run;

Averages in SAS with dates using months

Let's say I have 50 years of data for each day and month. I also have a column which lists the max rainfall for each day of that dataset. I want to be able to compute the average monthly rainfall and standard deviation for each of those 50 years. How would I accomplish this task? I've considered using PROC MEANS:
PROC MEANS DATA = WORK.rainfall;
BY DATE;
VAR AVG(max_rainfall);
RUN;
but I'm unfamiliar on how to let SAS understand that I want to be using the MM of the MMDDYY format to indicate where to start and stop calculating those averages for each month. I also do not know how I can tell SAS within this PROC MEANS statement on how to format the data correctly, using MMDDYY10. This is why my code fails.
Update: I've also tried using this statement,
proc sql;
create table new as
select date,count(max_rainfall) as rainfall
from WORK.rainfall
group by date;
create table average as
select year(date) as year,month(date) as month,avg(rainfall) as avg
from new
group by year,month;
quit;
but that doesnt solve the problem either, unfortunately. It gives me the wrong values, although it does create a table. Where in my code could I have gone wrong? Am I telling SAS correctly that add all the rainfall's in 30 days and then divide it by the number of days for each month? Here's a snippet of my table.
You can use a format to group the dates for you. But you should use a CLASS statement instead of a BY statement. Here is an example using the dataset SASHELP.STOCKS.
proc means data=sashelp.stocks nway;
where date between '01JAN2005'd and '31DEC2005'd ;
class date ;
format date yymon. ;
var close ;
run;

calculate market weighted return using SAS

I have four variables Name, Date, MarketCap and Return. Name is the company name. Date is the time stamp. MarketCap shows the size of the company. Return is its return at day Date.
I want to create an additional variable MarketReturn which is the value weighted return of the market at each point time. For each day t, MarketCap weighted return = sum [ return(i)* (MarketCap(i)/Total(MarketCap) ] (return(i) is company i's return at day t).
The way I do this is very inefficient. I guess there must be some function can easily achieve this traget in SAS, So I want to ask if anyone can improve my code please.
step1: sort data by date
step2: calculate total market value at each day TotalMV = sum(MarketCap).
step3: calculate the weight for each company (weight = MarketCap/TotalMV)
step4: create a new variable 'Contribution' = Return * weight for each company
step5: sum up Contribution at each day. Sum(Contribution)
Weighted averages are supported in a number of SAS PROCs. One of the more common, all-around useful ones is PROC SUMMARY:
PROC SUMMARY NWAY DATA = my_data_set ;
CLASS Date ;
VAR Return / WEIGHT = MarketCap ;
OUTPUT
OUT = my_result_set
MEAN (Return) = MarketReturn
;
RUN;
The NWAY piece tells the PROC that the observations should be grouped only by what is stated in the CLASS statement - it shouldn't also provide an ungrouped grand total, etc.
The CLASS Date piece tells the PROC to group the observations by date. You do not need to pre-sort the data when you use CLASS. You do have to pre-sort if you say BY Date instead. The only rationale for using BY is if your dataset is very large and naturally ordered, you can gain some performance. Stick to CLASS in most cases.
VAR Return / WEIGHT = MarketCap tells the proc that any weighted calculations on Return should use MarketCap as the weight.
Lastly, the OUTPUT statement specifies the data set to write the results to (using the OUT option), and specifies the calculation of a mean on Return that will be written as MarketReturn.
There are many, many more things you can do with PROC SUMMARY. The documentation for PROC SUMMARY is sparse, but only because it is the nearly identical sibling of PROC MEANS, and SAS did not want to produce reams of mostly identical documentation for both. Here is the link to the SAS 9.4 PROC MEANS documentation. The main difference between the two PROCS is that SUMMARY only outputs to a dataset, while MEANS by default outputs to the screen. Try PROC MEANS if you want to see the result pop up on the screen right away.
The MEAN keyword in the OUTPUT statement comes from SAS's list of statistical keywords, a helpful reference for which is here.

New SAS variable conditional on observations

(first time posting)
I have a data set where I need to create a new variable (in SAS), based on meeting a condition related to another variable. So, the data contains three variables from a survey: Site, IDnumb (person), and Date. There can be multiple responses from different people but at the same site (see person 1 and 3 from site A).
Site IDnumb Date
a 1 6/12
b 2 3/4
c 4 5/1
a 3 .
d 5 .
I want to create a new variable called Complete, but it can't contain duplicates. So, when I go to proc freq, I want site A to be counted once, using the 6/12 Date of the Completed Survey. So basically, if a site is represented twice and contains a Date in one, I want to only count that one and ignore the duplicate site without a date.
N %
Complete 3 75%
Last Month 1 25%
My question may be around the NODUP and NODUPKEY possibilities. If I do a Proc Sort (nodupkey) by Site and Date, would that eliminate obs "a 3 ."?
Any help would be greatly appreciated. Sorry for the jumbled "table", as this is my first post (hints on making that better are also welcomed).
You can do this a number of ways.
First off, you need a complete/not complete binary variable. If you're in the datastep anyway, might as well just do it all there.
proc sort data=yourdata;
by site date descending;
run;
data yourdata_want;
set yourdata;
by site date descending;
if first.site then do;
comp = ifn(date>0,1,0);
output;
end;
run;
proc freq data=yourdata_want;
tables comp;
run;
If you used NODUPKEY, you'd first sort it by SITE DATE DESCENDING, then by SITE with NODUPKEY. That way the latest date is up top. You also could format COMP to have the text labels you list rather than just 1/0.
You can also do it with a format on DATE, so you can skip the data step (still need the sort/sort nodupkey). Format all nonmissing values of DATE to "Complete" and missing value of date to "Last Month", then include the missing option in your proc freq.
Finally, you could do the table in SQL (though getting two rows like that is a bit harder, you have to UNION two queries together).