I run a program like this:
data january;
set allmonths (keep=product month num_sold cost);
if month='jan' then output january;
sales=num_sold*cost;
put sales;
keep product sales;
run;
The dataset january contains two variables: product and sales. But sales value is missing.
product sales
a .
I kind of understand why sales value is missing because it is not defined in the dataset allmonth before output statement.
Then why this variable can be included in the dataset january if output statement doesn't have it. Keep statement can include every variable listed to every dataset, then why no value be written to the dataset.
I think I might know the reason. Keep statement is about variables, output is about observations.
But I still want to ask and learn.
Thank you!
Keep is one of those SAS statements that is processed during the compilation phase for the data step before it starts processing the data. The decision on which variables to keep in the output table january has been decided (due to the keep statement) before your if and output statements are executed. An equivalent way of writing your code which may make it clearer is:
data january (keep= product sales);
set allmonths (keep=product month num_sold cost);
if month='jan' then output january;
sales=num_sold*cost;
put sales;
run;
To simplify it and get it to do what you probably want:
data january(keep=product sales);
set allmonths(keep=product month num_sold cost);
where month='jan';
sales=num_sold*cost;
run;
The OUTPUT statement runs immediately. So it is writing the record before you have calculated a value for SALES. Try adding another PUT statement before the IF/THEN/OUTPUT statement and you can see the values that will be output.
For this problem you probably want to use a subsetting IF statement or WHERE statement instead of explicitly running an OUTPUT statement. If you remove the OUTPUT statement then the data step will automatically output the records at the end of the data step.
data january;
set allmonths;
if month='jan';
sales=num_sold*cost;
keep product sales;
run;
Related
I have a dataset containing information by an account number. I'm trying to add a new variable, called product_type, populated with the same value for every record. It is within a SAS macro.
data CC_database;
set cc_base_v2 (keep=accnum date product_type);
product_type="CC";
where date>=%sysevalf("&start_date."d) and file_date<=%sysevalf("&end_date."d);
run;
However, I keep getting the error "The variable product_type has in the DROP, KEEP, or RENAME list has never been referenced" and the output dataset only shows a blank column called product_type. What is happening here?
You are using the KEEP= dataset option on the input dataset. So the error is saying PRODUCT_TYPE does not exist in the dataset CC_BASE_V2.
If you want to control what variables are written by the data step just use a KEEP statement.
keep accnum date product_type;
In your case you could use the KEEP= dataset option but only list the variables taht are coming from CC_BASE_V2.
data CC_database;
set cc_base_v2(keep=accnum date file_date);
where date>= "&start_date."d and file_date<="&end_date."d;
product_type="CC";
drop file_date;
run;
I just start learning sas and would like some help with understanding the following chunk of code. The following program computes the annual payroll by department.
proc sort data = company.usa out=work.temp;
by dept;
run;
data company.budget(keep=dept payroll);
set work.temp;
by dept;
if wagecat ='S' then yearly = wagrate *12;
else if wagecat = 'H' then yearly = wagerate *2000;
if first.dept then payroll=0;
payroll+yearly;
if last.dept;
run;
Questions:
What does out = work.temp do in the first line of this code?
I understand the data step created 2 temporary variables for each by variable (first.varibale/last.variable) and the values are either 1 or 0, but what does first.dept and last.dept exactly do here in the code?
Why do we need payroll=0 after first.dept in the second to the last line?
This code takes the data for salaries and calculates the payroll amount for each department for a year, assuming salary is the same for all 12 months and that an hourly worker works 2000 hours.
It creates a copy of the data set which is sorted and stored in the work library. RTM.
From the docs
OUT= SAS-data-set
names the output data set. If SAS-data-set does not exist, then PROC SORT creates it.
CAUTION:
Use care when you use PROC SORT without OUT=.
Without the OUT= option, PROC SORT replaces the original data set with the sorted observations when the procedure executes without errors.
Default Without OUT=, PROC SORT overwrites the original data set.
Tips With in-database sorts, the output data set cannot refer to the input table on the DBMS.
You can use data set options with OUT=.
See SAS Data Set Options: Reference
Example Sorting by the Values of Multiple Variables
First.DEPT is an indicator variable that indicates the first observation of a specific BY group. So when you encounter the first record for a department it is identified. Last.DEPT is the last record for that specific department. It means the next record would the first record for a different department.
It sets PAYROLL to 0 at the first of each record. Since you have if last.dept; that means that only the last record for each department is outputted. This code is not intuitive - it's a manual way to sum the wages for people in each department. The common way would be to use a summary procedure, such as MEANS/SUMMARY but I assume they were trying to avoid having two passes of the data. Though if you're not sorting it may be just as fast anyways.
Again, RTM here. The SAS documentation is quite thorough on these beginner topics.
Here's an alternative method that should generate the exact same results but is more intuitive IMO.
data temp;
set company.usa;
if wagecat='S' then factor=12; *salary in months;
else if wagecat='H' then factor=2000; *salary in hours;
run;
proc means data=temp noprint NWAY;
class dept;
var wagerate;
weight factor;
output out=company.budget sum(wagerate)=payroll;
run;
Let's say I have 50 years of data for each day and month. I also have a column which lists the max rainfall for each day of that dataset. I want to be able to compute the average monthly rainfall and standard deviation for each of those 50 years. How would I accomplish this task? I've considered using PROC MEANS:
PROC MEANS DATA = WORK.rainfall;
BY DATE;
VAR AVG(max_rainfall);
RUN;
but I'm unfamiliar on how to let SAS understand that I want to be using the MM of the MMDDYY format to indicate where to start and stop calculating those averages for each month. I also do not know how I can tell SAS within this PROC MEANS statement on how to format the data correctly, using MMDDYY10. This is why my code fails.
Update: I've also tried using this statement,
proc sql;
create table new as
select date,count(max_rainfall) as rainfall
from WORK.rainfall
group by date;
create table average as
select year(date) as year,month(date) as month,avg(rainfall) as avg
from new
group by year,month;
quit;
but that doesnt solve the problem either, unfortunately. It gives me the wrong values, although it does create a table. Where in my code could I have gone wrong? Am I telling SAS correctly that add all the rainfall's in 30 days and then divide it by the number of days for each month? Here's a snippet of my table.
You can use a format to group the dates for you. But you should use a CLASS statement instead of a BY statement. Here is an example using the dataset SASHELP.STOCKS.
proc means data=sashelp.stocks nway;
where date between '01JAN2005'd and '31DEC2005'd ;
class date ;
format date yymon. ;
var close ;
run;
I am confused as to why this code works and I'm afraid that this is due to misunderstanding of how SAS processes datasets. I thought that the PDV is initialized to missing each iteration for variables created in an assignment statement, while all other variableare overwritten with each iteration. So, I expected Promotion to be missing with the second iteration because there is no RETAIN statement. When is a RETAIN statement needed and why is it not needed here?
DATA work.extended;
SET Orion.Discount;
WHERE Month(Start_Date)=7 OR Month(End_Date)=7;
DROP Unit_Sales_Price;
Promotion="Happy Holidays";
Season="Winter";
Output;
Season="Summer";
Output;
RUN;
Here is a sample of the output:
Obs Product_ID Start_Date End_Date Discount Promotion Season
1 210100100038 01JUL2007 31JUL2007 60% Happy Holidays Winter
2 210100100038 01JUL2007 31JUL2007 60% Happy Holidays Summer
3 210200100010 01JUL2007 31JUL2007 60% Happy Holidays Winter
4 210200100010 01JUL2007 31JUL2007 60% Happy Holidays Summer
Thank you.
You're incorrectly defining iteration. One iteration of the data step is not one output, but one time from data to run (or return, which run has implied). Output has no relation to data step iterations, other than the implied output when it is not included.
I suggest the data step debugger (such as here ) if you want to see this in action.
(first time posting)
I have a data set where I need to create a new variable (in SAS), based on meeting a condition related to another variable. So, the data contains three variables from a survey: Site, IDnumb (person), and Date. There can be multiple responses from different people but at the same site (see person 1 and 3 from site A).
Site IDnumb Date
a 1 6/12
b 2 3/4
c 4 5/1
a 3 .
d 5 .
I want to create a new variable called Complete, but it can't contain duplicates. So, when I go to proc freq, I want site A to be counted once, using the 6/12 Date of the Completed Survey. So basically, if a site is represented twice and contains a Date in one, I want to only count that one and ignore the duplicate site without a date.
N %
Complete 3 75%
Last Month 1 25%
My question may be around the NODUP and NODUPKEY possibilities. If I do a Proc Sort (nodupkey) by Site and Date, would that eliminate obs "a 3 ."?
Any help would be greatly appreciated. Sorry for the jumbled "table", as this is my first post (hints on making that better are also welcomed).
You can do this a number of ways.
First off, you need a complete/not complete binary variable. If you're in the datastep anyway, might as well just do it all there.
proc sort data=yourdata;
by site date descending;
run;
data yourdata_want;
set yourdata;
by site date descending;
if first.site then do;
comp = ifn(date>0,1,0);
output;
end;
run;
proc freq data=yourdata_want;
tables comp;
run;
If you used NODUPKEY, you'd first sort it by SITE DATE DESCENDING, then by SITE with NODUPKEY. That way the latest date is up top. You also could format COMP to have the text labels you list rather than just 1/0.
You can also do it with a format on DATE, so you can skip the data step (still need the sort/sort nodupkey). Format all nonmissing values of DATE to "Complete" and missing value of date to "Last Month", then include the missing option in your proc freq.
Finally, you could do the table in SQL (though getting two rows like that is a bit harder, you have to UNION two queries together).