using by group processing First. and Last

using by group processing First. and Last - sas

I just start learning sas and would like some help with understanding the following chunk of code. The following program computes the annual payroll by department.
proc sort data = company.usa out=work.temp;
by dept;
run;
data company.budget(keep=dept payroll);
set work.temp;
by dept;
if wagecat ='S' then yearly = wagrate *12;
else if wagecat = 'H' then yearly = wagerate *2000;
if first.dept then payroll=0;
payroll+yearly;
if last.dept;
run;
Questions:
What does out = work.temp do in the first line of this code?
I understand the data step created 2 temporary variables for each by variable (first.varibale/last.variable) and the values are either 1 or 0, but what does first.dept and last.dept exactly do here in the code?
Why do we need payroll=0 after first.dept in the second to the last line?

This code takes the data for salaries and calculates the payroll amount for each department for a year, assuming salary is the same for all 12 months and that an hourly worker works 2000 hours.
It creates a copy of the data set which is sorted and stored in the work library. RTM.
From the docs
OUT= SAS-data-set
names the output data set. If SAS-data-set does not exist, then PROC SORT creates it.
CAUTION:
Use care when you use PROC SORT without OUT=.
Without the OUT= option, PROC SORT replaces the original data set with the sorted observations when the procedure executes without errors.
Default Without OUT=, PROC SORT overwrites the original data set.
Tips With in-database sorts, the output data set cannot refer to the input table on the DBMS.
You can use data set options with OUT=.
See SAS Data Set Options: Reference
Example Sorting by the Values of Multiple Variables
First.DEPT is an indicator variable that indicates the first observation of a specific BY group. So when you encounter the first record for a department it is identified. Last.DEPT is the last record for that specific department. It means the next record would the first record for a different department.
It sets PAYROLL to 0 at the first of each record. Since you have if last.dept; that means that only the last record for each department is outputted. This code is not intuitive - it's a manual way to sum the wages for people in each department. The common way would be to use a summary procedure, such as MEANS/SUMMARY but I assume they were trying to avoid having two passes of the data. Though if you're not sorting it may be just as fast anyways.
Again, RTM here. The SAS documentation is quite thorough on these beginner topics.
Here's an alternative method that should generate the exact same results but is more intuitive IMO.
data temp;
set company.usa;
if wagecat='S' then factor=12; *salary in months;
else if wagecat='H' then factor=2000; *salary in hours;
run;
proc means data=temp noprint NWAY;
class dept;
var wagerate;
weight factor;
output out=company.budget sum(wagerate)=payroll;
run;

Related

Proc SQL running total

I am building a process in SAS EG and came to a sticking point when I needed a running total. This would be very easy to do in Excel but my table is 22M records long. I have VBA experience but not Proc SQL. Can someone show me how to do a running total of dollars by item? The data is sorted by Market/Segment/Item/Month.
Thanks
Jeff
MyData

You hierarchy is Market / Segment / Item, and maybe from the question one can presume an Item is unique across all Markets and Segments.
A running total is easiest in a DATA Step. You will want to use first. automatic variables that are prepared when the step has a BY statement.
data want;
set have;
by Market Segment Item Month; * add month to make sure incoming data is ordered timewise, if not an error will appear in the log;
if first.Item then RunningDollars = 0;
RunningDollars + Dollars; * The + syntax here is a `SUM` statement that causes the RunningDollars variable to be automatically retaine, meaning the value is available for the next record.
run;

"BY variables are not properly sorted" error although it was sorted already

I am using SAS for a large dataset (>20gb). When I run a DATA step, I received the "BY variables are not properly sorted ......" although I sorted the dataset by the same variables. When I ran the PROC SORT again, SAS even said "Input dataset is already sorted, No sorting done"
My code is:
proc sort data=output.TAQ;
by market ric date miliseconds descending type order;
run;
options nomprint;
data markers (keep=market ric date miliseconds type order);
set output.TAQ;
by market ric date;
if first.date;
* ie do the following once per stock-day;
* Make 1-second markers;
/*Type="AMARK"; Order=0; * Set order to zero to ensure that markers get placed before trades and quotes that occur at the same milisecond;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;*/
run;
And the error message was:
ERROR: BY variables are not properly sorted on data set OUTPUT.TAQ.
RIC=CXR.CCP Date=20160914 Time=13:47:18.125 Type=Quote Price=. Volume=. BidPrice=9.03 BidSize=400
AskPrice=9.04 AskSize=100 Qualifiers= order=116458952 Miliseconds=49638125 exchange=CCP market=1
FIRST.market=0 LAST.market=0 FIRST.RIC=0 LAST.RIC=0 FIRST.Date=0 LAST.Date=1 i=. _ERROR_=1
_N_=43297873
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 43297874 observations read from the data set OUTPUT.TAQ.
WARNING: The data set WORK.MARKERS may be incomplete. When this step was stopped there were
56770826 observations and 6 variables.
WARNING: Data set WORK.MARKERS was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 1:14.21
cpu time 26.71 seconds

The error is occurring deep into your data step, at _N_=43297873. That suggests to me that the PROC SORT is working up to a point, but then fails. It is hard to know what the reason is without knowing your SAS environment or how OUTPUT.TAQ is stored.
Some people have reported resource problems or file system limitations when sorting large data sets.
From SAS FAQ: Sorting Very Large Datasets with SAS (not an official source):
When sorting in a WORK folder, you must have free storage equal to 4x the size of the data set (or 5x if under Unix)
You may be running out of RAM
You may be able to use options MSGLEVEL=i and FULLSTIMER to get a fuller picture
Also using options sastraceloc=saslog; can produce helpful messages.
Maybe instead of sorting it, you could break it up into a few steps, something like:
/* Get your market ~ ric ~ date pairs */
proc sql;
create table market_ric_date as
select distinct market, ric, date
from output.TAQ
/* Possibly an order by clause here on market, ric, date */
; quit;
data millisecond_stuff;
set market_ric_date;
*Possibly add type/order in this step as well?;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;
run;
/* Possibly a third step here to add type / order if you need to get from original data source */

If your source dataset is in a database, it may be sorted in a different collation.
Try the following before your sort:
options sortpgm=sas;

I had the same error, and the solution was to make a copy of the original table in the work directory, do the sort, and then the "by" was working.
In your case something like below:
data tmp_TAQ;
set output.TAQ;
run;
proc sort data=tmp_TAQ;
by market ric date miliseconds descending type order;
run;
data markers (keep=market ric date miliseconds type order);
set tmp_TAQ;
by market ric date;
if first.date;
* ie do the following once per stock-day;
* Make 1-second markers;
/*Type="AMARK"; Order=0; * Set order to zero to ensure that markers get placed before trades and quotes that occur at the same milisecond;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;*/
run;

Macro that outputs table with testing results of SAS table

Problem
I'm not a very experienced SAS user, but unfortunately the lab where I can access data is restricted to SAS. Also, I don't currently have access to the data since it is only available in the lab, so I've created simulated data for testing.
I need to create a macro that gets the values and dimensions from a PROC MEANS table and performs some tests that check whether or not the top two values from the data make up 90% of the results.
As an example, assume I have panel data that lists firms revenue, costs, and profits. I've created a table that lists n, sum, mean, median, and std. Now I need to check whether or not the top two firms make up 90% of the results and if so, flag if it's profit, revenue, or costs that makes up 90%.
I'm not sure how to get started
Here are the steps :
Read the data
Read the PROC MEAN table created, get dimensions, and variables.
Get top two firms in each variable and perform check
Create new table that lists variable, value from read table, largest and second largest, and flag.
Then print table
Simulated data :
https://www.dropbox.com/s/ypmri8s6i8irn8a/dataset.csv?dl=0
PROC MEANS Table
proc import datafile="/folders/myfolders/dataset.csv"
out=dt
dbms=csv
replace;
getnames=yes;
run;
TITLE "Macro Project Sample";
PROC MEANS n sum mean median std;
VAR V1 V2 V3;
RUN;
Desired Results :
Value Largest Sec. Largest Flag
V1 463138.09 9888.09 9847.13
V2 148.92 1.99 1.99
V3 11503375 9999900 1000000 Y

At the moment I can't open your simulated dataset but I can give you some advices, hope they will help.
You can add the n extreme values of given variables using the 'output out=' statement with the option IDGROUP.
Here an example using charity dataset ( run this to create it http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#p1oii7oi6k9gfxn19hxiiszb70ms.htm)
proc means data=Charity;
var MoneyRaised HoursVolunteered;
output out=try sum=
IDGROUP ( MAX (Moneyraised HoursVolunteered) OUT[2] (moneyraised hoursvolunteered)=max1 max2);
run;
data var1 (keep=name1 _freq_ moneyraised max1_1 max1_2 rename=(moneyraised=value max1_1=largest max1_2=seclargest name1=name))
var2 (keep=name2 _freq_ HoursVolunteered max2_1 max2_2 rename=(HoursVolunteered=value max2_1=largest max2_2=seclargest name2=name));
length name1 name2 $4;
set try ;
name1='VAR1';
name2='VAR2';
run;
data finalmerge;
length flag $1;
set var1 var2;
if largest+seclargest > value*0.9 then flag='Y';
run;
in the proc means I choose to variables moneyraised and hoursvolunteered, you will choose your var1 var2 var3 and make your changes in all the program.
The IDgroup will output the max value for both variables, as you see in the parentheses, but with out[2], obviously largest and second largest.
You must rename them, I choose to rename max1 and max 2, then sas will add an _1 and _2 to the first and the second max values automatically.
All the output will be on the same line, so I do a datastep referencing 2 datasets in output (data var1 var2) keeping the variables needed and renaming them for the next merge, I also choose a naming system as you see.
Finally I'll merge the 2 datasets created and add the flag.

Here are some initial steps and pointers in a non macro approach which restructures the data in such a manner that no array processing is required. This approach should be good for teaching you a bit about manipulating data in SAS but will not be as fast a single pass approach (like the macros you originally posted) as it transposes and sorts the data.
First create some nice looking dummy data.
/* Create some dummy data with three variables to assess */
data have;
do firm = 1 to 3;
revenue = rand("uniform");
costs = rand("uniform");
profits = rand("uniform");
output;
end;
run;
Transpose the data so all the values are in one column (with the variable names in another).
/* Move from wide to deep table */
proc transpose
data = have
out = trans
name = Variable;
by firm;
var revenue costs profits;
run;
Sort the data so each variable is in a contiguous group of rows and the highest values are at the end of each Variable group.
/* Sort by Variable and then value
so the biggest values are at the end of each Variable group */
proc sort data = trans;
by Variable COL1;
run;
Because of the structure of this data, you could go down through each observation in turn, creating a running total, which when you get to the final observation in a Variable group would be the Variable total. In this observation you also have the largest value (the second largest was in the previous observation).
At this point you can create a data step that:
Is aware when it is in the first and last values of each variable group
by statement to make the data step aware of your groups
first.Variable temporary variable so you can initialise your total variable to 0
last.Variable temporary variable so you can output only the last line of each group
Sums up the values in each group
retain statement so SAS doesn't empty your total with each new observation
sum() function or + operator to create your total
Creates and populates new variables for the largest and second largest values in each group
lag() function or retain statement to keep the previous value (the second largest)
Creates your flag
Outputs your new variables at the end of each group
output statement to request an observation be stored
keep statement to select which variables you want
The macros you posted originally looked like they were meant to perform the analysis you are describing but with some extras (only positive values contributed to the Total, an arbitrary number of values could be included rather than just the top 2, the total was multiplied by another variable k1198, negative values where caught in the second largest, extra flags and values were calculated).

SAS Transpose Comma Separated Field

This is a follow-up to an earlier question of mine.
Transposing Comma-delimited field
The answer I got worked for the specific case, but now I have a much larger dataset, so reading it in a datalines statement is not an option. I have a dataset similar to the one created by this process:
data MAIN;
input ID STATUS STATE $;
cards;
123 7 AL,NC,SC,NY
456 6 AL,NC
789 7 ALL
;
run;
There are two problems here:
1: I need a separate row for each state in the STATE column
2: Notice the third observation says 'ALL'. I need to replace that with a list of the specific states, which I can get from a separate dataset (below).
data STATES;
input STATE $;
cards;
AL
NC
SC
NY
TX
;
run;
So, here is the process I am attempting that doesn't seem to be working.
First, I create a list of the STATES needed for the imputation, and a count of said states.
proc sql;
select distinct STATE into :all_states separated by ','
from STATES;
select count(distinct STATE) into :count_states
from STATES;
quit;
Second, I try to impute that list where the 'ALL' value appears for STATE. This is where the first error appears. How can I ensure that the variable STATE is long enough for the new value? Also, how do I handle the commas?
data x_MAIN;
set MAIN;
if STATE='ALL' then STATE="&all_states.";
run;
Finally, I use a SCAN function to read in one state at a time. I'm also getting an error here, but I think fixing the above part may solve it.
data x_MAIN_mod;
set x_MAIN;
array state(&count_states.) state:;
do i=1 to dim(state);
state(i) = scan(STATE,i,',');
end;
run;
Thanks in advance for the help!

Looks like you are almost there. Try this on the last Data Step.
data x_MAIN_mod;
set x_MAIN;
format out_state $2.;
nstate = countw(state,",");
do i=1 to nstate;
out_state = scan(state,i,",");
output;
end;
run;

Do you have to actually have two steps like that? You can use a 'big number' in a temporary variable and not have much effect on things, if you don't have the intermediate dataset.
data x_MAIN;
length state_temp $150;
set MAIN;
if STATE='ALL' then STATE_temp="&all_states.";
else STATE_temp=STATE;
array state(&count_states.) state:;
do i=1 to dim(state);
state(i) = scan(STATE,i,',');
end;
drop STATE_temp;
run;
If you actually do need the STATE, then honestly I'd go with the big number (=50*3, so not all that big) and then add OPTIONS COMPRESS=CHAR; which will (give or take) turn your CHAR fields into VARCHAR (at the cost of a tiny bit of CPU time, but usually far less than the disk read/write time saved).

New SAS variable conditional on observations

(first time posting)
I have a data set where I need to create a new variable (in SAS), based on meeting a condition related to another variable. So, the data contains three variables from a survey: Site, IDnumb (person), and Date. There can be multiple responses from different people but at the same site (see person 1 and 3 from site A).
Site IDnumb Date
a 1 6/12
b 2 3/4
c 4 5/1
a 3 .
d 5 .
I want to create a new variable called Complete, but it can't contain duplicates. So, when I go to proc freq, I want site A to be counted once, using the 6/12 Date of the Completed Survey. So basically, if a site is represented twice and contains a Date in one, I want to only count that one and ignore the duplicate site without a date.
N %
Complete 3 75%
Last Month 1 25%
My question may be around the NODUP and NODUPKEY possibilities. If I do a Proc Sort (nodupkey) by Site and Date, would that eliminate obs "a 3 ."?
Any help would be greatly appreciated. Sorry for the jumbled "table", as this is my first post (hints on making that better are also welcomed).

You can do this a number of ways.
First off, you need a complete/not complete binary variable. If you're in the datastep anyway, might as well just do it all there.
proc sort data=yourdata;
by site date descending;
run;
data yourdata_want;
set yourdata;
by site date descending;
if first.site then do;
comp = ifn(date>0,1,0);
output;
end;
run;
proc freq data=yourdata_want;
tables comp;
run;
If you used NODUPKEY, you'd first sort it by SITE DATE DESCENDING, then by SITE with NODUPKEY. That way the latest date is up top. You also could format COMP to have the text labels you list rather than just 1/0.
You can also do it with a format on DATE, so you can skip the data step (still need the sort/sort nodupkey). Format all nonmissing values of DATE to "Complete" and missing value of date to "Last Month", then include the missing option in your proc freq.
Finally, you could do the table in SQL (though getting two rows like that is a bit harder, you have to UNION two queries together).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js