Proc SQL running total - sas

I am building a process in SAS EG and came to a sticking point when I needed a running total. This would be very easy to do in Excel but my table is 22M records long. I have VBA experience but not Proc SQL. Can someone show me how to do a running total of dollars by item? The data is sorted by Market/Segment/Item/Month.
Thanks
Jeff
MyData

You hierarchy is Market / Segment / Item, and maybe from the question one can presume an Item is unique across all Markets and Segments.
A running total is easiest in a DATA Step. You will want to use first. automatic variables that are prepared when the step has a BY statement.
data want;
set have;
by Market Segment Item Month; * add month to make sure incoming data is ordered timewise, if not an error will appear in the log;
if first.Item then RunningDollars = 0;
RunningDollars + Dollars; * The + syntax here is a `SUM` statement that causes the RunningDollars variable to be automatically retaine, meaning the value is available for the next record.
run;

Related

Getting average price across stores and across months

I am trying to use the proc tabulate procedure to arrive at the average price of some configurable items, across stores and across months. Below is the sample data set, which I need to process
Configuration|Store_Postcode|Retail Price|month
163|SE1 2BN|455|1
320|SW12 9HD|545|1
23|E2 0RY|515|1
The below code is displaying the month wise average price for each configuration.
proc tabulate data=cs2.pos_data_raw;
class configuration store_postcode month;
var retail_price;
table configuration,month*MEAN*retail_price;
run;
But can I get this grouped one more level - at the Store Post code level? I modified the code to read as shown below, but executing this is crashing the system!
proc tabulate data=cs2.pos_data_raw;
class configuration store_postcode month;
var retail_price;
table configuration,store_postcode*month*MEAN*retail_price;
run;
Please advice if my approach is incorrect, or what am I doing wrong in proc tabulate so much so that it crashes the system.
I am not sure if this exactly answers your question since I am new to SAS, but when I switched store_postcode*month*MEAN*retail_price to month*store_postcode*MEAN*retail_price , it worked without crashing. I am just guessing that the reason for this is because your data only contains 1 value for month and multiple for postal code, therefore month is the most general level of categorization then it becomes more specific.
On a side note, I tried to format the table in another way also to segment the data by postal code:
proc tabulate data=pos_data_raw;
class configuration store_postcode month;
var retail_price;
table store_postcode*configuration, month*MEAN*retail_price;
run;
The output looks like this:
where the table will have postal code and configuration id on the left and month and retail price on top.

using by group processing First. and Last

I just start learning sas and would like some help with understanding the following chunk of code. The following program computes the annual payroll by department.
proc sort data = company.usa out=work.temp;
by dept;
run;
data company.budget(keep=dept payroll);
set work.temp;
by dept;
if wagecat ='S' then yearly = wagrate *12;
else if wagecat = 'H' then yearly = wagerate *2000;
if first.dept then payroll=0;
payroll+yearly;
if last.dept;
run;
Questions:
What does out = work.temp do in the first line of this code?
I understand the data step created 2 temporary variables for each by variable (first.varibale/last.variable) and the values are either 1 or 0, but what does first.dept and last.dept exactly do here in the code?
Why do we need payroll=0 after first.dept in the second to the last line?
This code takes the data for salaries and calculates the payroll amount for each department for a year, assuming salary is the same for all 12 months and that an hourly worker works 2000 hours.
It creates a copy of the data set which is sorted and stored in the work library. RTM.
From the docs
OUT= SAS-data-set
names the output data set. If SAS-data-set does not exist, then PROC SORT creates it.
CAUTION:
Use care when you use PROC SORT without OUT=.
Without the OUT= option, PROC SORT replaces the original data set with the sorted observations when the procedure executes without errors.
Default Without OUT=, PROC SORT overwrites the original data set.
Tips With in-database sorts, the output data set cannot refer to the input table on the DBMS.
You can use data set options with OUT=.
See SAS Data Set Options: Reference
Example Sorting by the Values of Multiple Variables
First.DEPT is an indicator variable that indicates the first observation of a specific BY group. So when you encounter the first record for a department it is identified. Last.DEPT is the last record for that specific department. It means the next record would the first record for a different department.
It sets PAYROLL to 0 at the first of each record. Since you have if last.dept; that means that only the last record for each department is outputted. This code is not intuitive - it's a manual way to sum the wages for people in each department. The common way would be to use a summary procedure, such as MEANS/SUMMARY but I assume they were trying to avoid having two passes of the data. Though if you're not sorting it may be just as fast anyways.
Again, RTM here. The SAS documentation is quite thorough on these beginner topics.
Here's an alternative method that should generate the exact same results but is more intuitive IMO.
data temp;
set company.usa;
if wagecat='S' then factor=12; *salary in months;
else if wagecat='H' then factor=2000; *salary in hours;
run;
proc means data=temp noprint NWAY;
class dept;
var wagerate;
weight factor;
output out=company.budget sum(wagerate)=payroll;
run;

"BY variables are not properly sorted" error although it was sorted already

I am using SAS for a large dataset (>20gb). When I run a DATA step, I received the "BY variables are not properly sorted ......" although I sorted the dataset by the same variables. When I ran the PROC SORT again, SAS even said "Input dataset is already sorted, No sorting done"
My code is:
proc sort data=output.TAQ;
by market ric date miliseconds descending type order;
run;
options nomprint;
data markers (keep=market ric date miliseconds type order);
set output.TAQ;
by market ric date;
if first.date;
* ie do the following once per stock-day;
* Make 1-second markers;
/*Type="AMARK"; Order=0; * Set order to zero to ensure that markers get placed before trades and quotes that occur at the same milisecond;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;*/
run;
And the error message was:
ERROR: BY variables are not properly sorted on data set OUTPUT.TAQ.
RIC=CXR.CCP Date=20160914 Time=13:47:18.125 Type=Quote Price=. Volume=. BidPrice=9.03 BidSize=400
AskPrice=9.04 AskSize=100 Qualifiers= order=116458952 Miliseconds=49638125 exchange=CCP market=1
FIRST.market=0 LAST.market=0 FIRST.RIC=0 LAST.RIC=0 FIRST.Date=0 LAST.Date=1 i=. _ERROR_=1
_N_=43297873
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 43297874 observations read from the data set OUTPUT.TAQ.
WARNING: The data set WORK.MARKERS may be incomplete. When this step was stopped there were
56770826 observations and 6 variables.
WARNING: Data set WORK.MARKERS was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 1:14.21
cpu time 26.71 seconds
The error is occurring deep into your data step, at _N_=43297873. That suggests to me that the PROC SORT is working up to a point, but then fails. It is hard to know what the reason is without knowing your SAS environment or how OUTPUT.TAQ is stored.
Some people have reported resource problems or file system limitations when sorting large data sets.
From SAS FAQ: Sorting Very Large Datasets with SAS (not an official source):
When sorting in a WORK folder, you must have free storage equal to 4x the size of the data set (or 5x if under Unix)
You may be running out of RAM
You may be able to use options MSGLEVEL=i and FULLSTIMER to get a fuller picture
Also using options sastraceloc=saslog; can produce helpful messages.
Maybe instead of sorting it, you could break it up into a few steps, something like:
/* Get your market ~ ric ~ date pairs */
proc sql;
create table market_ric_date as
select distinct market, ric, date
from output.TAQ
/* Possibly an order by clause here on market, ric, date */
; quit;
data millisecond_stuff;
set market_ric_date;
*Possibly add type/order in this step as well?;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;
run;
/* Possibly a third step here to add type / order if you need to get from original data source */
If your source dataset is in a database, it may be sorted in a different collation.
Try the following before your sort:
options sortpgm=sas;
I had the same error, and the solution was to make a copy of the original table in the work directory, do the sort, and then the "by" was working.
In your case something like below:
data tmp_TAQ;
set output.TAQ;
run;
proc sort data=tmp_TAQ;
by market ric date miliseconds descending type order;
run;
data markers (keep=market ric date miliseconds type order);
set tmp_TAQ;
by market ric date;
if first.date;
* ie do the following once per stock-day;
* Make 1-second markers;
/*Type="AMARK"; Order=0; * Set order to zero to ensure that markers get placed before trades and quotes that occur at the same milisecond;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;*/
run;

Fill in missing values with mode in SAS

I think the logic to replace missingness is quite clear but when I dump it to SAS I find it too complicated to start with.
Given no code was provided, I'll give you some rough directions to get you started, but put it on you to determine any specifics.
First, lets create a month column for the data and then calculate the modes for each key for each month. Additionally, lets put this new data in its own dataset.
data temp;
set original_data;
month = month(date);
run;
proc univariate data=temp modes;
var values;
id key month;
out=mode_data;
run;
However, this procedure calculates the mode in a very specific way that you may not want (defaults to the lowest in the case of a tie and produces no mode if nothing occurs at least twice) Documentation: http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_univariate_sect027.htm
If that doesn't work for you, I recommend using proc sql to get a count of each key, month, value combination and calculating your own mode from there.
proc sql;
create table mode_data as select distinct
key, month, value, count(*) as distinct_count
from temp
group by key, month, value;
quit;
From there you might want to create a table containing all months in the data.
proc sql;
create table all_months as select distinct month
from temp;
quit;
Don't forget to merge back in any missing months from to the mode data and use the lag or retain functions to search previous months for "old modes".
Then simply merge your fully populated mode data back to the the temp dataset we created above and impute the missing values to the mode when value is missing (i.e. value = .)
Hope that helps get you started.

New SAS variable conditional on observations

(first time posting)
I have a data set where I need to create a new variable (in SAS), based on meeting a condition related to another variable. So, the data contains three variables from a survey: Site, IDnumb (person), and Date. There can be multiple responses from different people but at the same site (see person 1 and 3 from site A).
Site IDnumb Date
a 1 6/12
b 2 3/4
c 4 5/1
a 3 .
d 5 .
I want to create a new variable called Complete, but it can't contain duplicates. So, when I go to proc freq, I want site A to be counted once, using the 6/12 Date of the Completed Survey. So basically, if a site is represented twice and contains a Date in one, I want to only count that one and ignore the duplicate site without a date.
N %
Complete 3 75%
Last Month 1 25%
My question may be around the NODUP and NODUPKEY possibilities. If I do a Proc Sort (nodupkey) by Site and Date, would that eliminate obs "a 3 ."?
Any help would be greatly appreciated. Sorry for the jumbled "table", as this is my first post (hints on making that better are also welcomed).
You can do this a number of ways.
First off, you need a complete/not complete binary variable. If you're in the datastep anyway, might as well just do it all there.
proc sort data=yourdata;
by site date descending;
run;
data yourdata_want;
set yourdata;
by site date descending;
if first.site then do;
comp = ifn(date>0,1,0);
output;
end;
run;
proc freq data=yourdata_want;
tables comp;
run;
If you used NODUPKEY, you'd first sort it by SITE DATE DESCENDING, then by SITE with NODUPKEY. That way the latest date is up top. You also could format COMP to have the text labels you list rather than just 1/0.
You can also do it with a format on DATE, so you can skip the data step (still need the sort/sort nodupkey). Format all nonmissing values of DATE to "Complete" and missing value of date to "Last Month", then include the missing option in your proc freq.
Finally, you could do the table in SQL (though getting two rows like that is a bit harder, you have to UNION two queries together).