how many observations will the data set contain? - sas

The following is the code:
data work.homework;
infile 'file-specification';
input name$ age height;
if age le 10;
run;
The raw data file is listed as the following:
A 35 71
B 10 43
C 9 12
I thought the correct answer should be 2. But it seems that it is 3 according to the answer sheet. Could anyone explain to me what is the reason? Many thanks for your time and attention.

data work.homework;
infile datalines;
input name$ age height;
if age le 10;
datalines;
A 35 71
B 10 43
C 9 12
;;;;
run;
NOTE: The data set WORK.HOMEWORK has 2 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
Now, as for how the answer might be three, I would look very carefully at the problem. There are two potential pitfalls.
One: is it possible a fourth record is read in? One that has blanks? If there is a blank line in the file, it's possible this would occur. Blank is indeed less than or equal to ten (check it!), so this line would qualify.
Two: if the line is
if age le 10 then ... ;
Then the automatic output is not affected.
As long as the code and data are exactly as above, though, two rows will be the correct answer to "How many observations will the dataset contain". (Not, how many observations will be processed in the data step loop, of course.)

Related

Modifying a list of production goals based on constraints SAS

I have a list of production goals. Here are the data.
data goals;
input product $ w1 w2 w3 w4 w5;
cards;
A 800 400 200 400 100
B 300 400 200 100 100
C 50 25 25 25 25
;
run;
Now, what I need to do is redistribute the goal based on if production is stopped for some reason. For example, if the plant is closed on week 1 then I am unable to produce 800 As, 300 Bs, and 50 Cs. Now I cannot make up all of that lost productivity in week 2, only 50% of it. o, since I needed 800 As in week1, I will make 400 of those As in week2 (along with what I already had scheduled in week2) and do the same for week3. So I want my new goals to be.
product w1 w2 w3 w4 w5
A 0 800 600 400 100
B 0 550 350 100 100
C 0 50 50 25 25
So I'm redistributing the goal over the next two weeks, making up 50% of week1 goals each week.
I have experimented with different ways to do this, but unable to make any progress of the syntax.
This is SAS by the way. I am open to different ways to accomplish this.
The question that's more interesting revolves around how you communicate the W1 values and the change to the program. The problem of converting ds1 to ds2 is trivial, and how you solve it is connected to the larger question.
One good way to do this is with a macro with parameters appropriate to your needs. One example:
%macro fix_goals(week=,newweek=,percent=50,max=constant(big)); *max set arbitrarily high if not used;
adjusted = min(&newweek.+(&week.*&percent.),&max.); *assumes MAX is the maximum it could accomplish in a week including both the old target and the new target;
&week.=&week.-(adjusted-&newweek.); *how much was added;
&newweek.=adjusted;
%mend fix_goals;
Then:
data want;
set have;
%fix_goals(week=w1,newweek=w2,percent=50);
%fix_goals(week=w1,newweek=w3,percent=100); *the rest;
run;
You could do this a number of similar ways depending on the criteria and how you want to communicate the change to the program. If you always give it to the next 2 weeks exactly 50/50, the above is slightly more complicated than needed ( but accomplishes it as written).

SAS macros to average between a range of dates with missing dates in the data

I'm completely new to SAS and its macros. I have this dataset, named mydata:
Obs SYMBOL DATE kx y
1 A 20120128 5 6
2 B 20120128 10 7
3 C 20120128 20 9
4 D 20120128 6 10
5 E 20120128 9 20
My problem is to find this function:
Newi = ∑ j€[-10,-2] (x+y)i,j /N,
where,
i = any random date(user defined)
-10 and -2(10 days or 2 days before i)
N= total number of days with data available for (x+y) between (-10,-2)
There can be missing dates in the available data.
Can anyone help me with the possible SAS macros for the following problem.
Thanks in Advance!!
I'm assuming your date data are stored as dates and can accept numeric calculations. I'm also assuming that you want to get average of X and Y for a particular date around d, where d is user defined. Last, I'm assuming that if you have two unique ids on the same day, you keep the first one at random. Obviously those assumptions might need to be tweaked a bit but, from what I believe you are asking (I confess I'm only mostly sure I understand your question), hopefully this is close enough to what you need that you can tweak the rest pretty easily.
Okay...
PROC SORT DATA in;
BY date uniqueid;
RUN;
%MACRO summarize( userdate );
DATA out;
SET in (where = (date >= &userdate -10 and date <= &userdate - 2);
BY date uniqueid;
xy = sum(x, y)
IF first.uniqueid;
RUN;
PROC SUMMARY DATA = out;
OUTPUT OUT = Averages&userdate MEAN(xy) = ;
RUN;
%MEND summarize;
%summarize('20120128'd);
What's going on here? Well, I sort the data first by date and uniqueid. I could use NODUPKEY, but I imagine you might want to control how duplicate uniqueids on a given date are handled. The dataset is throwing out the dups by keeping the first one that it comes across, but you could modify deduping logic (which is coming from the BY command in the DATA step and the IF first. command in the same).
You want a set of dates around a particular user-defined date, d. So get d and filter the dataset with WHERE. You could also do this in your PROC SORT step, and there might be reasons for doing so if your raw data will be updated frequently. If you don't need the run the sort every time a user defines a date range, keep it outside the macro and only run it when needed. Sorts can be slow.
In the data step, I'm getting sum(x,y) to account for the fact that either x or y might be missing, or both, or neither. x + y would return missing in those cases. I assume that's now what you want, but do keep in mind that we'll be averaging out sum(x,y) over N, where N is "either x or y is not missing." If you wanted to ignore those rows entirely, use x + y and add IF xy != . in your DATA step.
The last part, the sum, should be pretty self-explanatory.
Hope this helps.

insert columns that are missing in a range

I've created panel data by transposing columns, based on weeks, and some of the weeks never had observations, so those weeks never showed up as columns. Is there a reasonable way to insert the weeks that had no observations.
I need week0-week61, but currently I am missing week0, week4, week8... It seems silly to do this by hand in excel.
The simplest way is like this:
data ttt;
input id week0 week4;
datalines;
1 10 20
2 11 21
;
data ttt1;
set ttt;
array a{*} week0-week61;
run;

proc transpose using SPDE takes ~60x longer to run than v9 library

I've been moving all of my datasets into SPDE libraries because I've experienced wonderful performance gains in everything. Everything until running proc transpose. This takes ~60x longer to execute on the SPDE dataset than the same dataset stored in normal v9 library. The data sets is sorted by item_id. It is being read/written to the same library.
Does anyone have an idea why this is the case? Am I missing something important about SPDE and Proc Transpose not playing well together?
SPDE Libary
MPRINT(XMLIMPORT_VANTAGE): proc transpose data = smplus.links_response_mechanism out = smplus.response_mechanism (drop = _NAME_)
prefix = rm_;
MPRINT(XMLIMPORT_VANTAGE): by item_id;
MPRINT(XMLIMPORT_VANTAGE): id lookup_code;
MPRINT(XMLIMPORT_VANTAGE): var x;
MPRINT(XMLIMPORT_VANTAGE): run;
NOTE: There were 5866747 observations read from the data set SMPLUS.LINKS_RESPONSE_MECHANISM.
NOTE: The data set SMPLUS.RESPONSE_MECHANISM has 3209353 observations and 14 variables.
NOTE: Compressing data set SMPLUS.RESPONSE_MECHANISM decreased size by 37.98 percent.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 28:27.63
cpu time 28:34.64
V9 Library
MPRINT(XMLIMPORT_VANTAGE): proc transpose data = mplus.links_response_mechanism out = mplus.response_mechanism (drop = _NAME_)
prefix = rm_;
MPRINT(XMLIMPORT_VANTAGE): by item_id;
68 The SAS System 02:00 Thursday, August 8, 2013
MPRINT(XMLIMPORT_VANTAGE): id lookup_code;
MPRINT(XMLIMPORT_VANTAGE): var x;
MPRINT(XMLIMPORT_VANTAGE): run;
NOTE: There were 5866747 observations read from the data set MPLUS.LINKS_RESPONSE_MECHANISM.
NOTE: The data set MPLUS.RESPONSE_MECHANISM has 3209353 observations and 14 variables.
NOTE: Compressing data set MPLUS.RESPONSE_MECHANISM decreased size by 27.60 percent.
Compressed is 32271 pages; un-compressed would require 44572 pages.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 28.76 seconds
cpu time 28.79 seconds
Looks to me that there is some issue with PROC TRANSPOSE and SPDE. Here's a simple SSCCE, which has significant differences; not as significant as yours, but to some extent that may be a factor of this being on a desktop with not particularly substantial performance tuning in the first place. Sounds like a call to SAS tech support is in order.
libname spdelib spde 'c:\temp\SPDE Main'
datapath=('c:\temp\SPDE Data' 'd:\temp\SPDE Data')
indexpath=('d:\temp\SPDE Index')
partsize=512;
libname mainlib 'c:\temp\';
data mainlib.bigdata;
do ID = 1 to 1500000;
do _varn=1 to 10;
varname=cats("Var_",_varn);
vardata=ranuni(7);
output;
end;
end;
run;
data spdelib.bigdata;
do ID = 1 to 1500000;
do _varn=1 to 10;
varname=cats("Var_",_varn);
vardata=ranuni(7);
output;
end;
end;
run;
*These data steps take roughly the same amount of time, around 30 seconds each;
proc transpose data=spdelib.bigdata out=spdelib.transdata;
by id;
id varname;
var vardata;
run;
*Run a few times, this takes around 3 to 4 minutes, with 1.5 minutes CPU time;
proc transpose data=mainlib.bigdata out=mainlib.transdata;
by id;
id varname;
var vardata;
run;
*Run a few times, this takes around 30 to 45 seconds, with 20 seconds CPU time;
There have been known issues with SPDE and proc compare in the past (not multi-threading), at least up to version 4.1. What version are you using? (can be seen in the “!install/logs” folder).
This is definitely something to raise with SAS support, to "speed" things along I would recommend submitting a log with the following options:
proc setinit noalias; run;
proc options; run;
%put _ALL_;
options fullstimer msglevel=i;
Also:
options spdedebug='DA_TRACEIO_OCR CJNL=Trace.txt';
(The CJNL option simply routes the trace message output to a text file)
In the meantime, you may be able to take advantage of some of the following SPD specific options:
http://support.sas.com/kb/11/349.html
This issue usually occurs when PROC TRANSPOSE is used with BY-processing on compressed datasets. SAS is forced to read the same block of rows repeatedly decompressing them every time until all the records are fully sorted.
Set Compress=No option and it will work. See the log below, one program has Compress=yes and the other Compress=no, the former was 56 minutes vs .5 seconds.
OPTIONS COMPRESS=YES;
50 **tranpose from spde to spde;
51 proc transpose data=spdelib.balancewalkoutput out=spdelib.spdelib_to_spdelib;
52 var metric ;
53 by balancewalk facility_id isretained isexisting isicaapnpl monthofmaturity vintage;
54 run;
NOTE: There were 10000000 observations read from the data set SPDELIB.BALANCEWALKOUTPUT.
NOTE: The data set SPDELIB.SPDELIB_TO_SPDELIB has 160981 observations and 74 variables.
NOTE: Compressing data set SPDELIB.SPDELIB_TO_SPDELIB decreased size by 69.96 percent.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 56:58.54
user cpu time 52:03.65
system cpu time 4:03.00
memory 19028.75k
OS Memory 34208.00k
Timestamp 09/16/2019 06:19:55 PM
Step Count 9 Switch Count 22476
Page Faults 0
Page Reclaims 4056
Page Swaps 0
Voluntary Context Switches 142316
Involuntary Context Switches 5726
Block Input Operations 88
Block Output Operations 569200
OPTIONS COMPRESS=NO;
50 **tranpose from spde to spde;
51 proc transpose data=spdelib.balancewalkoutput out=spdelib.spdelib_to_spdelib;
52 var metric ;
53 by balancewalk facility_id isretained isexisting isicaapnpl monthofmaturity vintage;
18 The SAS System 16:04 Monday, September 16, 2019
54 run;
NOTE: There were 10000000 observations read from the data set SPDELIB.BALANCEWALKOUTPUT.
NOTE: The data set SPDELIB.SPDELIB_TO_SPDELIB has 160981 observations and 74 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 26.73 seconds
user cpu time 14.52 seconds
system cpu time 11.99 seconds
memory 13016.71k
OS Memory 27556.00k
Timestamp 09/16/2019 04:13:06 PM
Step Count 9 Switch Count 24827
Page Faults 0
Page Reclaims 2662
Page Swaps 0
Voluntary Context Switches 162653
Involuntary Context Switches 1678
Block Input Operations 96
Block Output Operations 1510040

SAS: backward looking data step to compute the average

Sorry for the "not really informative" title of this post.
I have the following data set in SAS:
time Add time_delete
5 3.00 5
5 3.15 11
5 3.11 11
8 4.21 8
8 3.42 8
8 4.20 11
11 3.12 .
Where the time correspond to a new added (Add) price in an auction at every 3minute. This price can get delete within the same time interval or later as shown in time_delete. My objective is to compute the average price from the Add field standing at every time. For instance, my average price at time=5 is (3.15+3.11)/2 since the 3.00 gets deleted within the interval. Then the average price standing at time=8 is (4.20+3.15+3.11)/3. As you can see, I have to look at the current time where I am standing and look back and see which price is still valid standing at time=8. Also, I would like to have a field where for every time I know the highest price available that was not deleted.
Any help?
You have a variant of a rolling sum here. There's no one straightforward solution (especially as you undoubtedly have a few complications not mentioned); but here are a few pointers.
First, you may want to change the format of your data. This is actually a relatively easy problem to solve if you have one row for each possible timepoint rather than just a single row.
data have;
input time Add time_delete;
datalines;
5 3.00 5
5 3.15 11
5 3.11 11
8 4.21 8
8 3.42 8
8 4.20 11
11 3.12 .
;;;;
run;
data want;
set have;
if time=time_delete then delete;
else do time=time to time_delete-1;
output;
end;
keep time add;
run;
proc means data=want mean max n;
class time;
var add;
run;
You could output the proc means to a dataset and have your maximum value plus the average value, and then either put that back on the main dataset or whatever you need.
The main downside to this is it's a much larger dataset, so if you're looking at hundreds of thousands of data points, this is not your best option likely.
You can also perform this in SQL without the extra rows, although this is where those "other complications" would potentially throw a wrench in things.
proc sql;
select H.time, mean(V.add), max(V.add) from (
select distinct H.time from have H
left join
(select * from have) V
on V.time le H.time
and V.time_delete gt H.time )
group by 1;
;
quit;
Fairly straightforward and quick query, except that if you have a lot of time values it might take some time to execute the join.
Other options:
Read the data into an array, with a second array tracking the delete points. This can get a bit complex as you probably need to sort your array by delete point - so rather than just adding a new record into the end, you need to move a bunch of records down. SAS isn't quite as friendly to this sort of operation as a c-type language would be.
Use a hash table solution. Somewhat less messy than an array, particularly as you can sort a hash table more easily than two separate arrays.
Use IML and vectors. Similar to the array solution but with more powerful manipulation techniques available.