I am doing practice assignment as part of my BASE SAS certification prep, to see when a data step ends.
Below is the code:
data first;
input x;
datalines;
1
2
9
;
run;
data second;
input x;
datalines;
3
4
5
6
;
run;
data third;
set first;
output;
set second;
output;
run;
Output is:
1
3
2
4
5
9
But when I have only 2 values 1 and 2 in the first dataset, output is 1 2 3 4
and not 1 3 2 4 . Why is it so?
The datastep process as implicit do loops. So when you consider your datastep...
data third;
set first;
output;
set second;
output;
run;
...your two set statements both act as a dripfeed, providing one observation from the corresponding dataset sets specified on each interation through the datastep loop.
If you wanted observations in third to be in the order of:
1, 2, 9, 3, 4, 5, 6
Then you need to change the datastep to provide just one set statement to dripfeed in both datasteps one after the other:
data third;
set first second ;
output;
run;
I think the set statement reads obs from both datasets simultaneously.
so in PDV the first iteration n=1 then x = . and x = 1 (from first)
n=2 then x = 3 and x =2 (from second and first) and so on...
because of two explicit output statements I would say.
can be more clear if you use put statement.
data third;
put _all_;
set first;
output;
put _all_;
set second;
output;
run;
same happens when you read second dataset followed by first.
Because that is what you told it to do?
SAS executes the data step until it reads past the end of an input file (or it detects and infinite loop). In your case it stops when it tries to read a fourth observation from the first SET statement. Hence it never gets to the second SET statement on that fourth iteration.
Related
Let's say I have stores all around the world and I want to know what was my top losses sales across the world per store. What is the code for that?!
here is my try:
proc sort data= store out=sorted_store;
by store descending amount;
run;
and
data calc1;
do _n_=1 by 1 until(last.store);
set sorted_store;
by store;
if _n_ <= 5 then "Sum_5Largest_Losses"n=sum(amount);
end;
run;
but this just prints out the 5:th amount and not 1.. TO .. 5! and I really don't know how to select the top 5 of EACH store . I think a kind of group by would be a perfect fit. But first things, first. How do I selct i= 1...5 ? And not just = 5?
There is also way of doing it with proc sql:
data have;
input store$ amount;
datalines;
A 100
A 200
A 300
A 400
A 500
A 600
A 700
B 1000
B 1100
C 1200
C 1300
C 1400
D 600
D 700
E 1000
E 1100
F 1200
;
run;
proc sql outobs=4; /* limit to first 10 results */
select store, sum(amount) as TOTAL_AMT
from have
group by 1
order by 2 desc; /* order them to the TOP selection*/
quit;
The data step sum(,) function adds up its arguments. If you only give it one argument then there is nothing to actually sum so it just returns the input value.
data calc1;
do _n_=1 by 1 until(last.store);
set sorted_store;
by store;
if _n_ <= 5 then Sum_5Largest_Losses=sum(Sum_5Largest_Losses,amount);
end;
run;
I would highly recommend learning the basic methods before getting into DOW loops.
Add a counter so you can find the first 5 of each store
As the data step loops the sum accumulates
Output sum for counter=5
proc sort data= store out=sorted_store;
by store descending amount;
run;
data calc1;
set sorted_store;
by store;
*if first store then set counter to 1 and total sum to 0;
if first.store then do ;
counter=1;
total_sum=0;
end;
*otherwise increment the counter;
else counter+1;
*accumulate the sum if counter <= 5;
if counter <=5 then total_sum = sum(total_sum, amount);
*output only when on last/5th record for each store;
if counter=5 then output;
run;
data output;
set input;
by id;
if first.id = 1 then do;
call symputx('i', 1); ------ also tried %let i = 1;
a_&i = a;
end;
else do;
call symputx('i', &i + 1); ------ also tried %let i = %sysevalf (&i + 1);
a_&i = a;
end;
run;
Example Data:
ID A
1 2
1 3
2 2
2 4
Want output:
ID A A_1 A_2
1 2 2 .
1 3 . 3
2 2 2 .
2 4 . 4
I know that you can do this using transpose, but i'm just curious why does this way not work. The macro does not retain its value for the next observation.
Thanks!
edit: Since %let is compile time, and call symput is execution time, %let will run only once and call symput will always be 1 step slow.
why does this way not work
The sequence of behavior in SAS executor is
resolve macro expressions
process steps
automatic compile of proc or data step (compile-time)
run the compilation (run-time)
a running data step can not modify its pdv layout (part of the compilation process) while it is running.
call symput() is performed at run-time, so any changes it makes will not and can not be applied to the source code as a_&i = a;
Array based transposition
You will need to determine the maximum number of items in the groups prior to coding the data step. Use array addressing to place the a value in the desired array slot:
* Hand coded transpose requires a scan over the data first;
* determine largest group size;
data _null_;
set have end=lastrecord_flag;
by id;
if first.id
then seq=1;
else seq+1;
retain maxseq 0;
if last.id then maxseq = max(seq,maxseq);
if lastrecord_flag then call symputx('maxseq', maxseq);
run;
* Use maxseq macro variable computed during scan to define array size;
data want (drop=seq);
set have;
by id;
array a_[&maxseq]; %* <--- set array size to max group size;
if first.id
then seq=1;
else seq+1;
a_[seq] = a; * 'triangular' transpose;
run;
Note: Your 'want' is a triangular reshaping of the data. To achieve a row per id reshaping the a_ elements would have to be cleared (call missing()) at first.id and output at last.id.
I am trying to test how accumulator variables work and I created the following program.
data numbers;
input n;
cards;
10
20
40
50
;
data newnums;
infile numbers;
input tens;
count+tens;
run;
proc print data=newnums;
run;
I purposely put blank rows. However besides that I thought that the program would execute.
I want to figure out the last value of the variable count, but I cannot... may I have some help please?
you have multiple things in your code, which you need to change.
missing numeric value is represented as . character
data set is referenced using set statement not infile
accumulator variable you are talking about is sum statement, it retains the value when you have missing value and more on sum statement in the link below.
Difference between SUM statement and sum variable in SAS?
data numbers;
input n;
cards;
10
20
40
.
50
;
data newnums;
set numbers;
count+n;
run;
proc print data=newnums;
run;
Edit1: if you had something below you will get missing value by using truncover
data numbers;
infile datalines truncover;
input n;
cards;
10
20
40
50
;
I had posted this earlier, and got help on it. My interest was piqued, and I ventured into this a little further to see what I could do with it. I am fascinated with simulations, but am just an average SAS programmer. I wonder if somebody might help here.
data out;
call streaminit(7); *seed better random number engine;
do pointvar = 1 by 1 until (outs=27); *iterate starting at
1 and stop when 27 outs ;
randvar = rand('Uniform'); *better random number engine;
if pointvar > 9 then pointvar=1; *reset to 1 if over 9;
set in point=pointvar; *pull the row we need;
if randvar < cutoff then do;
outs+1;
outs_inning+1;
end;
output;
if outs_inning=3 then outs_inning=0;
end;
stop;
run;
the data set has just one observation for the 9 hitters.
.73
.75
.72
.78
.81
.69
.74
.72
.75
With the help of Joe and others, the above did what I wanted, which was to simulate primarily the counting of outs involved in ONE baseball game.
I have been playing around with this (to no avail) and trying to get it to repeat a game, so to speak, where it would start at the top of the lineup after 27 outs. So for what I have right now, assume the 27th out is achieved with the 5th batter. I would like to put this whole code inside of a loop where it starts the process again at the beginning of the data set (1st observation, i.e, first batter).
So, assume I want to complete 3 iterations here. 3 games of 27 outs. Is there a way to do this? I tried doing the following.
%macro replicate(new,out,n)/des=’&new1 is &out repeated &n times
Data &new;
%do i=1 to &n;
Set &out;
Output;
%end;
%mend;
%replicate(new,out,3);
Proc print;
I was hoping with a do statement I could do this, but The problem with this is that it is reading each observation 3 times. So in the do i=1 to 3, followed by set out (three instances it takes the first observation from data set ‘out’, then 3 times it takes the second observation from data set out, etc.
i.e.
Outs randvar cutoff outs_inning
0 0.84 0.73 0
0 0.84 0.73 0
0 0.84 0.73 0
1 0.61 0.75 0
1 0.61 0.75 0
1 0.61 0.75 0
Can anybody help? I appreciate that this is a little outside the realm of what is typically discussed here, but a few of my students are also interested in simulations, and a baseball example has certainly interested them. It has become a fun problem. thanks for getting me this far.
You don't need a macro. You should be able to add an outer DO loop which is do game=1 to 3;
Below I changed the variable POINTVAR to be BATTER, and added a PUT statement to write messages to the log.
data in;
input cutoff ##;
cards;
.73 .75 .72 .78 .81 .69 .74 .72 .75
;
data play;
call streaminit(7);
do game=1 to 3;
outs=0;
outs_inning=0;
do batter = 1 by 1 until (outs=27);
randvar = rand('Uniform');
if batter > 9 then batter=1;
set in point=batter;
if randvar < cutoff then do;
outs+1;
outs_inning+1;
end;
output;
put (game batter cutoff randvar outs_inning outs)(=);
if outs_inning=3 then outs_inning=0;
end;
end;
stop;
run;
For the dataset,
data testing;
input key $ output $;
datalines;
1 A
1 B
1 C
2 A
2 B
2 C
3 A
3 B
3 C
;
run;
Desired Output,
1 A
2 B
3 C
The logic is if either key or output appear within the column before then delete the observation.
1 A (as 1 and A never appear then keep the obs)
1 B (as 1 appear already then delete)
1 C (as 1 appear then delete)
2 A (as A appear then delete)
2 B (as 2 and B never appear then keep the obs)
2 C (as 2 appear then delete)
3 A (as A appear then delete)
3 B (as B appear then delete)
3 C (as 3 and C never appear then keep the obs)
My effort:
The basic idea here is you keep a dictionary of what's already been used, and search that. Here's a simple array based method; a hash table might be better, certainly less memory intensive, anyway, and likely faster - I would leave that to your imagination.
data want;
set testing;
array _keys[30000] _temporary_; *temporary arrays to store 'used' values;
array _outputs[30000] $ _temporary_;
retain _keysCounter 1 _outputsCounter 1; *counters to help us store the values;
if whichn(key, of _keys[*]) = 0 and whichc(output,of _outputs[*]) = 0 /* whichn and whichc search lists (or arrays) for a value. */
then do;
_keys[_keysCounter] = key; *store the key in the next spot in the dictionary;
_keysCounter+1; *increment its counter;
_outputs[_outputsCounter] = output; *store the output in the next spot in the dictionary;
_outputsCounter+1; *increment its counter;
output; *output the actual datarow;
end;
keep key output;
run;