I am new to SAS and I was wondering how to cure the variable not found problem in creating a binomial distribution?
DATA additional (KEEP=X);
DO REPEAT = 1 TO 1000;
CALL STREAMINIT(1234);
DO I=1 TO 1000;
X=RAND("BINOMIAL",0.6,10); /*NUMBER OF WINS IN TEN TOSSES*/
END;
IF X GE 5 THEN WINNER + 1;
ELSE LOSER + 1;
OUTPUT;
END;
RUN;
PROC PRINT DATA=additional;
VAR WINNER LOSER;
RUN;
I am creating a binomial random variable which if x is great than 5 then counts one for the winner, if less than 5 then counts one for the loser, the question is asking to found how many time are winners and how many times are losers. I kept on getting variable not found error. Am i doing something wrong with generating the binomial distribution.
/further editing/ this is the problem I am given.
You are given $10. Let the variable money = 10.
You play a game 10 times. The probability that you win a game is 0.4,
and the probability that you lose a game is 0.6.
If you win a game, you win $1. If you lose a game, you lose $1. So if
you win the first game, money becomes 11. But if you lose the first
game, money becomes 9.
After you have played the game 10 times, money is the amount that you
go home with. If you end up with at least $10, call yourself a winner.
Otherwise, call yourself a loser. Define the variable result as winner
or loser.
(a) Write a data step to generate random numbers and simulate your
result 1000 times. So that I can easily check your outputs, use
1234 as your seed for the random number generator. (You do not
need to show me the 1000 results.)
(b) Write a proc step to show how many times you are a winner, and
how many times you are a loser.
Not fully understand that you want to do with simulation. From your codes, you just keep 1000 records, which are all kept at last loop because of your first loop end position; call streaminit should be first line; you only keep X, you could not get winner and loser variable.
I guess maybe you could try this.
DATA additional;
CALL STREAMINIT(1234);
DO REPEAT= 1 TO 1000; *numbers of sample;
DO I=1 TO 100; *size of sample;
X=RAND("BINOMIAL",0.6,10); /*NUMBER OF WINS IN TEN TOSSES*/
IF X GE 5 THEN results='WINNER';
ELSE results='LOSER';
OUTPUT;
END;
END;
RUN;
proc freq data=additional;
by repeat;
table results;
run;
Edit: It seems that you want to know final results, you could get it from above code by changing results as numeric variable. Here is modified codes, if win is +1, lose is -1.
DATA additional;
CALL STREAMINIT(1234);
DO REPEAT= 1 TO 100; *numbers of sample;
DO I=1 TO 10; *size of sample;
X=RAND("BINOMIAL",0.6,10); /*NUMBER OF WINS IN TEN TOSSES*/
IF X GE 5 THEN results+1;
ELSE results=results-1;
OUTPUT;
END;
results=0;
END;
RUN;
proc freq data=additional;
by repeat;
table results;
run;
Related
For a project, I need to create a program where dice 9 are rolled. All the dice that rolled the lowest (best) score of that group are recorded in the scorecard, removed for the next round, and it continues until all 9 dice are gone. This is then done twice. The combined score of the former nine and ladder nine are then summed.
For example:
Round 1: 9 dice rolled, the best score of the round was -1 and 2 dice landed on it. Hole1 and Hole2 are marked with a score of -1.
Round 2: 7 dice (9-2 removed) rolled, the best score of the round was 0 with 1 die. Hole3 is marked with the score 0.
This continues until 0 dice remain. The scores are added up and the game is played again for holes for another 9 rounds.
The output based on the example given should look something like this:
hole1
hole2
hole3
...
hole18
frontnine
backnine
score
-1
-1
0
best
sum(hole1-9)
sum(hole10-18)
total
Below is the program for a single die with the lowest value being recorded and removed. How do I modify it to get what I want (listed above).
data work.project;
array scorecard(18) hole1-hole18;
array dice(9) dice1-dice9;
seed= 12345;
do frontback=0,1;
do rolls=1 to 9;
do die=1 to 9;
dup=0;
dice(die)=rantbl(seed, 1/12,3/12,2/12,2/12,3/12,1/12)-2;
if die=rolls then best=dice(die);
if dice(die) < best then best=dice(die);
end;
end;
end;
scorecard((rolls+dup)+frontback*9)=best;
end;
frontnine=sum( of hole1-hole9);
backnine=sum(of hole10-hole18);
score= frontnine + backnine;
output;
keep hole1-hole18 frontnine backnine score;
run;
Let's say I have stores all around the world and I want to know what was my top losses sales across the world per store. What is the code for that?!
here is my try:
proc sort data= store out=sorted_store;
by store descending amount;
run;
and
data calc1;
do _n_=1 by 1 until(last.store);
set sorted_store;
by store;
if _n_ <= 5 then "Sum_5Largest_Losses"n=sum(amount);
end;
run;
but this just prints out the 5:th amount and not 1.. TO .. 5! and I really don't know how to select the top 5 of EACH store . I think a kind of group by would be a perfect fit. But first things, first. How do I selct i= 1...5 ? And not just = 5?
There is also way of doing it with proc sql:
data have;
input store$ amount;
datalines;
A 100
A 200
A 300
A 400
A 500
A 600
A 700
B 1000
B 1100
C 1200
C 1300
C 1400
D 600
D 700
E 1000
E 1100
F 1200
;
run;
proc sql outobs=4; /* limit to first 10 results */
select store, sum(amount) as TOTAL_AMT
from have
group by 1
order by 2 desc; /* order them to the TOP selection*/
quit;
The data step sum(,) function adds up its arguments. If you only give it one argument then there is nothing to actually sum so it just returns the input value.
data calc1;
do _n_=1 by 1 until(last.store);
set sorted_store;
by store;
if _n_ <= 5 then Sum_5Largest_Losses=sum(Sum_5Largest_Losses,amount);
end;
run;
I would highly recommend learning the basic methods before getting into DOW loops.
Add a counter so you can find the first 5 of each store
As the data step loops the sum accumulates
Output sum for counter=5
proc sort data= store out=sorted_store;
by store descending amount;
run;
data calc1;
set sorted_store;
by store;
*if first store then set counter to 1 and total sum to 0;
if first.store then do ;
counter=1;
total_sum=0;
end;
*otherwise increment the counter;
else counter+1;
*accumulate the sum if counter <= 5;
if counter <=5 then total_sum = sum(total_sum, amount);
*output only when on last/5th record for each store;
if counter=5 then output;
run;
I'd like to set all values in an array to 1 if some sort of condition is met, and perform a calculation if the condition isn't met. I'm using a do loop at the moment which is very slow.
I was wondering if there was a faster way.
data test2;
set test1;
array blah_{*} blah1-blah100;
array a_{*} a1-a100;
array b_{*} b1-b100;
do i=1 to 100;
blah_{i}=a_{i}/b_{i};
if b1=0 then blah_{i}=1;
end;
run;
I feel like the if statement is inefficient as I am setting the value 1 cell at a time. Is there a better way?
There are already several good answers, but for the sake of completeness, here is an extremely silly and dangerous way of changing all the array values at once without using a loop:
data test2;
set test1;
array blah_{*} blah1-blah100 (100*1);
array a_{*} a1-a100;
array b_{*} b1-b100;
/*Make a character copy of what an array of 100 1s looks like*/
length temp $800; *Allow 8 bytes per numeric variable;
retain temp;
if _n_ = 1 then temp = peekclong(addrlong(blah1), 800);
do i=1 to 100;
blah_{i}=a_{i}/b_{i};
end;
/*Overwrite the array using the stored value from earlier*/
if b1=0 then call pokelong(temp,addrlong(blah1),800);
run;
You have 100*NOBS assignments to do. Don't see how using a DO loop over an ARRAY is any more inefficient than any other way.
But there is no need to do the calculation when you know it will not be needed.
do i=1 to 100;
if b1=0 then blah_{i}=1;
else blah_{i}=a_{i}/b_{i};
end;
This example uses a data set to "set" all values of an array without DOingOVER the array. Note that using SET in this way changes INIT-TO-MISSING for array BLAH to don't. I cannot comment on performance you will need to do your own testing.
data one;
array blah[10];
retain blah 1;
run;
proc print;
run;
data test1;
do b1=0,1,0;
output;
end;
run;
data test2;
set test1;
array blah[10];
array a[10];
array b[10];
if b1 eq 0 then set one nobs=nobs point=nobs;
else do i = 1 to dim(blah);
blah[i] = i;
end;
run;
proc print;
run;
This is not a response to the original question, but as a response to the discussion on the efficiency between using loops vs set to set the values for multiple variables
Here is a simple experiment that I ran:
%let size = 100; /* Controls size of dataset */
%let iter = 1; /* Just to emulate different number of records in the base dataset */
data static;
array aa{&size} aa1 - aa&size (&size * 1);
run;
data inp;
do ii = 1 to &iter;
x = ranuni(234234);
output;
end;
run;
data eg1;
set inp;
array aa{&size} aa1 - aa&size;
set static nobs=nobs point=nobs;
run;
data eg2;
set inp;
array aa{&size} aa1 - aa&size;
do ii = 1 to &size;
aa(ii) = 1;
end;
run;
What I see when I run this with various values of &iter and &size is as follows:
As &size increases for a &iter value of 1, assignment method is faster than the SET.
However for a given &size, as iter increases (i.e. the number of times the set statement / loop is called), the speed of the SET approach increases while the assignment method starts to decrease at a certain point at which they cross. I think this is because the transfer from physical disk to buffer happens just once (since static is a relatively small dataset) whereas the assignment loop cost is fixed.
For this use case, where the fixed dataset used to set values will be smaller, I admit that SET will be faster especially when the logic needs to execute on multiple records on the input and the number of variables that needs to be assigned are relatively few. This however will not be the case if the dataset cannot be cached in memory between two records in which case the additional overhead of having to read it into the buffer can slow it down.
I think this test isolates the statements of interest.
SUMMARY:
SET+create init array 0.40 sec. + 0.03 sec,
DO OVER array 11.64 sec.
NOTE: Additional host information:
X64_SRV12 WIN 6.2.9200 Server
NOTE: SAS initialization used:
real time 4.70 seconds
cpu time 0.07 seconds
1 options fullstimer=1;
2 %let d=1e4; /*array size*/
3 %let s=1e5; /*reps (obs)*/
4 data one;
5 array blah[%sysevalf(&d,integer)];
6 retain blah 1;
7 run;
NOTE: The data set WORK.ONE has 1 observations and 10000 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
user cpu time 0.03 seconds
system cpu time 0.00 seconds
memory 7788.90k
OS Memory 15232.00k
Timestamp 08/17/2019 06:57:48 AM
Step Count 1 Switch Count 0
8
9 sasfile one open;
NOTE: The file WORK.ONE.DATA has been opened by the SASFILE statement.
10 data _null_;
11 array blah[%sysevalf(&d,integer)];
12 do _n_ = 1 to &s;
13 set one nobs=nobs point=nobs;
14 end;
15 stop;
16 run;
NOTE: DATA statement used (Total process time):
real time 0.40 seconds
user cpu time 0.40 seconds
system cpu time 0.00 seconds
memory 7615.31k
OS Memory 16980.00k
Timestamp 08/17/2019 06:57:48 AM
Step Count 2 Switch Count 0
2 The SAS System 06:57 Saturday, August 17, 2019
17 sasfile one close;
NOTE: The file WORK.ONE.DATA has been closed by the SASFILE statement.
18
19 data _null_;
20 array blah[%sysevalf(&d,integer)];
21 do _n_ = 1 to &s;
22 do i=1 to dim(blah); blah[i]=1; end;
23 end;
24 stop;
25 run;
NOTE: DATA statement used (Total process time):
real time 11.64 seconds
user cpu time 11.64 seconds
system cpu time 0.00 seconds
memory 3540.65k
OS Memory 11084.00k
Timestamp 08/17/2019 06:58:00 AM
Step Count 3 Switch Count 0
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
real time 16.78 seconds
user cpu time 12.10 seconds
system cpu time 0.04 seconds
memory 15840.62k
OS Memory 16980.00k
Timestamp 08/17/2019 06:58:00 AM
Step Count 3 Switch Count 16
Some more interesting tests results based on data null 's original test. I added the following test also:
%macro loop;
data _null_;
array blah[%sysevalf(&d,integer)] blah1 - blah&d;
do _n_ = 1 to &s;
%do i = 1 %to &d;
blah&i = 1;
%end;
end;
stop;
run;
%mend;
%loop;
d s SET Method (real/cpu) %Loop (real/cpu) array based(real/cpu)
100 1e5 0.03/0.01 0.00/0.00 0.07/0.07
100 1e8 11.16/9.51 4.78/4.78 1:22.38/1:21.81
500 1e5 0.03/0.04 0.02/0.01 Did not measure
500 1e8 16.53/15.18 32.17/31.62 Did not measure
1000 1e5 0.03/0.03 0.04/0.03 0.74/0.70
1000 1e8 20.24/18.65 42.58/42.46 Did not measure
So with array based assignments, it is not the assignment that is the big culprit itself. Since arrays use a memory map to map the original memory locations, it appears that the memory location lookup for a given subscript is what really impacts performance. A direct assignment avoids this and significantly improves performance.
So if your array size is in the lower 100s, then direct assignment may not be a bad way to go. SET becomes effective when the array sizes go beyond a few hundreds.
Hello I am trying to solve a problem using iteration with do until but I don't get any results. Also I am learning SAS at the moment on my own using books, documentation and videos so I am new to this language. My problem is :
A car delivers a mileage of 20 miles per galon. Write a program so that the program stops generating observations when distance reaches 250 miles or when 10 gallons of fuel have been used
Hint Miles = gallon * mpg
I used the following code:
data mileage;
mpg = 20;
do until (miles le 250);
miles +1;
do until (gallon le 10);
gallon + 1;
miles = gallon * mpg;
end;
end;
output;
run;
Please tell me what am I doing wrong here?
Many thanks for your time and attention !
Because you waited until after the DO loops finished to write out any observations. If you want to write multiple observations you should move your output statement inside the do loop.
Also your program is never initializing gallon so mpg will always be missing and so less than 250 which means your outer DO loop will only execute once.
Your question as written can be answered without a program since 10*20 is less than 250. Assuming that you also want to change the mpg values then perhaps this is more what you wanted?
data mileage;
do mpg = 20 by 1 until (miles ge 250);
do gallon=1 to 10 until (miles ge 250);
miles = gallon * mpg;
output;
end;
end;
run;
The ability to combine both an iterative loop with an UNTIL condition is one of the many nice features of the data step DO loop.
I've been trying to find the simplest way to generate simulated time series datasets in SAS. I initially was experimenting with the LAG operator, but this requires input data, so is proabably not the best way to go. (See this question: SAS: Using the lag function without a set statement (to simulate time series data.))
Has anyone developed a macro or dataset that enables time series to be genereated with an arbitrary number of AR and MA terms? What is the best way to do this?
To be specific, I'm looking to generate what SAS calls an ARMA(p,q) process, where p denotes the autoregressive component (lagged values of the dependent variable), and q is the moving average component (lagged values of the error term).
Thanks very much.
I have developed a macro to attempt to answer this question, but I'm not sure whether this is the most efficient way of doing this. Anyway, I thought it might be useful to someone:
%macro TimeSeriesSimulation(numDataPoints=100, model=y=e,outputDataSetName=ts, maxLags=10);
data &outputDataSetName (drop=j);
array lagy(&maxlags) _temporary_;
array lage(&maxlags) _temporary_;
/*Initialise values*/
e = 0;
y=0;
t=1;
do j = 1 to 10;
lagy(j) = 0;
lage(j) = 0;
end;
output;
do t = 2 to &numDataPoints; /*Change this for number of observations*/
/*SPECIFY MODEL HERE*/
e = rannorm(-1); /*Draw from a N(0,1)*/
&model;
/*Update values of lags on the moving average and autoregressive terms*/
do j = &maxlags-1 to 1 by -1; /*Note you have to do this backwards because otherwise you cascade the current value to all past values!*/
lagy(j+1) = lagy(j);
lage(j+1) = lage(j);
end;
lagy(1) = y;
lage(1) = e;
output;
end;
run;
%mend;
/*Example 1: Unit root*/
%TimeSeriesSimulation(numDataPoints=1000, model=y=lagy(1)+e)
/*Example 2: Simple process with AR and MA components*/
%TimeSeriesSimulation(numDataPoints=1000, model=y=0.5*lagy(1)+0.5*lage(1)+e)