Add new columns of random numbers to existing table - sas

I'm totally new to SAS.
I would like to add two columns of randomly generated 0s and 1s to my existing table.
Can anyone give me some feedback?
I know how to do this in R but absolutely not in SAS.
Thank you in advance

You need the RAND() function and can specify the BERNOULLI treatment. I've also included CALL STREAMINIT() which is the random seed.
data want;
set have;
call streaminit(123);
new_var = rand('bernoulli', 0.5);
run;
More details here: https://blogs.sas.com/content/iml/2011/08/24/how-to-generate-random-numbers-in-sas.html

Related

Missing values in a FREQ (SAS)

I'm going to ask this with an example...
Suppose i have a data set where each observation represents a person. Two of the variables are AGE and HASADOG (and say this has values 1 for yes and 2 for no.) Is there a way to run a PROC FREQ (by AGE*HASADOG) that forces SAS to include in the report a line for instances where the count is zero?
By this I mean: if there is a particular value for AGE such that no observation with this AGE value has a 1 in the HASADOG variable, the report will still include a row for this combination (with a row percent of 0.)
Is this possible?
The SPARSE option in PROC FREQ is likely all you need.
proc freq data=sashelp.class;
table sex*age / sparse list;
run;
If the value is nowhere in your data set at all, then there's no way for SAS to know it exists. In this case you'd need a more complex solution, basically a way to tell SAS all values you would be using ahead of time. This can be done via a PRELOADFMT or CLASSDATA option on several procs. There are asked an answered questions on this topic here on SO, so I won't provide a solution for this option, which seems beyond the scope of your question.

Summing Multiple lags in SAS using LAG

I'm trying to make a data step that creates a column in my table that has the sum of ten, fifteen, twenty and fortyfive lagged variables. What I have below works, but it is not practicle to write this code for the twenty and fortyfive summed lags. I'm new to SAS and can't find a good way to write the code. Any help would be greatly appreciated.
Here's what I have:
data averages;
set work.cuts;
sum_lag_ten = (lag10(col) + lag9(col) + lag8(col) + lag7(col) + lag6(col) + lag5(col) + lag4(col) + lag3(col) + lag2(col) + lag1(col));
run;
Proc EXPAND allows the easy calculation for moving statistics.
Technically it requires a time component, but if you don't have one you can make one up, just make sure it's consecutive. A row number would work.
Given this, I'm not sure it's less code, but it's easier to read and type. And if you're calculating for multiple variables it's much more scalable.
Transformout specifies the transformation, In this case a moving sum with a window of 10 periods. Trimleft/right can be used to ensure that only records with a full 10 days are included.
You may need to tweak these depending on what exactly you want. The third example under PROC EXPAND has examples.
Data have;
Set have;
RowNum = _n_;
Run;
Proc EXPAND data=have out=want;
ID rownum;
Convert col=col_lag10 / transformout=(MOVSUM 10 trimleft 9);
Run;
Documentation(SAS/STAT 14.1)
http://support.sas.com/documentation/cdl/en/etsug/68148/HTML/default/viewer.htm#etsug_expand_examples04.htm
If you must do this in the datastep (and if you do things like this regularly, SAS/ETS has better tools for sure), I would do it like this.
data want;
set sashelp.steel;
array lags[20];
retain lags1-lags20;
*move everything up one;
do _i = dim(lags) to 2 by -1;
lags[_i] = lags[_i-1];
end;
*assign the current record value;
lags[1] = steel;
*now calculate sums;
*if you want only earlier records and NOT this record, then use lags2-lags11, or do the sum before the move everything up one step;
lag_sum_10 = sum(of lags1-lags10);
lag_sum_15 = sum(of lags1-lags15); *etc.;
run;
Note - this is not the best solution (I think a hash table is better), but this is better for a more intermediate level programmer as it uses data step variables.
I don't use a temporary array because you need to use variable shortcuts to do the sum; with temporary array you don't get that, unfortunately (so no way to sum just 1-10, you have to sum [*] only).

Split a dataset into two new dataset based on percentage of split

I want to split by large dataset randomly into two new dataset in the ratio of 70% - 30%.
Basically I need to allocate 70% of random values from large dataset to the newdataset1 and 30% of the random values from largedataset to the newdataset2.
Can you please help with a SAS code that will help me achieve it.
A dummy code will really help..
Proc SQl or SAS statement. Anything will work with me.
For complex sample design (like stratified randomization, e. g.) PROC SURVEYSELECT is a way to go, as #Keith said.
But for just a simple random splitting RANTBL-function will do the trick:
data newdataset1 newdataset2;
set have;
flag=rantbl(-1, 0.7, 0.3);
if flag=1 then output newdataset1;
else output newdataset2;
run;

Picking rows in SAS based on a specific value in a specific column

So I'm working with a data set that has millions of rows. I'm trying to cut down the number of rows, so that I can merge this data set and another data set by zipcode.
What I'm trying to do is take a specific column "X6" and search through it for the value of "357". Then every row that has that value I want to move into a new data set.
I'm assuming that I'm going to have to use some form of if/then statement, but I can't get anything to work successfully. If needed I can post a snapshot of some of my data or what SAS code I currently have. I've seen other things that are similar, but none of them involve SAS.
Thanks for all of your help in advanced.
RamB gave a great way to parse into two datasets.
If you just want a new dataset that is a subset of the original, the following will work well
DATA NEW;
SET ORIGINAL;
IF X6="357"; *NOTE: THIS ASSUMES X6 IS DEFINED AS CHARACTER*
RUN;
A nice function can also parse multiple criteria. Say you wanted to keep records where X6 = 357 or 588.
DATA NEW;
SET ORIGINAL;
IF X6 IN("357","588"); *NOTE: THIS ASSUMES X6 IS DEFINED AS CHARACTER*
RUN;
Lastly, the NOTIN also works to exclude.
With data step this is really simple. I'll give you an example.
data dataset_with_357
original_without_357;
set original_dataset;
if compress(x6) = "357" then output dataset_with_357;
else output original_without_357;
run;
As I said, there are several ways of doing this, and it wasn't clear for me which is better for you.
Just use Proc SQL to create your data set, then reference the value your looking for in your query -
Proc SQL;
Create table new as
Select *
From dataset
Where x6 = 357
;
Quit;
Assuming your x6 variable is numeric...
On a mobile device...sorry for no code text

Naming variable using _n_, a column for each iteration of a datastep

I need to declare a variable for each iteration of a datastep (for each n), but when I run the code, SAS will output only the last one variable declared, the greatest n.
It seems stupid declaring a variable for each row, but I need to achieve this result, I'm working on a dataset created by a proc freq, and I need a column for each group (each row of the dataset).
The result will be in a macro, so it has to be completely flexible.
proc freq data=&data noprint ;
table &group / out=frgroup;
run;
data group1;
set group (keep=&group count ) end=eof;
call symput('gr', _n_);
*REQUESTED code will go here;
run;
I tried these:
var&gr.=.;
call missing(var&gr.);
and a lot of other statement, but none worked.
Always the same result, the ds includes only var&gr where &gr is the maximum n.
It seems that the PDV is overwriting the new variable each iteration, but the name is different.
Please, include the result in a single datastep, or, at least, let the code take less time as possible.
Any idea on how can I achieve the requested result?
Thanks.
Macro variables don't work like you think they do. Any macro variable reference is resolved at compile time, so your call symput is changing the value of the macro variable after all the references have been resolved. The reason you are getting results where the &gr is the maximum n is because that is what &gr was as a result of the last time you ran the code.
If you know you can determine the maximum _n_, you can put the max value into a macro variable and declare an array like so:
Find max _n_ and assign value to maxn:
data _null_;
set have end=eof;
if eof then call symput('maxn',_n_);
run;
Create variables:
data want;
set have;
array var (&maxn);
run;
If you don't like proc transpose (if you need 3 columns you can always use it once for every column and then put together the outputs) what you ask can be done with arrays.
First thing you need to determine the number of groups (i.e. rows) in the input dataset and then define an array with dimension equal to that number.
Then the i-th element of your array can be recalled using _n_ as index.
In the following code &gr. contains the number of groups:
data group1;
set group;
array arr_counts(&gr.) var1-var&gr.;
arr_counts(_n_)= count;
run;
In SAS there're several methods to determine the number of obs in a dataset, my favorite is the following: (doesn't work with views)
data _null_;
if 0 then set group nobs=n;
call symputx('gr',n);
run;