Random Permutation of Rows of a Large Dataset - sas

I need to create a random permutation of a data set with over 3 million rows.
I have tried to use PROC PLAN, based off of this example: http://support.sas.com/kb/23/977.html According to this article, having n = (number of rows)! allows the random selection of a permutation. Unfortunately for a set this size, that is 4.3*10^19668676 permutations. Obviously I run into a bit of a memory problem here.
I also found an example using PROC IML, but it looks like my company does not have the necessary licence/software package to use it.
Can anyone supply me with a good way to shuffle this data set?

It sounds like you want to take the rows and put them in a random order. If so, the most common method is to create a random value for each row and then sort by the random values:
DATA augmented ;
SET original ;
sortval = RAND('UNIFORM') ;
RUN ;
PROC SORT DATA=augmented OUT=permuted(DROP=sortval) ;
BY sortval ;
RUN ;
You can use the CALL STREAMINIT(seedval) call routine if you want to be able to precisely reproduce the random sequence at a later time.
You could also probably do this with PROC SQL [untested code]:
PROC SQL ;
CREATE TABLE permuted(DROP=sortval) AS
SELECT a.*, RAND('UNIFORM') AS sortval
FROM original a
ORDER BY sortval
;
QUIT ;

Similar to what Ludwig61 said, I would do the following
DATA augmented ;
SET original ;
call streaminit(12072015)*optional, if you want to set a seed;
sortval1 = RAND('UNIFORM') ;
sortval2 = RAND('UNIFORM') ;
sortval3 = RAND('UNIFORM') ;
RUN ;
PROC SORT DATA=augmented OUT=permuted(DROP=sortval1 sortval2 sortval3) ;
BY sortval1 sortval2 sortval3;
RUN ;
Since you would run into collisions by just using one random number, you can just add more random numbers until you feel comfortable knowing that you won't get any repeat values, then sort by those random numbers. Given that the Rand('Uniform') function has a period of 2^19937-1, you should be fine using 3--your only enemy in this case is SAS's truncation after 53 bits.

This will create a random shuffle of the data and save the random number seed both as a column in the data and as meta data. You could omit one or both but I like to let SAS generate the seed and let me save it so I can reproduce the sample/shuffle. Use a VIEW so the observations are piped directly to PROC SORT.
data shuffle / view=shuffle;
obs = _n_;
set sashelp.cars;
if _n_ eq 1 then call streaminit(0);
r = rand('uniform');
retain seed;
if _n_ eq 1 then seed=symgetn('sysrandom');
run;
proc sort data=shuffle out=shuffle01;
by r;
run;
%put NOTE: &=sysrandom;
proc datasets nolist;
modify shuffle01(label="SEED=&sysrandom");
run;
quit;
proc contents;
run;
proc print;
run;

Related

SAS select random samples from a dataset

I understand that to select a random sample, I can use
proc surveyselect data = raw_data method = srs n=200000 out=sample_data;
run;
However, sometimes my raw_data has the number of records < 200000. If the raw_data is small, I would like to just keep the raw_data; if it's larger than a million records, I would like to randomly select a 200k of records out of it. How should I do this?
Thank you!
Just create a macro variable for n. You can do this below, or you can use dictionary.tables or proc contents to get the count without actually counting all of the rows if you don't have reason to disbelieve those values.
proc sql;
select
case when count(1) < 1000000 then count(1) else 200000 end
into :sampcount
from yourdataset
;
quit;
proc surveyselect n=&sampcount. .... ;
run;

randomly select two observation and calculate the distance

I have a data set have with numerical column x. I want to randomly select any two distinct points and then calculate the distance between them.
If I only do it once, then I just use proc surveyselect to generate another data set with two obs.
proc surveyselect data=have out=want method=srs
sampsize=2;
run;
data out;
set have end=eof;
dist = abs(dif1(x));
if eof;
run;
But how can I do it multiple times, say 1000000? Each time two points are selected with equal prob, then finally I have 1000000 dist records.
How about you reorder your input dataset into a random order and then calculate the distance for every second observation?
proc sql ;
create table random as
select *, ranuni(0) as randorder
from have
order by randorder
;quit ;
data want ;
set random;
dist=abs(dif1(x)) ;
if _n_/2=int(_n_/2) ;
run;
If you need to specify a specific number of samples to calculate then you can add update set random to set random(obs=100000) for example. Although note this would be 'sample pairs' so 100,000 would give your want dataset 50,000 observations

Create 100 copies of a dataset in SAS

I need to create 100 copies of a data set (which has 3 variables) but one of the variables need to be assign randomly (1 through 1000)
I know I can use 100 data statement but I don't want to go down that road!
Let say I have data set A and want to create data set A1 to A100, I used the following code;
data A1--A100;
set A;
do i=1 to 1000;
var3=int(ranuni(0) * 1000 + 1);
output A1--A1000;
end;
run;
but SAS does not generate anything at all
You can't do it via any shortcut like that. You could use the macro language to create the 1000 dataset names and 1000 output statements.
However, more than likely you shouldn't do this. Instead, have one dataset with a BY variable, and then in whatever you're going to do (MCMC or whatever) use that BY variable with the BY statement.
data want;
set have;
do byvar=1 to 1000;
var3 = int(ranuni(7)*1000+1);
output;
end;
run;
Also, don't use ranuni(0). Always use a positive seed (and save it) so you can replicate your results.
Here is the answer, hope it could help;
data want;
set have;
do dset=1 to 101;
rand=ranuni(4011120);
if dset=1 then real=1; else real=0;
output;
end;
run;
proc sort data=want;
by dset rand;
run;
data want2;
set permut;
if real=0 then rank= mod(_N_,366);
if real then realrank=rank;
run;
proc sort data=want2;
by dset dayofyear;
run;

SAS creating subtables ordered by year

I have a table in SAS which consists of data from stock exchange. One of its columns holds information about date. I would like to create subtables, each one hold data from only one specific year.
Assuming you want to do this (often, this is an inferior option as analyses run separately by year can be run from one dataset using by year;, but certainly this can sometimes be appropriate), the gold standard method for doing this is the hash table, as the hash table can produce unlimited tables based on the data. I will edit in a method for doing this with hash table if I have time while running things this afternoon; it's the 'hashing' method described on this page.
Hashing code, adapted from the sascommunity.org page above:
data have;
call streaminit(7);
do year=1998 to 2014;
do id= 1 to 10;
x=rand('Uniform');
output;
end;
end;
run;
data _null_ ;
dcl hash byyear () ;
byyear.definekey ('k') ; if `id` or similar is a safe unique ID you could use that here, otherwise `k` is your unique identifier - hash requires unique;
byyear.definedata ('year','id','x') ;
byyear.definedone () ;
do k = 1 by 1 until ( last.year ) ;
set have;
by year ;
byyear.add () ;
end ;
dsetname=cats('year',year);
byyear.output (dataset: dsetname) ;
run ;
There is a similar set of methods that revolve around using a macro to generate the code. This paper goes into detail about one method to do that; I won't explain it in detail as I consider it inferior to the hash method (even if it is lower CPU time, it is more complicated to write than either a pure macro method or a pure hash method) but in certain cases it could be better.
A simple example of the macro method using the conceptual have aframe defined:
proc sql;
select distinct(cats('year',year(date))) into :dsetlist
separated by ' '
from have;
select distinct(cats('%outputyear(year=',year(date),')')) into :outputlist
separated by ' '
from have;
quit;
%macro outpuyear(year=);
if year(date)=&year. then output year&dset.;
%mend outputyear;
data &dsetlist.;
set have;
&outputlist.;
run;
data year1 year2 year3 yearN;
set stockdata;
if year(date) = 2014 then
output year1;
else if year(date) = 2013 then
output year2;
else if year(date) = 2012 then
output year3;
else
output yearN;
run;
You could also use case statements I guess.

several regressions on a single dataset in SAS

I have a dataset of the following format:
a table of M rows and 2K columns.
My columns are pairs of variables: X_i, Y_i and the rows are observations.
I would like to perform many linear regressions: one for each pair of columns (Y_i ~ X_i)
and obtain the results.
I know how to access specific columns using arrays, like so:
data Xs_Ys_data (drop=i);
array Xs[60] X1-X60;
array Ys[60] Y1-Y60;
I also know how to fit a single linear regression model, like so:
proc reg data=some_data;
model y = x;
output out=out_lin_reg;
run;
And I am familiar with the concept of loops:
do i=1 to 60;
Xs[i] .......;
end;
How do I combine these three to get what I need?
Thanks!
P.S - I asked a similar question on a different format here:
SAS reading a file in long format
Update:
I have managed to create the regressions using a macro like so:
%macro mylogit();
%do i = 1 %to 60;
proc reg data=Xs_Ys_data;
model Y&i = X&i;
run;
%end;
%mend;
%mylogit()
Now I am not sure how to export the results into a single table...
You have this in your macro:
proc reg data=Xs_Ys_data;
model Y&i = X&i;
run;
So instead create:
data x_y_Data;
set xs_yx_data;
array xs x1-x60;
array yx y1-y60;
do iter = 1 to dim(xs);
x=xs[iter];
y=ys[iter];
output;
end;
run;
proc reg data=X_Y_data;
by iter;
model Y = X;
run;
And then add an output statement however you normally would to get your resulting dataset. Now you get 1 output table with all 60 iterations (still 60 printed outputs), and if you want to create one printed output you can construct that from the output dataset.