randomly select two observation and calculate the distance - sas

I have a data set have with numerical column x. I want to randomly select any two distinct points and then calculate the distance between them.
If I only do it once, then I just use proc surveyselect to generate another data set with two obs.
proc surveyselect data=have out=want method=srs
sampsize=2;
run;
data out;
set have end=eof;
dist = abs(dif1(x));
if eof;
run;
But how can I do it multiple times, say 1000000? Each time two points are selected with equal prob, then finally I have 1000000 dist records.

How about you reorder your input dataset into a random order and then calculate the distance for every second observation?
proc sql ;
create table random as
select *, ranuni(0) as randorder
from have
order by randorder
;quit ;
data want ;
set random;
dist=abs(dif1(x)) ;
if _n_/2=int(_n_/2) ;
run;
If you need to specify a specific number of samples to calculate then you can add update set random to set random(obs=100000) for example. Although note this would be 'sample pairs' so 100,000 would give your want dataset 50,000 observations

Related

SAS targed encored

Hi,
Can someone explain to me what a given code sequence does step by step?**
I must describe it in detail what is happening in turn
%macro frequency_encoding(dataset, var);
proc sql noprint;
create table freq as
select distinct(&var) as values, count(&var) as number
from &dataset
group by Values ;
create table new as select *, round(freq.number/count(&var),00.01) As freq_encode
from &dataset left join freq on &var=freq.values;
quit;
data new(drop=values number &var);
set new;
rename freq_encode=&var;
run;
data new;
set new;
keep &var;
run;
data dane(drop = &var);
set dane;
run;
data dane;
set dane;
set new;
run;
The SQL is first finding the frequency of each value of the variable. Then it divides those counts by the total number of non-missing values and rounds that percentage to two decimal places (or integers when you think of the ratio as a percentage).
This could be done in one step with:
proc sql noprint;
create table new as
select *,round(number/count(&var),0.01) as freq_encode
from (select *,&var as values,count(&var) as number
from &dataset
group by &var
)
;
quit;
It is not clear what the DANE dataset is supposed to be. If &DATESET does not equal DANE then those last four data steps make no sense. If it does then it is a convoluted way to replace the original variable with the percentage.
The first one is basically trying to rename the calculated percentage as the original variable and eliminate the original variable and the other two intermediate variables used in calculating the percentage.
The second one is dropping all of the variables except the new percentage.
The third one is dropping the original variable from "dane".
The last one is adding the new variable back to "dane".
Assuming DANE should be replaced with &DATASET then those four data steps could be reduced to one:
data &dataset;
set &dataset(drop=&var);
set new(keep=freq_encode rename=(freq_encode=&var));
run;
It is probably best not to overwrite your original dataset in that way. So perhaps you should add an OUT parameter to your macro to name to new dataset you want to create.
You could have avoided all of those data steps by just adding the DROP= and RENAME= dataset options to the dataset generated by the SQL query.
So perhaps you want something like this:
%macro frequency_encoding(dataset, var,out);
proc sql noprint;
create table &out(drop=&var number rename=(freq_encode=&var)) as
select *,round(number/count(&var),0.01) as freq_encode
from (select *,count(&var) as number
from &dataset
group by &var
)
;
quit;
%mend ;
%frequency_encoding(sashelp.class,sex,work.class);

How to perform addition of column values in SAS for large dataset

I have a large data set having millions of records with more than 200 columns. Is there any way to perform the sum of columns in SAS.Below is the sample data.
I want to display the data as below.Like Need to SUM all the values of columns in col1 .100+100+100+100+100+100=600
A simple proc summary will suffice.
proc summary data=want nway ;
var col1-col7 ;
output out=sum (drop=_:) sum= ;
run ;

Storing median of a data set and using it for doing calculation

I have a data set x=(1,4,7,8,10,...................) which has these random values. I want to find the median of the data set and then store it in a call symput so that i can use to find the difference of each of the observation from the median.
what is the function to find the median of this data set by just mentioning the variable name?
I want the data in the below format:
X X-median
1 1-median
4 4-median
7 7-median
8 8-median
10 10-median
Don't use it in a macro variable. Just combine it as a dataset.
proc means data=sashelp.class noprint;
var age;
output out=age_median median(age)=age_median;
run;
data class_fin;
if _n_=1 then set age_median(keep=age_median);
set sashelp.class;
age_minus_median = age - age_median;
run;
When you set a dataset this way (a one row dataset with if _n_=1) you get its value(s) copied onto every row on the dataset, sort of like if you merged it with some universal by value.
This can be easily done in PROC SQL;
Let's say you have the x values in table A and create a new table B:
PROC SQL;
CREATE TABLE B AS SELECT x,median(x) AS median,x-median(x) AS offset FROM A;
QUIT;

Random Permutation of Rows of a Large Dataset

I need to create a random permutation of a data set with over 3 million rows.
I have tried to use PROC PLAN, based off of this example: http://support.sas.com/kb/23/977.html According to this article, having n = (number of rows)! allows the random selection of a permutation. Unfortunately for a set this size, that is 4.3*10^19668676 permutations. Obviously I run into a bit of a memory problem here.
I also found an example using PROC IML, but it looks like my company does not have the necessary licence/software package to use it.
Can anyone supply me with a good way to shuffle this data set?
It sounds like you want to take the rows and put them in a random order. If so, the most common method is to create a random value for each row and then sort by the random values:
DATA augmented ;
SET original ;
sortval = RAND('UNIFORM') ;
RUN ;
PROC SORT DATA=augmented OUT=permuted(DROP=sortval) ;
BY sortval ;
RUN ;
You can use the CALL STREAMINIT(seedval) call routine if you want to be able to precisely reproduce the random sequence at a later time.
You could also probably do this with PROC SQL [untested code]:
PROC SQL ;
CREATE TABLE permuted(DROP=sortval) AS
SELECT a.*, RAND('UNIFORM') AS sortval
FROM original a
ORDER BY sortval
;
QUIT ;
Similar to what Ludwig61 said, I would do the following
DATA augmented ;
SET original ;
call streaminit(12072015)*optional, if you want to set a seed;
sortval1 = RAND('UNIFORM') ;
sortval2 = RAND('UNIFORM') ;
sortval3 = RAND('UNIFORM') ;
RUN ;
PROC SORT DATA=augmented OUT=permuted(DROP=sortval1 sortval2 sortval3) ;
BY sortval1 sortval2 sortval3;
RUN ;
Since you would run into collisions by just using one random number, you can just add more random numbers until you feel comfortable knowing that you won't get any repeat values, then sort by those random numbers. Given that the Rand('Uniform') function has a period of 2^19937-1, you should be fine using 3--your only enemy in this case is SAS's truncation after 53 bits.
This will create a random shuffle of the data and save the random number seed both as a column in the data and as meta data. You could omit one or both but I like to let SAS generate the seed and let me save it so I can reproduce the sample/shuffle. Use a VIEW so the observations are piped directly to PROC SORT.
data shuffle / view=shuffle;
obs = _n_;
set sashelp.cars;
if _n_ eq 1 then call streaminit(0);
r = rand('uniform');
retain seed;
if _n_ eq 1 then seed=symgetn('sysrandom');
run;
proc sort data=shuffle out=shuffle01;
by r;
run;
%put NOTE: &=sysrandom;
proc datasets nolist;
modify shuffle01(label="SEED=&sysrandom");
run;
quit;
proc contents;
run;
proc print;
run;

Is there a way to name proc rank groups based on values within the group?

So I have multiple continuous variables that I have used proc rank to divide into 10 groups, ie for each observation there is now a "GPA" and a "GRP_GPA" value, ditto for Hmwrk_Hrs and GRP_Hmwrk_Hrs. But for each of the new group columns the values are between 1 - 10. Is there a way to change that value so that rather than 1 for instance it would be 1.2-2.8 if those were the min and max values within the group? I know I can do it by hand using proc format or if then or case in sql but since I have something like 40 different columns that would be very time intensive.
It's not clear from your question if you want to store the min-max values or just format the rank columns with them. My solution below formats the rank column and utilises the ability of SAS to create formats from a dataset. I've obviously only used 1 variable to rank, for your data it will be a simple matter to wrap a macro around the code and run for each of your 40 or so variables. Hope this helps.
/* create ranked dataset */
proc rank data=sashelp.steel groups=10 out=want;
var steel;
ranks steel_rank;
run;
/* calculate minimum and maximum values per rank */
proc summary data=want nway;
class steel_rank;
var steel;
output out=want_min_max (drop=_:) min= max= / autoname;
run;
/* create dataset with formatted values */
data steel_rank_fmt;
set want_min_max (rename=(steel_rank=start));
retain fmtname 'stl_fmt' type 'N';
label=catx('-',steel_min,steel_max);
run;
/* create format from previous dataset */
proc format cntlin=steel_rank_fmt;
run;
/* apply formatted value to rank column */
proc datasets lib=work nodetails nolist;
modify want;
format steel_rank stl_fmt10.;
quit;
In addition to Keith's good answer, you can also do the following:
proc rank data = sashelp.cars groups = 10 out = test;
var enginesize;
ranks es;
run;
proc sql ;
select *, catx('-',min(enginesize), max(enginesize)) as esrange, es from test
group by es
order by make, model
;
quit;