I am trying to come up with code that will select a random column from a group of columns on interest. The group of columns will change depending on the values in the columns for each observation. Each observation is a subject.
Let me explain to be more clear:
I have 8 columns, names V1-V8. Each column has 3 potential responses ('Small','Medium','High'). Due to certain circumstances in our project, I need to "combine" all this information into 1 column.
Key factor 1: We only want the columns per subject where he/she selected 'High' (lots of combinations here). This is what I refer to when I say the columns of interest changes per subject.
Key factor 2: Once I have identified the columns where 'High' was selected for the subject, select one of the columns at random.
At the end, I need a new variable (New_V) with values V1-V8 (NOT 'Small','Medium','High') indicating which column was selected for each subject.
Any advice would be great. I have tried ARRAYs and Macro variables but I can seem to tackle this the right way.
This method uses macro variables and a loop. There are three main steps: First, find all variables that are "high." Second, select a random value from 1 to the number of variables that are "high." Third, pick that variable and call it selected_var.
data temp;
input subject $ v1 $ v2 $ v3 $ v4 $ v5 $ v6 $ v7 $ v8 $;
datalines;
1 high medium small high medium small high medium
2 medium small high medium small high medium high
3 small high high medium small high medium high
4 medium medium high medium small small medium medium
5 medium medium high small small high medium small
6 small small high medium small high high high
7 small small small small small small small small
8 high high high high high high high high
;
run;
%let vars = v1 v2 v3 v4 v5 v6 v7 v8;
%macro find_vars;
data temp2;
set temp;
/*find possible variables*/
format possible_vars $20.;
%do i = 1 %to %sysfunc(countw(&vars.));
%let this_var = %scan(&vars., &i.);
if &this_var. = "high" then possible_vars = cats(possible_vars, "&this_var.");
%end;
/*create a random integer between 1 and number of variables to select from*/
rand = 1 + floor((length(possible_vars) / 2) * rand("Uniform"));
/*pick that one!*/
selected_var = substr(possible_vars, (rand * 2 - 1), 2);
run;
%mend find_vars;
%find_vars;
You're on the right track with arrays. The vname function will be helpful here. The want datastep shows how to do this (the rest just sets up example data):
proc format;
value smh
1='Small'
2='Medium'
3='High'
other=' '
;
quit;
data have;
call streaminit(5);
array v[8] $;
do _i = 1 to 1000;
do _j = 1 to 8;
__rand = ceil(1+rand('Binomial',.7,2));
v[_j] = put(__rand,smh6.);
end;
if whichc('High',of v[*]) = 0 then v8 = 'High'; *guarantee have one high;
output;
end;
drop _:;
run;
data want;
call streaminit(7); *arbitrary seed here, pick any positive number;
set have;
array v[8] ;
do until (v[_rand] = 'High'); *repeat this loop until one is picked that is High;
_rand = ceil(8*rand('Uniform'));
end;
chosen_v = vname(v[_rand]); *assign the chosen name to chosen_v variable;
drop _:;
run;
proc freq data=want;
tables chosen_v;
run;
Related
I want to do a sum of 250 previous rows for each row, starting from the row 250th.
X= lag1(VWRETD)+ lag2(VWRETD)+ ... +lag250(VWRETD)
X = sum ( lag1(VWRETD), lag2(VWRETD), ... ,lag250(VWRETD) )
I try to use lag function, but it does not work for too many lags.
I also want to calculate sum of 250 next rows after each row.
What you're looking for is a moving sum both forwards and backwards where the sum is missing until that 250th observation. The easiest way to do this is with PROC EXPAND.
Sample data:
data have;
do MKDate = '01JAN1993'd to '31DEC2000'd;
VWRET = rand('uniform');
output;
end;
format MKDate mmddyy10.;
run;
Code:
proc expand data=have out=want;
id MKDate;
convert VWRET = x_backwards_250 / transform=(movsum 250 trimleft 250);
convert VWRET = x_forwards_250 / transform=(reverse movsum 250 trimleft 250 reverse);
run;
Here's what the transformation operations are doing:
Creating a backwards moving sum of 250 observations, then setting the initial 250 to missing.
Reversing VWRET, creating a moving sum of 250 observations, setting the initial 250 to missing, then reversing it again. This effectively creates a forward moving sum.
The key is how to read observations from previous and post rows. As for your sum(n1, n2,...,nx) function, you can replace it with iterative summation.
This example uses multiple set skill to achieve summing a variable from 25 previous and post rows:
data test;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
set sashelp.air(keep=air rename=air=pre_air) point=i;
sum_pre=sum(sum_pre,pre_air);
end;
do j=_n_+1 to _n_+25;
set sashelp.air(keep=air rename=air=post_air) point=j;
sum_post=sum(sum_post,post_air);
end;
end;
drop pre_air post_air;
run;
Only 26th to nobs-25th rows will be calculated, where nobs stands for number of observations of the setting data sashelp.air.
Multiple set may take long time when meeting big dataset, if you want to be more effective, you can use array and DOW-loop to instead multiple set skill:
data test;
array _val_[1024]_temporary_;
if _n_=1 then do i=1 by 1 until(eof);
set sashelp.air end=eof;
_val_[i]=air;
end;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
sum_pre=sum(sum_pre,_val_[i]);
end;
do j=_n_+1 to _n_+25;
sum_post=sum(sum_post,_val_[j]);
end;
end;
drop i j;
run;
The weakness is you have to give a dimension number to array, it should be equal or great than nobs.
These skills are from a concept called "Table Look-Up", For SAS context, read "Table Look-Up by Direct Addressing: Key-Indexing -- Bitmapping -- Hashing", Paul Dorfman, SUGI 26.
You don't want use normal arithmetic with missing values becasue then the result is always a missing value. Use the SUM() function instead.
You don't need to spell out all of the lags. Just keep a normal running sum but add the wrinkle of removing the last one in by subtraction. So your equation only needs to reference the one lagged value.
Here is a simple example using running sum of 5 using SASHELP.CLASS data as an example:
%let n=5 ;
data step1;
set sashelp.class(keep=name age);
retain running_sum ;
running_sum=sum(running_sum,age,-(sum(0,lag&n.(age))));
if _n_ >= &n then want=running_sum;
run;
So the sum of the first 5 observations is 68. But for the next observation the sum goes down to 66 since the age on the 6th observation is 2 less than the age on the first observation.
To calculate the other variable sort the dataset in descending order and use the same logic to make another variable.
I have a large dataset in SAS which I know is almost sorted; I know the first and second levels are sorted, but the third level is not. Furthermore, the first and second levels contain a large number of distinct values and so it is even less desirable to sort the first two columns again when I know it is already in the correct order. An example of the data is shown below:
ID Label Frequency
1 Jon 20
1 John 5
2 Mathieu 2
2 Mathhew 7
2 Matt 5
3 Nat 1
3 Natalie 4
Using the "presorted" option on a proc sort seems to only check if the data is sorted on every key, otherwise it does a full sort of the data. Is there any way to tell SAS that the first two columns are already sorted?
If you've previously sorted the dataset by the first 2 variables, then regardless of the sortedby information on the dataset, SAS will take less CPU time to sort it *. This is a natural property of most decent sorting algorithms - it's much less work to sort something that's already nearly sorted.
* As long as you don't use the force option in the proc sort statement, which forces it to do redundant sorting.
Here's a little test I ran:
option fullstimer;
/*Make sure we have plenty of rows with the same 1 + 2 values, so that sorting by 1 + 2 doesn't imply that the dataset is already sorted by 1 + 2 + 3*/
data test;
do _n_ = 1 to 10000000;
var1 = round(rand('uniform'),0.0001);
var2 = round(rand('uniform'),0.0001);
var3 = round(rand('uniform'),0.0001);
output;
end;
run;
/*Sort by all 3 vars at once*/
proc sort data = test out = sort_all;
by var1 var2 var3;
run;
/*Create a baseline dataset already sorted by 2/3 vars*/
/*N.B. proc sort adds sortedby information to the output dataset*/
proc sort data = test out = baseline;
by var1 var2;
run;
/*Sort baseline by all 3 vars*/
proc sort data = baseline out = sort_3a;
by var1 var2 var3;
run;
/*Remove sort information from baseline dataset (leaving the order of observations unchanged)*/
proc datasets lib = work nolist nodetails;
modify baseline (sortedby = _NULL_);
run;
quit;
/*Sort baseline dataset again*/
proc sort data = baseline out = sort_3b;
by var1 var2 var3;
run;
The relevant results I got were as follows:
SAS took 8 seconds to sort the original completely unsorted dataset by all 3 variables.
SAS took 4 seconds to sort by 3/3 starting from the baseline dataset already sorted by 2/3 variables.
SAS took 4 seconds to sort by 3/3 starting from the same baseline dataset after removing the sort information from it.
The relevant metric from the log output is the amount of user CPU time.
Of course, if the almost-sorted dataset is very large and contains lots of other variables, you may wish to avoid the sort due to the write overhead when replacing it. Another approach you could take would be to create a composite index - this would allow you to do things involving by group processing, for example.
/*Alternative option - index the 2/3 sorted dataset on all 3 vars rather than sorting it*/
proc datasets lib = work nolist nodetails;
/*Replace the sort information*/
modify baseline(sortedby = var1 var2);
run;
/*Create composite index*/
modify baseline;
index create index1 = (var1 var2 var3);
run;
quit;
Creating an index requires a read of the whole dataset, as does the sort, but only a fraction of the work involved in writing it out again, and might be faster than a 2/3 to 3/3 sort in some situations.
I have a file with 25 rows like:
Model Cena (zl) Nagrywanie fimow HD Optyka - krotnosc zoomu swiatlo obiektywu przy najkrotszej ogniskowej Wielkosc LCD (cale)
Lumix DMC-LX3 1699 tak 2.5 2 3
Lumix DMC-GH1 + LUMIX G VARIO HD 14-140mm/F4.0-5.8 ASPH./MEGA O.I.S 5199 tak 10 4 3
And I wrote:
DATA lab_1;
INFILE 'X:\aparaty.txt' delimiter='09'X;
INPUT Model $ Cena Nagrywanie $ Optyka Wielkosc_LCD Nagr_film;
f_skal = MAX(Cena - 1500, Optyka - 10, Wielkosc_LCD - 1, Nagr_film - 1) + 1/1000*(Cena - 1500 + Optyka - 10 + Wielkosc_LCD - 1 + Nagr_film - 1);
*rozw = MIN(f_skal);
*rozw = f_skal[,<:>];
PROC SORT;
BY DESCENDING f_skal;
PROC PRINT DATA = lab_1;
data _null_;
set lab_1;
FILE 'X:\aparatyNOWE.txt'; DLM='09'x;
PUT Model= $ Cena Nagrywanie $ Optyka Wielkosc_LCD Nagr_film f_skal;
RUN;
I need to find the lowest value of f_skal and I don't know how because min(f_skal) doesn't work.
In a data step, the min function only looks at one row at a time - if you feed it several variables, it will give you the minimum value out of all of those variables for that row, but you can't use it to look at values across multiple rows (unless you get data from multiple rows into 1 row first, e.g. via use of retain / lag).
One way of calculating statistics in SAS across a whole dataset is to use proc means / proc summary, e.g.:
proc summary data = lab1;
var f_skal;
output out = min_val min=;
run;
This will create a dataset called min_val in your work library, and the value of f_skal in that dataset will be the minimum from anywhere in the dataset lab1.
If you would rather create a macro variable containing the minimum value, so that you can use it in subsequent code, one way of doing that is to use proc sql instead:
proc sql noprint;
select min(f_skal) into :min_value from lab1;
quit;
run;
%put Minimum value = &min_value;
In proc sql the behaviour of min is different - here it compares values across rows, the way you were trying to use it.
I calculate a ratio for 40 stocks. I need to sort those into three groups high, medium and low based on the value of the ratio. The ratios are fractions of one and there aren't many repetitions. What I need is to create three groups of about 13 stocks each, in group 1 to have the high ratios, in group 2 medium ratios and group 3 low ratios. I have the below code but it just assigns rank 1 to all my stocks.
How can I correct this?
data sourceh.combinedfreq2;
merge sourceh.nonnfreq2 sourceh.nofreq2 sourcet.caps;
by symbol;
ratio=(freqnn/freq);
run;
proc rank data=sourceh.combinedFreq2 out=sourceh.ranked groups=3;
by symbol notsorted;
var ratio;
ranks rank;
run;
If you want to automatically partition into three relatively even groups, you can use PROC RANK (See example using sashelp.stocks):
data have;
set sashelp.stocks;
ratio=high/low;
run;
proc rank data=have out=want groups=3;
by stock notsorted;
var ratio;
ranks rank;
run;
That partitions them into three groups. As long as you have 40 different values (ie, not a lot of repeats of one value), it will make 3 evenly split groups (with ~13 in each).
In your case, do not use by anything - by will create separate sets of ranks (here I'm ranking dates by stock, but you want to rank stocks.)
I think people are making this more complicated than it needs to be. Lets do this on easy mode.
First, we'll create the dataset and create out ratios.
Second, We'll sort the data by ratio.
Lastly, we'll assign a group based on observation number.
WARNING! UNTESTED CODE!
/*Make the dataset. I stole this from your code above*/
data sourceh.combinedfreq2;
merge sourceh.nonnfreq2 sourceh.nofreq2 sourcet.caps;
by symbol;
ratio=(freqnn/freq);
run;
/*sort the data so that its ordered by ratio*/
PROC SORT DATA=sourceh.combinedfreq2 OUT=sourceh.combinedfreq2 ;
BY DESCENDING ratio ;
RUN ;
/*Assign a value based on observation number*/
Data sourceh.combinedfreq2;
Set sourceh.combinedfreq2;
length Group $6.;
if _N_ <=13 Then Group = "High";
if _N_ > 13 and _N_ <= 26 Then Group = "Medium";
if _N_ > 26 Then Group = "Low";
run;
I have data on exam results for 2 years for a number of students. I have a column with the year, the students name and the mark. Some students don't appear in year 2 because they don't sit any exams in the second year. I want to show whether the performance of students persists or whether there's any pattern in their subsequent performance. I can split the data into two halves of equal size to account for the 'first-half' and 'second-half' marks. I can also split the first half into quintiles according to the exam results using 'proc rank'
I know the output I want is a 5 X 5 table that has the original 5 quintiles on one axis and the 5 subsequent quintiles plus a 'dropped out' category as well, so a 5 x 6 matrix. There will obviously be around 20% of the total number of students in each quintile in the first exam, and if there's no relationship there should be 16.67% in each of the 6 susequent categories. But I don't know how to proceed to show whether this is the case of not with this data.
How can I go about doing this in SAS, please? Could someone point me towards a good tutorial that would show how to set this up? I've been searching for terms like 'performance persistence' etc, but to no avail. . .
I've been proceeding like this to set up my dataset. I've added a column with 0 or 1 for the first or second half of the data using the first procedure below. I've also added a column with the quintile rank in terms of marks for all the students. But I think I've gone about this the wrong way. Shoudn't I be dividing the data into quintiles in each half, rather than across the whole two periods?
Proc rank groups=2;
var yearquarter;
ranks ExamRank;
run;
Proc rank groups=5;
var percentageResult;
ranks PerformanceRank;
run;
Thanks in advance.
Why are you dividing the data into quintiles?
I would leave the scores as they are, then make a scatterplot with
PROC SGPLOT data = dataset;
x = year1;
y = year2;
loess x = year1 y = year2;
run;
Here's a fairly basic example of the simple tabulation. I transpose your quintile data and then make a table. Here there is basically no relationship, except that I only allow a 5% DNF so you have more like 19% 19% 19% 19% 19% 5%.
data have;
do i = 1 to 10000;
do year = 1 to 2;
if year=2 and ranuni(7) < 0.05 then call missing(quintile);
else quintile = ceil(5*ranuni(7));
output;
end;
end;
run;
proc transpose data=have prefix=year out=have_t;
by i;
var quintile;
id year;
run;
proc tabulate data=have_t missing;
class year1 year2;
tables year1,year2*rowpctn;
run;
PROC CORRESP might be helpful for the analysis, though it doesn't look like it exactly does what you want.
proc corresp data=have_t outc=want outf=want2 missing;
tables year1,year2;
run;