I am used to creating count variables within a group where the count goes upwards +1 at each time using :
data objective ;
set eg ;
count + 1 ;
by id age ;
if first.age then count = 1 ;
run ;
However I would like to do the reverse, i.e. where the first value of age in each id group has a value of 10 and each subsequently line has a value of -1 that of the preceding line:
data eg ;
input id age desire ;
cards;
1 5 10
1 4 9
1 3 8
1 2 7
1 1 6
2 10 10
2 9 9
2 8 8
2 7 7
2 6 6
2 5 5
2 4 4
2 3 3
2 2 2
2 1 1
3 7 10
3 6 9
3 5 8
3 4 7
3 3 6
3 2 5
3 1 4
;
run;
data objective ;
set eg ;
count - 1 ;
by id age ;
if first.age_ar then count = 10 ;
run ;
Is there a way to do this as count-1 is not recognised.
You can add -1 without using retain as follows:
data objective;
set eg;
count + -1;
by id descending age;
if first.id then count = 10;
run;
Try this (see comments in code for explanation):
data objective ;
retain count 10; /*retain last countvalue for every observation, 10 is optional as initial value*/
set eg ;
count=count - 1 ; /*count -1 does not work, but count=count-1 with count as retainvariable*/
by id age notsorted;/*notsorted because age is ordered descending*/
if first.id then count = 10 ;/*not sure why you hade age_ar here, should be id to get your desired output*/
run ;
output:
Related
In this data, I need to subset by each variable by certain percentage.
For example,
Obs Group Score
1 A 1
2 A 2
3 B 1
4 B 1
5 C 3
6 C 1
7 C 1
8 A 1
9 A 3
10 A 1
11 A 2
12 B 3
13 C 2
I would need to subset 10 obs.
The sample must consist of all groups, and score of 1 takes higher priority.
Each group is given certain percent.
Let say 50% for A, 20% for B and 30% for C.
I tried using proc surveyselect but it failed. The number of alloc is not same as the strata.
proc surveyselect data=example out=test sampsize=10;
strata group score/alloc=(0.5 0.2 0.3);
run;
I don't know proc surveyselect too much, so I give the data step version.
data have;
input Obs Group$ Score;
cards;
1 A 1
2 A 2
3 B 1
4 B 1
5 C 3
6 C 1
7 C 1
8 A 1
9 A 3
10 A 1
11 A 2
12 B 3
13 C 2
;
run;
proc sort;
by Group Score;
run;
data want;
array _Dist_[3]$ _temporary_('A','B','C');
array _Upper_[3] _temporary_(5,2,3);
array _Count_[3] _temporary_;
do i = 1 to rec;
set have nobs=rec point=i;
do j = 1 to dim(_Dist_);
_Count_[j] + (Group=_Dist_[j]);
if _Count_[j] <= _Upper_[j] and Group = _Dist_[j] then output;
end;
end;
stop;
drop j;
run;
I have an array t that specifies numbers of rows that I want to read from file.txt. So my code should look like this:
data a;
do i = 1 to dim(t);
infile "C:\sas\file.txt" firstobs = t(i) obs = t(i);
input x1-x10;
output;
end;
run;
Of course this solution (firstobs) works only if the number of a column is a constant. How can I do this using an array (which is also read from the same file - from the first row)?
For example if the file.txt looked like this:
2 4 6 . . . . . . .
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6
Then I want the output to be:
2 2 2 2 2 2 2 2 2 2
4 4 4 4 4 4 4 4 4 4
6 6 6 6 6 6 6 6 6 6
Here's an answer similar to Tom's, but which does not attempt to read in off-path data. This may be superior for cases where your skipped rows have data which are not formatted in the same manner as your on-path data. It uses Tom's parmcards and structure so you can more easily see the differences.
options parmcards=tempdata ;
filename tempdata temp;
parmcards;
2 4 6 . . . . . . .
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6
;
%let ncol=9 ;
%let maxrows=1000;
data want ;
infile tempdata truncover end=eof;
array rows (&maxrows) _temporary_;
do i=1 by 1 until (rows(i)=.); *read in first line, just like Toms answer;
input rows(i) #;
drop i;
end;
input ; * stop inputting on the first line;
* Here you may need to use CALL SORTN to sort row array if it is not already sorted;
_currow = 2; * current row indicator;
do _i = 1 to dim(rows); * iterate through row array;
if rows[_i]=. then leave; * leave if row array is empty;
do while (_currow lt rows[_i] and not eof); * skip rows not in row array;
input;
_currow = _currow + 1;
end;
input x1-x&ncol; * now you know you are on a desired row, so input it;
output; * and output it;
_currow = _currow + 1;
end;
run;
You may as I noted above have to use CALL SORTN, if the array is not already sorted (i.e., if the missings are not at the end and the numbers are out of order).
Sounds like the first row contains the list of rows to keep. It would probably be easier to read that from a separate file, but you could make it work with a single file. You did not mention how to know the number of columns of data or the maximum number of row numbers that could be in the first row. For now let's assume that you can set these numbers in macro variables.
Let's get your example data into a file:
options parmcards=tempdata ;
filename tempdata temp;
parmcards;
2 4 6 . . . . . . .
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6
;
Now let's read it into a dataset.
%let ncol=9 ;
%let maxrows=1000;
data want ;
infile tempdata truncover ;
array rows (&maxrows) _temporary_;
if _n_=1 then do i=1 by 1 until (rows(i)=.);
input rows(i) #;
drop i;
end;
else do;
input x1-x&ncol;
if whichn(_n_,of rows(*)) then output;
end;
run;
If the other rows of the file have invalid data such that the INPUT statement would cause errors you can skip trying to read the data from those rows with a minor modification in the ELSE block.
else do;
input #;
if whichn(_n_,of rows(*)) then do;
input x1-x&ncol;
output;
end;
end;
If you find that you frequently want to not read lots of records at the end of the file you could add this line to the end of the data step to stop when you have read past the last line you want.
if _n_ > max(of rows(*)) then stop;
If your file is structured (i.e. same delimiter/one continuous 'row' of input data ) then the approach below should work. I'm sure that you can tweak to make a bit more efficient but I put some comments in to explain what each section is doing. I also suggest reading through the infile documentation for an explanation of the _infile_ automatic variable and other ways to manipulate the input data buffer. Also, if your input data file needs split up into individual rows itself then you will need to adjust for that.
filename in_data 'C:\sas\file.txt';
data out_data (keep=x1-x10);
infile in_data;
input fn;
/*get the number of vars based on delimiter*/
count = count(strip(_infile_), ' ') + 1;
/*iterate through vars*/
do i =1 to count;
/*set new value to current var*/
rec = scan(strip(_infile_), i, ' ');
/*set array values to new value*/
array obs(10) x1-x10;
do j=1 to dim(obs);
obs(j) = rec;
end;
/*output to dataset*/
output out_data;
end;
run;
Input
2 4 6 7 8 9 10 11 2 3
Output
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
2 2 2 2 2 2 2 2 2 2
4 4 4 4 4 4 4 4 4 4
6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9
10 10 10 10 10 10 10 10 10 10
11 11 11 11 11 11 11 11 11 11
2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3
Hope that this helps.
OK, I figured it out. Assuming I know the number of columns (10) and number of rows (10) I can get what I wanted using the following code:
data a;
w=1;
infile "C:\sas\file.txt" n=10;
input #w x1-x10;
array x(*) x1-x10;
array t(10) _temporary_;
do i=1 to 10;
if(x(i)^=.) then t(i)=x(i);
else leave;
end;
do j=1 to i-1;
w=t(j);
input #w x1-x10;
output;
end;
stop;
run;
What is left is to do the same without knowing numbers of rows and columns. This way I only read the rows I'm interested in as opposed to reading all rows and only outputting the ones I need.
It would probably be a lot easier program to maintain if you just read the whole matrix into a dataset and then used the row numbers to pick the data you want. Your file would probably need to have hundreds of thousands of observations for the time saved to be worth the programming effort to avoid reading the full file.
Here is one way using the POINT= option of the SET statement to select the rows.
options parmcards=tempdata ;
filename tempdata temp;
parmcards;
2 4 6
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6
;
data rows;
infile tempdata obs=1 ;
input row ##;
row=row-1;
run;
proc import datafile="%sysfunc(pathname(tempdata))" dbms=dlm out=full replace;
getnames=no;
delimiter=' ';
datarow=2;
run;
data want ;
set rows ;
pointer=row ;
set full point=pointer ;
run;
proc print; run;
I would like to know if it's possible to select the 5 minimum or maximum values by rows with IML ?
This is my code :
Proc iml ;
use table;
read all var {&varlist} into matrix ;
n=nrow(matrix) ; /* n=369 here*/
p=ncol(matrix); /* p=38 here*/
test=J(n,5,.) ;
Do i=1 to n ;
test[i,1]=MIN(taux[i,]);
End;
Quit ;
So I would like to obtain a matrix test that contains for the 1rst column the maximal minimum value, then for the 2nd column the minimum value of my row EXCEPTING the 1rst value, etc...
If you have any idea ! :)
Event if it's not with IML (but with SAS : base, sql..)
So for example :
Data test; input x1-x10 ; cards;
1 9 8 7 3 4 2 6
9 3 2 1 4 7 12 -2
;run;
And I would like to obtain the results sorted by row:
1 2 3 4 6 7 8 9
-2 1 2 3 4 7 12
in order to select my 5 minimum values in another table :
y1 y2 y3 y4 y5
1 2 3 4 6
-2 1 2 3 4
Read the article "Compute the kth smallest data value in SAS"
Define the modules as in the article. Then use the following:
have = {1 9 8 7 3 4 2 6,
9 3 2 1 4 7 12 -2};
x = have`; /* transpose */
ord = j(5,ncol(x));
do j = 1 to ncol(x);
ord[,j] = ordinal(1:5, x[,j]);
end;
print ord;
If you have missing values in your data and want to exclude them, use the SMALLEST module instead of the ORDINAL module.
You can use call sort() in PROC IML to sort a column. Because you want to separate the columns and not sort the whole matrix, extract the column, sort it, and then update the original.
You want to sort rows, so transpose your matrix, do the sorting, and then transpose back.
proc iml;
have = {1 9 8 7 3 4 2 6,
9 3 2 1 4 7 12 -2};
print have;
n = nrow(have);
have = have`; /*Transpose because sort works on columns*/
do i=1 to n;
tmp = have[,i];
call sort(tmp,1);
have[,i]=tmp;
end;
have = have`;
want = have[,1:5];
print want;
quit;
data have;
input patient level timepoint;
datalines;
1 0 1
1 0 2
1 0 3
1 3 4
1 0 5
1 0 6
2 0 1
2 4 2
2 0 3
2 3 4
2 0 5
2 0 6
2 0 7
2 2 8
2 0 9
2 0 10
3 3 1
3 0 2
3 0 3
4 0 1
4 0 2
4 0 3
4 0 4
4 1 5
4 0 6
4 0 7
4 0 8
4 0 9
4 0 10
;;
proc print; run;
/*
Condition 1: If there is one non-zero numeric value, in level, sorted by timepoint for a patient, set level to 2.5 for the record that is immediately prior to this time point; and set level = 1.5 for the next prior time point; set level to 2.5 for the record that is immediate post this time point; and set level to 1.5 for the next post record. The levels by timepoint should look like, ... 1.5, 2.5, non-zero numeric value, 2.5, 1.5 ... (Note: ... are kept as 0s).
Condition 2: If there are two or more non-zero numeric values, in level, sorted by timepoint for a patient, find the FIRST non-zero numeric value, and set level to 2.5 for the record that is immediate prior this time point; and set level to 1.5 for the next prior time point; then find the LAST non-zero numeric value record, set level to 2.5 for the record that is immediate post this last non-zero numeric value, and set level to 1.5 for the next post record; Set all zero values (i.e. level=0) to level = 2.5 for records between the first and last non-zero numeric values; The levels by timepoint should look like: ... 1.5, 2.5, FIRST Non-zero Numeric value, 2.5, Non-zero Numeric value, 2.5, LAST Non-zero Numeric value, 2.5, 1.5 ....
*/
I've tried data steps using N-1, N-2, N+1, N+2, arrays/do loops (my first thought was to use multiple arrays for this so that I could use the i=index to go to previous i-1/i+1 or i-2/1+2 records, but it was hard to grasp the concept of how to even code it.). All of this has to be done BY Patient, so there may be instances where there is only one record before the first non-zero and not two. The same could be true for post record as well. I searched all different types of examples and help, but none that could help with my needs. Thanks in advance for any help.
This is how I want the data to look like:
data want;
input patient level timepoint;
datalines;
1 0 1
1 1.5 2
1 2.5 3
1 3 4
1 2.5 5
1 1.5 6
2 2.5 1
2 4 2
2 2.5 3
2 3 4
2 2.5 5
2 2.5 6
2 2.5 7
2 2 8
2 2.5 9
2 1.5 10
3 3 1
3 2.5 2
3 1.5 3
4 0 1
4 0 2
4 1.5 3
4 2.5 4
4 1 5
4 2.5 6
4 1.5 7
4 0 8
4 0 9
4 0 10
;;
proc print; run;
I approached this by first finding the timepoints of the first and last non-zero levels. Then I merged those into the original set, and changed levels based on the rules you mentioned.
proc sort data = have;
by patient timepoint;
run;
data have2;
retain first 0 last 0;
set have;
by patient timepoint;
if level ne 0 and first = 0 then first = timepoint;
if level ne 0 then last = timepoint;
if last.patient then do;
output;
first = 0;
last = 0;
end;
keep patient first last;
run;
proc sort data=have2;
by patient;
run;
data merged;
merge have have2;
by patient;
if level = 0 then do;
if first-timepoint = 1 then level = 2.5;
if first-timepoint = 2 then level = 1.5;
if last-timepoint = -1 then level = 2.5;
if last-timepoint = -2 then level = 1.5;
if first < timepoint < last then level = 2.5;
end;
drop first last;
run;
I have a data set that has a person's name and how many times they scored a 1-10. For example, Bob scored 7 1s, 8 2s, and 7 4s, but did not receive any other scores.
Name 1 2 3 4 5 6 7 8 9 10
Bob 7 8 7 0 0 0 0 0 0 0
Hal 9 3 1 0 0 0 0 0 0 0
I want a data set that has a row for Bob that looks like this
Bob 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4
Hal 1 1 1 1 1 1 1 1 1 2 2 2 3
I'm doing this in SAS by the way.
I know I can write a macro to create variables named score1, score2, ..., scoreN.
I am having trouble populating the cells. Any help would be appreciated. Thanks.
Such things - changing the structure of the dataset - sometimes easier to do with PROC TRANSPOSE:
data have;
input Name $ v1 v2 v3 v4 v5 v6 v7 v8 v9 v10;
datalines;
Bob 7 8 7 0 0 0 0 0 0 0
;
run;
/*convert original wide dataset into long one*/
proc transpose data=have out=have_long;
var v:;
by Name;
run;
data want;
set have_long;
substr(_NAME_,1,1)=""; *to get rid of first 'v' in variables' names;
do i=1 to COL1;
new_var=_NAME_;
output;
end;
drop _NAME_ COL1 i;
run;
/*convert back to wide dataset*/
proc transpose data=want out=want(drop=_NAME_);
var new_var;
by Name;
run;