Please, help me to interpret the SAS code (I am quite new to sas):
DATA sample;
SET sample;
v_eq = mve;
est_v_eq = v_eq;
sig_eq = sige;
WHERE optosey > 0 AND
optprcey > 0;
RUN;
Interpretation: Use "sample" - database. Define "v_eq = mve", "est_v_eq = v_eq", "sig_eq = sige" only for observations that have optosey > 0 AND optprcey > 0, am I right? What is confusing is that why do "they" define "v_eq = mve", "est_v_eq = v_eq" and not directly "v_eq = mve" ?
Your interpretation is broadly right. Your question is a question I'd ask, also. I would say that it depends on the purpose of this code; it's possible that this is written this way for readability; if you were saying the purpose of this code in English, perhaps that's how you'd describe it.
I'd warn that this is fairly bad form, though, in particular this part:
data sample;
set sample;
where ... ;
Normally when you are doing something irreversible, it is best to not write to the same dataset that you're reading from (since you're losing data). WHERE does not only apply the above transformations; it actually filters the rows coming in, so only rows that qualify for the WHERE end up in the output dataset.
Related
Long time reader first-time questioner.
Using SAS Data Integration studio, when you create a summary transformation in the table options advanced tab you can add a where statement to your code automatically. Unfortunately, it adds some code that makes this resolve incorrectly. Putting the following in the where text box:
TESTFIELD = "TESTVALUE"
creates
%let _INPUT_options = %nrquote(WHERE = %(TESTFIELD = %"TESTVALUE%"%));
In the code, used
proc tabulate data = &_INPUT (&_INPUT_options)
But resolves to
WHERE = (TESTFIELD = "TESTVALUE")
_
22
ERROR: Syntax error while parsing WHERE clause. ERROR 22-322: Syntax
error, expecting one of the following: a name, a quoted string, a
numeric constant, a datetime constant,
a missing value, (, *, +, -, :, INPUT, NOT, PUT, ^, ~.
My question is this: Is there a way to add a function to the where statement box that would allow this quotation mark to be properly added here?
Note that all functions get the preceding % when added to the where statement automatically and I have no control over that. This seems like something that should be relatively easy to fix but I haven't found a simple way yet.
The % are simply escaping the " and () characters; they're perfectly harmless, really. The bigger problem is the %NRQUOTE "quotes" (which are nonprinting characters that tell SAS this is macro-quoted); they mess up the WHERE processing.
Use %UNQUOTE( ... ) to remove these.
Example:
data have;
testfield="TESTVALUE";
output;
testfield="AMBASDF";
output;
run;
%let _INPUT_options = %nrquote(WHERE = %(TESTFIELD = %"TESTVALUE%"%));
%put &=_input_options;
data want;
set have(%unquote(&_INPUT_options.));
run;
Thank you all for your responses. Long story short, I ended up creating a SAS Troubleshooting ticket. The analyst told me that they have now documented the issue, which should now be resolved in a future iteration of DI.
The temporary solution was to create a new transformation, with a slight alteration, adding an UNQOUTE (as mentioned above by Joe) to the source code before the input options:
proc tabulate data = &_INPUT (%unquote(&_INPUT_options)) %unquote(&procOptions);
For those interested you will need to create the transformation in a public subfolder of your project so others can use it. Not what I was hoping for, but a workable solution while waiting for the version update.
I am trying to build a custom transformation in SAS DI. This transformation will "act" on columns in an input data set, producing the desired output. For simplicity let's assume the transformation will use input_col1 to compute output_col1, input_col2 to compute output_col2, and so on up to some specified number of columns to act on (let's say 2).
In the Code Options section of the custom transformation users are able to specify (via prompts) the names of the columns to be acted on; for example, a user could specify that input_col1 should refer to the column named "order_datetime" in the input dataset, and either make a similar specification for input_col2 or else leave that prompt blank.
Here is the code I am using to generate the output for the custom transformation:
data cust_trans;
set &_INPUT0;
i=1;
do while(i<3);
call symputx('index',i);
result = myfunc("&&input_col&index");
output_col&index = result; /*what is proper syntax here?*/
i = i+1;
end;
run;
Here myfunc refers to a custom function I made using proc fcmp which works fine.
The custom transformation works fine if I do not try to take into account the variable number of input columns to act on (i.e. if I use "&&input_col&i" instead of "&&input_col&index" and just use the column result on the output table).
However, I'm having two issues with trying to make the approach more dynamic:
I get the following warning on the line containing
result = myfunc("&&input_col&index"):
WARNING: Apparent symbolic reference INDEX not resolved.
I do not know how to have the assignment to the desired output column happen dynamically; i.e., depending on the iteration of the do loop I'd like to assign the output value to the corresponding output column.
I feel confident that the solution to this must be well known amongst experts, but I cannot find anything explaining how to do this.
Any help is greatly appreciated!
You can't use macro variables that depend on data variables, in this manner. Macro variables are resolved at compile time, not at run time.
So you either have to
%do i = 1 %to .. ;
which is fine if you're in a macro (it won't work outside of an actual macro), or you need to use an array.
data cust_trans;
set &_INPUT0;
array in[2] &input_col1 &input_col2; *or however you determine the input columns;
array output_col[2]; *automatically names the results;
do i = 1 to dim(in);
result = myfunc(in[i]); *You quote the input - I cannot see what your function is doing, but it is probably wrong to do so;
output_col[i] = result; /*what is proper syntax here?*/
end;
run;
That's the way you'd normally do that. I don't know what myfunc does, and I also don't know why you quote "&&input_col&index." when you pass it to it, but that would be a strange way to operate unless you want the name of the input column as text (and don't want to know what data is in that variable). If you do, then pass vname(in[i]) which passes the name of the variable as a character.
I've come across a program that uses not = as if it had the same meaning as ne or ^=. It seems to work fine, and doesn't raise so much as a note to log. But I can't find any official documentation confirming that this is supported syntax.
Is not = really the same as ne?
Yes. Look at the documentation here http://support.sas.com/documentation/cdl/en/lrcon/67885/HTML/default/viewer.htm#p00iah2thp63bmn1lt20esag14lh.htm
data _null_;
x = 1;
if x not = 0 then
put x=;
run;
Please refer the below link:
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001001956.htm
Hope this helps
I am trying to write a data step in SAS, for later use with proc rank, that creates six groups (the group variable) of eight subjects each (the subject variable) with a random number assigned to each subject (the cohort variable, which is used later in proc rank). This is pretty straightforward except I want to have my subjects numbered 1-48 while still being split into six groups (A, B, C, etc.). Just writing a nested do loop would be fine for having groups A, B, etc. each containing a subject 1 through subject 8, but I want to have subject A have 1-8, B have 9-16, and so on. Right now, I have the following code to do this:
data treatment;
do group = 'A', 'B', 'C', 'D', 'E', 'F';
if group = 'A' then do subject = 1 to 8;
cohort = ranuni(1234);
output;
end;
else if group = 'B' then do subject = 9 to 16;
cohort = ranuni(1234);
output;
end;
else if group = 'C' then do subject = 17 to 24;
cohort = ranuni(1234);
output;
end;
else if group = 'D' then do subject = 25 to 32;
cohort = ranuni(1234);
output;
end;
else if group = 'E' then do subject = 33 to 40;
cohort = ranuni(1234);
output;
end;
else if group = 'F' then do subject = 41 to 48;
cohort = ranuni(1234);
output;
end;
end;
run;
This does work, but it's a mess. Is there a way to have "subject" index from 1 to 8 for group A, then 9 to 16 for group B, and so on, WITHOUT having all the conditionals? I imagine there are other tools in SAS (macros? proc sql?) that would be much easier to work with, but I'm limited to do loops in the data step right now.
Disclaimer: This is for a homework assignment for a first-year SAS class. My code is working and does what I need it to do right now (and I'll submit it as-is if I can't figure anything else out), but I know it's extremely inefficient and I can't seem to find anything on how to get rid of all these if-else statements. (It is possible I just don't know what to search for--I've read several pages on using nested do loops, but nothing that would seem to help with my problem. Everything here seems to concern do loops in macros, and I'm not there yet.)
I do not want my code rewritten entirely--it's homework, I need to do it myself!--but I would appreciate any pointers in the right direction, even if they're just search terms. I'm completely stuck on what I'd need to look up to get this to work at this point.
There are, as you expect, a million and a half ways to solve this problem in SAS.
So I assume you want a dataset where you have
A 1
A 2
A 3
A 4
A 5
A 6
A 7
A 8
B 9
B 10
B 11
...
F 48
plus some random piece afterwards. The way I'd do that is to calculate the pieces separately.
You in effect have a single loop, which is 1 to 48, where the A-F grouping is effectively applied to the loop, right? So you should try to structure it this way:
data want;
set have;
do subject = 1 to 48;
group=<logic to determine group>;
cohort=<logic to determine cohort>;
output;
end;
run;
There are a few different ways to do <logic to determine group>; the 'worst' way is a series of if statements, ie:
if subject le 8 then group='A';
else if subject le 16 then group='B';
...
else group='F';
There are several good options I could see for determining this in one single statement without conditional logic. If you want to figure this out for yourself, do so; if you want a hint or an explanation, comment such and I'm happy to explain how I'd do it, but I think it's better left unsaid for now (particularly as the exact method might depend on what you've learned to date).
A second option is to not use a loop for your subject at all, but a counter.
do class='A','B',...;
subjID+1;
cohort=...;
end;
That is basically how you would keep an 'external to the loop' counter; it's not a true programming loop itself, but it allows you to keep track of the ID. This is something you'll very commonly see used in other locations, and may be what your instructor was getting at. In your particular example I prefer the single loop 1:48 solution, as it avoids quite so much hardcoding of letters, but this is a common solution as well.
One side note: I strongly recommend not learning ranuni and instead learning to use the rand function. ranuni is based on an inferior PRNG; rand is strictly superior, and also has the bonus advantage that you don't have to keep uselessly repeating the seed (as the seed doesn't actually have any effect after the first call!). If your teacher has instructed you to use ranuni, I suggest learning both and only including ranuni in homework assignments that are submitted back to class. If your teacher is interested in learning why, Rick Wicklin has a good explanation here.
If you really like the double loop, there is a way to do this with two loops - but it requires the same basic concept that the above 1:48 loop does. (Don't read further if you want a completely spoiler free attempt at solving the first problem.) To read the spoiler, click 'improve' or 'edit' on this answer, as I hid it in angle braces (why doesn't SO have spoiler tags ...)
Background: When we test the significance of a categorical variable that has been coded as dummy variables, we need to simultaneously test all dummy variables are 0. For example, if X takes on values of 0, 1, 2, 3 and 4, I would fit dummy variables for levels 1-4 (assuming I want 0 to be baseline), then want to simultaneously test B1=B2=B3=B4=0.
If this is the only variable in my data set, I can use the overall F-statistic to achieve this. However, if I have other covariates, the overall F-test doesn't work.
In Stata, for example, this is (very, very) simply carried out by the testparm command as:
testparm i.x (after fitting the desired regression model), where the i. prefix tells Stata X is a categorical data to be treated as dummy variables.
Question/issue: I'm wondering how I can do this in SAS with a CONTRAST (or ESTIMATE?) statement while fitting a regression model with PROC GLM. Since I have scoured the internet and haven't found what I'm looking for, I'm guessing I'm missing something very obvious. However, all of the examples I've seen are NOT for categorical (class) variables, but rather two separate (say continuous) variables. The contrast statement in that case would simply be something like
CONTRAST 'Contrast1' y 1 z 1;
Otherwise, they're for calculating hypotheses like H_0: B1-B2=0.
I feel like I need to breakdown the hypotheses into smaller pieces and determine that set that defines the whole relationship, but I'm not doing it correctly. For example, for B1=B2=B3=B4=0, I thought I might say B1=B2=B3=-B4, then define (1) B1=-B4, (2) B2=-B4 and (3) B2=B3. I was trying to code this as a CONTRAST statement as (say X is in descending order in data set: 4-0):
CONTRAST 'Contrast' x -1 0 0 1 0
x -1 0 1 0 0
x 0 1 1 0 0;
I know this is not correct, and I tried many, many variations and whatever random logic I could come up with. My problem is I have relatively novice-level knowledge of CONTRAST (and unfortunately have not found great documentation to help with this) and also of how this hypothesis test should really be formulated for the sake of estimation (do I try to split it up into pieces as I did above, or...?).
From my note above, you actually can get SAS to do this for you with PROC GENMOD and the CLASS statement and a TYPE3 specification.
proc genmod data=input;
class classvar ;
model slope= classvar othervar/ type3;
run;
quit;
In the example above, my class levels are in the classvar variable. The othervar is my other covariate.
At the end of the output, you see a table labeled LR Statistics For Type 3 Analysis. The row for classvar is the LR test of all the class effects=0.
Another case where PROC REG with TEST works (TEST x1=0, x2=0, x3=0, x4=0, e.g.), which isn't answering my initial question for PROC GLM, but is an option if PROC REG gets the job done for your type of model.