SAS - How to determine the number of variables in a used range? - sas

I imagine what I'm asking is pretty basic, but I'm not entirely certain how to do it in SAS.
Let's say that I have a range of variables, or an array, x1-xn. I want to be able to run a program that uses the number of variables within that range as part of its calculation. But I want to write it in such a way that, if I add variables to that range, it will still function.
Essentially, I want to be able to create a variable that if I have x1-x6, the variable value is '6', but if I have x1-x7, the value is '7'.
I know that :
var1=n(of x1-x6)
will return the number of non-missing numeric variables.. but I want this to work if there are missing values.
I hope I explained that clearly and that it makes sense.

Couple of things.
First off, when you put a range like you did:
x1-x7
That will always evaluate to seven items, whether or not those variables exist. That simply evaluates to
x1 x2 x3 x4 x5 x6 x7
So it's not very interesting to ask how many items are in that, unless you're generating that through a macro (and if you are, you probably can have that macro indicate how many items are in it).
But the range x1--x7 or x: both are more interesting problems, so we'll continue.
The easiest way to do this is, if the variables are all of a single type (but an unknown type), is to create an array, and then use the dim function.
data _null_;
x3='ABC';
array _temp x1-x7;
count = dim(_temp);
put count=;
run;
That doesn't work, though, if there are multiple types (numeric and character) at hand. If there are, then you need to do something more complex.
The next easiest solution is to combine nmiss and n. This works if they're all numeric, or if you're tolerant of the log messages this will create.
data _null_;
x3='ABC';
count = nmiss(of x1-x7) + n(of x1-x7);
put count=;
run;
nmiss is number of missing, plus n is number of nonmissing numeric. Here x3 is counted with the nmiss group.
Unfortunately, there is not a c version of n, or we'd have an easier time with this (combining c and cmiss). You could potentially do this in a macro function, but that would get a bit messy.
Fortunately, there is a third option that is tolerant of character variables: combining countw with catx. Then:
data _null_;
x3='ABC';
x4=' ';
count = countw(catq('dm','|',of x1-x7),'|','q');
put count=;
run;
This will count all variables, numeric or character, with no conversion notes.
What you're doing here is concatenating all of the variables together with a delimiter between, so [x1]|[x2]|[x3]..., and then counting the number of "words" in that string defining word as thing delimited by "|". Even missing values will create something - so .|.|ABC|.|.|.|. will have 7 "words".
The 'm' argument to CATQ tells it to even include missing values (spaces) in the concatenation. The 'q' argument to COUNTW tells it to ignore delimiters inside quotes (which CATQ adds by default).
If you use a version before CATQ is available (sometime in 9.2 it was added I believe), then you can use CATX, but you lose the modifiers, meaning you have more trouble with empty strings and embedded delimiters.

Related

Looping over macro of macros

I've defined a macro of macros:
local my_macros "`macro1' `macro2' `macro3'"
Each of the individual macros has a list of covariates, e.g.
local macro1 "cov1 cov2 cov3"
local macro2 "cov4 cov5 cov6"
local macro3 "cov7 cov8 cov9"
When I loop over my_macros, I want to extract each individual macro. So for example, if I have
for each m in my_macros{
di `m'
}
then it would ideally print the three macros, something like
`macro1'
`macro2'
`macro3'
or
cov1 cov2 cov3
cov4 cov5 cov6
cov7 cov8 cov9
This is because the actual loop I'm running is a regression, and each macro is a list of covariates I want to run. However, the output instead looks like
for each m in my_macros{
di `m'
}
0
0
0
0
0
0
0
0
0
0
So in the full regression loop, only one covariate is being included in a regression at a time. Does anyone know what's going on and how to get each macro as a line of output when I print `my_macros'?
Solution What you want can be done by nesting macro references.
local macro1 "cov1 cov2 cov3"
local macro2 "cov4 cov5 cov6"
local macro3 "cov7 cov8 cov9"
That's fine. But now the crucial step to loop over such macros could be
forval j = 1/3 {
... `macro`j'' ...
}
where the dots indicate whatever else is needed. Evaluation of macros is exactly like evaluation in elementary algebra or arithmetic whenever parentheses, brackets or braces are used: innermost references are evaluated first, so a reference to macro j is evaluated first.
Misunderstandings The question contains various small and large misunderstandings.
M1. for each is a repeated typo for foreach.
M2. in my_macros is written where only of local my_macros makes sense.
M3. Once you define a macro from three macros each containing three words, the original macros no longer have any identity as three separate entities. The levels are the new macro; its constituent words (here variable names); and the individual characters (not relevant here). To retain such identities you would need to introduce punctuation, say commas, and parse the contents using that punctuation. But here it is easier to use nested references, and not to define a wider macro at all.
M4. Assuming that you really defined my_macros in two steps so that it eventually contained nine variable names, then a loop like
foreach m of local my_macros {
di `m'
}
would be issuing in turn nine commands like
di cov1
Each such command displays the value of each variable in the first observation (it's not obvious that Stata does that, but it's true). That is,
di `m'
(where local macro m contains a variable name) is exactly equivalent to
di `m'[1]
To see the name, i.e. the text inside the macro, here a variable name, and not the value, you would need the statement inside the loop to be
di "`m'"
Hence the double quotes " " insist on the name, not the value, being displayed. Although you don't give a data example or reproducible code, a series of nine (not ten) zeros would be displayed if and only if all those nine variables contain zeros in the first observation.
The same confusion between name and value occurred in your previous thread Stata type mismatch with local macro?

SAS: adding character variables in data step without setting the lenghth in advance

In a SAS data step, if one creates a character variable he has to be careful in choosing the right length in advance. The following data step returns a wrong result when var1=case2, since 'var2' is truncated to 2 characters and is equal to 'ab', which is obviously not what we want. The same happens replacing var2=' ' with length var2 $2. This kind of procedure is quite prone to errors.
data b; set a;
var2 = ' ';
if var1 = 'case1' then var2='xy';
if var1 = 'case2' then var2='abcdefg';
run;
I was unable to find a way to just define 'var2' as a character, without having to care for its length (side note: if left unspecified, the length is 8).
Do you know if it is possible?
If not, can you perhaps suggest a more robust turnoround, something similar to an sql "case", "decode", etc, to allocate different values to a new string variable that does not suffer from this length issue?
SAS data step code is very flexible compared to most computer languages (and certainly compared to other languages created in the early 1970s) in that you are not forced to define variables before you start using them. The data step compiler waits to define the variable until it needs to. But like any computer program it has rules that it follows. When it cannot tell anything about the variable then it is defined as numeric. If it sees that the variable should be character it bases the decision on the length of the variable on the information available at the first reference. So if the first place you use the variable in your code is assigning it a string constant that is 2 bytes long then the variable has a length of 2. If it is the result of character function where the length is unknown then the default length is 200. If the reference is using a format or informat then the length is set to the appropriate length for the width of the format/informat. If there is no additional information then the length is 8.
You can also use PROC SQL code if you want. In that case the rules of ANSI SQL apply for how variable types are determined.
In your particular example the assignment of blanks to the variable is not needed since all newly created variables are set to missing (all blanks in the case of character variables) when the data step iteration starts. Note that if VAR2 is not new (ie it is already defined in dataset A) then you cannot change its length anyway.
So just replace the assignment statement with a length statement.
data b;
set a;
length var2 $20;
if var1 = 'case1' then var2='ab';
if var1 = 'case2' then var2='abcdefg';
run;
SAS is not going the change the language at this point, they have too many users with existing code bases. Perhaps they will make a new language at some point in the future.

SAS adding whitespace in a CALL SYMPUT variable name?

New SAS user.
I'm learning to use/write macros right now. I'm trying to loop through the variable ZONE in a data set "zonelist", as well as count the number of observations in the data set. Here's my code:
data _null_;
set zonelist;
call symput ('zone'||_n_, zone);
call symput ('numzones', _n_);
run;
I expected this to create the variables 'zone1', 'zone2' etc to call them in a do loop. This is a reasonable way to do this, right? Anyway, SAS seems to be adding whitespace to my variable names. I get this error when I run it:
ERROR: Symbolic variable name ZONE 1 must contain only
letters, digits, and underscores. NOTE: Invalid argument to function
SYMPUT('zone '[12 of 16 characters shown],'100 '[12 of
16 characters shown]) at line 567 column 10. zone=100 _ERROR_=1 _N_=1
And of course I get the same error for each observation in my dataset. It makes sense why the ZONE value from the table would have a bunch of whitespace (the variable is $16 I think), but why is it adding all of that space to my variable name? What am I missing here?
This happens due to the numeric / character conversion of the _n_ variable. When numeric values are converted to character, they are right aligned.
Try the following instead:
data _null_;
set zonelist;
call symputx(cats('zone',_n_), zone);
call symputx('numzones', _n_);
run;
The cats function will perform the numeric / character conversion and also strip the leading blanks.
If you have SAS 9 then you can also use the symputX function to strip leading / trailing blanks from the macro VALUES as well.
I solved it using "compress", to just delete all of the spaces:
data _null_;
set zonelist;
call symputx(compress('zone',_n_), zone);
call symputx('numzones', _n_);
run;
However, this doesn't help me understand why I needed to do this at all. Any enlightenment would be appreciated!
SAS has two types of variables, fixed length character strings and floating point numbers. Let's look at your first statement.
call symput ('zone'||_n_, zone);
In there you are referencing two variables and one string literal. Since the || operator works on character variables SAS will need to do an implicit conversion of the numeric variable _n_ to a character string. SAS will use the best12. format so it will result in a value like ' 1'. So you will end up with an invalid value for the call symput() function to use for the macro variable name.
But what about that third value, the variable zone? If zone is a number then the same implicit conversion will happen and the macro variable will end up containing leading spaces. Or the zone variable is a character string, in which case your macro variable will most likely end up having trailing spaces, unless the length of the value of zone happens to exactly match the maximum length that the variable zone is defined to hold.
As others have suggested there are two things to do to fix this.
First use the call symputx() instead of call symput() (unless you really want those trailing spaces stored in your macro variables) which will automatically strip() the input values. It will also silence the note about implicit numeric to character conversion.
The second is to use some method of generating the macro varaible name that does not insert spaces. The easiest way is to just use the cats() function instead of the || operator. But you could also use combinations of other functions like put(), compress(), strip(), etc.
call symputx(cats('zone',_n_),zone);

TCL: check if variable is list

set var1 A
set var2 {A}
Is it possible to check if variable is list in TCL? For var1 and var2 llength gives 1. I am thinking that these 2 variables are considered same. They are both lists with 1 element. Am I right?
Those two things are considered to be entirely identical, and will produce identical bytecode (except for any byte offsets used for indicating where the content of constants are location, which is not information normally exposed to scripts at all so you can ignore it, plus the obvious differences due to variable names). Semantically, braces are a quoting mechanism and not an indicator of a list (or a script, or …)
You need to write your code to not assume that it can look things up by inspecting the type of a value. The type of 123 could be many different things, such as an integer, a list (of length 1), a unicode string or a command name. Tcl's semantics are based on you not asking what the type of a value is, but rather just using commands and having them coerce the values to the right type as required. Tcl's different to many other languages in this regard.
Because of this different approach, it's not easy to answer questions about this in general: the answers get too long with all the different possible cases to be considered in general yet most of it will be irrelevant to what you're really seeking to do. Ask about something specific though, and we'll be able to tell you much more easily.
You can try string is list $var1 but that will accept both of these forms - it will only return false on something that can't syntactically be interpreted as a list, eg. because there is an unmatched bracket like "aa { bb".

Is the md5 function safe to use for merging datasets?

We are about to promote a piece of code which uses the SAS md5() hash function to efficiently track changes in a large dataset.
format md5 $hex32.;
md5=md5(cats(of _all_));
As per the documentation:
The MD5 function converts a string, based on the MD5 algorithm, into a 128-bit hash value. This hash value is referred to as a message digest (digital signature), which is nearly unique for each string that is passed to the function.
At approximately what stage does 'nearly unique' begin to pose a data integrity risk (if at all)?
I have seen an example where the md5 comparison goes wrong.
If you have the values "AB" and "CD" in the (two columns of the) first row and "ABC" and "D" in the second row, they got the same md5 value. See this example:
data md5;
attrib a b length=$3 informat=$3.;
infile datalines;
input a b;
format md5 $hex32.;
md5=md5(cats(of _all_));
datalines;
AB CD
A BCD
;run;
This is, of course, because the CATS(of _all_) will concatinate and strip the variables (converting numbers to string using the "best" format), without a delimiter. If you use CAT instead , this will not happen because the leading and trailing blanks are not removed. This error is not very far fetched. If you have missing values, then this could occur more often. If, for example, you have a lot of binary values in text variables, some of which are missing, it could occur very often.
One could do this manually, adding a delimiter in between the values. Of course, you would still have the case when you have ("AB!" and "CD") and ("AB" and "!CD") and you use "!" as delimiter...
MD5 has 2^128 distinct values, and from what I've read at 2^64 different values (that's 10^20 or so) you begin to have a high likelihood of finding a collision.
However, as a result of how MD5 is generated, you have some risks of collisions from very similar preimages which only differ in as little as two bytes. As such, it's hard to say how risky this would be for your particular process. It's certainly possible for a collision to occur on as few as two messages. It's not likely. Does saving [some] computing time benefit you enough to outweigh a small risk?