I have a dataset with missing values coded "missing". How do I recode these so Stata recognizes them as missing values? When I have numeric missing values, I have been using e.g.:
mvdecode _all, mv(99=. )
However, when I run this with a character in it, e.g.:
mvdecode _all, mv("missing"=. )
I get the error missing is not a valid numlist.
mvdecode is for numeric variables only: the banner in the help is "Change numeric values to missing values" (emphasis added). So the error message should make sense: the string "missing" is certainly not a numeric value, so Stata stops you there. It makes no sense to say to Stata that numeric values "missing" should be changed to system missing, as you requested.
As for what you should do, that depends on what you mean in Stata terms by coded "missing".
If you are referring to string variables with literal values "missing" which should just be replaced by the empty string "", then that would be a loop over all string variables:
ds, has(type string)
quietly foreach v in `r(varlist)' {
replace `v' = "" if `v' == "missing"
}
If you are referring to numeric variables for which there is a value label "missing" then you need to find out the corresponding numeric value and use that in your call to mvdecode. Use label list to look up the asssociation between values and value labels.
mvdecode works with numlists, not strings (clearly stated in help mvdecode). The missing value for strings in Stata is denoted by "".
clear
set more off
*----- example dataset -----
sysuse auto
keep make mpg
keep in 1/5
replace make = "missing" in 2
list
*----- what you want -----
ds, has(type string)
foreach var in `r(varlist)' {
replace `var' = "" if `var' == "missing"
}
list
list if missing(make)
You can verify that Stata now recognizes one missing value for the string variable using the missing() function.
Related
In one line, I would like to perform an operation on a specific row, the row of which is referenced by a number created by a local subtracting a number.
Here is a MWE:
sysuse auto2, clear
*save the number of observations in a local (is there a quicker way to do this?)
count
local N = r(N)
*make sure it works
di `N'
*make sure that the subtraction works
di `N'-1
*make the replacement, which works when referencing row with `N'
replace make = "def" in `N'
*here is the problem - subtracting from `N' doesn't work
replace make = "abc" in `N'-1
Error:
'74-1' invalid observation number
How can I solve this problem?
There are at least two ways to do this.
Create another local macro:
local Nm1 = `N' - 1
replace make = "abc" in `Nm1'
Force evaluation of an expression on the fly:
replace make = "abc" in `=`N'-1'
I am new to SAS and I am currently trying to create a macro which will automatically replace any special characters in a variables name with an underscore. I am currently using PRXCHANGE to perform the replacement, yet I notice that when the variable gets renamed, there is extra underscores being placed at the end of the new variable name.
Suppose we were to have two variables "dummy?" and "te!st". When I perform the replacement, the new variables are "dummy___________________________" and " te_st___________________________". When the replacement should just be "dummy_" and "te_st", respectively.
In the sample code below, I know that if I were to add "TRIM(name)" in the PRXCHANGE function then there would not be any extra replacements occurring. The issue with doing this is that if I were to have a variable named "example! ", with a space as the final character, then I would want the variable to be renamed to "example__", with two underscores at the end. Yet by using TRIM(name), I would get "example_", with a single underscore.
N.B. I know if I change the SAS variable name policy to V7, then this would not be a problem. I am solely doing this to improve upon my SAS skills.
/* Generate dummy data */
option validvarname = any;
data dummy_data;
input "dummy?"n "te!st"n;
datalines;
1 1
2 2
3 3
;
run;
/* Generate variables with the old and new variable names as entries */
data test (keep = name new_name);
set sashelp.vcolumn;
where libname = "WORK" and memname = "DUMMY_DATA";
new_name = prxchange("s/[^a-zA-Z0-9]/_/", -1, name);
run;
Your issue is how SAS pads string variable length. While most languages have variable length strings, SAS is more akin to SQL char type, without the accompanying varchar type. This gives SAS very good performance in some ways, due to predictable row sizes, but has some consequences. Note that you can actually get effectively variable length strings on datasets using options compress, but during a data step the dataset is uncompressed.
In SAS, a string of length 10 that is assigned "A" will actually have value "A ". A, plus 9 spaces. Not null characters, actual space characters. That usually doesn't matter, as SAS is written in many ways to ignore those trailing spaces (so "A" = "A " = "A "), but in this particular case it does matter (since you're transforming the space character).
You can use the trim function to remove the spaces during execution, though it will still be stored with the spaces afterwards of course.
new_name = prxchange("s/[^a-zA-Z0-9]/_/", -1, trim(name));
Note that trim cannot return a null value, it will always return a single space, so if that's a possibility, you should wrap this in a check for missing (a string variable with only spaces = missing).
if not missing(name) then do;
new_name = prxchange("s/[^a-zA-Z0-9]/_/", -1, trim(name));
end;
else new_name = ' ';
There is a trimn function that can return a length 0 string, but there's no reason to do the prxchange if it's missing - this will save time.
Your concern about trailing spaces on variable names it not valid. Trailing spaces on variable names are not significant. This data step creates only one variable.
376 options validvarname=any;
377 data test;
378 'xxx'n = 1;
379 'xxx 'n= 2;
380 run;
NOTE: The data set WORK.TEST has 1 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
cpu time 0.03 seconds
Why does this code not need two trim statements, one for first and one for last name? Does the length statement remove blanks?
data work.maillist; set cert.maillist;
length FullName $ 40;
fullname=trim(firstname)||' '||lastname;
run;
length is a declarative statement and introduces a variable to the Program Data Vector (PDV) with the specific length you specify. When an undeclared variable is used in a formula SAS will assign it a default length depending on the formula or usage context.
Character variables in SAS have a fixed length and are padded with spaces on the right. That is why the trim(firstname) is needed when || lastname concatenation occurs. If it wasn't, the right padding of firstname would be part of the value in the concatenation operations, and might likely exceed the length of the variable receiving the result.
There are concatenation functions that can simplify string operations
CAT same as using <var>|| operator
CATT same as using trim(<var>)||
CATS same as using trim(left(<var>))||
CATX same as using CATS with a delimiter.
STRIP same as trim(left(<var>))
Your expression could be re-coded as:
fullname = catx(' ', firstname, lastname);
Is there a reason you think it should? Can you see trailing spaces in the surname, have you tried a length() function?
I could be wrong here but sometimes when you apply a function (put especially) or import data you can inadvertently store leading or trailing spaces. Trailing spaces are a mystery because you don't realise they are there until you try to do something else with the data.
A length statement should allow you to store exactly the data you give it providing you use a number/character variable correctly with truncation only occurring if the length value is too short.
I've found the
compress() function to be the most convenient for dealing with white space and punctuation particularly if you are concatenating variables.
https://www.geeksforgeeks.org/sas-compress-function-with-examples/
All the best,
Phil
Because SAS will truncate the value when it is too long to fit into FULLNAME. And when it is too short it will fill in the rest of FULLNAME with spaces anyway so there is no need to remove them.
It would only be an issue if the length of FULLNAME is smaller than the sum of the lengths of FIRSTNAME and LASTNAME plus one. Otherwise the result cannot be too long to fit into FULLNAME, even if there are no trailing spaces in either FIRSTNAME or LASTNAME.
Try it yourself with non-blank values so it is easier to see what is happening.
1865 data test;
1866 length one $1 two $2 three $3 ;
1867 one = 'ABCD';
1868 two = 'ABCD';
1869 three='ABCD';
1870 put (_all_) (=);
1871 run;
one=A two=AB three=ABC
NOTE: The data set WORK.TEST has 1 observations and 3 variables.
I learned that to change lower case variable names to upper case variables I need to do the following:
foreach var of varlist * {
rename `var' `=upper("`var'")'
}
But I can't comprehend how this can really work.
First, rename does not require = to change variable names.
Second, I understand that I need to embrace var with ` and '
But what does that ` and ' mean surrounding
=upper("var'")
?
You don't need to do that. You don't need a loop and you don't need that syntax. Consider
. sysuse auto, clear
(1978 Automobile Data)
. ds
make mpg headroom weight turn gear_ratio
price rep78 trunk length displacement foreign
. rename *, upper
. ds
MAKE MPG HEADROOM WEIGHT TURN GEAR_RATIO
PRICE REP78 TRUNK LENGTH DISPLACEMENT FOREIGN
Otherwise you are puzzled at the
`= '
because indeed that is nothing to do with rename. That syntax obliges Stata to evaluate a scalar expression on the fly so that rename sees only the result of that expression. In your case the string expression
upper("`var'")
yields an upper-case version of the variable name contained in local macro var.
This syntax is documented at help macro and [P] macro (e.g.
in this version p.13) as one kind of expansion operator.
All that said, all variable names upper case is horrible style....
So one of my variables was coded in a messy mix of numeric values, texts, parenthesis and so on. I actually only need to extract the numeric values which are recorded as 12345 (for example, not limited to a specific number of digits, i mean it could be a n-k-digit to n-digit) followed by || and then description that might also contain some numeric values. So when I applied SAS compress funtion newvar = compress(oldvar, '', 'a'), the newvar extracted ALL the numbers from the oldvar. Thus it looks like 12345|||(789)|| etc. The number of '|' sign (which is control character to indicate line breaks etc.?) varies though.
I only need to extract the first numeric values before the '|' sign. Any help please?
Thanks in advance.
Use the SCAN() function to extract the values. It will result in a character value and converting to a numeric should be straightforward.
new_var = input(scan(old_var, 1, "|"), best12.);
This should do it:
substr("12345||45||89||...",1,find("|","12345||45||89||...",1)-1)