Creating a new variable based on observations in existing variables - if-statement

I am new to coding and SAS so I hope I am framing my question sufficiently. I am trying to create a new_column that indicates whether or not an observation is "VALID" in any of the variables 2-18 but " " in variable 1. I have used if then statements and it appears to work for some observations but not others. The new_column will have "yes" indicated for many rows that lack "VALID" in the var1 and have "VALID" in any of the other variables (which is the goal), but other rows with "VALID" in var1 will also have "yes" in new_column. I have shared my code below. Hopefully someone can help. Thanks in advance!
data mydata_1;
set mydata;
if var2 = "VALID"
or var3 = "VALID"
or var4 = "VALID"
or var5 = "VALID"
or var6 = "VALID"
or var7 = "VALID"
or var8 = "VALID"
or var9 = "VALID"
or var10 = "VALID"
or var11 = "VALID"
or var12 = "VALID"
or var13 = "VALID"
or var14 = "VALID"
or var15 = "VALID"
or var16 = "VALID"
or var17 = "VALID"
or var18 = "VALID"
and var1 = " " then new_column = "yes";
run;

The way you wrote the expression the AND is only between the last two comparisons. I suspect you meant the AND to be between the result of the OR across the whole series of tests for 'VALID' and the other comparison. Add parentheses.
if (var2 = "VALID"
or var3 = "VALID"
or var4 = "VALID"
...
or var18 = "VALID")
and var1 = " "
then ...
Or just make it easier and use the WHICHC() function to test if VALID appears in any of the list of variables.
if whichc('VALID', of var2 - Var18) and var1 = ' ' then ...

Related

Generating dummy variable based on two string variables

I want to generate a dummy variable which is 1 if there is any match in two variables. These two variables are generated by egen concat and each contains a group of languages used in a country.
For example, var1 has values of apc apc apc apc, and var2 has values of apc or var1 is apc fra nya and var2 is apc. In either cases, fndmtch2 or egen anymatch would not give me 1. Is there anyway I can get 1 for each case?
Your data example can be simplified to
sysuse auto
egen var1 = concat(mpg foreign), punct(" ")
egen var2 = concat(trunk foreign), punct(" ")
as mapping to string in this instance is not needed for mpg trunk any more than it was needed for foreign. concat() maps to string on the fly, and the only issues with numeric variables (neither applying here) are if fractional parts are present or you want to see value labels.
Now that it is confirmed that multiple words can be present, we can work with a slightly more interesting example.
Here are two methods. One is to loop over the words in one variable and also the words in the other variable to check if there are any matches.
Stata's definition of a word here is that words are delimited by spaces. That being so, we can check for the occurrence of " word " within " variable ", where the leading and trailing spaces are needed because in say "frog toad newt" neither "frog" nor "newt" occurs with both leading and trailing spaces. In the OP's example the check may not be needed, but it often is, just as a search for "1" or "2" or "3" finds any of those within "11 12 13", which is wrong if you seek any as a word and not as a single character.
More is said on search for words within strings in a paper in press at the Stata Journal and likely to appear in 22(4) 2022.
* Example generated by -dataex-. For more info, type help dataex
clear
input str8 var1 str5 var2
"FR DE" "FR"
"FR DE GB" "GB"
"GB" "FR"
"IT FR" "GB DE"
end
gen wc = wordcount(var1)
su wc, meanonly
local max1 = r(max)
replace wc = wordcount(var2)
su wc, meanonly
local max2 = r(max)
drop wc
gen match = 0
quietly forval i = 1/`max1' {
forval j = 1/`max2' {
replace match = 1 if word(var1, `i') == word(var2, `j') & word(var1, `i') != ""
}
}
gen MATCH = 0
forval i = 1/`max1' {
replace MATCH = 1 if strpos(" " + var2 + " ", " " + word(var1, `i') + " ")
}
list
+----------------------------------+
| var1 var2 match MATCH |
|----------------------------------|
1. | FR DE FR 1 1 |
2. | FR DE GB GB 1 1 |
3. | GB FR 0 0 |
4. | IT FR GB DE 0 0 |
+----------------------------------+
EDIT
replace MATCH = 1 if strpos(" " + var2 + " ", " " + word(var1, `i') + " ") & !missing(var1, var2)
is better code to avoid the uninteresting match of " " with " ".

SAS Retain not working for 1 string variable

The below code doesn't seem to be working for the variable all_s when there is more than 1 record with the same urn. Var1,2,3 work fine but that one doesn't and I cant figure out why. I am trying to have all_s equal to single_var1,2,3 concatenated with no spaces if it's first.urn but I want it to be
all_s = all_s + ',' + single_var1 + single_var2 + single_var3
when it's not the first instance of that urn.
data dataset_2;
set dataset_1;
by URN;
retain count var1 var2 var3 all_s;
format var1 $40. var2 $40. var3 $40. all_s $50.;
if first.urn then do;
count=0;
var1 = ' ';
var2 = ' ';
var3 = ' ';
all_s = ' ';
end;
var1 = catx(',',var1,single_var1);
var2 = catx(',',var2,single_var2);
var3 = catx(',',var3,single_var3);
all_s = cat(all_s,',',single_var1,single_var2,single_var3);
count = count+1;
if first.urn then do;
all_s = cat(single_var1,single_var2,single_var3);
end;
run;
all_s is not large enough to contain the concatenation if the total length of the var1-var3 values within the group exceeds $50. Such a scenario seems likely with var1-var3 being $40.
I recommend using the length function to specify variable lengths. format will create a variable of a certain length as a side effect.
catx removes blank arguments from the concatenation, so if you want spaces in the concatenation when you have blank single_varN you won't be able to use catx
A requirement that specifies a concatenation such that non-blank values are stripped and blank values are a single blank will likely have to fall back to the old school trim(left(… approach
Sample code
data have;
length group 8 v1-v3 $5;
input group (v1-v3) (&);
datalines;
1 111 222 333
1 . 444 555
1 . . 666
1 . . .
1 777 888 999
2 . . .
2 . b c
2 x . z
run;
data want(keep=group vlist: all_list);
length group 8 vlist1-vlist3 $40 all_list $50;
length comma1-comma3 comma $2;
do until (last.group);
set have;
by group;
vlist1 = trim(vlist1)||trim(comma1)||trim(left(v1));
vlist2 = trim(vlist2)||trim(comma2)||trim(left(v2));
vlist3 = trim(vlist3)||trim(comma3)||trim(left(v3));
comma1 = ifc(missing(v1), ' ,', ',');
comma2 = ifc(missing(v2), ' ,', ',');
comma3 = ifc(missing(v3), ' ,', ',');
all_list =
trim(all_list)
|| trim(comma)
|| trim(left(v1))
|| ','
|| trim(left(v2))
|| ','
|| trim(left(v3))
;
comma = ifc(missing(v3),' ,',',');
end;
run;
Reference
SAS has operators and multiple functions for string concatenation
|| concatenate
cat concatenate
catt concatenate, trimming (remove trailing spaces) of each argument
cats concatenate, stripping (remove leading and trailing spaces) of each argument
catx concatenate, stripping each argument and delimiting
catq concatenate with delimiter and quote arguments containing the delimiter
From SAS 9.2 documentation
Comparisons
The results of the CAT, CATS, CATT, and CATX functions are usually equivalent to results that are produced by certain combinations of the concatenation operator (||) and the TRIM and LEFT functions. However, the default length for the CAT, CATS, CATT, and CATX functions is different from the length that is obtained when you use the concatenation operator. For more information, see Length of Returned Variable.
Note: In the case of variables that have missing values, the concatenation produces different results. See Concatenating Strings That Have Missing Values.
Some example data would be helpful, but I'm going to give it a shot and ask you to try
all_s = cat(strip(All_s),',',single_var1,single_var2,single_var3);

Stata: label variables using forvalue loop

I am trying to label a batch of variables using a loop as follows, but failed with stata error "invalid syntax". I couldn't find out where went wrong.
local myvars "basicenumerator" "basicfr_gpslatitude" "basicfr_gpslongitude"
local mylabels "Name of enumerator" "the latitude of the farmers house" "the longtitude of the farmers house"
local n : word count `mylabels'
forvalues i = 1/`n'{
local a: word `i' of `mylabels'
local b: word `i' of `myvars'
label var `b' "`a'"
}
To debug this, the main trick is to get Stata to show you what it thinks the local macros are. This script makes your code reproducible and also fixes it.
clear
set obs 1
gen basicenumerator = 42
gen basicfr_gpslatitude = 42
gen basicfr_gpslongitude = 42
local myvars `" "basicenumerator" "basicfr_gpslatitude" "basicfr_gpslongitude" "'
local mylabels `" "Name of enumerator" "the latitude of the farmers house" "the longtitude of the farmers house" "'
local n : word count `mylabels'
mac li
forvalues i = 1/`n'{
local a: word `i' of `mylabels'
local b: word `i' of `myvars'
label var `b' "`a'"
}
The problem is that the outer " " get stripped in defining your locals, so to keep the " " as desired, you need to wrap each string within compound double quotes.
For explanation, see http://www.stata.com/manuals14/u12.pdf 12.4.6.
Picky correction: spelling is longitude.

SAS : How do I find nth instance of a character/group of characters within a string?

I'm trying to find a function that will index the nth instance of a character(s).
For example, if I have the string ABABABBABSSSDDEE and I want to find the 3rd instance of A, how do I do that? What if I want to find the 4th instance of AB
ABABABBABSSSDDEE
data HAVE;
input STRING $;
datalines;
ABABABBASSSDDEE
;
RUN;
Here is a much simplified implementation of finding N-th instance of a group of characters in a SAS character string using SAS find() function:
data a;
s='AB bhdf +BA s Ab fs ABC Nfm AB ';
x='AB';
n=3;
/* from left to right */
p = 0;
do i=1 to n until(p=0);
p = find(s, x, p+1);
end;
put p=;
/* from right to left */
p = length(s) + 1;
do i=1 to n until(p=0);
p = find(s, x, -p+1);
end;
put p=;
run;
As you can see it allows for both, left-to-right and right-to-left searches.
You can combine these two into a SAS user-defined function (negative n will indicate search from right to left as it is in find function):
proc fcmp outlib=sasuser.functions.findnth;
function findnth(str $, sub $, n);
p = ifn(n>=0,0,length(str)+1);
do i=1 to abs(n) until(p=0);
p = find(str,sub,sign(n)*p+1);
end;
return (p);
endsub;
run;
Note that the above solutions with FIND() and FINDNTH() functions assume that the searched substring can overlap with its prior instance. For example, if we search for a substring ‘AAA’ within a string ‘ABAAAA’, then the first instance of the ‘AAA’ will be found in position 3, and the second instance – in position 4. That is, the first and second instances are overlapping. For that reason, when we find an instance we increment position p by 1 (p+1) to start the next iteration (instance) of the search.
However, if such overlapping is not a valid case in your searches, and you want to continue search after the end of the previous substring instance, then we should increment p not by 1, but by length of the substring x. That will speed up our search (the more the longer our substring x is) as we will be skipping more characters as we go through the string s. In this case, in our search code we should replace p+1 to p+w, where w=length(x).
A detail discussion of this problem is described in my recent SAS blog post Finding n-th instance of a substring within a string. I also found that using find() function works considerably faster than using regular expression functions in SAS.
I realize I'm late to the party here, but in the interest of adding to the collection of answers, here's what I've come up with.
DATA test;
input = "ABABABBABSSSDDEE";
A_3 = find(prxchange("s/A/#/", 2, input), "A");
AB_4 = find(prxchange("s/AB/##/", 3, input), "AB");
RUN;
Breaking it down, prxchange() just does a pattern matching replacement, but the great thing about it is that you can tell it how many times to replace that pattern. So, prxchange("s/A/#/", 2, input) replaces the first two A's in input with #. Once you've replaced the first two A's, you can wrap it in a find() function to find the "first A", which is actually the third A of the original string.
One thing to note about this approach is that, ideally, the replacement string should be the same length as the string you're replacing. For instance, notice the difference between
prxchange("s/AB/##/", 3, input) /* gives 8 (correct) */
and
prxchange("s/AB/#/", 3, input) /* gives 5 (incorrect) */
That's because we've replaced a string of length 2 with a string of length 1 three times. In other words:
(length("#") - length("AB")) * 3 = -3
so 8 + (-3) = 5.
Hopefully that helps someone out there!
data _null_;
findThis = 'A'; *** substring to find;
findIn = 'ADABAACABAAE'; **** the string to search;
instanceOf=1; *** and the instance of the substring we want to find;
pos = 0;
len = 0;
startHere = 1;
endAt = length(findIn);
n = 0; *** count occurrences of the pattern;
pattern = '/' || findThis || '/';
rx = prxparse(pattern);
CALL PRXNEXT(rx, startHere, endAt, findIn, pos, len);
if pos le 0 then do;
put 'Could not find ' findThis ' in ' findIn;
end;
else do while (pos gt 0);
n+1;
if n eq instanceOf then leave;
CALL PRXNEXT(rx, startHere, endAt, findIn, pos, len);
end;
if n eq instanceOf then do;
put 'found ' instanceOf 'th instance of ' findThis ' at position ' pos ' in ' findIn;
end;
else do;
put 'No ' instanceOf 'th instance of ' findThis ' found';
end;
run;
Here is a solution using the find() function and a do loop within a datastep. I then take that code, and place it into a proc fcmp procedure to create my own function called find_n(). This should greatly simplify whatever task is using this and allows for code re-use.
Define the data:
data have;
length string $50;
input string $;
datalines;
ABABABBABSSSDDEE
;
run;
Do-loop solution:
data want;
set have;
search_term = 'AB';
nth_time = 4;
counter = 0;
last_find = 0;
start = 1;
pos = find(string,search_term,'',start);
do while (pos gt 0 and nth_time gt counter);
last_find = pos;
start = pos + 1;
counter = counter + 1;
pos = find(string,search_term,'',start+1);
end;
if nth_time eq counter then do;
put "The nth occurrence was found at position " last_find;
end;
else do;
put "Could not find the nth occurrence";
end;
run;
Define the proc fcmp function:
Note: If the nth-occurrence cannot be found return 0.
options cmplib=work.temp.temp;
proc fcmp outlib=work.temp.temp;
function find_n(string $, search_term $, nth_time) ;
counter = 0;
last_find = 0;
start = 1;
pos = find(string,search_term,'',start);
do while (pos gt 0 and nth_time gt counter);
last_find = pos;
start = pos + 1;
counter = counter + 1;
pos = find(string,search_term,'',start+1);
end;
result = ifn(nth_time eq counter, last_find, 0);
return (result);
endsub;
run;
Example proc fcmp usage:
Note that this calls the function twice. The first example is showing the original request solution. The second example shows what happens when a match cannot be found.
data want;
set have;
nth_position = find_n(string, "AB", 4);
put nth_position =;
nth_position = find_n(string, "AB", 5);
put nth_position =;
run;

How to pad out character fields in SAS?

I am creating a SAS dataset from a database that includes a VARCHAR(5) key field.
This field includes some entries that use all 5 characters and some that use fewer.
When I import this data, I would prefer to pad all the shorter entries out to use all five characters. For this example, I want to pad on the left with 0, the character zero. So, 114 would become 00114, ABCD would become 0ABCD, and EA222 would stay as it is.
I've attempted this with a simple data statement, but of course the following does not work:
data test;
set databaseinput;
format key $5.;
run;
I've tried to do this with a user-defined informat, but I don't think it's possible to specify the ranges correctly on character fields, per this SAS KB answer. Plus, I'm fairly sure proc format won't let me define the result dynamically in terms of the incoming variable.
I'm sure there's an obvious solution here, but I'm just missing it.
Here is an alternative:
data padded_data_dsn; length key $5;
drop raw_data;
set raw_data_dsn(rename=(key=raw_data));
key = translate(right(raw_data),'0',' ');
run;
Data raw_data_dsn;
format key $5.;
key = '4'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
key = 'A114'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
key = 'A1140'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
run;
I'm sure someone will have a more elegant solution, but the following code works. Essentially it is padding the variable with five leading zeros, then reversing the order of this text string so that the zeros are to the right, then reversing this text string again and limiting the size to five characters, in the original order but left-padded with zeros.
data raw_data_dsn;
format key $varying5.;
key = '114'; output;
key = 'ABCD'; output;
key = 'EA222'; output;
run;
data padded_data_dsn;
format key $5.;
drop raw_data;
set raw_data_dsn(rename=(key=raw_data));
key = put(put('00000' || raw_data ,$revers10.),$revers5.);
run;
Here's what worked for me.
data b (keep = str2);
format str2 $5. ;
set a;
catlength = 4 - length(str);
cat = repeat('0', catlength);
str2 = catt(cat, str);
run;
It works by counting the length of the existing string, and then creating a cat string of length 4 - that, and then appending the cat value and the original string together.
Notice that it screws up if the original string is length 5.
Also - it won't work if the input string has a $5. format on it.
data a; /*input dataset*/
input str $;
datalines;
a
aa
aaa
aaaa
aaaaa
;
run;
data b (keep = str2);
format str2 $5. ;
set a;
catlength = 4 - length(str);
cat = repeat('0', catlength);
str2 = catt(cat, str);
run;
input:
a
aa
aaa
aaaa
aaaaa
output:
0000a
000aa
00aaa
0aaaa
0aaaa
I use this, but only works with numeric values :S. Try with another formats in the INPUT
data work.prueba;
format xx $5.;
xx='1234';
vv=PUT(INPUT(xx,best5.),z5.);
run;