I'm trying to subset some data with the following code:
data want;
set have;
array fx(12) fx1-fx12;
do i=1 to 12;
if substr(dx(i),1,4) in ('1115')
or substr(fx(i),1,5) in ('1146%')
then output;
end;
run;
I cross reference the data output using proc freq to the original dataset. The frequency counts for '1115' matches as they should. They don't for '1146%'. I thought '%' is a wildcard that I can use?
I also tried '/^1146\d*/'
The % wildcard is recognized by the WHERE LIKE operator. For the IF statement you will want to use the string prefix equality (i.e. starts with) operator =: or the prefix in set operator IN:
Also, since you are just substr 5 characters, you could substr 4 characters and check = '1146'. Furthermore, since you are substr from position 1 (1st character) you won't need to do substr at all (see 3rd example) when using IN:.
In order to use Perl regular expression pattern matching use the PRXMATCH function. Your pattern '/^1146\d*/' does not need \d* (0 or more digits). '/^1146/' will match anything that '/^1146\d*/' does.
Example(s):
if substr(dx(i),1,4) in ('1115') or fx(i) =: '1146' then output;
if substr(dx(i),1,4) in ('1115') or substr(fx(i),1,4) = '1146' then output;
/* expanded example for case of checking two prefix possibilities */
if dx(i) in: ('1115') or fx(i) in: ('1146', '124') then output;
if dx(i) =: '1115' or prxmatch('/^1146/', fx(i)) then output;
Related
i have a long text string in SAS, and a value is within it of variable length but is always proceeded by a '#' and then ends with ' ,'
is there a way i can extract this and store as a new variable please?
e.g:
word word, word, #12.34, word, word
And i want to get the 12.34
Thanks!
Double scan should also work if you only have a single #:
data _null_;
var1 = 'word word, word, #12.34, word, word';
var2 = scan(scan(var1,2,'#'),1,',');
put var2=;
run;
You can make use of the substr and index functions to do this. The index function returns the first position of the character specified.
data _null_;
var1 = 'word word, word, #12.34, word, word';
pos1 = index(var1,'#'); *Get the position of the first # sign;
tmp = substr(var1,pos1+1); *Create a string that returns only characters after the # sign;
put tmp;
pos2 = index(tmp,','); *Get the position of the first "," in the tmp variable;
var2 = substr(tmp,1,pos2-1);
put var2;
run;
Note that this method only works if there is only one "#" in the string.
One way is to use index to locate the two 'sentinels' delimiting the value and retrieve the innards with substr. If the value is supposed to be numeric, an additional use of input function is needed.
A second way is to use a regular expression routines prxmatch and prxposn to locate and extract the embedded value.
data have;
input;
longtext = _infile_;
datalines;
some thing #12.34, wicked
#, oops
#5a64, oops
# oops
oops ,
oops #
ok #1234,
who wants be a #1e6,aire
space # , the final frontier
double #12, jeopardy #34, alex
run;
data want;
set have;
* locate with index;
_p1 = index(longtext,'#');
if _p1 then _p2 = index(substr(longtext,_p1),',');
if _p2 > 2 then num_in_text = input (substr(longtext,_p1+1,_p2-2), ?? best.);
* locate with regular expression;
if _n_ = 1 then _rx = prxparse('/#(\d*\.?\d*)?,/'); retain _rx;
if prxmatch(_rx,longtext) then do;
call prxposn(_rx,1,_start,_length);
if _length > 0 then num_in_text_2 = input (substr(longtext,_start, _length), ?? best.);
end;
* drop _: ;
run;
The regex way looks for ##.## variants, the index way looks only for #...,. Then input function will decipher scientific notation values the regex (example pattern)way will not 'locate'. The ?? option in the input function prevents invalid arguments NOTE:s in the log when the enclosed value can not be parsed as a number.
Another way to do is by using Regex and code is given below
data have;
infile datalines truncover ;
input var $200.;
datalines;
word word, word, #12.34, word, word
word1 #12.34, hello hi hello hi
word1 #970000 hello hi hello hi #970022, hi
word1 123, hello hi hello hi #97.99
#99456, this is cool
;
A small note about below regular expression and functions
(?<=#) Zero-width positive look-behind assertion and looking for # before the pattern of interest
(\d+.?\d+) here means digit followed or not followed by . and other digits
(?=,) Zero-width positive look-ahead assertion and looking for , after the pattern of interest
call prxsubstr finds the position and length of pattern and substr extracts the required values.
data want( drop=pattern position length);
retain pattern;
IF _N_ = 1 THEN PATTERN = PRXPARSE("/(?<=#)(\d+\.?\d+)(?=,)/");
set have;
call prxsubstr(pattern, var, position, length);
if position then
match = substr(var, position, length);
run;
if you want to get really lazy you can just do
want = compress(have,".","kd");
Supposed I have two strings to convert from SAS program name to table number.
My goal is to convert the first "f-2-2-7-5-vcb" to "2.2.7.5".
And this should be done dynamically. Like for "f-2-2-12-1-2-hbd87q",
it needed to be "2.2.12.1.2" .
How to accomplish this?
data input;
input str $ 1-20;
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
;
run;
data want;
set input;
Sub=compress(substr(str,3,length(str)),,'kd') ;
run;
Bit of a longer way, but this works fine for me.
Use FIND() to find the first '-'
Use REVERSE() and FIND() to find the
last '-'
Use SUBSTR() and metrics + math from above to remove the first and
last components
Use TRANSLATE() to convert the - to periods.
z=find(str, '-');
end=find(strip(reverse(str)), '-');
string = translate(substr(str, z+1, length(str) - z - end), ".", "-");
A regular expression can match the dash delimited digits only sequence. The match, when extracted, can be transformed using translate.
data input;
input str $ 1-20;
rx = prxparse ("/^.*?((\d+)(-\d+)*)/");
if prxmatch(rx,str) then do;
call prxposn (rx,1,s,e);
name = substr(str,s,e);
name = translate(name,'.','-');
end;
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
funky2-2-1funky
f-2-hb17
a2bfunky
;
run;
A funky situation occurs if the digits only token sequence is preceded by a token ending with digits, or succeeded by a token starting with digits.
data input;
input str $ 1-20;
string=translate(prxchange('s/\w+?\-(.*)\-\w+/$1/',-1,strip(str)),'.','-');
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
;
run;
You can do this in one line. Use subtr to keep the text between the second word and last word:
translate(substr(str,find(str,scan(str,2,'-')),find(str,scan(str,-1,'-'))-find(str,scan(str,2,'-'))-1),'.','-')
find(str,scan(str,2,'-') : finds the starting position of the second
word.
find(str,scan(str,-1,'-') : finds the starting position of the last
word.
step2 - find(str,scan(str,2,'-'))-1 : find ending position of second
last word (length of text to copy).
Translate function: replaces '-' with '.'
substr(str,step1,step3) : copy text between second word and second to last.
Code:
data want;
set input;
Sub=translate(substr(str,find(str,scan(str,2,'-')),find(str,scan(str,-1,'-'))-find(str,scan(str,2,'-'))-1),'.','-');
put _all_;
run;
Output:
str=f-2-3-1-5-vcb Sub=2.3.1.5
str=f-2-4-1-6-rtg Sub=2.4.1.6
str=f-2-3-11-1-3-hb17 Sub=2.3.11.1.3
I have data with an ID field that is structured like this:
XX00000X
7 characters total, with the first 2 and last letters only, and numbers in between.
How can I check that the ID is structured specifically and exactly like this?
I'm not sure of how to approach checking this - one possibility was the CATs function but not sure how to apply that.
You can use a combination of functions to check this, including:
CHAR()
ANYDIGIT()
ANYALPHA()
data have;
input x $10.;
cards;
AB0000X
AO000BF
1234556
ABCDEFG
AB0123Y
AB
ABCDEFGHI
;
run;
data check;
set have;
flag=0;
if lengthn(x) ne 7 then flag=1;
length letter $1;
if flag=0 then do i=1 to 7;
letter = char(x, i);
if ( i in (1,2, 7) and anyalpha(letter) ne 1 )
or i in (3:6) and anydigit(letter) ne 1 then do;
flag=1;
leave;
end;
end;
run;
Regular expressions are obviously more succinct and likely a better approach.
Here is an approach by regular expression. [A-Z]{2} mathc first two letters, [0-9]{4} match four digits in the middle, [A-Z] match last letter, i ignore case.
data want;
set have;
flag=prxmatch("m/[A-Z]{2}[0-9]{4}[A-Z]/i",x);
run;
I made a dataset in SAS that reads a text file line by line. So while I read those lines in my dataset, i want to eliminate special characters like *,%,-,; from the beginning and end of that particular line.
what function should i use? The characters may occur in any sequence and i have to replace them by space.
Please help!
data forAditi;
infile datalines truncover;
format aLine translated parced $80.;
input #1 aLine $char80.;
** The old school translate function does a good job but also translates characters in the middle **;
translated = translate(aLine,' ','* % - ;');
** Therefore you might prefer regular expressions **;
retain prx_nr;
if _N_ EQ 1 then prx_nr = prxparse('/[ *%-;]*(.+[^ *%-;])/') ;
match = prxmatch(prx_nr, aLine);
call prxposn(prx_nr, 1, pos, len);
substr(parced,pos) = prxposn(prx_nr, 1, aLine);
** [ *%-;]* looks for zero or more special characters, .+ looks for 1 or more characters what so ever and [^ *%-;] looks for any non special character. prxmatch will look for the longest possible match, so starting at the first character, special or not and ending at the last non-special character. prxposn, however, will set the position and length to the part of the match enclosed in (...), i.e. from the first non special character till the last. Now using the fact that SAS reinitializes all its variables unless explicitly retained, we just have to copy that part at the right position into parced **;
datalines4;
This is text;
--That should be cleaned up,
And here- you have *% special characters in the middle.
Blanks at the start should be preserved. Right?
;;;;
run;
please, take a look at translate function in sas.
the first argument is your variable, the second argument is blank (the term you will have), third argument is a list of all your special chars that need to be replaced with second parameter.
translate(variable,' ','*%-');
You can use the compress function to remove special characters, either using a defined list of characters, or the 'p' option (remove all punctuation/special chars). To ensure they're only removed at the start/end, also use substr :
/* Assuming 'text' is always 3 or more characters */
data want ;
set have ;
strStart = substr(text,1,1) ;
strEnd = substr(text,length(text),1) ;
strMid = substr(text,2,length(text)-2) ;
newStart = compress(strStart,,'p') ; /* remove all non-alphanumeric */
newEnd = compress(strEnd ,,'p') ;
newStr = cats(newStart,strMid,newEnd) ;
run ;
You could consolidate all those operations into a single statement.
I'm new in this field, and try to use prxmatch and rxmatch to match some strings.
The pattern is a., which matches a string with more than 2 characters and a isn't the last one.
I run prxmatch('/a./', 'a') and rxmatch('/a./', 'a'), the result should be 0. But the system returns me 1.
So how can I get 0 in this case?
If you write a MCVE for this, you do get no match.
data test;
x='a';
rc=prxmatch('~a.~',x);
put x= rc=;
run;
However, if x is not length 1, it will match!
data test;
length x $5;
x='a';
rc=prxmatch('~a.~',x);
put x= rc=;
run;
Why?
Because in SAS, strings are not varchar, they are char. They have spaces padding the rest of the string out to its full length. So you would need to do either
data test;
length x $5;
x='a';
rc=prxmatch('~a[^ ]~',x);
put x= rc=;
run;
or, better,
data test;
length x $5;
x='a';
rc=prxmatch('~a.~',trim(x));
put x= rc=;
run;
(Note, I use ~ for my regex delimiter - you're free to use slash, or any other character, for that, it makes no difference.)