I made a dataset in SAS that reads a text file line by line. So while I read those lines in my dataset, i want to eliminate special characters like *,%,-,; from the beginning and end of that particular line.
what function should i use? The characters may occur in any sequence and i have to replace them by space.
Please help!
data forAditi;
infile datalines truncover;
format aLine translated parced $80.;
input #1 aLine $char80.;
** The old school translate function does a good job but also translates characters in the middle **;
translated = translate(aLine,' ','* % - ;');
** Therefore you might prefer regular expressions **;
retain prx_nr;
if _N_ EQ 1 then prx_nr = prxparse('/[ *%-;]*(.+[^ *%-;])/') ;
match = prxmatch(prx_nr, aLine);
call prxposn(prx_nr, 1, pos, len);
substr(parced,pos) = prxposn(prx_nr, 1, aLine);
** [ *%-;]* looks for zero or more special characters, .+ looks for 1 or more characters what so ever and [^ *%-;] looks for any non special character. prxmatch will look for the longest possible match, so starting at the first character, special or not and ending at the last non-special character. prxposn, however, will set the position and length to the part of the match enclosed in (...), i.e. from the first non special character till the last. Now using the fact that SAS reinitializes all its variables unless explicitly retained, we just have to copy that part at the right position into parced **;
datalines4;
This is text;
--That should be cleaned up,
And here- you have *% special characters in the middle.
Blanks at the start should be preserved. Right?
;;;;
run;
please, take a look at translate function in sas.
the first argument is your variable, the second argument is blank (the term you will have), third argument is a list of all your special chars that need to be replaced with second parameter.
translate(variable,' ','*%-');
You can use the compress function to remove special characters, either using a defined list of characters, or the 'p' option (remove all punctuation/special chars). To ensure they're only removed at the start/end, also use substr :
/* Assuming 'text' is always 3 or more characters */
data want ;
set have ;
strStart = substr(text,1,1) ;
strEnd = substr(text,length(text),1) ;
strMid = substr(text,2,length(text)-2) ;
newStart = compress(strStart,,'p') ; /* remove all non-alphanumeric */
newEnd = compress(strEnd ,,'p') ;
newStr = cats(newStart,strMid,newEnd) ;
run ;
You could consolidate all those operations into a single statement.
Related
I'm trying to subset some data with the following code:
data want;
set have;
array fx(12) fx1-fx12;
do i=1 to 12;
if substr(dx(i),1,4) in ('1115')
or substr(fx(i),1,5) in ('1146%')
then output;
end;
run;
I cross reference the data output using proc freq to the original dataset. The frequency counts for '1115' matches as they should. They don't for '1146%'. I thought '%' is a wildcard that I can use?
I also tried '/^1146\d*/'
The % wildcard is recognized by the WHERE LIKE operator. For the IF statement you will want to use the string prefix equality (i.e. starts with) operator =: or the prefix in set operator IN:
Also, since you are just substr 5 characters, you could substr 4 characters and check = '1146'. Furthermore, since you are substr from position 1 (1st character) you won't need to do substr at all (see 3rd example) when using IN:.
In order to use Perl regular expression pattern matching use the PRXMATCH function. Your pattern '/^1146\d*/' does not need \d* (0 or more digits). '/^1146/' will match anything that '/^1146\d*/' does.
Example(s):
if substr(dx(i),1,4) in ('1115') or fx(i) =: '1146' then output;
if substr(dx(i),1,4) in ('1115') or substr(fx(i),1,4) = '1146' then output;
/* expanded example for case of checking two prefix possibilities */
if dx(i) in: ('1115') or fx(i) in: ('1146', '124') then output;
if dx(i) =: '1115' or prxmatch('/^1146/', fx(i)) then output;
I am looking for function if variable contains Non-Alpha characters
I found the function
notalpha
data test;
set final_step1;
f_test = notalpha(first_name);
l_test = notalpha(last_name);
keep emplid first_name last_name f_test l_test;
run;
but it showing like this
Last_name Abate f_test
John 4
it supposed to show 0
notalpha("%%%%%"); is supposed to show 1 from
https://books.google.com/books?id=d58uBZPO0IwC&pg=PA28&lpg=PA28&dq=notalpha+sas&source=bl&ots=XKM3DlDol-&sig=ACfU3U1SReZzc5zjsXcCdls3twlUReOxBA&hl=en&sa=X&ved=2ahUKEwjV_Pmb_vXiAhXkna0KHWrmBYgQ6AEwB3oECAkQAQ#v=onepage&q=notalpha%20sas&f=false
Is any function it finds non alphabetic value on SAS or I made mistakes on the code?
Use the TRIMN function to remove trailing spaces and return a 0-length string (if necessary) when name is blank.
pos_notalpha = notalpha ( TRIMN ( name )) ;
If you have leading spaces as well, use STRIP
leftedpos_notalpha = notalpha ( STRIP ( name )) ;
From helps
NOTALPHA Function
Searches a character string for a nonalphabeticcharacter, and returns
the first position at which the character isfound.
and
TRIMN Function
Removes trailing blanks from character expressions,and returns a
string with a length of zero if the expression is missing.
and
STRIP Function
Returns a character string with all leading and trailing blanks removed.
…
The STRIP function returns the argument with all leading and trailing
blanks removed. If the argument is blank, STRIP returns a string with a
length of zero.
You can refer to anyalpha function for this purpose, see code below:
data have;
input name $10.;
anyalp=anyalpha(name);
if anyalp=0 then notalpha=1;
else if anyalp>0 then notalpha=0;
drop anyalp;
datalines;
%%%%%
01233
abcdef
#bc
abc123
;
run;
proc print data=have; run;
Documentation: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002194060.htm
I'd use
lengthn(compress(first_name,".",'a'))
compress removes all alphabetic chars. If the length of the resulting string is greater than zero, then it contains non alphabetic chars.
i have a long text string in SAS, and a value is within it of variable length but is always proceeded by a '#' and then ends with ' ,'
is there a way i can extract this and store as a new variable please?
e.g:
word word, word, #12.34, word, word
And i want to get the 12.34
Thanks!
Double scan should also work if you only have a single #:
data _null_;
var1 = 'word word, word, #12.34, word, word';
var2 = scan(scan(var1,2,'#'),1,',');
put var2=;
run;
You can make use of the substr and index functions to do this. The index function returns the first position of the character specified.
data _null_;
var1 = 'word word, word, #12.34, word, word';
pos1 = index(var1,'#'); *Get the position of the first # sign;
tmp = substr(var1,pos1+1); *Create a string that returns only characters after the # sign;
put tmp;
pos2 = index(tmp,','); *Get the position of the first "," in the tmp variable;
var2 = substr(tmp,1,pos2-1);
put var2;
run;
Note that this method only works if there is only one "#" in the string.
One way is to use index to locate the two 'sentinels' delimiting the value and retrieve the innards with substr. If the value is supposed to be numeric, an additional use of input function is needed.
A second way is to use a regular expression routines prxmatch and prxposn to locate and extract the embedded value.
data have;
input;
longtext = _infile_;
datalines;
some thing #12.34, wicked
#, oops
#5a64, oops
# oops
oops ,
oops #
ok #1234,
who wants be a #1e6,aire
space # , the final frontier
double #12, jeopardy #34, alex
run;
data want;
set have;
* locate with index;
_p1 = index(longtext,'#');
if _p1 then _p2 = index(substr(longtext,_p1),',');
if _p2 > 2 then num_in_text = input (substr(longtext,_p1+1,_p2-2), ?? best.);
* locate with regular expression;
if _n_ = 1 then _rx = prxparse('/#(\d*\.?\d*)?,/'); retain _rx;
if prxmatch(_rx,longtext) then do;
call prxposn(_rx,1,_start,_length);
if _length > 0 then num_in_text_2 = input (substr(longtext,_start, _length), ?? best.);
end;
* drop _: ;
run;
The regex way looks for ##.## variants, the index way looks only for #...,. Then input function will decipher scientific notation values the regex (example pattern)way will not 'locate'. The ?? option in the input function prevents invalid arguments NOTE:s in the log when the enclosed value can not be parsed as a number.
Another way to do is by using Regex and code is given below
data have;
infile datalines truncover ;
input var $200.;
datalines;
word word, word, #12.34, word, word
word1 #12.34, hello hi hello hi
word1 #970000 hello hi hello hi #970022, hi
word1 123, hello hi hello hi #97.99
#99456, this is cool
;
A small note about below regular expression and functions
(?<=#) Zero-width positive look-behind assertion and looking for # before the pattern of interest
(\d+.?\d+) here means digit followed or not followed by . and other digits
(?=,) Zero-width positive look-ahead assertion and looking for , after the pattern of interest
call prxsubstr finds the position and length of pattern and substr extracts the required values.
data want( drop=pattern position length);
retain pattern;
IF _N_ = 1 THEN PATTERN = PRXPARSE("/(?<=#)(\d+\.?\d+)(?=,)/");
set have;
call prxsubstr(pattern, var, position, length);
if position then
match = substr(var, position, length);
run;
if you want to get really lazy you can just do
want = compress(have,".","kd");
Supposed I have two strings to convert from SAS program name to table number.
My goal is to convert the first "f-2-2-7-5-vcb" to "2.2.7.5".
And this should be done dynamically. Like for "f-2-2-12-1-2-hbd87q",
it needed to be "2.2.12.1.2" .
How to accomplish this?
data input;
input str $ 1-20;
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
;
run;
data want;
set input;
Sub=compress(substr(str,3,length(str)),,'kd') ;
run;
Bit of a longer way, but this works fine for me.
Use FIND() to find the first '-'
Use REVERSE() and FIND() to find the
last '-'
Use SUBSTR() and metrics + math from above to remove the first and
last components
Use TRANSLATE() to convert the - to periods.
z=find(str, '-');
end=find(strip(reverse(str)), '-');
string = translate(substr(str, z+1, length(str) - z - end), ".", "-");
A regular expression can match the dash delimited digits only sequence. The match, when extracted, can be transformed using translate.
data input;
input str $ 1-20;
rx = prxparse ("/^.*?((\d+)(-\d+)*)/");
if prxmatch(rx,str) then do;
call prxposn (rx,1,s,e);
name = substr(str,s,e);
name = translate(name,'.','-');
end;
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
funky2-2-1funky
f-2-hb17
a2bfunky
;
run;
A funky situation occurs if the digits only token sequence is preceded by a token ending with digits, or succeeded by a token starting with digits.
data input;
input str $ 1-20;
string=translate(prxchange('s/\w+?\-(.*)\-\w+/$1/',-1,strip(str)),'.','-');
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
;
run;
You can do this in one line. Use subtr to keep the text between the second word and last word:
translate(substr(str,find(str,scan(str,2,'-')),find(str,scan(str,-1,'-'))-find(str,scan(str,2,'-'))-1),'.','-')
find(str,scan(str,2,'-') : finds the starting position of the second
word.
find(str,scan(str,-1,'-') : finds the starting position of the last
word.
step2 - find(str,scan(str,2,'-'))-1 : find ending position of second
last word (length of text to copy).
Translate function: replaces '-' with '.'
substr(str,step1,step3) : copy text between second word and second to last.
Code:
data want;
set input;
Sub=translate(substr(str,find(str,scan(str,2,'-')),find(str,scan(str,-1,'-'))-find(str,scan(str,2,'-'))-1),'.','-');
put _all_;
run;
Output:
str=f-2-3-1-5-vcb Sub=2.3.1.5
str=f-2-4-1-6-rtg Sub=2.4.1.6
str=f-2-3-11-1-3-hb17 Sub=2.3.11.1.3
I have a variable that contains a number of firms separated by the | symbol. I would like to be able to count how many firms there. i.e., the number of | + 1, and ideally identify the location of the | symbol in the string. Note there will not be more than five firms in a single variable. I was trying to use the following approach but run into the fact that SAS treats the | symbol as a special operator.
pattern1 = prxparse('/|/'); /* I can't seem to get SAS to treat this as a text to compare */
start = 1;
stop = length(reassignment2); /* my list of firms is in the variable reassignment2 */
call prxnext(pattern1, start, stop, reassignment2, position, length);
ARRAY Y[5];
do J=1 to 5 while (position > 0);
Y[J]=position;
call prxnext(pattern1, start, stop, reassignment2, position, length);
end;
nfirms=j+1;
run;
I would do it somewhat differently. What you really want is not the number of | characters, but the actual firms, right? So search for those. Your code had a number of minor issues; primarily, you must first prxmatch before using call prxnext, your j+1 is wrong because the loop iterator actually increments one beyond the last qualifying loop value (I use j-1 because I will find one more element than you), and | is a regular expression metacharacter and must be escaped if you actually want to use it, unless it is inside [] like I am using it.
data test;
infile datalines truncover;
input #1 reassignment2 $50.;
pattern1 = prxparse('/[^|]+/io'); /* Look for non-| characters */
start = 1;
stop = length(reassignment2); /* my list of firms is in the variable reassignment2 */
rc=prxmatch(pattern1,reassignment2);
if rc>0 then do;
ARRAY Y[5];
do J=1 by 1 until (position = 0);
call prxnext(pattern1, start, stop, reassignment2, position, length);
Y[J]=position;
end;
nfirms=j-1;
end;
else nfirms=0;
put nfirms=;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;;;;
run;
For completeness' sake, you could also do this easily without regular expressions, using call scan.
data test;
infile datalines truncover;
input #1 reassignment2 $50.;
array y[5];
do nfirms=1 by 1 until (position le 0);
call scan(reassignment2,nfirms,position,length,'|');
y[nfirms]=position;
end;
nfirms=nfirms-1; *loop ends one iteration too late;
put nfirms=;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;;;;
run;
I agree with #Joe that this could be done more simply without regular expressions, though I would simplify his code a little further to exclude the use of an array.
data test;
infile datalines truncover length = reclen;
input firmlist $varying256. reclen;
i = 0;
do until(scan(firmlist,i,"|") = "");
i + 1;
end;
nfirms = i - 1;
drop i;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;
run;
You said you'd also like to capture the position of the "|" character in the string, but if there are multiple firms per record there will be multiple "|" characters in the string. If you want the position of each one, an array might be a better route, though if you only want one, the index function will get you what you want. You'd use delimpos = index(firmlist,"|");.
I hope that helps!