Is there an alternative to regexp_replace() and regexp_extract() in sas? - sas

I have
select
regexp_replace(city_name,',','') as city_name
, regexp_extract(regexp_replace(postal_cd,',','') ,'^(.*?)(?:-)(.*)$',1) as zip5
This works in Hue but I want to get the same output in SAS, so what can be the replacement for regexp_replace and regexp_extract function to work in sas?
I tried using replace but that is not working in sas

Use SAS functions prxchange to replace, and prxsubstr to extract.
Replacing a matching character with nothing can also be done with COMPRESS
Extract words from a delimited string can also be done with SCAN
The non regular expression ways (COMPRESS, SCAN) are generally faster because they are very specific in their implementation.
Example:
Use COMPRESS and SCAN
data have;
city = 'Spring,field';
zip = '1,2,3,4,5-6,7,8,9';
run;
proc sql;
create table want as
select
compress(city,',') as city
, scan(compress(zip,','),1,'-') as zip5
from
have
;

Related

Unable to convert a character variable with numbers with a comma into numeric

I have a set of variables in SAS that should be numeric but are characters. Numbers are comma separated and I need a point. For example, I need 19,000417537 to be 19.000417537. I tried translate without success. the comma is still there and I'm not able to convert the variable to numeric using input(). Can anyone help me please?
Thank you in advance
Best
Use INPUT() with the COMMAX informat.
data have;
length have $20.;
have = "19,000417537";
want = input(have, commax32.);
format want 32.8;
run;
proc print data=have;
run;
Obs have want
1 19,000417537 19.00041754
In two steps you can replace the , with . with tranwrd and then use input to convert it to numeric.
data yourdf;
set df;
charnum2=tranwrd(charnum, ",", "."); /*replace , with .*/
numvar = input(charnum2, 12.); /*convert to numeric*/
run;
You can use the COMMA informat to read strings with commas in them. But if you want it to treat the commas as decimal points instead of ignoring them then you probably need to use COMMAX instead (Or perhaps use the NLNUM informat instead so that the meaning of commas and periods in the text will depending on your LOCALE settings).
So if the current dataset is named HAVE and the text you want to convert is in the variable named STRING you can create a new dataset named WANT with a new numeric variable named NUMBER with code like this:
data want;
set have;
number = input(string,commax32.);
run;

Find Dot Separated Words in a String

I need to parse a log file to pick out strings that match the following case-insensitive pattern:
libname.data <--- Okay
libname.* <--- Not okay
For those with SAS experience, I'm trying to get SAS dataset names out of a large log.
All strings are space-separated. Some examples of lines:
NOTE: The data set LIBNAME.DATA has 428 observations and 15 variables.
MPRINT(MYMACRO): data libname.data;
MPRINT(MYMACRO): create table libname.data(rename=(var1 = var2)) as select distinct var1, var2 as
MPRINT(MYMACRO): format=date. from libname.data where ^missing(var1) and ^missing(var2) and
What I've tried
This PERL regular expression:
/^(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi
https://regex101.com/r/jYkXn5/1
In SAS code:
data test;
line = 'words and stuff libname.data';
test = prxmatch('/^(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi', line);
run;
Problem
This will work when the line only contains this exact string, but it will not work if the line contains other strings.
Solution
Thanks, Blindy!
The regex that worked for me to parse SAS datasets from a log is:
/(?!.*[.*]{3})[a-z_]+[a-z0-9_]+(?:\.[a-z0-9_]+)/mi
data test;
line = 'NOTE: COMPRESSING DATA SET LIBNAME.DATA DECREASED SIZE BY 46.44 PERCENT';
prxID = prxparse('/(?!.*[.*]{3})[a-z]+[a-z0-9_]+(?:\.[a-z0-9_]+)/mi');
call prxsubstr(prxID, line, position, length);
dataset = substr(line, position, length);
run;
This will still pick up some SQL select statements but that is easily solvable through post-processing.
You anchored your expression at the beginning, simply remove the first ^ and you're set.
/(?!.*[.*]{2})[a-z0-9*_:-]+(?:\.[a-z0-9;_:-]+)+$/mi
You can get by just locating the following landmark text in a log file line.
... data set <LIBNAME>.<MEMNAME> ...
If the data set name is in the log you can presume it was correctly formed.
data want;
length line $1000;
infile LOG_FILE lrecl=1000 length=L;
input line $VARYING. L;
* literally "data set <name>" followed by space or period;
rx = prxparse('/data set (.*?)\.(.*?)[. ]/');
if prxmatch(rx,line) then do;
length libname $8 memname $32;
libname = prxposn(rx,1,line);
memname = prxposn(rx,2,line);
line_number = _n_;
output;
end;
keep libname memname line_number;
run;
Some adjustment would be needed if the data set names are name literals of the form '<anything>'N
There are also a plethora of existing SAS Log file parsers and analyzers out on the web that you can utilize.
The lookahead at the start prevents matching .. but the pattern by itself will not match that, as the character classes are repeated 1 or more times and do not contain a dot.
If you don't want to match ** as well, and the string should not start with *, you can add that to a character class [*.] together with the dot, and take it out of the first character class.
In that case, you could omit the positive lookahead and the anchor:
/[a-z0-9_:-]+(?:[.*][a-z0-9_:-]+)+/i
Regex demo
As the pattern does not contain any anchors, you could omit the m flag.

One function to replace different text with other in SAS

I want to replace one combination of text with another. For example
data test;
a='raja\ram{work}italic';
if index(a,'\') then b=tranwrd(a,'\','\\');
if index(a,'{') then b=tranwrd(a,'{','\{');
if index(a,'}') then b=tranwrd(a,'}','\}');
if index(upcase(a),'ITALIC') then b=tranwrd(a,substr(a,index(upcase(a),'ITALIC'),length('ITALIC')),'\i');
run;
Required Result: b=raja\\ram\{work\}\i;
These kind of combination I wanted to replace. I'm not interested to use a macro or FCMP or if else condition.
Is there any function to do all at once? I tried to use a Perl expression that also working for one at a time b= prxchange('s/\\/\\\\/', -1, a)
Your regular expression is on the right track. You have a set of characters, right, you want to always prepend a \ to? So search for (one of that set of characters), which you do with [...], and then add a \ to it, using a capturing group. That's the escape character, so you have to add two any time you want to use one (\\ escapes itself to \).
data test;
a='Hello\Goodbye{stuff}';
b= prxchange('s/([\\{}])/\\$1/',-1,a);
put b=;
run;
You should do the italic bit in a second expression (or just use tranwrd). That's a totally different replacement and while theoretically possible to put in one, would make it too messy.
This question is almost identical to the other question: Multiple search and replace within a string through regular expression in SAS
Is that a coincidence?
Here is the code that worked for the other question.
%let text = abc\pqr{work};
data _null_;
var=prxchange("s/\\/\\\\/",-1,"&text");
var=prxchange("s/\{/\\\{/",-1,var);
var=prxchange("s/\}/\\\}/",-1,var);
put var;
run;
Result: abc\\pqr\{work\};
%let text = BOLD\ITALIC\ITALICBOLD\BOLDITALIC\B\I\IB\BI;
data _null_;
var=prxchange("s/BOLD/b/",-1,"&text");
var=prxchange("s/ITALIC/i/",-1,var);
var=lowcase(var);
put var;
run;
RESULT: b\i\ib\bi\b\i\ib\bi

Convert string into numeric and change period to comma seperator sas

I have a string called weight that is 85.5
I would like to convert it into a numeric 85,5 and replace the decimal seperator with a comma using SAS.
So far I am using this (messy) two step approach
weight_num= (weight*1);
format weight_num COMMAX13.2;
How can this be achieved in a less clumpsy way??
Your sample code is the recommended method of changing a variable type.
Another way is transtrn function to replace the . with a comma. This is only a good method if you don't plan to do any calculations on the values.
data have;
set sashelp.class;
keep name weight:;
weight_char=put(weight, 8.1);
run;
data want;
set have;
weight_char=transtrn(weight_char, ".", ",");
run;
proc print data=want;
run;
If you just want to change it so that commas are used for decimal point instead of periods then why not just use a simple character substitution. Do you also want to change thousands separator from comma to period? TRANSLATE() is good for that.
weight = translate(weight,',.','.,');
If you want to convert it to a number then use the INPUT() function rather than forcing SAS to convert for you.
weight_num = input(weight,comma32.);
You can then attach whatever format you want to the new numeric variable.

Hive - regexp_replace function for multiple strings

I am using hive 0.13! I want to find multiple tokens like "hip hop" and "rock music" in my data and replace them with "hiphop" and "rockmusic" - basically replace them without white space. I have used the regexp_replace function in hive. Below is my query and it works great for above 2 examples.
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 'rockmusic') as ntext1
from vp_nlp_protext_males
;
But I have 100 such bigrams/ngrams and want to be able to do replace efficiently where I just remove the whitespace. I can pattern match the phrase - hip hop and rock music but in the replace I want to simply trim the white spaces. Below is what I tried. I also tried using trim with regexp_replace but it wants the third argument in the regexp_replace function.
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1
from vp_nlp_protext_males
;
You can strip all occurrences of a substring from a string using the TRANSLATE function to replace the substring with the empty string. For your query it would become this:
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
translate(ntext, ' ', '') as ntext1
from vp_nlp_protext_males
;