Changing spaces with "prxchange", but not all spaces - regex

I need to change the spaces in my text to underscores, but only the spaces that are between words, not the ones between digits, so, for an example
"The quick brown fox 99 07 3475"
Would become
"The_quick_brown_fox 99 07 3475"
I tried using this in a data step:
mytext = prxchange('s/\w\s\w/_/',-1,mytext);
But the result was not what i wanted
"Th_uic_row_ox 99 07 3475"
Any ideas on what i could do?
Thanks in advance.

Data One ;
X = "The quick brown fox 99 07 3475" ;
Y = PrxChange( 's/(?<=[a-z])\s+(?=[a-z])/_/i' , -1 , X ) ;
Put X= Y= ;
Run ;

You are changing
"W W"
to
"_"
when you want to change
"W W"
to
"W_W"
so
prxchange('s/(\w)\s(\w)/$1_$2/',-1,mytext);
Full example:
data test;
mytext='The quick brown fox 99 07 3475';
newtext = prxchange('s/([A-Za-z])\s([A-Za-z])/$1_$2/',-1,mytext);
put _all_;
run;

You can use the CALL PRXNEXT function to find the position of each match, then use the SUBSTR function to replace the space with an underscore. I've changed your regular expression as \w matches any alphanumeric character, so it should include spaces between numbers. I'm not sure how you got your result using that expression.
Anyway, the code below should give you what you want.
data have;
mytext='The quick brown fox 99 07 3475';
_re=prxparse('/[a-z]\s[a-z]/i'); /* match a letter followed by a space followed by a letter, ignore case */
_start=1 /* starting position for search */;
call prxnext(_re,_start,-1,mytext,_position,_length); /* find position of 1st match */
do while(_position>0); /* loop through all matches */
substr(mytext,_position+1,1)='_'; /* replace ' ' with '_' for matches */
_start=_start-2; /* prevents the next start position jumping 3 ahead (the length of the regex search string) */
call prxnext(_re,_start,-1,mytext,_position,_length); /* find position of next match */
end;
drop _: ;
run;

Related

Stata Regex for 'standalone' numbers in string

I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
I would like to end up with:
ID string
1 7-test
2 67-tty
3 j37b2 3hty
I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.
Following up on the loop suggesting from the comments, you could do something like the following:
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings
split string, gen(part) parse(" ") // split string at space (p.s. space is the default)
gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}
drop part* N_words
where the result would be
. list
+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+
Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.

Get string between two specific char positions

i have a long text string in SAS, and a value is within it of variable length but is always proceeded by a '#' and then ends with ' ,'
is there a way i can extract this and store as a new variable please?
e.g:
word word, word, #12.34, word, word
And i want to get the 12.34
Thanks!
Double scan should also work if you only have a single #:
data _null_;
var1 = 'word word, word, #12.34, word, word';
var2 = scan(scan(var1,2,'#'),1,',');
put var2=;
run;
You can make use of the substr and index functions to do this. The index function returns the first position of the character specified.
data _null_;
var1 = 'word word, word, #12.34, word, word';
pos1 = index(var1,'#'); *Get the position of the first # sign;
tmp = substr(var1,pos1+1); *Create a string that returns only characters after the # sign;
put tmp;
pos2 = index(tmp,','); *Get the position of the first "," in the tmp variable;
var2 = substr(tmp,1,pos2-1);
put var2;
run;
Note that this method only works if there is only one "#" in the string.
One way is to use index to locate the two 'sentinels' delimiting the value and retrieve the innards with substr. If the value is supposed to be numeric, an additional use of input function is needed.
A second way is to use a regular expression routines prxmatch and prxposn to locate and extract the embedded value.
data have;
input;
longtext = _infile_;
datalines;
some thing #12.34, wicked
#, oops
#5a64, oops
# oops
oops ,
oops #
ok #1234,
who wants be a #1e6,aire
space # , the final frontier
double #12, jeopardy #34, alex
run;
data want;
set have;
* locate with index;
_p1 = index(longtext,'#');
if _p1 then _p2 = index(substr(longtext,_p1),',');
if _p2 > 2 then num_in_text = input (substr(longtext,_p1+1,_p2-2), ?? best.);
* locate with regular expression;
if _n_ = 1 then _rx = prxparse('/#(\d*\.?\d*)?,/'); retain _rx;
if prxmatch(_rx,longtext) then do;
call prxposn(_rx,1,_start,_length);
if _length > 0 then num_in_text_2 = input (substr(longtext,_start, _length), ?? best.);
end;
* drop _: ;
run;
The regex way looks for ##.## variants, the index way looks only for #...,. Then input function will decipher scientific notation values the regex (example pattern)way will not 'locate'. The ?? option in the input function prevents invalid arguments NOTE:s in the log when the enclosed value can not be parsed as a number.
Another way to do is by using Regex and code is given below
data have;
infile datalines truncover ;
input var $200.;
datalines;
word word, word, #12.34, word, word
word1 #12.34, hello hi hello hi
word1 #970000 hello hi hello hi #970022, hi
word1 123, hello hi hello hi #97.99
#99456, this is cool
;
A small note about below regular expression and functions
(?<=#) Zero-width positive look-behind assertion and looking for # before the pattern of interest
(\d+.?\d+) here means digit followed or not followed by . and other digits
(?=,) Zero-width positive look-ahead assertion and looking for , after the pattern of interest
call prxsubstr finds the position and length of pattern and substr extracts the required values.
data want( drop=pattern position length);
retain pattern;
IF _N_ = 1 THEN PATTERN = PRXPARSE("/(?<=#)(\d+\.?\d+)(?=,)/");
set have;
call prxsubstr(pattern, var, position, length);
if position then
match = substr(var, position, length);
run;
if you want to get really lazy you can just do
want = compress(have,".","kd");

How to keep only a specific date pattern within a string in SAS?

I have a text string that contains words, numbers, and dates (in mm/dd/yy format) and want to keep only the date values.
So far, using the compress function I'm able to keep all the numbers and "/" characters:
data _null_;
string = 'text 2342 12/11/15 text 54 11/01/14 49 text 10/23/16 423';
parsed = compress(string, '0123456789/ ', 'k');
put parsed;
run;
Which returns:
12/11/15 54 11/01/14 49 10/23/16 423
What I want is: 12/11/15 11/01/14 10/23/16
How can I accomplish this?
(Adapted from SAS' documentation on CALL PRXNEXT Routine)
data dates(keep = Text Dates);
ExpressionID = prxparse('/\d{2}\/\d{2}\/\d{2}/');
text = 'text 2342 12/11/15 text 54 11/01/14 49 text 10/23/16 423';
length Dates $ 120;
start = 1;
stop = length(text);
/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the start parameter so that searching */
/* begins again after the last match. */
call prxnext(ExpressionID, start, stop, text, position, length);
do while (position > 0);
found = substr(text, position, length);
put found= position= length=;
Dates = catx(" ", dates, found);
call prxnext(ExpressionID, start, stop, text, position, length);
end;
run;
Resulting dataset:

R: gsub of exact full string with fixed = T

I am trying to gsub exact FULL string - I know I need to use ^ and $. The problem is that I have special characters in strings (could be [, or .) so I need to use fixed=T. This overrides the ^ and $. Any solution is appreciated.
Need to replace 1st, 2nd element in exact_orig with 1st, 2nd element from exact_change but only if full string is matched from beginning to end.
exact_orig = c("oz","32 oz")
exact_change = c("20 oz","32 ct")
gsub_FixedTrue <- function(i) {
for(k in seq_along(exact_orig)) i = gsub(exact_orig[k],exact_change[k],i,fixed=TRUE)
return(i)
}
Test cases:
print(gsub_FixedTrue("32 oz")) #gives me "32 20 oz" - wrong! Must be "32 ct"
print(gsub_FixedTrue("oz oz")) # gives me "20 oz 20 oz" - wrong! Must remain as "oz oz"
I read a somewhat similar thread, but could not make it work for full string (grep at the beginning of the string with fixed =T in R?)
If you want to exactly match full strings, i don't think you really want to use regular expressions in this case. How about just the match() function
fixedTrue<-function(x) {
m <- match(x, exact_orig)
x[!is.na(m)] <- exact_change[m[!is.na(m)]]
x
}
fixedTrue(c("32 oz","oz oz"))
# [1] "32 ct" "oz oz"

Regex - Match length based on value inside match (using variables ?)

I'd like to know if its possible to use a value inside the expression as a variable for a second part of the expression
The goal is to extract some specific strings from a memory dump. One part of the string is based on a (more or less) fixed structure that can be described well using regular expressions. The Problem is the second part of the string that has a variable length and no "footer" or anything that can be "matched" as an "END".
Instead there is a length indicator on position 2 of the first part.
Here is a simplified example string that id like to find (an all others) inside a large file
00 24 AA BB AA DD EE FF GG HH II JJ ########### ( # beeing unwanted data)
Lets assume that the main structure would allways be 00 XX AA BB AA - but the last part (starting from DD) will be variable in length for each string based on the value of XX
I know that this can be done in code outside regex but iam curious if its possible :)
Short answer: NO
Long answer:
You can acheive what you want in two steps:
Extract the value inside string
Build dynamically a regexp for matching
PSEUDO CODE
s:='00 24 AA BB AA DD EE FF GG HH II JJ ###########'
re:=/00 (\d{2}) AA BB AA/
if
s::matches(re)
then
match := re::match(s)
len := matches(1)
dynamicRE := new Regexp(re::toString() + ' (?:[A-Z]{2} ){' + len + '}')
// dynamicRE == /00 (\d{2}) AA BB AA (?:[A-Z]{2} ){24,24}/
if s::matches(dynamicRE) then
// MATCH !!
else
// NO MATCH !!
end if
end if