SAS: Remove duplicated expressions from given list using REGEX - regex

I would like to remove duplicated expressions from a given string using SAS code. Each expression is delimited by a space and respects the following REGEX /[A-Z]_\d{2}.\d{2}(.[a-z])?/.
Here is the code:
data want;
text = "X_99.99.a X_99.99.a A_12.00 A_12.00 A_13.00 A_12.00 X_99.99.a";
do i=1 to countw(text);
Nondups=prxchange('s/\b(\w+)\s\1/$1/',-1,compbl(text));
end;
run;
The desired result should be:
Nondups ="X_99.99.a A_12.00 A_13.00"
What should be the regular expression to be used inside the function prxchange?
Any help appreciated.

You may use
Nondups=trim(prxchange('s/\s*([A-Z]_\d{2}\.\d{2}(?:\.[a-z])?)(?=.*\1)//',-1, text));
See the regex demo
The pattern matches:
\s* - 0+ whitespaces
([A-Z]_\d{2}\.\d{2}(?:\.[a-z])?) - Group 1:
[A-Z] - an uppercase ASCII letter
_ - an underscore
\d{2} - two digits
\. - a dot (must be escaped)
\d{2} - two digits
(?:\.[a-z])? - an optional group matching 1 or 0 sequences of a . and a lowercase ASCII letter
(?=.*\1) - a positive lookahead that requires any 0+ chars other than line break chars, as many as possible, up to the value stored in Group 1 immediately to the right of the current location.

Related

Regex exclude whitespaces from a group to select only a number

I need to take only a number (a float number) from a text, but I can't remove the whitespaces...
** Update
I have a problem with this method, I only need to consider numbers and ',' between '- EUR' and 'Fee' as rule.
You can use
- EUR\W*(.*?)\W*Fee
See the regex demo.
Variations of the regex that might work in different regex engines:
- EUR\W*\K.*?(?=\W*Fee)
(?<=- EUR\W*).*?(?=\W*Fee)
Details:
- EUR - literal text
\W* - zero or more non-word chars
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
\W*- zero or more non-word chars
Fee - a string.
You could also match the number format in capture group 1
- EUR\b\D*(\d+(?:,\d+)?)\s+Fee\b
- EUR\b Match - EUR and a word boundary
\D* Match 0+ times any char except a digit
( Capture group 1
\d+(?:,\d+)? Match 1+ digits with an optional decimal part
) Close group 1
\s+Fee\b Match 1+ whitespace chars, Fee and a word boundary
Regex demo
this is working i removed the , from (.) in test string.
Regex example - working

Extracting a number in string inside brackets with Regexextract

I am trying to extract number only (float?) from accounting numbers in google sheet with abbrev. units like K,M,B and sometimes in a bracket when negative. Sorry I am so new in regex, how to write a regular express covering different possibilities like (213M),(31.23B)?
\(([0-9.]+\.\[0-9.]+)\)
You may use
\((-?\d+(?:\.\d+)?)[KMB]\)
Details
\( - a literal ( char
(-?\d+(?:\.\d+)?) - Group 1:
-? - an optional -
\d+ - 1+ digits
(?:\.\d+)? - an optional non-capturing group matching one or zero occurrences of a dot followed with 1+ digits
[KMB] - a character class matching K, M or B
\) - a literal ) char.
See the regex demo.

Regex 2 Numbers OR Special Characters

So i'm trying to create a Regex which does the following:
Min 12 Characters, Requires Uppercase, Requires Lowercase, Requires 2 Numeric values OR 2 Special Characters.
At the moment i have the following:
~^(?=\P{Ll}*\p{Ll})(?=\P{Lu}*\p{Lu})(?=.*[!##$%^&*()]|\D*\d).{12,}~u
Which does 1 numeric OR 1 special character, not 2. I've tried adding {2} to the OR condition, however, this requires a combination of two which is incorrect.
Any help would be appreciated.
You should replace (?=.*[!##$%^&*()]|\D*\d) lookahead with (?:(?=(?:[^!##$%^&*()]*[!##$%^&*()]){2})|(?=(?:\D*\d){2})). The regex will look like
'~^(?=\P{Ll}*\p{Ll})(?=\P{Lu}*\p{Lu})(?:(?=(?:[^!##$%^&*()]*[!##$%^&*()]){2})|(?=(?:\D*\d){2})).{12,}$~u'
See the regex demo.
The lookahead matches a location that is immediately followed with
(?:[^!##$%^&*()]*[!##$%^&*()]){2} - two repetitions of any 0+ chars other than !##$%^&*() chars followed with a char from the !##$%^&*() list
| - or
(?=(?:\D*\d){2} - two repetitions of any 0+ non-digit chars followed with a digit

How Can i write a regexp that will allow only digits and comas and only digits at the beginning and the end of the string?

How can i write a regexp, that will check if string starts and ends with digits and in between contains only digits and comas? Comas must also be separated from each other with at least one digit. For the conditions above i have following regexp: ^\d(,?\d)*$ but i have following additional condition: All comma separated integers, that are composed by sequences of digits, must be different from each other. What would be the regexp that allows only this kind of strings?
Thank you
First of all, your regex contains unquantified \d, and that matches only single digits. You need to add + after \d to match 1 or more digits.
To avoid having duplicate values, you may use
^(?!.*\b(\d+)\b.*\b\1\b)\d+(?:,\d+)*$
^^^^^^^^^^^^^^^^^^^^^^^
See the regex demo
The (?!.*\b(\d+)\b.*\b\1\b) is a negative lookahead that will fail the match if after any 0+ chars other than line break chars, there is a group of digits that appear later in the string (after another 0+ chars other than line break chars) again.
Details
^ - start of string
(?!.*\b(\d+)\b.*\b\1\b) - a negative lookahead that fails the match if identical values appear in the text
\d+ - 1+ digits
(?:,\d+)* - zero or more occurrences of
, - a comma
\d+ - 1+ digits
$ - end of string.

Replacing last occurrence of character group with Oracle REGEXP_REPLACE

I have strings like the following in my Oracle 11g table:
ABCDEF000xyz12345abcdefgh
GHIJK0000def67890abcdefgh
I.e., the strings begin with capital letters followed by a series of zeros, followed by three characters, digits and characters again.
How can I replace the xyz12345abcdefgh and def67890abcdefgh with a certain string using REGEXP_REPLACE in Oracle?
If you need to only select the records of the type you mentioned, consider using
select REGEXP_REPLACE(col, '^([[:upper:]]+0+)[[:alpha:]]{3}\d+[[:alpha:]]+$', '\1NEW_STRING')
where
^ - a start of string
([[:upper:]]+0+) - capturing group #1 matching:
[[:upper:]]+ - 1 or more uppercase letters
0+ - one or more 0 chars
[[:alpha:]]{3} - 3 alphabetic chars
\d+ - 1 or more digits
[[:alpha:]]+ - 1 or more alphabetic chars
$ - end of string.
The \1 in the replacement string is a backreference that inserts the value stored in the capturing group #1 buffer.
See the online demo.
select regexp_replace(column_name,'(.*)([0]{2,})(.*)','\1\2xxxx') from table_name;