Unable to use two regular expressions in SAS Content Categorization Studio - regex

I'm working in SAS Content Categorization Studio. I'm trying to get two concepts, consisting of one regular expression each, to return a number of matches. One is supposed to find dates, the other a particularly formatted number.
(0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](?:[0-9]{2})?[0-9]{2}
[1-9](?:(?:[ -.])?\d){10,10}
The regex that is supposed to find the formatted number (latter) doesn't return any hits as long as the regex that is supposed to find dates (former) is active or not commented out. As soon as I comment out the regex for the date, the latter continues to work again. They seem to be mutually exclusive. Can anyone tell me what I'm doing wrong?

If your pattern on the end (?:(?:[ -.])?\d){10,10} is the date match portion, it seems to be a little off to me. What that appears to match is 10 iterations of "Some optional character (literally anything because of the '.') followed by a digit". First, it seems like you would want 8 iterations, not ten to match a date. But I think what you really want is something like \d{1,2}([\.-])\d{1,2}\1\d{4}. This would match "One or two digits, followed by a literal . or -, followed by one or two more digits, followed by whatever special character you matched before (. or -), followed by four digits".

Related

How do I create a regex expression that does not allow the same 9 duplicate numbers in a social security number, with or without hyphens?

The first thing I tried to do, is get the regex matching what I DON'T want. This way, I could just flip it to NOT accept that same input. This is where I came up with the first part of this regex.
Accept all 9 digit numbers, where all 9 digits are identical (without dashes): "^(\d)\1{8}$". This expression works as expected (as seen here: (https://regex101.com/r/Ez8YC3/1)).
The second expression should do the same, with dashes formatted as follows xxx-xx-xxxx: "^(\d)\1{8}$". This expressions works as expected (as seen here: https://regex101.com/r/bodzIX/1).
Now what I want to do at this point, is combine them together to look for BOTH conditions. However when I do that it seems to break, and only match 9 digit numbers that are identical throughout WITH dashes: "^(\d)\1{2}-(\d)\1{1}-(\d)\1{3}$|^(\d)\1{8}$". This can be seen here: https://regex101.com/r/lPnksf/1.
I may be getting a little ahead of myself here, but in order to show my work as much as possible, I also tried flipping those regex separately, which also did not work as expected.
Condition #1 flipped: "^(?!(\d)\1{8})$". Can be seen here: https://regex101.com/r/ed51yk/1.
Condition #2 flipped: "^(?!(\d)\1{2}-(\d)\1{1}-(\d)\1{3})$". Can be seen here: https://regex101.com/r/UYfoMK/1.
I would expect the two expressions (when flipped) to match any 9 digit number (with or without dashes) where all numbers are not identical. How ever this does not happen at all.
This is the final regex that I came up with, which is clearly not doing what I would expect it to: "^(?!(\d)\1{2}-(\d)\1{1}-(\d)\1{3})$|^(?!(\d)\1{8})$". Can be seen here: https://regex101.com/r/9eHhF5/1
At the end of the day, I want to combine these 2 expressions, with this one (that already works as intended): "^(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d\d\d\d$". Can be seen here: https://regex101.com/r/AdRI8i/1.
I am still pretty new to regex, and really want to understand why I can't simply wrap the condition in (?!...) in order to match the opposite condition.
Thank you in advance
What you want to do is not flip, but reverse the regex logic.
Yes, to reverse the pattern logic, you should use a negative lookahead, but there are caveats.
First, the $ end of string anchor: if it was at the end of the "positive" regex, it must also be moved to the lookahead in the reverse pattern. So, your ^(?!(\d)\1{8})$ regex must be written as ^(?!(\d)\1{8}$). Same goes for your second regex.
Next, mind that each subsequent capturing group gets an incremented ID number, so you cannot keep the same backreferences when you "join" patterns with OR | operator. You must adjust these IDs to reflect their new values in the new regex.
So, you want to match a string that matches ^(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d\d\d\d$ first (let's note \d\d\d\d = \d{4}), then you can add restrictions with lookaheads:
(?!(\d)\1{8}$) - fails the match if, immediately from the current position, it matches identical 9 digits and then the string end comes
(?!(\d)\2\2-(\d)\2-(\d)\2{3}$) - (note the ID incrementing continuation) fails the match if, immediately from the current position, it matches identical to the first one 3 digits, -, identical 2 digits, -, identical 5 digits, and then the string end comes.
So, to follow your logic, you can use
^(?!(\d)\1{8}$)(?!(\d)\2\2-(\d)\2-(\d)\2{3}$)(?!000|666|9\d\d)\d{3}-(?!00)\d\d-(?!0000)\d{4}$
See the regex demo
As the lookaheads are non-consuming patterns, i.e. the regex index remains at the same position after matching their pattern sequences where it was before, the 3 lookaheads will all be tried at the start of the string (see the ^ anchor). If any of the three negative lookaheads at the start fails, the whole string match will be failed right away.
By this Regex you match what you dont want as social security number:
^(?:(\d)\1{8})|(?:(\d)\2{2}-\2{2}-\2{4})$
Demo
By this regex you match only what you want:
^(?!(?:(\d)\1{8})|(?:(\d)\2{2}-\2{2}-\2{4})).*$
Demo

Regex how to get a full match of nth word (without using non-capturing groups)

I am trying to use Regex to return the nth word in a string. This would be simple enough using other answers to similar questions; however, I do not have access to any of the code. I can only access a regex input field and the server only returns the 'full match' and cannot be made to return any captured groups such as 'group 1'
EDIT:
From the developers explaining the version of regex used:
"...its javascript regex so should mostly be compatible with perl i
believe but not as advanced, its fairly low level so wasn't really
intended for use by end users when originally implemented - i added
the dropdown with the intention of having some presets going
forwards."
/EDIT
Sample String:
One Two Three Four Five
Attempted solution (which is meant to get just the 2nd word):
^(?:\w+ ){1}(\S+)$
The result is:
One Two
I have also tried other variations of the regex:
(?:\w+ ){1}(\S+)$
^(?:\w+ ){1}(\S+)
But these just return the entire string.
I have tried replicating the behaviour that I see using regex101 but the results seem to be different, particularly when changing around the ^ and $.
For example, I get the same output on regex101 if I use the altered regex:
^(?:\w+ ){1}(\S+)
In any case, none of the comparing has helped me actually achieve my stated aim.
I am hoping that I have just missed something basic!
===EDIT===
Thanks to all of you who have contributed thus far, however, I am still running into issues. I am afraid that I do not know the language or restrictions on the regex other than what I can ascertain through trial and error, therefore here is a list of attempts and results all of which are trying to return "Two" from a sample of:
One Two Three Four Five
\w+(?=( \w+){1}$)
returns all words
^(\w+ ){1}\K(\w+)
returns no words atall (so I assume that \K does not work)
(\w+? ){1}\K(\w+?)(?= )
returns no words at all
\w+(?=\s\w+\s\w+\s\w+$)
returns all words
^(?:\w+\s){1}\K\w+
returns all words
====
With all of the above not working, I thought I would test out some others to see the limitations of the system
Attempting to return the last word:
\w+$
returns all words
This leads me to believe that something strange is going on with the start ^ and end $ characters, perhaps the server puts these in automatically if they are omitted? Any more ideas greatly appreciated.
I don't known if your language supports positive lookbehind, so using your example,
One Two Three Four Five
here is a solution which should work in every language :
\w+ match the first word
\w+$ match the last word
\w+(?=\s\w+$) match the 4th word
\w+(?=\s\w+\s\w+$) match the 3rd word
\w+(?=\s\w+\s\w+\s\w+$) match the 2nd word
So if a string contains 10 words :
The first and the last word are easy to find. To find a word at a position, then you simply have to use this rule :
\w+(?= followed by \s\w+ (10 - position) times followed by $)
Example
In this string :
One Two Three Four Five Six Seven Height Nine Ten
I want to find the 6th word.
10 - 6 = 4
\w+(?= followed by \s\w+ 4 times followed by $)
Our final regex is
\w+(?=\s\w+\s\w+\s\w+\s\w+$)
Demo
It's possible to use reset match (\K) to reset the position of the match and obtain the third word of a string as follows:
(\w+? ){2}\K(\w+?)(?= )
I'm not sure what language you're working in, so you may or may not have access to this feature.
I'm not sure if your language does support \K, but still sharing this anyway in case it does support:
^(?:\w+\s){3}\K\w+
to get the 4th word.
^ represents starting anchor
(?:\w+\s){3} is a non-capturing group that matches three words (ending with spaces)
\K is a match reset, so it resets the match and the previously matched characters aren't included
\w+ helps consume the nth word
Regex101 Demo
And similarly,
^(?:\w+\s){1}\K\w+ for the 2nd word
^(?:\w+\s){2}\K\w+ for the 3rd word
^(?:\w+\s){3}\K\w+ for the 4th word
and so on...
So, on the down side, you can't use look behind because that has to be a fixed width pattern, but the "full match" is just the last thing that "full matches", so you just need something whose last match is your word.
With Positive look-ahead, you can get the nth word from the right
\w+(?=( \w+){n}$)
If your server has extended regex, \K can "clear matched items", but most regex engines don't support this.
^(\w+ ){n}\K(\w+)
Unfortunately, Regex doesn't have a standard "match only n'th occurrence", So counting from the right is the best you can do. (Also, Regex101 has a searchable quick reference in the bottom right corner for looking up special characters, just remember that most of those characters are not supported by all regex engines)

Using Regex to find repeating groups in phone numbers

I'm looking for a way to use regex to search for obviously false phone numbers that have the same digit repeating. The numbers are all formatted and stored as follows:
(111)111-1111
I'm not able to alter the text in any way.
I've tried modifying a few of the regex lines I've seen such as:
^([0-9])\1{2}.\1{3}.\1{4}$
which was for finding repeating digits with a period in between the numbers. However, I haven't figured out how to get around the first character as a parenthesis.
Any help would be appreciated!
You misunderstand the purpose of the . Dot Operator. It is not to match a period, it matches anything. In that (quite badly) regex, it serves only to skip the - – and because it matches anything, it will also match something like 11121113111.
Use this regexp instead:
^\(?([0-9])\1{2}\)?\1{3}-?\1{4}$
This checks for parentheses around the first group, optionally so it will still work without; and specifically checks for the presence of a dash between the second and third group of digits, also optionally.

Verifying that a string starts with a number (easy) OR exactly 3 letters?

I'm trying to make a RegEx expression to verify that a field starts with either the number 3 - the easy part - or starts with three letters, then continues to be numbers
My expression so far is
^((3)[\d])|([a-zA-Z]{3}[\d])$
The expression stops you from doing anything BELOW 3, but it still lets you go over...
I've done some searching and can't find a topic that relates to the issue of having an exact amount of characters
And I'm having trouble with limiting it to exactly 3 letter characters. Unfortunately what I'm working with, it HAS to be RegEx and not another language.
^(?:3|[a-zA-Z]{3})\d+$
verifies, that your string starts with either 3 or 3 letters and then is only followed by numbers (at least one) until the end of the string
See https://regex101.com/r/tD2nK4/3 for some positive and negative examples
This regex should do exactly what you want:
^((3)[\d])|([a-zA-Z]{3}[^a-zA-Z])
Please note that this regex can only cope with the ASCII alphabet.

positive look ahead and replace

Recently I'm writing/testing regexps on https://regex101.com/.
My question is: Is it possible to do a positive look-ahead AND a replacement in the same "replacement"? Or just limited kind of replacement is possible.
Input is several lines with phone numbers. Let's say the correct phone number where the number of "numbers" are 11. No matter how the numbers are divided/group together with - / characters, no matter if starts with + 00 or it is omitted.
Some example lines:
+48301234567
+48/30/1234567
+48-30-12-345-67
+483011223344556677
0048301234567
+(48)30/1234567
Positive look-ahead able to check if from the beginning until the end of line there are only 11 digits, regardless how many other, above specified character separating them. This works perfectly.
Where the positive look-ahead check is fine, I would like to delete every character but numbers. The replacement works fine until I'm not involving look-ahead.
Checking the regexp itself working perfectly ("gm" modes):
^(?:\+|00)?(?:[\-\/\(\)]?\d){11}$
Checking the replace part works perfectly (replace to nothing):
[^\d\n]
Put this into look-ahead, after the deletion of non new-line and non-digit characters from the matching lines:
(?=^(?:\+|00)?(?:[\-\/\(\)]?\d){11}$)[^\d\n]
Even I put the ^ $ into look-ahead, seems the replacement working only from beginning of the lines until the very first digit.
I know in real life the replacement and the check should/would go separate ways, however I'm curious if I could mix look-ahead/look-behind with string operations like replace, delete, take the string apart and put together as I like.
UPDATE: This is what would do the trick, however I feel this one "ugly" a bit. Is there any prettier solution?
https://regex101.com/r/yT5dA4/2
Or the version which I asked originally, where only digits remains: regex101.com/r/yT5dA4/3
You cannot replace/delete text with regex. Regex is just a tool for matching certain strings and then taking certain action depending on the matching text, eg. perform a substitution, retrieve the second capture group.
However it is possible to perform certain decisions within a regex engine, by using conditionals. The common syntax for this, with a lookahead assertion, is (?(?=regex)then|else).
With conditionals you can change the behaviour depending on how the text matches the regex. For your example you could do something like:
^(\+)?(?(1)\(|\d)
If the phone number starts with a plus it must be followed by a bracket, else it should start with a digit. Although in your situation, this is not very useful.
If you want to read up more on conditionals in regex you can do so here.