Regular expression for Medicare MBI number format - regex

I'm wondering if someone can help me to create a regular expression to check if a string matches the new Medicare MBI number format. Here are the specifics in regards to character position and what they can contain.
I'm using Cache ObjectScript, but any language would be helpful just so I can get the idea.

If PCRE is an option, you could leverage subroutines:
(?(DEFINE)
(?P<numeric>\d) # numbers
(?P<abc>(?![SLOIBZ])[A-Z]) # A-Z without S,L,O,I,B,Z
(?P<both>(?&numeric)|(?&abc)) # combined
)
^ # start of line/string
(?&numeric)(?&abc)(?&both) # in packs of three
(?&numeric)(?&abc)(?&both)
(?&numeric)(?&abc)(?&abc)
(?&numeric)(?&numeric)
$ # end of line/string
Paste your IDs into the demo on regex101.com (but don't save it on regex101 or you'll expose those IDs to the public permanently).
Of course, it is not a must to use subroutines, it just makes the expression clearer, more readable and maintainable.
But, you could very well go for
^
\d
(?![SLOIBZ])[A-Z]
\d|(?![SLOIBZ])[A-Z]
\d
(?![SLOIBZ])[A-Z]
\d|(?![SLOIBZ])[A-Z]
\d
(?![SLOIBZ])[A-Z]
(?![SLOIBZ])[A-Z]
\d
\d
$
Or condensed (just copy and paste it):
^\d(?![SLOIBZ])[A-Z]\d|(?![SLOIBZ])[A-Z]\d(?![SLOIBZ])[A-Z]\d|(?![SLOIBZ])[A-Z]\d(?![SLOIBZ])[A-Z](?![SLOIBZ])[A-Z]\d\d$

First position should be 1-9
https://www.cms.gov/Outreach-and-Education/Medicare-Learning-Network-MLN/MLNProducts/Downloads/MedicareCard-FactSheet-TextOnly-909365.pdf
\b[1-9][AC-HJKMNP-RT-Yac-hjkmnp-rt-y][AC-HJKMNP-RT-Yac-hjkmnp-rt-y0-9][0-9]-?[AC-HJKMNP-RT-Yac-hjkmnp-rt-y][AC-HJKMNP-RT-Yac-hjkmnp-rt-y0-9][0-9]-?[AC-HJKMNP-RT-Yac-hjkmnp-rt-y]{2}\d{2}\b

Using ! condition (a negative look-ahead) for one of the previous answers:
\b[1-9](?![sloibzSLOIBZ])[a-zA-Z](?![sloibzSLOIBZ)])[a-zA-Z0-9][0-9]-?(?![sloibzSLOIBZ])[a-zA-Z](?![sloibzSLOIBZ])[a-zA-Z0-9][0-9]-?(?![sloibzSLOIBZ])[a-zA-Z]{2}\d{2}\b
Or, even shorter:
\b[1-9](?![sloibzSLOIBZ])[a-zA-Z](?![sloibzSLOIBZ)])[a-zA-Z\d]\d-?(?![sloibzSLOIBZ])[a-zA-Z](?![sloibzSLOIBZ])[a-zA-Z\d]\d-?(?![sloibzSLOIBZ])[a-zA-Z]{2}\d{2}\b

public static boolean isValidHICN(String mHICN) {
String mPatternHICN = "[1-9]{1}[ACDEFGHJKMNPQRTUVWXY]{1}[A-N]{1}[0-9]{1}[ACDEFGHJKMNPQRTUVWXY]{1}[A-N]{1}[0-9]{1}[ACDEFGHJKMNPQRTUVWXY]{2}[0-9]{2}";
return (mHICN.matches(mPatternHICN));
}

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

PHP preg_match_all trouble

I have written a regular expression that I tested in rubular.com and it returned 4 matches. The subject of testing can be found here http://pastebin.com/49ERrzJN and the PHP code is below. For some reason the PHP code returns only the first 2 matches. How to make it to match all 4? It seems it has something to do with greediness or so.
$file = file_get_contents('x.txt');
preg_match_all('~[0-9]+\s+(((?!\d{7,}).){2,20})\s{2,30}(((?!\d{7,}).){2,30})\s+([0-9]+)-([0-9]+)-([0-9]+)\s+(F|M)\s+(.{3,25})\s+(((?!\d{7,}).){2,50})~', $file, $m, PREG_SET_ORDER);
foreach($m as $v) echo 'S: '. $v[1]. '; N: '. $v[3]. '; D:'. $v[7]. '<br>';
Your regex is very slooooooow. After trying it on regex101.com, I found it would timeout on PHP (but not JS, for whatever reason). I'm pretty sure the timeout happens at around 50,000 steps. Actually, it makes sense now why you're not using an online PHP regex tester.
I'm not sure if this is the source of your problem, but there is a default memory limit in PHP:
memory_limit [default:] "128M"
[history:] "8M" before PHP 5.2.0, "16M" in PHP 5.2.0
If you use the multiline modifier (I assume that preg_match_all essentially adds the global modifier), you can use this regex that only takes 1282 steps to find all 4 matches:
^ [0-9]+\s+(((?!\d{7,}).){2,20})\s{2,30}(((?!\d{7,}).){2,30})\s+([0-9]+)-([0-9]+)-([0-9]+)\s+(F|M)\s+(.{3,25})\s+(((?!\d{7,}).){2,50})
Actually, there are only 2 characters that I added. They're at the beginning, the anchor ^ and the literal space.
If you have to write a long pattern, the first thing to do is to make it readable. To do that, use the verbose mode (x modifier) that allows comments and free-spacing, and use named captures.
Then you need to make a precise description of what you are looking for:
your target takes a whole line => use the anchors ^ and $ with the modifier m, and use the \h class (that only contains horizontal white-spaces) instead of the \s class.
instead of using this kind of inefficient sub-patterns (?:(?!.....).){m,n} to describe what your field must not contain, describe what the field can contain.
use atomic groups (?>...) when needed instead of non-capturing groups to avoid useless backtracking.
in general, using precise characters classes avoids a lot of problems
pattern:
~
^ \h*+ # start of the line
# named captures # field separators
(?<VOTERNO> [0-9]+ ) \h+
(?<SURNAME> \S+ (?>\h\S+)*? ) \h{2,}
(?<OTHERNAMES> \S+ (?>\h\S+)*? ) \h{2,}
(?<DOB> [0-9]{2}-[0-9]{2}-[0-9]{4} ) \h+
(?<SEX> [FM] ) \h+
(?<APPID_RECNO> [0-9A-Z/]+ ) \h+
(?<VILLAGE> \S+ (?>\h\S+)* )
\h* $ # end of the line
~mx
demo
If you want to know what goes wrong with a pattern, you can use the function preg_last_error()

Regular Expression to exclude numerical emailids

I have below set of sample emailids
EmailAddress
1123
123.123
123_123
123#123.123
123#123.com
123#abc.com
123mbc#abc.com
123mbc#123abc.com
123mbc#123abc.123com
123mbc123#cc123abc.c123com
Need to eliminate mailids if they contain entirely numericals before #
Expected output:
123mbc#abc.com
123mbc#123abc.com
123mbc#123abc.123com
123mbc123#cc123abc.c123com
I used below Java Rex. But its eliminating everything. I have basic knowledge in writing these expressions. Please help me in correcting below one. Thanks in advance.
[^0-9]*#.*
do you mean something like this ? (.*[a-zA-Z].*[#]\w*\.\w*)
breakdown .* = 0 or more characters [a-zA-Z] = one
letter .* = 0 or more characters #
\w*\.\w* endless times a-zA-Z0-9 with a single . in between
this way you have the emails that contains at least one letter
see the test at https://regex101.com/r/qV1bU4/3
edited as suggest by ccf with updated breakdown
The following regex only lets email adresses pass that meet your specs:
(?m)^.*[^0-9#\r\n].*#
Observe that you have to specify multi-line matching ( m flag. See the live demo. The solution employs the embedded flag syntax m flag. You can also call Pattern.compile with the Pattern.MULTILINE argument. ).
Live demo at regex101.
Explanation
Strategy: Define a basically sound email address as a single-line string containing a #, exclude strictly numerical prefixes.
^: start-of-line anchor
#: a basically sound email address must match the at-sign
[^...]: before the at sign, one character must neither be a digit nor a CR/LF. # is also included, the non-digit character tested for must not be the first at-sign !
.*: before and after the non-digit tested for, arbitrary strings are permitted ( well, actually they aren't, but true syntactic validation of the email address should probably not happen here and should definitely not be regex based for reasons of reliability and code maintainability ). The strings need to be represented in the pattern, because the pattern is anchored.
Try this one:
[^\d\s].*#.+
it will match emails that have at least one letter or symbol before the # sign.

Regex to match a string not followed by anything

I am trying to figure out a regex sequence that will match the first item in the list below but not the other two, {Some-Folder} is variable.
http://www.url.com/{Some-Folder}/
http://www.url.com/{Some-Folder}/thing/key/
http://www.url.com/{Some-Folder}/thing/119487302/
http://www.url.com/{Some-Folder}/{something-else}
Essentially I want to be able to detect anything that is of the form:
http://www.url.com/{Some-Folder}/
or
http://www.url.com/{Some-Folder}
but not
http://www.url.com/{Some-Folder}/{something-else}
So far I have
http://www.url.com/[A-Z,-]*\/^.
but this doesn't match anything
http://www.url.com/[^/]+/?$
Or, in the few parsers that use \Z as end of text,
http://www.url.com/[^/]+/?\Z
I customized a regex I've used for URL parsing before, it's not perfect, and will need even more work once gTLD becomes more used. Anyway, here it is:
\bhttps?:\/\/[a-z0-9.-]+\.(?:[a-z]{2,4}|museum|travel)\/[^\/\s]+(?:\/\b)?
You may want to add case insensitive flag, for whichever language you're using.
Demo: http://rubular.com/r/HyVXU30Hvp
You may use the following regex:
(?m)http:\/\/www\.example\.com\/[^\/]+\/?$
Explanation:
(?m) : Set the m modifier which makes ^ and $ match start and end of line respectively
http:\/\/www\.example\.com\/ : match http://www.example.com/
[^\/]+ : match anything except / one or more times
\/? : optionally match /
$ : declare end of line
Online demo
I've been looking for an answer to this exact problem. aaaaaa123456789's answer almost worked for me. But the $ and \Z didn't work. My solution is:
http://www.url.com/[^/]+/?.{0}

How to ignore whitespace in a regular expression subject string?

Is there a simple way to ignore the white space in a target string when searching for matches using a regular expression pattern? For example, if my search is for "cats", I would want "c ats" or "ca ts" to match. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
You can stick optional whitespace characters \s* in between every other character in your regex. Although granted, it will get a bit lengthy.
/cats/ -> /c\s*a\s*t\s*s/
While the accepted answer is technically correct, a more practical approach, if possible, is to just strip whitespace out of both the regular expression and the search string.
If you want to search for "my cats", instead of:
myString.match(/m\s*y\s*c\s*a\*st\s*s\s*/g)
Just do:
myString.replace(/\s*/g,"").match(/mycats/g)
Warning: You can't automate this on the regular expression by just replacing all spaces with empty strings because they may occur in a negation or otherwise make your regular expression invalid.
Addressing Steven's comment to Sam Dufel's answer
Thanks, sounds like that's the way to go. But I just realized that I only want the optional whitespace characters if they follow a newline. So for example, "c\n ats" or "ca\n ts" should match. But wouldn't want "c ats" to match if there is no newline. Any ideas on how that might be done?
This should do the trick:
/c(?:\n\s*)?a(?:\n\s*)?t(?:\n\s*)?s/
See this page for all the different variations of 'cats' that this matches.
You can also solve this using conditionals, but they are not supported in the javascript flavor of regex.
You could put \s* inbetween every character in your search string so if you were looking for cat you would use c\s*a\s*t\s*s\s*s
It's long but you could build the string dynamically of course.
You can see it working here: http://www.rubular.com/r/zzWwvppSpE
If you only want to allow spaces, then
\bc *a *t *s\b
should do it. To also allow tabs, use
\bc[ \t]*a[ \t]*t[ \t]*s\b
Remove the \b anchors if you also want to find cats within words like bobcats or catsup.
This approach can be used to automate this
(the following exemplary solution is in python, although obviously it can be ported to any language):
you can strip the whitespace beforehand AND save the positions of non-whitespace characters so you can use them later to find out the matched string boundary positions in the original string like the following:
def regex_search_ignore_space(regex, string):
no_spaces = ''
char_positions = []
for pos, char in enumerate(string):
if re.match(r'\S', char): # upper \S matches non-whitespace chars
no_spaces += char
char_positions.append(pos)
match = re.search(regex, no_spaces)
if not match:
return match
# match.start() and match.end() are indices of start and end
# of the found string in the spaceless string
# (as we have searched in it).
start = char_positions[match.start()] # in the original string
end = char_positions[match.end()] # in the original string
matched_string = string[start:end] # see
# the match WITH spaces is returned.
return matched_string
with_spaces = 'a li on and a cat'
print(regex_search_ignore_space('lion', with_spaces))
# prints 'li on'
If you want to go further you can construct the match object and return it instead, so the use of this helper will be more handy.
And the performance of this function can of course also be optimized, this example is just to show the path to a solution.
The accepted answer will not work if and when you are passing a dynamic value (such as "current value" in an array loop) as the regex test value. You would not be able to input the optional white spaces without getting some really ugly regex.
Konrad Hoffner's solution is therefore better in such cases as it will strip both the regest and test string of whitespace. The test will be conducted as though both have no whitespace.