Custom regex for SSN - regex

Need a regex that will identify SSN so that we can mask the numbers, but want to skip masking of work order numbers (starts with WO-).
Details are below:
Current regex used: \b(?!000|666|9)\d{3}[- ]?(?!00)\d{2}[- ]?(?!0000)\d{4}\b
Sample Test Date
WO-011493479
011493479
778-62-8144
030 72 7381
757-85-7495
149-13-7317
401318448
003 06 8815
790714615
805 14 1893
From the above sample data, numbers after WO-011493479 should be skipped. Due to the above regex, a work order string like WO-011493479 is getting changed to WO-*********.
For checking the regex quickly, please use this link.
Note: The regex should work with all major browsers

You're one negative look behind from getting it yourself:
\b(?<!WO-)(?!000|666|9)\d{3}[- ]?(?!00)\d{2}[- ]?(?!0000)\d{4}\b

Related

Regex matching issue to Test-String

i have a problem and dont get it.
My Regex:
My Test-String:
I have two issues and one general question :)
As you can see in my Test-String the very last (german) Phone Number (the big yellow one in the Test-String attachment) does not match my Regex-Pattern correctly. I dont get it, what is the Problem here? the "0049" fits Group 5, but should fit Group 2, why is that?
My second Problem is, how can i get rid of the spaces before and after every match? (The 7 yellow small circles in the Test-String Attachment)
For copy/paste purposes, here is the Regex and Test-String again:
Regex:
((\+\d{2}|00\d{2})?([ ])?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ])?(\d+)?([ ])?(\d+)?)
Test-String:
Vorwahl 089, die E.123 ebenfalls , also (089) 1234567. Die DIN 5008, also +49 89 1234567 respectivly 0049 89 1234567. Die E.123 empfiehlt, also +49 89 123456 0 respectivly 0049 89 123456 0 oder +49 89 123456 789. Also +49 89 123 456 789. Klammern 089/1234567 und 0151 19406041. Test +49 151 123 456 789 respectivly 0049 151 123 456 789
Last but not at least, my general question:
Is it a good approach to Group each logical part as i did in my example?
A last Information: I validate my Regex with https://regex101.com/ and use it in Python with the re Module.
The thing that makes it unpredictable are the numerous optional groups (..)?.
As first step i recommend replacing ([ ])?(\d+)? as a coupled expression ([ ]?\d+)?, which will avoid spaces at the end of the match - your point #2.
As a second step i recommend coupling the first optional space with the expression of the "national dialling": ((\+|00)\d{2}([ ])?)?. Now we are lucky, because it solves both the space at the beginning and the recognition of the whole number, due to less possible matching options.
The new expression now looks like this:
(((\+|00)\d{2}([ ])?)?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ]?\d+)?([ ]?\d+)?)
I now recommend to simplify the last part, if you dont need the single group-values:
(((\+|00)\d{2}([ ])?)?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ]?\d+){0,2})
For better performance I suggest you remove the parenteses/groups where possible or mark them as non-capturing, if you don't need to have the specific group-values.
In some programming languages you will not need to most outer parenteses, as that is always group 0.

Italian phone 10-digit number regex issue

I'm trying to use the regex from this site
/^([+]39)?((38[{8,9}|0])|(34[{7-9}|0])|(36[6|8|0])|(33[{3-9}|0])|(32[{8,9}]))([\d]{7})$/
for italian mobile phone numbers but a simple number as 3491234567 results invalid.
(don't care about spaces as i'll trim them)
should pass:
349 1234567
+39 349 1234567
TODO: 0039 349 1234567
TODO: (+39) 349 1234567
TODO: (0039) 349 1234567
regex101 and regexr both pass the validation..what's wrong?
UPDATE:
To clarify:
The regex should match any number that starts with either
388/389/380 (38[{8,9}|0])|
or
347/348/349/340 (34[{7-9}|0])|
or
366/368/360 (36[6|8|0])|
or
333/334/335/336/337/338/339/330 (33[{3-9}|0])|
328/329 (32[{8,9}])
plus 7 digits ([\d]{7})
and the +39 at the start optionally ([+]39)?
The following regex appears to fulfill your requirements. I took out the syntax errors and guessed a bit, and added the missing parts to cover your TODO comments.
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[7-90]|36[680]|33[3-90]|32[89])\d{7}$
Demo: https://regex101.com/r/yF7bZ0/1
Your test cases fail to cover many of the variations captured by the regex; perhaps you'll want to beef up the test set to make sure it does what you want.
The beginning allows for an optional international prefix with or without the parentheses. The basic pattern is (00|\+)39 and it is repeated with or without parentheses around it. (Perhaps a better overall approach would be to trim parentheses and punctuation as well as whitespace before processing begins; you'll want to keep the plus as significant, of course.)
Updated with information from #Edoardo's answer; wrapped for legibility and added comments:
^ # beginning of line
(\((00|\+)39\)|(00|\+)39)? # country code or trunk code, with or without parentheses
( # followed by one of the following
32[89]| # 328 or 329
33[013-9]| # 33x where x != 2
34[04-9]| # 34x where x not in 1,2,3
35[01]| # 350 or 351
36[068]| # 360 or 366 or 368
37[019] # 370 or 371 or 379
38[089]) # 380 or 388 or 389
\d{6,7} # ... followed by 6 or 7 digits
$ # and end of line
There are obvious accidental gaps which will probably also get filled over time. Generalizing this further is likely to improve resilience toward future changes, but of course may at the same time increase the risk of false positives. Make up your mind about which is worse.
I found this and i updated with new operators and MVNO prefixes (Iliad, ho.)
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[4-90]|36[680]|33[13-90]|32[89]|35[01]|37[019])\d{6,7}$
I improved the regex adding the case to handle space between numbers:
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[4-90]|36[680]|33[13-90]|32[89]|35[01]|37[019])(\s?\d{3}\s?\d{3,4}|\d{6,7})$
so, for example, I can match phone number like this (0039) 349 123 4567 or this 349 123 4567
Following doc:
https://it.qaz.wiki/wiki/Telephone_numbers_in_Italy
A simple regex for MOBILE italian numbers without special chars is:
/^3[0-9]{8,9}$/
it match a string starting with the digit '3' and followed by 8 or 9 digits, ex:
3345678103
you can add then ITALIAN prefix like '+39 ' or '0039 '
/^+39 3[0-9]{8,9}$/ --- match --> +39 3345678103
/^\0039 3[0-9]{8,9}$/ --- match --> 0039 3345678103

Why does this regex prefix not work?

I've got a sample data I'm looking to find matches within. I'm testing in Notepad++.
QA 44
ABQ DAL 280
ABQ HOU 290
HOU PHX 210
DAL PHX 102
When I use the following regular expression, I get matches as expected on the last four lines
([A-Z]{3}\s){2}[0-9]{3}
But when I try to hone in on the 3-digit number at the end and move everything else to a prefix, no matches come back.
(?<=([A-Z]{3}\s){2})[0-9]{3}
What exactly am I doing wrong with the prefix? I want all those 3-digit numbers to match and qualify on the letter codes before it, but it is not working.
If you just want the last digits you need a none-capture group within look behind :
(?<=(?:[A-Z]{3}\s){2})[0-9]{3}
see demo
Is there a reason you can't use a look behind with an anchor?
(?<=\s)\d{3}$

Regular expression hex bytes in string

I am trying to validate the input of a LineEdit widget in Qt. I am using regular expressions for the first time so I could use some help. I want to allow 1 to 32 hex bytes separated by a space, for example this should be valid:
"0a 45 bc 78 e2 34 71 af"
And here are some examples of invalid input:
"1 34 bc 4e" -> They need to be written in pairs, so 1 must be 01.
"8a cb3 58 11" -> cb3 invalid.
"56 f2 a8 69 " -> No trailing space is allowed.
After some head scratching I came up with this regex which seems to work:
"([0-9A-Fa-f]{2}[ ]){0,31}([0-9A-Fa-f]{2})"
Now on to my questions:
Do you see any problems with my regex that my tests have failed to show? If so how can I improve it?
Is there a cleaner way to write it?
Thanks in advance
I am not sure what method you used for validation, but one possible problem is that the method searches the string for substring that matches the pattern rather than checking the string matches the pattern. Use exactMatch for validate a string against a regular expression.
In any case, adding anchors ^ and $ is safer (not necessary when exactMatch is used, though):
"^([0-9A-Fa-f]{2}[ ]){0,31}([0-9A-Fa-f]{2})$"
Since you are doing validation, you don't need capturing. And you don't need to put space in []
"^(?:[0-9A-Fa-f]{2} ){0,31}[0-9A-Fa-f]{2}$"
You can set case-sensitivity with setCaseSensitivity method. If you set it to Qt::CaseInsensitive, you can shorten the regex a bit:
"^(?:[0-9a-f]{2} ){0,31}[0-9a-f]{2}$"

Extract a portion of text using RegEx

I would like to extract portion of a text using a regular expression. So for example, I have an address and want to return just the number and streets and exclude the rest:
2222 Main at King Edward Vancouver BC CA
But the addresses varies in format most of the time. I tried using Lookbehind Regex and came out with this expression:
.*?(?=\w* \w* \w{2}$)
The above expressions handles the above example nicely but then it gets way too messy as soon as commas come into the text, postal codes which can be a 6 character string or two 3 character strings with a space in the middle, etc...
Is there any more elegant way of extracting a portion of text other than a lookbehind regex?
Any suggestion or a point in another direction is greatly appreciated.
Thanks!
Regular expressions are for data that is REGULAR, that follows a pattern. So if your data is completely random, no, there's no elegant way to do this with regex.
On the other hand, if you know what values you want, you can probably write a few simple regexes, and then just test them all on each string.
Ex.
regex1= address # grabber, regex2 = street type grabber, regex3 = name grabber.
Attempt a match on string1 with regex1, regex2, and finally regex3. Move on to the next string.
well i thot i'd throw my hat into the ring:
.*(?=,? ([a-zA-Z]+,?\s){3}([\d-]*\s)?)
and you might want ^ or \d+ at the front for good measure
and i didn't bother specifying lengths for the postal codes... just any amount of characters hyphens in this one.
it works for these inputs so far and variations on comas within the City/state/country area:
2222 Main at King Edward Vancouver, BC, CA, 333-333
555 road and street place CA US 95000
2222 Main at King Edward Vancouver BC CA 333
555 road and street place CA US
it is counting at there being three words at the end for the city, state and country but other than that it's like ryansstack said, if it's random it won't work. if the city is two words like New York it won't work. yeah... regex isn't the tool for this one.
btw: tested on regexhero.net
i can think of 2 ways you can do this
1) if you know that "the rest" of your data after the address is exactly 2 fields, ie BC and CA, you can do split on your string using space as delimiter, remove the last 2 items.
2) do a split on delimiter /[A-Z][A-Z]/ and store the result in array. then print out the array ( this is provided that the address doesn't contain 2 or more capital letters)