Regex matching issue to Test-String - regex

i have a problem and dont get it.
My Regex:
My Test-String:
I have two issues and one general question :)
As you can see in my Test-String the very last (german) Phone Number (the big yellow one in the Test-String attachment) does not match my Regex-Pattern correctly. I dont get it, what is the Problem here? the "0049" fits Group 5, but should fit Group 2, why is that?
My second Problem is, how can i get rid of the spaces before and after every match? (The 7 yellow small circles in the Test-String Attachment)
For copy/paste purposes, here is the Regex and Test-String again:
Regex:
((\+\d{2}|00\d{2})?([ ])?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ])?(\d+)?([ ])?(\d+)?)
Test-String:
Vorwahl 089, die E.123 ebenfalls , also (089) 1234567. Die DIN 5008, also +49 89 1234567 respectivly 0049 89 1234567. Die E.123 empfiehlt, also +49 89 123456 0 respectivly 0049 89 123456 0 oder +49 89 123456 789. Also +49 89 123 456 789. Klammern 089/1234567 und 0151 19406041. Test +49 151 123 456 789 respectivly 0049 151 123 456 789
Last but not at least, my general question:
Is it a good approach to Group each logical part as i did in my example?
A last Information: I validate my Regex with https://regex101.com/ and use it in Python with the re Module.

The thing that makes it unpredictable are the numerous optional groups (..)?.
As first step i recommend replacing ([ ])?(\d+)? as a coupled expression ([ ]?\d+)?, which will avoid spaces at the end of the match - your point #2.
As a second step i recommend coupling the first optional space with the expression of the "national dialling": ((\+|00)\d{2}([ ])?)?. Now we are lucky, because it solves both the space at the beginning and the recognition of the whole number, due to less possible matching options.
The new expression now looks like this:
(((\+|00)\d{2}([ ])?)?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ]?\d+)?([ ]?\d+)?)
I now recommend to simplify the last part, if you dont need the single group-values:
(((\+|00)\d{2}([ ])?)?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ]?\d+){0,2})
For better performance I suggest you remove the parenteses/groups where possible or mark them as non-capturing, if you don't need to have the specific group-values.
In some programming languages you will not need to most outer parenteses, as that is always group 0.

Related

Custom regex for SSN

Need a regex that will identify SSN so that we can mask the numbers, but want to skip masking of work order numbers (starts with WO-).
Details are below:
Current regex used: \b(?!000|666|9)\d{3}[- ]?(?!00)\d{2}[- ]?(?!0000)\d{4}\b
Sample Test Date
WO-011493479
011493479
778-62-8144
030 72 7381
757-85-7495
149-13-7317
401318448
003 06 8815
790714615
805 14 1893
From the above sample data, numbers after WO-011493479 should be skipped. Due to the above regex, a work order string like WO-011493479 is getting changed to WO-*********.
For checking the regex quickly, please use this link.
Note: The regex should work with all major browsers
You're one negative look behind from getting it yourself:
\b(?<!WO-)(?!000|666|9)\d{3}[- ]?(?!00)\d{2}[- ]?(?!0000)\d{4}\b

Find a String from a varying number block to the end

I have nearly 8000 lines of the following text:
DIL 2 M 006 SC SCHÜTZ 083 1 Stck
25215-1 BIN-SORT 2152310251724-1 BIN-SORT getestet 048 133 Stck
RBBE60-T3dsg 21S003 SEALING 6X8.9X2.4 MM 082 3 Stck
I am only interested in the 3 digit block at the end and the number behind.
So this should be the output:
083 1
048 133
082 3
It could be, that the same number e.g. 048 appears at the beginning of the line. this shouldn't be a hit.
Unfortunatelly i have no idea how to extract this strings with the help of notpad++.
This expression,
.*(\d{3}\s+\d+).*
with a replacement of $1 is likely to work here.
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
You may try the following find and replace, in regex mode:
Find: ^.*?(\d+ \d+) \S*$
Replace: $1
The logic here is to use .* to consume everything up until the last two consecutive digits in the line. Then, we replace with only the captured two digits.
Demo

Italian phone 10-digit number regex issue

I'm trying to use the regex from this site
/^([+]39)?((38[{8,9}|0])|(34[{7-9}|0])|(36[6|8|0])|(33[{3-9}|0])|(32[{8,9}]))([\d]{7})$/
for italian mobile phone numbers but a simple number as 3491234567 results invalid.
(don't care about spaces as i'll trim them)
should pass:
349 1234567
+39 349 1234567
TODO: 0039 349 1234567
TODO: (+39) 349 1234567
TODO: (0039) 349 1234567
regex101 and regexr both pass the validation..what's wrong?
UPDATE:
To clarify:
The regex should match any number that starts with either
388/389/380 (38[{8,9}|0])|
or
347/348/349/340 (34[{7-9}|0])|
or
366/368/360 (36[6|8|0])|
or
333/334/335/336/337/338/339/330 (33[{3-9}|0])|
328/329 (32[{8,9}])
plus 7 digits ([\d]{7})
and the +39 at the start optionally ([+]39)?
The following regex appears to fulfill your requirements. I took out the syntax errors and guessed a bit, and added the missing parts to cover your TODO comments.
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[7-90]|36[680]|33[3-90]|32[89])\d{7}$
Demo: https://regex101.com/r/yF7bZ0/1
Your test cases fail to cover many of the variations captured by the regex; perhaps you'll want to beef up the test set to make sure it does what you want.
The beginning allows for an optional international prefix with or without the parentheses. The basic pattern is (00|\+)39 and it is repeated with or without parentheses around it. (Perhaps a better overall approach would be to trim parentheses and punctuation as well as whitespace before processing begins; you'll want to keep the plus as significant, of course.)
Updated with information from #Edoardo's answer; wrapped for legibility and added comments:
^ # beginning of line
(\((00|\+)39\)|(00|\+)39)? # country code or trunk code, with or without parentheses
( # followed by one of the following
32[89]| # 328 or 329
33[013-9]| # 33x where x != 2
34[04-9]| # 34x where x not in 1,2,3
35[01]| # 350 or 351
36[068]| # 360 or 366 or 368
37[019] # 370 or 371 or 379
38[089]) # 380 or 388 or 389
\d{6,7} # ... followed by 6 or 7 digits
$ # and end of line
There are obvious accidental gaps which will probably also get filled over time. Generalizing this further is likely to improve resilience toward future changes, but of course may at the same time increase the risk of false positives. Make up your mind about which is worse.
I found this and i updated with new operators and MVNO prefixes (Iliad, ho.)
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[4-90]|36[680]|33[13-90]|32[89]|35[01]|37[019])\d{6,7}$
I improved the regex adding the case to handle space between numbers:
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[4-90]|36[680]|33[13-90]|32[89]|35[01]|37[019])(\s?\d{3}\s?\d{3,4}|\d{6,7})$
so, for example, I can match phone number like this (0039) 349 123 4567 or this 349 123 4567
Following doc:
https://it.qaz.wiki/wiki/Telephone_numbers_in_Italy
A simple regex for MOBILE italian numbers without special chars is:
/^3[0-9]{8,9}$/
it match a string starting with the digit '3' and followed by 8 or 9 digits, ex:
3345678103
you can add then ITALIAN prefix like '+39 ' or '0039 '
/^+39 3[0-9]{8,9}$/ --- match --> +39 3345678103
/^\0039 3[0-9]{8,9}$/ --- match --> 0039 3345678103

SSN issue in SIEBEL

I am working on siebel CRM. I have space issues in my regex.
I have SSN numbers in these formats
123 456 789
123-456-789
123 45 6789
I need to dispaly my SSN Like XXX-XX-4567. My regex looks like
([\s.:])(?!000)(?!666)(?!9[0-9][0-9])\d{3}[- ]?(?!00)\d{2}[- ]?(?!0000)\d{4})([\s.:]) |
([\s.:])(?!000)(?!666)(?!9[0-9][0-9])\d{3}[- ]?(?!00)\d{3}[- ]?(?!00)\d{3})([\s.:]).
How can I remove all blank spaces in the above expression and display the format as i mentioned above?
It looks like there are syntax errors in your RegEx. There are a couple of unmatched brackets, at (?!0000)\d{4}) on the first section, the last bracket is unmatched.
I think I've managed to write the regex you're looking for, but a bit shorter than the one you were using:
([\s.:])((?!000)(?!666)(?!9[0-9]{2})\d{3})[- ]?((?!00)\d{2,3})[- ]?((?!00)\d{3,4})([\s.:])
This will match the following strings:
123-12-1234
123 456 789
123-456-789
123 45 6789
But will not match the following:
666-45-1234
abc-12-1232
123-00-1233
123-224-0011
123 224 0000
There are several capture groups here:
Matches any character (you may want to change this).
Matches the first three digit number.
Matches the second, two or three digit number.
Matches the third, three or four digit number.
Matches any character (you may want to change this).
You should be able to reconstruct the SSN in the format you need with the result of this RegEx.

Extract a portion of text using RegEx

I would like to extract portion of a text using a regular expression. So for example, I have an address and want to return just the number and streets and exclude the rest:
2222 Main at King Edward Vancouver BC CA
But the addresses varies in format most of the time. I tried using Lookbehind Regex and came out with this expression:
.*?(?=\w* \w* \w{2}$)
The above expressions handles the above example nicely but then it gets way too messy as soon as commas come into the text, postal codes which can be a 6 character string or two 3 character strings with a space in the middle, etc...
Is there any more elegant way of extracting a portion of text other than a lookbehind regex?
Any suggestion or a point in another direction is greatly appreciated.
Thanks!
Regular expressions are for data that is REGULAR, that follows a pattern. So if your data is completely random, no, there's no elegant way to do this with regex.
On the other hand, if you know what values you want, you can probably write a few simple regexes, and then just test them all on each string.
Ex.
regex1= address # grabber, regex2 = street type grabber, regex3 = name grabber.
Attempt a match on string1 with regex1, regex2, and finally regex3. Move on to the next string.
well i thot i'd throw my hat into the ring:
.*(?=,? ([a-zA-Z]+,?\s){3}([\d-]*\s)?)
and you might want ^ or \d+ at the front for good measure
and i didn't bother specifying lengths for the postal codes... just any amount of characters hyphens in this one.
it works for these inputs so far and variations on comas within the City/state/country area:
2222 Main at King Edward Vancouver, BC, CA, 333-333
555 road and street place CA US 95000
2222 Main at King Edward Vancouver BC CA 333
555 road and street place CA US
it is counting at there being three words at the end for the city, state and country but other than that it's like ryansstack said, if it's random it won't work. if the city is two words like New York it won't work. yeah... regex isn't the tool for this one.
btw: tested on regexhero.net
i can think of 2 ways you can do this
1) if you know that "the rest" of your data after the address is exactly 2 fields, ie BC and CA, you can do split on your string using space as delimiter, remove the last 2 items.
2) do a split on delimiter /[A-Z][A-Z]/ and store the result in array. then print out the array ( this is provided that the address doesn't contain 2 or more capital letters)