How do I write more maintainable regular expressions?

How do I write more maintainable regular expressions? - regex

I have started to feel that using regular expressions decreases code maintainability. There is something evil about the terseness and power of regular expressions. Perl compounds this with side effects like default operators.
I DO have a habit of documenting regular expressions with at least one sentence giving the basic intent and at least one example of what would match.
Because regular expressions are built up I feel it is an absolute necessity to comment on the largest components of each element in the expression. Despite this even my own regular expressions have me scratching my head as though I am reading Klingon.
Do you intentionally dumb down your regular expressions? Do you decompose possibly shorter and more powerful ones into simpler steps? I have given up on nesting regular expressions. Are there regular expression constructs that you avoid due to mainainability issues?
Do not let this example cloud the question.
If the following by Michael Ash had some sort of bug in it would you have any prospects of doing anything but throwing it away entirely?
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
Per request the exact purpose can be found using Mr. Ash's link above.
Matches 01.1.02 | 11-30-2001 | 2/29/2000
Non-Matches 02/29/01 | 13/01/2002 | 11/00/02

Use Expresso which gives a hierarchical, english breakdown of a regex.
Or
This tip from Darren Neimke:
.NET allows regular expression
patterns to be authored with embedded
comments via the
RegExOptions.IgnorePatternWhitespace
compiler option and the (?#...) syntax
embedded within each line of the
pattern string.
This allows for psuedo-code-like
comments to be embedded in each line
and has the following affect on
readability:
Dim re As New Regex ( _
"(?<= (?# Start a positive lookBEHIND assertion ) " & _
"(#|#) (?# Find a # or a # symbol ) " & _
") (?# End the lookBEHIND assertion ) " & _
"(?= (?# Start a positive lookAHEAD assertion ) " & _
" \w+ (?# Find at least one word character ) " & _
") (?# End the lookAHEAD assertion ) " & _
"\w+\b (?# Match multiple word characters leading up to a word boundary)", _
RegexOptions.Multiline Or RegexOptions.IgnoreCase Or RegexOptions.IgnoreWhitespace _
)
Here's another .NET example (requires the RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace options):
static string validEmail = #"\b # Find a word boundary
(?<Username> # Begin group: Username
[a-zA-Z0-9._%+-]+ # Characters allowed in username, 1 or more
) # End group: Username
# # The e-mail '#' character
(?<Domainname> # Begin group: Domain name
[a-zA-Z0-9.-]+ # Domain name(s), we include a dot so that
# mail.somewhere is also possible
.[a-zA-Z]{2,4} # The top level domain can only be 4 characters
# So .info works, .telephone doesn't.
) # End group: Domain name
\b # Ending on a word boundary
";
If your RegEx is applicable to a common problem, another option is to document it and submit to RegExLib, where it will be rated and commented upon. Nothing beats many pairs of eyes...
Another RegEx tool is The Regulator

I usually just try to wrap all my Regular Expression calls inside their own function, with a meaningful name and an some basic comments. I like to think of Regular Expressions as a write only language, readable only by the one that wrote it (Unless it's really simple). I fully expect that someone would need to probably completely re-write the expression if they had to change its intent and this is probably for the better to keep the Regular Expression training alive.

Well, the entire purpose in life of the PCRE /x modifier is to allow you to write regexes more readably, as in this trivial example:
my $expr = qr/
[a-z] # match a lower-case letter
\d{3,5} # followed by 3-5 digits
/x;

Some people use REs for the wrong things (I'm waiting for the first SO question on how to detect a valid C++ program using a single RE).
I usually find that, if I can't fit my RE within 60 characters, it's better off being a piece of code since that will almost always be more readable.
In any case, I always document, in the code, what the RE is supposed to achieve, in great detail. This is because I know, from bitter experience, how hard it is for someone else (or even me, six months later) to come in and try to understand.
I don't believe they're evil, although I do believe some people who use them are evil (not looking at you, Michael Ash :-). They're a great tool but, like a chainsaw, you'll cut your legs off if you don't know how to use them properly.
UPDATE: Actually, I've just followed the link to that monstrosity, and it's to validate m/d/y format dates between the years 1600 and 9999. That is a classic case of where full-blown code would be more readable and maintainable.
You just split it up into three fields and check the individual values. I'd almost consider it an offense worthy of termination if one of my minions bought this to me. I'd certainly send them back to write it properly.

Here is the same regex broken down into digestible pieces. In addition to being more readable, some of the sub-regexes can be useful on their own. It is also significantly easier to change the allowed separators.
#!/usr/local/ActivePerl-5.10/bin/perl
use 5.010; #only 5.10 and above
use strict;
use warnings;
my $sep = qr{ [/.-] }x; #allowed separators
my $any_century = qr/ 1[6-9] | [2-9][0-9] /x; #match the century
my $any_decade = qr/ [0-9]{2} /x; #match any decade or 2 digit year
my $any_year = qr/ $any_century? $any_decade /x; #match a 2 or 4 digit year
#match the 1st through 28th for any month of any year
my $start_of_month = qr/
(?: #match
0?[1-9] | #Jan - Sep or
1[0-2] #Oct - Dec
)
($sep) #the separator
(?:
0?[1-9] | # 1st - 9th or
1[0-9] | #10th - 19th or
2[0-8] #20th - 28th
)
\g{-1} #and the separator again
/x;
#match 28th - 31st for any month but Feb for any year
my $end_of_month = qr/
(?:
(?: 0?[13578] | 1[02] ) #match Jan, Mar, May, Jul, Aug, Oct, Dec
($sep) #the separator
31 #the 31st
\g{-1} #and the separator again
| #or
(?: 0?[13-9] | 1[0-2] ) #match all months but Feb
($sep) #the separator
(?:29|30) #the 29th or the 30th
\g{-1} #and the separator again
)
/x;
#match any non-leap year date and the first part of Feb in leap years
my $non_leap_year = qr/ (?: $start_of_month | $end_of_month ) $any_year/x;
#match 29th of Feb in leap years
#BUG: 00 is treated as a non leap year
#even though 2000, 2400, etc are leap years
my $feb_in_leap = qr/
0?2 #match Feb
($sep) #the separtor
29 #the 29th
\g{-1} #the separator again
(?:
$any_century? #any century
(?: #and decades divisible by 4 but not 100
0[48] |
[2468][048] |
[13579][26]
)
|
(?: #or match centuries that are divisible by 4
16 |
[2468][048] |
[3579][26]
)
00
)
/x;
my $any_date = qr/$non_leap_year|$feb_in_leap/;
my $only_date = qr/^$any_date$/;
say "test against garbage";
for my $date (qw(022900 foo 1/1/1)) {
say "\t$date ", $date ~~ $only_date ? "matched" : "didn't match";
}
say '';
#comprehensive test
my #code = qw/good unmatch month day year leap/;
for my $sep (qw( / - . )) {
say "testing $sep";
my $i = 0;
for my $y ("00" .. "99", 1600 .. 9999) {
say "\t", int $i/8500*100, "% done" if $i++ and not $i % 850;
for my $m ("00" .. "09", 0 .. 13) {
for my $d ("00" .. "09", 1 .. 31) {
my $date = join $sep, $m, $d, $y;
my $re = $date ~~ $only_date || 0;
my $code = not_valid($date);
unless ($re == !$code) {
die "error $date re $re code $code[$code]\n"
}
}
}
}
}
sub not_valid {
state $end = [undef, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31];
my $date = shift;
my ($m,$d,$y) = $date =~ m{([0-9]+)[-./]([0-9]+)[-./]([0-9]+)};
return 1 unless defined $m; #if $m is set, the rest will be too
#components are in roughly the right ranges
return 2 unless $m >= 1 and $m <= 12;
return 3 unless $d >= 1 and $d <= $end->[$m];
return 4 unless ($y >= 0 and $y <= 99) or ($y >= 1600 and $y <= 9999);
#handle the non leap year case
return 5 if $m == 2 and $d == 29 and not leap_year($y);
return 0;
}
sub leap_year {
my $y = shift;
$y = "19$y" if $y < 1600;
return 1 if 0 == $y % 4 and 0 != $y % 100 or 0 == $y % 400;
return 0;
}

I have found a nice method is to simply break up the matching process into several phases. It probably does not execute as fast but you have the added bonus of also being able to tell at a finer grain level why the match is not occurring.
Another route is to use LL or LR parsing. Some languages are not expressible as regular expressions probably even with perl's non-fsm extensions.

I have learned to avoid all but the simplest regexp. I far prefer other models such as Icon's string scanning or Haskell's parsing combinators. In both of these models you can write user-defined code that has the same privileges and status as the built-in string ops. If I were programming in Perl I would probably rig up some parsing combinators in Perl---I've done it for other languages.
A very nice alternative is to use Parsing Expression Grammars as Roberto Ierusalimschy has done with his LPEG package, but unlike parser combinators this is something you can't whip up in an afternoon. But if somebody has already done PEGs for your platform it's a very nice alternative to regular expressions.

Wow, that is ugly. It looks like it should work, modulo an unavoidable bug dealing with 00 as a two digit year (it should be a leap year one quarter of the time, but without the century you have no way of knowing what it should be). There is a lot of redundancy that should probably be factored out into sub-regexes and I would create three sub-regexes for the three main cases (that is my next project tonight). I also used a different character for the delimiter to avoid having to escape forward slashes, changed the single character alternations into character classes (which happily lets us avoid having to escape period), and changed \d to [0-9] since the former matches any digit character (including U+1815 MONGOLIAN DIGIT FIVE: ᠕) in Perl 5.8 and 5.10.
Warning, untested code:
#!/usr/bin/perl
use strict;
use warnings;
my $match_date = qr{
#match 29th - 31st of all months but 2 for the years 1600 - 9999
#with optionally leaving off the first two digits of the year
^
(?:
#match the 31st of 1, 3, 5, 7, 8, 10, and 12
(?: (?: 0? [13578] | 1[02] ) ([/-.]) 31) \1
|
#or match the 29th and 30th of all months but 2
(?: (?: 0? [13-9] | 1[0-2] ) ([/-.]) (?:29|30) \2)
)
(?:
(?: #optionally match the century
1[6-9] | #16 - 19
[2-9][0-9] #20 - 99
)?
[0-9]{2} #match the decade
)
$
|
#or match 29 for 2 for leap years
^
(?:
#FIXME: 00 is treated as a non leap year
#even though 2000, 2400, etc are leap years
0?2 #month 2
([/-.]) #separtor
29 #29th
\3 #separator from before
(?: #leap years
(?:
#match rule 1 (div 4) minus rule 2 (div 100)
(?: #match any century
1[6-9] |
[2-9][0-9]
)?
(?: #match decades divisible by 4 but not 100
0[48] |
[2468][048] |
[13579][26]
)
|
#or match rule 3 (div 400)
(?:
(?: #match centuries that are divisible by 4
16 |
[2468][048] |
[3579][26]
)
00
)
)
)
)
$
|
#or match 1st through 28th for all months between 1600 and 9999
^
(?: (?: 0?[1-9]) | (?:1[0-2] ) ) #all months
([/-.]) #separator
(?:
0?[1-9] | #1st - 9th or
1[0-9] | #10th - 19th or
2[0-8] #20th - 28th
)
\4 #seprator from before
(?:
(?: #optionally match the century
1[6-9] | #16 - 19
[2-9][0-9] #20 - 99
)?
[0-9]{2} #match the decade
)
$
}x;

Some people, when confronted with a
problem, think "I know, I’ll use
regular expressions." Now they have
two problems. — Jamie Zawinski in
comp.lang.emacs.
Keep the regular expressions as simple as they can possibly be (KISS). In your date example, I'd likely use one regular expression for each date-type.
Or even better, replaced it with a library (i.e. a date-parsing library).
I'd also take steps to ensure that the input source had some restrictions (i.e. only one type of date-strings, ideally ISO-8601).
Also,
One thing at the time (with the possible exception of extracting values)
Advanced constructs are ok if used correctly (as in simplying the expression and hence reducing maintenance)
EDIT:
"advanced constructs lead to
maintainance issues"
My original point was that if used correctly it should lead to simpler expressions, not more difficult ones. Simpler expressions should reduce maintenance.
I've updated the text above to say as much.
I would point out that regular expressions hardly qualify as advanced constructs in and of themselves. Not being familiar with a certain construct does not make it an advanced construct, merely an unfamiliar one. Which does not change the fact that regular expressions are powerful, compact and- if used properly- elegant. Much like a scalpel, it lies entirely in the hands of the one who wields it.

I think the answer to maintaining regular expression is not so much with commenting or regex constructs.
If I were tasked with debugging the example you gave, I would sit down infront of a regex debug tool (like Regex Coach) and step through the regular expression on the data that it is has to process.

I could still work with it. I'd just use Regulator. One thing it allows you to do is save the regex along with test data for it.
Of course, I might also add comments.
Here's what Expresso produced. I had never used it before, but now, Regulator is out of a job:
// using System.Text.RegularExpressions;
///
/// Regular expression built for C# on: Thu, Apr 2, 2009, 12:51:56 AM
/// Using Expresso Version: 3.0.3276, http://www.ultrapico.com
///
/// A description of the regular expression:
///
/// Select from 3 alternatives
/// ^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$
/// Beginning of line or string
/// Match expression but don't capture it. [(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2)]
/// Select from 2 alternatives
/// (?:(?:0?[13578]|1[02])(\/|-|\.)31)\1
/// Match expression but don't capture it. [(?:0?[13578]|1[02])(\/|-|\.)31]
/// (?:0?[13578]|1[02])(\/|-|\.)31
/// Match expression but don't capture it. [0?[13578]|1[02]]
/// Select from 2 alternatives
/// 0?[13578]
/// 0, zero or one repetitions
/// Any character in this class: [13578]
/// 1[02]
/// 1
/// Any character in this class: [02]
/// [1]: A numbered capture group. [\/|-|\.]
/// Select from 3 alternatives
/// Literal /
/// -
/// Literal .
/// 31
/// Backreference to capture number: 1
/// (?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2)
/// Return
/// New line
/// Match expression but don't capture it. [(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2]
/// (?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2
/// Match expression but don't capture it. [0?[13-9]|1[0-2]]
/// Select from 2 alternatives
/// 0?[13-9]
/// 0, zero or one repetitions
/// Any character in this class: [13-9]
/// 1[0-2]
/// 1
/// Any character in this class: [0-2]
/// [2]: A numbered capture group. [\/|-|\.]
/// Select from 3 alternatives
/// Literal /
/// -
/// Literal .
/// Match expression but don't capture it. [29|30]
/// Select from 2 alternatives
/// 29
/// 29
/// 30
/// 30
/// Backreference to capture number: 2
/// Return
/// New line
/// Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?\d{2}]
/// (?:1[6-9]|[2-9]\d)?\d{2}
/// Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
/// Select from 2 alternatives
/// 1[6-9]
/// 1
/// Any character in this class: [6-9]
/// [2-9]\d
/// Any character in this class: [2-9]
/// Any digit
/// Any digit, exactly 2 repetitions
/// End of line or string
/// ^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$
/// Beginning of line or string
/// Match expression but don't capture it. [0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)))]
/// 0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)))
/// 0, zero or one repetitions2
/// [3]: A numbered capture group. [\/|-|\.]
/// Select from 3 alternatives
/// Literal /
/// -
/// Literal .
/// 29
/// Backreference to capture number: 3
/// Match expression but don't capture it. [(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))]
/// Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)]
/// Select from 2 alternatives
/// (?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])
/// Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
/// Select from 2 alternatives
/// 1[6-9]
/// 1
/// Any character in this class: [6-9]
/// [2-9]\d
/// Any character in this class: [2-9]
/// Any digit
/// Match expression but don't capture it. [0[48]|[2468][048]|[13579][26]]
/// Select from 3 alternatives
/// 0[48]
/// 0
/// Any character in this class: [48]
/// [2468][048]
/// Any character in this class: [2468]
/// Any character in this class: [048]
/// [13579][26]
/// Any character in this class: [13579]
/// Any character in this class: [26]
/// (?:(?:16|[2468][048]|[3579][26])00)
/// Return
/// New line
/// Match expression but don't capture it. [(?:16|[2468][048]|[3579][26])00]
/// (?:16|[2468][048]|[3579][26])00
/// Match expression but don't capture it. [16|[2468][048]|[3579][26]]
/// Select from 3 alternatives
/// 16
/// 16
/// [2468][048]
/// Any character in this class: [2468]
/// Any character in this class: [048]
/// [3579][26]
/// Any character in this class: [3579]
/// Any character in this class: [26]
/// 00
/// End of line or string
/// ^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
/// Beginning of line or string
/// Match expression but don't capture it. [(?:0?[1-9])|(?:1[0-2])]
/// Select from 2 alternatives
/// Match expression but don't capture it. [0?[1-9]]
/// 0?[1-9]
/// 0, zero or one repetitions
/// Any character in this class: [1-9]
/// Match expression but don't capture it. [1[0-2]]
/// 1[0-2]
/// 1
/// Any character in this class: [0-2]
/// Return
/// New line
/// [4]: A numbered capture group. [\/|-|\.]
/// Select from 3 alternatives
/// Literal /
/// -
/// Literal .
/// Match expression but don't capture it. [0?[1-9]|1\d|2[0-8]]
/// Select from 3 alternatives
/// 0?[1-9]
/// 0, zero or one repetitions
/// Any character in this class: [1-9]
/// 1\d
/// 1
/// Any digit
/// 2[0-8]
/// 2
/// Any character in this class: [0-8]
/// Backreference to capture number: 4
/// Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?\d{2}]
/// (?:1[6-9]|[2-9]\d)?\d{2}
/// Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
/// Select from 2 alternatives
/// 1[6-9]
/// 1
/// Any character in this class: [6-9]
/// [2-9]\d
/// Any character in this class: [2-9]
/// Any digit
/// Any digit, exactly 2 repetitions
/// End of line or string
///
///
///
public static Regex regex = new Regex(
"^(?:(?:(?:0?[13578]|1[02])(\\/|-|\\.)31)\\1|\r\n(?:(?:0?[13-9]"+
"|1[0-2])(\\/|-|\\.)(?:29|30)\\2))\r\n(?:(?:1[6-9]|[2-9]\\d)?\\d"+
"{2})$|^(?:0?2(\\/|-|\\.)29\\3(?:(?:(?:1[6-9]|[2-9]\\d)?(?:0["+
"48]|[2468][048]|[13579][26])|\r\n(?:(?:16|[2468][048]|[3579][2"+
"6])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))\r\n(\\/|-|\\.)(?:0?[1-9"+
"]|1\\d|2[0-8])\\4(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$",
RegexOptions.CultureInvariant
| RegexOptions.Compiled
);

I posted a question recently about commenting regexes with embedded comments There were useful answers and particularly one from #mikej
See the post by Martin Fowler on
ComposedRegex for some more ideas on
improving regexp readability. In
summary, he advocates breaking down a
complex regexp into smaller parts
which can be given meaningful variable
names. e.g.

I do not expect regular expressions to be readable, so I just leave them as they are, and rewrite if needed.

Related

Excluding text at the beginning of a string

I'm new to using RegEx and I'm still stumbling around a bit, so I'm sorry if this is a basic question. I'm trying to extract a string from between two parenthesis and I can't seem to figure out how to exclude the first part from my match.
This is my regex pattern:
(.+?)(?= -)
I want to extract a birth date, for example, excluding the "b." and the training "-". Here's a sample set:
( b. circa 1883 - d. Mar 03, 1960 )
( b. May 21, 1887 - d. Jan 24, 1979 )
( b. May 28, 1902 Zembin, BELARUS - d. Dec 22, 1998 Florida, USA )
( b. Jan 09, 1886 Philadelphia, Pennsylvania, USA - d. May 17, 1969 New York, New York, USA )
My regex matches ( b. Jan 09, 1886 Philadelphia, Pennsylvania, USA (for example) but also includes "( b. " prefix, which I want to exclude.
The regex also matches the following text, which I would like to exclude as well:
Husband of Sarah Wilder (August 2000
Also, I cannot get the following string to match, presumably because of the dot and space in St. Louis.
( b. Jun 28, 1920 St. Louis, Missouri, USA )
I've been banging my head for several hours and just can't quite get the rest of it. Any help or guidance would be very much appreciated. I've already gotten a lot of help from reading many of the posts here.
Thanks so much!

Assuming that your data always contains a hyphen followed by d., you can try this: (?<=b\. )(.*) - d\.
(?<=b\. ) matches the b. text without it being added to the matching text.
(.*) is a capturing group that contains the match. It captures everything until the terminating - d. is hit. Note that the . characters must be escaped to match correctly as they are regex special characters.

If it always starts with ( b. and end with - d. <something> ), you can simply do
(?<=^\( b\. ).*(?= - d\..*\))
Which actually means you are match any characters (.*), with <start of line>( b. in front of it ((?<=^\( b\. )), and with - d. <something>) behind it ((?= - d\..*\))). https://regex101.com/r/vB2fmP/1
Or, if you don't mind using matching group:
^\( b\. (.*) - d \..*\)$
^ start of line
\( b\. open parenthesis, space, b, dot, space
( ) capture group
.* any char, any occurence
- d \..*\) space, hyphen, space, d, dot,
then any char any occurrence,
close parenthesis,
$ end of line
and capture group 1 is the value you need (personally I prefer this one instead).

To prevent capturing the leading ( b. you could prefix your regex with \(\s*b\.\s* which will match the ( and the b. surrounded by zero or more whitespace characters \s*.
Then from that point you would capture your values in a group (.*?) and you could update your positive lookahead (?= (?:\-|\))) to include a whitespace with either a - or a ).
\(\s*b\.\s*(.*?)(?= (?:\-|\)))

You can do this be making two passes through the search string. On the first pass you capture all text inside brackets, and on the second you clean up your results by removing the unwanted expressions. You don't say what language you are using, so I will use PHP.
$want = "/\(.+?\)/";
$dontWant = "/(b/.|/-)/";
$desiredResult = array();
$result = preg_match_all($want, $searchText, $matches); // Get all text inside brackets
if (count($matches[0])>0) { // $matches[0] holds all the matches
foreach ($matches[0] as $match) { // Loop through the matches
$desiredResult[] = preg_replace( $dontWant, "", $match); // Remove unwanted text
}
}
You can adjust this to whatever language you are using.

Extract nested string from text column

I have following SQL result entries.
Result
---------
TW - 5657980 Due Date updated : to <strong>2017-08-13 10:21:00</strong> by <strong>System</strong>
TW - 5657980 Priority updated from <strong> Medium</strong> to <strong>Low</strong> by <strong>System</strong>
TW - 5657980 Material added: <strong>1000 : Cash in Bank - Operating (Old)/ QTY:2</strong> by <strong>System</strong>#9243
TW - 5657980 Labor added <strong>Kelsey Franks / 14:00 hours </strong> by <strong>System</strong>#65197
Now I am trying to extract a short description from this result and trying to migrate it to the another column in the same table.
Expected result
--------------
Due Date Updated
Priority Updated
Material Added
Labor Added
Ignore first 13 characters. For most of the cases it ends with 'updated'. Few ends with 'added'. It should be case insensitive.
Is there any way to get the expected result.

Solution with substring() using a regular expression. It skips the first 13 characters, then takes the string up to the first ' updated' or ' added', case-insensitive, with leading blank. Else NULL:
SELECT substring(result, '(?i)^.{13}(.*? (?:updated|added))')
FROM tbl;
The regexp explained:
(?i) .. meta-syntax to switch to case-insensitive matching
^ .. start of string
.{13} .. skip the first 13 characters
() .. capturing parenthesis (captures payload)
.*? .. any number of characters (non-greedy)
(?:) .. non-capturing parenthesis
(?:updated|added) .. 2 branches (string ends in 'updated' or 'added')
If we cannot rely on 13 leading characters like you later commented, we need some other reliable definition instead. Your difficulty seems with hazy requirements more than with the actual implementation.
Say, we are dealing with 1 or more non-digits, followed by 1 or more digits, a space and then the payload as defined above:
SELECT substring(result, '(?i)^\D+\d+ (.*? (?:updated|added))') ...
\d .. class shorthand for digits
\D .. non-digits, the opposite of \d

String Split AND Replace

I am trying to replace a string based on the split portion. This string is a date, where the year should be formatted as a superscript.
Eg. Jan 24, 2014 needs to be split at 2014 then replaced with Jan 24, ^2014^ where 2014 is the superscript.
Example pseudo:
mydate.Split(" ", 2).Replace("^2014^")
But, instead of replacing the new split string, it should be the original (or copy of original). I can't just edit based on index because the formatting may not always be the same, at times the date may be expanded to January 24th, 2014 which would then break the traditional replace by index.

You can try
(?<=[A-Z][a-z]{2} \d{2}, )(\d{4})
Replaced with ^$1^ or ^\1^
Here is online demo and tested it on regexstorm
If you want to match January 24th, 2014 as well then try
([A-Z][a-z]{2,9} \d{2}[a-z]{0,2}, )(\d{4})
Replaced with $1^$2^
Here is demo

You can use a combination of lookarounds to achieve your result.
Regex.Replace(input, "(?<=\d{4})|(?=\d{4})", "^")
Explanation:
(?<= # look behind to see if there is:
\d{4} # digits (0-9) (4 times)
) # end of look-behind
| # OR
(?= # look ahead to see if there is:
\d{4} # digits (0-9) (4 times)
) # end of look-ahead
Live Demo

Normalize you date string by assigning it to a Date variable, then do the formatting from there.
Dim dt As Date = "Jan 24, 2014"
Dim s As String = dt.ToShortDateString.Replace("2014", "^2014^")
MsgBox(s)
' or '
s = dt.Month.ToString & "/" & dt.Day.ToString & "/^" & dt.Year.ToString & "^"
MsgBox(s)
IMO RegEx is write once code and is difficult to debug/maintain.

Regex failing to match number and dash with letter (or space and letter)

In the tester this works ... but not in PostgreSQL.
My data is like this -- usually a series of letters, followed by 2 numbers and a POSSIBLE '-' or 'space' with only ONE letter following. I am trying to isolate the 2 numbers and the Possible '-" or 'space' AND the ONE letter with my regex:
For ex:
AJ 50-R Busboys ## should return 50-R
APPLES 30 F ## should return 30 F
FOOBAR 30 Apple ## should return 30
Regex's (that have worked in the tester, but not in PostgreSQL) that I've tried:
substring(REF from '([0-9]+)-?([:space:])?([A-Za-z])?')
&
substring(REF from '([0-9]+)-?([A-Za-z])?')
So far everything tests out in the tester...but not the PostgreSQL. I just keep getting the numbers returns -- AND NOTHING AFTER IT.
What I am getting now(for ex):
AJ 50-R Busboys ## returns as "50" NOT as "50-R"

Your looking for: substring(REF from '([0-9]+(-| )([A-Za-z]\y)?)')
In SQLFiddle. Your primary problem is that substring returns the first or outermost matching group (ie., pattern surrounded with ()), which is why you get 50 for your '50-R'. If you were to surround the entire pattern with (), this would give you '50-R'. However, the pattern you have fails to return what you want on the other strings, even after accounting for this issue, so I had to modify the entire regex.

This matches your description and examples.
Your description is slightly ambiguous. Leading letters are followed by a space and then two digits in your examples, as opposed to your description.
SELECT t, substring(t, '^[[:alpha:] ]+(\d\d(:?[\s-]?[[:alpha:]]\M)?)')
FROM (
VALUES
('AJ 50-R Busboys') -- should return: 50-R
,('APPLES 30 F') -- should return: 30 F
,('FOOBAR 30 Apple') -- should return: 30
,('FOOBAR 30x Apple') -- should return: 30x
,('sadfgag30 D 66 X foo') -- should return: 30 D - not: 66 X
) r(t);
->SQLfiddle
Explanation
^ .. start of string (last row could fail without anchoring to start and global flag 'g'). Also: faster.
[[:alpha:] ]+ .. one or more letters or spaces (like in your examples).
( .. capturing parenthesis
\d\d .. two digits
(:? .. non-capturing parenthesis
[\s-]? .. '-' or 'white space' (character class), 0 or 1 times
[[:alpha:]] .. 1 letter
\M .. followed by end of word (can be end of string, too)
)? .. the pattern in non-capturing parentheses 0 or 1 times
Letters as defined by the character class alpha according to the current locale! The poor man's substitute [a-zA-Z] only works for basic ASCII letters and fails for anything more. Consider this simple demo:
SELECT substring('oö','[[:alpha:]]*')
,substring('oö','[a-zA-Z]*');
More about character classes in Postgres regular expressions in the manual.

It's because of the parentheses.
I've looked everywhere in the documentation and found an interesting sentence on this page:
[...] if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned.
I took your first expression:
([0-9]+)-?([:space:])?([A-Za-z])?
and wrapped it in parentheses:
(([0-9]+)-?([:space:])?([A-Za-z])?)
and it works fine (see SQLFiddle).
Update:
Also, because you're looking for - or space, you could rewrite your middle expression to [-|\s]? (thanks Matthew for pointing that out), which leads to the following possible REGEX:
(([0-9]+)[-|\s]?([A-Za-z])?)
(SQLFiddle)
Update 2:
While my answer provides the explanation as to why the result represented a partial match of your expression, the expression I presented above fails your third test case.
You should use the regex provided by Matthew in his answer.

RegEx: Uk Landlines, Mobile phone numbers

I've been struggling with finding a suitable solution :-
I need an regex expression that will match all UK phone numbers and mobile phones.
So far this one appears to cover most of the UK numbers:
^0\d{2,4}[ -]{1}[\d]{3}[\d -]{1}[\d -]{1}[\d]{1,4}$
However mobile numbers do not work with this regex expression or phone-numbers written in a single solid block such as 01234567890.
Could anyone help me create the required regex expression?

[\d -]{1}
is blatently incorrect: a digit OR a space OR a hyphen.
01000 123456
01000 is not a valid UK area code. 123456 is not a valid local number.
It is important that test data be real area codes and real number ranges.
^\s*(?(020[7,8]{1})?[ ]?[1-9]{1}[0-9{2}[ ]?[0-9]{4})|(0[1-8]{1}[0-9]{3})?[ ]?[1-9]{1}[0-9]{2}[ ]?[0-9]{3})\s*|[0-9]+[ ]?[0-9]+$
The above pattern is garbage for many different reasons.
[7,8] matches 7 or comma or 8. You don't need to match a comma.
London numbers also begin with 3 not just 7 or 8.
London 020 numbers aren't the only 2+8 format numbers; see also 023, 024, 028 and 029.
[1-9]{1} simplifies to [1-9]
[ ]? simplifies to \s?
Having found the intial 0 once, why keep searching for it again and again?
^(0....|0....|0....|0....)$ simplifies to ^0(....|....|....|....)$
Seriously. ([1]|[2]|[3]|[7]){1} simplifies to [1237] here.
UK phone numbers use a variety of formats: 2+8, 3+7, 3+6, 4+6, 4+5, 5+5, 5+4. Some users don't know which format goes with which number range and might use the wrong one on input. Let them do that; you're interested in the DIGITS.
Step 1: Check the input format looks valid
Make sure that the input looks like a UK phone number. Accept various dial prefixes, +44, 011 44, 00 44 with or without parentheses, hyphens or spaces; or national format with a leading 0. Let the user use any format they want for the remainder of the number: (020) 3555 7788 or 00 (44) 203 555 7788 or 02035-557-788 even if it is the wrong format for that particular number. Don't worry about unbalanced parentheses. The important part of the input is making sure it's the correct number of digits. Punctuation and spaces don't matter.
^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)(?:\d{5}\)?[\s-]?\d{4,5}|\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3})|\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4}|\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}|8(?:00[\s-]?11[\s-]?11|45[\s-]?46[\s-]?4\d))(?:(?:[\s-]?(?:x|ext\.?\s?|\#)\d+)?)$
The above pattern matches optional opening parentheses, followed by 00 or 011 and optional closing parentheses, followed by an optional space or hyphen, followed by optional opening parentheses. Alternatively, the initial opening parentheses are followed by a literal + without a following space or hyphen. Any of the previous two options are then followed by 44 with optional closing parentheses, followed by optional space or hyphen, followed by optional 0 in optional parentheses, followed by optional space or hyphen, followed by optional opening parentheses (international format). Alternatively, the pattern matches optional initial opening parentheses followed by the 0 trunk code (national format).
The previous part is then followed by the NDC (area code) and the subscriber phone number in 2+8, 3+7, 3+6, 4+6, 4+5, 5+5 or 5+4 format with or without spaces and/or hyphens. This also includes provision for optional closing parentheses and/or optional space or hyphen after where the user thinks the area code ends and the local subscriber number begins. The pattern allows any format to be used with any GB number. The display format must be corrected by later logic if the wrong format for this number has been used by the user on input.
The pattern ends with an optional extension number arranged as an optional space or hyphen followed by x, ext and optional period, or #, followed by the extension number digits. The entire pattern does not bother to check for balanced parentheses as these will be removed from the number in the next step.
At this point you don't care whether the number begins 01 or 07 or something else. You don't care whether it's a valid area code. Later steps will deal with those issues.
Step 2: Extract the NSN so it can be checked in more detail for length and range
After checking the input looks like a GB telephone number using the pattern above, the next step is to extract the NSN part so that it can be checked in greater detail for validity and then formatted in the right way for the applicable number range.
^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)(44)\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)([1-9]\d{1,4}\)?[\s\d-]+)(?:((?:x|ext\.?\s?|\#)\d+)?)$
Use the above pattern to extract the '44' from $1 to know that international format was used, otherwise assume national format if $1 is null.
Extract the optional extension number details from $3 and store them for later use.
Extract the NSN (including spaces, hyphens and parentheses) from $2.
Step 3: Validate the NSN
Remove the spaces, hyphens and parentheses from $2 and use further RegEx patterns to check the length and range and identify the number type.
These patterns will be much simpler, since they will not have to deal with various dial prefixes or country codes.
The pattern to match valid mobile numbers is therefore as simple as
^7([45789]\d{2}|624)\d{6}$
Premium rate is
^9[018]\d{8}$
There will be a number of other patterns for each number type: landlines, business rate, non-geographic, VoIP, etc.
By breaking the problem into several steps, a very wide range of input formats can be allowed, and the number range and length for the NSN checked in very great detail.
Step 4: Store the number
Once the NSN has been extracted and validated, store the number with country code and all the other digits with no spaces or punctuation, e.g. 442035557788.
Step 5: Format the number for display
Another set of simple rules can be used to format the number with the requisite +44 or 0 added at the beginning.
The rule for numbers beginning 03 is
^44(3\d{2})(\d{3])(\d{4})$
formatted as
0$1 $2 $3 or as +44 $1 $2 $3
and for numbers beginning 02 is
^44(2\d)(\d{4})(\d{4})$
formatted as
(0$1) $2 $3 or as +44 $1 $2 $3
The full list is quite long. I could copy and paste it all into this thread, but it would be hard to maintain that information in multiple places over time. For the present the complete list can be found at: http://aa-asterisk.org.uk/index.php/Regular_Expressions_for_Validating_and_Formatting_GB_Telephone_Numbers

Given that people sometimes write their numbers with spaces in random places, you might be better off ignoring the spaces all together - you could use a regex as simple as this then:
^0(\d ?){10}$
This matches:
01234567890
01234 234567
0121 3423 456
01213 423456
01000 123456
But it would also match:
01 2 3 4 5 6 7 8 9 0
So you may not like it, but it's certainly simpler.

Would this regex do?
// using System.Text.RegularExpressions;
/// <summary>
/// Regular expression built for C# on: Wed, Sep 8, 2010, 06:38:28
/// Using Expresso Version: 3.0.2766, http://www.ultrapico.com
///
/// A description of the regular expression:
///
/// [1]: A numbered capture group. [\+44], zero or one repetitions
/// \+44
/// Literal +
/// 44
/// [2]: A numbered capture group. [\s+], zero or one repetitions
/// Whitespace, one or more repetitions
/// [3]: A numbered capture group. [\(?]
/// Literal (, zero or one repetitions
/// [area_code]: A named capture group. [(\d{1,5}|\d{4}\s+?\d{1,2})]
/// [4]: A numbered capture group. [\d{1,5}|\d{4}\s+?\d{1,2}]
/// Select from 2 alternatives
/// Any digit, between 1 and 5 repetitions
/// \d{4}\s+?\d{1,2}
/// Any digit, exactly 4 repetitions
/// Whitespace, one or more repetitions, as few as possible
/// Any digit, between 1 and 2 repetitions
/// [5]: A numbered capture group. [\)?]
/// Literal ), zero or one repetitions
/// [6]: A numbered capture group. [\s+|-], zero or one repetitions
/// Select from 2 alternatives
/// Whitespace, one or more repetitions
/// -
/// [tel_no]: A named capture group. [(\d{1,4}(\s+|-)?\d{1,4}|(\d{6}))]
/// [7]: A numbered capture group. [\d{1,4}(\s+|-)?\d{1,4}|(\d{6})]
/// Select from 2 alternatives
/// \d{1,4}(\s+|-)?\d{1,4}
/// Any digit, between 1 and 4 repetitions
/// [8]: A numbered capture group. [\s+|-], zero or one repetitions
/// Select from 2 alternatives
/// Whitespace, one or more repetitions
/// -
/// Any digit, between 1 and 4 repetitions
/// [9]: A numbered capture group. [\d{6}]
/// Any digit, exactly 6 repetitions
///
///
/// </summary>
public Regex MyRegex = new Regex(
"(\\+44)?\r\n(\\s+)?\r\n(\\(?)\r\n(?<area_code>(\\d{1,5}|\\d{4}\\s+"+
"?\\d{1,2}))(\\)?)\r\n(\\s+|-)?\r\n(?<tel_no>\r\n(\\d{1,4}\r\n(\\s+|-"+
")?\\d{1,4}\r\n|(\\d{6})\r\n))",
RegexOptions.IgnoreCase
| RegexOptions.Singleline
| RegexOptions.ExplicitCapture
| RegexOptions.CultureInvariant
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);
//// Replace the matched text in the InputText using the replacement pattern
// string result = MyRegex.Replace(InputText,MyRegexReplace);
//// Split the InputText wherever the regex matches
// string[] results = MyRegex.Split(InputText);
//// Capture the first Match, if any, in the InputText
// Match m = MyRegex.Match(InputText);
//// Capture all Matches in the InputText
// MatchCollection ms = MyRegex.Matches(InputText);
//// Test to see if there is a match in the InputText
// bool IsMatch = MyRegex.IsMatch(InputText);
//// Get the names of all the named and numbered capture groups
// string[] GroupNames = MyRegex.GetGroupNames();
//// Get the numbers of all the named and numbered capture groups
// int[] GroupNumbers = MyRegex.GetGroupNumbers();
Notice how the spaces and dashes are optional and can be part of it.. also it is now divided into two capture groups called area_code and tel_no to break it down and easier to extract.

Strip all whitespace and non-numeric characters and then do the test. It'll be musch , much easier than trying to account for all the possible options around brackets, spaces, etc.
Try the following:
#"^(([0]{1})|([\+][4]{2}))([1]|[2]|[3]|[7]){1}\d{8,9}$"
Starts with 0 or +44 (for international) - I;m sure you could add 0044 if you wanted.
It then has a 1, 2, 3 or 7.
It then has either 8 or 9 digits.
If you want to be even smarter, the following may be a useful reference: http://en.wikipedia.org/wiki/Telephone_numbers_in_the_United_Kingdom

It's not a single regex, but there's sample code from Braemoor Software that is simple to follow and fairly thorough.
The JS version is probably easiest to read. It strips out spaces and hyphens (which I realise you said you can't do) then applies a number of positive and negative regexp checks.

Start by stripping the non-numerics, excepting a + as the first character.
(Javascript)
var tel=document.getElementById("tel").value;
tel.substr(0,1).replace(/[^+0-9]/g,'')+tel.substr(1).replace(/[^0-9]/g,'')
The regex below allows, after the international indicator +, any combination of between 7 and 15 digits (the ITU maximum) UNLESS the code is +44 (UK). Otherwise if the string either begins with +44, +440 or 0, it is followed by 2 or 7 and then by nine of any digit, or it is followed by 1, then any digit except 0, then either seven or eight of any digit. (So 0203 is valid, 0703 is valid but 0103 is not valid). There is currently no such code as 025 (or in London 0205), but those could one day be allocated.
/(^\+(?!44)[0-9]{7,15}$)|(^(\+440?|0)(([27][0-9]{9}$)|(1[1-9][0-9]{7,8}$)))/
Its primary purpose is to identify a correct starting digit for a non-corporate number, followed by the correct number of digits to follow. It doesn't deduce if the subscriber's local number is 5, 6, 7 or 8 digits. It does not enforce the prohibition on initial '1' or '0' in the subscriber number, about which I can't find any information as to whether those old rules are still enforced. UK phone rules are not enforced on properly formatted international phone numbers from outside the UK.

After a long search for valid regexen to cover UK cases, I found that the best way (if you're using client side javascript) to validate UK phone numbers is to use libphonenumber-js along with custom config to reduce bundle size:
If you're using NodeJS, generate UK metadata by running:
npx libphonenumber-metadata-generator metadata.custom.json --countries GB --extended
then import and use the metadata with libphonenumber-js/core:
import { isValidPhoneNumber } from "libphonenumber-js/core";
import data from "./metadata.custom.json";
isValidPhoneNumber("01234567890", "GB", data);
CodeSandbox Example

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do I write more maintainable regular expressions? - regex

Well, the entire purpose in life of the PCRE /x modifier is to allow you to write regexes more readably, as in this trivial example: my $expr = qr/ [a-z] # match a lower-case letter \d{3,5} # followed by 3-5 digits /x;

I do not expect regular expressions to be readable, so I just leave them as they are, and rewrite if needed.

Related

Excluding text at the beginning of a string

Extract nested string from text column

String Split AND Replace

Regex failing to match number and dash with letter (or space and letter)

RegEx: Uk Landlines, Mobile phone numbers

Categories

Resources