PERL: Using regex to subtract columns

PERL: Using regex to subtract columns - regex

I have a file like this:
3107 0.9 0.0 0.0 chr1 29312346 29312694 (219937927) C L1HS LINE/L1 (4) 6151 5803 54360
8095 0.5 0.0 0.0 chr1 31040661 31041597 (218209024) + L1HS LINE/L1 5203 6139 (16) 57249
...
When the 9th column is C, I need to subtract column 14 from 13, and when the 9th column is +, I need to subtract column 12 from 13.
I understand I can create arrays, but how can I use a regex, such as ($line =~/(\w+)\s+(\w+)/), to solve this instead?

You can split at white spaces into #F array(first value being $F[0]), subtract columns, and output values separated by space.
perl -lane'
$F[12] -= $F[13] if $F[8] eq "C";
$F[12] -= $F[11] if $F[8] eq "+";
print "#F";
' file

Since you wanted to use a regex, here is another solution. It is perhaps a bit unsharp, because you did not define your lines cleanly but with only two example lines, and for those, it works. I commented the regex so that you can see, which part of the expression is matching a certain group and which of them are captured.
#!/usr/bin/perl
use strict;
use warnings;
use v5.10;
while( <DATA> )
{
if( $_ =~ /[0-9]+ # 1
\s+
[0-9.]+ # 2
\s+
[0-9.]+ # 3
\s+
[0-9.]+ # 4
\s+
[a-z0-9]+ # 5
\s+
[0-9]+ # 6
\s+
[0-9]+ # 7
\s+
\([a-z0-9]+\) # 8
\s+
([c+]) # 9 -> capture group 1
\s+
[a-z0-9]+ # 10
\s+
[a-z0-9\/]+ # 11
\s+
\(?([0-9]+)\)? # 12 -> capture group 2
\s+
([0-9]+) # 13 -> capture group 3
\s+
\(?([0-9]+)\)? # 14 -> capture group 4
\s+
[0-9]+? # 15
/ix )
{
say "Matched: $_";
say "Operation: $1";
if( $1 eq "+" )
{
say "$2 - $3 = ".( $2 - $3 );
}
elsif( $1 eq "C" )
{
say "$4 - $3 = ".( $4 - $3 );
}
else
{
say "Nothing do to here...";
}
}
}
exit;
#1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
__DATA__
3107 0.9 0.0 0.0 chr1 29312346 29312694 (219937927) C L1HS LINE/L1 (4) 6151 5803 54360
8095 0.5 0.0 0.0 chr1 31040661 31041597 (218209024) + L1HS LINE/L1 5203 6139 (16) 57249
Update:
As you can see in the perl documentation, I used the x flag to have comments in my regex. The i flag makes it case insensitive.
Furthermore, I didn't just try to devide all the single columns by whitespaces but also by their types, which is an advantage of using a regular expression. While \s+ expressions are seperators for columns here, allowing arbitary amounts of whitespace all the single groups are kind of specified. That allows to find non-conforming lines. For example, by defining caputre group $1 as ([c+]) I was able to reduce the possible characters, that trigger an operation to C and + ( and c because of case-inesensitivity).
Binding a group to a variable (capturing it) is done by using parenthises.
This way, I was able to only pick the columns I really need (see the comments).

Do not use a regex for a problem like this.
If you're just working with columns separated by whitespace, the proper tool is split.
my #cols = split ' ', $line;

Related

Match all consecutive numbers of length n [duplicate]

This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 5 years ago.
The community reviewed whether to reopen this question 7 months ago and left it closed:
Original close reason(s) were not resolved
Where n=4 in my example.
I'm very new to Regex and have searched for 20 minutes now. There are some helpful websites out there that simplify things but I can't work out how to proceed with this.
I wish to extract every combination of 4 consecutive digits from this:
12345
to get:
1234 - possible with ^\d{4}/g - Starts at the beginning
2345 - possible with \d{4}$/g - Starts at the end
But I can't get both! The input could be any length.

Your expression isn't working as expected because those two sub-strings are overlapping.
Aside from zero-length assertions, any characters in the input string will be consumed in the matching process, which results in the overlapping matches not being found.
You could work around this by using a lookahead and a capturing group to retrieve the overlapping matches. This works because lookahead assertions (as well as lookbehind assertions) are classified as zero-length assertions, which means that they don't consume the matches; thereby allowing you to find any overlapping matches.
(?=(\d{4}))
Here is a quick snippet demonstrating this:
var regex = /(?=(\d{4}))/g;
var input = '12345678';
var match;
while ((match = regex.exec(input)) !== null) {
if (match.index === regex.lastIndex) {
regex.lastIndex++;
}
console.log(match[1]);
}

You can use a lookahead with a capturing group:
(?=(\d{4}))
See demo

Use a look ahead assertion with all the possibilities
(?=(0123|1234|2345|3456|4567|5678|6789))
(?=
( # (1 start)
0123
| 1234
| 2345
| 3456
| 4567
| 5678
| 6789
) # (1 end)
)
Output
** Grp 0 - ( pos 0 , len 0 ) EMPTY
** Grp 1 - ( pos 0 , len 4 )
1234
------------------
** Grp 0 - ( pos 1 , len 0 ) EMPTY
** Grp 1 - ( pos 1 , len 4 )
2345

overlapping pattern matching in Perl

A beginner's question. In the code:
$a = 'aaagggaaa';
(#b) = ($a =~ /(a.+)(g.+)/);
print "$b[0]\n";
Why is $b[0] equal to aaagg and not aaa? In other words - why second group - (g.+) - matches only from last g ?

Because the first .+ is "greedy", which means that it will try to match as many characters as possible.
If you want to turn out this "greedy" behaviour, you may replace .+ by .+?, so /(a.+?)(g.+)/ will return ( 'aaa', 'gggaaa').
Maybe, you've wanted to write /(a+)(g+)/ (only 'a's in first group, and 'g's in second one).

The regular expression you wrote:
($a =~ /(a.+)(g.+)/);
catchs the "a" and any word as it can, finishing in one "g" followed by more characters. So the first (a.+) just matches "aaagg" until the match of the second part of your regular expression: (g.+) => "gaaa"
The #b array receives the two matches "aaagg" and "gaaa". So, $b[0] just prints "aaagg".

The problem is that the first .+ is causing the g to be matched as far to the right as possible.
To show you what is really happening I modified your code to output more illustrative debug information.
$ perl -Mre=debug -e'q[aaagggaaa] =~ /a.+[g ]/'
Compiling REx "a.+[g ]"
Final program:
1: EXACT <a> (3)
3: PLUS (5)
4: REG_ANY (0)
5: ANYOF[ g][] (16)
16: END (0)
anchored "a" at 0 (checking anchored) minlen 3
Guessing start of match in sv for REx "a.+[g ]" against "aaagggaaa"
Found anchored substr "a" at offset 0...
Guessed: match at offset 0
Matching REx "a.+[g ]" against "aaagggaaa"
0 <> <aaagggaaa> | 1:EXACT <a>(3)
1 <a> <aagggaaa> | 3:PLUS(5)
REG_ANY can match 8 times out of 2147483647...
9 <aaagggaaa> <> | 5: ANYOF[ g][](16)
failed...
8 <aaagggaa> <a> | 5: ANYOF[ g][](16)
failed...
7 <aaaggga> <aa> | 5: ANYOF[ g][](16)
failed...
6 <aaaggg> <aaa> | 5: ANYOF[ g][](16)
failed...
5 <aaagg> <gaaa> | 5: ANYOF[ g][](16)
6 <aaaggg> <aaa> | 16: END(0)
Match successful!
Freeing REx: "a.+[g ]"
Notice that the first .+ is capturing everything it can to start out with.
Then it has to backtrack until the g can be matched.
What you probably want is one of:
/( a+ )( g+ )/x;
/( a.+? )( g.+ )/x;
/( a+ )( g.+ )/x;
/( a[^g]+ )( g.+ )/x;
/( a[^g]+ )( g+ )/x;
# etc.
Without more information from you, it is impossible to know what regex you want is.
Really regular expressions are a language in their own right, that is more complicated than the rest of Perl.

Perl regular expressions normally match the longest string possible.
In your code it matches with the last g and returns the output aaagg. If you want to get the output as aaa, then you need to use the non-greedy behavior. Use this code:
$a = 'aaagggaaa';
(#b) = ($a =~ /(a.+?)(g.+)/);
print "$b[0]\n";
It will output:
aaa
Clearly, the use of the question mark makes the match ungreedy.

Usually a regex expression is greedy. You can turn it off using ? character:
$a = 'aaagggaaa';
my #b = ($a =~ /(a.+)(g.+)/);
my #c = ($a =~ /(a.+?)(g.+)/);
print "#b\n";
print "#c\n";
Output:
aaagg gaaa
aaa gggaaa
But I'm not sure this is what You want! What about abagggbb? You need aba?

Pattern matching dates

I'm having troubles trying to match a pattern of dates. Any of the following dates are legal:
- 121212
- 4 9 12
- 5-3-2000
- 62502
- 3/3/11
- 09-08-2001
- 8 6 07
- 12 10 2004
- 4-16-08
- 3/7/2005
What makes this date matching really challenging is that the year doesn't have to be 4 digits (a 2 digit year is assumed to be in the 21st century i.e. 02 = 2002), the month/date can either be written with a beginning 0 if it is a one digit month, and the dates may or may not be separated by spaces, dashes, or slashes.
This is what I currently have:
/((((0[13578])|([13578])|(1[02]))[\/-]?\s*(([1-9])|(0[1-9])|([12][0-9])|(3[01])))|(((0[469])|([469])|(11))[\/-]?\s*(([1-9])|(0[1-9])|([12][0-9])|(30)))|((2|02)[\/](([1-9])|(0[1-9])|([12][0-9])))[\/-]?\s*(20[0-9]{2})|([0-9]{2}))/g
This almost works, except right now I'm not exactly sure if I'm assuming the length of the dates and months. For example, in the case 121212, I might be assuming the month is 1 instead of 12. Also, for some reason when I'm printing out $1 and $2, it is the same value. In the case of 121212, $1 is 1212, $2 is 1212 and $3 is 12. However, I just want $1 to be 121212.

Your task is ambiguous, since you may not be able to tell mmd from mdd or mdccyy from mmddyy.
You left off the option for spaces or dashes in one place where you match /.
You aren't checking for leap years.
This is doable, but it's awfully easy to make a mistake; how about not trying to do it with a regex.

The CPAN modules Time::ParseDate and DateTime are probably what you're looking for, except the 62502 pattern:
use DateTime;
use Time::ParseDate;
foreach my $str (<DATA>) {
chomp $str;
$str =~ tr{ }{/};
my $epoch = parsedate($str, GMT => 1);
next unless $epoch; # skip 62502
my $dt = DateTime->from_epoch ( epoch => $epoch );
print $dt->ymd, "\n";
}
__DATA__
121212
4 9 12
5-3-2000
62502
3/3/11
09-08-2001
8 6 07
12 10 2004
4-16-08
3/7/2005
Once you have the DateTime object, you can extract year, month, and day information easily.

This solution handles all of the cases that you provided. But the solution isn't foolproof because the problem has ambiguities. E.g. how do we interpret the date 12502? Is it 1/25/02 or 12/5/02?
use 5.010;
while (my $line = <DATA>) {
chomp $line;
my #date = $line =~ /
\A
([01]?\d) # month is 1-2 digits, but the first digit may only be 0 or 1
[ \-\/]? # may or may not have a separator
([0123]?\d) # day is 1-2 digits
[ \-\/]?
(\d{2,4}) # year is 2-4 digits
\z
/x;
say join '_', #date;
}
__DATA__
121212
4 9 12
5-3-2000
12502
3/3/11
09-08-2001
8 6 07
12 10 2004
4-16-08
3/7/2005

This is the best I could come up with based on what info you've given. It matches all possibilities, and has error checking for month/day ranges and also the year (from 1900 to 2099)
/(1[012]|0?\d)([-\/ ]?)([12]\d|3[01]|0?\d)\2((19|20)?\d\d)/

regular expression to remove white spaces in a line and extract specific columns

Perl regular expression to remove white spaces in a line and extract specific columns
my line looks like this:
324446 psharma jobA 2 0 435529 0 0 0 435531
Here I can split the line with split function and then save the values in an array with the below command
#strings = split(/\s+/);
I do not want an extra variable.
with regular expression I want to extract value in column 1, 2 3, and 10 as $1, $2, $3 and $10.

Welcome to stackexchange.
There's no need for an extra variable:
use strict;
use warnings;
my $line = ' 324446 psharma jobA 2 0 435529 0 0 0 435531 ';
# Remove leading and trailing white space
$line =~ s/^ \s+ | \s+ $//x;
# Split by consecutive white space and keep certain fields:
my ($col1, $col2, $col3, $col10) = (split m/\s+/, $line)[0, 1, 2, 9];
print "1: $col1 2: $col2 3: $col3 10: $col10\n";
Output:
1: 324446 2: psharma 3: jobA 10: 435531
Meaning you really don't need any extra variables, even with split. For example, if you only want to pass those fields down to another function your split line would look like this:
some_func((split m/\s+/, $line)[0, 1, 2, 9]);
Note that I'm assuming that your column number count starts at 1 and not at 0 (meaning your "column 1" is the number "324446" etc.). That's how I named the variables in my example, too.

How do I write more maintainable regular expressions?

I have started to feel that using regular expressions decreases code maintainability. There is something evil about the terseness and power of regular expressions. Perl compounds this with side effects like default operators.
I DO have a habit of documenting regular expressions with at least one sentence giving the basic intent and at least one example of what would match.
Because regular expressions are built up I feel it is an absolute necessity to comment on the largest components of each element in the expression. Despite this even my own regular expressions have me scratching my head as though I am reading Klingon.
Do you intentionally dumb down your regular expressions? Do you decompose possibly shorter and more powerful ones into simpler steps? I have given up on nesting regular expressions. Are there regular expression constructs that you avoid due to mainainability issues?
Do not let this example cloud the question.
If the following by Michael Ash had some sort of bug in it would you have any prospects of doing anything but throwing it away entirely?
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
Per request the exact purpose can be found using Mr. Ash's link above.
Matches 01.1.02 | 11-30-2001 | 2/29/2000
Non-Matches 02/29/01 | 13/01/2002 | 11/00/02

Use Expresso which gives a hierarchical, english breakdown of a regex.
Or
This tip from Darren Neimke:
.NET allows regular expression
patterns to be authored with embedded
comments via the
RegExOptions.IgnorePatternWhitespace
compiler option and the (?#...) syntax
embedded within each line of the
pattern string.
This allows for psuedo-code-like
comments to be embedded in each line
and has the following affect on
readability:
Dim re As New Regex ( _
"(?<= (?# Start a positive lookBEHIND assertion ) " & _
"(#|#) (?# Find a # or a # symbol ) " & _
") (?# End the lookBEHIND assertion ) " & _
"(?= (?# Start a positive lookAHEAD assertion ) " & _
" \w+ (?# Find at least one word character ) " & _
") (?# End the lookAHEAD assertion ) " & _
"\w+\b (?# Match multiple word characters leading up to a word boundary)", _
RegexOptions.Multiline Or RegexOptions.IgnoreCase Or RegexOptions.IgnoreWhitespace _
)
Here's another .NET example (requires the RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace options):
static string validEmail = #"\b # Find a word boundary
(?<Username> # Begin group: Username
[a-zA-Z0-9._%+-]+ # Characters allowed in username, 1 or more
) # End group: Username
# # The e-mail '#' character
(?<Domainname> # Begin group: Domain name
[a-zA-Z0-9.-]+ # Domain name(s), we include a dot so that
# mail.somewhere is also possible
.[a-zA-Z]{2,4} # The top level domain can only be 4 characters
# So .info works, .telephone doesn't.
) # End group: Domain name
\b # Ending on a word boundary
";
If your RegEx is applicable to a common problem, another option is to document it and submit to RegExLib, where it will be rated and commented upon. Nothing beats many pairs of eyes...
Another RegEx tool is The Regulator

I usually just try to wrap all my Regular Expression calls inside their own function, with a meaningful name and an some basic comments. I like to think of Regular Expressions as a write only language, readable only by the one that wrote it (Unless it's really simple). I fully expect that someone would need to probably completely re-write the expression if they had to change its intent and this is probably for the better to keep the Regular Expression training alive.

Well, the entire purpose in life of the PCRE /x modifier is to allow you to write regexes more readably, as in this trivial example:
my $expr = qr/
[a-z] # match a lower-case letter
\d{3,5} # followed by 3-5 digits
/x;

Some people use REs for the wrong things (I'm waiting for the first SO question on how to detect a valid C++ program using a single RE).
I usually find that, if I can't fit my RE within 60 characters, it's better off being a piece of code since that will almost always be more readable.
In any case, I always document, in the code, what the RE is supposed to achieve, in great detail. This is because I know, from bitter experience, how hard it is for someone else (or even me, six months later) to come in and try to understand.
I don't believe they're evil, although I do believe some people who use them are evil (not looking at you, Michael Ash :-). They're a great tool but, like a chainsaw, you'll cut your legs off if you don't know how to use them properly.
UPDATE: Actually, I've just followed the link to that monstrosity, and it's to validate m/d/y format dates between the years 1600 and 9999. That is a classic case of where full-blown code would be more readable and maintainable.
You just split it up into three fields and check the individual values. I'd almost consider it an offense worthy of termination if one of my minions bought this to me. I'd certainly send them back to write it properly.

Here is the same regex broken down into digestible pieces. In addition to being more readable, some of the sub-regexes can be useful on their own. It is also significantly easier to change the allowed separators.
#!/usr/local/ActivePerl-5.10/bin/perl
use 5.010; #only 5.10 and above
use strict;
use warnings;
my $sep = qr{ [/.-] }x; #allowed separators
my $any_century = qr/ 1[6-9] | [2-9][0-9] /x; #match the century
my $any_decade = qr/ [0-9]{2} /x; #match any decade or 2 digit year
my $any_year = qr/ $any_century? $any_decade /x; #match a 2 or 4 digit year
#match the 1st through 28th for any month of any year
my $start_of_month = qr/
(?: #match
0?[1-9] | #Jan - Sep or
1[0-2] #Oct - Dec
)
($sep) #the separator
(?:
0?[1-9] | # 1st - 9th or
1[0-9] | #10th - 19th or
2[0-8] #20th - 28th
)
\g{-1} #and the separator again
/x;
#match 28th - 31st for any month but Feb for any year
my $end_of_month = qr/
(?:
(?: 0?[13578] | 1[02] ) #match Jan, Mar, May, Jul, Aug, Oct, Dec
($sep) #the separator
31 #the 31st
\g{-1} #and the separator again
| #or
(?: 0?[13-9] | 1[0-2] ) #match all months but Feb
($sep) #the separator
(?:29|30) #the 29th or the 30th
\g{-1} #and the separator again
)
/x;
#match any non-leap year date and the first part of Feb in leap years
my $non_leap_year = qr/ (?: $start_of_month | $end_of_month ) $any_year/x;
#match 29th of Feb in leap years
#BUG: 00 is treated as a non leap year
#even though 2000, 2400, etc are leap years
my $feb_in_leap = qr/
0?2 #match Feb
($sep) #the separtor
29 #the 29th
\g{-1} #the separator again
(?:
$any_century? #any century
(?: #and decades divisible by 4 but not 100
0[48] |
[2468][048] |
[13579][26]
)
|
(?: #or match centuries that are divisible by 4
16 |
[2468][048] |
[3579][26]
)
00
)
/x;
my $any_date = qr/$non_leap_year|$feb_in_leap/;
my $only_date = qr/^$any_date$/;
say "test against garbage";
for my $date (qw(022900 foo 1/1/1)) {
say "\t$date ", $date ~~ $only_date ? "matched" : "didn't match";
}
say '';
#comprehensive test
my #code = qw/good unmatch month day year leap/;
for my $sep (qw( / - . )) {
say "testing $sep";
my $i = 0;
for my $y ("00" .. "99", 1600 .. 9999) {
say "\t", int $i/8500*100, "% done" if $i++ and not $i % 850;
for my $m ("00" .. "09", 0 .. 13) {
for my $d ("00" .. "09", 1 .. 31) {
my $date = join $sep, $m, $d, $y;
my $re = $date ~~ $only_date || 0;
my $code = not_valid($date);
unless ($re == !$code) {
die "error $date re $re code $code[$code]\n"
}
}
}
}
}
sub not_valid {
state $end = [undef, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31];
my $date = shift;
my ($m,$d,$y) = $date =~ m{([0-9]+)[-./]([0-9]+)[-./]([0-9]+)};
return 1 unless defined $m; #if $m is set, the rest will be too
#components are in roughly the right ranges
return 2 unless $m >= 1 and $m <= 12;
return 3 unless $d >= 1 and $d <= $end->[$m];
return 4 unless ($y >= 0 and $y <= 99) or ($y >= 1600 and $y <= 9999);
#handle the non leap year case
return 5 if $m == 2 and $d == 29 and not leap_year($y);
return 0;
}
sub leap_year {
my $y = shift;
$y = "19$y" if $y < 1600;
return 1 if 0 == $y % 4 and 0 != $y % 100 or 0 == $y % 400;
return 0;
}

I have found a nice method is to simply break up the matching process into several phases. It probably does not execute as fast but you have the added bonus of also being able to tell at a finer grain level why the match is not occurring.
Another route is to use LL or LR parsing. Some languages are not expressible as regular expressions probably even with perl's non-fsm extensions.

I have learned to avoid all but the simplest regexp. I far prefer other models such as Icon's string scanning or Haskell's parsing combinators. In both of these models you can write user-defined code that has the same privileges and status as the built-in string ops. If I were programming in Perl I would probably rig up some parsing combinators in Perl---I've done it for other languages.
A very nice alternative is to use Parsing Expression Grammars as Roberto Ierusalimschy has done with his LPEG package, but unlike parser combinators this is something you can't whip up in an afternoon. But if somebody has already done PEGs for your platform it's a very nice alternative to regular expressions.

Wow, that is ugly. It looks like it should work, modulo an unavoidable bug dealing with 00 as a two digit year (it should be a leap year one quarter of the time, but without the century you have no way of knowing what it should be). There is a lot of redundancy that should probably be factored out into sub-regexes and I would create three sub-regexes for the three main cases (that is my next project tonight). I also used a different character for the delimiter to avoid having to escape forward slashes, changed the single character alternations into character classes (which happily lets us avoid having to escape period), and changed \d to [0-9] since the former matches any digit character (including U+1815 MONGOLIAN DIGIT FIVE: ᠕) in Perl 5.8 and 5.10.
Warning, untested code:
#!/usr/bin/perl
use strict;
use warnings;
my $match_date = qr{
#match 29th - 31st of all months but 2 for the years 1600 - 9999
#with optionally leaving off the first two digits of the year
^
(?:
#match the 31st of 1, 3, 5, 7, 8, 10, and 12
(?: (?: 0? [13578] | 1[02] ) ([/-.]) 31) \1
|
#or match the 29th and 30th of all months but 2
(?: (?: 0? [13-9] | 1[0-2] ) ([/-.]) (?:29|30) \2)
)
(?:
(?: #optionally match the century
1[6-9] | #16 - 19
[2-9][0-9] #20 - 99
)?
[0-9]{2} #match the decade
)
$
|
#or match 29 for 2 for leap years
^
(?:
#FIXME: 00 is treated as a non leap year
#even though 2000, 2400, etc are leap years
0?2 #month 2
([/-.]) #separtor
29 #29th
\3 #separator from before
(?: #leap years
(?:
#match rule 1 (div 4) minus rule 2 (div 100)
(?: #match any century
1[6-9] |
[2-9][0-9]
)?
(?: #match decades divisible by 4 but not 100
0[48] |
[2468][048] |
[13579][26]
)
|
#or match rule 3 (div 400)
(?:
(?: #match centuries that are divisible by 4
16 |
[2468][048] |
[3579][26]
)
00
)
)
)
)
$
|
#or match 1st through 28th for all months between 1600 and 9999
^
(?: (?: 0?[1-9]) | (?:1[0-2] ) ) #all months
([/-.]) #separator
(?:
0?[1-9] | #1st - 9th or
1[0-9] | #10th - 19th or
2[0-8] #20th - 28th
)
\4 #seprator from before
(?:
(?: #optionally match the century
1[6-9] | #16 - 19
[2-9][0-9] #20 - 99
)?
[0-9]{2} #match the decade
)
$
}x;

Some people, when confronted with a
problem, think "I know, I’ll use
regular expressions." Now they have
two problems. — Jamie Zawinski in
comp.lang.emacs.
Keep the regular expressions as simple as they can possibly be (KISS). In your date example, I'd likely use one regular expression for each date-type.
Or even better, replaced it with a library (i.e. a date-parsing library).
I'd also take steps to ensure that the input source had some restrictions (i.e. only one type of date-strings, ideally ISO-8601).
Also,
One thing at the time (with the possible exception of extracting values)
Advanced constructs are ok if used correctly (as in simplying the expression and hence reducing maintenance)
EDIT:
"advanced constructs lead to
maintainance issues"
My original point was that if used correctly it should lead to simpler expressions, not more difficult ones. Simpler expressions should reduce maintenance.
I've updated the text above to say as much.
I would point out that regular expressions hardly qualify as advanced constructs in and of themselves. Not being familiar with a certain construct does not make it an advanced construct, merely an unfamiliar one. Which does not change the fact that regular expressions are powerful, compact and- if used properly- elegant. Much like a scalpel, it lies entirely in the hands of the one who wields it.

I think the answer to maintaining regular expression is not so much with commenting or regex constructs.
If I were tasked with debugging the example you gave, I would sit down infront of a regex debug tool (like Regex Coach) and step through the regular expression on the data that it is has to process.

I could still work with it. I'd just use Regulator. One thing it allows you to do is save the regex along with test data for it.
Of course, I might also add comments.
Here's what Expresso produced. I had never used it before, but now, Regulator is out of a job:
// using System.Text.RegularExpressions;
///
/// Regular expression built for C# on: Thu, Apr 2, 2009, 12:51:56 AM
/// Using Expresso Version: 3.0.3276, http://www.ultrapico.com
///
/// A description of the regular expression:
///
/// Select from 3 alternatives
/// ^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$
/// Beginning of line or string
/// Match expression but don't capture it. [(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2)]
/// Select from 2 alternatives
/// (?:(?:0?[13578]|1[02])(\/|-|\.)31)\1
/// Match expression but don't capture it. [(?:0?[13578]|1[02])(\/|-|\.)31]
/// (?:0?[13578]|1[02])(\/|-|\.)31
/// Match expression but don't capture it. [0?[13578]|1[02]]
/// Select from 2 alternatives
/// 0?[13578]
/// 0, zero or one repetitions
/// Any character in this class: [13578]
/// 1[02]
/// 1
/// Any character in this class: [02]
/// [1]: A numbered capture group. [\/|-|\.]
/// Select from 3 alternatives
/// Literal /
/// -
/// Literal .
/// 31
/// Backreference to capture number: 1
/// (?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2)
/// Return
/// New line
/// Match expression but don't capture it. [(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2]
/// (?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2
/// Match expression but don't capture it. [0?[13-9]|1[0-2]]
/// Select from 2 alternatives
/// 0?[13-9]
/// 0, zero or one repetitions
/// Any character in this class: [13-9]
/// 1[0-2]
/// 1
/// Any character in this class: [0-2]
/// [2]: A numbered capture group. [\/|-|\.]
/// Select from 3 alternatives
/// Literal /
/// -
/// Literal .
/// Match expression but don't capture it. [29|30]
/// Select from 2 alternatives
/// 29
/// 29
/// 30
/// 30
/// Backreference to capture number: 2
/// Return
/// New line
/// Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?\d{2}]
/// (?:1[6-9]|[2-9]\d)?\d{2}
/// Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
/// Select from 2 alternatives
/// 1[6-9]
/// 1
/// Any character in this class: [6-9]
/// [2-9]\d
/// Any character in this class: [2-9]
/// Any digit
/// Any digit, exactly 2 repetitions
/// End of line or string
/// ^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$
/// Beginning of line or string
/// Match expression but don't capture it. [0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)))]
/// 0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)))
/// 0, zero or one repetitions2
/// [3]: A numbered capture group. [\/|-|\.]
/// Select from 3 alternatives
/// Literal /
/// -
/// Literal .
/// 29
/// Backreference to capture number: 3
/// Match expression but don't capture it. [(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))]
/// Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)]
/// Select from 2 alternatives
/// (?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])
/// Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
/// Select from 2 alternatives
/// 1[6-9]
/// 1
/// Any character in this class: [6-9]
/// [2-9]\d
/// Any character in this class: [2-9]
/// Any digit
/// Match expression but don't capture it. [0[48]|[2468][048]|[13579][26]]
/// Select from 3 alternatives
/// 0[48]
/// 0
/// Any character in this class: [48]
/// [2468][048]
/// Any character in this class: [2468]
/// Any character in this class: [048]
/// [13579][26]
/// Any character in this class: [13579]
/// Any character in this class: [26]
/// (?:(?:16|[2468][048]|[3579][26])00)
/// Return
/// New line
/// Match expression but don't capture it. [(?:16|[2468][048]|[3579][26])00]
/// (?:16|[2468][048]|[3579][26])00
/// Match expression but don't capture it. [16|[2468][048]|[3579][26]]
/// Select from 3 alternatives
/// 16
/// 16
/// [2468][048]
/// Any character in this class: [2468]
/// Any character in this class: [048]
/// [3579][26]
/// Any character in this class: [3579]
/// Any character in this class: [26]
/// 00
/// End of line or string
/// ^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
/// Beginning of line or string
/// Match expression but don't capture it. [(?:0?[1-9])|(?:1[0-2])]
/// Select from 2 alternatives
/// Match expression but don't capture it. [0?[1-9]]
/// 0?[1-9]
/// 0, zero or one repetitions
/// Any character in this class: [1-9]
/// Match expression but don't capture it. [1[0-2]]
/// 1[0-2]
/// 1
/// Any character in this class: [0-2]
/// Return
/// New line
/// [4]: A numbered capture group. [\/|-|\.]
/// Select from 3 alternatives
/// Literal /
/// -
/// Literal .
/// Match expression but don't capture it. [0?[1-9]|1\d|2[0-8]]
/// Select from 3 alternatives
/// 0?[1-9]
/// 0, zero or one repetitions
/// Any character in this class: [1-9]
/// 1\d
/// 1
/// Any digit
/// 2[0-8]
/// 2
/// Any character in this class: [0-8]
/// Backreference to capture number: 4
/// Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?\d{2}]
/// (?:1[6-9]|[2-9]\d)?\d{2}
/// Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
/// Select from 2 alternatives
/// 1[6-9]
/// 1
/// Any character in this class: [6-9]
/// [2-9]\d
/// Any character in this class: [2-9]
/// Any digit
/// Any digit, exactly 2 repetitions
/// End of line or string
///
///
///
public static Regex regex = new Regex(
"^(?:(?:(?:0?[13578]|1[02])(\\/|-|\\.)31)\\1|\r\n(?:(?:0?[13-9]"+
"|1[0-2])(\\/|-|\\.)(?:29|30)\\2))\r\n(?:(?:1[6-9]|[2-9]\\d)?\\d"+
"{2})$|^(?:0?2(\\/|-|\\.)29\\3(?:(?:(?:1[6-9]|[2-9]\\d)?(?:0["+
"48]|[2468][048]|[13579][26])|\r\n(?:(?:16|[2468][048]|[3579][2"+
"6])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))\r\n(\\/|-|\\.)(?:0?[1-9"+
"]|1\\d|2[0-8])\\4(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$",
RegexOptions.CultureInvariant
| RegexOptions.Compiled
);

I posted a question recently about commenting regexes with embedded comments There were useful answers and particularly one from #mikej
See the post by Martin Fowler on
ComposedRegex for some more ideas on
improving regexp readability. In
summary, he advocates breaking down a
complex regexp into smaller parts
which can be given meaningful variable
names. e.g.

I do not expect regular expressions to be readable, so I just leave them as they are, and rewrite if needed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

PERL: Using regex to subtract columns - regex

You can split at white spaces into #F array(first value being $F[0]), subtract columns, and output values separated by space. perl -lane' $F[12] -= $F[13] if $F[8] eq "C"; $F[12] -= $F[11] if $F[8] eq "+"; print "#F"; ' file

Do not use a regex for a problem like this. If you're just working with columns separated by whitespace, the proper tool is split. my #cols = split ' ', $line;

Related

Match all consecutive numbers of length n [duplicate]

overlapping pattern matching in Perl

Pattern matching dates

regular expression to remove white spaces in a line and extract specific columns

How do I write more maintainable regular expressions?

Categories

Resources