With this example, I am trying to only match operators in C-like code in order to add a space before and after it
For instance a+b should become a + b but a + b shouldn't become a + b. The other tricky thing is I shouldn't add a space in case of negative numbers like -2. Of course I need to be aware of all the exceptions like the text in comments like // work-around.
Here's the regex that I'm working on:
(?!= |=|&|\||%)(\+|-)(?! |=|\1)
Unfortunately the negative-lookaround doesn't work as expected. How can I fix it?
This is just to help your regex a little.
The first (negative) lookahead, should be a (negative) lookbehind.
I don't think this is going to help with parsing math symbols though.
http://regex101.com/r/lX3aF6/1
# (?<!=[ ])(?<![=&|%])(\+|-)(?!\1|[ =])
(?<! = [ ] )
(?<! [=&|%] )
( \+ | - )
(?! \1 | [ =] )
Related
I am making a compiler and need to match 1 or 2 of two different patterns e.g. +,=,+= or else,if,else if
so far I can do:
/\b(else( if)?|(else )?if)\b/
The regex above works but the patterns if and else and mentioned twice.
is there a better way that doesn't require making a copy of each of the words?
The general way to factor all you say above is this
(?:\+=?|=|else(?:[ ]if)?|if)
No other boundary conditions are asserted around this.
And nothing other than a single space is assumed between the else if.
(?:
\+ =?
| =
| else
(?: [ ] if )?
| if
)
Given the following string
2010-01-01XD2010-01-02XX2010-01-03NX2010-01-04XD2010-01-05DN
I am trying to find all instances of the date followed by one or two characters ie 2010-01-01XD but not where the characters are XX
I have tried
(2010-01-02[^X]{2})|(2010-01-08[^X]{2})|(2010-01-07[^X]{2})|(2010-01-05[^X]{2})|(2010-01-15[^X]{2})
this works if both chars are not X. I have also tried
(2010-01-02[^X]{1,2})|(2010-01-08[^X]{1,2})|(2010-01-07[^X]{1,2})|(2010-01-05[^X]{1,2})|(2010-01-15[^X]{1,2})
this works for for DX but not XD
So trying to be a little clearer
2010-01-01XD
2010-01-01DX
2010-01-01ND
All above should be picked up
2010-01-01XX
And this ignored
You can use this regex based on negative lookahead:
(20[0-9]{2}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12][0-9]|3[01])(?!XX)[A-Z]{2})
RegEx Demo
Easiest way is to use a lookahead assertion (if available).
# (2010-01-01|2010-01-02|2010-01-08|2010-01-07|2010-01-05|2010-01-15)(?!XX)(?i:([a-z]){1,2})
( # (1 start), One of these dates
2010-01-01
| 2010-01-02
| 2010-01-08
| 2010-01-07
| 2010-01-05
| 2010-01-15
) # (1 end)
(?! XX ) # Look ahead assertion, cannot match XX here
(?i: # 1 or 2 of any U/L case letter
( [a-z] ){1,2} # (2)
)
You could likely use a simple pattern with a negtive lookahead such as this:
\d{4}-\d{2}-\d{2}(?!XX)[A-Z]{1,2}
example: http://regex101.com/r/dI1nW4/2
To allow Unicode characters (with the exception of XX) you could use:
\d{4}-\d{2}-\d{2}(?!XX)\D{1,2}
example: http://regex101.com/r/yB5fI0/1
20[0-9]{2}-[01][0-9]-[0-3][0-9]([A-Z][A-WYZ]|[A-WYZ][A-Z])
See it in action.
A negative look ahead is the easiest way to assert the letters not being XX, but there are some simplifications you can make to the alternation by recognising the parts of the date shared by all dates you're trying to match, making this shorter regex:
2010-01-(02|08|07|05|15)(?!XX)[A-Z]{1,2}
I'm trying to select commas without numbers of 4 digits or the word "id" before, I tried with this:
( ? < ! [ \ d { 5 } | id ] ) ,
The problem
for example, if input string is "1999," that comma is not selected, I don't understand why.
Try this pattern:
(?<!\d{5}|id),
Your pattern, (?<![\d{5}|id]), is looking for a comma that is not after a digit, {, }, |, i, or d - They should not be in a charterer class: []. If anything, (?<![\d]{5}|id), will also work, but is redundant.
First of all, unless you're using the /x flag, each space will attempt to match a space. So take those out.
Second, you're using [...] presumably to group an alternation (|) but square brackets actually indicate a character class, i.e. [\d{5}|id] is equivalent to [id5{}|] and matches any one of those characters, but not more. What you mean is this:
(?<!\d{5}|id),
The final problem might be that many implementations of regex (you haven't specified which you're using) don't support variable-width lookbehind assertions. So, you may need to do something like:
(?<!\d{5}|...id),
If I have some HTML like this:
<b>1<i>2</i>3</b>
And the following regex:
\<[^\>\/]+\>(.*?)\<\/[^\>]+\>
Then it will match:
<b>1<i>2</i>
I want it to only match HTML where the start and end tags are the same. Is there a way to do this?
Thanks,
Joe
Is there a way to do this?
Yes, certainly. Ignore those flippant non-answers that tell you it can’t be done. It most certainly can. You just may not wish to do so, as I explain below.
Numbered Captures
Pretending for the nonce that HTML <i> and <b> tags are always denude of attributes, and moreover, neither overlap nor nest, we have this simple solution:
#!/usr/bin/env perl
#
# solution A: numbered captures
#
use v5.10;
while (<>) {
say "$1: $2" while m{
< ( [ib] ) >
(
(?:
(?! < /? \1 > ) .
) *
)
</ \1 >
}gsix;
}
Which when run, produces this:
$ echo 'i got <i>foo</i> and <b>bar</b> bits go here' | perl solution-A
i: foo
b: bar
Named Captures
It would be better to use named captures, which leads to this equivalent solution:
#!/usr/bin/env perl
#
# Solution B: named captures
#
use v5.10;
while (<>) {
say "$+{name}: $+{contents}" while m{
< (?<name> [ib] ) >
(?<contents>
(?:
(?! < /? \k<name> > ) .
) *
)
</ \k<name> >
}gsix;
}
Recursive Captures
Of course, it is not reasonable to assume that such tags neither overlap nor nest. Since this is recursive data, it therefore requires a recursive pattern to solve. Remembering that the trival pattern to parse nested parens recursively is simply:
( \( (?: [^()]++ | (?-1) )*+ \) )
I’ll build that sort of recursive matching into the previous solution, and I’ll further toss in a bit interative processing to unwrap the inner bits, too.
#!/usr/bin/perl
use v5.10;
# Solution C: recursive captures, plus bonus iteration
while (my $line = <>) {
my #input = ( $line );
while (#input) {
my $cur = shift #input;
while ($cur =~ m{
< (?<name> [ib] ) >
(?<contents>
(?:
[^<]++
| (?0)
| (?! </ \k<name> > )
.
) *+
)
</ \k<name> >
}gsix)
{
say "$+{name}: $+{contents}";
push #input, $+{contents};
}
}
}
Which when demo’d produces this:
$ echo 'i got <i>foo <i>nested</i> and <b>bar</b> bits</i> go here' | perl Solution-C
i: foo <i>nested</i> and <b>bar</b> bits
i: nested
b: bar
That’s still fairly simple, so if it works on your data, go for it.
Grammatical Patterns
However, it doesn’t actually know about proper HTML syntax, which admits tag attributes to things like <i> and <b>.
As explained in this answer, one can certainly use regexes to parse markup languages, provided one is careful about it.
For example, this knows the attributes germane to the <i> (or <b>) tag. Here we defined regex subroutines used to build up a grammatical regex. These are definitions only, just like defining regular subs but now for regexes:
(?(DEFINE) # begin regex subroutine defs for grammatical regex
(?<i_tag_end> < / i > )
(?<i_tag_start> < i (?&attributes) > )
(?<attributes> (?: \s* (?&one_attribute) ) *)
(?<one_attribute>
\b
(?&legal_attribute)
\s* = \s*
(?:
(?"ed_value)
| (?&unquoted_value)
)
)
(?<legal_attribute>
(?&standard_attribute)
| (?&event_attribute)
)
(?<standard_attribute>
class
| dir
| ltr
| id
| lang
| style
| title
| xml:lang
)
# NB: The white space in string literals
# below DOES NOT COUNT! It's just
# there for legibility.
(?<event_attribute>
on click
| on dbl click
| on mouse down
| on mouse move
| on mouse out
| on mouse over
| on mouse up
| on key down
| on key press
| on key up
)
(?<nv_pair> (?&name) (?&equals) (?&value) )
(?<name> \b (?= \pL ) [\w\-] + (?<= \pL ) \b )
(?<equals> (?&might_white) = (?&might_white) )
(?<value> (?"ed_value) | (?&unquoted_value) )
(?<unwhite_chunk> (?: (?! > ) \S ) + )
(?<unquoted_value> [\w\-] * )
(?<might_white> \s * )
(?<quoted_value>
(?<quote> ["'] )
(?: (?! \k<quote> ) . ) *
\k<quote>
)
(?<start_tag> < (?&might_white) )
(?<end_tag>
(?&might_white)
(?: (?&html_end_tag)
| (?&xhtml_end_tag)
)
)
(?<html_end_tag> > )
(?<xhtml_end_tag> / > )
)
Once you have the pieces of your grammar assembled, you could incorporate those definitions into the recursive solution already given to do a much better job.
However, there are still things that haven’t been considered, and which in the more general case must be. Those are demonstrated in the longer solution already provided.
SUMMARY
I can think of only three possible reasons why you might not care to use regexes for parsing general HTML:
You are using an impoverished regex language, not a modern one, and so you have to recourse to essential modern conveniences like recursive matching or grammatical patterns.
You might such concepts as recursive and grammatical patterns too complicated for you to easily understand.
You prefer for someone else to do all the heavy lifting for you, including the heavy testing, and so you would rather use a separate HTML parsing module instead of rolling your own.
Any one or more of those might well apply. In which case, don’t do it this way.
For simple canned examples, this route is easy. The more robust you want this to work on things you’ve never seen before, the harder this route becomes.
Certainly you can’t do any of it if you are using the inferior, impoverished pattern matching bolted onto the side of languages like Python or even worse, Javascript. Those are barely any better than the Unix grep program, and in some ways, are even worse. No, you need a modern pattern matching engine such as found in Perl or PHP to even start down this road.
But honestly, it’s probably easier just to get somebody else to do it for you, by which I mean that you should probably use an already-written parsing module.
Still, understanding why not to bother with these regex-based approaches (at least, not more than once) requires that you first correctly implement proper HTML parsing using regexes. You need to understand what it is all about. Therefore, little exercises like this are useful for improving your overall understanding of the problem-space, and of modern pattern matching in general.
This forum isn’t really in the right format for explaining all these things about modern pattern-matching. There are books, though, that do so equitably well.
You probably don't want to use regular expressions with HTML.
But if you still want to do this you need to take a look at backreferences.
Basically it's a way to capture a group (such as "b" or "i") to use it later in the same regular expression.
Related issues:
RegEx match open tags except XHTML self-contained tags
If I have a lot of matches, for example in multi line mode, and I want to replace them with part of the match as well as a counter number that increments.
I was wondering if any regex flavor has such a variable. I couldn't find one, but I seem to remember something like that exists...
I'm not talking about scripting languages in which you can use callbacks for replacement. It's about being able to do this in tools like RegexBuddy, sublime text, gskinner.com/RegExr, ... much in the same way you can refer to captured substrings with \1 or $1.
FMTEYEWTK about Fancy Regexes
Ok, I’m going to go from the simple to the sublime. Enjoy!
Simple s///e Solution
Given this:
#!/usr/bin/perl
$_ = <<"End_of_G&S";
This particularly rapid,
unintelligible patter
isn't generally heard,
and if it is it doesn't matter!
End_of_G&S
my $count = 0;
Then this:
s{
\b ( [\w']+ ) \b
}{
sprintf "(%s)[%d]", $1, ++$count;
}gsex;
produces this
(This)[1] (particularly)[2] (rapid)[3],
(unintelligible)[4] (patter)[5]
(isn't)[6] (generally)[7] (heard)[8],
(and)[9] (if)[10] (it)[11] (is)[12] (it)[13] (doesn't)[14] (matter)[15]!
Interpolated Code in Anon Array Solution
Whereas this:
s/\b([\w']+)\b/##{[++$count]}=$1/g;
produces this:
#1=This #2=particularly #3=rapid,
#4=unintelligible #5=patter
#6=isn't #7=generally #8=heard,
#9=and #10=if #11=it #12=is #13=it #14=doesn't #15=matter!
Solution with code in LHS instead of RHS
This puts the incrementation within the match itself:
s/ \b ( [\w']+ ) \b (?{ $count++ }) /#$count=$1/gx;
yields this:
#1=This #2=particularly #3=rapid,
#4=unintelligible #5=patter
#6=isn't #7=generally #8=heard,
#9=and #10=if #11=it #12=is #13=it #14=doesn't #15=matter!
A Stuttering Stuttering Solution Solution Solution
This
s{ \b ( [\w'] + ) \b }
{ join " " => ($1) x ++$count }gsex;
generates this delightful answer:
This particularly particularly rapid rapid rapid,
unintelligible unintelligible unintelligible unintelligible patter patter patter patter patter
isn't isn't isn't isn't isn't isn't generally generally generally generally generally generally generally heard heard heard heard heard heard heard heard,
and and and and and and and and and if if if if if if if if if if it it it it it it it it it it it is is is is is is is is is is is is it it it it it it it it it it it it it doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't matter matter matter matter matter matter matter matter matter matter matter matter matter matter matter!
Exploring Boundaries
There are more robust approaches to word boundaries that work for plural possessives (the previous approaches don’t), but I suspect your mystery lies in getting the ++$count to fire, not with the subtleties of \b behavior.
I really wish people understood that \b isn’t what they think it is.
They always think it means there's white space or the edge of the string
there. They never think of it as \w\W or \W\w transitions.
# same as using a \b before:
(?(?=\w) (?<!\w) | (?<!\W) )
# same as using a \b after:
(?(?<=\w) (?!\w) | (?!\W) )
As you see, it's conditional depending on what it's touching. That’s what the (?(COND)THEN|ELSE) clause is for.
This becomes an issue with things like:
$_ = qq('Tis Paul's parents' summer-house, isn't it?\n);
my $count = 0;
s{
(?(?=[\-\w']) (?<![\-\w']) | (?<![^\-\w']) )
( [\-\w'] + )
(?(?<=[\-\w']) (?![\-\w']) | (?![^\-\w']) )
}{
sprintf "(%s)[%d]", $1, ++$count
}gsex;
print;
which correctly prints
('Tis)[1] (Paul's)[2] (parents')[3] (summer-house)[4], (isn't)[5] (it)[6]?
Worrying about Unicode
1960s-style ASCII is about 50 years out of date. Just as whenever you see anyone write [a-z], it’s nearly always wrong, it turns out that things like dashes and quotation marks shouldn’t show up as literals in patterns, either. While we’re at it, you probably don’t want to use \w, because that includes numbers and underscores as well, not just alphabetics.
Imagine this string:
$_ = qq(\x{2019}Tis Ren\x{E9}e\x{2019}s great\x{2010}grandparents\x{2019} summer\x{2010}house, isn\x{2019}t it?\n);
which you could have as a literal with use utf8:
use utf8;
$_ = qq(’Tis Renée’s great‐grandparents’ summer‐house, isn’t it?\n);
This time I’ll go at the pattern a bit differently, separating out my definition of terms from their execution to try to make it more readable and thence maintainable:
#!/usr/bin/perl -l
use 5.10.0;
use utf8;
use open qw< :std :utf8 >;
use strict;
use warnings qw< FATAL all >;
use autodie;
$_ = q(’Tis Renée’s great‐grandparents’ summer‐house, isn’t it?);
my $count = 0;
s{ (?<WORD> (?&full_word) )
# the rest is just definition
(?(DEFINE)
(?<word_char> [\p{Alphabetic}\p{Quotation_Mark}] )
(?<full_word>
# next line won't compile cause
# fears variable-width lookbehind
#### (?<! (?&word_char) ) )
# so must inline it
(?<! [\p{Alphabetic}\p{Quotation_Mark}] )
(?&word_char)
(?:
\p{Dash}
| (?&word_char)
) *
(?! (?&word_char) )
)
) # end DEFINE declaration block
}{
sprintf "(%s)[%d]", $+{WORD}, ++$count;
}gsex;
print;
That code when run produces this:
(’Tis)[1] (Renée’s)[2] (great‐grandparents’)[3] (summer‐house)[4], (isn’t)[5] (it)[6]?
Ok, so that may have beeen FMTEYEWTK about fancy regexes, but aren’t you glad you asked? ☺
In plain regular expressions there isn't as far as I know.
On the other hand, there are several tools which offer it as an extension, for example grepWin. In the tool's help (press F1):
Internally it uses Boost's Perl Regular Expression engine but the ${count} is implemented within (as with other extensions).