Java Regular Expression Two Question marks (??) - regex

I know that /? means the / is optional. so "toys?" will match both toy and toys. My understanding is that if I make it lazy and use "toys??" I will match both toy and toys and always return toy. So, a quick test:
private final static Pattern TEST_PATTERN = Pattern.compile("toys??", Pattern.CASE_INSENSITIVE);
public static void main(String[] args) {
for(String arg : args) {
Matcher m = TEST_PATTERN.matcher(arg);
System.out.print("Arg: " + arg);
boolean b = false;
while (m.find()) {
System.out.print(" {");
for (int i=0; i<=m.groupCount(); ++i) {
System.out.print("[" + m.group(i) + "]");
}
System.out.print("}");
}
System.out.println();
}
}
Yep, it looks like it works as expected
java -cp .. regextest.RegExTest toy toys
Arg: toy {[toy]}
Arg: toys {[toy]}
Now, change the regular expression to "toys??2" and it still matches toys2 and toy2. In both cases, it returns the entire string without the s removed. Is there any functional difference between searching for "toys?2" and "toys??2".
The reason I am asking is because I found an example like the following:
private final static Pattern TEST_PATTERN = Pattern.compile("</??tag(\\s+?.*?)??>", Pattern.CASE_INSENSITIVE);
and although I see no apparent reason for using ?? rather than ?, I thought that perhaps the original author (who is not known to me) might know something that I don't, I expect the later.

?? is lazy while ? is greedy.
Given (pattern)??, it will first test for empty string, then if the rest of the pattern can't match, it will test for pattern.
In contrast, (pattern)? will test for pattern first, then it will test for empty string on backtrack.
Now, change the regular expression to "toys??2" and it still matches toys2 and toy2. In both cases, it returns the entire string without the s removed. Is there any functional difference between searching for "toys?2" and "toys??2".
The difference is in the order of searching:
"toys?2" searches for toys2, then toy2
"toys??2" searches for toy2, then toys2
But for the case of these 2 patterns, the result will be the same regardless of the input string, since the sequel 2 (after s? or s??) must be matched.
As for the pattern you found:
Pattern.compile("</??tag(\\s+?.*?)??>", Pattern.CASE_INSENSITIVE)
Both ?? can be changed to ? without affecting the result:
/ and t (in tag) are mutually exclusive. You either match one or the other.
> and \s are also mutually exclusive. The at least 1 in \s+? is important to this conclusion: the result might be different otherwise.
This is probably micro-optimization from the author. He probably thinks that the open tag must be there, while the closing tag might be forgotten, and that open/close tags without attributes/random spaces appears more often than those with some.
By the way, the engine might run into some expensive backtracking attempt due to \\s+?.*? when the input has <tag followed by lots of spaces without > anywhere near.

Related

Error while compiling regex function, why am I getting this issue?

My RAKU Code:
sub comments {
if ($DEBUG) { say "<filtering comments>\n"; }
my #filteredtitles = ();
# This loops through each track
for #tracks -> $title {
##########################
# LAB 1 TASK 2 #
##########################
## Add regex substitutions to remove superflous comments and all that follows them
## Assign to $_ with smartmatcher (~~)
##########################
$_ = $title;
if ($_) ~~ s:g:mrx/ .*<?[\(^.*]> / {
# Repeat for the other symbols
########################## End Task 2
# Add the edited $title to the new array of titles
#filteredtitles.push: $_;
}
}
# Updates #tracks
return #filteredtitles;
}
Result when compiling:
Error Compiling! Placeholder variable '#_' may not be used here because the surrounding block doesn't take a signature.
Is there something obvious that I am missing? Any help is appreciated.
So, in contrast with #raiph's answer, here's what I have:
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
Just that. Nothing else. Let's dissect it, from the inside out:
This part: / <[\(^]> / is a regular expression that will match one character, as long as it is an open parenthesis (represented by the \() or a caret (^). When they go inside the angle brackets/square brackets combo, it means that is an Enumerated character class.
Then, the: S introduces the non-destructive substitution, i.e., a quoting construct that will make regex-based substitutions over the topic variable $_ but will not modify it, just return its value with the modifications requested. In the code above, S:g brings the adverb :g or :global (see the global adverb in the adverbs section of the documentation) to play, meaning (in the case of the substitution) "please make as many as possible of this substitution" and the final / marks the end of the substitution text, and as it is adjacent to the second /, that means that
S:g / <[\(^]> //
means "please return the contents of $_, but modified in such a way that all its characters matching the regex <[\(^]> are deleted (substituted for the empty string)"
At this point, I should emphasize that regular expressions in Raku are really powerful, and that reading the entire page (and probably the best practices and gotchas page too) is a good idea.
Next, the: .map method, documented here, will be applied to any Iterable (List, Array and all their alikes) and will return a sequence based on each element of the Iterable, altered by a Code passed to it. So, something like:
#x.map({ S:g / foo /bar/ })
essencially means "please return a Sequence of every item on #x, modified by substituting any appearance of the substring foo for bar" (nothing will be altered on #x). A nice place to start to learn about sequences and iterables would be here.
Finally, my one-liner
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
can be translated as:
I have a List with three string elements
Foo
Ba(r
B^az
(This would be a placeholder for your "list of titles"). Take that list and generate a second one, that contains every element on it, but with all instances of the chars "open parenthesis" and "caret" removed.
Ah, and store the result in the variable #tracks (that has my scope)
Here's what I ended up with:
my #tracks = <Foo Ba(r B^az>;
sub comments {
my #filteredtitles;
for #tracks -> $_ is copy {
s:g / <[\(^]> //;
#filteredtitles.push: $_;
}
return #filteredtitles;
}
The is copy ensures the variable set up by the for loop is mutable.
The s:g/...//; is all that's needed to strip the unwanted characters.
One thing no one can help you with is the error you reported. I currently think you just got confused.
Here's an example of code that generates that error:
do { #_ }
But there is no way the code you've shared could generate that error because it requires that there is an #_ variable in your code, and there isn't one.
One way I can help in relation to future problems you may report on StackOverflow is to encourage you to read and apply the guidance in Minimal Reproducible Example.
While your code did not generate the error you reported, it will perhaps help you if you know about some of the other compile time and run time errors there were in the code you shared.
Compile-time errors:
You wrote s:g:mrx. That's invalid: Adverb mrx not allowed on substitution.
You missed out the third slash of the s///. That causes mayhem (see below).
There were several run-time errors, once I got past the compile-time errors. I'll discuss just one, the regex:
.*<?[...]> will match any sub-string with a final character that's one of the ones listed in the [...], and will then capture that sub-string except without the final character. In the context of an s:g/...// substitution this will strip ordinary characters (captured by the .*) but leave the special characters.
This makes no sense.
So I dropped the .*, and also the ? from the special character pattern, changing it from <?[...]> (which just tries to match against the character, but does not capture it if it succeeds) to just <[...]> (which also tries to match against the character, but, if it succeeds, does capture it as well).
A final comment is about an error you made that may well have seriously confused you.
In a nutshell, the s/// construct must have three slashes.
In your question you had code of the form s/.../ (or s:g/.../ etc), without the final slash. If you try to compile such code the parser gets utterly confused because it will think you're just writing a long replacement string.
For example, if you wrote this code:
if s/foo/ { say 'foo' }
if m/bar/ { say 'bar' }
it'd be as if you'd written:
if s/foo/ { say 'foo' }\nif m/...
which in turn would mean you'd get the compile-time error:
Missing block
------> if m/⏏bar/ { ... }
expecting any of:
block or pointy block
...
because Raku(do) would have interpreted the part between the second and third /s as the replacement double quoted string of what it interpreted as an s/.../.../ construct, leading it to barf when it encountered bar.
So, to recap, the s/// construct requires three slashes, not two.
(I'm ignoring syntactic variants of the construct such as, say, s [...] = '...'.)

Optimizing the regex "unrolling the loop" pattern [duplicate]

Requirement : Two expressions, exp1 and exp2, we need to match one or more of both. So I came up with,
(exp1 | exp2)*
However in some places, I see the below being used,
(exp1 * (exp2 exp1*)*)
What is the difference between the two? When would you use one over the other?
Hopefully a fiddle will make this more clear,
var regex1 = /^"([\x00-!#-[\]-\x7f]|\\")*"$/;
var regex2 = /^"([\x00-!#-[\]-\x7f]*(\\"[\x00-!#-[\]-\x7f]*)*)"$/;
var str = '"foo \\"bar\\" baz"';
var r1 = regex1.exec(str);
var r2 = regex2.exec(str);
EDIT: It looks like there is a difference in behavior between the two apporaches when we capture the groups. The second approach captures the entire string while the first approach captures only the last matching group. See updated fiddle.
The difference between the two patterns is potential efficiency.
The (exp1 | exp2)* pattern contains an alternation that automatically disables some internal regex matching optimization. Also, this regex tries to match the pattern at each location in the string.
The (exp1 * (exp2 exp1*)*) expression is written acc. to the unroll-the-loop principle:
This optimisation thechnique is used to optimize repeated alternation of the form (expr1|expr2|...)*. These expression are not uncommon, and the use of another repetition inside an alternation may also leads to super-linear match. Super-linear match arise from the underterministic expression (a*)*.
The unrolling the loop technique is based on the hypothesis that in most case, you kown in a repeteated alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:
normal* ( special normal* )*
So, the exp1 in your example is normal part that is most common, and exp2 is expected to be less frequent. In that case, the efficiency of the unrolled pattern can be really, much higher than that of the other regex since the normal* part will grab the whole chunks of input without any need to stop and check each location.
Let's see a simple "([^"\\]|\\.)*" regex test against "some text here": there are 35 steps involved:
Unrolling it as "[^"\\]*(\\.[^"\\]*)*" gives a boost to 6 steps as there is much less backtracking.
NOTE that the number of steps at regex101.com does not directly mean one regex is more efficient than another, however, the debug table shows where backtracking occurs, and backtracking is resource consuming.
Let's then test the pattern efficiency with JS benchmark.js:
var suite = new Benchmark.Suite();
Benchmark = window.Benchmark;
suite
.add('Regular RegExp test', function() {
'"some text here"'.match(/"([^"\\]|\\.)*"/);
})
.add('Unrolled RegExp test', function() {
'"some text here"'.match(/"[^"\\]*(\\.[^"\\]*)*"/);
})
.on('cycle', function(event) {
console.log(String(event.target));
})
.on('complete', function() {
console.log('Fastest is ' + this.filter('fastest').map('name'));
})
.run({ 'async': true });
<script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.13.1/lodash.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/platform/1.3.1/platform.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/benchmark/2.1.0/benchmark.js"></script>
Results:
Regular RegExp test x 9,295,393 ops/sec ±0.69% (64 runs sampled)
Unrolled RegExp test x 12,176,227 ops/sec ±1.17% (64 runs sampled)
Fastest is Unrolled RegExp test
Also, since unroll the loop concept is not language specific, here is an online PHP test (regular pattern yielding ~0.45, and unrolled one yielding ~0.22 results).
Also see Unroll Loop, when to use.
What is the difference between the two?
The difference between them is how they exactly match a particular given input. If you think of these as two functions in terms of input and output they are the equivalent, but how the function works to produce the output (match) is different. Both of these regular expressions (exp1 | exp2)* and (exp1 * (exp2 exp1*)*) will match the exact same input. In other-words you can say they are semantically equivalent in terms of the given input and a match (output).
When would you use one over the other?
Edit
The second regular expression (exp1 * (exp2 exp1*)*) is more optimal due to the loop unrolling technique. See #Wiktor Stribiżew's answer.
Proof
One way to prove if two regular expressions are equivalent is to see if they have the same DFA. Using this converter, here are the following DFAs of the regular expressions.
(Note: a = exp1 and b = exp2)
(a*(ba*)*)
(a|b)*
Notice that the first DFA is the same as the second one? The only difference is that the first one isn't minimized. Here is a crud fix to show the minimization of the first DFA:

Combining look aheads not matching

I have the following test string I'm working with:
__level__:,Undergraduate,;__subject__:,Maths,Art,;
This is actually a stringified object of { level: ["Undergraduate"], subject: ["Maths", "Art"] } that I figured converting to a string and using a regular expression might be quicker than looping through each level|subject and each value within those properties.
I can match a single value within a list of a property (e.g. level) like so:
(?=(__subject__:[^;]*(,Maths,).*?;))
And I can match two like so:
(?=(__subject__:[^;]*(,Maths,).*?;))(?=(__subject__:[^;]*(,Art,).*?;))
However, I can't guarantee the order that level and subject lists will be. Below is also possible:
__subject__:,Maths,Art,;__level__:,Undergraduate,;
Notice I've put subject before level now. Now the regular expression doesn't match. I'm pretty new to look aheads so I can't figure out what I've done wrong. Would appreciate any help on the matter.
I also want to combine the properties being matched, so something like:
(?=(__level__:[^;]*(,Undergraduate,).*?;))(?=(__subject__:[^;]*(,Maths,).*?;))(?=(__subject__:[^;]*(,Art,).*?;))
..doesn't work for me either but I'm trying to match two values from the subject property and a value from the level property. Again, I can't guarantee the order of properties (e.g. level, subject) and/or values (e.g. Maths, Art OR Art, Maths)
Class \[A-Z\] & Positive Lookahead (?=)
The targets are letters [A-Z]+? and to exclude the words surrounded by underscore use the positive lookahead to ensure the target is followed by a comma (?=,)
/([A-Z]+?)(?=,)/gi;
Demo
let str = `__level__:,Undergraduate,;__subject__:,Maths,Art,;`;
let rgx = /([A-Z]+?)(?=,)/gi;
let mch = rgx.exec(str);
let res = [];
while (mch !== null) {
res.push(mch[0]);
mch = rgx.exec(str);
}
console.log(res.join(', '));

How to replace parts of a string in lua "in a single pass"?

I have the following string of anchors (where I want to change the contents of the href) and a lua table of replacements, which tells which word should be replaced for:
s1 = '<a href="word7">'
replacementTable = {}
replacementTable["word1"] = "potato1"
replacementTable["word2"] = "potato2"
replacementTable["word3"] = "potato3"
replacementTable["word4"] = "potato4"
replacementTable["word5"] = "potato5"
The expected result should be:
<a href="word7">
I know I could do this iterating for each element in the replacementTable and process the string each time, but my gut feeling tells me that if by any chance the string is very big and/or the replacement table becomes big, this apporach is going to perform poorly.
So I though it could be best if I could do the following: apply the regular expression for finding all the matches, get an iterator for each match and replace each match for its value in the replacementTable.
Something like this would be great (writing it in Javascript because I don't know yet how to write lambdas in Lua):
var newString = patternReplacement(s1, '<a[^>]* href="([^"]*)"', function(match) { return replacementTable[match] })
Where the first parameter is the string, the second one the regular expression and the third one a function that is executed for each match to get the replacement. This way I think s1 gets parsed once, being more efficient.
Is there any way to do this in Lua?
In your example, this simple code works:
print((s1:gsub("%w+",replacementTable)))
The point is that gsub already accepts a table of replacements.
In the end, the solution that worked for me was the following one:
local updatedBody = string.gsub(body, '(<a[^>]* href=")(/[^"%?]*)([^"]*")', function(leftSide, url, rightSide)
local replacedUrl = url
if (urlsToReplace[url]) then replacedUrl = urlsToReplace[url] end
return leftSide .. replacedUrl .. rightSide
end)
It kept out any querystring parameter giving me just the URI. I know it's a bad idea to parse HTML bodies with regular expressions but for my case, where I required a lot of performance, this was performing a lot faster and just did the job.

Regex Pattern to Match, Excluding when... / Except between

--Edit-- The current answers have some useful ideas but I want something more complete that I can 100% understand and reuse; that's why I set a bounty. Also ideas that work everywhere are better for me than not standard syntax like \K
This question is about how I can match a pattern except some situations s1 s2 s3. I give a specific example to show my meaning but prefer a general answer I can 100% understand so I can reuse it in other situations.
Example
I want to match five digits using \b\d{5}\b but not in three situations s1 s2 s3:
s1: Not on a line that ends with a period like this sentence.
s2: Not anywhere inside parens.
s3: Not inside a block that starts with if( and ends with //endif
I know how to solve any one of s1 s2 s3 with a lookahead and lookbehind, especially in C# lookbehind or \K in PHP.
For instance
s1 (?m)(?!\d+.*?\.$)\d+
s3 with C# lookbehind (?<!if\(\D*(?=\d+.*?//endif))\b\d+\b
s3 with PHP \K (?:(?:if\(.*?//endif)\D*)*\K\d+
But the mix of conditions together makes my head explode. Even more bad news is that I may need to add other conditions s4 s5 at another time.
The good news is, I don't care if I process the files using most common languages like PHP, C#, Python or my neighbor's washing machine. :) I'm pretty much a beginner in Python & Java but interested to learn if it has a solution.
So I came here to see if someone think of a flexible recipe.
Hints are okay: you don't need to give me full code. :)
Thank you.
Hans, I'll take the bait and flesh out my earlier answer. You said you want "something more complete" so I hope you won't mind the long answer—just trying to please. Let's start with some background.
First off, this is an excellent question. There are often questions about matching certain patterns except in certain contexts (for instance, within a code block or inside parentheses). These questions often give rise to fairly awkward solutions. So your question about multiple contexts is a special challenge.
Surprise
Surprisingly, there is at least one efficient solution that is general, easy to implement and a pleasure to maintain. It works with all regex flavors that allow you to inspect capture groups in your code. And it happens to answer a number of common questions that may at first sound different from yours: "match everything except Donuts", "replace all but...", "match all words except those on my mom's black list", "ignore tags", "match temperature unless italicized"...
Sadly, the technique is not well known: I estimate that in twenty SO questions that could use it, only one has one answer that mentions it—which means maybe one in fifty or sixty answers. See my exchange with Kobi in the comments. The technique is described in some depth in this article which calls it (optimistically) the "best regex trick ever". Without going into as much detail, I'll try to give you a firm grasp of how the technique works. For more detail and code samples in various languages I encourage you to consult that resource.
A Better-Known Variation
There is a variation using syntax specific to Perl and PHP that accomplishes the same. You'll see it on SO in the hands of regex masters such as CasimiretHippolyte and HamZa. I'll tell you more about this below, but my focus here is on the general solution that works with all regex flavors (as long as you can inspect capture groups in your code).
Thanks for all the background, zx81... But what's the recipe?
Key Fact
The method returns the match in Group 1 capture. It does not care at
all about the overall match.
In fact, the trick is to match the various contexts we don't want (chaining these contexts using the | OR / alternation) so as to "neutralize them". After matching all the unwanted contexts, the final part of the alternation matches what we do want and captures it to Group 1.
The general recipe is
Not_this_context|Not_this_either|StayAway|(WhatYouWant)
This will match Not_this_context, but in a sense that match goes into a garbage bin, because we won't look at the overall matches: we only look at Group 1 captures.
In your case, with your digits and your three contexts to ignore, we can do:
s1|s2|s3|(\b\d+\b)
Note that because we actually match s1, s2 and s3 instead of trying to avoid them with lookarounds, the individual expressions for s1, s2 and s3 can remain clear as day. (They are the subexpressions on each side of a | )
The whole expression can be written like this:
(?m)^.*\.$|\([^\)]*\)|if\(.*?//endif|(\b\d+\b)
See this demo (but focus on the capture groups in the lower right pane.)
If you mentally try to split this regex at each | delimiter, it is actually only a series of four very simple expressions.
For flavors that support free-spacing, this reads particularly well.
(?mx)
### s1: Match line that ends with a period ###
^.*\.$
| ### OR s2: Match anything between parentheses ###
\([^\)]*\)
| ### OR s3: Match any if(...//endif block ###
if\(.*?//endif
| ### OR capture digits to Group 1 ###
(\b\d+\b)
This is exceptionally easy to read and maintain.
Extending the regex
When you want to ignore more situations s4 and s5, you add them in more alternations on the left:
s4|s5|s1|s2|s3|(\b\d+\b)
How does this work?
The contexts you don't want are added to a list of alternations on the left: they will match, but these overall matches are never examined, so matching them is a way to put them in a "garbage bin".
The content you do want, however, is captured to Group 1. You then have to check programmatically that Group 1 is set and not empty. This is a trivial programming task (and we'll later talk about how it's done), especially considering that it leaves you with a simple regex that you can understand at a glance and revise or extend as required.
I'm not always a fan of visualizations, but this one does a good job of showing how simple the method is. Each "line" corresponds to a potential match, but only the bottom line is captured into Group 1.
Debuggex Demo
Perl/PCRE Variation
In contrast to the general solution above, there exists a variation for Perl and PCRE that is often seen on SO, at least in the hands of regex Gods such as #CasimiretHippolyte and #HamZa. It is:
(?:s1|s2|s3)(*SKIP)(*F)|whatYouWant
In your case:
(?m)(?:^.*\.$|\([^()]*\)|if\(.*?//endif)(*SKIP)(*F)|\b\d+\b
This variation is a bit easier to use because the content matched in contexts s1, s2 and s3 is simply skipped, so you don't need to inspect Group 1 captures (notice the parentheses are gone). The matches only contain whatYouWant
Note that (*F), (*FAIL) and (?!) are all the same thing. If you wanted to be more obscure, you could use (*SKIP)(?!)
demo for this version
Applications
Here are some common problems that this technique can often easily solve. You'll notice that the word choice can make some of these problems sound different while in fact they are virtually identical.
How can I match foo except anywhere in a tag like <a stuff...>...</a>?
How can I match foo except in an <i> tag or a javascript snippet (more conditions)?
How can I match all words that are not on this black list?
How can I ignore anything inside a SUB... END SUB block?
How can I match everything except... s1 s2 s3?
How to Program the Group 1 Captures
You didn't as for code, but, for completion... The code to inspect Group 1 will obviously depend on your language of choice. At any rate it shouldn't add more than a couple of lines to the code you would use to inspect matches.
If in doubt, I recommend you look at the code samples section of the article mentioned earlier, which presents code for quite a few languages.
Alternatives
Depending on the complexity of the question, and on the regex engine used, there are several alternatives. Here are the two that can apply to most situations, including multiple conditions. In my view, neither is nearly as attractive as the s1|s2|s3|(whatYouWant) recipe, if only because clarity always wins out.
1. Replace then Match.
A good solution that sounds hacky but works well in many environments is to work in two steps. A first regex neutralizes the context you want to ignore by replacing potentially conflicting strings. If you only want to match, then you can replace with an empty string, then run your match in the second step. If you want to replace, you can first replace the strings to be ignored with something distinctive, for instance surrounding your digits with a fixed-width chain of ###. After this replacement, you are free to replace what you really wanted, then you'll have to revert your distinctive ### strings.
2. Lookarounds.
Your original post showed that you understand how to exclude a single condition using lookarounds. You said that C# is great for this, and you are right, but it is not the only option. The .NET regex flavors found in C#, VB.NET and Visual C++ for example, as well as the still-experimental regex module to replace re in Python, are the only two engines I know that support infinite-width lookbehind. With these tools, one condition in one lookbehind can take care of looking not only behind but also at the match and beyond the match, avoiding the need to coordinate with a lookahead. More conditions? More lookarounds.
Recycling the regex you had for s3 in C#, the whole pattern would look like this.
(?!.*\.)(?<!\([^()]*(?=\d+[^)]*\)))(?<!if\(\D*(?=\d+.*?//endif))\b\d+\b
But by now you know I'm not recommending this, right?
Deletions
#HamZa and #Jerry have suggested I mention an additional trick for cases when you seek to just delete WhatYouWant. You remember that the recipe to match WhatYouWant (capturing it into Group 1) was s1|s2|s3|(WhatYouWant), right? To delete all instance of WhatYouWant, you change the regex to
(s1|s2|s3)|WhatYouWant
For the replacement string, you use $1. What happens here is that for each instance of s1|s2|s3 that is matched, the replacement $1 replaces that instance with itself (referenced by $1). On the other hand, when WhatYouWant is matched, it is replaced by an empty group and nothing else — and therefore deleted. See this demo, thank you #HamZa and #Jerry for suggesting this wonderful addition.
Replacements
This brings us to replacements, on which I'll touch briefly.
When replacing with nothing, see the "Deletions" trick above.
When replacing, if using Perl or PCRE, use the (*SKIP)(*F) variation mentioned above to match exactly what you want, and do a straight replacement.
In other flavors, within the replacement function call, inspect the match using a callback or lambda, and replace if Group 1 is set. If you need help with this, the article already referenced will give you code in various languages.
Have fun!
No, wait, there's more!
Ah, nah, I'll save that for my memoirs in twenty volumes, to be released next Spring.
Do three different matches and handle the combination of the three situations using in-program conditional logic. You don't need to handle everything in one giant regex.
EDIT: let me expand a bit because the question just became more interesting :-)
The general idea you are trying to capture here is to match against a certain regex pattern, but not when there are certain other (could be any number) patterns present in the test string. Fortunately, you can take advantage of your programming language: keep the regexes simple and just use a compound conditional. A best practice would be to capture this idea in a reusable component, so let's create a class and a method that implement it:
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
public class MatcherWithExceptions {
private string m_searchStr;
private Regex m_searchRegex;
private IEnumerable<Regex> m_exceptionRegexes;
public string SearchString {
get { return m_searchStr; }
set {
m_searchStr = value;
m_searchRegex = new Regex(value);
}
}
public string[] ExceptionStrings {
set { m_exceptionRegexes = from es in value select new Regex(es); }
}
public bool IsMatch(string testStr) {
return (
m_searchRegex.IsMatch(testStr)
&& !m_exceptionRegexes.Any(er => er.IsMatch(testStr))
);
}
}
public class App {
public static void Main() {
var mwe = new MatcherWithExceptions();
// Set up the matcher object.
mwe.SearchString = #"\b\d{5}\b";
mwe.ExceptionStrings = new string[] {
#"\.$"
, #"\(.*" + mwe.SearchString + #".*\)"
, #"if\(.*" + mwe.SearchString + #".*//endif"
};
var testStrs = new string[] {
"1." // False
, "11111." // False
, "(11111)" // False
, "if(11111//endif" // False
, "if(11111" // True
, "11111" // True
};
// Perform the tests.
foreach (var ts in testStrs) {
System.Console.WriteLine(mwe.IsMatch(ts));
}
}
}
So above, we set up the search string (the five digits), multiple exception strings (your s1, s2 and s3), and then try to match against several test strings. The printed results should be as shown in the comments next to each test string.
Your requirement that it's not inside parens in impossible to satify for all cases.
Namely, if you can somehow find a ( to the left and ) to the right, it doesn't always mean you are inside parens. Eg.
(....) + 55555 + (.....) - not inside parens yet there are ( and ) to left and right
Now you might think yourself clever and look for ( to the left only if you don't encounter ) before and vice versa to the right. This won't work for this case:
((.....) + 55555 + (.....)) - inside parens even though there are closing ) and ( to left and to right.
It is impossible to find out if you are inside parens using regex, as regex can't count how many parens have been opened and how many closed.
Consider this easier task: using regex, find out if all (possibly nested) parens in a string are closed, that is for every ( you need to find ). You will find out that it's impossible to solve and if you can't solve that with regex then you can't figure out if a word is inside parens for all cases, since you can't figure out at a some position in string if all preceeding ( have a corresponding ).
Hans if you don't mind I used your neighbor's washing machine called perl :)
Edited:
Below a pseudo code:
loop through input
if line contains 'if(' set skip=true
if skip= true do nothing
else
if line match '\b\d{5}\b' set s0=true
if line does not match s1 condition set s1=true
if line does not match s2 condition set s2=true
if s0,s1,s2 are true print line
if line contains '//endif' set skip=false
Given the file input.txt:
tiago#dell:~$ cat input.txt
this is a text
it should match 12345
if(
it should not match 12345
//endif
it should match 12345
it should not match 12345.
it should not match ( blabla 12345 blablabla )
it should not match ( 12345 )
it should match 12345
And the script validator.pl:
tiago#dell:~$ cat validator.pl
#! /usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
sub validate_s0 {
my $line = $_[0];
if ( $line =~ \d{5/ ){
return "true";
}
return "false";
}
sub validate_s1 {
my $line = $_[0];
if ( $line =~ /\.$/ ){
return "false";
}
return "true";
}
sub validate_s2 {
my $line = $_[0];
if ( $line =~ /.*?\(.*\d{5.*?\).*/ ){
return "false";
}
return "true";
}
my $skip = "false";
while (<>){
my $line = $_;
if( $line =~ /if\(/ ){
$skip = "true";
}
if ( $skip eq "false" ) {
my $s0_status = validate_s0 "$line";
my $s1_status = validate_s1 "$line";
my $s2_status = validate_s2 "$line";
if ( $s0_status eq "true"){
if ( $s1_status eq "true"){
if ( $s2_status eq "true"){
print "$line";
}
}
}
}
if ( $line =~ /\/\/endif/) {
$skip="false";
}
}
Execution:
tiago#dell:~$ cat input.txt | perl validator.pl
it should match 12345
it should match 12345
it should match 12345
Not sure if this would help you or not, but I am providing a solution considering the following assumptions -
You need an elegant solution to check all the conditions
Conditions can change in future and anytime.
One condition should not depend on others.
However I considered also the following -
The file given has minimal errors in it. If it doe then my code might need some modifications to cope with that.
I used Stack to keep track of if( blocks.
Ok here is the solution -
I used C# and with it MEF (Microsoft Extensibility Framework) to implement the configurable parsers. The idea is, use a single parser to parse and a list of configurable validator classes to validate the line and return true or false based on the validation. Then you can add or remove any validator anytime or add new ones if you like. So far I have already implemented for S1, S2 and S3 you mentioned, check classes at point 3. You have to add classes for s4, s5 if you need in future.
First, Create the Interfaces -
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace FileParserDemo.Contracts
{
public interface IParser
{
String[] GetMatchedLines(String filename);
}
public interface IPatternMatcher
{
Boolean IsMatched(String line, Stack<string> stack);
}
}
Then comes the file reader and checker -
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using FileParserDemo.Contracts;
using System.ComponentModel.Composition.Hosting;
using System.ComponentModel.Composition;
using System.IO;
using System.Collections;
namespace FileParserDemo.Parsers
{
public class Parser : IParser
{
[ImportMany]
IEnumerable<Lazy<IPatternMatcher>> parsers;
private CompositionContainer _container;
public void ComposeParts()
{
var catalog = new AggregateCatalog();
catalog.Catalogs.Add(new AssemblyCatalog(typeof(IParser).Assembly));
_container = new CompositionContainer(catalog);
try
{
this._container.ComposeParts(this);
}
catch
{
}
}
public String[] GetMatchedLines(String filename)
{
var matched = new List<String>();
var stack = new Stack<string>();
using (StreamReader sr = File.OpenText(filename))
{
String line = "";
while (!sr.EndOfStream)
{
line = sr.ReadLine();
var m = true;
foreach(var matcher in this.parsers){
m = m && matcher.Value.IsMatched(line, stack);
}
if (m)
{
matched.Add(line);
}
}
}
return matched.ToArray();
}
}
}
Then comes the implementation of individual checkers, the class names are self explanatory, so I don't think they need more descriptions.
using FileParserDemo.Contracts;
using System;
using System.Collections.Generic;
using System.ComponentModel.Composition;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
namespace FileParserDemo.PatternMatchers
{
[Export(typeof(IPatternMatcher))]
public class MatchAllNumbers : IPatternMatcher
{
public Boolean IsMatched(String line, Stack<string> stack)
{
var regex = new Regex("\\d+");
return regex.IsMatch(line);
}
}
[Export(typeof(IPatternMatcher))]
public class RemoveIfBlock : IPatternMatcher
{
public Boolean IsMatched(String line, Stack<string> stack)
{
var regex = new Regex("if\\(");
if (regex.IsMatch(line))
{
foreach (var m in regex.Matches(line))
{
//push the if
stack.Push(m.ToString());
}
//ignore current line, and will validate on next line with stack
return true;
}
regex = new Regex("//endif");
if (regex.IsMatch(line))
{
foreach (var m in regex.Matches(line))
{
stack.Pop();
}
}
return stack.Count == 0; //if stack has an item then ignoring this block
}
}
[Export(typeof(IPatternMatcher))]
public class RemoveWithEndPeriod : IPatternMatcher
{
public Boolean IsMatched(String line, Stack<string> stack)
{
var regex = new Regex("(?m)(?!\\d+.*?\\.$)\\d+");
return regex.IsMatch(line);
}
}
[Export(typeof(IPatternMatcher))]
public class RemoveWithInParenthesis : IPatternMatcher
{
public Boolean IsMatched(String line, Stack<string> stack)
{
var regex = new Regex("\\(.*\\d+.*\\)");
return !regex.IsMatch(line);
}
}
}
The program -
using FileParserDemo.Contracts;
using FileParserDemo.Parsers;
using System;
using System.Collections.Generic;
using System.ComponentModel.Composition;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace FileParserDemo
{
class Program
{
static void Main(string[] args)
{
var parser = new Parser();
parser.ComposeParts();
var matches = parser.GetMatchedLines(Path.GetFullPath("test.txt"));
foreach (var s in matches)
{
Console.WriteLine(s);
}
Console.ReadLine();
}
}
}
For testing I took #Tiago's sample file as Test.txt which had the following lines -
this is a text
it should match 12345
if(
it should not match 12345
//endif
it should match 12345
it should not match 12345.
it should not match ( blabla 12345 blablabla )
it should not match ( 12345 )
it should match 12345
Gives the output -
it should match 12345
it should match 12345
it should match 12345
Don't know if this would help you or not, I do had a fun time playing with it.... :)
The best part with it is that, for adding a new condition all you have to do is provide an implementation of IPatternMatcher, it will automatically get called and thus will validate.
Same as #zx81's (*SKIP)(*F) but with using a negative lookahead assertion.
(?m)(?:if\(.*?\/\/endif|\([^()]*\))(*SKIP)(*F)|\b\d+\b(?!.*\.$)
DEMO
In python, i would do easily like this,
import re
string = """cat 123 sat.
I like 000 not (456) though 111 is fine
222 if( //endif if(cat==789 stuff //endif 333"""
for line in string.split('\n'): # Split the input according to the `\n` character and then iterate over the parts.
if not line.endswith('.'): # Don't consider the part which ends with a dot.
for i in re.split(r'\([^()]*\)|if\(.*?//endif', line): # Again split the part by brackets or if condition which endswith `//endif` and then iterate over the inner parts.
for j in re.findall(r'\b\d+\b', i): # Then find all the numbers which are present inside the inner parts and then loop through the fetched numbers.
print(j) # Prints the number one ny one.
Output:
000
111
222
333