Perl regexp partial match if the string were longer? - regex

I have a tree of nested hashes, each of which contains a name, like nested directories with files. If I get foreign supplied regexps at runtime (which I don't want to analyze) how can I find in which subtrees to look for matches. The path to match might be of the form
"$x{name}/$x{subdir}{name}/$x{subdir}{subdir}{name}"
but, because there can be thousands of hashes, I want to try it only if both of these partially match:
"$x{name}"
"$x{name}/$x{subdir}{name}"
Or even better, if the 1st part matches then try to continue directly with the 2nd and then with the 3rd, sort of like /\G.../g, except the regexp comes from elsewhere in one piece. And I'd need backtracking to also look in all other partially matching subdirs.
PCRE g_match_info_is_partial_match sounds just what I'm looking for, but despite the "Perl" in that name even the 5.18 source doesn't seem to contain this. And I actually want something backward compatible to 5.8.0.
Background to this question is introducing regexp syntax to makepp. We essentially do that for patterns, but due to their trivial syntax, that is easy. Note that we cache what files we find and can cope with more files as they appear. This enables makepp to match files which might be built later, because it puts the rules' outputs into the tree as well.

Perl regexes and PCRE inspire each other, but are not really compatible and are totally not the same. Perl uses a custom regex engine.
Either a regex matches, or it doesn't. If a regex fails, it is impossible to tell where the match failed except when the regex was written in such a way to report the position.
The only viable solution would be to require a list of regexes, one for each level.
Otherwise you could require users to write regexes in such a way that partial matches work as well. In this case, the regex qr|foo/bar\.txt$| would have to be rewritten
qr|\A / # anchor at start
(?: [^/]*/ )* # match as many directories as neccessary
(?: foo/bar\.txt )? # maybe match an ending foo/bar.txt
\z|x # anchor at end
Example:
for ("/a/", "/a/b/", "/a/b/foo/", "/a/b/foo/bar.txt", "/a/b/foo/baz.txt", "/a/bar.txt") {
say qq("$_" -- ), /$regex/ ? "matches" : "doesn't match";
}
Output:
"/a/" -- matches
"/a/b/" -- matches
"/a/b/foo/" -- matches
"/a/b/foo/bar.txt" -- matches
"/a/b/foo/baz.txt" -- doesn't match
"/a/bar.txt" -- doesn't match
Obviously, this doesn't reduce the search space in any way for this regex.
You may be able to spin this in a way that works for your application. Depending on the guarantees your app provides, you can transform the original regex automatically to something that “always” matches.

Related

Can this be parsed by regular expression [duplicate]

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.
So let's say the text is:
start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end
This example has 8 items inside, but say it could have between 3 and 10 items.
I'd ideally like something like this:
start:(?:(\w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here
I usually use something like this in simple situations:
start:(\w+)-(\w+)-(\w+)-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?:end
3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo
And the best I could do so far:
start:(\w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end
shorter especially if the matches are complex but still long. demo
Anyone managed to make it work as a 1 regex-only solution without programming?
I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.
Update:
The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation
Update 2 (bounty):
With #nhahtdh's help I got to the RegExp below by using \G:
(?:start:(?=(?:[\w]+(?:-|(?=:end))){3,10}:end)|(?!^)\G-)([\w]+)
demo even shorter, but can be described without repeating code
I'm also interested in the ECMA flavor and as it doesn't support \G wondering if there's another way, especially without using /g modifier.
Read this first!
This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.
For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.
This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.
Summary
.NET supports capturing repeating pattern with CaptureCollection class.
For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).
Solution
This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.
However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.
For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)
DEMO. The construction is \w+- repeated, then \w+:end.
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)
DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.
\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.
Explanation
Let us break down the regex:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?<=-)\G
)
(\w+)
(?:-|:end)
The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.
The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.
I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.
As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.
For the other construction:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?!^)\G-
)
(\w+)
Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).
It is quite tricky to make this regex works for matching in middle of the string:
We need to check the number of words between start: and :end (as part of the requirement of the original regex).
\G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.
For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).
For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.
The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.
Validation Version
If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)
(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).
Construction
For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.
When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.
We build such regex by going through these steps:
Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
Match and capture the first instance. (e.g. (\w+))
(At this point, the first instance and delimiter should have been matched)
Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
Add the delimiter (if any). (e.g. -)
(After this step, the rest of the tokens should have also been matched, except the last maybe)
Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
Now the hard part. You need to check that:
There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.
Although it might theoretically be possible to write a single expression, it's a lot more practical to match the outer boundaries first and then perform a split on the inner part.
In ECMAScript I would write it like this:
'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end'
.match(/^start:([\w-]+):end$/)[1] // match the inner part
.split('-') // split inner part (this could be a split regex as well)
In PHP:
$txt = 'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end';
if (preg_match('/^start:([\w-]+):end$/', $txt, $matches)) {
print_r(explode('-', $matches[1]));
}
Of course you can use the regex in this quoted string.
"(?<a>\\w+)-(?<b>\\w+)-(?:(?<c>\\w+)" \
"(?:-(?<d>\\w+)(?:-(?<e>\\w+)(?:-(?<f>\\w+)" \
"(?:-(?<g>\\w+)(?:-(?<h>\\w+)(?:-(?<i>\\w+)" \
"(?:-(?<j>\\w+))?" \
")?)?)?" \
")?)?)?" \
")"
Is it a good idea? No, I don't think so.
Not sure you can do it in that way, but you can use the global flag to find all of the words between the colons, see:
http://regex101.com/r/gK0lX1
You'd have to validate the number of groups yourself though. Without the global flag you're only getting a single match, not all matches - change {3,10} to {1,5} and you get the result 'sir' instead.
import re
s = "start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end"
print re.findall(r"(\b\w+?\b)(?:-|:end)", s)
produces
['test', 'test', 'lorem', 'ipsum', 'sir', 'doloret', 'etc', 'etc', 'something']
When you combine:
Your observation: any kind of repitition of a single capture group will result in an overwrite of the last capture, thus returning only the last capture of the capture group.
The knowledge: Any kind of capturing based on the parts, instead of the whole, makes it impossible to set a limit on the amount of times the regex engine will repeat. The limit would have to be metadata (not regex).
With a requirement that the answer cannot involve programming (looping), nor an answer that involves simply copy-pasting capturegroups as you've done in your question.
It can be deduced that it cannot be done.
Update: There are some regex engines for which p. 1 is not necessarily true. In that case the regex you have indicated start:(?:(\w+)-?){3,10}:end will do the job (source).

Regex to find two words on the page

I'm trying to find all pages which contain words "text1" and "text2".
My regex:
text1(.|\n)*text2
it doesn't work..
If your IDE supports the s (single-line) flag (so the . character can match newlines), you can search for your items with:
(text1).*(text2)|\2.*\1
Example with s flag
If the IDE does not support the s flag, you will need to use [\s\S] in place of .:
(text1)[\s\S]*(text2)|\2[\s\S]*\1
Example with [\s\S]
Some languages use $1 and $2 in place of \1 and \2, so you may need to change that.
EDIT:
Alternately, if you want to simply match that a file contains both strings (but not actually select anything), you can utilize look-aheads:
(?s)^(?=.*?text1)(?=.*?text2)
This doesn't care about the order (or number) of the arguments, and for each additional text that you want to search for, you simply append another (?=.*?text_here). This approach is nice, since you can even include regex instead of just plain strings.
text0[\s\S]*text1
Try this.This should do it for you.
What this does is match all including multiline .similar to having .*? with s flag.
\s takes care of spaces,newlines,tabs
\S takes care any non space character.
If you want the regex to match over several lines I would try:
text1[\w\W]*text2
Using . is not a good choice, because it usually doesn't match over multiple lines. Also, for matching single characters I think using square brackets is more idiomatic than using ( ... | ... )
If you want the match to be order-independent then use this:
(?:text1[\w\W]*text2)|(?:text2[\w\W]*text1)
Adding a response for IntelliJ
Building on #OnlineCop's answer, to swap the order of two expressions in IntelliJ,you would style the search as in the accepted response, but since IntelliJ doesn't allow a one-line version, you have to put the replace statement in a separate field. Also, IntelliJ uses $ to identify expressions instead of \.
For example, I tend to put my nulls at the end of my comparisons, but some people prefer it otherwise. So, to keep things consistent at work, I used this regex pattern to swap the order of my comparisons:
Notice that IntelliJ shows in a tooltip what the result of the replacement will be.
For me works text1*{0,}(text2){0,}.
With {0,} you can decide to get your keyword zero or more times OR you set {1,x} to get your keyword 1 or x-times (how often you want).

Collapse and Capture a Repeating Pattern in a Single Regex Expression

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.
So let's say the text is:
start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end
This example has 8 items inside, but say it could have between 3 and 10 items.
I'd ideally like something like this:
start:(?:(\w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here
I usually use something like this in simple situations:
start:(\w+)-(\w+)-(\w+)-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?:end
3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo
And the best I could do so far:
start:(\w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end
shorter especially if the matches are complex but still long. demo
Anyone managed to make it work as a 1 regex-only solution without programming?
I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.
Update:
The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation
Update 2 (bounty):
With #nhahtdh's help I got to the RegExp below by using \G:
(?:start:(?=(?:[\w]+(?:-|(?=:end))){3,10}:end)|(?!^)\G-)([\w]+)
demo even shorter, but can be described without repeating code
I'm also interested in the ECMA flavor and as it doesn't support \G wondering if there's another way, especially without using /g modifier.
Read this first!
This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.
For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.
This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.
Summary
.NET supports capturing repeating pattern with CaptureCollection class.
For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).
Solution
This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.
However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.
For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)
DEMO. The construction is \w+- repeated, then \w+:end.
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)
DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.
\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.
Explanation
Let us break down the regex:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?<=-)\G
)
(\w+)
(?:-|:end)
The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.
The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.
I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.
As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.
For the other construction:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?!^)\G-
)
(\w+)
Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).
It is quite tricky to make this regex works for matching in middle of the string:
We need to check the number of words between start: and :end (as part of the requirement of the original regex).
\G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.
For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).
For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.
The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.
Validation Version
If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)
(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).
Construction
For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.
When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.
We build such regex by going through these steps:
Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
Match and capture the first instance. (e.g. (\w+))
(At this point, the first instance and delimiter should have been matched)
Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
Add the delimiter (if any). (e.g. -)
(After this step, the rest of the tokens should have also been matched, except the last maybe)
Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
Now the hard part. You need to check that:
There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.
Although it might theoretically be possible to write a single expression, it's a lot more practical to match the outer boundaries first and then perform a split on the inner part.
In ECMAScript I would write it like this:
'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end'
.match(/^start:([\w-]+):end$/)[1] // match the inner part
.split('-') // split inner part (this could be a split regex as well)
In PHP:
$txt = 'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end';
if (preg_match('/^start:([\w-]+):end$/', $txt, $matches)) {
print_r(explode('-', $matches[1]));
}
Of course you can use the regex in this quoted string.
"(?<a>\\w+)-(?<b>\\w+)-(?:(?<c>\\w+)" \
"(?:-(?<d>\\w+)(?:-(?<e>\\w+)(?:-(?<f>\\w+)" \
"(?:-(?<g>\\w+)(?:-(?<h>\\w+)(?:-(?<i>\\w+)" \
"(?:-(?<j>\\w+))?" \
")?)?)?" \
")?)?)?" \
")"
Is it a good idea? No, I don't think so.
Not sure you can do it in that way, but you can use the global flag to find all of the words between the colons, see:
http://regex101.com/r/gK0lX1
You'd have to validate the number of groups yourself though. Without the global flag you're only getting a single match, not all matches - change {3,10} to {1,5} and you get the result 'sir' instead.
import re
s = "start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end"
print re.findall(r"(\b\w+?\b)(?:-|:end)", s)
produces
['test', 'test', 'lorem', 'ipsum', 'sir', 'doloret', 'etc', 'etc', 'something']
When you combine:
Your observation: any kind of repitition of a single capture group will result in an overwrite of the last capture, thus returning only the last capture of the capture group.
The knowledge: Any kind of capturing based on the parts, instead of the whole, makes it impossible to set a limit on the amount of times the regex engine will repeat. The limit would have to be metadata (not regex).
With a requirement that the answer cannot involve programming (looping), nor an answer that involves simply copy-pasting capturegroups as you've done in your question.
It can be deduced that it cannot be done.
Update: There are some regex engines for which p. 1 is not necessarily true. In that case the regex you have indicated start:(?:(\w+)-?){3,10}:end will do the job (source).

Regex to match when a string is present twice

I am horrible at RegEx expressions and I just don't use them often enough for me to remember the syntax between uses.
I am using grepWin to search my files. I need to do a search that will return the files that have a given string twice.
So, for example, if I was searching on the word "how", then file one would not match:
Hello
how are you today?
but file two would:
Hello
how are you today?
I am fine, how are you?
Any one know how to make a RegEx that will match that?
something like this (depends on language and your specific task)
\(how.*){2}\
Edit:
according to #CodeJockey
\^(([^h]|h[^o]|ho[^w])*how([^h]|h[^o]|ho[^w])*){2,2}$\
(it become more complicated)
#CodeJockey: Thanks for comments
I don't know what grepWin supports, but here's what I came up with to make something match exactly twice.
/^((?!how).)*how((?!how).)*how((?!how).)*$/
Explanation:
/^ # start of subject
((?!how).)* # any text that does not contain "how"
how # the word "how"
((?!how).)* # any text that does not contain "how"
how # the word "how"
((?!how).)* # any text that does not contain "how"
$/ # end of subject
This ensures that you find two "how"s, but the texts between the "how"s and to either side of them do not contain "how".
Of course, you can substitute any string for "how" in the expression.
If you want to "simplify" by only writing the search expression twice, you can use backreferences thus:
/^(?:(?!how).)*(how)(?:(?!\1).)*\1(?:(?!\1).)*$/
Refiddle with this expression
Explanation:
I added ?: to make the negative lookaheads' text non-capturing. Then I added parentheses around the regular how to make that a capturing subpattern (the first and only one).
I had to include "how" again in the first lookahead because it's a negative lookahead (meaning any capture would not contain "how") and the captured "how" is not captured yet at that point.
This is significantly harder than I originally thought it would be, and requires variable-length lookbehind, which grepWin does not support...
this expression:
(?<!blah.{0,99999})blah(?=.*?blah)(?!.*blah.*blah)
was successfully used in Eclipse, using the "Search > File" dialog to exclude files with one and three instances of blah and to include files with exactly two instances of blah.
Eclipse does not permit a .* in lookbehind, so I used .{0,99999} instead.
It is possible, with the right tool, but It isn't pretty to get it to work with grepWin (see answer above). Can you use other tools (such as Eclipse) and what did you want to do with the files afterwards?
This works for grep || python, it will return a match only if "how" exists twice in a your_file:
grep "how.*how" your_file
in python (re imported):
re.search(r"how.*how","your_text")
It will return everything in between,(the dot means any character and the star means any number of characters), and you can customize your own script.

making a small regular expression a bit more readable

I've got a working regular expression, but I'd like to make it a tad more readable, and I'm far from a regex guru, so I was humbly hoping for some tips.
This is designed to scrape the output of several different compilers, linkers, and other build tools, and is used to build a nice little summery report. It does it's job great, but I'm left feeling like I wrote it in a clunky fashion, and I'd sooner learn than keep it the wrong way.
(.*?)\s?:?\s?(informational|warning|error|fatal error)?\s([A-Z]+[0-9][0-9][0-9][0-9]):\s(.*)$
Which, broken down simply, is as follows:
(.*?) # non-greedily match up until...
\s?:?\s? # we come across a possible " : "
(informational|warning|error|fatal error)? # possibly followed by one of these
\s([A-Z]+[0-9][0-9][0-9][0-9]):\s # but 100% followed by this alphanum
(.*)$ # and then capture the rest
I'm mostly interested in making the 2nd and 4th entry above more... beautiful. For some reason, the regex tester I was using (The Regulator) didn't match plain spaces, so I had to use the \s... but it is not meant to match any other whitespace.
Any schooling will be greatly appreciated.
The easiest way to make a long regex more readable is to use the "free-spacing" (or \x) modifier, which would let you write your regex just like you did in the second block of code -- it makes whitespace ignored. This isn't supported by all engines, however (according to the page linked above, .NET, Java, Perl, PCRE, Python, Ruby and XPath support it).
Note also that in free-spacing mode, you can use [ ] instead of \s if you want to only match a space character (unless you're using Java, in which case you have to use \ , which is an escaped space).
There's not really anything you can do for the second line, if you want each element to be optional independently of the other elements, but the fourth can be shortened:
\s([A-Z]+\d{4}):\s
\d is a shorthand class equivalent to [0-9], and {4} specifies that it should appear exactly four times.
The third line can be slightly shortened as well ((?:…) specifies a non-capturing group):
(informational|warning|(?:fatal )? error)?
From an efficiency standpoint, unless you actually need to capture subpatterns each time you use brackets, you can remove all of them, except for on the third line, where the group is needed for the alternation) -- but that one can be made non-capturing. Putting this all together you'd get:
.*?
\s?:?\s?
(?:informational|warning|(?:fatal )?error)?
\s[A-Z]+\d{4}:\s
.*$
Line 2
I think your regular expression doesn't match with the comment. You probably want this instead:
(\s:\s)?
To make it non-capturing:
(?:\s:\s)?
You should be able to use a literal space instead of \s. This must be a restriction in the tool you are using.
Line 4
[0-9][0-9][0-9][0-9] can be replaced with [0-9]{4}.
In some languages [0-9] is equivalent to \d.
Perhaps you can build the RE from sub-expressions, so that your end RE would look something like this:
/$preamble$possible_colon$keyword$alphanum$trailer/