Stop Regex Engine at first match [duplicate] - regex

I am learning about using cucumber's step defintion, which use regex. I came across the following different usages and would like to know if there's some material difference between the two approaches of capturing a group within a pair of double quotes:
approach one: "([^"]*)"
approach two: "(.*?)"
For example, consider a string input: 'the output should be "pass!"'. Both approaches would capture pass!. Are there inputs where two the approaches capture differently; or are they equivalent?
Thanks

Well, in naked eye they look same. But slight different. Have a look on this example:
input:
a " regex
example is
here" please
Output for "([^"]*)":
regex
example is
here
And, Output for "(.*?)" is empty.
.*? means any character except \n (0 or more times), and there has few newlines between the quotes("). If we use this in regex we need to give the regex engine a hint to use Multiline matching.

"([^"]*)" will also capture newlines, so if you have
"Something
that goes on two lines"
then it will match it.
"(.*?)" does not span newlines, so it will not match that phrase.
Unless you use the single-line modifier (?s). In which case . will also include newline characters. The following expression: (?s)"(.*?)" would then match and capture.

Difference between "(.*?)" and "([^"]*)"
It depends upon where this regex fragment appears within the larger context of the overall pattern. It also depends upon the target string that is being searched. For example, given the following input string:
'foo "quote1" bar "quote2"'
The expression: /"(.*?)"$/ (note the added end of string anchor) will match: "quote1" bar "quote2" but the /"([^"]*)"$/ expression will match: "quote2".
The dot will match a double quote if it has to to get a successful overall match.

Related

How can I get the second part of a hyphenated word using regex?

For example, I have the word: sh0rt-t3rm.
How can I get the t3rm part using perl regex?
I could get sh0rt by using [(a-zA-Z0-9)+]\[-\], but \[-\][(a-zA-Z0-9)+] doesn't work to get t3rm.
The syntax used for the regex is not correct to get either sh0rt or t3rm
You flipped the square brackets and the parenthesis, and the hyphen does not have to be between square brackets.
To get sh0rt in sh0rt-t3rm you you might use for example one of:
Regex
Demo
Explanation
\b([a-zA-Z0-9]+)-
Demo 1
\b is a word boundary to prevent a partial word match, the value is in capture group 1.
\b[a-zA-Z0-9]+(?=-)
Demo 2
Match the allowed chars in the character class, and assert a - to the right using a positive lookahead (?=-)
To get t3rm in sh0rt-t3rm you might use for example one of:
Regex
Demo
Explanation
-([a-zA-Z0-9]+)\b
Demo 3
The other way around with a leading - and get the value from capture group 1.
-\K[a-zA-Z0-9]+\b
Demo 4
Match - and use \K to keep out what is matched so far. Then match 1 or more times the allowed chars in the character class.
If your whole target string is literally just sh0rt-t3rm then you want all that comes after the -.
So the barest and minimal version, cut precisely for this description, is
my ($capture) = $string =~ /-(.+)/;
We need parenthesis on the left-hand-side so to make regex run in a list context because that's when it returns the matches (otherwise it returns true/false, normally 1 or '').
But what if the preceding text may have - itself? Then make sure to match all up to that last -
my ($capture) = $string =~ /.*-(.+)/;
Here the "greedy" nature of the * quantifier makes the previous . match all it possibly can so that the whole pattern still matches; thus it goes up until the very last -.
There are of course many other variations on how the data may look like, other than just being one hyphenated-word. In particular, if it's a part of a text, you may want to include word-boundaries
my ($capture) = $string =~ /\b.*?-(.+?)\b/;
Here we also need to adjust our "wild-card"-like pattern .+ by limiting it using ? so that it is not greedy. This matches the first such hyphenated word in the $string. But if indeed only "word" characters fly then we can just use \w (instead of . and word-boundary anchors)
my ($capture) = $string =~ /\w*?-(\w+)/;
Note that \w matches [a-zA-Z0-9_] only, which excludes some characters that may appear in normal text (English, not to mention all other writing systems).
But this is clearly getting pickier and cookier and would need careful close inspection and testing, and more complete knowledge of what the data may look like.
Perl offers its own tutorial, perlretut, and the main full reference is perlre
-([a-zA-Z0-9]+) will match a - followed by a word, with just the word being captured.
Demo

Regex: ignore characters that follow

I'd like to know how can I ignore characters that follows a particular pattern in a Regex.
I tried with positive lookaheads but they do not work as they preserves those character for other matches, while I want them to be just... discarded.
For example, a part of my regex is: (?<DoubleQ>\"\".*?\"\")|(?<SingleQ>\".*?\")
in order to match some "key-parts" of this string:
This is a ""sample text"" just for "testing purposes": not to be used anywhere else.
I want to capture the entire ""sample text"", but then I want to "extract" only sample text and the same with testing purposes. That is, I want the group to match to be ""sample text"", but then I want the full match to be sample text. I partially achieved that with the use of the \K option:
(?<DoubleQ>\"\"\K.*?\"\")|(?<SingleQ>\"\K.*?\")
Which ignores the first "" (or ") from the full match but takes it into account when matching the group. How can I ignore the following "" (")?
Note: positive lookahead does not work: it does not ignore characters from the following matches, it just does not include them in the current match.
Thanks a lot.
I hope I got your questions right. So you want to match the whole string including the quotes, but you want to replace/extract it only the expression without the quotes, right?
You typically can use the regex replace functionality to extract just a part of the match.
This is the regex expression:
""?(.*?)""?
And this the replace expression:
$1

RegEx: Match everything up to the last space without including it

I'd like to match everything in a string up to the last space but without including it. For the sake of example, I would like to match characters I put in bold:
RENATA T. GROCHAL
So far I have ^(.+\s)(.+) However, it matches the last space and I don't want it to. RegEx should work also for other languages than English, as mine does.
EDIT: I didn't mention that the second capturing group should not contain a space – it should be GROCHAL not GROCHAL with a space before it.
EDIT 2: My new RegEx based on what the two answers have provided is: ^((.+)(?=\s))\s(.+) and the RegEx used to replace the matches is \3, \1. It does the expected result:
GROCHAL, RENATa T.
Any improvements would be desirable.
^(.+)\s(.+)
with substitution string:
\2, \1
Update:
Another version that can collapse extra spaces between the 2 capturing groups:
^(.+?)\s+(\S+)$
Use a positive lookahead assertion:
^(.+)(?=\s)
Capturing group 1 will contain the match.
I like using named capturing groups:
rawName = RENATA T. GROCHAL
RegexMatch(rawName, "O)^(?P<firstName>.+)\s(?P<lastName>.+)", match)
MsgBox, % match.lastName ", " match.firstName

Difference between two regex: "([^"]*)" vs "(.*?)"

I am learning about using cucumber's step defintion, which use regex. I came across the following different usages and would like to know if there's some material difference between the two approaches of capturing a group within a pair of double quotes:
approach one: "([^"]*)"
approach two: "(.*?)"
For example, consider a string input: 'the output should be "pass!"'. Both approaches would capture pass!. Are there inputs where two the approaches capture differently; or are they equivalent?
Thanks
Well, in naked eye they look same. But slight different. Have a look on this example:
input:
a " regex
example is
here" please
Output for "([^"]*)":
regex
example is
here
And, Output for "(.*?)" is empty.
.*? means any character except \n (0 or more times), and there has few newlines between the quotes("). If we use this in regex we need to give the regex engine a hint to use Multiline matching.
"([^"]*)" will also capture newlines, so if you have
"Something
that goes on two lines"
then it will match it.
"(.*?)" does not span newlines, so it will not match that phrase.
Unless you use the single-line modifier (?s). In which case . will also include newline characters. The following expression: (?s)"(.*?)" would then match and capture.
Difference between "(.*?)" and "([^"]*)"
It depends upon where this regex fragment appears within the larger context of the overall pattern. It also depends upon the target string that is being searched. For example, given the following input string:
'foo "quote1" bar "quote2"'
The expression: /"(.*?)"$/ (note the added end of string anchor) will match: "quote1" bar "quote2" but the /"([^"]*)"$/ expression will match: "quote2".
The dot will match a double quote if it has to to get a successful overall match.

Regex - how to match everything except a particular pattern

How do I write a regex to match any string that doesn't meet a particular pattern? I'm faced with a situation where I have to match an (A and ~B) pattern.
You could use a look-ahead assertion:
(?!999)\d{3}
This example matches three digits other than 999.
But if you happen not to have a regular expression implementation with this feature (see Comparison of Regular Expression Flavors), you probably have to build a regular expression with the basic features on your own.
A compatible regular expression with basic syntax only would be:
[0-8]\d\d|\d[0-8]\d|\d\d[0-8]
This does also match any three digits sequence that is not 999.
If you want to match a word A in a string and not to match a word B. For example:
If you have a text:
1. I have a two pets - dog and a cat
2. I have a pet - dog
If you want to search for lines of text that HAVE a dog for a pet and DOESN'T have cat you can use this regular expression:
^(?=.*?\bdog\b)((?!cat).)*$
It will find only second line:
2. I have a pet - dog
Match against the pattern and use the host language to invert the boolean result of the match. This will be much more legible and maintainable.
notnot, resurrecting this ancient question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
I'm faced with a situation where I have to match an (A and ~B)
pattern.
The basic regex for this is frighteningly simple: B|(A)
You just ignore the overall matches and examine the Group 1 captures, which will contain A.
An example (with all the disclaimers about parsing html in regex): A is digits, B is digits within <a tag
The regex: <a.*?<\/a>|(\d+)
Demo (look at Group 1 in the lower right pane)
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
The complement of a regular language is also a regular language, but to construct it you have to build the DFA for the regular language, and make any valid state change into an error. See this for an example. What the page doesn't say is that it converted /(ac|bd)/ into /(a[^c]?|b[^d]?|[^ab])/. The conversion from a DFA back to a regular expression is not trivial. It is easier if you can use the regular expression unchanged and change the semantics in code, like suggested before.
pattern - re
str.split(/re/g)
will return everything except the pattern.
Test here
My answer here might solve your problem as well:
https://stackoverflow.com/a/27967674/543814
Instead of Replace, you would use Match.
Instead of group $1, you would read group $2.
Group $2 was made non-capturing there, which you would avoid.
Example:
Regex.Match("50% of 50% is 25%", "(\d+\%)|(.+?)");
The first capturing group specifies the pattern that you wish to avoid. The last capturing group captures everything else. Simply read out that group, $2.
(B)|(A)
then use what group 2 captures...