qt regular expression look behind - regex

my regular expression :(?<=defining\s)[^;]*
should try to find var in the following cases:
defining var;
some text defining var;
I tested the regular expression using some online tools. Unfortunately it does not work with Qt for some reason. Is there something wrong with the regular expression or should I look for the error somewhere else in the code? It is strange because regular expressions without lookbehind work.
Extra:
a bit off topic but because I'm already writing the question: How do I have to change my regular expression so that it can find var in the following case with more then one whitespace:
some text defining var;
Right now it finds all but one whitespace and var.

You can match the context to the left of the value you want to obtain and capture the latter into a capturing group:
QRegularExpression regex("defining\\s+([^;]+)");
QRegularExpressionMatch match = regex.match(str);
QString textYouWant = match.captured(1);
Here, defining\\s+([^;]+) matches defining, then 1+ whitespace chars, and then captures 1+ chars other than ; into Group 1 (that you can access using .captured(1)).
See this regex demo.
Note that QRegularExpression is PCRE-powered, so you may use the PCRE \K operator to get the value you need in the match itself:
QRegularExpression regex("defining\\s+\\K[^;]+");
^^^
The \K operator discards the text matched so far, so you will only get the text matched with [^;]+ after the match is returned.
See another regex demo.

Related

With regex, how to select first 3 words (commas/other characters included)?

Practicing some regex.
Trying to only get Regular, Expressions, and abbreviated
from the below data
Regular Expressions, abbreviated as Regex or Regexp, are a string of characters created within the framework of Regex syntax rules.
With (\w+\S?), I get all words including a nonwhitespace character if present.
How would I get just Regular, Expressions, , and abbreviated ?
Edit:
To clarify, I'm looking for
Regex Expressions, abbreviated separately without spaces
not Regex Expressions, abbreviated (spaces included here)
Regex can't "select". It can only match and capture.
This captures the first 3 words (including optional trailing comma) as groups 1, 2 and 3:
^(\w+,?)\s+(\w+,?)\s+(\w+,?)
See live demo.
as #Bohemian has pointed out, in regex you cannot select but rather capture. If the Regex implementation that you use supports it, then captured group will be returned as part of the match. For example in JS this will happen giving you the results separated.
Capturing groups are created by grouping in parenthesis the part of the match that you want to take out
To match those three specific words the regex would be the following
/(Regular) (Expressions), (abbreviated)/
Note that the words you care about are inside the parenthesis, while the parts of the string you don't want (like spaces and comas) are outside the string
You would use it like this (javascript code)
const string = "Regular Expressions, abbreviated as Regex or Regexp, are a string of characters created within the framework of Regex syntax rules."
const regex = /(Regular) (Expressions), (abbreviated)/;
string.match(regex); // returns [ "Regular Expressions, abbreviated", "Regular", "Expressions", "abbreviated" ]
Note that in the result the first element is the whole match, and the 2nd, 3rd and 4rh element are your capture groups that you can use as if you had selected them from the string
To match any three words separated by space or coma you could use
/(\w+),?\s?(\w+),?\s?(\w+),?\s?/
\w represents a char
\s represents a space
? indicates that there might be 0 or 1 ocurrence of what is previews
and finally the parenthesis group the word and leave out everything else the same as the example above
You would use it like this (javascript code)
const string = "Regular Expressions, abbreviated as Regex or Regexp, are a string of characters created within the framework of Regex syntax rules."
const regex = /(\w+),?\s?(\w+),?\s?(\w+),?\s?/;
string.match(regex); // returns [ "Regular Expressions, abbreviated", "Regular", "Expressions", "abbreviated" ]

Combine two regular expressions with a logical "and" operator

I am trying to build a combined regular expression, but I don't know how to combine the two sub expressions
I have an input string like this: 4711_001.doc
In want to match the following: 4711.doc
I am able to match 4711 with this expression: [^\_\.]*
I am able to match .prt with this exression: \.[^.]+
Is there some kind of logical AND to combine the two expressions and match 4711.doc? How would the expression look like?
You can use groups to do it in one regular expression. Check out this code for reference:
import re
s = "4711_001.doc"
match = re.search(r"(.+?)_\d+(\..+)", s)
print(match.group(1) + match.group(2))
Output:
4711.doc
Another possibility would be to match the part you don't want:
_\d+
And replace this with "":
import re
s = "4711_001.doc"
match = re.sub(r"_\d+", "", s)
print(match)
See the online demo
For this example string 4711_001.doc, using [^_.]* and \.[^.]+ is quite a broad match as it can match any character except what is listed in the character class.
Perhaps you could make the pattern a bit more specific, matching digits at the start and word characters as the extension.
In the replacement use capture group 1 and 2, often denoted as $1$2 or \1\2
(\d+)_\d+(\.\w+)
Regex demo
There is no language tagged, but if for example \K is supported to clear the match buffer this might also be an option (including the parts that you tried)
In the replacement use an empty string.
[^_.]*\K_[^._]+(?=\.[^.]+$)
In parts
[^_.]*\K Match the part before the underscore, then forget what is matched so far
_[^._]+ Match the underscore, follwed by 1+ chars other than . and _
(?=\.[^.]+$) A positive lookahead assertion to make sure what is at the right is a . followed by any char other than a . until the end of the string.
Regex demo

How to exclude the beginning string in regex match

I have a string containing the following variable "nonce=1ff7de7518b9a52080489ecd7629796d&" how to get the value between the equal and the "&" in regular expression, I have tried nonce=(.*?).+?(?=&) the ending part excluded "&" but I could not exclude "nonce="
Note: trying to match the value between "=" and "&" will not work as there are many "=" and "&" characters which will result in more than 1 match, the unique string is "nonce"
here is an example https://regexr.com/48vmd
You can use nonce=([^&]+) to match and capture your intended string from group1
Here nonce= will match literally and then ([^&]+) will match all text before & and capture in group1.
Demo
In case your regex flavor supports \K match reset operator, you can use this regex nonce=\K[^&]+ to have your intended text as full match without requiring any group text capture.
Demo without any grouped capture
If you're using Java, you can use this regex which uses look behind and Java supports look behind.
(?<=nonce=)[^&]+
Demo using look behind
If you're looking for the regular expression it would be as simple as nonce=(\w+)&
Demo (assumes RegExp Tester mode of the View Results Tree listener)
Even easier way would be going for Boundary Extractor which basically extracts everything between the given "left" and "right" boundaries:

conditional group matching using regex

how to match a group except if it starts with a certain character.
e.g. I have the following sentence:
just _checking any _string.
I have the regex ([\w]+) which matches all the words {just, _checking, any, _sring}. But, what I want is to match all the words that don't start with character _ i.e. {just, any}.
The above example is a watered down version of what I'm actually trying to parse.
I'm parsing a code file, which contains string in the following format :
package1.class1<package2.class2 <? extends package3.class3> , package4.class4 <package5.package6.class5<?>.class6.class7<class8> >.class9.class10
The output that I require should create a match result like all the fully qualified names (having at least one . in the middle )but stop if encounter a <.
So, the result should be :
{ package1.class1, package2.class2, package3.class3, package4.class4, package5.package6.class5 }
I wrote ([\w]+\.)+([\w]+) to parse it but it also matches class6.class7 and class9.class10 which I don't want. I know it's way off the mark and I apologize for that.
Hence, I earlier asked if I can ignore a capture group starting from a specific character.
Here's the link where I tried : regex101
there everything that it is matching is correct except the part where it matches class6.class7 and class9.class10.
I'm not sure how to proceed on this. I'm using C++14 and it supports ECMAScript grammar as well along with POSIX style.
EDIT : as suggested by #Corion, I've added more details.
EDIT2 : added regex101 link
Just use a word boundary \b and make sure that the first character is not an underscore (but still a letter):
(\b(?=[^_])[\w]+)
Using the following Perl script to validate that:
perl -wlne "print qq(Matched <$_>) for /(\b(?=[^_])[\w]+)/g"
Matched <just>
Matched <any>
regex101 playground
In response to the expansion of the question in the comment, the following regular expression will also capture dots in the "middle" of the word (but still disallow them at the start of a word):
(\b(?=[^_.])[\w.]+)
perl -wlne "print qq(Matched <$_>) for /(\b(?=[^_.])[\w.]+)/g"
just _checking any _string. and. this. inclu.ding dots
Matched <just>
Matched <any>
Matched <and.>
Matched <this.>
Matched <inclu.ding>
Matched <dots>
regex101 playground
After the third expansion of the question, I've expanded the regular expression to match the class names but exclude the extends keyword, and only start a new match when there was a space (\s) or less-than sign (<). The full qualified matches are achieved by forcing a dot ( \. ) to appear in the match:
(?:^|[<>\s])(?:(?![_.]|\bextends\b)([\w]+\.[\w.]+))
perl -nwle "print qq(Matched <$_>) for /(?:^|[<>\s])(?:(?![_.]|\bextends\b)([\w]+\.[\w.]+))/g"
Matched <package1.class1>
Matched <package2.class2>
Matched <package3.class3>
Matched <package4.class4>
Matched <package5.package6.class5>
regex 101 playground

How to get first match of string by Regular Expression?

I have the following text string:
$ABCD(file="somefile.txt")$' />Some more text followed by a dollar like this one)$. Some more random text
I am trying to match the $ABCD(file="somefile.txt")$ part of the string using a regular expression.
I am using this (?=[$]ABCD[(]file=).*(?<=[)][$]) regular expression pattern to make the intended match. It's not working as expected because I am getting a match all the way to the second )$ in the string.
For example, the match will be as follows:
$ABCD(file="somefile.txt")$' />Some more text followed by a dollar like this one)$
How should I modify the pattern to match to the end of the first occurrence of the )$?
Here is a good online regular expression engine tester:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
try appending a ? to the greedy *
(?=[$]ABCD[(]file=).*?(?<=[)][$])
Lazy quantification
The standard quantifiers in regular expressions are greedy, meaning
they match as much as they can. Modern regular expression tools allow a quantifier to be specified as lazy (also known as > non-greedy, reluctant, minimal, or ungreedy) by putting a question mark after the quantifier
You could just use this:
\$ABCD\(file="[a-z.]+"\)\$
to get $ABCD(file="somefile.txt")$.
Your problem was the .* bit, it was too general and thus matched everything up to the last $.
I would advance you to use the second quote to define the end of the searched pattern: [^"]* will match to anything except ".
So the pattern for the file name would be: \$ABCD\(file="([^"]*)