QRegularExpression find and capture all quoted and non-quoated parts in string - c++

I am fairly new to using regexes.
I got a string which can contain quoted and not quoted substrings.
Here are examples of how they could look:
"path/to/program.exe" -a -b -c
"path/to/program.exe" -a -b -c
path/to/program.exe "-a" "-b" "-c"
path/to/program.exe "-a" -b -c
My regex looks like this: (("[^"]*")|([^"\t ]+))+
With ("[^"]+") I attempt to find every quoted substring and capture it.
With ([^"\t ]+) I attempt to find every substring without quotes.
My code to test this behaviour looks like this:
QString toMatch = R"del( "path/to/program.exe" -a -b -c)del";
qDebug() << "String to Match against: " << toMatch << "\n";
QRegularExpression re(R"del((("[^"]+")|([^"\t ]+))+)del");
QRegularExpressionMatchIterator it = re.globalMatch(toMatch);
int i = 0;
while (it.hasNext())
{
QRegularExpressionMatch match = it.next();
qDebug() << "iteration: " << i << " captured: " << match.captured(i) << "\n";
i++;
}
Output:
String to Match against: " \"path/to/program.exe\" -a -b -c"
iteration: 0 captured: "\"path/to/program.exe\""
iteration: 1 captured: "-a"
iteration: 2 captured: ""
iteration: 3 captured: "-c"
Testing it in Regex101 shows me the result I want.
I also tested it on some other websites e.g this.
I guess I am doing something wrong, could anyone point in the right direction?
Thanks in advance.

You assume that the groups you need to get value from will change their IDs with each new match, while, in fact, all the groups IDs are set in the pattern itself.
I suggest removing all groups and just extract the whole match value:
QString toMatch = R"del( "path/to/program.exe" -a -b -c)del";
qDebug() << "String to Match against: " << toMatch << "\n";
QRegularExpression re(R"del("[^"]+"|[^"\s]+)del");
QRegularExpressionMatchIterator it = re.globalMatch(toMatch);
while (it.hasNext())
{
QRegularExpressionMatch match = it.next();
qDebug() << " matched: " << match.captured(0) << "\n";
}
Note the "[^"]+"|[^"\s]+ pattern matches either
"[^"]+" - ", then one or more chars other than " and then a "
| - or
[^"\s]+ - one or more chars other than " and whitespace.
See the updated pattern demo.

Related

Regex to match strings not enclosed in macro

In a development context, I would like to make sure all strings in source files within certain directories are enclosed in some macro "STR_MACRO". For this I will be using a Python script parsing the source files, and I would like to design a regex for detecting non-commented lines with strings not enclosed in this macro.
For instance, the regex should match the following strings:
std::cout << "Hello World!" << std::endl;
load_file("Hello World!");
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar
Excluding commented lines containing strings seems to work well using the regex ^(?!\s*//).*"([^"]+)". However when I try to exclude non-commented strings already enclosed in the macro, using the regex ^(?!\s*//).*(?!STR_MACRO\()"([^"]+)", it does nothing more (seemingly due to with the opening parenthesis after STR_MACRO).
Any hints on how to achieve this?
With PyPi regex module (that you can install with pip install regex in the terminal) you can use
import regex
pattern = r'''(?:^//.*|STR_MACRO\("[^"\\]*(?:\\.[^"\\]*)*"\))(*SKIP)(*F)|"[^"\\]*(?:\\.[^"\\]*)*"'''
text = r'''For instance, the regex should match the following strings:
std::cout << "Hello World!" << std::endl;
load_file("Hello World!");
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar'''
print( regex.sub(pattern, r'STR_MACRO(\g<0>)', text, flags=regex.M) )
Details:
(?:^//.*|STR_MACRO\("[^"\\]*(?:\\.[^"\\]*)*"\))(*SKIP)(*F) - // at the line start and the rest of the line, or STR_MACRO( + a double quoted string literal pattern + ), and then the match is skipped, and the next match search starts at the failure location
| - or
"[^"\\]*(?:\\.[^"\\]*)*" - ", zero or more chars other than " and \, then zero or more reptitions of a \ and then any single char followed with zero or more chars other than a " and \ chars, and then a " char
See the Python demo. Output:
For instance, the regex should match the following strings:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar

Find regex matches & remove outer part of the match

I have a string
content = "std::cout << func(some_val) << std::endl; auto i = func(some_other_val);"
and I find to find all instances with func(...), and remove the function call. So that I would get
content = "std::cout << some_val << std::endl; auto i = some_other_val;"
So I've tried this:
import re
content = "std::cout << func(some_val) << std::endl; auto i = func(some_other_val);"
c = re.compile('func\([a-zA-Z0-9_]+\)')
print(c.sub('', content)) # gives "std::cout << << std::endl; auto i = ;"
but this removes the entire match, not just the func( and ).
Basically, how do I keep whatever matched with [a-zA-Z0-9_]+?
You can use re.sub to replace all the outer func(...) with only the value like below, See regex here , Here I've used [w]+, you can do changes if you use
import re
regex = r"func\(([\w]+)\)"
test_str = "std::cout << func(some_val) << std::endl; auto i = func(some_other_val);"
subst = "\\1"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Demo: https://rextester.com/QZJLF65281
Output:
std::cout << some_val << std::endl; auto i = some_other_val;
You should capture the part of the match that you want to keep into a group:
re.compile(r'func\(([a-zA-Z0-9_]+)\)')
Here I captured it into group 1.
And then you can refer to group 1 with \1:
print(c.sub(r'\1', content))
Note that in general, you should not use regex to parse source code of a non-regular language (such as C in this case) with regex. It might work in a few very specific cases, where the input is very limited, but you should still use a C parser to parse C code. I have found libraries such as this and this.

Regex matching groups boost c++

[Noob Corner]
Hello,
I'm trying to catch a group with boost regex depending on the string that matched and I think I'm using a wrong way.
boost::regex expr(R"(:?(:?\busername *(\S*))|(:?\bserver *(\S*))|(:?\bpassword *(\S*)))");
std::vector<std::string > vec = { "server my.server.eu", "username myusername", "password mypassword" };
for (auto &elem : vec)
{
if (boost::regex_match(elem, expr, boost::match_extra))
{
boost::smatch what;
boost::regex_search(elem, what, expr);
std::cout << "Match 1 (username) : " << what[1].str() << std::endl;
std::cout << "Match 2 (server) : " << what[2].str() << std::endl;
std::cout << "Match 3 (password) : " << what[3].str() << std::endl;
}
}
I want something like :
server my.server.eu
Match 1 (username) : NULL
Match 2 (server) : my.server.eu
Match 3 (password) : NULL
I searched on internet but I have not found clear answers regarding the identification of capturing groups.
Thanks
You actually have 6 and not 3 matching groups.
Your regular expression is organized in such a manner that the odd matching groups will match a key-value (i.e.: username myusername) while the even matching groups will match the actual value (i.e.: myusername).
So you have to look for groups 2, 4 and 6 to get the username, server and password values.

QRegExp not extracting text as expected

I am trying to extract text from between square brackets on a line of text. I've been messing with the regex for some time now, and cannot get what I need. (I can't even explain why the output is what it is). Here's the code:
QRegExp rx_timestamp("\[(.*?)\]");
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
The input line is:
messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found
And the output is:
Captured texts: (".")
timestamp cap: "."
timestamp cap: ""
timestamp cap: ""
Can someone explain what is going on? Why is cap returning "." when no such character exists between square brackets
Can someone correct the regex to extract the timestamp from between the square brackets?
You are missing two things. Escaping the backslash, and using setMinimal. See below.
QString line = "messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '\"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found";
QRegExp rx_timestamp("\\[(.*)\\]");
rx_timestamp.setMinimal(true);
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
Output:
Captured texts: ("[2013-10-08 09:13:41]", "2013-10-08 09:13:41")
timestamp cap: "[2013-10-08 09:13:41]"
timestamp cap: "2013-10-08 09:13:41"
timestamp cap: ""
UPDATE: What is going on:
A backslash in c++ source code indicates that the next character is an escape character, such as \n. To have a backslash show up in a regular expression you have to escape a backslash like so: \\ That will make it so that the Regular Expression engine sees \, like what Ruby, Perl or Python would use.
The square brackets should be escaped, too, because they are used to indicate a range of elements normally in regex.
So for the Regular expression engine to see a square bracket character you need to send it
\[
but a c++ source file can't get a \ character into a string without two of them in a row so it turns into
\\[
While learning regex, I liked using this regex tool by GSkinner. It has a listing on the right hand side of the page of unique codes and characters.
QRegEx doesn't match regex exactly. If you study the documentation you find a lot of little things. Such as how it does Greedy v. Lazy matching.
QRegExp and double-quoted text for QSyntaxHighlighter
How the captures are listed is pretty typical as far as I have seen from regex parsers. The capture listing first lists all of them, then it lists the first capture group (or what was enclosed by the first set of parentheses.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#cap
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#capturedTexts
To find more matches, you have to iteratively call indexIn.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#indexIn
QString str = "offsets: 1.23 .50 71.00 6.00";
QRegExp rx("\\d*\\.\\d+"); // primitive floating point matching
int count = 0;
int pos = 0;
while ((pos = rx.indexIn(str, pos)) != -1) {
++count;
pos += rx.matchedLength();
}
// pos will be 9, 14, 18 and finally 24; count will end up as 4
Hope that helps.

Qt 4.8.4 MAC Address QRegExp

I'm trying to get Qt to match a MAC Address ( 1a:2b:3c:4d:5e:6f ) using a QRegExp. I can't seem to get it to match - what am I doing wrong?
I am forcing it to try and match the string:
"48:C1:AC:55:86:F3"
Here are my attempts:
// Define a RegEx to match the mac address
//QRegExp regExMacAddress("[0-9a-F]{1,2}[\.:-]){5}([0-9a-F]{1,2}");
//QRegExp regExMacAddress("[0-9a-F]{0,2}:[0-9a-F]{0,2}:[0-9a-F]{0,2}:[0-9a-F]{0,2}:[0-9a-F]{0,2}:[0-9a-F]{0,2}");
//regExMacAddress.setPatternSyntax(QRegExp::RegExp);
// Ensure that the hexadecimal characters are upper case
hwAddress = hwAddress.toUpper();
qDebug() << "STRING TO MATCH: " << hwAddress << "MATCHED IT: " << regExMacAddress.indexIn(hwAddress) << " Exact Match: " << regExMacAddress.exactMatch(hwAddress);
// Check the mac address format
if ( regExMacAddress.indexIn(hwAddress) == -1 ) {
In your first example opening bracket is missing and \. is incorrect (read help for explanations), in both a-F matches nothing, due to 'a' > 'F'.
The correct answer you can find in the comment of kenrogers, but I'll duplicate it for you:
([0-9A-F]{2}[:-]){5}([0-9A-F]{2})
If you want to match . you should use:
([0-9A-F]{2}[:-\\.]){5}([0-9A-F]{2})
If you also want to match lower case characters, you should use:
([0-9A-Fa-f]{2}[:-\\.]){5}([0-9A-Fa-f]{2})