QRegExp not extracting text as expected - regex

I am trying to extract text from between square brackets on a line of text. I've been messing with the regex for some time now, and cannot get what I need. (I can't even explain why the output is what it is). Here's the code:
QRegExp rx_timestamp("\[(.*?)\]");
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
The input line is:
messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found
And the output is:
Captured texts: (".")
timestamp cap: "."
timestamp cap: ""
timestamp cap: ""
Can someone explain what is going on? Why is cap returning "." when no such character exists between square brackets
Can someone correct the regex to extract the timestamp from between the square brackets?

You are missing two things. Escaping the backslash, and using setMinimal. See below.
QString line = "messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '\"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found";
QRegExp rx_timestamp("\\[(.*)\\]");
rx_timestamp.setMinimal(true);
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
Output:
Captured texts: ("[2013-10-08 09:13:41]", "2013-10-08 09:13:41")
timestamp cap: "[2013-10-08 09:13:41]"
timestamp cap: "2013-10-08 09:13:41"
timestamp cap: ""
UPDATE: What is going on:
A backslash in c++ source code indicates that the next character is an escape character, such as \n. To have a backslash show up in a regular expression you have to escape a backslash like so: \\ That will make it so that the Regular Expression engine sees \, like what Ruby, Perl or Python would use.
The square brackets should be escaped, too, because they are used to indicate a range of elements normally in regex.
So for the Regular expression engine to see a square bracket character you need to send it
\[
but a c++ source file can't get a \ character into a string without two of them in a row so it turns into
\\[
While learning regex, I liked using this regex tool by GSkinner. It has a listing on the right hand side of the page of unique codes and characters.
QRegEx doesn't match regex exactly. If you study the documentation you find a lot of little things. Such as how it does Greedy v. Lazy matching.
QRegExp and double-quoted text for QSyntaxHighlighter
How the captures are listed is pretty typical as far as I have seen from regex parsers. The capture listing first lists all of them, then it lists the first capture group (or what was enclosed by the first set of parentheses.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#cap
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#capturedTexts
To find more matches, you have to iteratively call indexIn.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#indexIn
QString str = "offsets: 1.23 .50 71.00 6.00";
QRegExp rx("\\d*\\.\\d+"); // primitive floating point matching
int count = 0;
int pos = 0;
while ((pos = rx.indexIn(str, pos)) != -1) {
++count;
pos += rx.matchedLength();
}
// pos will be 9, 14, 18 and finally 24; count will end up as 4
Hope that helps.

Related

QRegExp to extract array name and index

I am parsing some strings. If I encounter something like "Foo(bar)", I want to extract "Foo" and "bar"
How do I do it using QRegExp?
First thing, if you are using Qt 5 then rather use QRegularExpression class
The QRegularExpression class introduced in Qt 5 is a big improvement upon QRegExp, in terms of APIs offered, supported pattern syntax and speed of execution.
Secondly, get a visual tool that helps when testing/defining regular expressions, I use an online website.
To get the "Foo" and "Bar" from your example, I can suggest the following pattern:
(\w+)\((\w+)\)
--------------
The above means:
(\w+) - Capture one or more word characters (capture group 1)
\( - followed by a opening brace
(\w+) - then capture one or more word characters (capture group 2)
\) - followed by a closing brace
This pattern must be escaped for direct usage in the Qt regular expression:
const QRegularExpression expression( "(\\w+)\\((\\w+)\\)" );
QRegularExpressionMatch match = expression.match( "Foo(bar)" );
if( match.hasMatch() ) {
qDebug() << "0: " << match.captured( 0 ); // 0 is the complete match
qDebug() << "1: " << match.captured( 1 ); // First capture group
qDebug() << "2: " << match.captured( 2 ); // Second capture group
}
Output is:
0: "Foo(bar)"
1: "Foo"
2: "bar"
See the pattern in action online here. Hover the mouse over the parts in the "Expression" box to see the explanations or over the "Text" part to see the result.

Getting the index of a slice

I want to do some processing on a string in Scala. The first stage of that is finding the index of articles such as: "A ", " A ", "a ", " a ". I am trying to do that like this:
"A house is in front of us".indexOfSlice("\\s+[Aa] ")
I think this should return 0, as the substring is first matched in the first position of the string.
However, this returns -1.
Why does it return -1? Is the regex I am using incorrect?
The other answers as I type this are just missing the point. Your problem is that indexOfSlice doesn't take a regexp, but a sub-sequence to seach for in the sequence. So fixing the regexp won't help at all.
Try this:
val pattern = "\\b[Aa]\\b".r.unanchored
for (mo <- pattern.findAllMatchIn("A house is in front of us, a house is in front of us all")) {
println("pattern starts at " + mo.start)
}
//> pattern starts at 0
//| pattern starts at 27
(with fixed regex, too)
Edit: counter-example for the popular but wrong suggestion of "\\s*[Aa] "
val pattern2 = "\\s*[Aa] ".r.unanchored
for (mo <- pattern2.findAllMatchIn("The agenda is hidden")) {
println("pattern starts at " + mo.start)
}
//> pattern starts at 9
I see a mistake in your regex. your regex is searching for
at least once space (\s+)
a letter (either A or a)
but string you are matching doesn't contain space in beginning. that's why It's not returning you index 0 but -1.
you could write your regex as "^\\s*[Aa] "
Here is example:
val text = "A house is in front of us";
val matcher = Pattern.compile("^\\s*[Aa] ").matcher(text)
var idx = 0;
if(matcher.find()){
idx = matcher.start()
}
println(idx)
it should return 0 as expected.

Regex grouping matches with C++ 11 regex library

I'm trying to use a regex for group matching. I want to extract two strings from one big string.
The input string looks something like this:
tХB:Username!Username#Username.tcc.domain.com Connected
tХB:Username!Username#Username.tcc.domain.com WEBMSG #Username :this is a message
tХB:Username!Username#Username.tcc.domain.com Status: visible
The Username can be anything. Same goes for the end part this is a message.
What I want to do is extract the Username that comes after the pound sign #. Not from any other place in the string, since that can vary aswell. I also want to get the message from the string that comes after the semicolon :.
I tried that with the following regex. But it never outputs any results.
regex rgx("WEBMSG #([a-zA-Z0-9]) :(.*?)");
smatch matches;
for(size_t i=0; i<matches.size(); ++i) {
cout << "MATCH: " << matches[i] << endl;
}
I'm not getting any matches. What is wrong with my regex?
Your regular expression is incorrect because neither capture group does what you want. The first is looking to match a single character from the set [a-zA-Z0-9] followed by <space>:, which works for single character usernames, but nothing else. The second capture group will always be empty because you're looking for zero or more characters, but also specifying the match should not be greedy, which means a zero character match is a valid result.
Fixing both of these your regex becomes
std::regex rgx("WEBMSG #([a-zA-Z0-9]+) :(.*)");
But simply instantiating a regex and a match_results object does not produce matches, you need to apply a regex algorithm. Since you only want to match part of the input string the appropriate algorithm to use in this case is regex_search.
std::regex_search(s, matches, rgx);
Putting it all together
std::string s{R"(
tХB:Username!Username#Username.tcc.domain.com Connected
tХB:Username!Username#Username.tcc.domain.com WEBMSG #Username :this is a message
tХB:Username!Username#Username.tcc.domain.com Status: visible
)"};
std::regex rgx("WEBMSG #([a-zA-Z0-9]+) :(.*)");
std::smatch matches;
if(std::regex_search(s, matches, rgx)) {
std::cout << "Match found\n";
for (size_t i = 0; i < matches.size(); ++i) {
std::cout << i << ": '" << matches[i].str() << "'\n";
}
} else {
std::cout << "Match not found\n";
}
Live demo
"WEBMSG #([a-zA-Z0-9]) :(.*?)"
This regex will match only strings, which contain username of 1 character length and any message after semicolon, but second group will be always empty, because tries to find the less non-greedy match of any characters from 0 to unlimited.
This should work:
"WEBMSG #([a-zA-Z0-9]+) :(.*)"

Qt 4.8.4 MAC Address QRegExp

I'm trying to get Qt to match a MAC Address ( 1a:2b:3c:4d:5e:6f ) using a QRegExp. I can't seem to get it to match - what am I doing wrong?
I am forcing it to try and match the string:
"48:C1:AC:55:86:F3"
Here are my attempts:
// Define a RegEx to match the mac address
//QRegExp regExMacAddress("[0-9a-F]{1,2}[\.:-]){5}([0-9a-F]{1,2}");
//QRegExp regExMacAddress("[0-9a-F]{0,2}:[0-9a-F]{0,2}:[0-9a-F]{0,2}:[0-9a-F]{0,2}:[0-9a-F]{0,2}:[0-9a-F]{0,2}");
//regExMacAddress.setPatternSyntax(QRegExp::RegExp);
// Ensure that the hexadecimal characters are upper case
hwAddress = hwAddress.toUpper();
qDebug() << "STRING TO MATCH: " << hwAddress << "MATCHED IT: " << regExMacAddress.indexIn(hwAddress) << " Exact Match: " << regExMacAddress.exactMatch(hwAddress);
// Check the mac address format
if ( regExMacAddress.indexIn(hwAddress) == -1 ) {
In your first example opening bracket is missing and \. is incorrect (read help for explanations), in both a-F matches nothing, due to 'a' > 'F'.
The correct answer you can find in the comment of kenrogers, but I'll duplicate it for you:
([0-9A-F]{2}[:-]){5}([0-9A-F]{2})
If you want to match . you should use:
([0-9A-F]{2}[:-\\.]){5}([0-9A-F]{2})
If you also want to match lower case characters, you should use:
([0-9A-Fa-f]{2}[:-\\.]){5}([0-9A-Fa-f]{2})

boost regex whitespace followed by one or more asterisk

why does the following boost regex not return the results I am looking for (starts with 0 ore more whitespace followed by one or more asterisk)?
boost::regex tmpCommentRegex("(^\\s*)\\*+");
for (std::vector<std::string>::iterator vect_it =
tmpInputStringLines.begin(); vect_it != tmpInputStringLines.end();
++vect_it) {
boost::match_results<std::string::const_iterator> tmpMatch;
if (boost::regex_match((*vect_it), tmpMatch, tmpCommentRegex,
boost::match_default) == 0) {
std::cout << "Found comment " << (*vect_it) << std::endl;
} else {
std::cout << "No comment" << std::endl;
}
}
On the following input:
* Script 7
[P]%OMO * change
[P]%QMS * change
[T]%OMO * change
[T]%QMM * change
[S]%G1 * Resume
[]
This should read
Found comment * Script 7
No comment
No comment
No comment
No comment
No comment
No comment
Quoting from the documentation for regex_match:
Note that the result is true only if the expression matches the whole of the input sequence. If you want to search for an expression somewhere within the sequence then use regex_search. If you want to match a prefix of the character string then use regex_search with the flag match_continuous set.
None of your input lines are matched by your regular expression as a whole, so the program works as expected. You should use regex_search to get the desired behavior.
Besides, regex_match and regex_search both return bool and not int, so testing for == 0 is wrong.