boost regex whitespace followed by one or more asterisk - c++

why does the following boost regex not return the results I am looking for (starts with 0 ore more whitespace followed by one or more asterisk)?
boost::regex tmpCommentRegex("(^\\s*)\\*+");
for (std::vector<std::string>::iterator vect_it =
tmpInputStringLines.begin(); vect_it != tmpInputStringLines.end();
++vect_it) {
boost::match_results<std::string::const_iterator> tmpMatch;
if (boost::regex_match((*vect_it), tmpMatch, tmpCommentRegex,
boost::match_default) == 0) {
std::cout << "Found comment " << (*vect_it) << std::endl;
} else {
std::cout << "No comment" << std::endl;
}
}
On the following input:
* Script 7
[P]%OMO * change
[P]%QMS * change
[T]%OMO * change
[T]%QMM * change
[S]%G1 * Resume
[]
This should read
Found comment * Script 7
No comment
No comment
No comment
No comment
No comment
No comment

Quoting from the documentation for regex_match:
Note that the result is true only if the expression matches the whole of the input sequence. If you want to search for an expression somewhere within the sequence then use regex_search. If you want to match a prefix of the character string then use regex_search with the flag match_continuous set.
None of your input lines are matched by your regular expression as a whole, so the program works as expected. You should use regex_search to get the desired behavior.
Besides, regex_match and regex_search both return bool and not int, so testing for == 0 is wrong.

Related

How to find the index of the first character in a matching regex?

This is my first question here, I hope that It doesn't sound stupid. So I'm trying to find a way to get the index of the first character from a string that matches the regular expression. I made my research in the regex reference in cplusplus.com but I wasn't able to find anything (probably my fault). For anyone that still don't understand what I want to do, let's make a small example, I have the following code:
int main () {
auto re = regex("(\\d+)");
smatch rematch;
string str("The num is: 123481");
regex_search(str, rematch, re);
if (rematch.size() > 0)
cout << "It matches!!!\n";
else
cout << "It doesn't match!!!\n";
}
The following example will match the number in the str string. I want to get the index that this match appears. How can I do that?
What you are looking for is the position() function of match_results, so something like that should work:
if (rematch.size() > 0)
cout << "It matches at position " << rematch.position() << "\n";

Regex grouping matches with C++ 11 regex library

I'm trying to use a regex for group matching. I want to extract two strings from one big string.
The input string looks something like this:
tХB:Username!Username#Username.tcc.domain.com Connected
tХB:Username!Username#Username.tcc.domain.com WEBMSG #Username :this is a message
tХB:Username!Username#Username.tcc.domain.com Status: visible
The Username can be anything. Same goes for the end part this is a message.
What I want to do is extract the Username that comes after the pound sign #. Not from any other place in the string, since that can vary aswell. I also want to get the message from the string that comes after the semicolon :.
I tried that with the following regex. But it never outputs any results.
regex rgx("WEBMSG #([a-zA-Z0-9]) :(.*?)");
smatch matches;
for(size_t i=0; i<matches.size(); ++i) {
cout << "MATCH: " << matches[i] << endl;
}
I'm not getting any matches. What is wrong with my regex?
Your regular expression is incorrect because neither capture group does what you want. The first is looking to match a single character from the set [a-zA-Z0-9] followed by <space>:, which works for single character usernames, but nothing else. The second capture group will always be empty because you're looking for zero or more characters, but also specifying the match should not be greedy, which means a zero character match is a valid result.
Fixing both of these your regex becomes
std::regex rgx("WEBMSG #([a-zA-Z0-9]+) :(.*)");
But simply instantiating a regex and a match_results object does not produce matches, you need to apply a regex algorithm. Since you only want to match part of the input string the appropriate algorithm to use in this case is regex_search.
std::regex_search(s, matches, rgx);
Putting it all together
std::string s{R"(
tХB:Username!Username#Username.tcc.domain.com Connected
tХB:Username!Username#Username.tcc.domain.com WEBMSG #Username :this is a message
tХB:Username!Username#Username.tcc.domain.com Status: visible
)"};
std::regex rgx("WEBMSG #([a-zA-Z0-9]+) :(.*)");
std::smatch matches;
if(std::regex_search(s, matches, rgx)) {
std::cout << "Match found\n";
for (size_t i = 0; i < matches.size(); ++i) {
std::cout << i << ": '" << matches[i].str() << "'\n";
}
} else {
std::cout << "Match not found\n";
}
Live demo
"WEBMSG #([a-zA-Z0-9]) :(.*?)"
This regex will match only strings, which contain username of 1 character length and any message after semicolon, but second group will be always empty, because tries to find the less non-greedy match of any characters from 0 to unlimited.
This should work:
"WEBMSG #([a-zA-Z0-9]+) :(.*)"

QRegExp not extracting text as expected

I am trying to extract text from between square brackets on a line of text. I've been messing with the regex for some time now, and cannot get what I need. (I can't even explain why the output is what it is). Here's the code:
QRegExp rx_timestamp("\[(.*?)\]");
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
The input line is:
messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found
And the output is:
Captured texts: (".")
timestamp cap: "."
timestamp cap: ""
timestamp cap: ""
Can someone explain what is going on? Why is cap returning "." when no such character exists between square brackets
Can someone correct the regex to extract the timestamp from between the square brackets?
You are missing two things. Escaping the backslash, and using setMinimal. See below.
QString line = "messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '\"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found";
QRegExp rx_timestamp("\\[(.*)\\]");
rx_timestamp.setMinimal(true);
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
Output:
Captured texts: ("[2013-10-08 09:13:41]", "2013-10-08 09:13:41")
timestamp cap: "[2013-10-08 09:13:41]"
timestamp cap: "2013-10-08 09:13:41"
timestamp cap: ""
UPDATE: What is going on:
A backslash in c++ source code indicates that the next character is an escape character, such as \n. To have a backslash show up in a regular expression you have to escape a backslash like so: \\ That will make it so that the Regular Expression engine sees \, like what Ruby, Perl or Python would use.
The square brackets should be escaped, too, because they are used to indicate a range of elements normally in regex.
So for the Regular expression engine to see a square bracket character you need to send it
\[
but a c++ source file can't get a \ character into a string without two of them in a row so it turns into
\\[
While learning regex, I liked using this regex tool by GSkinner. It has a listing on the right hand side of the page of unique codes and characters.
QRegEx doesn't match regex exactly. If you study the documentation you find a lot of little things. Such as how it does Greedy v. Lazy matching.
QRegExp and double-quoted text for QSyntaxHighlighter
How the captures are listed is pretty typical as far as I have seen from regex parsers. The capture listing first lists all of them, then it lists the first capture group (or what was enclosed by the first set of parentheses.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#cap
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#capturedTexts
To find more matches, you have to iteratively call indexIn.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#indexIn
QString str = "offsets: 1.23 .50 71.00 6.00";
QRegExp rx("\\d*\\.\\d+"); // primitive floating point matching
int count = 0;
int pos = 0;
while ((pos = rx.indexIn(str, pos)) != -1) {
++count;
pos += rx.matchedLength();
}
// pos will be 9, 14, 18 and finally 24; count will end up as 4
Hope that helps.

Why am I getting multiple regex matches?

I'm trying to write a processor for GLSL shader code that will allow me to analyze the code and dynamically determine what inputs and outputs I need to handle for each shader.
To accomplish that, I decided to use some regex to parse the shader code before I compile it via OpenGL.
I've written some test code to verify that the regex is working as I expect.
Code:
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main()
{
string strInput = " in vec3 i_vPosition; ";
smatch match;
// Will appear in regex as:
// \bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(\[[0-9]+\])?\s*;
regex rgx("\\bin\\s+[a-zA-Z0-9]+\\s+[a-zA-Z0-9_]+\\s*(\\[[0-9]+\\])?\\s*;");
bool bMatchFound = regex_search(strInput, match, rgx);
cout << "Match found: " << bMatchFound << endl;
for (int i = 0; i < match.size(); ++i)
{
cout << "match " << i << " (" << match[i] << ") ";
cout << "at position " << match.position(i) << std::endl;
}
}
The only problem is that the above code generates two results instead of one. Though one of the results is empty.
Output:
Match found: 1
match 0 (in vec3 i_vPosition;) at position 6
match 1 () at position 34
I ultimately want to generate multiple results when I provide a whole file as input, but I'd like to get some consistency so that I can process the results in a consistent manner.
Any ideas as to why I'm getting multiple results when I'm only expecting one?
Your regex appears to contain a back reference
(\[[0-9]+\])?
which would contain square brackets surrounding 1 or more digits, but the ? makes it optional.
When applying the regex, the leading and trailing spaces are trimmed by the
\s+ ... \s*
The remainder of the string is matched by
[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*
And the backreference bit matches the empty string.
If you want to match strings that optionally contain that bit, but not return it as a backreference, make it passive with ?: like:
\bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(?:\[[0-9]+\])?\s*
I ultimately want to generate multiple results
The regex_search only finds the first match of the complete regular expression.
If you want to find the other places in your source text that the complete regular expression matches,
you must run regex_search repeatedly.
See
" C++ Regex to match words without punctuation "
for an example of repeatedly running the search.
the above code generates two results instead of one.
Confusing, isn't it?
The regular expression
\bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(\[[0-9]+\])?\s*;
includes round brackets().
The round brackets create a "group" aka "sub-expression".
Because the sub-expression is optional "(....)?",
the expression as a whole is allowed to match even if the sub-expression doesn't really match anything.
When the sub-expression doesn't match anything, the value of that sub-expression is an empty string.
See "Regular-expressions: Use Round Brackets for Grouping" for far more information on "capturing parenthesis" and "non-capturing parenthesis".
According to the documentation for regex_search,
match.size() is the number of subexpressions plus 1,
match[0] is the part of the source string that matches the complete regular expression.
match[1] is the part of the source string that matches the first sub-expression inside the regular expression.
match[n] is the part of the source string that matches the n'th sub-expression inside the regular expression.
A regular expression with only 1 sub-expression, as in the above example, will always return a match.size() of 2 -- one match for the complete regular expression, and one match for the sub-expression -- even when that sub-expression doesn't really match anything and is therefore the empty string.

Get String Between 2 Strings

How can I get a string that is between two other declared strings, for example:
String 1 = "[STRING1]"
String 2 = "[STRING2]"
Source:
"832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
How can I get the "I need this text here"?
Since this is homework, only clues:
Find index1 of occurrence of String1
Find index2 of occurrence of String2
Substring from index1+lengthOf(String1) (inclusive) to index2 (exclusive) is what you need
Copy this to a result buffer if necessary (don't forget to null-terminate)
Might be a good case for std::regex, which is part of C++11.
#include <iostream>
#include <string>
#include <regex>
int main()
{
using namespace std::string_literals;
auto start = "\\[STRING1\\]"s;
auto end = "\\[STRING2\\]"s;
std::regex base_regex(start + "(.*)" + end);
auto example = "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"s;
std::smatch base_match;
std::string matched;
if (std::regex_search(example, base_match, base_regex)) {
// The first sub_match is the whole string; the next
// sub_match is the first parenthesized expression.
if (base_match.size() == 2) {
matched = base_match[1].str();
}
}
std::cout << "example: \""<<example << "\"\n";
std::cout << "matched: \""<<matched << "\"\n";
}
Prints:
example: "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
matched: "I need this text here"
What I did was create a program that creates two strings, start and end that serve as my start and end matches. I then use a regular expression string that will look for those, and match against anything in-between (including nothing). Then I use regex_match to find the matching part of the expression, and set matched as the matched string.
For more info, see http://en.cppreference.com/w/cpp/regex and http://en.cppreference.com/w/cpp/regex/regex_search
Use strstr http://www.cplusplus.com/reference/clibrary/cstring/strstr/ , with that function you will get 2 pointers, now you should compare them (if pointer1 < pointer2) if so, read all chars between them.