How to stop at first match in C++ regular expression? - c++

I'm trying to remove comments in C++ with flex.
This is my example code:
cout << "Pulapka \" \
// ma \
/* ma */ \
" << endl;
cout << /*Proba*/"Zabawa \" // ala i kot " << endl;
I want match everything between " ".
My regural expression:
(\"[^[]*]*")
I want to stop my matching after second quotation marks. That's mean I need only this fragment:
"Pulapka \" \
// ma \
/* ma */ \
"

The following regex works for me:
(\".*?(?<!\\)("))
You can then extract the first group which is exactly what you want.
Note: I don't know how it works for C++ but I had to use the s flag
Demo: http://regex101.com/r/xL4kU8

Related

Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string

So I need to tokenize a string by all spaces not between quotes, I am using regex in Javascript notation.
For example:
" Test Test " ab c " Test" "Test " "Test" "T e s t"
becomes
[" Test Test ",ab,c," Test","Test ","Test","T e s t"]
For my use case however, the solution should work in the following test setting:
https://www.regextester.com/
All Spaces not within quotes should be highlighted in the above setting. If they are highlighted in the above setting they would be parsed correctly in my program.
For more specificity, I am using Boost::Regex C++ to do the parsing as follows:
...
std::string test_string("\" Test Test \" ab c \" Test\" \"Test \" \"Test\" \"T e s t\"");
// (,|;)?\\s+ : Split on ,\s or ;\s
// (?![^\\[]*\\]) : Ignore spaces inside []
// (?![^\\{]*\\}) : Ignore spaces inside {}
// (?![^\"].*\") : Ignore spaces inside "" !!! MY ATTEMPT DOESN'T WORK !!!
//Note the below regex delimiter declaration does not include the erroneous regex.
boost::regex delimiter("(,|;\\s|\\s)+(?![^\\[]*\\])(?![^\\(]*\\))(?![^\\{]*\\})");
std::vector<std::string> string_vector;
boost::split_regex(string_vector, test_string, delimiter);
For those of you who do not use Boost::regex or C++ the above link should enable testing of viable regex for the above use case.
Thank you all for you assistance I hope you can help me with the above problem.
I would 100% not use regular expressions for this. First off, because it's way easier to express as a PEG grammar instead. E.g.:
std::vector<std::string> tokens(std::string_view input) {
namespace x3 = boost::spirit::x3;
std::vector<std::string> r;
auto atom //
= '[' >> *~x3::char_(']') >> ']' //
| '{' >> *~x3::char_('}') >> '}' //
| '"' >> *~x3::char_('"') >> '"' //
| x3::graph;
auto token = x3::raw[*atom];
parse(input.begin(), input.end(), token % +x3::space, r);
return r;
}
This, off the bat, already performs as you intend:
Live On Coliru
int main() {
for (std::string const input : {R"(" Test Test " ab c " Test" "Test " "Test" "T e s t")"}) {
std::cout << input << "\n";
for (auto& tok : tokens(input))
std::cout << " - " << quoted(tok, '\'') << "\n";
}
}
Output:
" Test Test " ab c " Test" "Test " "Test" "T e s t"
- '" Test Test "'
- 'ab'
- 'c'
- '" Test"'
- '"Test "'
- '"Test"'
- '"T e s t"'
BONUS
Where this really makes the difference, is when you realize that you wanted to be able to handle nested constructs (e.g. "string" [ {1,2,"3,4", [true,"more [string]"], 9 }, "bye ]).
Regular expressions are notoriously bad at this. Spirit grammar rules can be recursive though. If you make your grammar description more explicit I could show you examples.
You can use multiple regexes if you are ok with that. The idea is to replace spaces inside quotes with a non-printable char (\x01), and restore them after the split:
const input = `" Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
.replace(/"[^"]*"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
.split(/ +/) // split on spaces
.map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);
If you have escaped quotes within a string, such as "a \"quoted\" token" you can use this regex instead:
const input = `"A \"quoted\" token" " Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
.replace(/".*?[^\\]"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
.split(/ +/) // split on spaces
.map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);
If you want to parse nested brackets you need a proper language parser. You can also do that with regexes however: Parsing JavaScript objects with functions as JSON
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

Regex to match strings not enclosed in macro

In a development context, I would like to make sure all strings in source files within certain directories are enclosed in some macro "STR_MACRO". For this I will be using a Python script parsing the source files, and I would like to design a regex for detecting non-commented lines with strings not enclosed in this macro.
For instance, the regex should match the following strings:
std::cout << "Hello World!" << std::endl;
load_file("Hello World!");
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar
Excluding commented lines containing strings seems to work well using the regex ^(?!\s*//).*"([^"]+)". However when I try to exclude non-commented strings already enclosed in the macro, using the regex ^(?!\s*//).*(?!STR_MACRO\()"([^"]+)", it does nothing more (seemingly due to with the opening parenthesis after STR_MACRO).
Any hints on how to achieve this?
With PyPi regex module (that you can install with pip install regex in the terminal) you can use
import regex
pattern = r'''(?:^//.*|STR_MACRO\("[^"\\]*(?:\\.[^"\\]*)*"\))(*SKIP)(*F)|"[^"\\]*(?:\\.[^"\\]*)*"'''
text = r'''For instance, the regex should match the following strings:
std::cout << "Hello World!" << std::endl;
load_file("Hello World!");
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar'''
print( regex.sub(pattern, r'STR_MACRO(\g<0>)', text, flags=regex.M) )
Details:
(?:^//.*|STR_MACRO\("[^"\\]*(?:\\.[^"\\]*)*"\))(*SKIP)(*F) - // at the line start and the rest of the line, or STR_MACRO( + a double quoted string literal pattern + ), and then the match is skipped, and the next match search starts at the failure location
| - or
"[^"\\]*(?:\\.[^"\\]*)*" - ", zero or more chars other than " and \, then zero or more reptitions of a \ and then any single char followed with zero or more chars other than a " and \ chars, and then a " char
See the Python demo. Output:
For instance, the regex should match the following strings:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar

QRegularExpression find and capture all quoted and non-quoated parts in string

I am fairly new to using regexes.
I got a string which can contain quoted and not quoted substrings.
Here are examples of how they could look:
"path/to/program.exe" -a -b -c
"path/to/program.exe" -a -b -c
path/to/program.exe "-a" "-b" "-c"
path/to/program.exe "-a" -b -c
My regex looks like this: (("[^"]*")|([^"\t ]+))+
With ("[^"]+") I attempt to find every quoted substring and capture it.
With ([^"\t ]+) I attempt to find every substring without quotes.
My code to test this behaviour looks like this:
QString toMatch = R"del( "path/to/program.exe" -a -b -c)del";
qDebug() << "String to Match against: " << toMatch << "\n";
QRegularExpression re(R"del((("[^"]+")|([^"\t ]+))+)del");
QRegularExpressionMatchIterator it = re.globalMatch(toMatch);
int i = 0;
while (it.hasNext())
{
QRegularExpressionMatch match = it.next();
qDebug() << "iteration: " << i << " captured: " << match.captured(i) << "\n";
i++;
}
Output:
String to Match against: " \"path/to/program.exe\" -a -b -c"
iteration: 0 captured: "\"path/to/program.exe\""
iteration: 1 captured: "-a"
iteration: 2 captured: ""
iteration: 3 captured: "-c"
Testing it in Regex101 shows me the result I want.
I also tested it on some other websites e.g this.
I guess I am doing something wrong, could anyone point in the right direction?
Thanks in advance.
You assume that the groups you need to get value from will change their IDs with each new match, while, in fact, all the groups IDs are set in the pattern itself.
I suggest removing all groups and just extract the whole match value:
QString toMatch = R"del( "path/to/program.exe" -a -b -c)del";
qDebug() << "String to Match against: " << toMatch << "\n";
QRegularExpression re(R"del("[^"]+"|[^"\s]+)del");
QRegularExpressionMatchIterator it = re.globalMatch(toMatch);
while (it.hasNext())
{
QRegularExpressionMatch match = it.next();
qDebug() << " matched: " << match.captured(0) << "\n";
}
Note the "[^"]+"|[^"\s]+ pattern matches either
"[^"]+" - ", then one or more chars other than " and then a "
| - or
[^"\s]+ - one or more chars other than " and whitespace.
See the updated pattern demo.

how can I print "\' in c++?

I have a homework assignment where part of the menu has to have "R\C" printed, but when I run the program the console just prints "RC". Does anyone know why is this happening and how I can fix it?
This is what I have in Visual Studio:
cout << "R\C" << endl;
The \C is being interpreted as an (invalid) escape sequence. You need to escape the \ character as \\ in order to print it as a single \, eg:
cout << "R\\C" << endl;
Alternatively, in C++11 and later, you can use a raw string literal instead, so you do not need to escape the \ character:
cout << R"(R\C)" << endl;
Escape \ with another \:
cout << "R\\C" << endl;
c++ reserve some characters, so you can't directly input them, usually you will have to put \ in front of them to signify that you want to use "\" as a string.
You have to use escape sequences for certain characters. For the character that you specified you would have to output as “\\” and your output would be \. Other escape sequences are:
\’
\t For Tab
\n For newline
\? For question marks
See this for more information.
You can use escape sequences.., like \t, \n, \a...
If you want to print ' \ ', you have to code like this
cout<<"\\";

QRegExp not extracting text as expected

I am trying to extract text from between square brackets on a line of text. I've been messing with the regex for some time now, and cannot get what I need. (I can't even explain why the output is what it is). Here's the code:
QRegExp rx_timestamp("\[(.*?)\]");
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
The input line is:
messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found
And the output is:
Captured texts: (".")
timestamp cap: "."
timestamp cap: ""
timestamp cap: ""
Can someone explain what is going on? Why is cap returning "." when no such character exists between square brackets
Can someone correct the regex to extract the timestamp from between the square brackets?
You are missing two things. Escaping the backslash, and using setMinimal. See below.
QString line = "messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '\"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found";
QRegExp rx_timestamp("\\[(.*)\\]");
rx_timestamp.setMinimal(true);
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
Output:
Captured texts: ("[2013-10-08 09:13:41]", "2013-10-08 09:13:41")
timestamp cap: "[2013-10-08 09:13:41]"
timestamp cap: "2013-10-08 09:13:41"
timestamp cap: ""
UPDATE: What is going on:
A backslash in c++ source code indicates that the next character is an escape character, such as \n. To have a backslash show up in a regular expression you have to escape a backslash like so: \\ That will make it so that the Regular Expression engine sees \, like what Ruby, Perl or Python would use.
The square brackets should be escaped, too, because they are used to indicate a range of elements normally in regex.
So for the Regular expression engine to see a square bracket character you need to send it
\[
but a c++ source file can't get a \ character into a string without two of them in a row so it turns into
\\[
While learning regex, I liked using this regex tool by GSkinner. It has a listing on the right hand side of the page of unique codes and characters.
QRegEx doesn't match regex exactly. If you study the documentation you find a lot of little things. Such as how it does Greedy v. Lazy matching.
QRegExp and double-quoted text for QSyntaxHighlighter
How the captures are listed is pretty typical as far as I have seen from regex parsers. The capture listing first lists all of them, then it lists the first capture group (or what was enclosed by the first set of parentheses.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#cap
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#capturedTexts
To find more matches, you have to iteratively call indexIn.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#indexIn
QString str = "offsets: 1.23 .50 71.00 6.00";
QRegExp rx("\\d*\\.\\d+"); // primitive floating point matching
int count = 0;
int pos = 0;
while ((pos = rx.indexIn(str, pos)) != -1) {
++count;
pos += rx.matchedLength();
}
// pos will be 9, 14, 18 and finally 24; count will end up as 4
Hope that helps.