Regex for replacing printf-style calls with ostream left-shift syntax - c++

The logging facility for our C++ project is about to be refactored to use repeated left-shift operators (in the manner of Qt's qDebug() syntax) instead of printf-style variadic functions.
Suppose the logging object is called logger. Let's say we want to show the ip and port of the server we connected to. In the current implementation, the usage is:
logger.logf("connected to %s:%d", ip, port);
After the refactor, the above call would become:
logger() << "connected to" << ip << ":" << port;
Manually replacing all these calls would be extremely tedious and error-prone, so naturally, I want to use a regex. As a first pass, I could replace the .logf(...) call, yielding
logger() "connected to %s:%d", ip, port;
However, reformatting this string to the left-shift syntax is where I have trouble. I managed to create the separate regexes for capturing printf placeholders and comma-delimited arguments. However, I don't know how to properly correlate the two.
In order to avoid repetition of the fairly unwieldy regexes, I will use the placeholder (printf) to refer to the printf placeholder regex (returning the named group token), and (args) to refer to the comma-delimited arguments regex (returning the named group arg). Below, I will give the outputs of various attempts applied to the relevant part of the above line, i.e.:
"connected to %s:%d", ip, port
/(printf)(args)/g produces no match.
/(printf)*(args)/g produces two matches, containing ip and port in the named group arg (but nothing in token).
/(printf)(args)*/g achieves the opposite result: it produces two matches, containing %s and %d in the named group token, but nothing in arg.
/(printf)*(args)*/g returns 3 matches: the first two contain %s and %d in token, the third contains port in arg. However, regexp101 reports "20 matches - 207 steps" and seems to match before every character.
I figured that perhaps I need to specify that the first capturing group is always between double quotes. However, neither /"(printf)"(args)/g nor /"(printf)(args)/g produce any matches.
/(printf)"(args)/g produces one (incorrect) match, containing %d in group token and ip in arg, and substitution consumes the entire string between those two strings (so entering # for the substitution string results in "connected to %s:#, port. Obviously, this is not the desired outcome, but it's the only version where I could at least get both named groups in a single match.
Any help is greatly appreciated.
Edited to correct broken formatting

Disclaimer: This is a workaround, it's far from perfect and may lead to errors. Be careful when you'll commit the changes and, if you can, make a colleague proofread the diff to reduce the chances of disturbance.
You may try this multi-steps replacement from the max number of argument you have in the solution to the min (here I'll do from 3 to 0).
Let's consider logger.logf("connected to %s:%d some %s random text", ip, port, test);
You can match this with this regex: logger.logf\("(.*?)(%[a-z])(.*?)(%[a-z])(.*?)(%[a-z])(.*?)",(.*?)(?:, (.*?))?(?:, (.*?))?\); which will give you the following groups:
1. [75-88] `connected to `
2. [88-90] `%s`
3. [90-91] `:`
4. [91-93] `%d`
5. [93-99] ` some `
6. [99-101] `%s`
7. [101-113] ` random text`
8. [115-118] ` ip`
9. [120-124] `port`
10. [126-130] `test`
Replace with logger() << "\1" << \8 << "\3" << \9 << "\5" << \10 << "\7"; will give you
logger() << "connected to " << ip << ":" << port << " some " << test << " random text";
Now step with 2 args, example string is logger.logf("connected to %s:%d some random text", ip, port);, corresponding regex is logger.logf\("(.*?)(%[a-z])(.*?)(%[a-z])(.*?)",(.*?)(?:, (.*?))?\);
The matching is the following:
1. [13-26] `connected to `
2. [26-28] `%s`
3. [28-29] `:`
4. [29-31] `%d`
5. [31-48] ` some random text`
6. [50-53] ` ip`
7. [55-59] `port`
And the replace string: logger() << "\1" << \6 << "\3" << \7 << "\5"; outputs:
logger() << "connected to " << ip << ":" << port << " some random text";
Input logger.logf("Some %s text", port);
Regex logger.logf\("(.*?)(%[a-z])(.*?)",(.*?)\);
Replacement logger() << "\1" << \4 << "\3";
logger() << "Some " << port << " text";
What about empty groups?
Let's say input is not logger.logf("Some %s text", port); but logger.logf("Some %s", port);. The output will then be:
logger() << "Some " << port << "";
You'll have to remove << "" to get something clean.

Related

Regex to match strings not enclosed in macro

In a development context, I would like to make sure all strings in source files within certain directories are enclosed in some macro "STR_MACRO". For this I will be using a Python script parsing the source files, and I would like to design a regex for detecting non-commented lines with strings not enclosed in this macro.
For instance, the regex should match the following strings:
std::cout << "Hello World!" << std::endl;
load_file("Hello World!");
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar
Excluding commented lines containing strings seems to work well using the regex ^(?!\s*//).*"([^"]+)". However when I try to exclude non-commented strings already enclosed in the macro, using the regex ^(?!\s*//).*(?!STR_MACRO\()"([^"]+)", it does nothing more (seemingly due to with the opening parenthesis after STR_MACRO).
Any hints on how to achieve this?
With PyPi regex module (that you can install with pip install regex in the terminal) you can use
import regex
pattern = r'''(?:^//.*|STR_MACRO\("[^"\\]*(?:\\.[^"\\]*)*"\))(*SKIP)(*F)|"[^"\\]*(?:\\.[^"\\]*)*"'''
text = r'''For instance, the regex should match the following strings:
std::cout << "Hello World!" << std::endl;
load_file("Hello World!");
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar'''
print( regex.sub(pattern, r'STR_MACRO(\g<0>)', text, flags=regex.M) )
Details:
(?:^//.*|STR_MACRO\("[^"\\]*(?:\\.[^"\\]*)*"\))(*SKIP)(*F) - // at the line start and the rest of the line, or STR_MACRO( + a double quoted string literal pattern + ), and then the match is skipped, and the next match search starts at the failure location
| - or
"[^"\\]*(?:\\.[^"\\]*)*" - ", zero or more chars other than " and \, then zero or more reptitions of a \ and then any single char followed with zero or more chars other than a " and \ chars, and then a " char
See the Python demo. Output:
For instance, the regex should match the following strings:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar

Google Sheets RegexpReplace with computable replacers

I'm trying to replace a pattern with some string computed with other GSheets functions. For example, I want to make all the int numbers in the string ten times larger: "I want to multiply 2 numbers in this string by 10" should turn into "I want to multiply 20 numbers in this string by 100".
Assuming for short, that my string is in A1 cell, I've tried a construction
REGEXREPLACE(A1, "([0-9]+)", TEXT(10*VALUE("$1"),"###"))
But it seems REGEXREPLACE firstly computes the arguments and only after that yields regular expression rules. So it converts 3rd argument
TEXT(10*VALUE("$1"),"###") => TEXT(10*1,"###") => "10"
and then just replaces all integers in the string with 10.
It turns out, I need to substitute the group $1 BEFORE implementing outer functions in the 3rd argument. Is there any way to do such a thing?
Maybe there's another way. See if this works
=join(" ", ArrayFormula(if(isnumber(split(A1, " ")), split(A1, " ")*10, split(A1, " "))))
try:
=ARRAYFORMULA(JOIN(" ", IFERROR(SPLIT(A1, " ")*10, SPLIT(A1, " "))))
or:
=ARRAYFORMULA(JOIN(" ", IF(ISNUMBER(SPLIT(A1, " ")), SPLIT(A1, " ")*10, SPLIT(A1, " "))))

QRegExp not extracting text as expected

I am trying to extract text from between square brackets on a line of text. I've been messing with the regex for some time now, and cannot get what I need. (I can't even explain why the output is what it is). Here's the code:
QRegExp rx_timestamp("\[(.*?)\]");
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
The input line is:
messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found
And the output is:
Captured texts: (".")
timestamp cap: "."
timestamp cap: ""
timestamp cap: ""
Can someone explain what is going on? Why is cap returning "." when no such character exists between square brackets
Can someone correct the regex to extract the timestamp from between the square brackets?
You are missing two things. Escaping the backslash, and using setMinimal. See below.
QString line = "messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '\"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found";
QRegExp rx_timestamp("\\[(.*)\\]");
rx_timestamp.setMinimal(true);
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
Output:
Captured texts: ("[2013-10-08 09:13:41]", "2013-10-08 09:13:41")
timestamp cap: "[2013-10-08 09:13:41]"
timestamp cap: "2013-10-08 09:13:41"
timestamp cap: ""
UPDATE: What is going on:
A backslash in c++ source code indicates that the next character is an escape character, such as \n. To have a backslash show up in a regular expression you have to escape a backslash like so: \\ That will make it so that the Regular Expression engine sees \, like what Ruby, Perl or Python would use.
The square brackets should be escaped, too, because they are used to indicate a range of elements normally in regex.
So for the Regular expression engine to see a square bracket character you need to send it
\[
but a c++ source file can't get a \ character into a string without two of them in a row so it turns into
\\[
While learning regex, I liked using this regex tool by GSkinner. It has a listing on the right hand side of the page of unique codes and characters.
QRegEx doesn't match regex exactly. If you study the documentation you find a lot of little things. Such as how it does Greedy v. Lazy matching.
QRegExp and double-quoted text for QSyntaxHighlighter
How the captures are listed is pretty typical as far as I have seen from regex parsers. The capture listing first lists all of them, then it lists the first capture group (or what was enclosed by the first set of parentheses.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#cap
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#capturedTexts
To find more matches, you have to iteratively call indexIn.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#indexIn
QString str = "offsets: 1.23 .50 71.00 6.00";
QRegExp rx("\\d*\\.\\d+"); // primitive floating point matching
int count = 0;
int pos = 0;
while ((pos = rx.indexIn(str, pos)) != -1) {
++count;
pos += rx.matchedLength();
}
// pos will be 9, 14, 18 and finally 24; count will end up as 4
Hope that helps.

Qt 4.8.4 MAC Address QRegExp

I'm trying to get Qt to match a MAC Address ( 1a:2b:3c:4d:5e:6f ) using a QRegExp. I can't seem to get it to match - what am I doing wrong?
I am forcing it to try and match the string:
"48:C1:AC:55:86:F3"
Here are my attempts:
// Define a RegEx to match the mac address
//QRegExp regExMacAddress("[0-9a-F]{1,2}[\.:-]){5}([0-9a-F]{1,2}");
//QRegExp regExMacAddress("[0-9a-F]{0,2}:[0-9a-F]{0,2}:[0-9a-F]{0,2}:[0-9a-F]{0,2}:[0-9a-F]{0,2}:[0-9a-F]{0,2}");
//regExMacAddress.setPatternSyntax(QRegExp::RegExp);
// Ensure that the hexadecimal characters are upper case
hwAddress = hwAddress.toUpper();
qDebug() << "STRING TO MATCH: " << hwAddress << "MATCHED IT: " << regExMacAddress.indexIn(hwAddress) << " Exact Match: " << regExMacAddress.exactMatch(hwAddress);
// Check the mac address format
if ( regExMacAddress.indexIn(hwAddress) == -1 ) {
In your first example opening bracket is missing and \. is incorrect (read help for explanations), in both a-F matches nothing, due to 'a' > 'F'.
The correct answer you can find in the comment of kenrogers, but I'll duplicate it for you:
([0-9A-F]{2}[:-]){5}([0-9A-F]{2})
If you want to match . you should use:
([0-9A-F]{2}[:-\\.]){5}([0-9A-F]{2})
If you also want to match lower case characters, you should use:
([0-9A-Fa-f]{2}[:-\\.]){5}([0-9A-Fa-f]{2})

How to declare a variable that spans multiple lines

I'm attempting to initialise a string variable in C++, and the value is so long that it's going to exceed the 80 character per line limit I'm working to, so I'd like to split it to the next line, but I'm not sure how to do that.
I know that when splitting the contents of a stream across multiple lines, the syntax goes like
cout << "This is a string"
<< "This is another string";
Is there an equivalent for variable assignment, or do I have to declare multiple variables and concatenate them?
Edit: I misspoke when I wrote the initial question. When I say 'next line', I'm just meaning the next line of the script. When it is printed upon execution, I would like it to be on the same line.
You can simply break the line like this:
string longText("This is a "
"very very very "
"long text");
In the C family, whitespaces are insignificant, so you can freely use character literals spanning multiple lines this way.
It can also simply be
cout << "This is a string"
"This is another string";
You can write this:
const char * str = "First phrase, "
"Second phrase, "
"Third phrase";