Extract string matching a specific format - regex

Given a QString, I want to extract a substring from the main string input.
e.g. I have a QString reading something like:
\\\\?\\Volume{db41aa6a-c0b8-11e9-bc8a-806e6f6e6963}\\
I need to extract the string (if a string with the format exists) using a template/format matching a regex format (\w){8}([-](\w){4}){3}[-](\w){12} as shown below:
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
and it should return
db41aa6a-c0b8-11e9-bc8a-806e6f6e6963
if found, else an empty QString.
Currently, I can achieve this by doing something like:
string.replace("{", "").replace("}", "").replace("\\", "").replace("?", "").replace("Volume", "");
But this is tedious and inefficient, and tailored to a specific request.
Is there a generalized function that enables me to extract a substring using a regex format or other?
Update
To clarity after #Emma's answer, I want e.g. QString::extract("(\w){8}([-](\w){4}){3}[-](\w){12}") which returns db41aa6a-c0b8-11e9-bc8a-806e6f6e6963.

Here's a bunch of ways to extract part of a string as presented in the question. I don't know how much of the string format is fixed vs. variable, so possibly not all of these examples would be practical. Also some examples below are using QStringRef class which can be more efficient but must have the original string (the one being referenced) available while any references are active (see warning in docs).
const QString str("\\\\?\\Volume{db41aa6a-c0b8-11e9-bc8a-806e6f6e6963}\\");
// Treat str as a list delimited by "{" and "}" chars.
const QString sectResult = str.section('{', 1, 1).section('}', 0, 0); // = "db41aa6a-c0b8-11e9-bc8a-806e6f6e6963"
const QString sectRxResult = str.section(QRegExp("\\{|\\}"), 1, 1); // = "db41aa6a-c0b8-11e9-bc8a-806e6f6e6963"
// Example using QStringRef, though this could also be just QString::split() which returns QString copies.
const QVector<QStringRef> splitRef = str.splitRef(QRegExp("\\{|\\}"));
const QStringRef splitRefResult = splitRef.value(1); // = "db41aa6a-c0b8-11e9-bc8a-806e6f6e6963"
// Use regular expressions to find/extract matching string
const QRegularExpression rx("\\w{8}(?:-(\\w){4}){3}-\\w{12}"); // match a UUID string
const QRegularExpressionMatch match = rx.match(str);
const QString rxResultStr = match.captured(0); // = "db41aa6a-c0b8-11e9-bc8a-806e6f6e6963"
const QStringRef rxResultRef = match.capturedRef(0); // = "db41aa6a-c0b8-11e9-bc8a-806e6f6e6963"
const QRegularExpression rx2(".+\\{([^{\\}]+)\\}.+"); // capture anything inside { } brackets
const QRegularExpressionMatch match2 = rx2.match(str);
const QString rx2ResultStr = match2.captured(1); // = "db41aa6a-c0b8-11e9-bc8a-806e6f6e6963"
// Make a copy for replace so that our references to the original string remain valid.
const QString replaceResult = QString(str).replace(rx2, "\\1"); // = "db41aa6a-c0b8-11e9-bc8a-806e6f6e6963"
qDebug() << sectResult << sectRxResult << splitRefResult << rxResultStr
<< rxResultRef << rx2ResultStr << replaceResult;

Maybe,
Volume{(\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b)}
or just,
\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b
for a full match might be a bit closer.
If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.
RegEx Circuit
jex.im visualizes regular expressions:
Source
Searching for UUIDs in text with regex

Related

How to get number of partial matches using re2

I want to get number of sub-string matches of a given string using re2;
I have read the codes of re2: https://github.com/google/re2/blob/master/re2/re2.h but do not see an easy way to do that.
I have following sample code:
std::string regexPunc = "[\\p{P}]"; // matches any punctuations;
re2::RE2 re2Punc(regexPunc);
std::string sampleString = "test...test";
if (re2::RE2::PartialMatch(sampleString, re2Punc)) {
std::cout << re2Punc.numOfMatches();
}
I want it to output 3 as there are three punctuations in the string;
Use FindAndConsume, and count the matches yourself. It won't be inefficient, because in order to know the number of matches, those matches would have to be performed and counted anyway.
Example:
std::string regexPunc = "[\\p{P}]"; // matches any punctuations;
re2::RE2 re2Punc(regexPunc);
std::string sampleString = "test...test";
StringPiece input(sampleString);
int numberOfMatches = 0;
while(re2::RE2::FindAndConsume(&input, re2Punc)) {
++numberOfMatches;
}

In Qt, what takes the least amount of code to replace string matches with regular expression captures?

I was hoping that QString would allow this:
QString myString("School is LameCoolLame and LameRadLame");
myString.replace(QRegularExpression("Lame(.+?)Lame"),"\1");
Leaving
"School is Cool and Rad"
Instead from what I saw in the docs, doing this is a lot more convoluted requiring you to do (from the docs):
QRegularExpression re("\\d\\d \\w+");
QRegularExpressionMatch match = re.match("abc123 def");
if (match.hasMatch()) {
QString matched = match.captured(0); // matched == "23 def"
// ...
}
Or in my case something like this:
QString myString("School is LameCoolLame and LameRadLame");
QRegularExpression re("Lame(.+?)Lame");
QRegularExpressionMatch match = re.match(myString);
if (match.hasMatch()) {
for (int i = 0; i < myString.count(re); i++) {
QString newString(match.captured(i));
myString.replace(myString.indexOf(re),re.pattern().size, match.captured(i));
}
}
And that doesn't even seem to work, (I gave up actually). There must be an easier more convenient way. For the sake of simplicity and code readability, I'd like to know the methods which take the least lines of code to accomplish this.
Thanks.
QString myString("School is LameCoolLame and LameRadLame");
myString.replace(QRegularExpression("Lame(.+?)Lame"),"\\1");
Above code works as you expected. In your version, you forgot to escape the escape character itself.

Qt Using QRegularExpression multiline option

I'm writing a program that use QRegularExpression and MultilineOption, I wrote this code but matching stop on first line. Why? Where am I doing wrong?
QString recv = "AUTH-<username>-<password>\nINFO-ID:45\nREG-<username>-<password>-<name>-<status>\nSEND-ID:195-DATE:12:30 2/02/2015 <esempio>\nUPDATEN-<newname>\nUPDATES-<newstatus>\n";
QRegularExpression exp = QRegularExpression("(SEND)-ID:(\\d{1,4})-DATE:(\\d{1,2}):(\\d) (\\d{1,2})\/(\\d)\/(\\d{2,4}) <(.+)>\\n|(AUTH)-<(.+)>-<(.+)>\\n|(INFO)-ID:(\\d{1,4})\\n|(REG)-<(.+)>-<(.+)>-<(.+)>-<(.+)>\\n|(UPDATEN)-<(.+)>\\n|(UPDATES)-<(.+)>\\n", QRegularExpression::MultilineOption);
qDebug() << exp.pattern();
QRegularExpressionMatch match = exp.match(recv);
qDebug() << match.lastCapturedIndex();
for (int i = 0; i <= match.lastCapturedIndex(); ++i) {
qDebug() << match.captured(i);
}
Can someone help me?
The answer is you should use .globalMatch method rather than .match.
See QRegularExpression documentation on that:
Attempts to perform a global match of the regular expression against
the given subject string, starting at the position offset inside the
subject, using a match of type matchType and honoring the given
matchOptions. The returned QRegularExpressionMatchIterator is
positioned before the first match result (if any).
Also, you can remove the QRegularExpression::MultilineOption option as it is not being used.
Sample code:
QRegularExpressionMatchIterator i = exp.globalMatch(recv);
while (i.hasNext()) {
QRegularExpressionMatch match = i.next();
// ...
}
Actually I google'd this question having similar issue, but I couldn't agree completely with an answer, as I think most of the questions about multi-line matching with new QRegularExpression can be answered as following:
use QRegularExpression::DotMatchesEverythingOption option which allows (.) to match newline characters. Which is extremely useful then porting from QRegExp
you got an or Expression and the first one is true, job is done.
you need to split the string and loop the array to compare with this Expression will work i think.
If the data every times have the same struct you can use something like this:
"(AUTH)-<([^>]+?)>-<([^>]+?)>\\nINFO-ID:(\\d+)\\n(REG)-<([^>]+?)>-<([^>]+?)>-<([^>]+?)>-<([^>]+?)>\\n(SEND)-ID:(\\d+)-DATE:(\\d+):(\\d+) (\\d+)/(\\d+)/(\\d+) <([^>]+?)>\\n(UPDATEN)-<([^>]+?)>\\n(UPDATES)-<([^>]+?)>"
21 Matches

C++11 regex replace

I have an XML string that i wish to log out. this XML contains some sensitive data that i'd like to mask out before sending to the log file. Currently using std::regex to do this:
std::regex reg("<SensitiveData>(\\d*)</SensitiveData>");
return std::regex_replace(xml, reg, "<SensitiveData>......</SensitiveData>");
Currently the data is being replaced by exactly 6 '.' characters, however what i really want to do is to replace the sensitive data with the correct number of dots. I.e. I'd like to get the length of the capture group and put that exact number of dots down.
Can this be done?
regex_replace of C++11 regular expressions does not have the capability you are asking for — the replacement format argument must be a string. Some regular expression APIs allow replacement to be a function that receives a match, and which could perform exactly the substitution you need.
But regexps are not the only way to solve a problem, and in C++ it's not exactly hard to look for two fixed strings and replace characters inbetween:
const char* const PREFIX = "<SensitiveData>";
const char* const SUFFIX = "</SensitiveData>";
void replace_sensitive(std::string& xml) {
size_t start = 0;
while (true) {
size_t pref, suff;
if ((pref = xml.find(PREFIX, start)) == std::string::npos)
break;
if ((suff = xml.find(SUFFIX, pref + strlen(PREFIX))) == std::string::npos)
break;
// replace stuff between prefix and suffix with '.'
for (size_t i = pref + strlen(PREFIX); i < suff; i++)
xml[i] = '.';
start = suff + strlen(SUFFIX);
}
}

Comparing regex in qt

I have a regex which I hope means any file with extension listed:
((\\.cpp$)|(\\.cxx$)|(\\.c$)|(\\.hpp$)|(\\.h$))
How to compare it in Qt against selected file?
Your actual RegEx itself doesn't have double backslashes (just when you fit it into a string literal). And you'll need some kind of wildcard if you want to use it to match full filenames. There's a semantic issue of whether you want a file called just ".cpp" to match or not. What about case sensitivity?
I'll assume for the moment that you want at least one other character in the beginning and use .+:
.+((\.cpp$)|(\.cxx$)|(\.c$)|(\.hpp$)|(\.h$))
So this should work:
QRegExp rx (".+((\\.cpp$)|(\\.cxx$)|(\\.c$)|(\\.hpp$)|(\\.h$))");
bool isMatch = rx.exactMatch(filename);
But with the expressive power of a whole C++ compiler at your beck and call, it can be a bit stifling to use regular expressions. You might have an easier time adapting code if you write it more like:
bool isMatch = false;
QStringList fileExtensionList;
fileExtensionList << "CPP" << "CXX" << "C" << "HPP" << "H";
QStringList splitFilenameList = filename.split(".");
if(splitFilenameList.size() > 1) {
QString fileExtension = splitFilenameList[splitFilenameList.size() - 1];
isMatch = fileExtensionList.contains(fileExtension.toUpper()));
}