Problem with special characters with RegEx in C++

Problem with special characters with RegEx in C++ - c++

I have an issue to replace a special characters in string (from IIS Sharepoint log files) that contains a domain name with forward slash and names that starts with t, n, r that makes confusions with regular expressions. my code is as follow:
std::setlocale(LC_ALL, ".ACP"); //Sets the locale to the ANSI code page obtained from the operating system. FR characters
std::string subject("2018-08-26 11:38:20 172.20.1.148 GET /BaseDocumentaire/Documents+de+la+page+Notes+de+services/Rappel+du+dispositif+de+Sécurité+relatif+aux+Moyens+de+paiement+et+d’épargne+en+agence.pdf - 80 0#.w|domainname\tonzaro 10.12.105.24 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:61.0)+Gecko/20100101+Firefox/61.0 200 0 0 29984");
std::string result;
std::string g1, g2, g5, g9, g10; //str groups in regex
try {
std::regex re("(\\d{4}-\\d{2}-\\d{2})( \\d{2}:\\d{2}:\\d{2})( 172.20.1.148)( GET | POST | HEAD )((/.*){1,4}/.*.(pdf|aspx))( -.*)(domainname.[a-zA-Z0-9]*)( \\d+.\\d+.\\d+.\\d+)");
std::sregex_iterator next(subject.begin(), subject.end(), re);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str() << "\n";
std::cout << "-------------------------------------------" << "\n";
g1 = match.str(1);
g2 = match.str(2);
g5 = match.str(5);
g9 = match.str(9);
g10 = match.str(10);
next++;
}
std::cout << "Date: " + g1 << "\n";
std::cout << "Time: " + g2 << "\n";
std::replace(g5.begin(), g5.end(), '+', ' ');
std::cout << "Link Document : " + g5 << "\n";
std::cout << "User: " + g9 << "\n";
std::cout << "IP: " + g10 << "\n";
}
catch (std::regex_error& e) {
std::cout << "Syntax error in the regular expression" << "\n";
}
My output for domain name is: domainname onzaro
Any help please for this problem with \, \t, \n or \r ?

I'd urge you to use raw string literals. This is solution designed for cases where the literal should not be processed in any way, such as yours.
The syntax is R "delimiter( raw_characters )delimiter", so in your case it could be:
std::string subject(R"raw(2018-08-26 11:38:20 172.20.1.148 GET /BaseDocumentaire/Documents+de+la+page+Notes+de+services/Rappel+du+dispositif+de+Sécurité+relatif+aux+Moyens+de+paiement+et+d’épargne+en+agence.pdf - 80 0#.w|domainname\tonzaro 10.12.105.24 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:61.0)+Gecko/20100101+Firefox/61.0 200 0 0 29984)raw");
std::regex re( R"raw((\d{4}-\d{2}-\d{2})( \d{2}:\d{2}:\d{2})( 172.20.1.148)( GET | POST | HEAD )((/.*){1,4}/.*.(pdf|aspx))( -.*)(domainname.[a-zA-Z0-9]*)( \d+.\d+.\d+.\d+))raw");
(I might have missed some superfluous \ above). See it live.
Those special characters are called escape sequences are being processed in string literals at compilation level (in phase 5 to be precise). For raw string literals this transformation is suppressed.
You don't care about any special character handling. You just need to take care that ")delimiter" doesn't appear in your literal, which I imagine could happen in regex.

'\t' is one character, a horizontal tab. If you want the characters \ and t, you need to escape the backslash: "\\t".

Related

std::regex - lookahead assertion not always working

I'm writing a module that's making some string substitutions into text to give to a scripting language. The language's syntax is vaugely lisp-y, so expressions are bounded by parentheses and symbols separated by spaces, most of them starting with '$'. A regular expression like this seems like it should give matches at the appropriate symbol boundaries:
auto re_match_abc = std::regex{ "(?=.*[[:space:]()])\\$abc(?=[()[:space:]].*)" };
But in my environment (Visual C++ 2017, 15.9.19, targetting C++-17) it can match strings without a suitable boundary in front of them:
std::cout << " $abc -> " << std::regex_replace(" $abc ", re_match_abc, "***") << std::endl;
std::cout << " ($abc) -> " << std::regex_replace("($abc)", re_match_abc, "***") << std::endl;
std::cout << "xyz$abc -> " << std::regex_replace("xyz$abc ", re_match_abc, "***") << std::endl;
std::cout << " $abcdef -> " << std::regex_replace(" $abcdef", re_match_abc, "***") << std::endl;
// Result from VC++ 2017:
//
// $abc -> ***
// ($abc) -> (***)
// xyz$abc -> xyz*** <= What's going wrong here?
// $abcdef -> $abcdef
Why is that regex ignoring the positive-lookahead requirement to have at least one space or parenthesis before the matching text?
[I realize that there are other ways to do this job and to do it really robustly maybe I should use something to turn the string into a token stream, but for the immediate job I have (and because the person authoring the strings that get processed is sitting next to me, so we can coordinate) I thought that regex replacements would do for now.]

You need to use a positive lookbehind instead. What you really want is this:
auto re_match_abc = std::regex{ "(?<=[[:space:]()])\\$abc(?=[()[:space:]])" };
You can try it out on a website like https://regex101.com/ (just remove the escaped backslash that's required for the C++ string). It explains what each piece of the regex is doing and shows you everything that matches.
Keep in mind that this will also match things like )$abc)
Edit: std::regex apparently does not support lookbehind. For you specific case you might try something like this:
auto re_match_abc = std::regex{ "([[:space:]()])\\$abc(?=[()[:space:]])" };
std::cout << " $abc -> " << std::regex_replace(" $abc ", re_match_abc, "$1***") << std::endl;
std::cout << " ($abc) -> " << std::regex_replace("($abc)", re_match_abc, "$1***") << std::endl;
std::cout << "xyz$abc -> " << std::regex_replace("xyz$abc ", re_match_abc, "$1***") << std::endl;
std::cout << " $abcdef -> " << std::regex_replace(" $abcdef", re_match_abc, "$1***") << std::endl;
output:
$abc -> ***
($abc) -> (***)
xyz$abc -> xyz$abc
$abcdef -> $abcdef
try it here
Here instead of a lookbehind we have a normal capture group. In the replacement we're emitting whatever we captured (a parenthesis or space) followed by the actual string we want to replace $abc with.

Why is my c++ regex matching correctly, but not returning the correct value?

I have a regex for matching terms of a polynomial, which I will use to implement a function for turning a string into a polynomial class. You can see the regex demoed here with the correct matches being generated. However, when I try to implement it, my program finds the matches properly, but prints them bizarrely to the screen. For instance:
-21323x^5+1233x4+123x^2-1232
Trying to match: -21323x^5+1233x4+123x^2-1232
-21323x^5
Trying to match: +1233x4+123x^2-1232
1233x4
Trying to match: +123x^2-1232
12xx^2
Trying to match: -1232
-1232
In this case, for some reason it prints 12xx^2 rather than 123x^2
And another:
-1234x^5+789x4+6x^2-567+123x
Trying to match: -1234x^5+789x4+6x^2-567+123x
-1234x^5
Trying to match: +789x4+6x^2-567+123x
789x4
Trying to match: +6x^2-567+123x
x^22
Trying to match: -567+123x
-567
Trying to match: +123x
23xx
In this case it shows x^22 instead of 6x^2 and 23xx instead of 123x.
This is my code:
Poly* Poly::fromString(std::string str) {
Poly* re = new Poly;
bool returnNull = true;
std::regex r_term("((-?[0-9]*)x(\\^?([0-9]+))?|-?[0-9]+)");
std::smatch sm;
while(std::regex_search(str, sm, r_term)) {
returnNull = false;
std::cout << "Trying to match: " << str << std::endl;
str = sm.suffix().str();
std::cout << sm.str() << std::endl;
}
if(returnNull) {
delete re;
return nullptr;
} else return re;
}

While Igor correctly notices the issue with your current code, I think all you need is to get full pattern matches, and for this purpose, I'd rather suggest using a regex iterator:
std::regex r("(-?[0-9]*)x(\\^?([0-9]+))?|-?[0-9]+");
std::string s = "-21323x^5+1233x4+123x^2-1232";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << '\n';
std::cout << "Group 1 value: " << m.str(1) << '\n';
std::cout << "Group 2 value: " << m.str(2) << '\n';
std::cout << "Group 3 value: " << m.str(3) << '\n';
}
See the C++ demo online.
The pattern details:
-? - 1 or 0 hyphens
[0-9]* - 0 or more digits
x - a literal char x
(\\^?([0-9]+))? - 1 or 0 sequences of:
\\^? - an optional (1 or 0) ^ symbols
[0-9]+ - 1 or more digits
| - or
-? - an optional hyphen/minus
[0-9]+ - 1 or more digits.

Filter out url from string

Im trying to filter out urls from a string that contains lots of special characters, blank space and urls. I have tried to use regex but it fails, it manages sometimes to line up the url but the output still contains special characters and blank space, so here I am. Best regards P
string str;
std::ifstream in("c:/Users/Petrus/Documents/History", std::ios::binary);
std::stringstream buffer;
if (!in.is_open()){
cout << "Failed to open" << endl;
}
else{
cout << "Opened OK" << endl;
}
buffer << in.rdbuf();
std::string contents(buffer.str());
std::ofstream out("urls.txt");
unsigned counter = 0;
std::regex word_regex(
R"(^(([^:\/?#]+):)?(//([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?)",
std::regex::extended
);
auto words_begin = std::sregex_iterator(contents.begin(), contents.end(), word_regex);
auto words_end = std::sregex_iterator();
for (std::sregex_iterator i = words_begin; i != words_end; ++i) {
std::smatch match = *i;
std::string match_str = match.str();
for (const auto& res : match) {
counter++;
std::cout << counter++ << ": " << res << std::endl;
}
std::cout << " " << match_str << '\n';
}
system("PAUSE");
return 0;
}

a few steps to simplify (and debug) the regex:
use named groups (?<groupname>regex) to help identify what's what and access results.
for 'grouping only' ()'s, use (?:regex) to "not remember" captures, also helps clarify what's going on
once done, just a few tweaks "fixes" this regex for all your inputs:
(?<protocol>https?:\/\/)(?:(?<urlroot>[^\/?#\n\s]+))?(?<urlResource>[^?#\n\s]+)?(?<queryString>\?(?:[^#\n\s]*))?(?:#(?<fragment>[^\n\s]))?
I changed the negated char classes to not match newlines or spaces: [^#\n\s]
specified that any segment after urlRoot is optional.
added the string "https?" to limit results to valid urls
regex demo output:
and the match groups (truncated but all there):

C++ Boost:regex_search expression - Issue combining expressions to catch all sequences

I'm trying to write a template parser and need to pickup (3) distinct sets of sequences for string replacement.
// Each of These Expressions Work Perfect Separately!
// All Sequences start with | pipe. Followed by
boost::regex expr {"(\\|[0-9]{2})"}; // 2 Digits only.
boost::regex expr {"(\\|[A-Z]{1,2}+[0-9]{1,2})"}; // 1 or 2 Uppercase Chars and 1 or 2 Digits.
boost::regex expr {"(\\|[A-Z]{2})(?!\\d)"}; // 2 Uppercase Chars with no following digits.
However, once I try to combine them into a single statement, I get can't them to work properly to catch all sequences. I must be missing something. Can anyone shed some light on what I'm missing?
Here is what I have so far:
// Each sequence is separated with a | for or between parenthesis.
boost::regex expr {"(\\|[0-9]{2})|(\\|[A-Z]{1,2}+[0-9]{1,2})|(\\|[A-Z]{2})(?!\\d)"};
I'm using the follow string for testing, and probably little more then needed here is the code as well.
#include <boost/regex.hpp>
#include <string>
#include <iostream>
std::string str = "|MC01 |U1 |s |A22 |12 |04 |2 |EW |SSAADASD |15";
boost::regex expr {"(\\|[0-9]{2})|(\\|[A-Z]{1,2}+[0-9]{1,2})|(\\|[A-Z]{2})(?!\\d)"};
boost::smatch matches;
std::string::const_iterator start = str.begin(), end = str.end();
while(boost::regex_search(start, end, matches, expr))
{
std::cout << "Matched Sub '" << matches.str()
<< "' following ' " << matches.prefix().str()
<< "' preceeding ' " << matches.suffix().str()
<< std::endl;
start = matches[0].second;
for(size_t s = 1; s < matches.size(); ++s)
{
std::cout << "+ Matched Sub " << matches[s].str()
<< " at offset " << matches[s].first - str.begin()
<< " of length " << matches[s].length()
<< std::endl;
}
}

I believe this is what you want:
const boost::regex expr {"(\\|[0-9]{2})|(\\|[A-Z]{1,2}+[0-9]{1,2})|(\\|[A-Z]{2})"}; // basically, remove the constraint on the last sub
I also suggest being explicit in your flags for expr and passed to regex_search.

I also fond that by added an extra check for matches on matched, this removes half-matched patterns which was throwing me off.
for(size_t s = 1; s < matches.size(); ++s)
{
if (matches[s].matched) // Check for bool True/False
{
std::cout << "+ Matched Sub " << matches[s].str()
<< " at offset " << matches[s].first - str.begin()
<< " of length " << matches[s].length()
<< std::endl;
}
}
Without it, matches where showing with an offset at the end of the string showing length 0. So I hope this helps anyone else who runs into this.
Another Tip is, in the loop, checking s == 1, 2, 3 refers back to the match on the expressions. Since I have (3) expressions, if it matched on the first part of the expression, s will have a 1 value when matched is a true value, otherwise it will have 2 or 3. Pretty nice!

Anything like substr but instead of stopping at the byte you specified, it stops at a specific string [duplicate]

This question already has answers here:
How do you search a std::string for a substring in C++?
(6 answers)
Closed 8 years ago.
I have a client for a pre-existing server. Let's say I get some packets "MC123, 456!##".
I store these packets in a char called message. To print out a specific part of them, in this case the numbers part of them, I would do something like "cout << message.substr(3, 7) << endl;".
But what if I receive another message "MC123, 456, 789!##". "cout << message.substr(3,7)" would only print out "123, 456", whereas I want "123, 456, 789". How would I do this assuming I know that every message ends with "!##".

First - Sketch out the indexing.
std::string packet1 = "MC123, 456!##";
// 0123456789012345678
// ^------^ desired text
std::string packet2 = "MC123, 456, 789!##";
// 0123456789012345678
// ^-----------^ desired text
The others answers are ok. If you wish to use std::string find,
consider rfind and find_first_not_of, as in the following code:
// forward
void messageShow(std::string packet,
size_t startIndx = 2);
// /////////////////////////////////////////////////////////////////////////////
int main (int, char** )
{
// 012345678901234567
// |
messageShow("MC123, 456!##");
messageShow("MC123, 456, 789!##");
messageShow("MC123, 456, 789, 987, 654!##");
// error test cases
messageShow("MC123, 456, 789##!"); // missing !##
messageShow("MC123x 456, 789!##"); // extraneous char in packet
return(0);
}
void messageShow(std::string packet,
size_t startIndx) // default value 2
{
static size_t seq = 0;
seq += 1;
std::cout << packet.size() << " packet" << seq << ": '"
<< packet << "'" << std::endl;
do
{
size_t bangAtPound_Indx = packet.rfind("!##");
if(bangAtPound_Indx == std::string::npos){ // not found, can't do anything more
std::cerr << " '!##' not found in packet " << seq << std::endl;
break;
}
size_t printLength = bangAtPound_Indx - startIndx;
const std::string DIGIT_SPACE = "0123456789, ";
size_t allDigitSpace = packet.find_first_not_of(DIGIT_SPACE, startIndx);
if(allDigitSpace != bangAtPound_Indx) {
std::cerr << " extraneous char found in packet " << seq << std::endl;
break; // something extraneous in string
}
std::cout << bangAtPound_Indx << " message" << seq << ": '"
<< packet.substr(startIndx, printLength) << "'" << std::endl;
}while(0);
std::cout << std::endl;
}
This outputs
13 packet1: 'MC123, 456!##'
10 message1: '123, 456'
18 packet2: 'MC123, 456, 789!##'
15 message2: '123, 456, 789'
28 packet3: 'MC123, 456, 789, 987, 654!##'
25 message3: '123, 456, 789, 987, 654'
18 packet4: 'MC123, 456, 789##!'
'!##' not found in packet 4
18 packet5: 'MC123x 456, 789!##'
extraneous char found in packet 5
Note: String indexes start at 0. The index of the digit '1' is 2.

The correct approach is to look for existence / location of the "known termination" string, then take the substring up to (but not including) that substring.
Something like
str::string termination = "!#$";
std::size_t position = inputstring.find(termination);
std::string importantBit = message.substr(0, position);
You could check the front of the string separately as well. Combining these, you could use regular expressions to make your code more robust, using a regex like
MC([0-9,]+)!#\$
This will return the bit between MC and !#$ but only if it consists entirely of numbers and commas. Obviously you can adapt this as needed.
UPDATE you asked in your comment how to use the regular expression. Here is a very simple program. Note - this is using C++11: you need to make sure our compiler supports it.
#include <iostream>
#include <regex>
int main(void) {
std::string s ("ABC123,456,789!#$");
std::smatch m;
std::regex e ("ABC([0-9,]+)!#\\$"); // matches the kind of pattern you are looking for
if (std::regex_search (s,m,e)) {
std::cout << "match[0] = " << m[0] << std::endl;
std::cout << "match[1] = " << m[1] << std::endl;
}
}
On my Mac, I can compile the above program with
clang++ -std=c++0x -stdlib=libc++ match.cpp -o match
If instead of just digits and commas you want "anything" in your expression (but it's still got fixed characters in front and behind) you can simply do
std::regex e ("ABC(.*)!#\\$");
Here, .+ means "zero or more of 'anything'" - but followed by !#$. The double backslash has to be there to "escape" the dollar sign, which has special meaning in regular expressions (it means "the end of the string").
The more accurately your regular expression reflects exactly what you expect, the better you will be able to trap any errors. This is usually a very good thing in programming. "Always check your inputs".
One more thing - I just noticed you mentioned that you might have "more stuff" in your string. This is where using regular expressions quickly becomes the best. You mentioned a string
MC123, 456!##*USRChester.
and wanted to extract 123, 456 and Chester. That is - stuff between MC and !#$, and more stuff after USR (if that is even there). Here is the code that shows how that is done:
#include <iostream>
#include <regex>
int main(void) {
std::string s1 ("MC123, 456!#$");
std::string s2 ("MC123, 456!#$USRChester");
std::smatch m;
std::regex e ("MC([0-9, ]+)!#\\$(?:USR)?(.*)$"); // matches the kind of pattern you are looking for
if (std::regex_search (s1,m,e)) {
std::cout << "match[0] = " << m[0] << std::endl;
std::cout << "match[1] = " << m[1] << std::endl;
std::cout << "match[2] = " << m[2] << std::endl;
}
if (std::regex_search (s2,m,e)) {
std::cout << "match[0] = " << m[0] << std::endl;
std::cout << "match[1] = " << m[1] << std::endl;
std::cout << "match[2] = " << m[2] << std::endl;
if (match[2].length() > 0) {
std::cout << m[2] << ": " << m[1] << std::endl;
}
}
}
Output:
match[0] = MC123, 456!#$
match[1] = 123, 456
match[2] =
match[0] = MC123, 456!#$USRChester
match[1] = 123, 456
match[2] = Chester
Chester: 123, 456
The matches are:
match[0] : "everything in the input string that was consumed by the Regex"
match[1] : "the thing in the first set of parentheses"
match[2] : "The thing in the second set of parentheses"
Note the use of the slightly tricky (?:USR)? expression. This says "This might (that's the ()? ) be followed by the characters USR. If it is, skip them (that's the ?: part) and match what follows.
As you can see, simply testing whether m[2] is empty will tell you whether you have just numbers, or number plus "the thing after the USR". I hope this gives you an inkling of the power of regular expressions for chomping through strings like yours.

If you are sure about the ending of the message, message.substr(3, message.size()-6) will do the trick.
However, it is good practice to check everything, just to avoid surprises.
Something like this:
if (message.size() < 6)
throw error;
if (message.substr(0,3) != "MCX") //the exact numbers do not match in your example, but you get the point...
throw error;
if (message.substr(message.size()-3) != "!##")
throw error;
string data = message.substr(3, message.size()-6);

Just calculate the offset first.
string str = ...;
size_t start = 3;
size_t end = str.find("!##");
assert(end != string::npos);
return str.substr(start, end - start);

You can get the index of "!##" by using:
message.find("!##")
Then use that answer instead of 7. You should also check for it equalling std::string::npos which indicates that the substring was not found, and take some different action.

string msg = "MC4,512,541,3123!##";
for (int i = 2; i < msg.length() - 3; i++) {
if (msg[i] != '!' && msg[i + 1] != '#' && msg[i + 2] != '#')
cout << msg[i];
}
or use char[]
char msg[] = "MC4,123,54!##";
sizeof(msg -1 ); //instead of msg.length()
// -1 for the null byte at the end (each char takes 1 byte so the size -1 == number of chars)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Problem with special characters with RegEx in C++ - c++

'\t' is one character, a horizontal tab. If you want the characters \ and t, you need to escape the backslash: "\\t".

Related

std::regex - lookahead assertion not always working

Why is my c++ regex matching correctly, but not returning the correct value?

Filter out url from string

C++ Boost:regex_search expression - Issue combining expressions to catch all sequences

Anything like substr but instead of stopping at the byte you specified, it stops at a specific string [duplicate]

Categories

Resources