C++ regex escaping punctional characters like "." - c++

Matching a "." in a string with the std::tr1::regex class makes me use a weird workaround.
Why do I need to check for "\\\\." instead of "\\."?
regex(".") // Matches everything (but "\n") as expected.
regex("\\.") // Matches everything (but "\n").
regex("\\\\.") // Matches only ".".
Can someone explain me why? It's really bothering me since I had my code written using boost::regex classes, which didn't need this syntax.
Edit: Sorry, regex("\\\\.") seems to match nothing.
Edit2: Some code
void parser::lex(regex& token)
{
// Skipping whitespaces
{
regex ws("\\s*");
sregex_token_iterator wit(source.begin() + pos, source.end(), ws, regex_constants::match_default), wend;
if(wit != wend)
pos += (*wit).length();
}
sregex_token_iterator it(source.begin() + pos, source.end(), token, regex_constants::match_default), end;
if (it != end)
temp = *it;
else
temp = "";
}

This is because \. is interpreted as an escape sequence, which the language itself is trying to interpret as a single character. What you want is for your regex to contain the actual string "\.", which is written \\. because \\ is the escape sequence for the backslash character (\).

As it turns out, the actual problem was due to the way sregex_token_iterator was used. Using match_default meant it was always finding the next match in the string, if any, even if there is a non-match in-between. That is,
string source = "AAA.BBB";
regex dot("\\.");
sregex_token_iterator wit(source.begin(), source.end(), dot, regex_constants::match_default);
would give a match at the dot, rather than reporting that there was no match.
The solution is to use match_continuous instead.

Try to escape the dot by its ASCII code:
regex("\\x2E")

Related

tokenize a c++ string with regex having special characters

I am trying to find the tokens in a string, which has words, numbers, and special chars. I tried the following code:
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main() {
string str("The ,quick brown. fox \"99\" named quick_joe!");
regex reg("[\\s,.!\"]+");
sregex_token_iterator iter(str.begin(), str.end(), reg, -1), end;
vector<string> vec(iter, end);
for (auto a : vec) {
cout << a << ":";
}
cout << endl;
}
And got the following output:
The:quick:brown:fox:99:named:quick_joe:
But I wanted the output:
The:,:quick:brown:.:fox:":99:":named:quick_joe:!:
What regex should I use for that? I would like to stick to the standard c++ if possible, ie I would not like a solution with boost.
(See 43594465 for a java version of this question, but now I am looking for a c++ solution. So essentially, the question is how to map Java's Matcher and Pattern to C++.)
You're asking to interleave non-matched substrings (submatch -1) with the whole matched substrings (submatch 0), which is slightly different:
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,0}), end;
This yields:
The: ,:quick: :brown:. :fox: ":99:" :named: :quick_joe:!:
Since you're looking to just drop whitespace, change the regex to consume surrounding whitespace, and add a capture group for the non-whitespace chars. Then, just specify submatch 1 in the iterator, instead of submatch 0:
regex reg("\\s*([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;
Yields:
The:,:quick brown:.:fox:":99:":named quick_joe:!:
Splitting the spaces between adjoining words requires splitting on 'just spaces' too:
regex reg("\\s*\\s|([,.!\"]+)\\s*");
However, you'll end up with empty submatches:
The:::,:quick::brown:.:fox:::":99:":named::quick_joe:!:
Easy enough to drop those:
regex reg("\\s*\\s|([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;
vector<string> vec;
copy_if(iter, end, back_inserter(vec), [](const string& x) { return x.size(); });
Finally:
The:,:quick:brown:.:fox:":99:":named:quick_joe:!:
If you want to use the approach used in the Java related question, just use a matching approach here, too.
regex reg(R"(\d+|[^\W\d]+|[^\w\s])");
sregex_token_iterator iter(str.begin(), str.end(), reg), end;
vector<string> vec(iter, end);
See the C++ demo. Result: The:,:quick:brown:.:fox:":99:":named:quick_joe:!:. Note this won't match Unicode letters here as \w (\d, and \s, too) is not Unicode aware in an std::regex.
Pattern details:
\d+ - 1 or more digits
| - or
[^\W\d]+ - 1 or more ASCII letters or _
| - or
[^\w\s] - 1 char other than an ASCII letter/digit,_ and whitespace.

Remove spaces from string before period and comma

I could have a string like:
During this time , Bond meets a stunning IRS agent , whom he seduces .
I need to remove the extra spaces before the comma and before the period in my whole string. I tried throwing this into a char vector and only not push_back if the current char was " " and the following char was a "." or "," but it did not work. I know there is a simple way to do it maybe using trim(), find(), or erase() or some kind of regex but I am not the most familiar with regex.
A solution could be (using regex library):
std::string fix_string(const std::string& str) {
static const std::regex rgx_pattern("\\s+(?=[\\.,])");
std::string rtn;
rtn.reserve(str.size());
std::regex_replace(std::back_insert_iterator<std::string>(rtn),
str.cbegin(),
str.cend(),
rgx_pattern,
"");
return rtn;
}
This function takes in input a string and "fixes the spaces problem".
Here a demo
On a loop search for string " ," and if you find one replace that to ",":
std::string str = "...";
while( true ) {
auto pos = str.find( " ," );
if( pos == std::string::npos )
break;
str.replace( pos, 2, "," );
}
Do the same for " .". If you need to process different space symbols like tab use regex and proper group.
I don't know how to use regex for C++, also not sure if C++ supports PCRE regex, anyway I post this answer for the regex (I could delete it if it doesn't work for C++).
You can use this regex:
\s+(?=[,.])
Regex demo
First, there is no need to use a vector of char: you could very well do the same by using an std::string.
Then, your approach can't work because your copy is independent of the position of the space. Unfortunately you have to remove only spaces around the punctuation, and not those between words.
Modifying your code slightly you could delay copy of spaces waiting to the value of the first non-space: if it's not a punctuation you'd copy a space before the character, otherwise you just copy the non-space char (thus getting rid of spaces.
Similarly, once you've copied a punctuation just loop and ignore the following spaces until the first non-space char.
I could have written code. It would have been shorter. But i prefer letting you finish your homework with full understanding of the approach.

C++ boost::regex multiples captures

I'm trying to recover multiples substrings thanks to boost::regex and put each one in a var. Here my code :
unsigned int i = 0;
std::string string = "--perspective=45.0,1.33,0.1,1000";
std::string::const_iterator start = string.begin();
std::string::const_iterator end = string.end();
std::vector<std::string> matches;
boost::smatch what;
boost::regex const ex(R"(^-?\d*\.?\d+),(^-?\d*\.?\d+),(^-?\d*\.?\d+),(^-?\d*\.?\d+))");
string.resize(4);
while (boost::regex_search(start, end, what, ex)
{
std::string stest(what[1].first, what[1].second);
matches[i] = stest;
start = what[0].second;
++i;
}
I'm trying to extract each float of my string and put it in my vector variable matches. My result, at the moment, is that I can extract the first one (in my vector var, I can see "45" without double quotes) but the second one in my vector var is empty (matches[1] is "").
I can't figure out why and how to correct this. So my question is how to correct this ? Is my regex not correct ? My smatch incorrect ?
Firstly, ^ is symbol for the beginning of a line. Secondly, \ must be escaped. So you should fix each (^-?\d*\.?\d+) group to (-?\\d*\\.\\d+). (Probably, (-?\\d+(?:\\.\\d+)?) is better.)
Your regular expression searches for the number,number,number,number pattern, not for the each number. You add only the first substring to matches and ignore others. To fix this, you can replace your expression with (-?\\d*\\.\\d+) or just add all the matches stored in what to your matches vector:
while (boost::regex_search(start, end, what, ex))
{
for(int j = 1; j < what.size(); ++j)
{
std::string stest(what[j].first, what[j].second);
matches.push_back(stest);
}
start = what[0].second;
}
You are using ^ at several times in your regex. That's why it didn't match. ^ means the beginning of the string. Also you have an extra ) at the end of the regex. I don't know that closing bracket doing there.
Here is your regex after correction:
(-?\d*\.?\d+),(-?\d*\.?\d+),(-?\d*\.?\d+),(-?\d*\.?\d+)
A better version of your regex can be(only if you want to avoid matching numbers like .01, .1):
(-?\d+(?:\.\d+)?),(-?\d+(?:\.\d+)?),(-?\d+(?:\.\d+)?),(-?\d+(?:\.\d+)?)
A repeated search in combination with a regular expression that apparently is built to match all of the target string is pointless.
If you are searching repeatedly in a moving window delimited by a moving iterator and string.end() then you should reduce the pattern to something that matches a single fraction.
If you know that the number of fractions in your string is/must be constant, match once, not in a loop and extract the matched substrings from what.

Ignore String containing special words (Months)

I am trying to find alphanumeric strings by using the following regular expression:
^(?=.*\d)(?=.*[a-zA-Z]).{3,90}$
Alphanumeric string: an alphanumeric string is any string that contains at least a number and a letter plus any other special characters it can be # - _ [] () {} ç _ \ ù %
I want to add an extra constraint to ignore all alphanumerical strings containing the following month formats :
JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
One solution is to actually match an alphanumerical string. Then check if this string contains one of these names by using the following function:
vector<string> findString(string s)
{
vector<string> vec;
boost::regex rgx("JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
");
boost::smatch match;
boost::sregex_iterator begin {s.begin(), s.end(), rgx},
end {};
for (boost::sregex_iterator& i = begin; i != end; ++i)
{
boost::smatch m = *i;
vec.push_back(m.str());
}
return vec;
}
Question: How can I add this constraint directly into the regular expression instead of using this function.
One solution is to use negative lookahead as mentioned in How to ignore words in string using Regular Expressions.
I used it as follows:
String : 2-hello-001
Regular expression : ^(?=.*\d)(?=.*[a-zA-Z]^(?!Jan|Feb|Mar)).{3,90}$
Result: no match
Test website: http://regexlib.com/
The edit provided by #Robin and #RyanCarlson : ^[][\w#_(){}ç\\ù%-]{3,90}$ works perfectly in detecting alphanumeric strings with special characters. It's just the negative lookahead part that isn't working.
You can use negative look ahead, the same way you're using positive lookahead:
(?=.*\d)(?=.*[a-zA-Z])
(?!.*(?:JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre)).{3,90}$
Also you regex is pretty unclear. If you want alphanumerical strings with a length between 3 and 90, you can just do:
/^(?!.*(?:JANVIER|F[Eé]VRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AO[Uù]T|SEPTEMBRE|OCTOBRE|NOVEMBRE|D[Eé]CEMBRE|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))
[][\w#_(){}ç\\ù%-]{3,90}$/i
the i flag means it will match upper and lower case (so you can reduce your forbidden list), \w is a shortcut for [0-9a-zA-Z_] (careful if you copy-paste, there's a linebreak here for readability between (?! ) and [ ]). Just add in the final [...] whatever special characters you wanna match.

How do I add a backslash after every character in a string?

I need to transform a literal filepath (C:/example.txt) to one that is compatible with the various WinAPI Registry functions (C://example.txt) and I have no idea on how to go about doing it.
I've broken it down to having to add a backslash after a certain character (/ in this case) but i'm completely stuck after that.
Guidance and Code Examples will be greatly appreciated.
I'm using C++ and VS2012.
In C++, strings are made up of individual characters, like "foo". Strings can be composed of printable characters, such as the letters of the alphabet, or non-printable characters, such as the enter key or other control characters.
You cannot type one of these non-printable characters in the normal way when populating a string. For example, if you want a string that contains "foo" then a tab, and then "bar", you can't create this by typing:
fooTABbar
because this will simply insert that many spaces -- it won't actually insert the TAB character.
You can specify these non-printable characters by "escaping" them out. This is done by inserting a back slash character (\) followed by the character's code. In the case of the string above TAB is represented by the escape sequence \t, so you would write: "foo\tbar".
The character \ is not itself a non-printable character, but C++ (and C) recognize it to be special -- it always denotes the beginning of an escape sequence. To include the character "\" in a string, it has to itself be escaped, with \\.
So in C++ if you want a string that contains:
c:\windows\foo\bar
You code this using escape sequences:
string s = "c:\\windows\\foo\\bar"
\\ is not two chars, is one char:
for(size_t i = 0, sz = sPath.size() ; i < sz ; i++)
if(sPath[i]=='/') sPath[i] = '\\';
But be aware that some APIs work with \ and some with /, so you need to check in which cases to use this replacement.
If replacing every occurrence of a forward slash with two backslashes is really what you want, then this should do the job:
size_t i = str.find('/');
while (i != string::npos)
{
string part1 = str.substr(0, i);
string part2 = str.substr(i + 1);
str = part1 + R"(\\)" + part2; // Use "\\\\" instead of R"(\\)" if your compiler doesn't support C++11's raw string literals
i = str.find('/', i + 1);
}
EDIT:
P.S. If I misunderstood the question and your intention is actually to replace every occurrence of a forward slash with just one backslash, then there is a simpler and more efficient solution (as #RemyLebeau points out in a comment):
size_t i = str.find('/');
while (i != string::npos)
{
str[i] = '\\';
i = str.find('/', i + 1);
}
Or, even better:
std::replace_if(str.begin(), str.end(), [] (char c) { return (c == '/'); }, '\\');