Using regular expressions with C++ on Unix - c++

I'm familiar with Regex itself, but whenever I try to find any examples or documentation to use regex with Unix computers, I just get tutorials on how to write regex or how to use the .NET specific libraries available for Windows. I've been searching for a while and I can't find any good tutorials on C++ regex on Unix machines.
What I'm trying to do:
Parse a string using regex by breaking it up and then reading the different subgroups. To make a PHP analogy, something like preg_match that returns all $matches.

Consider using Boost.Regex.
An example (from the website):
bool validate_card_format(const std::string& s)
{
static const boost::regex e("(\\d{4}[- ]){3}\\d{4}");
return regex_match(s, e);
}
Another example:
// match any format with the regular expression:
const boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
const std::string machine_format("\\1\\2\\3\\4");
const std::string human_format("\\1-\\2-\\3-\\4");
std::string machine_readable_card_number(const std::string s)
{
return regex_replace(s, e, machine_format, boost::match_default | boost::format_sed);
}
std::string human_readable_card_number(const std::string s)
{
return regex_replace(s, e, human_format, boost::match_default | boost::format_sed);
}

Look up the documentation for TR1 regexes or (almost equivalently) boost regex. Both work quite nicely on various Unix systems. The TR1 regex classes have been accepted into C++ 0x, so though they're not exactly part of the standard yet, they will be reasonably soon.
Edit: To break a string into subgroups, you can use an sregex_token_iterator. You can specify either what you want matched as tokens, or what you want matched as separators. Here's a quickie demo of both:
#include <iterator>
#include <regex>
#include <string>
#include <iostream>
int main() {
std::string line;
std::cout << "Please enter some words: " << std::flush;
std::getline(std::cin, line);
std::tr1::regex r("[ .,:;\\t\\n]+");
std::tr1::regex w("[A-Za-z]+");
std::cout << "Matching words:\n";
std::copy(std::tr1::sregex_token_iterator(line.begin(), line.end(), w),
std::tr1::sregex_token_iterator(),
std::ostream_iterator<std::string>(std::cout, "\n"));
std::cout << "\nMatching separators:\n";
std::copy(std::tr1::sregex_token_iterator(line.begin(), line.end(), r, -1),
std::tr1::sregex_token_iterator(),
std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}
If you give it input like this: "This is some 999 text", the result is like this:
Matching words:
This
is
some
text
Matching separators:
This
is
some
999
text

You are looking for regcomp, regexec and regfree.
One thing to be careful about is that the Posix regular expressions actually implement two different languages, regular (default) and extended (include the flag REG_EXTENDED in the call to regcomp). If you are coming from the PHP world, the extended language closer to what you are used to.

For perl-compatible regular expressions (pcre/preg), I'd suggest boost.regex.

My best bet would be boost::regex.

Try pcre. And pcrepp.

Feel free to have a look at this small color grep tool I wrote.
At github
It uses regcomp, regexec and regfree that R Samuel Klatchko refers to.

I use "GNU regex": http://www.gnu.org/s/libc/manual/html_node/Regular-Expressions.html
Works well but can't find clear solution for UTF-8 regexp.
Regards

Related

Split a mathematical expression using regex

I want to split the following mathematical expression -1+33+4.4+sin(3)-2-x^2 into tokens using regex. I use the following site to test my regex expression link, this says that nothing wrong. When I implement the regex into my C++, throwing the following error Invalid special open parenthesis I looked for the solution and I find the following stackoverflow site link but it do not helped me solve my problem.
My regex code is (?<=[-+*\/^()])|(?=[-+*\/^()]). In the C++ code I do not use \.
The other problem is that I do not know how to determine the minus sign is an unary operator or a binary operator, if the minus is an unary operator I want to look like this {-1}
I want the tokens looks like this : {-1,+,33,+4.4,+,sin,(,3,),-,2,-,x,^,2}
The unary minus can be anywhere in the string.
If I do not use ^ it still wrong.
code:
std::vector<std::string> split(const std::string& s, std::string rgx_str) {
std::vector<std::string> elems;
std::regex rgx (rgx_str);
std::sregex_token_iterator iter(s.begin(), s.end(), rgx);
std::sregex_token_iterator end;
while (iter != end) {
elems.push_back(*iter);
++iter;
}
return elems;
}
int main() {
std::string str = "-1+33+4.4+sin(3)-2-x^2";
std::string reg = "(?<=[-+*/()^])|(?=[-+*/()^])";
std::vector<std::string> s = split(str,reg);
for(auto& a : s)
cout << a << endl;
return 0;
}
C++ uses a modified ECMAScript regular expression grammar for its std::regex by default. It does support lookaheads (?=) and (?!), but not lookbehinds. So, the (?<=) is not a valid std::regex syntax.
There is a proposal to add this in C++23, but it is not currently implemented.

<regex> having trouble with Cyrillic characters

I'm trying to use the standard <regex> library to match some Cyrillic words:
// This is a UTF-8 file.
std::locale::global(std::locale("en_US.UTF-8"));
string s {"Каждый охотник желает знать где сидит фазан."};
regex re {"[А-Яа-яЁё]+"};
for (sregex_iterator it {s.begin(), s.end(), re}, end {}; it != end; it++) {
cout << it->str() << "#";
}
However, that doesn't seem work. The code above results in the following:
Кажд�#й#о�#о�#ник#желае�#зна�#�#где#�#иди�#�#азан#
rather than the expected:
Каждый#охотник#желает#знать#где#сидит#фазан
The code of the '�' symbol above is \321.
I've checked the regular expression I used with grep and it works as expected. My locale is en_US.UTF-8. Both GCC and Clang produce the same result.
Is there anything I'm missing? Is there a way to "tame" <regex> so it would work with Cyrillic characters?
For ranges like А-Я to work properly, you must use std::regex::collate
Constants
...
collate Character ranges of the form "[a-b]" will be locale sensitive.
Changing the regular expression to
std::regex re{"[А-Яа-яЁё]+", std::regex::collate};
gives the expected result.
Depending on the encoding of your source file, you might need to prefix the regular expression string with u8
std::regex re{u8"[А-Яа-яЁё]+", std::regex::collate};
Cyrillic letters are represented as multibyte sequences in UTF-8. Therefore, one way of handling the problem is by using the "wide" version of string called wstring. Other functions and types working with wide characters need to be replaced with their "multibyte-conscious" version as well, generally this is done by prepending w to their name. This works:
std::locale::global(std::locale("en_US.UTF-8"));
wstring s {L"Каждый охотник желает знать где сидит фазан."};
wregex re {L"[А-Яа-яЁё]+"};
for (wsregex_iterator it {s.begin(), s.end(), re}, end {}; it != end; it++) {
wcout << it->str() << "#";
}
Output:
Каждый#охотник#желает#знать#где#сидит#фазан#
(Thanks #JohnDing for pitching this solution.)
An alternative solution is to use regex::collate to make regexes locale-sensitive with ordinary strings, see this answer by #OlafDietsche for details. This topic will shed some light on which solution might be more preferable in your circumstances. (Turns out in my case collate was a better idea!)

Replace single backslash with double in a string c++

I am trying to replace one backslash with two. To do that I tried using the following code
str = "d:\test\text.txt"
str.replace("\\","\\\\");
The code does not work. Whole idea is to pass str to deletefile function, which requires double blackslash.
since c++11, you may try using regex
#include <regex>
#include <iostream>
int main() {
auto s = std::string(R"(\tmp\)");
s = std::regex_replace(s, std::regex(R"(\\)"), R"(\\)");
std::cout << s << std::endl;
}
A bit overkill, but does the trick is you want a "quick" sollution
There are two errors in your code.
First line: you forgot to double the \ in the literal string.
It happens that \t is a valid escape representing the tab character, so you get no compiler error, but your string doesn't contain what you expect.
Second line: according to the reference of string::replace,
you can replace a substring by another substring based on the substring position.
However, there is no version that makes a substitution, i.e. replace all occurences of a given substring by another one.
This doesn't exist in the standard library. It exists for example in the boost library, see boost string algorithms. The algorithm you are looking for is called replace_all.

Referring to whole match from replacement pattern using std::regex_replace

The code below doesn't give the same result in Visual Studio 2015 and IDEOne.com (C++14). More strange, in both cases the results are incorrect !
#include <iostream>
#include <regex>
int main()
{
const char* pszTestString = "ENDRESS+HAUSER*ST-DELL!HP||BESTMATCH&&ABCD\\ABCD";
const char* pszExpectedString = "ENDRESS\\+HAUSER\\*ST\\-DELL\\!HP\\||BESTMATCH\\&&ABCD\\\\ABCD";
std::cout << std::regex_replace(pszTestString, std::regex("[-+!\"\\[\\](){}^~*?:]|&&|\\|\\|"), "\\$0") << std::endl;
std::cout << pszExpectedString << std::endl;
return 0;
}
Under Visual Studio 2015 I got this strange result, the second line contains the expected result for both compilers :
ENDRESS\$0HAUSER\$0ST\$0DELL\$0HP\$0BESTMATCH\$0ABCD\ABCD
ENDRESS\+HAUSER\*ST\-DELL\!HP\||BESTMATCH\&&ABCD\\ABCD
With IDEOne (C++14 compiler) :
ENDRESS\+HAUSER\*ST\-DELL\!HP\||BESTMATCH\&&ABCD\ABCD
ENDRESS\+HAUSER\*ST\-DELL\!HP\||BESTMATCH\&&ABCD\\ABCD
We can see in the latter that there is a mistake : before the last "ABCD" there must be two backslashes and not a single one
What the heck is going on ? I wrote a manual parser instead of using std::regex_replace for the moment, but I really want make it work under VS2015 (and any other IDE ideally) and make a benchmark before choosing the manual parsing solution.
VS2015 default compiler does not treat $0 as a zeroth backreference. You need to use the "native" ECMAScript $& backreference to refer to the whole match from inside the replacement pattern.
Also, revo is right, in order to match \ you need to add it to the character class.
And note that in VS2015 you can use raw string literals. It is best practice to use raw string literals to define regex patterns as they help avoid overescaping (also called as backslash hell).
Solution:
std::cout << std::regex_replace(pszTestString,
std::regex(R"([-+!\\\"\[\](){}^~*?:]|&&|\|\|)"), "\\$&") << std::endl;
^^ ^^

c++ regex convert regex to c++ code

first time regex (in c++ that is)
I have a hard time writing
(?<=name=")(?:[^\\"]+|\\.)*(?=")
that matches for example name="blabla" xyz as blabla as code...
How do I
std::regex TheName("(?<=name=")(?:[^\\"]+|\\.)*(?=")");
correctly please?
You need to use capturing rather than positive lookbehind in C++ regex. Also, it is advisable to use the unroll-the-loop principle to unroll your ([^"\\]|\\.)* subpattern to make the regex as fast as it can be, see [^\"\\]*(?:\\.[^\"\\]*)*. Also, it is advisable to use raw string literals (see R"(<PATTERN>)") when defining regex patterns in order to avoid overescaping.
See the C++ demo:
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::string s = "name=\"bla \\\"bla\\\"\"";
std::regex TheName(R"(name=\"([^\"\\]*(?:\\.[^\"\\]*)*)\")");
std::smatch m;
if (regex_search(s, m, TheName)) {
std::cout << m[1].str() << std::endl;
}
return 0;
}
Result: bla \"bla\"