Let
exp = ^[0-9!##$%^&*()_+-=[]{};':"\|,.<>/?\s]*$
be a regular expression that allows me to find all sequences of numbers with or without special characters.
by using exp I manage to extract all sequences of numbers that are greater than 5. But the number 98200 cannot be extracted. I am not using any limits to how long should the sequence of numbers be.
Source code:
#include <boost/regex.hpp>
#include iostream;
using namespace std;
int main()
{
string s = "16000";
string exp = ^[0-9!##$%^&*()_+-=[]{};':"\\|,.<>\\/?\\s]*$
const boost::regex e(exp);
bool isSequence = boost::regex_match(s,e);
//isSequence is boolean and should be equal to 1
cout << isSequence << endl;
return 0;
}
In C#, you need to escape the ]. You don't need to escape [ {} () when they are inside a character class. Also, if you want to include the dash as an included character in the character class, it should be at the beginning or end of the list. The sequence that you have of +-= translates to [+,-./0123456789:;<=] which makes your regex redundant. Finally, because of the terminal quantifier, you are allowing matching of zero length strings. This may be what you want, but if not, consider the '+' quantifier.
What about simply
[^A-Za-z]+
with or without the ^ $ anchors at the beginning/end
Indiscriminately escaping everything works for me.. :)
string exp = "^[0-9\\!##\\$\\%\\^&*\\(\\)_\\+\\-=\\[\\]\\{\\};\\\':\\\"\\\\|,\\.<>\\/?\\s]*$";
Note the double backslash... I'm sure you can workout which of the characters in your list means anything special, and only escape those, as I don't have the time to lookup what has special meaning in this context, I escaped everything, and this works fine for a few of the cases I tested
16000 => returns 1 16A000 => returns 0 16#000 => returns 1
Which I'm guessing is what you want...
I have shifted the brackets to the front of the character class and therewith I get the output 1 for 98200 using the following code:
#include <string>
#include <boost/regex.hpp>
#include <iostream>
using namespace std;
int main()
{
std::cout << "main()\n";
string s = "98200";
string exp = "^[][0-9!##$%^&*()_+-={};':\"\\|,.<>\\/?\\s]*$";
const boost::regex e(exp);
bool isSequence = boost::regex_match(s,e);
//isSequence is boolean and should be equal to 1
cout << isSequence << endl;
return 0;
}
/**
Local Variables:
compile-command: "g++ -g test.cc -o test.exe -lboost_regex-mt; ./test.exe"
End:
*/
EDIT: Note, that I used my experience with emacs regular
expressions. The info pages of emacs explain: "To include a ] in a
character set, you must make it the first character." I tried this
with boost::regexp and it worked. Later on when I had more time I read
in the boost manual
http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.character_sets
that this is not specified for the perl regular expression syntax.
The perl syntax is the standard setting for boost::regex. According to the
specification the comment by
https://stackoverflow.com/users/2872922/ron-rosenfeld is the best
answer.
In the following program I eliminate the character range which was incidentally encoded into your regular expression.
Testing shows that the bracket at the beginning of the character set is included into the character set. So it turns out that my statement was right even if it is not specified in the official manual of boost::regex.
Nevertheless, I suggest that https://stackoverflow.com/users/2872922/ron-rosenfeld inserts his comment as an answer and you mark it as the solution. This will help others reading this thread.
#include <string>
#include <boost/regex.hpp>
#include <iostream>
using namespace std;
int main()
{
std::cout << "main()\n";
string s = "98-[2]00";
string exp = "^[][0-9!##$%^&*()_+={};':\"|,.<>/?\\s-]*$";
const boost::regex e(exp);
bool isSequence = boost::regex_match(s,e);
//isSequence is boolean and should be equal to 1
cout << isSequence << endl;
return 0;
}
/**
Local Variables:
compile-command: "g++ -g test.cc -o test.exe -lboost_regex-mt; ./test.exe"
End:
*/
I asked at http://lists.boost.org/boost-users/2013/12/80707.php
The answer of John Maddock (the author of the boost::regex library) is:
>I discovered that if one uses an closing bracket as the first character of
>a
>character class the character class includes this bracket.
>This works with the standard setting of boost::regex (i.e., perl-regular
>expressions) but it is not documented in the
>manual page
>
>http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/syntax/
>perl_syntax.html#boost_regex.syntax.perl_syntax.character_sets
>
>Is this an undocumented feature, a bug or did I misinterpret something in
>the manual?
It's a feature, both Perl and POSIX extended regular expression behave the
same way.
John.
Related
I am trying to replace one backslash with two. To do that I tried using the following code
str = "d:\test\text.txt"
str.replace("\\","\\\\");
The code does not work. Whole idea is to pass str to deletefile function, which requires double blackslash.
since c++11, you may try using regex
#include <regex>
#include <iostream>
int main() {
auto s = std::string(R"(\tmp\)");
s = std::regex_replace(s, std::regex(R"(\\)"), R"(\\)");
std::cout << s << std::endl;
}
A bit overkill, but does the trick is you want a "quick" sollution
There are two errors in your code.
First line: you forgot to double the \ in the literal string.
It happens that \t is a valid escape representing the tab character, so you get no compiler error, but your string doesn't contain what you expect.
Second line: according to the reference of string::replace,
you can replace a substring by another substring based on the substring position.
However, there is no version that makes a substitution, i.e. replace all occurences of a given substring by another one.
This doesn't exist in the standard library. It exists for example in the boost library, see boost string algorithms. The algorithm you are looking for is called replace_all.
Is there a simple way to escape all occurrences of \ in a string? I start with the following string:
#include <string>
#include <iostream>
std::string escapeSlashes(std::string str) {
// I have no idea what to do here
return str;
}
int main () {
std::string str = "a\b\c\d";
std::cout << escapeSlashes(str) << "\n";
// Desired output:
// a\\b\\c\\d
return 0;
}
Basically, I am looking for the inverse to this question. The problem is that I cannot search for \ in the string, because C++ already treats it as an escape sequence.
NOTE: I am not able to change the string str in the first place. It is parsed from a LaTeX file. Thus, this answers to a similar question does not apply. Edit: The parsing failed due to an unrelated problem, the question here is about string literals.
Edit: There are nice solutions to find and replace known escape sequences, such as this answer. Another option is to use boost::regex("\p{cntrl}"). However, I haven't found one that works for unknown (erroneous) escape sequences.
You can use raw string literal. See http://en.cppreference.com/w/cpp/language/string_literal
#include <string>
#include <iostream>
int main() {
std::string str = R"(a\b\c\d)";
std::cout << str << "\n";
return 0;
}
Output:
a\b\c\d
It is not possible to convert the string literal a\b\c\d to a\\b\\c\\d, i.e. escaping the backslashes.
Why? Because the compiler converts \c and \d directly to c and d, respectively, giving you a warning about Unknown escape sequence \c and Unknown escape sequence \d (\b is fine as it is a valid escape sequence). This happens directly to the string literal before you have any chance to work with it.
To see this, you can compile to assembler
gcc -S main.cpp
and you will find the following line somewhere in your assembler code:
.string "a\bcd"
Thus, your problem is either in your parsing function or you use string literals for experimenting and you should use raw strings R"(a\b\c\d)" instead.
In the following code, I tried to use $1 to refer to the first submatch:
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main()
{
string str {"1-2-3 4-5-6 7-8-9"};
int r = 1;
str = regex_replace(str, regex{R"((\d*-\d*-)\d*)"}, "$1" + to_string(r));
cout << str << "\n";
return 0;
}
What I expect is:
1-2-1 4-5-1 7-8-1
But it doesn't work because the actual format string passed to regex_replace() is $11 as if I were trying to refer to the 11th submatch.
So when using regex_replace(), what is the correct way to back-reference a submatch which is followed directly by another digit in the format string?
I tried using ${1} but it didn't work for any of the mainstream implementations that I tried.
According to Standard N3337, §28.5.2, Table 139:
format_default: When a regular expression match is to be replaced by a new string, the new string shall be constructed using the rules used
by the ECMAScript replace function in ECMA-262, part 15.5.4.11
String.prototype.replace. In addition, during search and replace
operations all non-overlapping occurrences of the regular expression
shall be located and replaced, and sections of the input that did not
match the expression shall be copied unchanged to the output string.
And according to ECMA-262 part 15.5.4.11 String.prototype.replace, Table 22
$nn: The nn-th capture, where nn is a two-digit decimal number in the range
01 to 99. If nn≤m and the nnth capture is undefined, use the empty
String instead. If nn>m, the result is implementation-defined.
So, there could be at most two decimal digits after $, which refers to matching group, therefore you could use
"$01" + to_string(r)
I need to match the text '\0' with the same regex that I would match 'a' or 'b'. (a regex for a character constant in C++). I've tried a bunch of different regexes, but haven't gotten a successful one yet. My latest attempt:
^['].|\\0[']
Most of the other things I've tried have given seg faults, so this is really the closest I've gotten.
This works pretty nicely with what I've tested ('a','b','\0').
If you don't have std::regex or boost::regex I guess what you can get out of it is the fact that the regex I used is ('.'|'\\0').
#include <boost/regex.hpp>
#include <string>
#include <iostream>
#include <vector>
int main() {
std::vector<std::string> strings;
strings.push_back(R"('a')");
strings.push_back(R"('b')");
strings.push_back(R"('\0')");
boost::regex rgx(R"(('.'|'\\0'))");
boost::smatch match;
for(auto& i : strings) {
if(boost::regex_match(i,match, rgx)) {
boost::ssub_match submatch = match[1];
std::cout << submatch.str() << '\n';
}
}
}
Example
There's nothing magic about '\0'; it's just a character, like any other character, and there's nothing (almost) special you have to do to use it in a regular expression. The only problem you might run into is if you use it in the middle of a character literal that you pass to a function that treats it as the end of a string. To avoid that, force it into a std::string:
const char s[] = "a\0b";
std::string not_my_str(s); // not_my_str holds "a"
std::string str(s, 3); // str holds "a\0b"
Once you've constructed the string object, the embedded '\0' gets no special treatment. Except, of course, if you copy the contents with a function that treats it specially.
The regex that works (in this instance, using the C header ) is:
^('(.|([\\]0))')
Thanks to #WhozCraig for the help!
I'm writing a small command-line program that asks the user for polynomials in the form ax^2+bx^1+cx^0. I'm going to parse the data later but for now I'm just trying to see if I can match the polynomial with the regular expression(\+|-|^)(\d*)x\^([0-9*]*)My problem is, it doesn't match multiple terms in the user-entered polynomial unless I change it to((\+|-|^)(\d*)x\^([0-9*]*))*(the difference is the entire expression is grouped and has an asterisk at the end). The first expression works if I type something such as "4x^2" but not "4x^2+3x^1+2x^0", since it doesn't check multiple times.
My question is, why won't Boost.Regex'sregex_match()find multiple matches within the same string? It does in the regular expression editor I used (Expresso) but not in the actual C++ code. Is it supposed to be like that?
Let me know if something doesn't make sense and I'll try to clarify. Thanks for the help.
Edit1: Here's my code (I'm following the tutorial here: http://onlamp.com/pub/a/onlamp/2006/04/06/boostregex.html?page=3)
int main()
{
string polynomial;
cmatch matches; // matches
regex re("((\\+|-|^)(\\d*)x\\^([0-9*]*))*");
cout << "Please enter your polynomials in the form ax^2+bx^1+cx^0." << endl;
cout << "Polynomial:";
getline(cin, polynomial);
if(regex_match(polynomial.c_str(), matches, re))
{
for(int i = 0; i < matches.size(); i++)
{
string match(matches[i].first, matches[i].second);
cout << "\tmatches[" << i << "] = " << match << endl;
}
}
system("PAUSE");
return 0;
}
You're using the wrong thing -- regex_match is intended to check whether a (single) regex matches the entirety of a sequence of characters. As such, you need to either specify a regex that matches the whole input, or use something else. For your situation, it probably makes the most sense to just modify the regex as you've already done (group it and add a Kleene star). If you wanted to iterate over the individual terms of the polynomial, you'd probably want to use something like a regex_token_iterator.
Edit: Of course, since you're embedding this into C++, you also have to double all your backslashes. Looking at it, I'm also a little confused about the regex you're using -- it doesn't look to me like it should really work quite right. Just for example, it seems to require a "+", "-" or "^" at the beginning of a term, but the first term won't normally have that. I'm also somewhat uncertain why there would be a "^" at the beginning of a term. Since the exponent is normally omitted when it's zero, it's probably better to allow it to be omitted. Taking those into account, I get something like: "[-+]?(\d*)x(\^([0-9])*)".
Incorporating that into some code, we can get something like this:
#include <iterator>
#include <regex>
#include <string>
#include <iostream>
int main() {
std::string poly = "4x^2+3x^1+2x";
std::tr1::regex term("[-+]?(\\d*)x(\\^[0-9])*");
std::copy(std::tr1::sregex_token_iterator(poly.begin(), poly.end(), term),
std::tr1::sregex_token_iterator(),
std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}
At least for me, that prints out each term individually:
4x^2
+3x^1
+2x
Note that for the moment, I've just printed out each complete term, and modified your input to show off the ability to recognize a term that doesn't include a power (explicitly, anyway).
Edit: to collect the results into a vector instead of sending them to std::cout, you'd do something like this:
#include <iterator>
#include <regex>
#include <string>
#include <iostream>
int main() {
std::string poly = "4x^2+3x^1+2x";
std::tr1::regex term("[-+]?(\\d*)x(\\^[0-9])*");
std::vector<std::string> terms;
std::copy(std::tr1::sregex_token_iterator(poly.begin(), poly.end(), term),
std::tr1::sregex_token_iterator(),
std::back_inserter(terms));
// Now terms[0] is the first term, terms[1] the second, and so on.
return 0;
}