I'm just getting my head around regular expressions, and I'm using the Boost Regex library.
I have a need to use a regex that includes a specific URL, and it chokes because obviously there are characters in the URL that are reserved for regex and need to be escaped.
Is there any function or method in the Boost library to escape a string for this kind of usage? I know there are such methods in most other regex implementations, but I don't see one in Boost.
Alternatively, is there a list of all characters that would need to be escaped?
. ^ $ | ( ) [ ] { } * + ? \
Ironically, you could use a regex to escape your URL so that it can be inserted into a regex.
const boost::regex esc("[.^$|()\\[\\]{}*+?\\\\]");
const std::string rep("\\\\&");
std::string result = regex_replace(url_to_escape, esc, rep,
boost::match_default | boost::format_sed);
(The flag boost::format_sed specifies to use the replacement string format of sed. In sed, an escape & will output whatever matched by the whole expression)
Or if you are not comfortable with sed's replacement string format, just change the flag to boost::format_perl, and you can use the familiar $& to refer to whatever matched by the whole expression.
const std::string rep("\\\\$&");
std::string result = regex_replace(url_to_escape, esc, rep,
boost::match_default | boost::format_perl);
Using code from Dav (+ a fix from comments), I created ASCII/Unicode function regex_escape():
std::wstring regex_escape(const std::wstring& string_to_escape) {
static const boost::wregex re_boostRegexEscape( _T("[.^$|()\\[\\]{}*+?\\\\]") );
const std::wstring rep( _T("\\\\&") );
std::wstring result = regex_replace(string_to_escape, re_boostRegexEscape, rep, boost::match_default | boost::format_sed);
return result;
}
For ASCII version, use std::string/boost::regex instead of std::wstring/boost::wregex.
Same with boost::xpressive:
const boost::xpressive::sregex re_escape_text = boost::xpressive::sregex::compile("([\\^\\.\\$\\|\\(\\)\\[\\]\\*\\+\\?\\/\\\\])");
std::string regex_escape(std::string text){
text = boost::xpressive::regex_replace( text, re_escape_text, std::string("\\$1") );
return text;
}
In C++11, you can use raw string literals to avoid escaping the regex string:
std::string myRegex = R"(something\.com)";
See http://en.cppreference.com/w/cpp/language/string_literal, item (6).
Related
I want to remove all special symbols from string and have only words in string
I tried this but it gives same output only
main() {
String s = "Hello, world! i am 'foo'";
print(s.replaceAll(new RegExp('\W+'),''));
}
output : Hello, world! i am 'foo'
expected : Hello world i am foo
There are two issues:
'\W' is not a valid escape sequence, to define a backslash in a regular string literal, you need to use \\, or use a raw string literal (r'...')
\W regex pattern matches any char that is not a word char including whitespace, you need to use a negated character class with word and whitespace classes, [^\w\s].
Use
void main() {
String s = "Hello, world! i am 'foo'";
print(s.replaceAll(new RegExp(r'[^\w\s]+'),''));
}
Output: Hello world i am foo.
Fully Unicode-aware solution
Based on What's the correct regex range for javascript's regexes to match all the non word characters in any script? post, bearing in mind that \w in Unicode aware regex is equal to [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}], you can use the following in Dart:
void main() {
String s = "Hęllo, wórld! i am 'foo'";
String regex = r'[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}\s]+';
print(s.replaceAll(RegExp(regex, unicode: true),''));
}
// => Hęllo wórld i am foo
The docs for the RegExp class state that you should use raw strings (a string literal prefixed with an r, like r"Hello world") if you're constructing a regular expression that way. This is particularly necessary where you're using escapes.
In addition, your regex is going to catch spaces as well, so you'll need to modify that. You can use RegExp(r"[^\s\w]") instead - that matches any character that's not whitespace or a word character
I found this question looking for how to remove a symbol from a string. For others who come here wanting to do that:
final myString = 'abc=';
final withoutEquals = myString.replaceAll(RegExp('='), ''); // abc
First solution
s.replaceAll(RegExp(",|!|'"), ""); // The | operator works as OR
Second solution
s.replaceAll(",", "").replaceAll("!", "").replaceAll("'", "");
Removing characters "," from string:
String myString = "s, t, r";
myString = myString.replaceAll(",", ""); // myString is "s t r"
May we have similar question here stackoverflow:
But my question is:
First I tried to match all x in the string so I write the following code, and it's working well:
string str = line;
regex rx("x");
vector<int> index_matches; // results saved here
for (auto it = std::sregex_iterator(str.begin(), str.end(), rx);
it != std::sregex_iterator();
++it)
{
index_matches.push_back(it->position());
}
Now if I tried to match all { I tried to replace
regex rx("x"); with regex rx("{"); andregex rx("\{");.
So I got an exception and I think it should throw an exception because we use {
sometimes to express the regular expression, and it expect to have } in the regex at the end that's why it throw an exception.
So first is my explanation correct?
Second question I need to match all { using the same code above, is that possible to change the regex rx("{"); to something else?
You need to escape characters with special meaning in regular expressions, i.e. use \{ regular expression. But, \ has special meaning in C++ string literals. So, next you need to escape characters with special meaning in C++ string literals, i.e. write:
regex rx("\\{");
I need to determine if a file is PCL encoded. So I am looking at the first line to see if it begins with an ESC character. If you know a better way feel free to suggest. Here is my code:
bool pclFlag = false;
if (containStr(jobLine, "^\\e")) {
pclFlag=true;
}
bool containStr(const string& s, const string& re)
{
static const boost::regex e(re);
return regex_match(s, e);
}
pclFlag does not get set to true.
You've declared boost::regex e to be static, which means it will only get initialized the very first time your function is called. If your search here is not the first call, it will be searching for whatever string was passed in the first call.
regex_match must match the entire string. Try adding ".*" (dot star) to the end of your regex.
Important
Note that the result is true only if the expression matches the whole of the input sequence. If you want to search for an expression somewhere within the sequence then use regex_search. If you want to match a prefix of the character string then use regex_search with the flag match_continuous set.
http://www.boost.org/doc/libs/1_51_0/libs/regex/doc/html/boost_regex/ref/regex_match.html
#JoachimPileborg is right... if (jobline[0] == 0x1B) {} is much easier.
Boost.Regex seems like overkill if all you want to do is see if a string starts with a certain character.
bool pclFlag = jobLine.length() > 0 && jobLine[0] == '\033';
You could also use Boost string algorithms:
#include <boost/algorithm/string.hpp>
bool pclFlag = jobLine.starts_with("\033");
If you're looking to see if a string contains an escape anywhere in the string:
bool pclFlag = jobLine.find('\033') != npos;
What is the regular expression for removing the suffix of file names? For example, if I have a file name in a string such as "vnb.txt", what is the regular expression to remove ".txt"?
Thanks.
Do you really need a regular expression to do this? Why not just look for the last period in the string, and trim the string up to that point? Frankly, there's a lot of overhead for a regular expression, and I don't think you need it in this case.
As suggested by tstenner, you can try one of the following, depending on what kinds of strings you're using:
std::strrchr
std::string::find_last_of
First example:
char* str = "Directory/file.txt";
size_t index;
char* pStr = strrchr(str,'.');
if(nullptr != pStr)
{
index = pStr - str;
}
Second example:
int index = string("Directory/file.txt").find_last_of('.');
If you are using Qt already, you could use QFileInfo, and use the baseName() function to get just the name (if one exists), or the suffix() function to get the extension (if one exists).
If you're looking for a solution that will give you anything except for the suffix, you should use string::find_last_of.
Your code could look like this:
const std::string removesuffix(const std::string& s) {
size_t suffixbegin = s.find_last_of('.');
//This will handle cases like "directory.foo/bar"
size_t dir = s.find_last_of('/');
if(dir != std::string::npos && dir > suffixbegin) return s;
if(suffixbegin == std::string::npos) return s;
else return s.substr(0,suffixbegin);
}
If you're looking for a regular expression, use \.[^.]+$.
You have to escape the first ., otherwise it will match any character, and put a $ at the end, so it will only match at the end of a string.
Different operating systems may allow different characters in filenams, the simplest regex might be (.+)\.txt$. Get the first capture group to get the filename sans extension.
I'm just getting my head around regular expressions, and I'm using the Boost Regex library.
I have a need to use a regex that includes a specific URL, and it chokes because obviously there are characters in the URL that are reserved for regex and need to be escaped.
Is there any function or method in the Boost library to escape a string for this kind of usage? I know there are such methods in most other regex implementations, but I don't see one in Boost.
Alternatively, is there a list of all characters that would need to be escaped?
. ^ $ | ( ) [ ] { } * + ? \
Ironically, you could use a regex to escape your URL so that it can be inserted into a regex.
const boost::regex esc("[.^$|()\\[\\]{}*+?\\\\]");
const std::string rep("\\\\&");
std::string result = regex_replace(url_to_escape, esc, rep,
boost::match_default | boost::format_sed);
(The flag boost::format_sed specifies to use the replacement string format of sed. In sed, an escape & will output whatever matched by the whole expression)
Or if you are not comfortable with sed's replacement string format, just change the flag to boost::format_perl, and you can use the familiar $& to refer to whatever matched by the whole expression.
const std::string rep("\\\\$&");
std::string result = regex_replace(url_to_escape, esc, rep,
boost::match_default | boost::format_perl);
Using code from Dav (+ a fix from comments), I created ASCII/Unicode function regex_escape():
std::wstring regex_escape(const std::wstring& string_to_escape) {
static const boost::wregex re_boostRegexEscape( _T("[.^$|()\\[\\]{}*+?\\\\]") );
const std::wstring rep( _T("\\\\&") );
std::wstring result = regex_replace(string_to_escape, re_boostRegexEscape, rep, boost::match_default | boost::format_sed);
return result;
}
For ASCII version, use std::string/boost::regex instead of std::wstring/boost::wregex.
Same with boost::xpressive:
const boost::xpressive::sregex re_escape_text = boost::xpressive::sregex::compile("([\\^\\.\\$\\|\\(\\)\\[\\]\\*\\+\\?\\/\\\\])");
std::string regex_escape(std::string text){
text = boost::xpressive::regex_replace( text, re_escape_text, std::string("\\$1") );
return text;
}
In C++11, you can use raw string literals to avoid escaping the regex string:
std::string myRegex = R"(something\.com)";
See http://en.cppreference.com/w/cpp/language/string_literal, item (6).