Removing expressions from QString using QRegExp - c++

I'm having an issue removing expressions from a QString using QRegExp. I tried a countless number of regex to no avail. What am I doing wrong?
Sample Text (QString myString) In this instance, myString contains "\u0006\u0007\u0013Hello".
myString.remove(QRegExp("\\[u][0-9]{4}"));
It does not remove any instances of \uXXXX where X = numbers.
However, when I am specific such as:
myString.remove("\u0006");
It does remove it.

String literals are not always the same as character sequence
for (char c : "\u0006\u0007\u0013Hello".toCharArray()) {
System.out.println( c + " (" + (int)c + ")" );
}
System.out.println( "--------------" );
for (char c : "\\u0006\\u0007\\u0013Hello".toCharArray()) {
System.out.println( c + " (" + (int)c + ")" );
}
In the first example \u0006 is encoding an unicode code point, whereas in second the string actually contains a backslash.
The string literal only exist at compile time, at runtime they are character sequences.
Regexes are working over character sequence not over string litteral, and also backlash have special meaning and need to be escaped.
Also note that \u0041 is another way to encode A.
Maybe what you are looking for are unicode categories, maybe following can help:
string.replaceAll( "\\p{Cc}", "" )

Related

Str.global_replace in OCaml putting carats where they shouldn't be

I am working to convert multiline strings into a list of tokens that might be easier for me to work with.
In accordance with the specific needs of my project, I'm padding any carat symbol that appears in my input with spaces, so that "^" gets turned into " ^ ". I'm using something like the following function to do so:
let bad_function string = Str.global_replace (Str.regexp "^") " ^ " (string)
I then use something like the below function to then turn this multiline string into a list of tokens (ignoring whitespace).
let string_to_tokens string = (Str.split (Str.regexp "[ \n\r\x0c\t]+") (string));;
For some reason, bad_function adds carats to places where they shouldn't be. Take the following line of code:
(bad_function " This is some
multiline input
with newline characters
and tabs. When I convert this string
into a list of tokens I get ^s showing up where
they shouldn't. ")
The first line of the string turns into:
^ This is some \n ^
When I feed the output from bad_function into string_to_tokens I get the following list:
string_to_tokens (bad_function " This is some
multiline input
with newline characters
and tabs. When I convert this string
into a list of tokens I get ^s showing up where
they shouldn't. ")
["^"; "This"; "is"; "some"; "^"; "multiline"; "input"; "^"; "with";
"newline"; "characters"; "^"; "and"; "tabs."; "When"; "I"; "convert";
"this"; "string"; "^"; "into"; "a"; "list"; "of"; "tokens"; "I"; "get";
"^s"; "showing"; "up"; "where"; "^"; "they"; "shouldn't."]
Why is this happening, and how can I fix so these functions behave like I want them to?
As explained in the Str module.
^ Matches at beginning of line: either at the beginning of the
matched string, or just after a '\n' character.
So you have to quote the '^' character using the escape character "\".
However, note that (also from the doc)
any backslash character in the regular expression must be doubled to
make it past the OCaml string parser.
This means you have to put a double '\' to do what you want without getting a warning.
This should do the job:
let bad_function string = Str.global_replace (Str.regexp "\\^") " ^ " (string);;

Matching of strings with special characters

I need to generate a string that can match another both containing special characters. I wrote what I thought would be a simple method, but so far nothing has given me a successful match.
I know that specials characters in c++ are preceded with a "\". Per example a single quote would be written as "\'".
string json_string(const string& incoming_str)
{
string str = "\\\"" + incoming_str + "\\\"";
return str;
}
And this is the string I have to compare to:
bool comp = json_string("hello world") == "\"hello world\"";
I can see in the cout stream that in fact I'm generating the string as needed but the comparison still gives a false value.
What am I missing? Any help would be appreciated.
One way is to filter one string and compare this filtered string. For example:
#include <iostream>
#include <algorithm>
using namespace std;
std::string filterBy(std::string unfiltered, std::string specialChars)
{
std::string filtered;
std::copy_if(unfiltered.begin(), unfiltered.end(),
std::back_inserter(filtered), [&specialChars](char c){return specialChars.find(c) == -1;});
return filtered;
}
int main() {
std::string specialChars = "\"";
std::string string1 = "test";
std::string string2 = "\"test\"";
std::cout << (string1 == filterBy(string2, specialChars) ? "match" : "no match");
return 0;
}
Output is match. This code also works if you add an arbitrary number of characters to specialChars.
If both strings contain special characters, you can also put string1 through the filterBy function. Then, something like:
"\"hello \" world \"" == "\"hello world "
will also match.
If the comparison is performance-critical, you might also have a comparison that uses two iterators, getting a comparison complexity of log(N+M), where N and M are the sizes of the two strings, respectively.
bool comp = json_string("hello world") == "\"hello world\"";
This will definitely yield false. You are creating string \"hello world\" by json_string("hello world") but comparing it to "hello world"
The problem is here:
string str = "\\\"" + incoming_str + "\\\"";
In your first string literal of str, the first character backlash that you’re assuming to be treated like escape character is not actually being treated an escape character, rather just a backslash in your string literal. You do the same in your last string literal.
Do this:
string str = "\"" + incoming_str + "\"";
In C++ string literals are delimited by quotes.
Then the problem arises: How can I define a string literal that does itself contain quotes? In Python (for comparison), this can get easy (but there are other drawbacks with this approach not of interest here): 'a string with " (quote)'.
C++ doesn't have this alternative string representation1, instead, you are limited to using escape sequences (which are available in Python, too – just for completeness...): Within a string (or character) literal (but nowhere else!), the sequence \" will be replaced by a single quote in the resulting string.
So "\"hello world\"" defined as character array would be:
{ '"', 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '"', 0 };
Note that now the escape character is not necessary...
Within your json_string function, you append additional backslashes, though:
"\\\""
{ '\', '"', 0 }
//^^^
Note that I wrote '\' just for illustration! How would you define single quote? By escaping again! '\'' – but now you need to escape the escape character, too, so a single backslash actually needs to be written as '\\' here (wheras in comparison, you don't have to escape the single quote in a string literal: "i am 'singly quoted'" – just as you didn't have to escape the double quote in the character literal).
As JSON uses double quotes for strings, too, you'd most likely want to change your function:
return "\"" + incoming_str + "\"";
or even much simpler:
return '"' + incoming_str + '"';
Now
json_string("hello world") == "\"hello world\""
would yield true...
1 Side note (stolen from answer deleted in the meanwhile): Since C++11, there are raw string literals, too. Using these, you don't have to escape either.

C++ Qt QString replace double backslash with one

I have a QString with following content:
"MXTP24\\x00\\x00\\xF4\\xF9\\x80\r\n"
I want it to become:
"MXTP24\x00\x00\xF4\xF9\x80\r\n"
I need to replace the "\x" to "\x" so that I can start parsing the values. But the following code, which I think should do the job is not doing anything as I get the same string before and after:
qDebug() << "BEFORE: " << data;
data = data.replace("\\\\x", "\\x", Qt::CaseSensitivity::CaseInsensitive);
qDebug() << "AFTER: " << data;
Here, no change!
Then I tried like this:
data = data.replace("\\x", "\x", Qt::CaseSensitivity::CaseInsensitive);
Then compiler complaines that \x used with no following hex digits!
any ideas?
First let's look at what this piece of code does:
data.replace("\\\\x", "\\x", ....
First string becomes \\x in compiled code, and is used as regular expression. In reqular expression, backslash is special, and needs to be escaped with another backslash to mean actual single backslash character, and your regexp does just this. 4 backslashes in C+n string literal regexp means matching single literal backslash in target text. So your reqular expression matches literal 2-character string \x.
Then you replace it. Replacement isn't a reqular expression, so backslash doesn't need double escaping here, so you end up using literal 2-char replacement string \x, which is same as what you matched, so even if there is a match, nothing changes.
However, this is not your problem, your problem is how qDebug() prints strings. It prints them escaped. That \" at start of output means just plain double quote, 1 char, in the actual string because double quote is escaped. And those \\ also are single backslash char, because literal backslash is also escaped (because it is the escape char and has special meaning for the next char).
So it seems you don't need to do any search replace at all, just remove it.
Try printing the QString in one of these ways to get is shown literally:
std::cout << data << std::endl;
qDebug() << data.toLatin1().constData();

c++11/regex - search for exact string, escape [duplicate]

This question already has answers here:
std::regex escape special characters for use in regex
(3 answers)
Closed 6 years ago.
Say you have a string which is provided by the user. It can contain any kind of character. Examples are:
std::string s1{"hello world");
std::string s1{".*");
std::string s1{"*{}97(}{.}}\\testing___just a --%#$%# literal%$#%^"};
...
Now I want to search in some text for occurrences of >> followed by the input string s1 followed by <<. For this, I have the following code:
std::string input; // the input text
std::regex regex{">> " + s1 + " <<"};
if (std::regex_match(input, regex)) {
// add logic here
}
This works fine if s1 did not contain any special characters. However, if s1 had some special characters, which are recognized by the regex engine, it doesn't work.
How can I escape s1 such that std::regex considers it as a literal, and therefore does not interpret s1? In other words, the regex should be:
std::regex regex{">> " + ESCAPE(s1) + " <<"};
Is there a function like ESCAPE() in std?
important I simplified my question. In my real case, the regex is much more complex. As I am only having troubles with the fact the s1 is interpreted, I left these details out.
You will have to escape all special characters in the string with \. The most straightforward approach would be to use another expression to sanitize the input string before creating the expression regex.
// matches any characters that need to be escaped in RegEx
std::regex specialChars { R"([-[\]{}()*+?.,\^$|#\s])" };
std::string input = ">> "+ s1 +" <<";
std::string sanitized = std::regex_replace( input, specialChars, R"(\$&)" );
// "sanitized" can now safely be used in another expression

Is this regular expression?

This is how to split string in Unityscript from Unity Wiki. However, I don't recognize " "[0]. Is this regular expression? If so, any reference to it? I'm familiar with regular expressions generally and used them a lot, but this syntax is little confusing.
var qualifiedName = "System.Integer myInt";
var name = qualifiedName.Split(" "[0]);
Wiki Reference
On any string, wether it is a variable or a literal (" "), you can use an indexer to get the char at the nth position.
Your codesample is a very weird way of literally defining a char with a space, and could be simplified by using this:
' '
note the single quotes instead of double quotes
As many have already mentioned, " "[0] is the first character of the " " string (which is a System.String, or an array of System.Chars. The problem with UnityScript is that ' ' is interpreted as a String too, so the only way to provide a Char is by slicing.
" "[0] is the first character of the string " ".
typeof " "[0]; // "string"
Your example is strange, because " "[0] and " " are strictly equal.
" "[0] === " "; // true
Reading reference:
Mono Types When a Mono function requires a char as an input, you can
obtain one by simply indexing a string. E.g. if you wanted to pass the
lowercase a as a char, you'd write: "a"[0]
I suppose it's because UnityScript is implemented in Boo and String is provided by mono.