Modifying a QString that contains a "\" - c++

I'm trying to modify a QString. The Qstring that I'm trying to modify is
"\002"
However when I try to modify it, the string either gets entirely deleted or shows no change.
I've tried
String.split("\"");
String.remove("\"");
String.remove(QChar('\'');
for some reason Qt requires that I add an extra " or ' in order to compile and not produce errors
What I currently have is this
string = pointer->data.info.get_type();
which according to the debugger returns "\002"
string = string.remove(QChar('\''));
the remove functionality does nothing afterwards.
I'm expecting to remove the \ from the string, but either it gets entirely deleted or nothing happens. What could be the problem and how do I modify the Qstring to just be the numerical values?

You're currently asking Qt to remove " from your string, not \. To remove \, you'll have to escape it, just like you escaped ", i.e. remove("\\").

First of all your string "\002" do not contain any slash, quotes or apostrophes.
Read about C++ string literals. This is escape sequence.
Note \nnn represents arbitrary octal value!
So your literal contains only one character of value decimal value 2! This is ASCII spatial code meaning: STX (start of text)
As a result this code:
String.split("\"");
String.remove("\"");
String.remove(QChar('\'');
won't split or anything since this string do not contain quote characters or apostrophe. It also do not tries split or remove slash character, since again this is an escape sequence, but different kind.
Now remember that debugger shows you this unprintable characters in escaped form to show you actual content. In live application user will see nothing or some strange glyph.

Related

English and arabic mixed string not ordered correctly Qt

So I have a code in Qt that goes joins the Strings “john جونسون” and “(جيد جداً), but when i add them up I get the answer in a wrong order ex.:
john (جيد جداً) جونسون’
This has nothing to do with Qt. It is a unicode thing. Qt just adds the characters.
The problem arises because the string starts of in LTR(left to right), with 'jhon' because that is in latin alphabet, but then when you add an arabic word to that, the first letter of that word, should be on the right, because arabic is a RTL script. This means that the last letter (represented by the last bytes) is on the left. So the place where the second string got added is - in memory - the end of the string.
You add an Arabic string to the string, because Arabic also uses '(', and thus, you stay in RTL mode.
So you need to explicitly mark the switchover back to LTR:
QString ltr{"\u200e"};
QString a {"john جونسون"};
QString b {"(جيد جداً)"};
std::cout << (a+ltr+b).toStdString()<< std::endl;
This will add a Left to right zero width character in between, which tells, whatever is displaying your string, from that point onwards, the end of the string is on the right again. (Until it reaches the arabic characters again.)

Calling function during regular expression replacement

I need to decode a string coming from json. Special characters are encoded as hex unicode (e.g. the apostrophe is /u0027).
I'm trying to accomplish this with these expression:
regexprep('Can/u0027t add the category','/u(\d{4})',native2unicode(hex2dec(strrep('$1','/u',''))))
but I get the following error
Error using hex2dec (line 38)
Input string found with characters other than 0-9, a-f, or A-F.
because hex2dec receives '$1' as value and not the result of strrep('$1','/u','').
If I try
regexprep('Can/u0027t add the category','/u(\d{4})',strrep('$1','/u',''))
I get, correctly, 'Can0027t add the category'. If I try with
regexprep('Can/u0027t add the category','/u(\d{4})',native2unicode(hex2dec(strrep('/u0027','/u',''))))
I get the right result (but with a fixed decoding, obviously).
I don't understand why the result of strrep is not the input argument of hex2dec.
You're tricking yourself with the debug. The $1 expansion in the replacement string operates on the string itself, as seen by regexprep. It is not expanded by the MATLAB parser before calling any functions, which will just see the string '$1'. If the result of those functions contains a $1, it will get passed into regexprep and expanded. So, for example, your test case with the bare strrep replaces nothing (since its input is the string '$1'), and passes the bare $1 string right back into regexprep.
You have two issues. One is easy: you don't need strrep at all, since the parentheses mark just the hex digits as the token. $1 expands with no /u. Test it:
regexprep('Can/u0027t add the category','/u(\d{4})','$1')
results in 'Can0027t add the category'.
Now for the harder one. As previously noted, you can't call normal functions on the $1 and have them do anything. However, MATLAB provides a special regexp syntax to call functions from inside the replacement string. Here is the documentation:
http://www.mathworks.com/help/matlab/matlab_prog/dynamic-regular-expressions.html
In summary, ${cmd($1)} expands to calling the MATLAB function cmd on the replacement token to generate the replacement string. So putting it all together:
regexprep('Can/u0027t add the category', '/u(\d{4})', '${native2unicode(hex2dec($1))}')
ans = Can't add the category

find if string starts with \U in Python 3.3

I have a string and I want to find out if it starts with \U.
Here is an example
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
I was trying this:
myStr.startswith('\\U')
but I get False.
How can I detect \U in a string?
The larger picture:
I have a list of strings, most of them are normal English word strings, but there are a few that are similar to what I have shown in myStr, how can I distinguish them?
The original string does not have the character \U. It has the unicode escape sequence \U0001f64c, which is a single Unicode character.
Therefore, it does not make sense to try to detect \U in the string you have given.
Trying to detect the \U in that string is similar to trying to detect \x in the C string "\x90".
It makes no sense because the interpreter has read the sequence and converted it. Of course, if you want to detect the first Unicode character in that string, that works fine.
myStr.startswith('\U0001f64c')
Note that if you define the string with a real \U, like this, you can detect it just fine. Based on some experimentation, I believe Python 2.7.6 defaults to this behavior.
myStr = r'\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
myStr.startswith('\\U') # Returns True.
Update: The OP requested a way to convert from the Unicode string into the raw string above.
I will show the solution in two steps.
First observe that we can view the raw hex for each character like this.
>>> [hex(ord(x)) for x in myStr]
['0x1f64c', '0x1f60d', '0x1f4a6', '0x1f445', '0x1f4af']
Next, we format it by using a format string.
formatString = "".join(r'\U%08x' for x in myStr)
output = formatString % tuple(myChars)
output.startswith("\\U") # Returns True.
Note of course that since we are converting a Unicode string and we are formatting it this way deliberately, it guaranteed to start with \U. However, I assume your actual application is not just to detect whether it starts with \U.
Update2: If the OP is trying to differentiate between "normal English" strings and "Unicode Strings", the above approach will not work, because all characters have a corresponding Unicode representation.
However, one heuristic you might use to check whether a string looks like ASCII is to just check whether the values of each character are outside the normal ASCII range. Assuming that you consider the normal ASCII range to be between 32 and 127 (You can take a look here and decide what you want to include.), you can do something like the following.
def isNormal(myStr):
myChars = [ord(x) for x in myStr]
return all(x < 128 and x > 31 for x in myChars)
This can be done in one line, but I separated it to make it more readable.
Your string:
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
is not a foraign language text. It is 5 Unicode characters, which are (in order):
PERSON RAISING BOTH HANDS IN CELEBRATION
SMILING FACE WITH HEART-SHAPED EYES
SPLASHING SWEAT SYMBOL
TONGUE
HUNDRED POINTS SYMBOL
If you want to get strings that only contain 'normal' characters, you can use something like this:
if re.search(r'[^A-Za-z0-9\s]', myStr):
# String contained 'weird' characters.
Note that this will also trip on characters like é, which will sometimes be used in English on words with a French origin.

I see a character called xDB on notepad++. What character is this?

What is this character
All I really need to know is what is this character. I have not seen anything like this before.
How do i remove this using Vb.net:
data = data.Replace(Chr(???????), "")
Is there a specific control character decimal number or something to this character that i can use in place of ??
Please help.
I tried looking up all the html, ascii and the regex languages to find this character but i did not find this anywhere.
To prevent possible bugs related to the encoding of your source files, you should use a hex editor (such as this Notepad++ plugin) to find the hexadecimal code of the character, then use that to reference the character in your code:
data = data.Replace((char)0xDB, "")
as opposed to:
data = data.Replace("Û", "")
Note: In this case the hex editor is unnecessary because xDB is already a hex code, but other control characters, such as CR and LF, are not displayed as their hex values [in Notepad++].

Unrecognizable character in C++

I'm programming an application that converts .txt files to bags of words for text mining. However, I keep getting non-alphabetic characters ( like ¾ and =) even though my application filters non-alphabetic characters:
My vector passes through a loop which erases strings that begins with a char with an ASCII value other than [65,90] (from A to Z). These characters also pass the isalpha test. It seems like these characters can't be distinguished from alphabetic characters.
I don't see how I can remove these weird strings dynamically from my vector of strings. I need help.
My code because it is quite long for a forum post.
This part of my code fails to get rid of the strings beginning with non-aphabetic characters:
for (unsigned int i=0; i<token24.size();i++){
string temp = token24[i];
char c = temp[0];
if(c>90||c<65){
token24.erase(token24.begin()+i);
i--;
}
}
I also tried with the condition
(c>'Z'||c<'A')
You could always do a string replace the characters with whitespace, but that just handles the specific cases of specific characters, not the larger problem.
I don't think we can do anything for you until we see the code.
The most important part in programs like yours is handling the content of .txt file. Such file can be a Unicode text, which in turn can be encoded, for eample, with UTF-8. Then, single byte can be only a part of a character, not character itself. Are you sure you load (and possibly, decode) the file in a proper way?
Also, don't you think that lower letters are also valid alpha characters?