How to decode a NCR to unicode character in C++ - c++

I want to decode a NCR value something like &# 35686; to its equivalent chinese character. Example:test direct(&# 35686;&# 23519;) should be converted to test direct(警察) . I have tried the below algorithm similar to java
find the decimal value that is between &# and ; --ie. 35686
Convert to int
get the equivalent char by using char(35686) which will give the unicode char
In java , it produces expected output but in C++ it produces the string as test direct( fß) instead of chinese.
Please help me out of this.

Related

How to convert accented character to Hexadecimal Unicode in VBScript? [duplicate]

I'd like to create a .properties file to be used in a Java program from a VBScript. I'm going to use some strings in languages that use characters outside the ASCII map. So, I need to replace these characters for its UTF code. This would be \u0061 for a, \u0062 fro b and so on.
Is there a way to get the UTF code for a char in VBScript?
VBScript has the AscW function that returns the Unicode (wide) code of the first character in the specified string.
Note that AscW returns the character code as a decimal number, so if you need it in a specific format, you'll have to write some additional code for that (and the problem is, VBScript doesn't have decent string formatting functions). For example, if you need the code formatted as \unnnn, you could use a function like this:
WScript.Echo ToUnicodeChar("✈") ''# \u2708
Function ToUnicodeChar(Char)
str = Hex(AscW(Char))
ToUnicodeChar = "\u" & String(4 - Len(str), "0") & str
End Function

Decoding %E6%B0%94%E6%97%8B%E5%93%88%E5%88%A9.txt to a valid string

I am trying to decode a filename*= field of content disposition header. I get a string something like:
%E6%B0%94%E6%97%8B%E5%93%88%E5%88%A9.txt
What I have figured out that replacing % to \x works fine and I get the correct file name:
气旋哈利.txt
Is there a standard way of doing this in C++? Is there any library available to decode this?
I tried
boost::replace_all(name, "%x","\\x");
std::locale::generator gen;
std::locale locl = gen.generate("en_US.utf-8");
decoded_data = boost::locale::conv::from_utf( encoded_data, locl);
But it prints the replaced string instead of chinese characters.
\xE6\xB0\x94\xE6\x97\x8B\xE5\x93\x88\xE5\x88\xA9.txt
Any Idea where am I going wrong?
Replacing escape code like "\xE6" only work in string and character literals, not generally in strings. That's because it's handled by the compiler when it compiles the program.
However, it's not very hard to do yourself, using a simple loop that check for the '%' character, gets the next two characters and convert them to a number and use that number as a "character".

Unicode to integer with preceding zeros in python

I am new to python. I am using python 2.7 version I have two unicode string variables,
a = u'0125', b = u'1234'
Now i want to convert this variables into integer and append it into a List like [0125, 1234]. This is my expected output.
I have tried to convert these variables into integer and appended it into the List and got the output as [125, 1234]. Preceeding zero is missing in that value. Can someone give better solution for this?.

Converting Hexadecimal(\x) in a string to unicode (\u)

I'm encountering a problem currently.
I'm getting a string from a url, I'm decoding this string via curl_easy_unescape and I'm getting a decoded string. So far so good.
Now is where the problem is. For example, when the url had the "counterpart" to ü inside his header, curl_easy_unescape turns the counterpart of ü in \xfc. Now my String has \xfc.
I need it as a "ü".
I need a written "ü" in my string, or I'm getting an error that my string is not utf8 formatted. And i need it inside a string. For example
"Hallü howre yoü"
with curl_easy_escape this turns into
"Hall\xfc+howre+you\xfc"
And i want to revert the \xfc into "ü"s or into "\u00fc"s
My solutions i tried have been:
changing the \x to \u00 . That would work and do the trick. But replacing doesn't work.
encoding the string in utf 8
getting the decimal value of xFC and doing char = valueofFC.
I don't have a clue, how i could resolve that issue.

find if string starts with \U in Python 3.3

I have a string and I want to find out if it starts with \U.
Here is an example
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
I was trying this:
myStr.startswith('\\U')
but I get False.
How can I detect \U in a string?
The larger picture:
I have a list of strings, most of them are normal English word strings, but there are a few that are similar to what I have shown in myStr, how can I distinguish them?
The original string does not have the character \U. It has the unicode escape sequence \U0001f64c, which is a single Unicode character.
Therefore, it does not make sense to try to detect \U in the string you have given.
Trying to detect the \U in that string is similar to trying to detect \x in the C string "\x90".
It makes no sense because the interpreter has read the sequence and converted it. Of course, if you want to detect the first Unicode character in that string, that works fine.
myStr.startswith('\U0001f64c')
Note that if you define the string with a real \U, like this, you can detect it just fine. Based on some experimentation, I believe Python 2.7.6 defaults to this behavior.
myStr = r'\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
myStr.startswith('\\U') # Returns True.
Update: The OP requested a way to convert from the Unicode string into the raw string above.
I will show the solution in two steps.
First observe that we can view the raw hex for each character like this.
>>> [hex(ord(x)) for x in myStr]
['0x1f64c', '0x1f60d', '0x1f4a6', '0x1f445', '0x1f4af']
Next, we format it by using a format string.
formatString = "".join(r'\U%08x' for x in myStr)
output = formatString % tuple(myChars)
output.startswith("\\U") # Returns True.
Note of course that since we are converting a Unicode string and we are formatting it this way deliberately, it guaranteed to start with \U. However, I assume your actual application is not just to detect whether it starts with \U.
Update2: If the OP is trying to differentiate between "normal English" strings and "Unicode Strings", the above approach will not work, because all characters have a corresponding Unicode representation.
However, one heuristic you might use to check whether a string looks like ASCII is to just check whether the values of each character are outside the normal ASCII range. Assuming that you consider the normal ASCII range to be between 32 and 127 (You can take a look here and decide what you want to include.), you can do something like the following.
def isNormal(myStr):
myChars = [ord(x) for x in myStr]
return all(x < 128 and x > 31 for x in myChars)
This can be done in one line, but I separated it to make it more readable.
Your string:
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
is not a foraign language text. It is 5 Unicode characters, which are (in order):
PERSON RAISING BOTH HANDS IN CELEBRATION
SMILING FACE WITH HEART-SHAPED EYES
SPLASHING SWEAT SYMBOL
TONGUE
HUNDRED POINTS SYMBOL
If you want to get strings that only contain 'normal' characters, you can use something like this:
if re.search(r'[^A-Za-z0-9\s]', myStr):
# String contained 'weird' characters.
Note that this will also trip on characters like é, which will sometimes be used in English on words with a French origin.