How to convert accented character to Hexadecimal Unicode in VBScript? [duplicate] - regex

I'd like to create a .properties file to be used in a Java program from a VBScript. I'm going to use some strings in languages that use characters outside the ASCII map. So, I need to replace these characters for its UTF code. This would be \u0061 for a, \u0062 fro b and so on.
Is there a way to get the UTF code for a char in VBScript?

VBScript has the AscW function that returns the Unicode (wide) code of the first character in the specified string.
Note that AscW returns the character code as a decimal number, so if you need it in a specific format, you'll have to write some additional code for that (and the problem is, VBScript doesn't have decent string formatting functions). For example, if you need the code formatted as \unnnn, you could use a function like this:
WScript.Echo ToUnicodeChar("✈") ''# \u2708
Function ToUnicodeChar(Char)
str = Hex(AscW(Char))
ToUnicodeChar = "\u" & String(4 - Len(str), "0") & str
End Function

Related

How to convert this '\x5b\x5b\x5b' to this '[[['

It's easy in JavaScript, but how to convert in PQ?
Is there an easy way?
Actually, in JavaScript source code, '\x5b\x5b\x5b\x220BwP4bPODhDZVZzVcXdmZlVnenc\x22' compiles to the same value as '[[["0BwP4bPODhDZVZzVcXdmZlVnenc"' so there is no conversion.
JavaScript strings are counted sequences of UTF-16 code units. UTF-16 is an encoding for the Unicode Character set. A JavaScript literal string allows several types of escape sequences. One is \xHH, where HH is the hexadecimal number (0 to 255) for an ISO 8859-1 code unit. ISO 8859-1 is a subset of Unicode and has the same codepoints as the first 256 Unicode codepoints. UTF-16 encodes those codepoints to the same values.
Power Query strings are also counted sequences of UTF-16 code units. (As are strings in Java, C#, VB4/5/6, VBA, VBScript, VB, F#, …, for that matter.) So, your string is almost there, except for the escapes. We can convert the JavaScript literal string to a Power Query text value in a few steps.
JavaScript also has \uHHHH escapes, where HHHH is the hexadecimal number for a UTF-16 code unit. Because of the similarity between Unicode and ISO 8859-1, \xHH is effectively shorthand for \u00HH.
JSON simplifies JavaScript literal strings, allowing \uHHHH escapes but not \xHH. Power Query has data transformation functions for JSON. So, we need to convert JavaScript to JSON and then transform.
In a blank query, open the Advanced Editor and paste:
let
JavaScriptLiteral = "\x5b\x5b\x5b\x220BwP4bPODhDZVZzVcXdmZlVnenc\x22",
JsonLiteral = """" & Text.Replace(JavaScriptLiteral, "\x","\u00") & """",
Value = Json.Document(JsonLiteral)
in
Value
(This will break if given a source with other escapes like "\\x" or with double quotes.)
You, can, of course, turn this into a Power Query function if you need to apply it to more than one string, or you can convert it to a table, or ….
This seems to work, specific to your \x5b\x5b\x5b portion.
If you start with a table named Table1 as your source for your Power Query query, with your text in Column1:
Then Transform -> Replace Values, using these settings:
...to get this:
You can then Add Column -> Custom Column with this:
... to get this:
Then you can extract the values in the new Custom column:
... to get this:

Decoding %E6%B0%94%E6%97%8B%E5%93%88%E5%88%A9.txt to a valid string

I am trying to decode a filename*= field of content disposition header. I get a string something like:
%E6%B0%94%E6%97%8B%E5%93%88%E5%88%A9.txt
What I have figured out that replacing % to \x works fine and I get the correct file name:
气旋哈利.txt
Is there a standard way of doing this in C++? Is there any library available to decode this?
I tried
boost::replace_all(name, "%x","\\x");
std::locale::generator gen;
std::locale locl = gen.generate("en_US.utf-8");
decoded_data = boost::locale::conv::from_utf( encoded_data, locl);
But it prints the replaced string instead of chinese characters.
\xE6\xB0\x94\xE6\x97\x8B\xE5\x93\x88\xE5\x88\xA9.txt
Any Idea where am I going wrong?
Replacing escape code like "\xE6" only work in string and character literals, not generally in strings. That's because it's handled by the compiler when it compiles the program.
However, it's not very hard to do yourself, using a simple loop that check for the '%' character, gets the next two characters and convert them to a number and use that number as a "character".

find if string starts with \U in Python 3.3

I have a string and I want to find out if it starts with \U.
Here is an example
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
I was trying this:
myStr.startswith('\\U')
but I get False.
How can I detect \U in a string?
The larger picture:
I have a list of strings, most of them are normal English word strings, but there are a few that are similar to what I have shown in myStr, how can I distinguish them?
The original string does not have the character \U. It has the unicode escape sequence \U0001f64c, which is a single Unicode character.
Therefore, it does not make sense to try to detect \U in the string you have given.
Trying to detect the \U in that string is similar to trying to detect \x in the C string "\x90".
It makes no sense because the interpreter has read the sequence and converted it. Of course, if you want to detect the first Unicode character in that string, that works fine.
myStr.startswith('\U0001f64c')
Note that if you define the string with a real \U, like this, you can detect it just fine. Based on some experimentation, I believe Python 2.7.6 defaults to this behavior.
myStr = r'\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
myStr.startswith('\\U') # Returns True.
Update: The OP requested a way to convert from the Unicode string into the raw string above.
I will show the solution in two steps.
First observe that we can view the raw hex for each character like this.
>>> [hex(ord(x)) for x in myStr]
['0x1f64c', '0x1f60d', '0x1f4a6', '0x1f445', '0x1f4af']
Next, we format it by using a format string.
formatString = "".join(r'\U%08x' for x in myStr)
output = formatString % tuple(myChars)
output.startswith("\\U") # Returns True.
Note of course that since we are converting a Unicode string and we are formatting it this way deliberately, it guaranteed to start with \U. However, I assume your actual application is not just to detect whether it starts with \U.
Update2: If the OP is trying to differentiate between "normal English" strings and "Unicode Strings", the above approach will not work, because all characters have a corresponding Unicode representation.
However, one heuristic you might use to check whether a string looks like ASCII is to just check whether the values of each character are outside the normal ASCII range. Assuming that you consider the normal ASCII range to be between 32 and 127 (You can take a look here and decide what you want to include.), you can do something like the following.
def isNormal(myStr):
myChars = [ord(x) for x in myStr]
return all(x < 128 and x > 31 for x in myChars)
This can be done in one line, but I separated it to make it more readable.
Your string:
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
is not a foraign language text. It is 5 Unicode characters, which are (in order):
PERSON RAISING BOTH HANDS IN CELEBRATION
SMILING FACE WITH HEART-SHAPED EYES
SPLASHING SWEAT SYMBOL
TONGUE
HUNDRED POINTS SYMBOL
If you want to get strings that only contain 'normal' characters, you can use something like this:
if re.search(r'[^A-Za-z0-9\s]', myStr):
# String contained 'weird' characters.
Note that this will also trip on characters like é, which will sometimes be used in English on words with a French origin.

I see a character called xDB on notepad++. What character is this?

What is this character
All I really need to know is what is this character. I have not seen anything like this before.
How do i remove this using Vb.net:
data = data.Replace(Chr(???????), "")
Is there a specific control character decimal number or something to this character that i can use in place of ??
Please help.
I tried looking up all the html, ascii and the regex languages to find this character but i did not find this anywhere.
To prevent possible bugs related to the encoding of your source files, you should use a hex editor (such as this Notepad++ plugin) to find the hexadecimal code of the character, then use that to reference the character in your code:
data = data.Replace((char)0xDB, "")
as opposed to:
data = data.Replace("Û", "")
Note: In this case the hex editor is unnecessary because xDB is already a hex code, but other control characters, such as CR and LF, are not displayed as their hex values [in Notepad++].

C++ new line not translating

First off, I'm a complete beginner at C++.
I'm coding something using an API, and would like to pass text containing new lines to it, and have it print out the new lines at the other end.
If I hardcode whatever I want it to print out, like so
printInApp("Hello\nWorld");
it does come out as separate lines in the other end, but if I retrieve the text from the app using a method that returns a const char then pass it straight to printInApp (which takes const char as argument), it comes out as a single line.
Why's this and how would I go about to fix it?
It is the compiler that process escape codes in string literals, not the runtime methods. This is why you can for example have "char c = '\n';" since the compiler just compiles it as "char c = 10".
If you want to process escape codes in strings such as '\' and 'n' as separate characters (eg read as such from a file), you will need to write (or use an existing one) a string function which finds the escape codes and converts them to other values, eg converting a '\' followed by a 'n' into a newline (ascii value 10).