Convert extended ASCII characters to Latin1 - c++

I can find different values of same ASCII characters at the below links:
1.http://www.theasciicode.com.ar/
(Extended ASCII in non-latin format)
2.https://www.ascii-code.com/
(shows Extended Ascii in Latin1 format)
In the first link i can see the value of á (a with acute accent) = 160,
In the second link value of á (a with acute accent) = 225
similary there is a Random Difference between each values ranging from 128-255.
I have a C++ application where i am getting the ASCII value in Non-Latin Format (1), which i need to Output in Latin Value (2). Is there any formula which could be helpful? Please help. Thanks

The first table looks like the codepage 437 you can found on Windows.
If you have a string encoded with this codepage, you can decode it like this:
text = b"your\xa0text".decode("cp437")
You get: 'yourátext'
To convert it to ISO Latin1, you can write:
latin1 = text.encode("Latin1"). # or "8859-1"
You get the bytes string: b'your\xe1text'
Notice: some characters won't convert from cp437 to Latin1.

Related

How to create a Python (2.7) regular expression t to find Ascii 06 followed by any two extended ascii characters

I'm parsing a file using Python 2.7 and I'm trying to find all occurrences of the following pattern
ASCII 06 followed by any two characters in the range from ASCII 0 to ASCII 255
Naive try #1 - [chr(6)][chr(0)-chr(255][chr(0)-chr(255]
fails with a message that indicates the range cannot be strings.
I've tried several other combos - no success.
The record that I'm parinsg was read in
sF = open('D:\Scratch\xxxxx.01', 'r')
record = sF.read()
Any help will be gratefully appreciated.
Thanx,
Doug
Given you're using python2.7, your open().read() will return bytes. You can use the following bytes regex to match ASCII 06 followed by 2 [0-255] bytes:
reg = re.compile(b'\x06..')
Note that I used . here as any byte will satisfy 0 - 255 (except the newline character \n (\xa0))
If you wanted to also match the newline character you could do one of the following things:
# Uses the DOTALL flag to force `.` to match `\n`
reg = re.compile(b'\x06..', re.DOTALL)
# Makes a character class which matches all bytes
reg = re.compile(b'\x06[\x00-\xff]{2}')

Storing crypted data with libconfig

I'm using the libconfig to create an configuration file and one of the fields is a content of a encrypted file. The problem occurs because in the file have some escapes characters that causes a partial storing of the content. What is the best way to store this data to avoid accidental escapes caracter ? Convert to unicode?
Any suggestion?
You can use either URL encoding, where each non-ASCII character is encoded as a % character followed by two hex digits, or you case use base64 encoding, where each set of 3 bytes is encoded to 4 ASCII characters (3x8 bits -> 4x6 bits).
For example, if you have the following bytes:
00 01 41 31 80 FE
You can URL encode it as follows:
%00%01A1%80%FE
Or you can base64 encode it like this, with 0-25 = A-Z, 26-51 = a-z, 52-62 = 0-9, 62 = ., 63 = /:
(00000000 00000001 01000001) (00110001 10000000 11111110) -->
(000000 000000 000101 000001) (001100 011000 000011 111110)
AAJBNYD.
The standard for encoding binary data in text used to be uuencode and is now base64. Both use same paradigm: a byte uses 8bits, so 3 bytes use 24 bits or 4 6 bits characters.
uuencode just used the 6 bits with an offset of 32 (ascii code for space), so characters are in range 32-96 => all in printable ascii range, but including space and possibly other characters that could have special meanings
base64 choosed these 64 characters to represent values from 0 to 63 (no =:;,'"\*(){}[] that could have special meaning...):
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
and the equal sign(=) being a place holder for empty positions and the end of an encoded string to ensure that the encoded string length is a multiple of 4.
Unfortunately, neither the C nor C++ standard library offer functions for uuencode not base 64 conversions, but you can find nice implementations around, with many pointers in this other SO answer: How do I base64 encode (decode) in C?

How can I get a random unicode string

I am testing a REST based service and one of the inputs is a text string. So I am sending it random unicode strings from my python code. So far the unicode strings that I sent were in the ascii range, so everything worked.
Now I am attempting to send characters beyond the ascii range and I am getting an encoding error. Here is my code. I have been through this link and still unable to wrap my head around it.
# coding=utf-8
import os, random, string
import json
junk_len = 512
junk = (("%%0%dX" % junk_len) % random.getrandbits(junk_len * 8))
for i in xrange(1,5):
if(len(junk) % 8 == 0):
print u'decoding to hex'
message = junk.decode("hex")
print 'Hex chars %s' %message
print u' '.join(message.encode("utf-8").strip())
The first line prints without any issues, but I can't send it to the REST service without encoding it. Hence the second line where I am attempting to encode it to utf-8. This is the line of code that fails with the following message.
UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position
7: ordinal not in range(128)
As others have said, it's very difficult to make valid random UTF-8 bytes as the byte sequences have to be correct.
As Unicode maps all characters to a number between 0x0000 and 0x10FFFF, all one needs to do is to randomly generate a number in that range to get a valid Unicode address. Passing the random number to unichar (or char on Py3), will return a Unicode string of the character at the random code point.
Then all you need to do is ask Python to encode to UTF-8 to create a valid UTF-8 sequence.
Because, there are many gaps and unprintable characters (due to font limitations) in the full Unicode range, using the range 0000-D7FF with return characters in the Basic Multilingual Plane, which will be more likely to be printable by your system. When encoded to UTF-8, this results in up to 3 byte sequences for each character.
Plain Random
import random
def random_unicode(length):
# Create a list of unicode characters within the range 0000-D7FF
random_unicodes = [unichr(random.randrange(0xD7FF)) for _ in xrange(0, length)]
return u"".join(random_unicodes)
my_random_unicode_str = random_unicode(length=512)
my_random_utf_8_str = my_random_unicode_str.encode('utf-8')
Unique random
import random
def unique_random_unicode(length):
# create a list of unique randoms.
random_ints = random.sample(xrange(0xD7FF), length)
## convert ints into Unicode characters
# for each random int, generate a list of Unicode characters
random_unicodes = [unichr(x) for x in random_ints]
# join the list
return u"".join(random_unicodes)
my_random_unicode_str = unique_random_unicode(length=512)
my_random_utf_8_str = my_random_unicode_str.encode('utf-8')
UTF-8 only allows certain bit patterns. You appear to be using UTF-8 in your code, so you will need to conform to the allowed UTF-8 patterns.
1 byte: 0b0xxxxxxx
2 byte: 0b110xxxxx 0b10xxxxxx
3 byte: 0b1110xxxx 0b10xxxxxx 0b10xxxxxx
4 byte: 0b11110xxx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx
In the multi-byte patterns, the first byte indicates the number of bytes in the whole pattern with leading 1s followed by 0 and data bits x. The non-leading bytes all follow the same pattern: 0b10xxxxxx with two leading indicator bits 10 and six data bits xxxxxx.
In general, randomly generated bytes will not follow these patterns. You can only generate the data bits x randomly.

Remove Crazy Characters in R/Shiny

I have a long list in the format like such:
group1 » group2 » group3
Within R, I can use a gsub('»', '-', x) where x is the vector structured like above.
However, I am running into errors when trying to utilize this functionality when loading this into a shiny app. I've tried multiple ways to use gsub, chartr, and some other ones.
Also, The  character is not captured when using [[:punct:]].
Any suggestions?
group1 » group2 » group3 is a UTF-8 encoded string and therefore it would be best if the R application is coded to read the strings with conversion from UTF-8 to Latin 1 as explained on Read or Set the Declared Encodings for a Character Vector and Read text as UTF-8 encoding.
» is the UTF-8 encoded right-pointing double angle quotation mark whereby the 2 bytes with the hexadecimal values C2 BB are interpreted and displayed (wrong) with code page Windows-1252 or ISO 8859-1 (Latin-1).
gsub("\\xC2?\\xBB", "-", x) could be used to find in a UTF-8 encoded string or single byte encoded string (Latin 1 or Windows 1252) all right pointing guillemets and replace each of them by a hyphen character.

Simulate a comma / dot keypress

With the header cctype I can simulate key presses in c++ :
void keyDownZ()
{
keyboardInput.ki.wVk = 0x05A;
keyboardInput.ki.dwFlags = KEYDOWN;
SendInput(1, &keyboardInput, sizeof(INPUT));
}
But I can't find anywhere on how to simulate the keypress of the comma key, or the dot key.. What is the hexidecimal code for those keys?
I mean, according to http://msdn.microsoft.com/en-us/library/windows/desktop/dd375731%28v=vs.85%29.aspx
VK_OEM_COMMA ( 0xBC )
Virtual-Key Codes The following table shows the symbolic constant
names, hexadecimal values, and mouse or keyboard equivalents for the
virtual-key codes used by the system. The codes are listed in numeric
order.
Try with this ones:
HTML Entity (decimal) ,
HTML Entity (hex) ,
UTF-8 (hex) 0x2C (2c)
UTF-8 (binary) 00101100
UTF-16 (hex) 0x002C (002c)
UTF-16 (decimal) 44
UTF-32 (hex) 0x0000002C (2c)
UTF-32 (decimal) 44
C/C++/Java source code "\u002C"
Python source code u"\u002C"
I think this is what you need:
UTF-8 (hex) 0x2C (2c)
There are 2 values:
VK_OEM_COMMA 0xBC
VK_OEM_PERIOD 0xBE
According to this thread, you can also test VK_DELETE.