I'm using the libconfig to create an configuration file and one of the fields is a content of a encrypted file. The problem occurs because in the file have some escapes characters that causes a partial storing of the content. What is the best way to store this data to avoid accidental escapes caracter ? Convert to unicode?
Any suggestion?
You can use either URL encoding, where each non-ASCII character is encoded as a % character followed by two hex digits, or you case use base64 encoding, where each set of 3 bytes is encoded to 4 ASCII characters (3x8 bits -> 4x6 bits).
For example, if you have the following bytes:
00 01 41 31 80 FE
You can URL encode it as follows:
%00%01A1%80%FE
Or you can base64 encode it like this, with 0-25 = A-Z, 26-51 = a-z, 52-62 = 0-9, 62 = ., 63 = /:
(00000000 00000001 01000001) (00110001 10000000 11111110) -->
(000000 000000 000101 000001) (001100 011000 000011 111110)
AAJBNYD.
The standard for encoding binary data in text used to be uuencode and is now base64. Both use same paradigm: a byte uses 8bits, so 3 bytes use 24 bits or 4 6 bits characters.
uuencode just used the 6 bits with an offset of 32 (ascii code for space), so characters are in range 32-96 => all in printable ascii range, but including space and possibly other characters that could have special meanings
base64 choosed these 64 characters to represent values from 0 to 63 (no =:;,'"\*(){}[] that could have special meaning...):
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
and the equal sign(=) being a place holder for empty positions and the end of an encoded string to ensure that the encoded string length is a multiple of 4.
Unfortunately, neither the C nor C++ standard library offer functions for uuencode not base 64 conversions, but you can find nice implementations around, with many pointers in this other SO answer: How do I base64 encode (decode) in C?
Related
I have a source of data that was converted from an oracle database and loaded into a hadoop storage point. One of the columns was a BLOB and therefore had lots of control characters and unreadable/undetectable ascii characters outside of the available codeset. I am using Impala to write regex replace function to parse some of the unicode characters that the regex library cannot understand. I would like to remove the offending 2 character hex codes BEFORE I use the unhex query function so that I can do the rest of the regex parsing with a "clean" string.
Here's the code I've used so far, which doesn't quite work:
'[2-7]{1}([A-Fa-f]|[0-9]{1})'
I've determined that I only need to capture \u0020-\u007f - or represented in the two bit hex - 20-7f
If my string looks like this:
010A000000153020405C00000000143020405CBC000000F53320405C4C010000E12F204058540100002D01
I would like to be able to capture 2 characters at a time (e.g. 01,0A,00) evaluate whether or not that fits the acceptable range of 2 byte hex I mentioned above and return only what is acceptable.
The correct output should be:
30 20 40 5C 30 20 40 5C 33 20 40 5C 4C 2F 20 40 58 and 54
However, my expression finds the first acceptable number in my first range (5) and starts the capture from there which returns the position or indexing wrong for the rest of the string... and this is the return from my expression -
010A0000001**53**0**20****40****5C**000000001**43**0**20****40****5C**BC000000F**53****32**0**40****5C****4C**010000E1**2F****20****40****58****54**010000**2D**01
I just don't know how to evaluate only two characters at a time in a mixed-length string. And, if they don't fit the expression, iterate to the next two characters. But only in two character increments.
My example: https://regex101.com/r/BZL7t0/1
I have added a Positieve Lookbehind to it. Which starts at the beginning of the string and then matches 2 characters at the time. This ensures that the group you're matching always has groups of 2 characters before it.
Positieve Lookbehind:
(?<=^(..)*)
Updated regex:
(?<=^(..)*)([2-7]{1}[A-Fa-f0-9]{1})
Preview:
Regex101
I am testing a REST based service and one of the inputs is a text string. So I am sending it random unicode strings from my python code. So far the unicode strings that I sent were in the ascii range, so everything worked.
Now I am attempting to send characters beyond the ascii range and I am getting an encoding error. Here is my code. I have been through this link and still unable to wrap my head around it.
# coding=utf-8
import os, random, string
import json
junk_len = 512
junk = (("%%0%dX" % junk_len) % random.getrandbits(junk_len * 8))
for i in xrange(1,5):
if(len(junk) % 8 == 0):
print u'decoding to hex'
message = junk.decode("hex")
print 'Hex chars %s' %message
print u' '.join(message.encode("utf-8").strip())
The first line prints without any issues, but I can't send it to the REST service without encoding it. Hence the second line where I am attempting to encode it to utf-8. This is the line of code that fails with the following message.
UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position
7: ordinal not in range(128)
As others have said, it's very difficult to make valid random UTF-8 bytes as the byte sequences have to be correct.
As Unicode maps all characters to a number between 0x0000 and 0x10FFFF, all one needs to do is to randomly generate a number in that range to get a valid Unicode address. Passing the random number to unichar (or char on Py3), will return a Unicode string of the character at the random code point.
Then all you need to do is ask Python to encode to UTF-8 to create a valid UTF-8 sequence.
Because, there are many gaps and unprintable characters (due to font limitations) in the full Unicode range, using the range 0000-D7FF with return characters in the Basic Multilingual Plane, which will be more likely to be printable by your system. When encoded to UTF-8, this results in up to 3 byte sequences for each character.
Plain Random
import random
def random_unicode(length):
# Create a list of unicode characters within the range 0000-D7FF
random_unicodes = [unichr(random.randrange(0xD7FF)) for _ in xrange(0, length)]
return u"".join(random_unicodes)
my_random_unicode_str = random_unicode(length=512)
my_random_utf_8_str = my_random_unicode_str.encode('utf-8')
Unique random
import random
def unique_random_unicode(length):
# create a list of unique randoms.
random_ints = random.sample(xrange(0xD7FF), length)
## convert ints into Unicode characters
# for each random int, generate a list of Unicode characters
random_unicodes = [unichr(x) for x in random_ints]
# join the list
return u"".join(random_unicodes)
my_random_unicode_str = unique_random_unicode(length=512)
my_random_utf_8_str = my_random_unicode_str.encode('utf-8')
UTF-8 only allows certain bit patterns. You appear to be using UTF-8 in your code, so you will need to conform to the allowed UTF-8 patterns.
1 byte: 0b0xxxxxxx
2 byte: 0b110xxxxx 0b10xxxxxx
3 byte: 0b1110xxxx 0b10xxxxxx 0b10xxxxxx
4 byte: 0b11110xxx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx
In the multi-byte patterns, the first byte indicates the number of bytes in the whole pattern with leading 1s followed by 0 and data bits x. The non-leading bytes all follow the same pattern: 0b10xxxxxx with two leading indicator bits 10 and six data bits xxxxxx.
In general, randomly generated bytes will not follow these patterns. You can only generate the data bits x randomly.
Is there a way to decrement a character value alphabetically in C++?
For example, changing a variable containing
'b' to the value 'a' or a variable containing
'd' to the value 'c' ?
I tried looking at character sequence but couldn't find anything useful.
Characters are essentially one byte integers (although the representation may vary between compilers). While there are many encodings which map integer values to characters, almost all of them map 'a' to 'z' characters in successive numerical order. So, if you wanted to change the string "aaab" to "aaaa" you could do something like the following:
char letters [4] = {'a','a','a','b'};
letters[3]--;
Alphabet characters are part of the ASCII character table. 65 is uppercase letter A, and 32 bits later, which is 97, is lowercase letter A. Letters B through Z and b through z are 66 through 90 and 98 through 122, respectively) The original computer programmers made it 32 bits apart in the ASCII chart rather than 26 (letters in the alphabet) because bit manipulation can be done to either easily change from lowercase to uppercase (and vice-versa), as well as ignoring the case (by ignoring the 32 bit - 0010 0000).
This way, for example, the 84th character on the ASCII chart, which represents the letter T, is represented with the bits 0101 0100. Lowercase t is 116 which is 0111 0100. When ignoring the case, the 1 in the 32 bit (6th position from the right) is ignored. You can see all the other bits are exactly the same for uppercase and lowercase. This makes it more convenient for everyone and more optimal for the computer.
To decrement just convert the character to its ASCII character value, decrement by 1, then take that integer and convert it back into ASCII value. Be careful when you have an 'A' though (or 'a'), as that's a special case.
While using B64 decode it doesn't remove the padded extra bytes added in Base64 encoding ?
Consider the scenario where I am giving data of size 50(not in multiple of 3) to the encode function this returns encoded data of size 68.
While using decode for the encoded data(input of 68 bytes) then decode function returns 51 bytes data, which I was expecting as zero.
How base 64 encode/decode should be handled properly when the data size is not in multiple of 3 ?
I have used open source Base64 encode/decode library which is compliant with RFC4648.properly
Base64 encoding uses a special marker at the end to indicate that padding was added.
It always generates a multiple of four output characters, each corresponding to three octets of input data except possibly the last one.
For that last one, if there are only two octets left, it encodes those into three characters (each taking six bits = eighteen, sixteen bits real data and two bits of junk) then adds a special padding = character to give four characters.
If there is only one octet left, it encodes that into two characters (sixteen bits, twelve bits real data and four bits of junk) then adds a special padding == character to give four characters.
Hence, during decoding, it's the number of = characters at the end that tells you how to handle the last section so as to end up with exactly the same data you encoded.
In other words, the input data AAAA (each A holding bits abcdef) gives:
decoding input: abcdef abcdef abcdef abcdef
|
V
output: abcdefab cdefabcd efabcdef
For a slightly short block AAA= (irrelevant bits being + and padding bits being =):
decoding input: abcdef abcdef abcd++ ======
|
V
output: abcdefab cdefabcd
And a very short block AA==:
decoding input: abcdef ab++++ ====== ======
|
V
output: abcdefab
So here '=' is used as a padding character for base64 encoding.
By determining number of '=' characters at the end of encoded data I can figure out whether the input data is in multiple of 3 or not.
In other words if there is '='(single equal to) character then possibly two octets left and if there is '==' then single octet was left in the last group of three characters.
Mads Kristensen got one down to 00amyWGct0y_ze4lIsj2Mw
Can it go smaller than that?
Looks like there are only 73 characters that can be used unescaped in a URL. IF that's the case, you could convert the 128-bit number to base 73, and have a 21 character URL.
IF you can find 85 legal characters, you can get down to a 20 character URL.
A GUID looks like this c9a646d3-9c61-4cb7-bfcd-ee2522c8f633 - that's 32 hex digits, each encoding 4 bits, so 128 bits in total
A base64 encoding uses 6 bits per symbol, which is easy to achieve with URL safe chars to give a 22 char encoded string. As others have noted, you could with with 73 url safe symbols and encoded as a base 73 number to give 21 chars.