Calculate the size to a Base 64 decoded message - c++

I have a BASE64 encode string:
static const unsigned char base64_test_enc[] =
"VGVzdCBzdHJpbmcgZm9yIGEgc3RhY2tvdmVyZmxvdy5jb20gcXVlc3Rpb24=";
It does not have CRLF-per-72 characters.
How to calculate a decoded message length?

Well, base64 represents 3 bytes in 4 characters... so to start with you just need to divide by 4 and multiply by 3.
You then need to account for padding:
If the text ends with "==" you need to subtract 2 bytes (as the last group of 4 characters only represents 1 byte)
If the text ends with just "=" you need to subtract 1 byte (as the last group of 4 characters represents 2 bytes)
If the text doesn't end with padding at all, you don't need to subtract anything (as the last group of 4 characters represents 3 bytes as normal)

Base 64 uses 4 characters per 3 bytes. If it uses padding it always has a multiple of 4 characters.
Furthermore, there are three padding possibilities:
two characters and two padding characters == for one encoded byte
3 characters and one padding character = for two encoded bytes
and of course no padding characters, making 3 bytes.
So you can simply divide the number of characters by 4, then multiply by 3 and finally subtract the number of padding characters.
Possible C code could be (if I wasn't extremely rusty in C, please adjust):
size_t encoded_base64_bytes(const char *input)
{
size_t len, padlen;
char *last, *first_pad;
len = strlen(input);
if (len == 0) return 0;
last = input + len - 4;
first_pad = strchr(last, '=');
padlen = first_pad == null ? 0 : last - first_pad;
return (len / 4) * 3 - padlen;
}
Note that this code assumes that the input is valid base 64.
A good observer will notice that there are spare bits, usually set to 0 in the final characters if padding is used.

Related

Why does char occupy 7 bits when the length is 1 byte ie 8 bits?

I've seen that the below program is taking only 7 bits of memory to store the character, but in general everywhere I've studied says that char occupies 1 byte of memory ie is 8 bits.
Does a single character require 8 bits or 7 bits?
If it requires 8 bits, what will be stored in the other bit?
#include <iostream>
using namespace std;
int main()
{
char ch = 'a';
int val = ch;
while (val > 0)
{
(val % 2)? cout<<1<<" " : cout<<0<<" ";
val /= 2;
}
return 0;
}
Output:
1 0 0 0 0 1 1
The below code shows the memory gap between the character, i.e. is 7 bits:
9e9 <-> 9f0 <->......<-> a13
#include <iostream>
using namespace std;
int main()
{
char arr[] = {'k','r','i','s','h','n','a'};
for(int i=0;i<7;i++)
cout<<&arr+i<<endl;
return 0;
}
Output:
0x7fff999019e9
0x7fff999019f0
0x7fff999019f7
0x7fff999019fe
0x7fff99901a05
0x7fff99901a0c
0x7fff99901a13
Your first code sample doesn't print leading zero bits, as ASCII characters all have the upper bit set to zero you'll only get at most seven bits printed if using ASCII characters. Extended ASCII characters or utf-8 use the upper bit for characters outside the basic ASCII character set.
Your second example is actually printing that each character is seven bytes long which is obviously incorrect. If you change the size of the array you are using to not be seven characters long you'll see different results.
&arr + i is equivalent to (&arr) + i as &arr is a pointer to char[7] which has a size of 7, the +i adds 7 * i bytes to the pointer. (&arr) + 1 points to one byte past the end of the array, if you try printing the values these pointers point to you'll get junk or a crash: **(&arr + i).
Your code should be static_cast<void*>(&arr[i]), you'll then see the pointer going up by one for each iteration. The cast to void* is necessary to stop the standard library from trying to print the pointer as a null terminated string.
It has nothing to do with space assigned for char. You simply converting ASCII represent of char into binary.
ASCII is a 7 bit character set. In C normally represented by an 8 bit char. If highest bit in an 8 bit byte is set, it is not an ASCII character. The eighth bit was used for parity. To communicate information between computers using different encoding.
ASCII stands for American Standard Code for Information Interchange, with the emphasis on American. The character set could not represent like Arabic letters (things with umlauts for example) or latin.
To “extend” the ASCII set and use those extra 128 values that became available by using all 8 bits, which caused problems. Eventually, Unicode came along which can represent every Unicode character. But 8 bit become a standard for char.

General algorithm for reading n bits and padding with zeros

I need a function to read n bits starting from bit x(bit index should start from zero), and if the result is not byte aligned, pad it with zeros. The function will receive uint8_t array on the input, and should return uint8_t array as well. For example, I have file with following contents:
1011 0011 0110 0000
Read three bits from the third bit(x=2,n=3); Result:
1100 0000
There's no (theoretical) limit on input and bit pattern lengths
Implementing such a bitfield extraction efficiently without beyond the direct bit-serial algorithm isn't precisely hard but a tad cumbersome.
Effectively it boils down to an innerloop reading a pair of bytes from the input for each output byte, shifting the resulting word into place based on the source bit-offset, and writing back the upper or lower byte. In addition the final output byte is masked based on the length.
Below is my (poorly-tested) attempt at an implementation:
void extract_bitfield(unsigned char *dstptr, const unsigned char *srcptr, size_t bitpos, size_t bitlen) {
// Skip to the source byte covering the first bit of the range
srcptr += bitpos / CHAR_BIT;
// Similarly work out the expected, inclusive, final output byte
unsigned char *endptr = &dstptr[bitlen / CHAR_BIT];
// Truncate the bit-positions to offsets within a byte
bitpos %= CHAR_BIT;
bitlen %= CHAR_BIT;
// Scan through and write out a correctly shifted version of every destination byte
// via an intermediate shifter register
unsigned long accum = *srcptr++;
while(dstptr <= endptr) {
accum = accum << CHAR_BIT | *srcptr++;
*dstptr++ = accum << bitpos >> CHAR_BIT;
}
// Mask out the unwanted LSB bits not covered by the length
*endptr &= ~(UCHAR_MAX >> bitlen);
}
Beware that the code above may read past the end of the source buffer and somewhat messy special handling is required if you can't set up the overhead to allow this. It also assumes sizeof(long) != 1.
Of course to get efficiency out of this you will want to use as wide of a native word as possible. However if the target buffer necessarily word-aligned then things get even messier. Furthermore little-endian systems will need byte swizzling fix-ups.
Another subtlety to take heed of is the potential inability to shift a whole word, that is shift counts are frequently interpreted modulo the word length.
Anyway, happy bit-hacking!
Basically it's still a bunch of shift and addition operations.
I'll use a slightly larger example to demonstrate this.
Suppose we are give an input of 4 characters, and x = 10, n = 18.
00101011 10001001 10101110 01011100
First we need to locate the character contains our first bit, by x / 8, which gives us 1 (the second character) in this case. We also need the offset in that character, by x % 8, which equals to 2.
Now we can get out first character of the solution in three operations.
Left shift the second character 10001001 with 2 bits, gives us 00100100.
Right shift the third character 10101110 with 6 (comes from 8 - 2) bits, gives us 00000010.
Add these two characters gives us the first character in your return string, gives 00100110.
Loop this routine for n / 8 rounds. And if n % 8 is not 0, extract that many bits from the next character, you can do it in many approaches.
So in this example, our second round will give us 10111001, and the last step we get 10, then pad the rest bits with 0s.

what does that mean, C programm for RLE

I am new to C so I do not understand what is happening in this line:
out[counter++] = recurring_count + '0';
What does +'0' mean?
Additionally, can you please help me by writing comments for most of the code? I don't understand it well, so I hope you can help me. Thank you.
#include "stdafx.h"
#include "stdafx.h"
#include<iostream>
void encode(char mass[], char* out, int size)
{
int counter = 0;
int recurring_count = 0;
for (int i = 0; i < size - 1; i++)
{
if (mass[i] != mass[i + 1])
{
recurring_count++;
out[counter++] = mass[i];
out[counter++] = recurring_count + '0';
recurring_count = 0;
}
else
{
recurring_count++;
}
}
}
int main()
{
char data[] = "yyyyyyttttt";
int size = sizeof(data) / sizeof(data[0]);
char * out = new char[size + 1]();
encode(data, out, size);
std::cout << out;
delete[] out;
std::cin.get();
return 0;
}
It adds the character encoding value of '0' to the value in recurring_count. If we assume ASCII encoded characters, that means adding 48.
This is common practice for making a "readable" digit from a integer value in the range 0..9 - in other words, convert a single digit number to an actual digit representation in a character form. And as long as all digits are "in sequence" (only digits between 0 and 9), it works for any encoding, not just ASCII - so a computer using EBCDIC encoding would still have the same effect.
recurring_count + '0' is a simple way of converting the int recurring_count value into an ascii character.
As you can see over on wikipedia the ascii character code of 0 is 48. Adding the value to that takes you to the corresponding character code for that value.
You see, computers may not really know about letters, digits, symbols; like the letter a, or the digit 1, or the symbol ?. All they know is zeroes and ones. True or not. To exist or not.
Here's one bit: 1
Here's another one: 0
These two are only things that a bit can be, existence or absence.
Yet computers can know about, say, 5. How? Well, 5 is 5 only in base 10; in base 4, it would be a 11, and in base 2, it would be 101. You don't have to know about the base 4, but let's examine the base 2 one, to make sure you know about that:
How would you represent 0 if you had only 0s and 1s? 0, right? You probably would also represent the 1 as 1. Then for 2? Well, you'd write 2 if you could, but you can't... So you write 10 instead.
This is exactly analogous to what you do while advancing from 9 to 10 in base 10. You cannot write 10 inside a single digit, so you rather reset the last digit to zero, and increase the next digit by one. Same thing while advancing from 19 to 20, you attempt to increase 9 by one, but you can't, because there is no single digit representation of 10 in base 10, so you rather reset that digit, and increase the next digit.
This is how you represent numbers with just 0s and 1s.
Now that you have numbers, how would you represent letters and symbols and character-digits, like the 4 and 3 inside the silly string L4M3 for example? You could map them; map them so, for example, that the number 1 would from then on represent the character A, and then 2 would represent B.
Of course, it would be a little problematic; because when you do that the number 1 would represent both the number 1 and the character A. This is exactly the reason why if you write...
printf( "%d %c", 65, 65 );
You will have the output "65 A", provided that the environment you're on is using ASCII encoding, because in ASCII 65 has been mapped to represent A when interpreted as a character. A full list can be found over there.
In short
'A' with single quotes around delivers the message that, "Hey, this A over here is to receive whatever the representative integer value of A is", and in most environments it will just be 65. Same for '0', which evaluates to 48 with ASCII encoding.

Compress 21 Alphanumeric Characters in to 16 Bytes

I'm trying to take 21 bytes of data which uniquely identifies a trade and store it in a 16 byte char array. I'm having trouble coming up with the right algorithm for this.
The trade ID which I'm trying to compress consists of 2 fields:
18 alphanumeric characters
consisting of the ASCII characters
0x20 to 0x7E, Inclusive. (32-126)
A 3-character numeric string "000" to "999"
So a C++ class that would encompass this data looks like this:
class ID
{
public:
char trade_num_[18];
char broker_[3];
};
This data needs to be stored in a 16-char data structure, which looks like this:
class Compressed
{
public:
char sku_[16];
};
I tried to take advantage of the fact that since the characters in trade_num_ are only 0-127 there was 1 unused bit in each character. Similarly, 999 in binary is 1111100111, which is only 10 bits -- 6 bits short of a 2-byte word. But when I work out how much I can squeeze this down, the smallest I can make it is 17 bytes; one byte too big.
Any ideas?
By the way, trade_num_ is a misnomer. It can contain letters and other characters. That's what the spec says.
EDIT: Sorry for the confusion. The trade_num_ field is indeed 18 bytes and not 16. After I posted this thread my internet connection died and I could not get back to this thread until just now.
EDIT2: I think it is safe to make an assumption about the dataset. For the trade_num_ field, we can assume that the non-printable ASCII characters 0-31 will not be present. Nor will ASCII codes 127 or 126 (~). All the others might be present, including upper and lower case letters, numbers and punctuations. This leaves a total of 94 characters in the set that trade_num_ will be comprised of, ASCII codes 32 through 125, inclusive.
If you have 18 characters in the range 0 - 127 and a number in the range 0 - 999 and compact this as much as possible then it will require 17 bytes.
>>> math.log(128**18 * 1000, 256)
16.995723035582763
You may be able to take advantage of the fact that some characters are most likely not used. In particular it is unlikely that there are any characters below value 32, and 127 is also probably not used. If you can find one more unused character so you can first convert the characters into base 94 and then pack them into the bytes as closely as possible.
>>> math.log(94**18 * 1000, 256)
15.993547951857446
This just fits into 16 bytes!
Example code
Here is some example code written in Python (but written in a very imperative style so that it can easily be understood by non-Python programmers). I'm assuming that there are no tildes (~) in the input. If there are you should substitute them with another character before encoding the string.
def encodeChar(c):
return ord(c) - 32
def encode(s, n):
t = 0
for c in s:
t = t * 94 + encodeChar(c)
t = t * 1000 + n
r = []
for i in range(16):
r.append(int(t % 256))
t /= 256
return r
print encode(' ', 0) # smallest possible value
print encode('abcdefghijklmnopqr', 123)
print encode('}}}}}}}}}}}}}}}}}}', 999) # largest possible value
Output:
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[ 59, 118, 192, 166, 108, 50, 131, 135, 174, 93, 87, 215, 177, 56, 170, 172]
[255, 255, 159, 243, 182, 100, 36, 102, 214, 109, 171, 77, 211, 183, 0, 247]
This algorithm uses Python's ability to handle very large numbers. To convert this code to C++ you could use a big integer library.
You will of course need an equivalent decoding function, the principle is the same - the operations are performed in reverse order.
That makes (18*7+10)=136 bits, or 17 bytes. You wrote trade_num is alphanumeric? If that means the usual [a-zA-Z0-9_] set of characters, then you'd have only 6 bits per character, needing (18*6+10)=118 bit = 15 bytes for the whole thing.
Assuming 8 bit = 1 byte
Or, coming from another direction: You have 128 bits for storage, you need ~10 bits for the number part, so there are 118 left for the trade_num. 18 characters means 118/18=6.555 bits per characters, this means you can have only the space to encode 26.555 = 94 different characters **unless there is a hidden structure in trade_num that we could exploit to save more bits.
This is something that should work, assuming you need only characters from allowedchars, and there is at most 94 characters there. This is python, but it is written trying not to use fancy shortcuts--so that you'll be able to translate it to your destination language easier. It assumes however that the number variable may contain integers up to 2**128--in C++ you should use some kind of big number class.
allowedchars=' !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}'
alphabase = len(allowedchars)
def compress(code):
alphanumeric = code[0:18]
number = int(code[18:21])
for character in alphanumeric:
# find returns index of character on the allowedchars list
number = alphabase*number + allowedchars.find(character)
compressed = ''
for i in xrange(16):
compressed += chr(number % 256)
number = number/256
return compressed
def decompress(compressed):
number = 0
for byte in reversed(compressed):
number = 256*number + ord(byte)
alphanumeric = ''
for i in xrange(18):
alphanumeric = allowedchars[number % alphabase] + alphanumeric
number = number/alphabase
# make a string padded with zeros
number = '%03d' % number
return alphanumeric + number
You can do this in ~~15bytes (14 bytes and 6 bits).
For each character from trace_num_ you can save 1 bit if you want save ascii in 7 bits.
Then you have 2 bytes free and 2
bits, you must have 5.
Let get number information, each char can be one from ten values (0 to 9).
Then you must have 4 bits to save this character, to save number you must have 1 byte and 4 bits, then you save half of this.
Now you have 3 bytes free and 6 bits,
you must have 5.
If you want to use only qwertyuioplkjhgfdsazxcvbnmQWERTYUIOPLKJHGFDSAZXCVBNM1234567890[]
You can save each char in 6 bits. Then you have next 2 bytes and 2 bits.
Now you have 6 bytes left, and your string can save in 15 bytes +
nulltermination = 16bytes.
And if you save your number in integer on 10 bytes. You can fit this into 14 bytes and 6 bits.
There are 95 characters between the space (0x20) and tilde (0x7e). (The 94 in other answers suffer from off-by-1 error).
Hence the number of distinct IDs is 9518×1000 = 3.97×1038.
But that compressed structure can only hold (28)16 = 3.40×1038 distinct values.
Therefore it is impossible to represent all IDs by that structure, unless:
There is 1 unused character in ≥15 digits of trade_num_, or
There are ≥14 unused characters in 1 digit of trade_num_, or
There are only ≤856 brokers, or
You're using is a PDP-10 which has a 9-bit char.
Key questions are:
There appears to be some contradiction in your post whether the trade number is 16 or 18 characters. You need to clear that up. You say the total is 21 consisting of 16+3. :-(
You say the trade num characters are in the range 0x00-0x7f. Can they really be any character in that range, including tab, new line, control-C, etc? Or are they limited to printable characters, or maybe even to alphanumerics?
Does the output 16 bytes have to be printable characters, or is it basically a binary number?
EDIT, after updates to original post:
In that case, if the output can be any character in the character set, it's possible. If it can only be printable characters, it's not.
Demonstration of the mathematical possibility is straightforward enough. There are 94 possible values for each of 18 characters, and 10 possible values for each of 3. Total number of possible combinations = 94 ^ 18 * 10 ^ 3 ~= 3.28E35. This requires 128 bits. 2 ^127 ~= 1.70e38, which is too small, while 2^128 ~= 3.40e38, which is big enough. 128 bits is 16 bytes, so it will just barely fit if we can use every possible bit combination.
Given the tight fit, I think the most practical way to generate the value is to think of it as a double-long number, and then run the input through an algorithm to generate a unique integer for every possible input.
Conceptually, then, let's imagine we had a "huge integer" data type that is 16 bytes long. The algorithm would be something like this:
huge out;
for (int p=0;p<18;++p)
{
out=out*94+tradenum[p]-32;
}
for (int p=0;p<3;++p)
{
out=out*10+broker[p]-'0';
}
// Convert output to char[16]
unsigned char[16] out16;
for (int p=15;p>=0;--p)
{
out16[p]=huge&0xff;
huge=huge>>8;
}
return out16;
Of course we don't have a "huge" data type in C. Are you using pure C or C++? Isn't there some kind of big number class in C++? Sorry, I haven't done C++ in a while. If not, we could easily create a little library to implement a huge.
If it can only contain letters, then you have less than 64 possibilities per character (26 upper case, 26 lower case, leaving you 12 for space, terminator, underscore, etc). With 6 bits per character, you should get there - in 15 characters. Assuming you don't support special characters.
Use the first 10 bits for the 3-character numeric string (encode the bits like they represent a number and then pad with zeros as appropriate when decoding).
Okay, this leaves you with 118 bits and 16 alphanumeric characters to store.
0x00 to 0x7F (if you mean inclusive) comprises 128 possible characters to represent. That means that each character can be identified by a combination of 7 bits. Come up with an index mapping each number those 7 bits can represent to the actual character. To represent 16 of your "alphanumeric" characters in this way, you need a total of 112 bits.
We now have 122 bits (or 15.25 bytes) representing our data. Add an easter egg to fill in the remaining unused bits and you have your 16 character array.

format specifier for short integer

I don't use correctly the format specifiers in C. A few lines of code:
int main()
{
char dest[]="stack";
unsigned short val = 500;
char c = 'a';
char* final = (char*) malloc(strlen(dest) + 6);
snprintf(final, strlen(dest)+6, "%c%c%hd%c%c%s", c, c, val, c, c, dest);
printf("%s\n", final);
return 0;
}
What I want is to copy at
final [0] = a random char
final [1] = a random char
final [2] and final [3] = the short array
final [4] = another char ....
My problem is that i want to copy the two bytes of the short int to 2 bytes of the final array.
thanks.
I'm confused - the problem is that you are saying strlen(dest)+6 which limits the length of the final string to 10 chars (plus a null terminator). If you say strlen(dest)+8 then there will be enough space for the full string.
Update
Even though a short may only be 2 bytes in size, when it is printed as a string each character will take up a byte. So that means it can require up to 5 bytes of space to write a short to a string, if you are writing a number above 10000.
Now, if you write the short to a string as a hexadecimal number using the %x format specifier, it will take up no more than 2 bytes.
You need to allocate space for 13 characters - not 11. Don't forget the terminating NULL.
When formatted the number (500) takes up three spaces, not one. So your snsprintf should give the final length as strlen(dest)+5+3. Then also fix your malloc call to adjust. If you want to compute the strlen of the number, do that with a call like this strlen(itoa(val)). Also, cant forget the NULL at the end of dest, but I think strlen takes this into account, but I'm not for sure.
Simple answer is you only allocated enough space for the strlen(dest) + 6 characters when in all reality it looks like you're going to have 8 extra characters... since you have 2 chars + 3 chars in your number + 2 chars after + dest (5 chars) = 13 char when you allocated 11 chars.
Unsigned shorts can take up to 5 characters, right? (0 - 65535)
Seems like you'd need to allocate 5 characters for your unsigned short to cover all of the values.
Which would point to using this:
char* final = (char*) malloc(strlen(dest) + 10);
You lose one byte because you think the short variable takes 2 byte. But it takes three: one for each digit character ('5', '0', '0'). Also you need a '\0' terminator (+1 byte).
==> You need strlen(dest) + 8
Use 8 instead of 6 on:
char* final = (char*) malloc(strlen(dest) + 6);
and
snprintf(final, strlen(dest)+6, "%c%c%hd%c%c%s", c, c, val, c, c, dest);
Seems like the primary misunderstanding is that a "2-byte" short can't be represented on-screen as 2 1-byte characters.
First, leave enough room:
char* final = (char*) malloc(strlen(dest) + 9);
The entire range of possible values for a 1-byte character are not printable. If you want to display this on screen and be readable, you'll have to encode the 2-byte short as 4 hex bytes, such as:
## as hex, 4 characters
snprintf(final, sizeof(final), "%c%c%4x%c%c%s", c, c, val, c, c, dest);
If you are writing this to a file, that's OK, and you might try the following:
## print raw bytes, upper byte, then lower byte.
snprintf(final, sizeof(final), "%c%c%c%c%c%c%s", c, c, ((val<<8)&0xFF), ((val>>8)&0xFF), c, c, dest);
But that won't make sense to a human looking at it, and is sensitive to endianness. I'd strongly recommend against it.