Windows API base64 encode/decod - c++

I want to base64 a big file (500MB)
I use this code but it doesn't work for a large file
I test CryptStringToBinary but it doesn't work too
what should I do????

The issue is clearly that there is not enough memory to store a 500 megabyte string in a 32-bit application.
The one solution is alluded to by the this link, which writes the data to a string. Assuming that the code works correctly, it is not that hard to adjust it to write to a file stream.
#include <windows.h>
#include <fstream>
static const wchar_t *Base64Digits = L"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
int Base64Encode(const BYTE* pSrc, int nLenSrc, std::wostream& pDstStrm, int nLenDst)
{
wchar_t pDst[4];
int nLenOut = 0;
while (nLenSrc > 0) {
if (nLenDst < 4) return(0);
int len = 0;
BYTE s1 = pSrc[len++];
BYTE s2 = (nLenSrc > 1) ? pSrc[len++] : 0;
BYTE s3 = (nLenSrc > 2) ? pSrc[len++] : 0;
pSrc += len;
nLenSrc -= len;
//------------------ lookup the right digits for output
pDst[0] = Base64Digits[(s1 >> 2) & 0x3F];
pDst[1] = Base64Digits[(((s1 & 0x3) << 4) | ((s2 >> 4) & 0xF)) & 0x3F];
pDst[2] = Base64Digits[(((s2 & 0xF) << 2) | ((s3 >> 6) & 0x3)) & 0x3F];
pDst[3] = Base64Digits[s3 & 0x3F];
//--------- end of input handling
if (len < 3) { // less than 24 src bits encoded, pad with '='
pDst[3] = L'=';
if (len == 1)
pDst[2] = L'=';
}
nLenOut += 4;
// write the data to a file
pDstStrm.write(pDst,4);
nLenDst -= 4;
}
if (nLenDst > 0) *pDst = 0;
return (nLenOut);
}
The only changes done were to write the 4 bytes to a wide stream instead of appending the data to a string
Here is an example call:
int main()
{
std::wofstream ofs(L"testfile.out");
Base64Encode((BYTE*)"This is a test", strlen("This is a test"), ofs, 1000);
}
The above produces a file with the base64 string VGhpcyBpcyBhIHRlc3Q=, which when decoded, produces This is a test.
Note that the parameter is std::wostream, which means any wide output stream class (such as std::wostringstream) will work also.

Related

How to calculate the length of a string by characters, not by code units (UTF-8, UTF-16)? [duplicate]

my std::string is utf-8 encoded so obviously, str.length() returns the wrong result.
I found this information but I'm not sure how I can use it to do this:
The following byte sequences are
used to represent a character. The
sequence to be
used depends on the UCS code number of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
How can I find the actual length of a UTF-8 encoded std::string? Thanks
Count all first-bytes (the ones that don't match 10xxxxxx).
int len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;
C++ knows nothing about encodings, so you can't expect to use a
standard function to do this.
The standard library indeed does acknowledge the existence of character encodings, in the form of locales. If your system supports a locale, it is very easy to use the standard library to compute the length of a string. In the example code below I assume your system supports the locale en_US.utf8. If I compile the code and execute it as "./a.out ソニーSony", the output is that there were 13 char-values and 7 characters. And all without any reference to the internal representation of UTF-8 character codes or having to use 3rd party libraries.
#include <clocale>
#include <cstdlib>
#include <iostream>
#include <string>
using namespace std;
int main(int argc, char *argv[])
{
string str(argv[1]);
unsigned int strLen = str.length();
cout << "Length (char-values): " << strLen << '\n';
setlocale(LC_ALL, "en_US.utf8");
unsigned int u = 0;
const char *c_str = str.c_str();
unsigned int charCount = 0;
while(u < strLen)
{
u += mblen(&c_str[u], strLen - u);
charCount += 1;
}
cout << "Length (characters): " << charCount << endl;
}
This is a naive implementation, but it should be helpful for you to see how this is done:
std::size_t utf8_length(std::string const &s) {
std::size_t len = 0;
std::string::const_iterator begin = s.begin(), end = s.end();
while (begin != end) {
unsigned char c = *begin;
int n;
if ((c & 0x80) == 0) n = 1;
else if ((c & 0xE0) == 0xC0) n = 2;
else if ((c & 0xF0) == 0xE0) n = 3;
else if ((c & 0xF8) == 0xF0) n = 4;
else throw std::runtime_error("utf8_length: invalid UTF-8");
if (end - begin < n) {
throw std::runtime_error("utf8_length: string too short");
}
for (int i = 1; i < n; ++i) {
if ((begin[i] & 0xC0) != 0x80) {
throw std::runtime_error("utf8_length: expected continuation byte");
}
}
len += n;
begin += n;
}
return len;
}
You should probably take the advice of Omry and look into a specialized library for this. That said, if you just want to understand the algorithm to do this, I'll post it below.
Basically, you can convert your string into a wider-element format, such as wchar_t. Note that wchar_t has a few portability issues, because wchar_t is of varying size depending on your platform. On Windows, wchar_t is 2 bytes, and therefore ideal for representing UTF-16. But on UNIX/Linux, it's four-bytes and is therefore used to represent UTF-32. Therefore, for Windows this will only work if you don't include any Unicode codepoints above 0xFFFF. For Linux you can include the entire range of codepoints in a wchar_t. (Fortunately, this issue will be mitigated with the C++0x Unicode character types.)
With that caveat noted, you can create a conversion function using the following algorithm:
template <class OutputIterator>
inline OutputIterator convert(const unsigned char* it, const unsigned char* end, OutputIterator out)
{
while (it != end)
{
if (*it < 192) *out++ = *it++; // single byte character
else if (*it < 224 && it + 1 < end && *(it+1) > 127) {
// double byte character
*out++ = ((*it & 0x1F) << 6) | (*(it+1) & 0x3F);
it += 2;
}
else if (*it < 240 && it + 2 < end && *(it+1) > 127 && *(it+2) > 127) {
// triple byte character
*out++ = ((*it & 0x0F) << 12) | ((*(it+1) & 0x3F) << 6) | (*(it+2) & 0x3F);
it += 3;
}
else if (*it < 248 && it + 3 < end && *(it+1) > 127 && *(it+2) > 127 && *(it+3) > 127) {
// 4-byte character
*out++ = ((*it & 0x07) << 18) | ((*(it+1) & 0x3F) << 12) |
((*(it+2) & 0x3F) << 6) | (*(it+3) & 0x3F);
it += 4;
}
else ++it; // Invalid byte sequence (throw an exception here if you want)
}
return out;
}
int main()
{
std::string s = "\u00EAtre";
cout << s.length() << endl;
std::wstring output;
convert(reinterpret_cast<const unsigned char*> (s.c_str()),
reinterpret_cast<const unsigned char*>(s.c_str()) + s.length(), std::back_inserter(output));
cout << output.length() << endl; // Actual length
}
The algorithm isn't fully generic, because the InputIterator needs to be an unsigned char, so you can interpret each byte as having a value between 0 and 0xFF. The OutputIterator is generic, (just so you can use an std::back_inserter and not worry about memory allocation), but its use as a generic parameter is limited: basically, it has to output to an array of elements large enough to represent a UTF-16 or UTF-32 character, such as wchar_t, uint32_t or the C++0x char32_t types. Also, I didn't include code to convert character byte sequences greater than 4 bytes, but you should get the point of how the algorithm works from what's posted.
Also, if you just want to count the number of characters, rather than output to a new wide-character buffer, you can modify the algorithm to include a counter rather than an OutputIterator. Or better yet, just use Marcelo Cantos' answer to count the first-bytes.
I recommend you use UTF8-CPP. It's a header-only library for working with UTF-8 in C++. With this lib, it would look something like this:
int LenghtOfUtf8String( const std::string &utf8_string )
{
return utf8::distance( utf8_string.begin(), utf8_string.end() );
}
(Code is from the top of my head.)
Most of my personal C library code has only been really tested in English, but here is how I've implemented my utf-8 string length function. I originally based it on the bit pattern described in this wiki page table. Now this isn't the most readable code, but I do like the benchmark better from my compiler. Also sorry for this being C code, it should translate over to std::string in C++ pretty easily though with some slight modifications :).
size_t utf8len(const char* const str) {
size_t len = 0;
unsigned char c = str[0];
for (size_t i = 0; c != 0; ++len) {
int v0 = (c & 0x80) >> 7;
int v1 = (c & 0x40) >> 6;
int v2 = (c & 0x20) >> 5;
int v3 = (c & 0x10) >> 4;
i += 1 + v0 * v1 + v0 * v1 * v2 + v0 * v1 * v2 * v3;
c = str[i];
}
return len;
}
Note that this does not validate any of the bytes (much like all the other suggested answers here). Personally I would separate string validation out of my string length function as that is not it's responsibility. If we were to move string validation to another function we could have the validation done something like the following.
bool utf8valid(const char* const str) {
if (str == NULL)
return false;
const char* c = str;
bool valid = true;
for (size_t i = 0; c[0] != 0 && valid;) {
valid = (c[0] & 0x80) == 0
|| ((c[0] & 0xE0) == 0xC0 && (c[1] & 0xC0) == 0x80)
|| ((c[0] & 0xF0) == 0xE0 && (c[1] & 0xC0) == 0x80 && (c[2] & 0xC0) == 0x80)
|| ((c[0] & 0xF8) == 0xF0 && (c[1] & 0xC0) == 0x80 && (c[2] & 0xC0) == 0x80 && (c[3] & 0xC0) == 0x80);
int v0 = (c[0] & 0x80) >> 7;
int v1 = (c[0] & 0x40) >> 6;
int v2 = (c[0] & 0x20) >> 5;
int v3 = (c[0] & 0x10) >> 4;
i += 1 + v0 * v1 + v0 * v1 * v2 + v0 * v1 * v2 * v3;
c = str + i;
}
return valid;
}
If you are going for readability, I'll admit that other suggestions are a quite bit more readable haha!
try to use an encoding library like iconv.
it probably got the api you want.
an alternative is to implement your own utf8strlen which determines the length of each codepoint and iterate codepoints instead of characters.
A slightly lazy approach would be to only count lead bytes, but visit every byte. This saves the complexity of decoding the various lead byte sizes, but obviously you pay to visit all the bytes, though there usually aren't that many (2x-3x):
size_t utf8Len(std::string s)
{
return std::count_if(s.begin(), s.end(),
[](char c) { return (static_cast<unsigned char>(c) & 0xC0) != 0x80; } );
}
Note that certain code values are illegal as lead bytes, those that represent bigger values than the 20 bits needed for extended unicode, for example, but then the other approach would not know how to deal with that code, anyway.
UTF-8 CPP library has a function that does just that. You can either include the library into your project (it is small) or just look at the function. http://utfcpp.sourceforge.net/
char* twochars = "\xe6\x97\xa5\xd1\x88";
size_t dist = utf8::distance(twochars, twochars + 5);
assert (dist == 2);
This code I'm porting from php-iconv to c++, you need use iconv first, hope usefull:
// porting from PHP
// http://lxr.php.net/xref/PHP_5_4/ext/iconv/iconv.c#_php_iconv_strlen
#define GENERIC_SUPERSET_NBYTES 4
#define GENERIC_SUPERSET_NAME "UCS-4LE"
UInt32 iconvStrlen(const char *str, size_t nbytes, const char* encode)
{
UInt32 retVal = (unsigned int)-1;
unsigned int cnt = 0;
iconv_t cd = iconv_open(GENERIC_SUPERSET_NAME, encode);
if (cd == (iconv_t)(-1))
return retVal;
const char* in;
size_t inLeft;
char *out;
size_t outLeft;
char buf[GENERIC_SUPERSET_NBYTES * 2] = {0};
for (in = str, inLeft = nbytes, cnt = 0; inLeft > 0; cnt += 2)
{
size_t prev_in_left;
out = buf;
outLeft = sizeof(buf);
prev_in_left = inLeft;
if (iconv(cd, &in, &inLeft, (char **) &out, &outLeft) == (size_t)-1) {
if (prev_in_left == inLeft) {
break;
}
}
}
iconv_close(cd);
if (outLeft > 0)
cnt -= outLeft / GENERIC_SUPERSET_NBYTES;
retVal = cnt;
return retVal;
}
UInt32 utf8StrLen(const std::string& src)
{
return iconvStrlen(src.c_str(), src.length(), "UTF-8");
}
Just another naive implementation to count chars in UTF-8 string
int utf8_strlen(const string& str)
{
int c,i,ix,q;
for (q=0, i=0, ix=str.length(); i < ix; i++, q++)
{
c = (unsigned char) str[i];
if (c>=0 && c<=127) i+=0;
else if ((c & 0xE0) == 0xC0) i+=1;
else if ((c & 0xF0) == 0xE0) i+=2;
else if ((c & 0xF8) == 0xF0) i+=3;
//else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8
//else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8
else return 0;//invalid utf8
}
return q;
}

Getting the correct length of string with special character [duplicate]

my std::string is utf-8 encoded so obviously, str.length() returns the wrong result.
I found this information but I'm not sure how I can use it to do this:
The following byte sequences are
used to represent a character. The
sequence to be
used depends on the UCS code number of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
How can I find the actual length of a UTF-8 encoded std::string? Thanks
Count all first-bytes (the ones that don't match 10xxxxxx).
int len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;
C++ knows nothing about encodings, so you can't expect to use a
standard function to do this.
The standard library indeed does acknowledge the existence of character encodings, in the form of locales. If your system supports a locale, it is very easy to use the standard library to compute the length of a string. In the example code below I assume your system supports the locale en_US.utf8. If I compile the code and execute it as "./a.out ソニーSony", the output is that there were 13 char-values and 7 characters. And all without any reference to the internal representation of UTF-8 character codes or having to use 3rd party libraries.
#include <clocale>
#include <cstdlib>
#include <iostream>
#include <string>
using namespace std;
int main(int argc, char *argv[])
{
string str(argv[1]);
unsigned int strLen = str.length();
cout << "Length (char-values): " << strLen << '\n';
setlocale(LC_ALL, "en_US.utf8");
unsigned int u = 0;
const char *c_str = str.c_str();
unsigned int charCount = 0;
while(u < strLen)
{
u += mblen(&c_str[u], strLen - u);
charCount += 1;
}
cout << "Length (characters): " << charCount << endl;
}
This is a naive implementation, but it should be helpful for you to see how this is done:
std::size_t utf8_length(std::string const &s) {
std::size_t len = 0;
std::string::const_iterator begin = s.begin(), end = s.end();
while (begin != end) {
unsigned char c = *begin;
int n;
if ((c & 0x80) == 0) n = 1;
else if ((c & 0xE0) == 0xC0) n = 2;
else if ((c & 0xF0) == 0xE0) n = 3;
else if ((c & 0xF8) == 0xF0) n = 4;
else throw std::runtime_error("utf8_length: invalid UTF-8");
if (end - begin < n) {
throw std::runtime_error("utf8_length: string too short");
}
for (int i = 1; i < n; ++i) {
if ((begin[i] & 0xC0) != 0x80) {
throw std::runtime_error("utf8_length: expected continuation byte");
}
}
len += n;
begin += n;
}
return len;
}
You should probably take the advice of Omry and look into a specialized library for this. That said, if you just want to understand the algorithm to do this, I'll post it below.
Basically, you can convert your string into a wider-element format, such as wchar_t. Note that wchar_t has a few portability issues, because wchar_t is of varying size depending on your platform. On Windows, wchar_t is 2 bytes, and therefore ideal for representing UTF-16. But on UNIX/Linux, it's four-bytes and is therefore used to represent UTF-32. Therefore, for Windows this will only work if you don't include any Unicode codepoints above 0xFFFF. For Linux you can include the entire range of codepoints in a wchar_t. (Fortunately, this issue will be mitigated with the C++0x Unicode character types.)
With that caveat noted, you can create a conversion function using the following algorithm:
template <class OutputIterator>
inline OutputIterator convert(const unsigned char* it, const unsigned char* end, OutputIterator out)
{
while (it != end)
{
if (*it < 192) *out++ = *it++; // single byte character
else if (*it < 224 && it + 1 < end && *(it+1) > 127) {
// double byte character
*out++ = ((*it & 0x1F) << 6) | (*(it+1) & 0x3F);
it += 2;
}
else if (*it < 240 && it + 2 < end && *(it+1) > 127 && *(it+2) > 127) {
// triple byte character
*out++ = ((*it & 0x0F) << 12) | ((*(it+1) & 0x3F) << 6) | (*(it+2) & 0x3F);
it += 3;
}
else if (*it < 248 && it + 3 < end && *(it+1) > 127 && *(it+2) > 127 && *(it+3) > 127) {
// 4-byte character
*out++ = ((*it & 0x07) << 18) | ((*(it+1) & 0x3F) << 12) |
((*(it+2) & 0x3F) << 6) | (*(it+3) & 0x3F);
it += 4;
}
else ++it; // Invalid byte sequence (throw an exception here if you want)
}
return out;
}
int main()
{
std::string s = "\u00EAtre";
cout << s.length() << endl;
std::wstring output;
convert(reinterpret_cast<const unsigned char*> (s.c_str()),
reinterpret_cast<const unsigned char*>(s.c_str()) + s.length(), std::back_inserter(output));
cout << output.length() << endl; // Actual length
}
The algorithm isn't fully generic, because the InputIterator needs to be an unsigned char, so you can interpret each byte as having a value between 0 and 0xFF. The OutputIterator is generic, (just so you can use an std::back_inserter and not worry about memory allocation), but its use as a generic parameter is limited: basically, it has to output to an array of elements large enough to represent a UTF-16 or UTF-32 character, such as wchar_t, uint32_t or the C++0x char32_t types. Also, I didn't include code to convert character byte sequences greater than 4 bytes, but you should get the point of how the algorithm works from what's posted.
Also, if you just want to count the number of characters, rather than output to a new wide-character buffer, you can modify the algorithm to include a counter rather than an OutputIterator. Or better yet, just use Marcelo Cantos' answer to count the first-bytes.
I recommend you use UTF8-CPP. It's a header-only library for working with UTF-8 in C++. With this lib, it would look something like this:
int LenghtOfUtf8String( const std::string &utf8_string )
{
return utf8::distance( utf8_string.begin(), utf8_string.end() );
}
(Code is from the top of my head.)
Most of my personal C library code has only been really tested in English, but here is how I've implemented my utf-8 string length function. I originally based it on the bit pattern described in this wiki page table. Now this isn't the most readable code, but I do like the benchmark better from my compiler. Also sorry for this being C code, it should translate over to std::string in C++ pretty easily though with some slight modifications :).
size_t utf8len(const char* const str) {
size_t len = 0;
unsigned char c = str[0];
for (size_t i = 0; c != 0; ++len) {
int v0 = (c & 0x80) >> 7;
int v1 = (c & 0x40) >> 6;
int v2 = (c & 0x20) >> 5;
int v3 = (c & 0x10) >> 4;
i += 1 + v0 * v1 + v0 * v1 * v2 + v0 * v1 * v2 * v3;
c = str[i];
}
return len;
}
Note that this does not validate any of the bytes (much like all the other suggested answers here). Personally I would separate string validation out of my string length function as that is not it's responsibility. If we were to move string validation to another function we could have the validation done something like the following.
bool utf8valid(const char* const str) {
if (str == NULL)
return false;
const char* c = str;
bool valid = true;
for (size_t i = 0; c[0] != 0 && valid;) {
valid = (c[0] & 0x80) == 0
|| ((c[0] & 0xE0) == 0xC0 && (c[1] & 0xC0) == 0x80)
|| ((c[0] & 0xF0) == 0xE0 && (c[1] & 0xC0) == 0x80 && (c[2] & 0xC0) == 0x80)
|| ((c[0] & 0xF8) == 0xF0 && (c[1] & 0xC0) == 0x80 && (c[2] & 0xC0) == 0x80 && (c[3] & 0xC0) == 0x80);
int v0 = (c[0] & 0x80) >> 7;
int v1 = (c[0] & 0x40) >> 6;
int v2 = (c[0] & 0x20) >> 5;
int v3 = (c[0] & 0x10) >> 4;
i += 1 + v0 * v1 + v0 * v1 * v2 + v0 * v1 * v2 * v3;
c = str + i;
}
return valid;
}
If you are going for readability, I'll admit that other suggestions are a quite bit more readable haha!
try to use an encoding library like iconv.
it probably got the api you want.
an alternative is to implement your own utf8strlen which determines the length of each codepoint and iterate codepoints instead of characters.
A slightly lazy approach would be to only count lead bytes, but visit every byte. This saves the complexity of decoding the various lead byte sizes, but obviously you pay to visit all the bytes, though there usually aren't that many (2x-3x):
size_t utf8Len(std::string s)
{
return std::count_if(s.begin(), s.end(),
[](char c) { return (static_cast<unsigned char>(c) & 0xC0) != 0x80; } );
}
Note that certain code values are illegal as lead bytes, those that represent bigger values than the 20 bits needed for extended unicode, for example, but then the other approach would not know how to deal with that code, anyway.
UTF-8 CPP library has a function that does just that. You can either include the library into your project (it is small) or just look at the function. http://utfcpp.sourceforge.net/
char* twochars = "\xe6\x97\xa5\xd1\x88";
size_t dist = utf8::distance(twochars, twochars + 5);
assert (dist == 2);
This code I'm porting from php-iconv to c++, you need use iconv first, hope usefull:
// porting from PHP
// http://lxr.php.net/xref/PHP_5_4/ext/iconv/iconv.c#_php_iconv_strlen
#define GENERIC_SUPERSET_NBYTES 4
#define GENERIC_SUPERSET_NAME "UCS-4LE"
UInt32 iconvStrlen(const char *str, size_t nbytes, const char* encode)
{
UInt32 retVal = (unsigned int)-1;
unsigned int cnt = 0;
iconv_t cd = iconv_open(GENERIC_SUPERSET_NAME, encode);
if (cd == (iconv_t)(-1))
return retVal;
const char* in;
size_t inLeft;
char *out;
size_t outLeft;
char buf[GENERIC_SUPERSET_NBYTES * 2] = {0};
for (in = str, inLeft = nbytes, cnt = 0; inLeft > 0; cnt += 2)
{
size_t prev_in_left;
out = buf;
outLeft = sizeof(buf);
prev_in_left = inLeft;
if (iconv(cd, &in, &inLeft, (char **) &out, &outLeft) == (size_t)-1) {
if (prev_in_left == inLeft) {
break;
}
}
}
iconv_close(cd);
if (outLeft > 0)
cnt -= outLeft / GENERIC_SUPERSET_NBYTES;
retVal = cnt;
return retVal;
}
UInt32 utf8StrLen(const std::string& src)
{
return iconvStrlen(src.c_str(), src.length(), "UTF-8");
}
Just another naive implementation to count chars in UTF-8 string
int utf8_strlen(const string& str)
{
int c,i,ix,q;
for (q=0, i=0, ix=str.length(); i < ix; i++, q++)
{
c = (unsigned char) str[i];
if (c>=0 && c<=127) i+=0;
else if ((c & 0xE0) == 0xC0) i+=1;
else if ((c & 0xF0) == 0xE0) i+=2;
else if ((c & 0xF8) == 0xF0) i+=3;
//else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8
//else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8
else return 0;//invalid utf8
}
return q;
}

How to read N bytes from a file continuously untill the EOF

I am trying a wave to base 64 converter program.
I am trying this following code snippet:
vector<char> in(3);
std::string out = "abcd"; //four letter garbage value as initializer
ifstream file_ptr(filename.c_str(), ios::in | ios::binary);
unsigned int threebytes = 0;
//Apply the Base 64 encoding algorithm
do {
threebytes = (unsigned int) file_ptr.rdbuf()->sgetn(&in[0], 3);
if (threebytes > 0) {
EncodeBlock(in, out, (int)threebytes); //Apply conversion algorithm to convert 3 bytes into 4
outbuff = outbuff + out; //Append the 4 bytes got from above step to the output
}
} while (threebytes == in.size());
file_ptr.close();
In encode block where the Base64 encoding algorithm is written
void EncodeBlock(const std::vector<char>& in, std::string& out, int len) {
using namespace std;
cb64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
out[0] = cb64[(int) (in[0] >> 2)];
out[1] = cb64[(int) (((in[0] << 6) >> 2) | (in[1] >> 4))];
out[2] = (len > 1) ?
cb64[(int) (((in[1] << 4) >> 2) | (in[2] >> 6))] :
'=';
out[3] = (len > 2) ?
cb64[(int) ((in[2] << 2) >> 2)] :
'=';
}
The cb64 is a 64 length long string but the index generated by bit manipulation sometimes fall out of range (0 to 63).
Why!!!
The resolution to this was to handle the bit manipulation correctly.
the char 8 bits are operated and then casted to unsigned int introduces 24 bits extra into it which needed to be set to 0.
So,
out[0] = cb64[(unsigned int) ((in[0] >> 2) & 0x003f)];
out[1] = cb64[(unsigned int) ((((in[0] << 6) >> 2) | (in[1] >> 4))) & 0x003f)]; .. and so on handles the masking

C++ Base64 Unicode - null bytes

I am trying to base64 encode a unicode string. I am running into problems, after the encoding, the output is my string base64'ed however, there is null bytes at random places in throughout the code, I don't know why, or how to get them out.
Here is my Base64Encode function:
static char Base64Digits[] =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
int Base64Encode(const BYTE* pSrc, int nLenSrc, wchar_t* pDst, int nLenDst)
{
int nLenOut= 0;
while ( nLenSrc > 0 ) {
if (nLenOut+4 > nLenDst) return(0); // error
// read three source bytes (24 bits)
BYTE s1= pSrc[0]; // (but avoid reading past the end)
BYTE s2= 0; if (nLenSrc>1) s2=pSrc[1]; //------ corrected, thanks to jprichey
BYTE s3= 0; if (nLenSrc>2) s3=pSrc[2];
DWORD n;
n = s1; // xxx1
n <<= 8; // xx1x
n |= s2; // xx12
n <<= 8; // x12x
n |= s3; // x123
//-------------- get four 6-bit values for lookups
BYTE m4= n & 0x3f; n >>= 6;
BYTE m3= n & 0x3f; n >>= 6;
BYTE m2= n & 0x3f; n >>= 6;
BYTE m1= n & 0x3f;
//------------------ lookup the right digits for output
BYTE b1 = Base64Digits[m1];
BYTE b2 = Base64Digits[m2];
BYTE b3 = Base64Digits[m3];
BYTE b4 = Base64Digits[m4];
//--------- end of input handling
*pDst++ = b1;
*pDst++ = b2;
if ( nLenSrc >= 3 ) { // 24 src bits left to encode, output xxxx
*pDst++ = b3;
*pDst++ = b4;
}
if ( nLenSrc == 2 ) { // 16 src bits left to encode, output xxx=
*pDst++ = b3;
*pDst++ = '=';
}
if ( nLenSrc == 1 ) { // 8 src bits left to encode, output xx==
*pDst++ = '=';
*pDst++ = '=';
}
pSrc += 3;
nLenSrc -= 3;
nLenOut += 4;
}
// Could optionally append a NULL byte like so:
// *pDst++= 0; nLenOut++;
return( nLenOut );
}
Not to fool anyone, but I copied the function from here
Here is how I call the function:
wchar_t base64[256];
Base64Encode((const unsigned char *)UserLoginHash, lstrlenW(UserLoginHash) * 2, base64, 256);
So, why is there random null-bytes or "whitespaces" in the generated hash? What should be changed so that I can get rid of them?
Try something more like this. Portions copied from my own base64 encoder:
static const wchar_t *Base64Digits = L"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
int Base64Encode(const BYTE* pSrc, int nLenSrc, wchar_t* pDst, int nLenDst)
{
int nLenOut = 0;
while (nLenSrc > 0) {
if (nLenDst < 4) return(0); // error
// read up to three source bytes (24 bits)
int len = 0;
BYTE s1 = pSrc[len++];
BYTE s2 = (nLenSrc > 1) ? pSrc[len++] : 0
BYTE s3 = (nLenSrc > 2) ? pSrc[len++] : 0;
pSrc += len;
nLenSrc -= len;
//------------------ lookup the right digits for output
pDst[0] = Base64Digits[(s1 >> 2) & 0x3F];
pDst[1] = Base64Digits[(((s1 & 0x3) << 4) | ((s2 >> 4) & 0xF)) & 0x3F];
pDst[2] = Base64Digits[(((s2 & 0xF) << 2) | ((s3 >> 6) & 0x3)) & 0x3F];
pDst[3] = Base64Digits[s3 & 0x3F];
//--------- end of input handling
if (len < 3) { // less than 24 src bits encoded, pad with '='
pDst[3] = L'=';
if (len == 1)
pDst[2] = L'=';
}
nLenOut += 4;
pDst += 4;
nLenDst -= 4;
}
if (nLenDst > 0) *pDst = 0;
return (nLenOut);
}
The problem, from what I can see, is that as the encoder works, occasionally it is adding a value to a certain character value, for example, let's say U+0070 + U+0066 (this is just an example). At some point, these values equal the null terminator (\0) or something equivalent to it, making it so the program doesn't read past that point when outputting the string and making it appear shorter than it should be.
I've encountered this problem with my own encoding algorithm before, and the best solution appears to be to add more variability to your algorithm; so, instead of only adding characters to the string, subtract some, multiply or XOR some at some point in the algorithm. This should remove (or at least reduce the chances of) null terminators appearing where you don't want them. This may, however, take some trial-and-error on your part to see what works and what doesn't.

Base 64 Encoding Losing data

This is my fourth attempt at doing base64 encoding. My first tries work but it isn't standard. It's also extremely slow!!! I used vectors and push_back and erase a lot.
So I decided to re-write it and this is much much faster! Except that it loses data. -__-
I need as much speed as I can possibly get because I'm compressing a pixel buffer and base64 encoding the compressed string. I'm using ZLib. The images are 1366 x 768 so yeah.
I do not want to copy any code I find online because... Well, I like to write things myself and I don't like worrying about copyright stuff or having to put a ton of credits from different sources all over my code..
Anyway, my code is as follows below. It's very short and simple.
const static std::string Base64Chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
inline bool IsBase64(std::uint8_t C)
{
return (isalnum(C) || (C == '+') || (C == '/'));
}
std::string Copy(std::string Str, int FirstChar, int Count)
{
if (FirstChar <= 0)
FirstChar = 0;
else
FirstChar -= 1;
return Str.substr(FirstChar, Count);
}
std::string DecToBinStr(int Num, int Padding)
{
int Bin = 0, Pos = 1;
std::stringstream SS;
while (Num > 0)
{
Bin += (Num % 2) * Pos;
Num /= 2;
Pos *= 10;
}
SS.fill('0');
SS.width(Padding);
SS << Bin;
return SS.str();
}
int DecToBinStr(std::string DecNumber)
{
int Bin = 0, Pos = 1;
int Dec = strtol(DecNumber.c_str(), NULL, 10);
while (Dec > 0)
{
Bin += (Dec % 2) * Pos;
Dec /= 2;
Pos *= 10;
}
return Bin;
}
int BinToDecStr(std::string BinNumber)
{
int Dec = 0;
int Bin = strtol(BinNumber.c_str(), NULL, 10);
for (int I = 0; Bin > 0; ++I)
{
if(Bin % 10 == 1)
{
Dec += (1 << I);
}
Bin /= 10;
}
return Dec;
}
std::string EncodeBase64(std::string Data)
{
std::string Binary = std::string();
std::string Result = std::string();
for (std::size_t I = 0; I < Data.size(); ++I)
{
Binary += DecToBinStr(Data[I], 8);
}
for (std::size_t I = 0; I < Binary.size(); I += 6)
{
Result += Base64Chars[BinToDecStr(Copy(Binary, I, 6))];
if (I == 0) ++I;
}
int PaddingAmount = ((-Result.size() * 3) & 3);
for (int I = 0; I < PaddingAmount; ++I)
Result += '=';
return Result;
}
std::string DecodeBase64(std::string Data)
{
std::string Binary = std::string();
std::string Result = std::string();
for (std::size_t I = Data.size(); I > 0; --I)
{
if (Data[I - 1] != '=')
{
std::string Characters = Copy(Data, 0, I);
for (std::size_t J = 0; J < Characters.size(); ++J)
Binary += DecToBinStr(Base64Chars.find(Characters[J]), 6);
break;
}
}
for (std::size_t I = 0; I < Binary.size(); I += 8)
{
Result += (char)BinToDecStr(Copy(Binary, I, 8));
if (I == 0) ++I;
}
return Result;
}
I've been using the above like this:
int main()
{
std::string Data = EncodeBase64("IMG." + ::ToString(677) + "*" + ::ToString(604)); //IMG.677*604
std::cout<<DecodeBase64(Data); //Prints IMG.677*601
}
As you can see in the above, it prints the wrong string. It's fairly close but for some reason, the 4 is turned into a 1!
Now if I do:
int main()
{
std::string Data = EncodeBase64("IMG." + ::ToString(1366) + "*" + ::ToString(768)); //IMG.1366*768
std::cout<<DecodeBase64(Data); //Prints IMG.1366*768
}
It prints correctly.. I'm not sure what is going on at all or where to begin looking.
Just in-case anyone is curious and want to see my other attempts (the slow ones): http://pastebin.com/Xcv03KwE
I'm really hoping someone could shed some light on speeding things up or at least figuring out what's wrong with my code :l
The main encoding issue is that you are not accounting for data that is not a multiple of 6 bits. In this case, the final 4 you have is being converted into 0100 instead of 010000 because there are no more bits to read. You are supposed to pad with 0s.
After changing your Copy like this, the final encoded character is Q, instead of the original E.
std::string data = Str.substr(FirstChar, Count);
while(data.size() < Count) data += '0';
return data;
Also, it appears that your logic for adding padding = is off because it is adding one too many = in this case.
As far as comments on speed, I'd focus primarily on trying to reduce your usage of std::string. The way you are currently converting the data into a string with 0 and 1 is pretty inefficent considering that the source could be read directly with bitwise operators.
I'm not sure whether I could easily come up with a slower method of doing Base-64 conversions.
The code requires 4 headers (on Mac OS X 10.7.5 with G++ 4.7.1) and the compiler option -std=c++11 to make the #include <cstdint> acceptable:
#include <string>
#include <iostream>
#include <sstream>
#include <cstdint>
It also requires a function ToString() that was not defined; I created:
std::string ToString(int value)
{
std::stringstream ss;
ss << value;
return ss.str();
}
The code in your main() — which is what uses the ToString() function — is a little odd: why do you need to build a string from pieces instead of simply using "IMG.677*604"?
Also, it is worth printing out the intermediate result:
int main()
{
std::string Data = EncodeBase64("IMG." + ::ToString(677) + "*" + ::ToString(604));
std::cout << Data << std::endl;
std::cout << DecodeBase64(Data) << std::endl; //Prints IMG.677*601
}
This yields:
SU1HLjY3Nyo2MDE===
IMG.677*601
The output string (SU1HLjY3Nyo2MDE===) is 18 bytes long; that has to be wrong as a valid Base-64 encoded string has to be a multiple of 4 bytes long (as three 8-bit bytes are encoded into four bytes each containing 6 bits of the original data). This immediately tells us there are problems. You should only get zero, one or two pad (=) characters; never three. This also confirms that there are problems.
Removing two of the pad characters leaves a valid Base-64 string. When I use my own home-brew Base-64 encoding and decoding functions to decode your (truncated) output, it gives me:
Base64:
0x0000: SU1HLjY3Nyo2MDE=
Binary:
0x0000: 49 4D 47 2E 36 37 37 2A 36 30 31 00 IMG.677*601.
Thus it appears you have encode the null terminating the string. When I encode IMG.677*604, the output I get is:
Binary:
0x0000: 49 4D 47 2E 36 37 37 2A 36 30 34 IMG.677*604
Base64: SU1HLjY3Nyo2MDQ=
You say you want to speed up your code. Quite apart from fixing it so that it encodes correctly (I've not really studied the decoding), you will want to avoid all the string manipulation you do. It should be a bit manipulation exercise, not a string manipulation exercise.
I have 3 small encoding routines in my code, to encode triplets, doublets and singlets:
/* Encode 3 bytes of data into 4 */
static void encode_triplet(const char *triplet, char *quad)
{
quad[0] = base_64_map[(triplet[0] >> 2) & 0x3F];
quad[1] = base_64_map[((triplet[0] & 0x03) << 4) | ((triplet[1] >> 4) & 0x0F)];
quad[2] = base_64_map[((triplet[1] & 0x0F) << 2) | ((triplet[2] >> 6) & 0x03)];
quad[3] = base_64_map[triplet[2] & 0x3F];
}
/* Encode 2 bytes of data into 4 */
static void encode_doublet(const char *doublet, char *quad, char pad)
{
quad[0] = base_64_map[(doublet[0] >> 2) & 0x3F];
quad[1] = base_64_map[((doublet[0] & 0x03) << 4) | ((doublet[1] >> 4) & 0x0F)];
quad[2] = base_64_map[((doublet[1] & 0x0F) << 2)];
quad[3] = pad;
}
/* Encode 1 byte of data into 4 */
static void encode_singlet(const char *singlet, char *quad, char pad)
{
quad[0] = base_64_map[(singlet[0] >> 2) & 0x3F];
quad[1] = base_64_map[((singlet[0] & 0x03) << 4)];
quad[2] = pad;
quad[3] = pad;
}
This is written as C code rather than using native C++ idioms, but the code shown should compile with C++ (unlike the C99 initializers elsewhere in the source). The base_64_map[] array corresponds to your Base64Chars string. The pad character passed in is normally '=', but can be '\0' since the system I work with has eccentric ideas about not needing padding (pre-dating my involvement in the code, and it uses a non-standard alphabet to boot) and the code handles both the non-standard and the RFC 3548 standard.
The driving code is:
/* Encode input data as Base-64 string. Output length returned, or negative error */
static int base64_encode_internal(const char *data, size_t datalen, char *buffer, size_t buflen, char pad)
{
size_t outlen = BASE64_ENCLENGTH(datalen);
const char *bin_data = (const void *)data;
char *b64_data = (void *)buffer;
if (outlen > buflen)
return(B64_ERR_OUTPUT_BUFFER_TOO_SMALL);
while (datalen >= 3)
{
encode_triplet(bin_data, b64_data);
bin_data += 3;
b64_data += 4;
datalen -= 3;
}
b64_data[0] = '\0';
if (datalen == 2)
encode_doublet(bin_data, b64_data, pad);
else if (datalen == 1)
encode_singlet(bin_data, b64_data, pad);
b64_data[4] = '\0';
return((b64_data - buffer) + strlen(b64_data));
}
/* Encode input data as Base-64 string. Output length returned, or negative error */
int base64_encode(const char *data, size_t datalen, char *buffer, size_t buflen)
{
return(base64_encode_internal(data, datalen, buffer, buflen, base64_pad));
}
The base64_pad constant is the '='; there's also a base64_encode_nopad() function that supplies '\0' instead. The errors are somewhat arbitrary but relevant to the code.
The main point to take away from this is that you should be doing bit manipulation and building up a string that is an exact multiple of 4 bytes for a given input.
std::string EncodeBase64(std::string Data)
{
std::string Binary = std::string();
std::string Result = std::string();
for (std::size_t I = 0; I < Data.size(); ++I)
{
Binary += DecToBinStr(Data[I], 8);
}
if (Binary.size() % 6)
{
Binary.resize(Binary.size() + 6 - Binary.size() % 6, '0');
}
for (std::size_t I = 0; I < Binary.size(); I += 6)
{
Result += Base64Chars[BinToDecStr(Copy(Binary, I, 6))];
if (I == 0) ++I;
}
if (Result.size() % 4)
{
Result.resize(Result.size() + 4 - Result.size() % 4, '=');
}
return Result;
}