I am trying to base64 encode a unicode string. I am running into problems, after the encoding, the output is my string base64'ed however, there is null bytes at random places in throughout the code, I don't know why, or how to get them out.
Here is my Base64Encode function:
static char Base64Digits[] =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
int Base64Encode(const BYTE* pSrc, int nLenSrc, wchar_t* pDst, int nLenDst)
{
int nLenOut= 0;
while ( nLenSrc > 0 ) {
if (nLenOut+4 > nLenDst) return(0); // error
// read three source bytes (24 bits)
BYTE s1= pSrc[0]; // (but avoid reading past the end)
BYTE s2= 0; if (nLenSrc>1) s2=pSrc[1]; //------ corrected, thanks to jprichey
BYTE s3= 0; if (nLenSrc>2) s3=pSrc[2];
DWORD n;
n = s1; // xxx1
n <<= 8; // xx1x
n |= s2; // xx12
n <<= 8; // x12x
n |= s3; // x123
//-------------- get four 6-bit values for lookups
BYTE m4= n & 0x3f; n >>= 6;
BYTE m3= n & 0x3f; n >>= 6;
BYTE m2= n & 0x3f; n >>= 6;
BYTE m1= n & 0x3f;
//------------------ lookup the right digits for output
BYTE b1 = Base64Digits[m1];
BYTE b2 = Base64Digits[m2];
BYTE b3 = Base64Digits[m3];
BYTE b4 = Base64Digits[m4];
//--------- end of input handling
*pDst++ = b1;
*pDst++ = b2;
if ( nLenSrc >= 3 ) { // 24 src bits left to encode, output xxxx
*pDst++ = b3;
*pDst++ = b4;
}
if ( nLenSrc == 2 ) { // 16 src bits left to encode, output xxx=
*pDst++ = b3;
*pDst++ = '=';
}
if ( nLenSrc == 1 ) { // 8 src bits left to encode, output xx==
*pDst++ = '=';
*pDst++ = '=';
}
pSrc += 3;
nLenSrc -= 3;
nLenOut += 4;
}
// Could optionally append a NULL byte like so:
// *pDst++= 0; nLenOut++;
return( nLenOut );
}
Not to fool anyone, but I copied the function from here
Here is how I call the function:
wchar_t base64[256];
Base64Encode((const unsigned char *)UserLoginHash, lstrlenW(UserLoginHash) * 2, base64, 256);
So, why is there random null-bytes or "whitespaces" in the generated hash? What should be changed so that I can get rid of them?
Try something more like this. Portions copied from my own base64 encoder:
static const wchar_t *Base64Digits = L"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
int Base64Encode(const BYTE* pSrc, int nLenSrc, wchar_t* pDst, int nLenDst)
{
int nLenOut = 0;
while (nLenSrc > 0) {
if (nLenDst < 4) return(0); // error
// read up to three source bytes (24 bits)
int len = 0;
BYTE s1 = pSrc[len++];
BYTE s2 = (nLenSrc > 1) ? pSrc[len++] : 0
BYTE s3 = (nLenSrc > 2) ? pSrc[len++] : 0;
pSrc += len;
nLenSrc -= len;
//------------------ lookup the right digits for output
pDst[0] = Base64Digits[(s1 >> 2) & 0x3F];
pDst[1] = Base64Digits[(((s1 & 0x3) << 4) | ((s2 >> 4) & 0xF)) & 0x3F];
pDst[2] = Base64Digits[(((s2 & 0xF) << 2) | ((s3 >> 6) & 0x3)) & 0x3F];
pDst[3] = Base64Digits[s3 & 0x3F];
//--------- end of input handling
if (len < 3) { // less than 24 src bits encoded, pad with '='
pDst[3] = L'=';
if (len == 1)
pDst[2] = L'=';
}
nLenOut += 4;
pDst += 4;
nLenDst -= 4;
}
if (nLenDst > 0) *pDst = 0;
return (nLenOut);
}
The problem, from what I can see, is that as the encoder works, occasionally it is adding a value to a certain character value, for example, let's say U+0070 + U+0066 (this is just an example). At some point, these values equal the null terminator (\0) or something equivalent to it, making it so the program doesn't read past that point when outputting the string and making it appear shorter than it should be.
I've encountered this problem with my own encoding algorithm before, and the best solution appears to be to add more variability to your algorithm; so, instead of only adding characters to the string, subtract some, multiply or XOR some at some point in the algorithm. This should remove (or at least reduce the chances of) null terminators appearing where you don't want them. This may, however, take some trial-and-error on your part to see what works and what doesn't.
Related
I have a std::string - INPUT - "(#1476710203) éf.pdf"
I want the result as - OUTPUT - "UTF-8 "%28#1476710203%29%20%C3%A9f.pdf"
I have tried
std::codecvt_utf8
Win32 MultiByteToWideChar() function to convert data from CP437 to UTF-16, and then use the WideCharToMultiByte() function to convert data from UTF-16 to UTF-8.
But, when I print the bytes in after both conversions, it still shows input string while I want the output string.
First, we need to parse the UTF-8 characters in the string in order not to encode the whole string, this code was taken from Dear ImGui with a bit of modifications:
int WideCharFromUtf8(unsigned int* out_char, const char* in_text, const char* in_text_end)
{
const char lengths[32] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 4, 0 };
const int masks[] = { 0x00, 0x7f, 0x1f, 0x0f, 0x07 };
const uint32_t mins[] = { 0x400000, 0, 0x80, 0x800, 0x10000 };
const int shiftc[] = { 0, 18, 12, 6, 0 };
const int shifte[] = { 0, 6, 4, 2, 0 };
int len = lengths[*(const unsigned char*)in_text >> 3];
int wanted = len + (!len ? 1 : 0);
if (in_text_end == NULL)
in_text_end = in_text + wanted; // Max length, nulls will be taken into account.
// Copy at most 'len' bytes, stop copying at 0 or past in_text_end. Branch predictor does a good job here,
// so it is fast even with excessive branching.
unsigned char s[4];
s[0] = in_text + 0 < in_text_end ? in_text[0] : 0;
s[1] = in_text + 1 < in_text_end ? in_text[1] : 0;
s[2] = in_text + 2 < in_text_end ? in_text[2] : 0;
s[3] = in_text + 3 < in_text_end ? in_text[3] : 0;
// Assume a four-byte character and load four bytes. Unused bits are shifted out.
*out_char = (uint32_t)(s[0] & masks[len]) << 18;
*out_char |= (uint32_t)(s[1] & 0x3f) << 12;
*out_char |= (uint32_t)(s[2] & 0x3f) << 6;
*out_char |= (uint32_t)(s[3] & 0x3f) << 0;
*out_char >>= shiftc[len];
// Accumulate the various error conditions.
int e = 0;
e = (*out_char < mins[len]) << 6; // non-canonical encoding
e |= ((*out_char >> 11) == 0x1b) << 7; // surrogate half?
e |= (*out_char > 0x10FFFF) << 8; // out of range?
e |= (s[1] & 0xc0) >> 2;
e |= (s[2] & 0xc0) >> 4;
e |= (s[3]) >> 6;
e ^= 0x2a; // top two bits of each tail byte correct?
e >>= shifte[len];
if (e)
{
// No bytes are consumed when *in_text == 0 || in_text == in_text_end.
// One byte is consumed in case of invalid first byte of in_text.
// All available bytes (at most `len` bytes) are consumed on incomplete/invalid second to last bytes.
// Invalid or incomplete input may consume less bytes than wanted, therefore every byte has to be inspected in s.
int cmp = !!s[0] + !!s[1] + !!s[2] + !!s[3];
wanted = wanted < cmp ? wanted : cmp;
*out_char = 0xFFFD;
}
return wanted;
}
Then loop through the string and encode the character if it is UTF-8:
void main()
{
std::string utf8_string(u8"(#1476710203) éf.pdf");
std::string result;
result.reserve(utf8_string.size());
const char* text_begin = utf8_string.c_str();
const char* text_end = text_begin + utf8_string.size();
for (const char* cursor = text_begin; cursor < text_end;)
{
unsigned int wide_char = *cursor;
// this is an ascii char
if (wide_char < 0x80)
{
result.push_back(*cursor);
cursor += 1;
}
// this is an utf-8 char
else
{
size_t utf8_len = WideCharFromUtf8(&wide_char, cursor, text_end);
if (wide_char == 0) // Malformed UTF-8?
break;
result += FormatUtf8(cursor, utf8_len);
cursor += utf8_len;
}
}
printf("%s\n", result.c_str());
}
And the FormatUtf8 function, converting string to hex string was taken from this answer:
// convert array of char to hex, thanks to this answer https://stackoverflow.com/a/3382894/13253010
std::string FormatUtf8(const char* text, size_t size)
{
const char hex_digits[] = "0123456789ABCDEF";
std::string output;
output.reserve(size * 3);
for (const char* cursor = text; cursor < text + size; cursor++)
{
unsigned char chr = *cursor;
output.push_back('%');
output.push_back(hex_digits[chr >> 4]);
output.push_back(hex_digits[chr & 15]);
}
return output;
}
The output:
(#1476710203) %C3%A9f.pdf
This should be enough to solve the problem, as you asked for encoding UTF-8 and not URL, processing the other URL-encoded characters will be too long to put in an answer.
I am writing some simple code to encode files to base64. I have a short c++ code that reads a file into a vector and converts it to unsigned char*. I do this so I can properly use the encoding function I got.
The problem: It works with text files (of different sizes), but it won't work with image files. And I can't figure it out why. What gives?
For an simple text.txt containing the text abcd, the output for both my code and a bash $( base64 text.txt ) is the same.
On the other hand, when I input an image the output is something like iVBORwOKGgoAAAAAAA......AAA== or sometimes it ends with an corrupted size vs prev_size Aborted (core dumped), the first few bytes are correct.
The code:
static std::vector<char> readBytes(char const* filename)
{
std::ifstream ifs(filename, std::ios::binary|std::ios::ate);
std::ifstream::pos_type pos = ifs.tellg();
std::vector<char> result(pos);
ifs.seekg(0, std::ios::beg);
ifs.read(&result[0], pos);
return result;
}
static char Base64Digits[] =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
int ToBase64Simple( const BYTE* pSrc, int nLenSrc, char* pDst, int nLenDst )
{
int nLenOut= 0;
while ( nLenSrc > 0 ) {
if (nLenOut+4 > nLenDst) {
cout << "error\n";
return(0); // error
}
// read three source bytes (24 bits)
BYTE s1= pSrc[0]; // (but avoid reading past the end)
BYTE s2= 0; if (nLenSrc>1) s2=pSrc[1]; //------ corrected, thanks to jprichey
BYTE s3= 0; if (nLenSrc>2) s3=pSrc[2];
DWORD n;
n = s1; // xxx1
n <<= 8; // xx1x
n |= s2; // xx12
n <<= 8; // x12x
n |= s3; // x123
//-------------- get four 6-bit values for lookups
BYTE m4= n & 0x3f; n >>= 6;
BYTE m3= n & 0x3f; n >>= 6;
BYTE m2= n & 0x3f; n >>= 6;
BYTE m1= n & 0x3f;
//------------------ lookup the right digits for output
BYTE b1 = Base64Digits[m1];
BYTE b2 = Base64Digits[m2];
BYTE b3 = Base64Digits[m3];
BYTE b4 = Base64Digits[m4];
//--------- end of input handling
*pDst++ = b1;
*pDst++ = b2;
if ( nLenSrc >= 3 ) { // 24 src bits left to encode, output xxxx
*pDst++ = b3;
*pDst++ = b4;
}
if ( nLenSrc == 2 ) { // 16 src bits left to encode, output xxx=
*pDst++ = b3;
*pDst++ = '=';
}
if ( nLenSrc == 1 ) { // 8 src bits left to encode, output xx==
*pDst++ = '=';
*pDst++ = '=';
}
pSrc += 3;
nLenSrc -= 3;
nLenOut += 4;
}
// Could optionally append a NULL byte like so:
*pDst++= 0; nLenOut++;
return( nLenOut );
}
int main(int argc, char* argv[])
{
std::vector<char> mymsg;
mymsg = readBytes(argv[1]);
char* arr = &mymsg[0];
int len = mymsg.size();
int lendst = ((len+2)/3)*4;
unsigned char* uarr = (unsigned char *) malloc(len*sizeof(unsigned char));
char* dst = (char *) malloc(lendst*sizeof(char));;
mymsg.clear(); //free()
// convert to unsigned char
strncpy((char*)uarr, arr, len);
int lenOut = ToBase64Simple(uarr, len, dst, lendst);
free(uarr);
int cont = 0;
while (cont < lenOut) //(dst[cont] != 0)
cout << dst[cont++];
cout << "\n";
}
Any insight is welcomed.
I see two problems.
First, you are clearing your mymsg vector before you're done using it. This leaves the arr pointer dangling (pointing at memory that is no longer allocated). When you access arr to get the data out, you end up with Undefined Behavior.
Then you use strncpy to copy (potentially) binary data. This copy will stop when it reaches the first nul (0) byte within the file, so not all of your data will be copied. You should use memcpy instead.
I am trying to automate the process of adding software policy hash rules to Windows, and am currently having a problem adding valid hashes to the registry. This code creates a key and adds the hash to the registry:
HKEY* m_hKey;
string md5Digest;
string valueName = "ItemData";
vector<BYTE> itemData;
/*
use Crypto++ to get file hash
*/
//convert string to format that can be loaded into registry
for (int i = 1; i < md5Digest.length(); i += 2)
itemData.push_back('0x' + md5Digest[i - 1] + md5Digest[i]);
// Total data size, in bytes
const DWORD dataSize = static_cast<DWORD>(itemData.size());
::RegSetValueEx(
m_hKey,
valueName.c_str(),
0, // reserved
REG_BINARY,
&itemData[0],
dataSize
);
This works fine, and adds the key to the registry:
But when comparing the registry key to a rule added by Group Policy you can see a very important difference:
The 'ItemData' values are different between them. The bottom picture's ItemData value is the correct output. When debugging the program I can clearly see that md5Digest has the correct value, so the problem lies with the conversion of the md5Digest string to the ItemData vector of BYTEs or unsigned chars....
What is the problem with my code, why is the data being entered incorrectly to the registry?
You have a string that you want to convert to a byte array. You can write a helper function to convert 2 chars to a BYTE:
using BYTE = unsigned char;
BYTE convert(char a, char b)
{
// Convert hex char to byte
// Needs fixin for lower case
if (a >= '0' && a <= '9') a -= '0';
else a -= 55; // 55 = 'A' - 10
if (b >= '0' && b <= '9') b -= '0';
else b -= 55;
return (a << 4) | b;
}
....
vector<BYTE> v;
string s = "3D65B8EBDD0E";
for (int i = 0; i < s.length(); i+=2) {
v.push_back(convert(s[i], s[i+1]));
}
v now contains {0x3D, 0x65, 0xB8, 0xEB, 0xDD, 0x0E}
Or, as mention by #RbMm, you can use the CryptStringToBinary Windows function:
#include <wincrypt.h>
...
std::string s = "3D65B8EBDD0E";
DWORD hex_len = s.length() / 2;
BYTE *buffer = new BYTE[hex_len];
CryptStringToBinary(s.c_str(),
s.length(),
CRYPT_STRING_HEX,
buffer,
&hex_len,
NULL,
NULL
);
You have '0x' two-letter char literal summed up with md5Digest[i - 1] + md5Digest[i] and then trunketed to BYTE. This looked like you were trying to build "0xFF" byte value out of them. You should store md5 string directly:
const DWORD dataSize = static_cast<DWORD>(md5Digest.size());
::RegSetValueEx(
m_hKey,
valueName.c_str(),
0, // reserved
REG_BINARY,
reinterpret_cast< BYTE const *>(md5Digest.data()),
dataSize
);
If you actually need to store binary representation of hex numbers from md5 then you need to convert them to bytes like this:
BYTE char_to_halfbyte(char const c)
{
if(('0' <= c) && (c <= '9'))
{
return(static_cast<BYTE>(c - `0`));
}
else
{
assert(('A' <= c) && (c <= 'F'));
return(static_cast<BYTE>(10 + c - `A`));
}
}
for(std::size_t i = 0; i < md5Digest.length(); i += 2)
{
assert((i + 1) < md5Digest.length());
itemData.push_back
(
(char_to_halfbyte(md5Digest[i ]) << 4)
|
(char_to_halfbyte(md5Digest[i + 1]) )
);
}
I want to base64 a big file (500MB)
I use this code but it doesn't work for a large file
I test CryptStringToBinary but it doesn't work too
what should I do????
The issue is clearly that there is not enough memory to store a 500 megabyte string in a 32-bit application.
The one solution is alluded to by the this link, which writes the data to a string. Assuming that the code works correctly, it is not that hard to adjust it to write to a file stream.
#include <windows.h>
#include <fstream>
static const wchar_t *Base64Digits = L"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
int Base64Encode(const BYTE* pSrc, int nLenSrc, std::wostream& pDstStrm, int nLenDst)
{
wchar_t pDst[4];
int nLenOut = 0;
while (nLenSrc > 0) {
if (nLenDst < 4) return(0);
int len = 0;
BYTE s1 = pSrc[len++];
BYTE s2 = (nLenSrc > 1) ? pSrc[len++] : 0;
BYTE s3 = (nLenSrc > 2) ? pSrc[len++] : 0;
pSrc += len;
nLenSrc -= len;
//------------------ lookup the right digits for output
pDst[0] = Base64Digits[(s1 >> 2) & 0x3F];
pDst[1] = Base64Digits[(((s1 & 0x3) << 4) | ((s2 >> 4) & 0xF)) & 0x3F];
pDst[2] = Base64Digits[(((s2 & 0xF) << 2) | ((s3 >> 6) & 0x3)) & 0x3F];
pDst[3] = Base64Digits[s3 & 0x3F];
//--------- end of input handling
if (len < 3) { // less than 24 src bits encoded, pad with '='
pDst[3] = L'=';
if (len == 1)
pDst[2] = L'=';
}
nLenOut += 4;
// write the data to a file
pDstStrm.write(pDst,4);
nLenDst -= 4;
}
if (nLenDst > 0) *pDst = 0;
return (nLenOut);
}
The only changes done were to write the 4 bytes to a wide stream instead of appending the data to a string
Here is an example call:
int main()
{
std::wofstream ofs(L"testfile.out");
Base64Encode((BYTE*)"This is a test", strlen("This is a test"), ofs, 1000);
}
The above produces a file with the base64 string VGhpcyBpcyBhIHRlc3Q=, which when decoded, produces This is a test.
Note that the parameter is std::wostream, which means any wide output stream class (such as std::wostringstream) will work also.
A 200 byte message has one random byte corrupted.
What's the most efficient way to fix the corrupt byte?
A Hamming(255,247) code has 8 bytes of overhead, but is simple to implement.
Reed-Solomon error correction has 2 bytes of overhead, but is complex to implement.
Is there a simpler method that I'm overlooking?
I found a paper of a method that's perfect for this case-- two bytes overhead, simple to implement. Here's the code:
// Single-byte error correction for messages <255 bytes long
// using two check bytes. Based on "CA-based byte error-correcting code"
// by Chowdhury et al.
//
// rmmh 2013
uint8_t lfsr(uint8_t x) {
return (x >> 1) ^ (-(x&1) & 0x8E);
}
void eccComputeChecks(uint8_t *data, int data_len, uint8_t *out_c0, uint8_t *out_c1) {
uint8_t c0 = 0; // parity: m_0 ^ m_1 ^ ... ^ m_n-1
uint8_t c1 = 0; // lfsr: f^n-1(m_0) ^ f^n(m_1) ^ ... ^ f^0(m_n-1)
for (int i = 0; i < data_len; ++i) {
c0 ^= data[i];
c1 = lfsr(c1 ^ data[i]);
}
*out_c0 = c0;
*out_c1 = c1;
}
void eccEncode(uint8_t *data, int data_len, uint8_t check[2]) {;
eccComputeChecks(data, data_len, &check[0], &check[1]);
}
bool eccDecode(uint8_t *data, int data_len, uint8_t check[2]) {
uint8_t s0, s1;
eccComputeChecks(data, data_len, &s0, &s1);
s0 ^= check[0];
s1 ^= check[1];
if (s0 && s1) {
int error_index = data_len - 255;
while (s1 != s0) { // find i st s1 = lfsr^i(s0)
s1 = lfsr(s1);
error_index++;
}
if (error_index < 0 || error_index >= data_len) {
// multi-byte error?
return false;
}
data[error_index] ^= s0;
} else if (s0 || s1) {
// parity error
}
return true;
}
Using Reed Solomon to correct a single byte error would not be that complicated. Use a generator polynomial of the form (using ⊕ to mean xor)
g(x) = (x ⊕ 1)(x ⊕ 2) = x^2 + 3x + 2.
Encode the message as usual.
For decode, generate the two syndromes S(0) and S(1) in the normal way.
if(S(0) != 0){
error value = S(0)
error location = log2(S(1)/S(0))
}
Error location would be from right to left (0 == right most byte). If a shortened code and location is out of range, then an uncorrectable error is detected.