Windows Clipboard won't preserve ASCII character - c++

I am writing a plugin for the TS3 Client, but I ran into an issue...
One of the channel names has a special sign (╠) in it, which is a special character from the extended ascii table I think.
When logging it INSIDE teamspeak, the character shows fine, but when trying to copy it to the windows clipboard using its C interface it returns a whole different character (â).
I have tried converting it to WCHAR after I read that the extended ascii table uses more bytes than the regular char, but that didn't work either.
I use the following code to copy the char* to the clipboard which I found somewhere and altered with some other code I found for using WCHAR:
void SaveClipboard(char* tx)
{
WCHAR text[140];
swprintf(text, 140, L"%hs", tx);
if(OpenClipboard(NULL))
{
EmptyClipboard();
HGLOBAL global = GlobalAlloc(GMEM_DDESHARE, 2 * (wcslen(text) + 1)); //text size + \0 character
WCHAR* pchData;
pchData = (WCHAR*)GlobalLock(global);
wcscpy(pchData, text);
GlobalUnlock(pchData);
SetClipboardData(CF_UNICODETEXT, global);
CloseClipboard();
}
}

wchar_t is UTF-16 encoded, but the data you get is UTF-8 encoded. You don't convert between these two encodings, you simply reinterpret the bytes.
Looking at the code points for those characters it should become obvious what's happening: The UTF-8 code point for ╠ is 0xE2 0x95 0xA0 and the UTF-16 code point for â is 0x00 0xE2, while the UTF-16 code point for ╠ is 0x25 0x60.
swprintf(text, 140, L"%hs", tx); <- This simply converts each char into a wchar_t, turning the 3 byte UTF-8 code point 0xE2 0x95 0xA0 into three 2 byte UTF-16 code points: 0x00 0xE2, 0x00 0x95 and 0x00 0xA0.
To get 0x25 0x60 from 0xE2 0x95 0xA0 you need to actually convert the data:
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>> converter;
std::wstring text = converter.from_bytes(tx);
Alternatively, since you are using WINAPI already, you can use MultiByteToWideChar:
WCHAR text[140];
int length = MultiByteToWideChar(CP_UTF8, 0, tx, -1, (LPWSTR)text, 140);

Related

Conversion from WCHAR to const unsigned char

I am a wix application packager. I am quite new to C++ and I am stuck with the below issue.
In my code, I am trying to convert wchar to const unsigned char. I have tried quite a few solutions that I got on the Internet, but I am unable to do it.
WCHAR szabc[888] = L"Example";
const unsigned char* pText = (const unsigned char*)szabc;
For your reference, the value of szabc is hard-coded, but ideally it is fetched as user input during installation of my code. szabc needs to be converted to const unsigned char as operator= doesn't seem to be working for conversion.
I am not getting any compilation error, but when I run this code, only the first character of szabc is being assigned to pText, I want the whole value of szabc to be assigned to pText.
As the value of pText is a user account password in a real time scenario, and it will be passed to a method which encrypts the value of the password.
Since you neglected to mention your OS, I am assuming it is Windows. You need WideCharToMultiByte or the standard wcstombs functions.
Note that both will determine the target encoding using system settings, so results will vary across computers. If possible, convert to UTF-8 or tell your users to stay away from special characters.
operator= cannot assign a value to a variable of an unrelated type. Which is why you cannot assign a WCHAR[] directly to an unsigned char*.
However, the real problem is with how the pointed data is being interpreted. You have a 16-bit Unicode string, and you are trying to pass it to a method that clearly wants a null-terminated 8-bit string instead.
On Windows, WCHAR is 2 bytes, and so the 2nd byte in your Unicode string is 0x00, eg:
WCHAR szabc[] = {L'E', L'x', L'a', L'm', L'p', L'l', L'e', L'\0'};
Has the same memory layout as this:
BYTE szabc[] = {'E', 0x00, 'x', 0x00, 'a', 0x00, 'm', 0x00, 'p', 0x00, 'l', 0x00, 'e', 0x00, '\0', 0x00};
This is why the method appears to see only 1 "character". It stops reading when it encounters the 1st 0x00 byte.
Thus, a simple pointer type-cast will not suffice. You will need to either:
use an 8-bit string to begin with, eg:
CHAR szabc[888] = "Example";
unsigned char* pText = (unsigned char*)szabc;
// use pText as needed...
convert the Unicode data at runtime, using WideCharToMultiByte() or equivalent, eg:
WCHAR szabc[888] = L"Example";
int len = WideCharToMultiByte(CP_ACP, 0, szabc, -1, NULL, 0, NULL, NULL);
CHAR szConverted = new char[len];
WideCharToMultiByte(CP_ACP, 0, szabc, -1, szConverted, len, NULL, NULL);
unsigned char* pText = (unsigned char*)szConverted;
// use pText as needed...
delete[] szConverted;

Unicode to UTF8 Conversation

I am trying to convert Unicode string to UTF8 string :
#include <stdio.h>
#include <string>
#include <atlconv.h>
#include <atlstr.h>
using namespace std;
CStringA ConvertUnicodeToUTF8(const CStringW& uni)
{
if (uni.IsEmpty()) return "";
CStringA utf8;
int cc = 0;
if ((cc = WideCharToMultiByte(CP_UTF8, 0, uni, -1, NULL, 0, 0, 0) - 1) > 0)
{
char *buf = utf8.GetBuffer(cc);
if (buf) WideCharToMultiByte(CP_UTF8, 0, uni, -1, buf, cc, 0, 0);
utf8.ReleaseBuffer();
}
return utf8;
}
int main(void)
{
string u8str = ConvertUnicodeToUTF8(L"gökhan");
printf("%d\n", u8str.size());
return 0;
}
My question is : Should u8str.size() return value be 6? It prints 7 now!
7 is correct. The non ASCII character ö is encoded with two bytes.
By definition, "multi byte" means that each unicode entity may occupy up to 6 bytes, see here: How many bytes does one Unicode character take?
Further reading: http://www.joelonsoftware.com/articles/Unicode.html
A Unicode codepoint uses 2 or 4 bytes in UTF-16, but uses 1-4 bytes in UTF-8, depending on its value. It is possible for a 2-byte codepoint value in UTF-16 to use 3-4 bytes in UTF-8, thus a UTF-8 string may use more bytes than the corresponding UTF-16 string. UTF-8 tends to be more compact for Latin/Western languages, but UTF-16 tends to be more compact for Eastern Asian languages.
std::(w)string::size() and CStringT::GetLength() count the number of encoded codeunits, not the number of codepoints. In your example, "gökhan" is encoded as:
UTF-16LE: 0x0067 0x00f6 0x006b 0x0068 0x0061 0x006e
UTF-16BE: 0x6700 0xf600 0x6b00 0x6800 0x6100 0x6e00
UTF-8: 0x67 0xc3 0xb6 0x6b 0x68 0x61 0x6e
Notice that ö is encoded using 1 codeunit in UTF-16 (LE: 0x00f6, BE: 0xf600) but uses 2 codeunits in UTF-8 (0xc3 0xb6). That is why your UTF-8 string has a size of 7 instead of 6.
That being said, when calling WideCharToMultiByte() and MultiByteToWideChar() with -1 as the source length, the function has to manually count the characters, and the return value will include room for a null terminator when the destination pointer is NULL. You don't need that extra space when using CStringA/W, std::(w)string, etc, and you don't need the overhead of counting characters when the source already knows its length. You should always specify the actual source length when you know it, eg:
CStringA ConvertUnicodeToUTF8(const CStringW& uni)
{
CStringA utf8;
int cc = WideCharToMultiByte(CP_UTF8, 0, uni, uni.GetLength(), NULL, 0, 0, 0);
if (cc > 0)
{
char *buf = utf8.GetBuffer(cc);
if (buf)
{
cc = WideCharToMultiByte(CP_UTF8, 0, uni, uni.GetLength(), buf, cc, 0, 0);
utf8.ReleaseBuffer(cc);
}
}
return utf8;
}

Simplify a c++ expression in string encoding answer

In this question: Convert ISO-8859-1 strings to UTF-8 in C/C++
There is a really nice concise piece of c++ code that converts ISO-8859-1 strings to UTF-8.
In this answer: https://stackoverflow.com/a/4059934/3426514
I'm still a beginner at c++ and I'm struggling to understand how this works. I have read up on the encoding sequences of UTF-8, and I understand that <128 the chars are the same, and above 128 the first byte gets a prefix and the rest of the bits are spread over a couple of bytes starting with 10xx, but I see no bit shifting in this answer.
If someone could help me to decompose it into a function that only processes 1 character, it would really help me understand.
Code, commented.
This works on the fact that Latin-1 0x00 through 0xff are mapping to consecutive UTF-8 code sequences 0x00-0x7f, 0xc2 0x80-bf, 0xc3 0x80-bf.
// converting one byte (latin-1 character) of input
while (*in)
{
if ( *in < 0x80 )
{
// just copy
*out++ = *in++;
}
else
{
// first byte is 0xc2 for 0x80-0xbf, 0xc3 for 0xc0-0xff
// (the condition in () evaluates to true / 1)
*out++ = 0xc2 + ( *in > 0xbf ),
// second byte is the lower six bits of the input byte
// with the highest bit set (and, implicitly, the second-
// highest bit unset)
*out++ = ( *in++ & 0x3f ) + 0x80;
}
}
The problem with a function processing a single (input) character is that the output could be either one or two bytes, making the function a bit awkward to use. You are usually better off (both in performance and cleanliness of code) with processing whole strings.
Note that the assumption of Latin-1 as input encoding is very likely to be wrong. For example, Latin-1 doesn't have the Euro sign (€), or any of these characters ŠšŽžŒœŸ, which makes most people in Europe use either Latin-9 or CP-1252, even if they are not aware of it. ("Encoding? No idea. Latin-1? Yea, that sounds about right.")
All that being said, that's the C way to do it. The C++ way would (probably, hopefully) look more like this:
#include <unistr.h>
#include <bytestream.h>
// ...
icu::UnicodeString ustr( in, "ISO-8859-1" );
// ...work with a properly Unicode-aware string class...
// ...convert to UTF-8 if necessary.
char * buffer[ BUFSIZE ];
icu::CheckedArrayByteSink bs( buffer, BUFSIZE );
ustr.toUTF8( bs );
That is using the International Components for Unicode (ICU) library. Note the ease this is adopted to a different input encoding. Different output encodings, iostream operators, character iterators, and even a C API are readily available from the library.

char type and re-encoding ASCII text into UTF-16

I am using libiconv to convert my char array into a UTF-16 string. I have doubts.
signature of iconv function
size_t iconv(iconv_t cd,
const char* * inbuf, size_t *inbytesleft,
char* * outbuf, size_t *outbytesleft);
that means, char is used to hold whatever type of characters being converted to (char vs wide char).
My C teacher at school teaches me that for odd or unreadable characters, we should use wchar_t. I'm so much confused now.
I tested this method on an input = "KOTEX" as ASCII encoded type and wish to output another string of double length encoded as UTF-16. It fails immediately. But if I change the destined code page into UTF-8, it'll work but the data returned is lost. Why is that ?
The buffer arguments to iconv are, in effect, char * but that is not intended to imply that they actually represent C strings. (It might have been less confusing had the interface used uint8_t* instead, but that's anachronic; iconv was around before stdint.h)
The Posix standard (and the Linux manpage) try to make this clear:
The type of inbuf and outbuf, char **, does not imply that the objects pointed to are interpreted as null-terminated C strings or arrays of characters. Any interpretation of a byte sequence that represents a character in a given character set encoding scheme is done internally within the codeset converters. (Posix.2008
So if you are planning on converting to UTF-16, you should provide an output buffer with an appropriate datatype for UTF-16. wchar_t is not an appropriate datatype; on many systems, it will be too big. uint16_t would be fine.
Note that there are actually three different UTF-16 encodings (the names are system-dependent; the ones here are recognized by Gnu iconv):
UTF16LE (or UTF-16LE): "Little endian" UTF-16. In this format, the low-order byte of each character is first, followed by the high-order byte. KOTEX is
{0x4B, 0x00, 0x4F, 0x00, 0x54, 0x00, 0x45, 0x00, 0x58, 0x00}
UTF16BE (or UTF-16BE): "Big endian" UTF-16. In this format, the high-order byte of each character is first, followed by the low-order byte. KOTEX is:
{0x00, 0x4B, 0x00, 0x4F, 0x00, 0x54, 0x00, 0x45, 0x00, 0x58}
UTF16 (or UTF-16): either UTF16BE or UTF16LE, depending on whether the machine is big-endian or little-endian; converted strings start with a Byte Order Mark (BOM). On a little-endian machine (mine), KOTEX is
{0xFF, 0xFE, 0x4B, 0x00, 0x4F, 0x00, 0x54, 0x00, 0x45, 0x00, 0x58, 0x00}
On a big-endian machine, it would be:
{0xFE, 0xFF, 0x00, 0x4B, 0x00, 0x4F, 0x00, 0x54, 0x00, 0x45, 0x00, 0x58}
The fact that UTF16 (unadorned with endian specification) always starts with a BOM means that you have to remember to provide an extra (2-byte) character in the output buffer. Otherwise, you'll end up with E2BIG.
In all three of these encodings, characters outside of the basic multilingual plane (BMP) require two (two-byte) character positions, a so-called surrogate pair. All ascii characters are on the BMP, so you don't need to worry about this for ascii-to-utf16 conversion, but you would if you were doing utf8-to-utf16.

Duplicate Windows Cryptographic Service Provider results in Python w/ Pycrypto

Edits and Updates
3/24/2013:
My output hash from Python is now matching the hash from c++ after converting to utf-16 and stoping before hitting any 'e' or 'm' bytes. However the decrypted results do not match. I know that my SHA1 hash is 20 bytes = 160 bits and RC4 keys can vary in length from 40 to 2048 bits so perhaps there is some default salting going on in WinCrypt that I will need to mimic. CryptGetKeyParam KP_LENGTH or KP_SALT
3/24/2013:
CryptGetKeyParam KP_LENGTH is telling me that my key ength is 128bits. I'm feeding it a 160 bit hash. So perhaps it's just discarding the last 32 bits...or 4 bytes. Testing now.
3/24/2013:
Yep, that was it. If I discard the last 4 bytes of my SHA1 hash in python...I get the same decryption results.
Quick Info:
I have a c++ program to decrypt a datablock. It uses the Windows Crytographic Service Provider so it only works on Windows. I would like it to work with other platforms.
Method Overview:
In Windows Crypto API
An ASCII encode password of bytes is converted to a wide character representation and then hashed with SHA1 to make a key for an RC4 stream cipher.
In Python PyCrypto
An ASCII encoded byte string is decoded to a python string. It is truncated based on empircally obsesrved bytes which cause mbctowcs to stop converting in c++. This truncated string is then enocoded in utf-16, effectively padding it with 0x00 bytes between the characters. This new truncated, padded byte string is passed to a SHA1 hash and the first 128 bits of the digest are passed to a PyCrypto RC4 object.
Problem [SOLVED]
I can't seem to get the same results with Python 3.x w/ PyCrypto
C++ Code Skeleton:
HCRYPTPROV hProv = 0x00;
HCRYPTHASH hHash = 0x00;
HCRYPTKEY hKey = 0x00;
wchar_t sBuf[256] = {0};
CryptAcquireContextW(&hProv, L"FileContainer", L"Microsoft Enhanced RSA and AES Cryptographic Provider", 0x18u, 0);
CryptCreateHash(hProv, 0x8004u, 0, 0, &hHash);
//0x8004u is SHA1 flag
int len = mbstowcs(sBuf, iRec->desc, sizeof(sBuf));
//iRec is my "Record" class
//iRec->desc is 33 bytes within header of my encrypted file
//this will be used to create the hash key. (So this is the password)
CryptHashData(hHash, (const BYTE*)sBuf, len, 0);
CryptDeriveKey(hProv, 0x6801, hHash, 0, &hKey);
DWORD dataLen = iRec->compLen;
//iRec->compLen is the length of encrypted datablock
//it's also compressed that's why it's called compLen
CryptDecrypt(hKey, 0, 0, 0, (BYTE*)iRec->decrypt, &dataLen);
// iRec is my record that i'm decrypting
// iRec->decrypt is where I store the decrypted data
//&dataLen is how long the encrypted data block is.
//I get this from file header info
Python Code Skeleton:
from Crypto.Cipher import ARC4
from Crypto.Hash import SHA
#this is the Decipher method from my record class
def Decipher(self):
#get string representation of 33byte password
key_string= self.desc.decode('ASCII')
#so far, these characters fail, possibly others but
#for now I will make it a list
stop_chars = ['e','m']
#slice off anything beyond where mbstowcs will stop
for char in stop_chars:
wc_stop = key_string.find(char)
if wc_stop != -1:
#slice operation
key_string = key_string[:wc_stop]
#make "wide character"
#this is equivalent to padding bytes with 0x00
#Slice off the two byte "Byte Order Mark" 0xff 0xfe
wc_byte_string = key_string.encode('utf-16')[2:]
#slice off the trailing 0x00
wc_byte_string = wc_byte_string[:len(wc_byte_string)-1]
#hash the "wchar" byte string
#this is the equivalent to sBuf in c++ code above
#as determined by writing sBuf to file in tests
my_key = SHA.new(wc_byte_string).digest()
#create a PyCrypto cipher object
RC4_Cipher = ARC4.new(my_key[:16])
#store the decrypted data..these results NOW MATCH
self.decrypt = RC4_Cipher.decrypt(self.datablock)
Suspected [EDIT: Confirmed] Causes
1. mbstowcs conversion of the password resulted in the "original data" being fed to the SHA1 hash was not the same in python and c++. mbstowcs was stopping conversion at 0x65 and 0x6D bytes. Original data ended with a wide_char encoding of only part of the original 33 byte password.
RC4 can have variable length keys. In the Enhanced Win Crypt Sevice provider, the default length is 128 bits. Leaving the key length unspecified was taking the first 128 bits of the 160 bit SHA1 digest of the "original data"
How I investigated
edit: based on my own experimenting and the suggestions of #RolandSmith, I now know that one of my problems was mbctowcs behaving in a way I wasn't expecting. It seems to stop writing to sBuf on "e" (0x65) and "m"(0x6d) (probably others). So the passoword "Monkey" in my description (Ascii encoded bytes), would look like "M o n k" in sBuf because mbstowcs stopped at the e, and placed 0x00 between the bytes based on the 2 byte wchar typedef on my system. I found this by writing the results of the conversion to a text file.
BYTE pbHash[256]; //buffer we will store the hash digest in
DWORD dwHashLen; //store the length of the hash
DWORD dwCount;
dwCount = sizeof(DWORD); //how big is a dword on this system?
//see above "len" is the return value from mbstowcs that tells how
//many multibyte characters were converted from the original
//iRec->desc an placed into sBuf. In some cases it's 3, 7, 9
//and always seems to stop on "e" or "m"
fstream outFile4("C:/desc_mbstowcs.txt", ios::out | ios::trunc | ios::binary);
outFile4.write((const CHAR*)sBuf, int(len));
outFile4.close();
//now get the hash size from CryptGetHashParam
//an get the acutal hash from the hash object hHash
//write it to a file.
if(CryptGetHashParam(hHash, HP_HASHSIZE, (BYTE *)&dwHashLen, &dwCount, 0)) {
if(CryptGetHashParam(hHash, 0x0002, pbHash, &dwHashLen,0)){
fstream outFile3("C:/test_hash.txt", ios::out | ios::trunc | ios::binary);
outFile3.write((const CHAR*)pbHash, int(dwHashLen));
outFile3.close();
}
}
References:
wide characters cause problems depending on environment definition
Difference in Windows Cryptography Service between VC++ 6.0 and VS 2008
convert a utf-8 to utf-16 string
Python - converting wide-char strings from a binary file to Python unicode strings
PyCrypto RC4 example
https://www.dlitz.net/software/pycrypto/api/current/Crypto.Cipher.ARC4-module.html
Hashing a string with Sha256
http://msdn.microsoft.com/en-us/library/windows/desktop/aa379916(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/aa375599(v=vs.85).aspx
You can test the size of wchar_t with a small test program (in C):
#include <stdio.h> /* for printf */
#include <stddef.h> /* for wchar_t */
int main(int argc, char *argv[]) {
printf("The size of wchar_t is %ld bytes.\n", sizeof(wchar_t));
return 0;
}
You could also use printf() calls in your C++ code to write e.g. iRec->desc and the result of the hash in sbuf to the screen if you can run the C++ program from a terminal. Otherwise use fprintf() to dump them to a file.
To better mimic the behavior of the C++ program, you could even use ctypes to call mbstowcs() in your Python code.
Edit: You wrote:
One problem is definitely with mbctowcs. It seems that it's transferring an unpredictable (to me) number of bytes into my buffer to be hashed.
Keep in mind that mbctowcs returns the number of wide characters converted. In other words, a 33 byte buffer in a multi-byte encoding
can contain anything from 5 (UTF-8 6-byte sequences) up to 33 characters depending on the encoding used.
Edit2: You are using 0 as the dwFlags parameter for CryptDeriveKey. According to its documentation, the upper 16 bits should contain the key length. You should check CryptDeriveKey's return value to see if the call succeeded.
Edit3: You could test mbctowcs in Python (I'm using IPython here.):
In [1]: from ctypes import *
In [2]: libc = CDLL('libc.so.7')
In [3]: monkey = c_char_p(u'Monkey')
In [4]: test = c_char_p(u'This is a test')
In [5]: wo = create_unicode_buffer(256)
In [6]: nref = c_size_t(250)
In [7]: libc.mbstowcs(wo, monkey, nref)
Out[7]: 6
In [8]: print wo.value
Monkey
In [9]: libc.mbstowcs(wo, test, nref)
Out[9]: 14
In [10]: print wo.value
This is a test
Note that in Windows you should probably use libc = cdll.msvcrt instead of libc = CDLL('libc.so.7').