Convert wchar_t to char

Convert wchar_t to char - c++

I was wondering is it safe to do so?
wchar_t wide = /* something */;
assert(wide >= 0 && wide < 256 &&);
char myChar = static_cast<char>(wide);
If I am pretty sure the wide char will fall within ASCII range.

Why not just use a library routine wcstombs.

assert is for ensuring that something is true in a debug mode, without it having any effect in a release build. Better to use an if statement and have an alternate plan for characters that are outside the range, unless the only way to get characters outside the range is through a program bug.
Also, depending on your character encoding, you might find a difference between the Unicode characters 0x80 through 0xff and their char version.

You are looking for wctomb(): it's in the ANSI standard, so you can count on it. It works even when the wchar_t uses a code above 255. You almost certainly do not want to use it.
wchar_t is an integral type, so your compiler won't complain if you actually do:
char x = (char)wc;
but because it's an integral type, there's absolutely no reason to do this. If you accidentally read Herbert Schildt's C: The Complete Reference, or any C book based on it, then you're completely and grossly misinformed. Characters should be of type int or better. That means you should be writing this:
int x = getchar();
and not this:
char x = getchar(); /* <- WRONG! */
As far as integral types go, char is worthless. You shouldn't make functions that take parameters of type char, and you should not create temporary variables of type char, and the same advice goes for wchar_t as well.
char* may be a convenient typedef for a character string, but it is a novice mistake to think of this as an "array of characters" or a "pointer to an array of characters" - despite what the cdecl tool says. Treating it as an actual array of characters with nonsense like this:
for(int i = 0; s[i]; ++i) {
wchar_t wc = s[i];
char c = doit(wc);
out[i] = c;
}
is absurdly wrong. It will not do what you want; it will break in subtle and serious ways, behave differently on different platforms, and you will most certainly confuse the hell out of your users. If you see this, you are trying to reimplement wctombs() which is part of ANSI C already, but it's still wrong.
You're really looking for iconv(), which converts a character string from one encoding (even if it's packed into a wchar_t array), into a character string of another encoding.
Now go read this, to learn what's wrong with iconv.

An easy way is :
wstring your_wchar_in_ws(<your wchar>);
string your_wchar_in_str(your_wchar_in_ws.begin(), your_wchar_in_ws.end());
char* your_wchar_in_char = your_wchar_in_str.c_str();
I'm using this method for years :)

A short function I wrote a while back to pack a wchar_t array into a char array. Characters that aren't on the ANSI code page (0-127) are replaced by '?' characters, and it handles surrogate pairs correctly.
size_t to_narrow(const wchar_t * src, char * dest, size_t dest_len){
size_t i;
wchar_t code;
i = 0;
while (src[i] != '\0' && i < (dest_len - 1)){
code = src[i];
if (code < 128)
dest[i] = char(code);
else{
dest[i] = '?';
if (code >= 0xD800 && code <= 0xD8FF)
// lead surrogate, skip the next code unit, which is the trail
i++;
}
i++;
}
dest[i] = '\0';
return i - 1;
}

Technically, 'char' could have the same range as either 'signed char' or 'unsigned char'. For the unsigned characters, your range is correct; theoretically, for signed characters, your condition is wrong. In practice, very few compilers will object - and the result will be the same.
Nitpick: the last && in the assert is a syntax error.
Whether the assertion is appropriate depends on whether you can afford to crash when the code gets to the customer, and what you could or should do if the assertion condition is violated but the assertion is not compiled into the code. For debug work, it seems fine, but you might want an active test after it for run-time checking too.

Here's another way of doing it, remember to use free() on the result.
char* wchar_to_char(const wchar_t* pwchar)
{
// get the number of characters in the string.
int currentCharIndex = 0;
char currentChar = pwchar[currentCharIndex];
while (currentChar != '\0')
{
currentCharIndex++;
currentChar = pwchar[currentCharIndex];
}
const int charCount = currentCharIndex + 1;
// allocate a new block of memory size char (1 byte) instead of wide char (2 bytes)
char* filePathC = (char*)malloc(sizeof(char) * charCount);
for (int i = 0; i < charCount; i++)
{
// convert to char (1 byte)
char character = pwchar[i];
*filePathC = character;
filePathC += sizeof(char);
}
filePathC += '\0';
filePathC -= (sizeof(char) * charCount);
return filePathC;
}

one could also convert wchar_t --> wstring --> string --> char
wchar_t wide;
wstring wstrValue;
wstrValue[0] = wide
string strValue;
strValue.assign(wstrValue.begin(), wstrValue.end()); // convert wstring to string
char char_value = strValue[0];

In general, no. int(wchar_t(255)) == int(char(255)) of course, but that just means they have the same int value. They may not represent the same characters.
You would see such a discrepancy in the majority of Windows PCs, even. For instance, on Windows Code page 1250, char(0xFF) is the same character as wchar_t(0x02D9) (dot above), not wchar_t(0x00FF) (small y with diaeresis).
Note that it does not even hold for the ASCII range, as C++ doesn't even require ASCII. On IBM systems in particular you may see that 'A' != 65

Related

C++ Pointers to arrays

#include <stdio.h>
char strA[80] = "A string to be used for demonstration purposes";
char strB[80];
char *my_strcpy(char *destination, char *source)
{
char *p = destination;
while (*source != '\0')
{
*p++ = *source++;
}
*p = '\0';
return destination;
}
int main(void)
{
my_strcpy(strB, strA);
puts(strB);
}
so my question here is that when i take out the portion:
//*p= '\0';
it prints the exact same answer, so why is this necessary? from my understanding, \0 is a nul portion of memory after a string but since the array strA already contains the nul portion since its in "" is it really necessary?

It seems you already know the importance of the null terminator, but the point is, you defined char strB[80]; in external namespace (with static life span), which causes initialization of the array strB, which sets all bytes of it to zero. That's why you can't observe the difference （because even if you don't append a null character, the rest of strB already is).
Moving the definition of strB makes this visible. strA doesn't need moving because it doesn't matter.
In actuality, this code
while (*source != '\0')
{
*p++ = *source++;
}
// *p = '\0';
When *source reaches a null character, it's not copied to *p, so you need to manualky add a terminator for that.

Your loop stops when it sees the \0 and so it is not copied to the destination and the destination is not NUL terminated. Is that a problem?
Not if your destination buffer is initialized to all 0s
Not if your code is willing to deal with fixed length strings (so the my_strcpy signature would need to change to return the length)
In general YES - the 0 terminated C string is such a common thing that not following the convention is asking for trouble,
Whether you 0 terminate or not the rest of the values will be the same as they were when you started. The 0 termination just makes your character array a "standard C string".
For arguments sake: Assuming you knew every string had space for 80 chars you could just do
for(int i = 0; i < 80; i++)
{
dest[i] = src[i];
}
The effect is the same and assuming the source is 0 terminated the destination will be too.

Converting a unsigned char(BYTE) array to const t_wchar* (LPCWSTR)

Alright so I have a BYTE array that I need to ultimately convert into a LPCWSTR or const WCHAR* to use in a built in function. I have been able to print out the BYTE array with printf but now that I need to convert it into a string I am having problems... mainly that I have no idea how to convert something like this into a non array type.
BYTE ba[0x10];
for(int i = 0; i < 0x10; i++)
{
printf("%02X", ba[i]); // Outputs: F1BD2CC7F2361159578EE22305827ECF
}
So I need to have this same thing basically but instead of printing the array I need it transformed into a LPCWSTR or WCHAR or even a string. The main problem I am having is converting the array into a non array form.

LPCWSTR represents a UTF-16 encoded string. The array contents you have shown are outside the 7bit ASCII range, so unless the BYTE array is already encoded in UTF-16 (the array you showed is not, but if it were, you could just use a simple type-cast), you will need to do a conversion to UTF-16. You need to know the particular encoding of the array before you can do that conversion, such as with the Win32 API MultiByteToWideChar() function, or third-party libraries like iconv or ICU, or built-in locale convertors in C++11, etc. So what is the actual encoding of the array, and where is the array data coming from? It is not UTF-8, for instance, so it has to be something else.

Alright I got it working. Now I can convert the BYTE array to a char* var. Thanks for the help guys but the formatting wasn't a large problem in this instance. I appreciate the help though, its always nice to have some extra input.
// Helper function to convert
Char2Hex(unsigned char ch, char* szHex)
{
unsigned char byte[2];
byte[0] = ch/16;
byte[1] = ch%16;
for(int i = 0; i < 2; i++)
{
if(byte[i] >= 0 && byte[i] <= 9)
{
szHex[i] = '0' + byte[i];
}
else
szHex[i] = 'A' + byte[i] - 10;
}
szHex[2] = 0;
}
// Function used throughout code to convert
CharStr2HexStr(unsigned char const* pucCharStr, char* pszHexStr, int iSize)
{
int i;
char szHex[3];
pszHexStr[0] = 0;
for(i = 0; i < iSize; i++)
{
Char2Hex(pucCharStr[i], szHex);
strcat(pszHexStr, szHex);
}
}

substitute strlen with sizeof for c-string

I want to use mbstowcs_s method but without iostream header. Therefore I cannot use strlen to predict the size of my buffer. The following method has to simply change c-string to wide c-string and return it:
char* changeToWide(char* value)
{
wchar_t* vOut = new wchar_t[strlen(value)+1];
mbstowcs_s(NULL,vOut,strlen(val)+1,val,strlen(val));
return vOut;
}
As soon as i change it to
char* changeToWide(char* value)
{
wchar_t* vOut = new wchar_t[sizeof(value)];
mbstowcs_s(NULL,vOut,sizeof(value),val,sizeof(value)-1);
return vOut;
}
I get wrong results (values are not the same in both arrays). What is the best way to work it out?
I am also open for other ideas how to make that conversion without using strings but pure arrays

Given a char* or const char* you cannot use sizeof() to get the size of the string being pointed by your char* variable. In this case, sizeof() will return you the number of bytes a pointer uses in memory (commonly 4 bytes in 32-bit architectures and 8 bytes in 64-bit architectures).
If you have an array of characters defined as array, you can use sizeof:
char text[] = "test";
auto size = sizeof(text); //will return you 5 because it includes the '\0' character.
But if you have something like this:
char text[] = "test";
const char* ptext = text;
auto size2 = sizeof(ptext); //will return you probably 4 or 8 depending on the architecture you are working on.

Not that I am an expert on this matter, but char to wchar_t conversion being made is seemingly nothing but using a wider space for the exact same bytes, in other words, prefixing each char with some set of zeroes.
I don't know C++ either, just C, but I can derive what it probably would look like in C++ by looking at your code, so here it goes:
wchar_t * changeToWide( char* value )
{
//counts the length of the value-array including the 0
int i = 0;
while ( value[i] != '\0' ) i++;
//allocates enough much memory
wchar_t * vOut = new wchar_t[i];
//assigns values including the 0
i = 0;
while ( ( vOut[i] = 0 | value[i] ) != '\0' ) i++;
return vOut;
}
0 | part looks truly obsolete to me, but I felt like including it, don't really know why...

Why am i getting two different strings?

I wrote a very simple encryption program to practice c++ and i came across this weird behavior. When i convert my char* array to a string by setting the string equal to the array, then i get a wrong string, however when i create an empty string and add append the chars in the array individually, it creates the correct string. Could someone please explain why this is happening, i just started programming in c++ last week and i cannot figure out why this is not working.
Btw i checked online and these are apparently both valid ways of converting a char array to a string.
void expandPassword(string* pass)
{
int pHash = hashCode(pass);
int pLen = pass->size();
char* expPass = new char[264];
for (int i = 0; i < 264; i++)
{
expPass[i] = (*pass)[i % pLen] * (char) rand();
}
string str;
for (int i = 0; i < 264; i++)
{
str += expPass[i];// This creates the string version correctly
}
string str2 = expPass;// This creates much shorter string
cout <<str<<"\n--------------\n"<<str2<<"\n---------------\n";
delete[] expPass;
}
EDIT: I removed all of the zeros from the array and it did not change anything

When copying from char* to std::string, the assignment operator stops when it reaches the first NULL character. This points to a problem with your "encryption" which is causing embedded NULL characters.
This is one of the main reasons why encoding is used with encrypted data. After encryption, the resulting data should be encoded using Hex/base16 or base64 algorithms.

a c-string as what you are constructing is a series of characters ending with a \0 (zero) ascii value.
in the case of
expPass[i] = (*pass)[i % pLen] * (char) rand();
you may be inserting \0 into the array if the expression evaluates to 0, as well as you do not append a \0 at the end of the string either to assure it being a valid c-string.
when you do
string str2 = expPass;
it can very well be that the string gets shorter since it gets truncated when it finds a \0 somewhere in the string.

This is because str2 = expPass interprets expPass as a C-style string, meaning that a zero-valued ("null") byte '\0' indicates the end of the string. So, for example, this:
char p[2];
p[0] = 'a';
p[1] = '\0';
std::string s = p;
will cause s to have length 1, since p has only one nonzero byte before its terminating '\0'. But this:
char p[2];
p[0] = 'a';
p[1] = '\0';
std::string s;
s += p[0];
s += p[1];
will cause s to have length 2, because it explicitly adds both bytes to s. (A std::string, unlike a C-style string, can contain actual null bytes — though it's not always a good idea to take advantage of that.)

I guess the following line cuts your string:
expPass[i] = (*pass)[i % pLen] * (char) rand();
If rand() returns 0 you get a string terminator at position i.

Convert wchar_t to int

how can I convert a wchar_t ('9') to a digit in the form of an int (9)?
I have the following code where I check whether or not peek is a digit:
if (iswdigit(peek)) {
// store peek as numeric
}
Can I just subtract '0' or is there some Unicode specifics I should worry about?

If the question concerns just '9' (or one of the Roman
digits), just subtracting '0' is the correct solution. If
you're concerned with anything for which iswdigit returns
non-zero, however, the issue may be far more complex. The
standard says that iswdigit returns a non-zero value if its
argument is "a decimal digit wide-character code [in the current
local]". Which is vague, and leaves it up to the locale to
define exactly what is meant. In the "C" locale or the "Posix"
locale, the "Posix" standard, at least, guarantees that only the
Roman digits zero through nine are considered decimal digits (if
I understand it correctly), so if you're in the "C" or "Posix"
locale, just subtracting '0' should work.
Presumably, in a Unicode locale, this would be any character
which has the general category Nd. There are a number of
these. The safest solution would be simply to create something
like (variables here with static lifetime):
wchar_t const* const digitTables[] =
{
L"0123456789",
L"\u0660\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669",
// ...
};
//! \return
//! wch as a numeric digit, or -1 if it is not a digit
int asNumeric( wchar_t wch )
{
int result = -1;
for ( wchar_t const* const* p = std::begin( digitTables );
p != std::end( digitTables ) && result == -1;
++ p ) {
wchar_t const* q = std::find( *p, *p + 10, wch );
if ( q != *p + 10 ) {
result = q - *p;
}
return result;
}
If you go this way:
you'll definitely want to download the
UnicodeData.txt file from the Unicode consortium
("Uncode Character
Database"—this page has a links to both the Unicode data
file and an explination of the encodings used in it), and
possibly write a simple parser of this file to extract the
information automatically (e.g. when there is a new version of
Unicode)—the file is designed for simple programmatic
parsing.
Finally, note that solutions based on ostringstream and
istringstream (this includes boost::lexical_cast) will not
work, since the conversions used in streams are defined to only
use the Roman digits. (On the other hand, it might be
reasonable to restrict your code to just the Roman digits. In
which case, the test becomes if ( wch >= L'0' && wch <= L'9' ),
and the conversion is done by simply subtracting L'0'—
always supposing the the native encoding of wide character
constants in your compiler is Unicode (the case, I'm pretty
sure, of both VC++ and g++). Or just ensure that the locale is
"C" (or "Posix", on a Unix machine).
EDIT: I forgot to mention: if you're doing any serious Unicode programming, you
should look into ICU. Handling Unicode
correctly is extremely non-trivial, and they've a lot of functionality already
implemented.

Look into the atoi class of functions: http://msdn.microsoft.com/en-us/library/hc25t012(v=vs.71).aspx
Especially _wtoi(const wchar_t *string); seems to be what you're looking for. You would have to make sure your wchar_t is properly null terminated, though, so try something like this:
if (iswdigit(peek)) {
// store peek as numeric
wchar_t s[2];
s[0] = peek;
s[1] = 0;
int numeric_peek = _wtoi(s);
}

You could use boost::lexical_cast:
const wchar_t c = '9';
int n = boost::lexical_cast<int>( c );

Despite MSDN documentation, a simple test suggest that not only ranger L'0'-L'9' returns true.
for(wchar_t i = 0; i < 0xFFFF; ++i)
{
if (iswdigit(i))
{
wprintf(L"%d : %c\n", i, i);
}
}
That means that L'0' subtraction probably won't work as you may expected.

For most purposes you can just subtract the code for '0'.
However, the Wikipedia article on Unicode numerials mentions that the decimal digits are represented in 23 separate blocks (including twice in Arabic).
If you are not worried about that, then just subtract the code for '0'.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Convert wchar_t to char - c++

I was wondering is it safe to do so? wchar_t wide = /* something */; assert(wide >= 0 && wide < 256 &&); char myChar = static_cast<char>(wide); If I am pretty sure the wide char will fall within ASCII range.

Why not just use a library routine wcstombs.

An easy way is : wstring your_wchar_in_ws(<your wchar>); string your_wchar_in_str(your_wchar_in_ws.begin(), your_wchar_in_ws.end()); char* your_wchar_in_char = your_wchar_in_str.c_str(); I'm using this method for years :)

one could also convert wchar_t --> wstring --> string --> char wchar_t wide; wstring wstrValue; wstrValue[0] = wide string strValue; strValue.assign(wstrValue.begin(), wstrValue.end()); // convert wstring to string char char_value = strValue[0];

Related

C++ Pointers to arrays

Converting a unsigned char(BYTE) array to const t_wchar* (LPCWSTR)

substitute strlen with sizeof for c-string

Why am i getting two different strings?

Convert wchar_t to int

Categories

Resources