Converting an ANSI C-String to UNICODE - c++

Note: I am trying to write my own function that performs this conversion
I understand that a char is 1 byte, while a wchar_t is 2 bytes.
So this is how a conversion would happen:
1) Input a text
Hello, world
2) Get the bytes of the string
48 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21
3) Allocate memory twice the number of bytes
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
4) Fill a byte with the ANSI value, skipping one byte at a time
48 00 65 00 6c 00 6c 00 6f 00 2c 00 20 00 77 00 6f 00 72 00 6c 00 64 00 21 00
I have a couple of questions about this process:
1) Can I simply cast an ANSI string to UNICODE and have it replicate the exact process above, or will it simply fill the first half of the bytes with the ANSI bytes and leave the rest to 0?
char a[] = { "Hello, world!" };
wchar_t* b = reinterpret_cast<wchar_t*>(a);
2) Looking at the MultiByteToWideChar function, I see a CodePage argument and I wonder what it is. Isn't the conversion all the same (as I understand it and wrote it out above)? I thought the ASCII character codes were all the same everywhere, but this argument seems to say otherwise if I am understanding correctly from the fact it has values for Mac and Windows there.

I thought the ASCII character codes were all the same everywhere, but this argument seems to say otherwise if I am understanding correctly from the fact it has values for Mac and Windows there.
The ASCII codes are, yes, but the high bit of an "Extended ASCII" string (spoiler: there's no such thing) maps to any of a large number of codepages, all different encodings intended for use mostly in different geographic locales. The approach you've taken is fine for the simple, plain ASCII case, but it doesn't work in general, and MultiByteToWideChar knows this. It will re-encode properly from whatever codepage you're using, to what Windows confusingly calls "Unicode" (not "UNICODE"), which is actually more specifically the "UTF-16" encoding.
Can I simply cast an ANSI string to UNICODE and have it replicate the exact process above, or will it simply fill the first half of the bytes with the ANSI bytes and leave the rest to 0?
No. A cast does not reencode things or change values. There you are just saying "I promise that a is a bunch of wchar_ts, even though it has type char* (it doesn't, it has array type, but close enough for today).
That code actually has undefined behaviour, if you use b, because you've broken aliasing rules (you can examine a T through a char*, but you can't treat a char[] as some T that you never created). But, if it didn't, you'd find that your "string" were now half the length, and more than likely an invalid UTF-16 sequence that would not render correctly anywhere.
So if I wanted to support UTF-32, I would have to create my own wrapper for strings since wchar_t is only 2 bytes long and I need 4 bytes, and also I would not be able to print it with printf for example, correct?
Technically, sort of yes (though you'd use a library like libicu rather than rolling your own).
But, in reality, you don't want to use UTF-32. Working with the Windows API you're stuck with UTF-16, but other than that we generally prefer UTF-8 over char, which is nice and portable and flexible and good and nice. (You will again want a library for this though.)
It'd then be up to you as to where you perform the relevant conversions, and/or whether you have a switch that flips from UTF-8 to UTF-16 depending on the platform (like Windows's old UNICODE macro) or just run UTF-8 everywhere until you hit a Windows API boundary.
Or, if all your input is ASCII as you imply, then you don't really need to do anything other than what you are already: either keep your ASCII throughout the program but convert it to UTF-16 when using the Windows API, or use UTF-16 (and wchar_ts throughout your whole program and have no conversions. Make sure to use wide-char versions of your favourite functions, though (like wprintf) if you go down that route.

What you are attempting to do will only work for ASCII character codes in the range of 0..127. Those characters have the same numeric values in Unicode, and thus can be copied as-is between char and wchar_t strings.
And no, you can't just reinterpret_cast the memory address of the char data to wchar_t*, you need to allocate a new wchar_t array and copy the values, eg:
char a[] = { "Hello, world!" };
wchar_t* b = new wchar_t[sizeof(a) * sizeof(wchar_t)];
for(size_t i = 0; i < sizeof(a); ++i) {
b[i] = static_cast<wchar_t>(a[i]);
}
...
delete[] b;
This type of copying would be better handled using std::string and std::wstring iterator-based constructors instead, eg:
std::string a = "Hello, world!";
std::wstring b(a.begin(), a.end());
...
However, beyond the ASCII range, you need to convert the data between char and wchar_t via charset/codepage lookups. Different charsets/codepages encode Unicode characters in different ways. MultiByteToWideChar() (and WideCharToMultiByte()) handle those conversion for you, using the codepage that you tell it to use. There are also many 3rd party libraries that can also handle these conversions, such as ICONV, ICU, etc. To an extent, even C++'s own std::wstring_convert and std::wbuffer_convert can, too (though they are deprecated in C++17 onwards).
For example, let's look at codepoint U+20AC EURO SIGN (€):
in a wchar_t string, it takes up a single wchar_t whose numeric value is 0x20AC.
in a UTF-8 encoded char string, it takes up 3 chars whose numeric values are 0xE2 0x82 0xAC.
in a Windows-1252 encoded char string, it takes up a single char whose numeric value is 0x80.
in a Latin-1 (ISO-8859-1) encoded char string, the Euro sign doesn't even have a numeric value assigned!
So, a simple value copy will not suffice for non-ASCII characters.

Related

C++ alignment of class - member call on misaligned address

I'm using UBSAN and am getting the following error. Note that I'm compiling with clang 6.0.1 with -fsanitize=undefined. I've read a number of background questions on SO and still can't solve my particular issue. Here are the background questions for reference:
What is the recommended way to align memory in C++11
Misaligned address using virtual inheritance
runtime error: member call on misaligned address 0x000001f67230 for type 'const A *', which requires 64 byte alignment
0x000001f67230: note: pointer points here
00 00 00 00 c0 72 f6 01 00 00 00 00 08 00 00 00 00 02 00 00 40 02 00 00 00 00 00 00 00 00 00 00
Here are some things to note about class C:
the object of type C is created using new (C* o = new C();)
type C has a member of type A that has 64 byte alignment. I verified this using alignof.
C is declared using class alignas(64) C -- but that doesn't solve my problem
My current hypothesis is that I need to use the C++11 equivalent of the C++17 std::aligned_alloc to create the object using aligned storage. But, I'm not sure how to best do this or if it will actually solve my problem. I would prefer to solve the problem once in the definition of class C as opposed to every time I create a C, if possible. What is the recommended approach to solve this issue to remove the UBSAN error?
If your class already has a member that requires 64 Byte alignment, then the class will already also have 64 Byte alignment out of necessity. So adding an explicit alignas(64) is not really gonna change anything.
The basic problem here is that allocation functions (in C++11) are only required to return memory aligned to fundamental alignment. C++11 left it implementation-defined whether over-aligned types are supported by new or not [expr.new]/1. C++17 introduced new-extended alignment and additional allocation functions to deal with that (if and which new-extended alignments are supported, however, is still implementation-defined).
If you can switch to a compiler that supports C++17, chances are that your code will just work. Otherwise you will probably have to either use some implementation-specific function to allocate aligned memory or just roll your own solution, e.g., based on std::align and placement new (which would work in C++11 too)…

Why is my linked list "next" pointer dereferencing to the wrong memory (XCode, C++)

Can someone please help me figure out why the "next" pointer in my linked list is dereferencing to the wrong memory address in code on 32-bit platform, but works fine on 64-bit platform? My program is built as a universal binary on Xcode 7.3 and written in C++.
I've got a linked list, and dereferencing the "next" pointer in the debugger shows the correct memory, but dereferencing it in code reads the memory that is 4-bytes beyond where it should read. I will try to explain..
The objects on the list are 4144 bytes each, the last 4-bytes are a 32-bit pointer to the "next" item on the list. Looking at the "next" pointer in memory (0xBFFD63AC), we see that it is 4 zeros (NULL), this is correct. BUT notice that memory at 0xBFFD63B0 is a 0x01. This is the byte beyond the "next" pointer. When I ask the debugger to print the next variable, it prints the proper value (NULL):
(lldb) print l_pObject->Next
(Object__t *) $0 = 0x00000000
(lldb) memory read &(l_pObject->Next)
0xbffd63ac: 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
However, if I execute the code that dereferences the "next" pointer, it actually reads the data from 0xBFFD63B0 instead of 0xBFFD63AC:
l_pObject = l_pObject->Next;
(lldb) print l_pObject
(Object_t *) $3 = 0x00000001
(lldb) memory read &(l_pObject)
0xbffd2504: 01 00 00 00 80 53 fd bf 00 00 00 00 6c 82 2e
I am positive it is reading the 0x01 from 0xBFFD63B0. The debugger seems to know that "next" indicates the memory at 0xBFFD63AC, but for some reason dereferencing "next" in code actually reads from 0xBFFD63B0, but I'm not sure how to figure out why. I've tried adding __attribute__((packed)) to the struct, but that made no difference. It's difficult to determine where things are going wrong because the debugger is telling me something different from what is really happening. Any tips on how to proceed from here would be very greatly appreciated!
EDITED TO ADD MORE INFO:
In the least there is a debugger error here! I ask the debugger to print the sizeof the struct and it gives me 4144, but in code I use the C funciton sizeof() and that gives me 4148! So there is definitely padding happening in the struct, but the debugger and apparently this section of code are blind to it. This is the root of my problem.
Debugger:
(lldb) print sizeof(*l_pObject)
(unsigned long) $0 = 4144
Code:
unsigned int iSizeOf = sizeof(*l_pObject); /* iSizeOf will equal 4148! */
Something funny is happening...
Without seeing the code it is impossible to tell. In general though, pointers tend to be 8-bytes wide on 64-bit systems and 4-bites wide on 32-bit systems. Obviously you observed that it's jumping 4 bytes beyond where it should be, which is a strong indicator. (0xBFFD63B0 - 0xBFFD63AC) == 4.
OK I found the problem, I should have realized it right away. The code that was crashing is in a library. Both the library and the calling application share definitions for variable types. Due to a missing preprocessor flag on the application side, one of those types was being allocated 4-bytes smaller on the application side than in the library code. So when the application created a linked list of objects and passed a reference to the library, each object on that list was 4-bytes smaller than the library was expecting them to be. What made it less obvious was the debugger was showing me addresses and offsets based on what was allocated in the application, so at first glance everything appeared to be sized correctly. I decided to not trust the debugger and wrote my own code to check addresses and offsets, and that's when it became obvious what was happening. So anyway, long story short, be sure your application and library allocate types to be the same size and then things will work much better. :)

Visual studio 2010 - data segment and stack memory are same

I figured out that a constant literal get's placed in the data segment of the program (from SO) and is read-only and hence the line " s[0] = 'a' " would cause an error, which actually did happen when I uncommented that line and ran. However when I looked into the memory window in MS VS, the variables are all placed together in memory. I am curious as to how they(compiler) enforce read-only access to 's'?
#include <iostream>
int main(void)
{
char *s = "1023";
char s_arr[] = "4237";
char *d = "5067";
char s_arr_1[] = "9999";
char *e = "6789";
printf("%c\n", s[0]);
// s[0] = 'a'; This line would error out since s should point to data segment of the program
printf("%s\n", s);
system ("pause");
}
0x002E54F4 31 30 32 33 00 00 00 00 34 32 33 37 00 00 00 00 1023....4237....
0x002E5504 35 30 36 37 00 00 00 00 39 39 39 39 00 00 00 00 5067....9999....
0x002E5514 36 37 38 39 00 00 00 00 25 63 0a 00 25 73 0a 00 6789....%c..%s..
0x002E5524 70 61 75 73 65 00 00 00 00 00 00 00 43 00 3a 00 pause.......C.:.
Edit 1:
Updating the value stored in s_arr (which should be placed in stack space) to make it clear that it is placed adjacent to the string constants.
Edit 2: Since I am seeing answers regarding ro/rw access based on pages,
Here address .. 0x...4f4 is rw 0x...4fc is ro and again 0x...504 is rw. How do they achieve this granularity? Also since the each page could be a minimum of 4kb, one could argue that the 0x4fb could be the last address of the previous ro page. But I have now added a few more variables to show that they are all placed contiguously in memory and the granularity is per every 8 bytes.
You could say, Since pages are at 4k level as you mentioned,
I don't know what made you think that your example shows modifiable memory next to non-modifiable memory. What "granularity" are you talking about? You memory dump does not show anything like that.
The string "4237" that you see in your memory dump is not your s_arr. That "4237" that you see there is a read-only string literal that was used as an initializer for the s_arr. That initializer was copied to s_arr. Meanwhile, the actual s_arr resides somewhere else (in the stack) and is perfectly modifiable. It contains "4237" as well (as its initial value), but that's a completely different "4237", which you don't see in your memory dump. Ask your program to print the address of s_arr and you will see that its is nowhere near the memory range that you dumped.
Again, your claim about "0x...4f4 is rw 0x...4fc is ro and again 0x...504 is rw" is completely incorrect. All these addresses are read-only. None of them are read-write. There no "granularity" there whatsoever.
Remember that a declaration like this
char s_arr[] = "4237";
is really equivalent to
const char *unnamed = "4237";
char s_arr[5];
memcpy(s_arr, unnamed, 5);
In your memory dump, you are looking at that unnamed address from my example above. That memory region is read-only. Your s_arr resides in completely different memory region, which is read-write.
Since 32 bit platforms were introduced, everything is placed into the same segment (This is not exactly so, but it is easier to think that this is so. There are minor caveats that require several pages to explain and they apply to operating system design).
The 32-bit address space is split into several pages. Intell allows to assign RO bits with the page granularity. Debuggers display only the 32-bit (64 bit) address that technically is an offset in the segment. It is fine to call this offset simply address. There will be no mistake in this.
Nevertheless linkers call different memory areas as segments. These segments have nothing to do with Intel memory segments. Linker segments (code, data, stack, etc) are loaded into diffrenet pages. These pages get different attributes (RO/RW, execution permission, etc).
The block of memory you are showing is area where string constants are stored (as you can see all 4 values are directly there one next to another). This area is marked as read-only. On Windows each 4Kb block of memory (page) can have its own attributes (read/write/execute), so even 2 adjascent locations can have different access flags.
The area where variables are is in different location (stack in your case). You can see it by checking value of &s immediate window (or watch window).

Exception handler

There is this code:
char text[] = "zim";
int x = 777;
If I look on stack where x and text are placed there output is:
09 03 00 00 7a 69 6d 00
Where:
09 03 00 00 = 0x309 = 777 <- int x = 777
7a 69 6d 00 = char text[] = "zim" (ASCII code)
There is now code with try..catch:
char text[] = "zim";
try{
int x = 777;
}
catch(int){
}
Stack:
09 03 00 00 **97 85 04 08** 7a 69 6d 00
Now between text and x is placed new 4 byte value. If I add another catch, then there will be something like:
09 03 00 00 **97 85 04 08** **xx xx xx xx** 7a 69 6d 00
and so on. I think that this is some value connected with exception handling and it is used during stack unwinding to find appropriate catch when exception is thrown in try block. However question is, what is exactly this 4-byte value (maybe some address to excception handler structure or some id)?
I use g++ 4.6 on 32 bit Linux machine.
AFAICT, that's a pointer to an "unwind table". Per the the Itanium ABI implementation suggestions, the process "[uses] an unwind table, [to] find information on how to handle exceptions that occur at that PC, and in particular, get the address of the personality routine for that address range. "
The idea behind unwind tables is that the data needed for stack unwinding is rarely used. Therefore, it's more efficient to put a pointer on the stack, and store the reast of the data in another page. In the best cases, that page can remain on disk and doesn't even need to be loaded in RAM. In comparison, C style error handling often ends up in the L1 cache because it's all inline.
Needless to say all this is platform-dependent and etc.
This may be an address. It may point to either a code section (some handler address), or data section (pointer to a build-time-generated structure with frame info), or the stack of the same thread (pointer to a run-time-generated table of frame info).
Or it may also be a garbage, left due to an alignment requirement, which EH may demand.
For instance on Win32/x86 there's no such a gap. For every function that uses exception handling (has either try/catch or __try/__except/__finally or objects with d'tors) - the compiler generates an EXCEPTION_RECORD structure that is allocated on the stack (by the function prolog code). Then, whenever something changes within the function (object is created/destroyed, try/catch block entered/exited) - the compiler adds an instruction that modifies this structure (more correctly - modifies its extension). But nothing more is allocated on the stack.

binary protocol - byte swap trick

lets say we have a binary protocol, with fields network ordered (big endian).
struct msg1
{
int32 a;
int16 b;
uint32 c
}
if instead of copying the network buffer to my msg1 and then use the "networkToHost" functions to read msg1
I rearrange / reverse msg1 to
struct msg1
{
uint32 c
int16 b;
int32 a;
}
and simply do a reverse copy from the network buffer to create msg1. In that case, there is no need for networkToHost functions. this idiomatic approach doesn't work in big endian machines but for me this is not a problem. Apart from that, is there any other drawback that I miss?
thanks
P.S. for the above we enforce strict alignment(#pragma pack(1) etc)
Apart from that, is there any other drawback that I miss?
I'm afraid you've misunderstood the nature of endian conversion problems. "Big endian" doesn't mean your fields are laid out in reverse, so that a
struct msg1_bigendian
{
int32 a;
int16 b;
uint32 c
}
on a big endian architecture is equivalent to a
struct msg1_littleendian
{
uint32 c;
int16 b;
int32 a;
}
on a little endian architecture. Rather, it means that the byte-order within each field is reversed. Let's assume:
a = 0x1000000a;
b = 0xb;
c = 0xc;
On a big-endian architecture, this will be laid out as:
10 00 00 0a
00 0b
00 00 00 0c
The high-order (most significant) byte comes first.
On a little-endian machine, this will be laid out as:
0a 00 00 10
0b 00
0c 00 00 00
The lowest order byte comes first, the highest order last.
Serialize them and overlay the serialized form of the messages on top of each other, and you will discover the incompatibility:
10 00 00 0a 00 0b 00 00 00 0c (big endian)
0a 00 00 10 0b 00 0c 00 00 00 (little endian)
int32 a int16 b int32 c
Note that this isn't simply a case of the fields running in reverse. You proposal would result in a little endian machine mistaking the big endian representation as:
a = 0xc000000;
b = 0xb00;
c = 0xa000010;
Certainly not what was transmitted!
You really do have to convert every individual field to network byte order and back again, for every field transmitted.
UPDATE:
Ok, I understand what you are trying to do now. You want to define the struct in reverse, then memcpy from the end of the byte string to the beginning (reverse copy) and reverse the byte order that way. In which case I would say, yes, this is a hack, and yes, it makes your code un-portable, and yes, it isn't worth it. Converting between byte orders is not, in fact, a very expensive operation and it is far easier to deal with than reversing the layout of every structure.
Are you sure this is required? More than likely, your network traffic is going to be your bottleneck, rather than CPU speed.
Agree with #ribond -
This has great potential to be very confusing to developers, since they'll have to work to keep these to semantically identical structures separate.
Given that network latency is on the order of 10,000,000x slower than it would take the CPU to process it, I'd just keep them the same.
Depending on how your compiler packs the bytes inside a struct, the 16-bit number in the middle might not end up in the right place. It might be stored in a 32-bit field and when you reverse the bytes it will "vanish".
Seriously, tricks like this may seem cute when you write them but in the long term they simply aren't worth it.
edit
You added the "pack 1" information so the bug goes away but the thing about "cute tricks" still stands - not worth it. Write a function to reverse 32-bit and 16-bit numbers.
inline void reverse(int16 &n)
{
...
}
inline void reverse(int32 &n)
{
...
}
Unless you can demonstrate that there is a significant performance penalty, you should use the same code to transfer data onto and off the network regardless of the endian-ness of the machine. As an optimization, for the platforms where the network order is the same as the hardware byte order, you can use tricks, but remember about alignment requirements and the like.
In the example, many machines (especially, as it happens, big-endian ones) will require a 2-byte pad between the end of the int16 member and the next int32 member. So, although you can read into a 10-byte buffer, you cannot treat that buffer as an image of the structure - which will be 12 bytes on most platforms.
As you say, this is not portable to big-endian machines. That is an absolute dealbreaker if you ever expect your code to be used outside of the x86 world. Do the rest of us a favor and just use the ntoh/hton routines or you'll probably find yourself featured on thedailywtf someady.
Please do the programmers that come after you a favor and write explicit conversions to and from a sequence of bytes in some buffer. Trickery with structures will lead you straight into endianness and alignment hell (been there).