How would i convert a String object to a UTF8CHAR pointer? - c++

Im integrating a new system, and the old system had a char* in a method. Now there is a UTF8CHAR * instead.
I have a string object:
string data("test set");
and wanted to pass it into the function:
my_method(UTF8CHAR* text, ENUM extra, newStruct &item);
What my first attempt was:
newStruct param("hi", 0,0);
my_method(data.c_str(), extra::OPEN,param);
I dont get an ERROR, but instead a EXC_BAD_ACCESS

A string and a char array each contains a sequence of bytes. It depends on the library in question, but common sense indicates that a UTF8CHAR array is a sequence of bytes as well, with the added understanding that certain byte combinations describe certain unicode codepoints, and certain other byte combinations are illegal. So every utf8 char array is a char array, but not neccessarily the other way round. As the distinction is not a thing the compiler can check, except for ensuring proper data type handling, passing a char pointer should work. If it does not, perhaps something else went wrong, which we cannot decide from the code you posted.

Related

better approach to copy portion of char array than strncpy

I used std::strncpy in c++98 to copy portion of a char array to another char array. It seems that it requires to manually add the ending character '\0', in order to properly terminate the string.
As below, if not explicitly appending '\0' to num1, the char array may have other characters in the later portion.
char buffer[] = "tag1=123456789!!!tag2=111222333!!!10=240";
char num1[10];
std::strncpy(num1, buffer+5, 9);
num1[9] = '\0';
Is there better approach than this? I'd like to have a one-step operation to reach this goal.
Yes, working with "strings" in C was rather verbose, wasn't it!
Fortunately, C++ is not so limited:
const char* in = "tag1=123456789!!!tag2=111222333!!!10=240";
std::string num1{in+5, in+15};
If you can't use a std::string, or don't want to, then simply wrap the logic you have described into a function, and call that function.
As below, if not explicitly appending '\0' to num1, the char array may have other characters in the later portion.
Not quite correct. There is no "later portion". The "later portion" you thought you observed was other parts of memory that you had no right to view. By failing to null-terminate your would-be C-string, your program has undefined behaviour and the computer could have done anything, like travelling back in time and murdering my great-great-grandmother. Thanks a lot, pal!
It's worth noting, then, that because it's C library functions doing that out-of-bounds memory access, if you hadn't used those library functions in that way then you didn't need to null-terminate num1. Only if you want to treat it as a C-style string later is that required. If you just consider it to be an array of 10 bytes, then everything is still fine.

converting a string to a c string

m working on some homework but don't even know where to start on this one. If you could can you throw me in the right direction. This is what i'm suppose to do
Write your own version of the str_c function that takes a C++ string as an argument (with the parameter set as a constant reference variable) and returns a pointer to the equivalent C-string. Be sure to test it with an appropriate driver.
There are different possibilities to write such a function.
First, take a look at the C++ reference for std::string, which is the starting point for your problem.
In the Iterator section on that page, you might find some methods which can help you to get the string character by character.
It can also help to read the documentation for the std::string::c_str method, you'd like to imitate: string::c_string. It's important to understand, how the system works with normal C-strings (char*):
Due to the fact, that a C-string has now length- or size-attribute, a trick is used to determine the end of the string: The last character in the string has to be a '\0'.
Make sure you understand, that a char* string can also be seen as array of characters (char[]). This might help you, when understanding and solving your problem.
as we know, C-string is null-terminated array of char. you can put char by char from std::string to an array of char, and then closed with '\0'. and remember a pointer to a char (char*) is also representation of array of char. you can use this concept

how to make a not null-terminated c string?

i am wondering :char *cs = .....;what will happen to strlen() and printf("%s",cs) if cs point to memory block which is huge but with no '\0' in it?
i write these lines:
char s2[3] = {'a','a','a'};
printf("str is %s,length is %d",s2,strlen(s2));
i get the result :"aaa","3",but i think this result is because that a '\0'(or a 0 byte) happens to reside in the location s2+3.
how to make a not null-terminated c string? strlen and other c string function relies heavily on the '\0' byte,what if there is no '\0',i just want know this rule deeper and better.
ps: my curiosity is aroused by studying the follw post on SO.
How to convert a const char * to std::string
and these word in that post :
"This is actually trickier than it looks, because you can't call strlen unless the string is actually nul terminated."
If it's not null-terminated, then it's not a C string, and you can't use functions like strlen - they will march off the end of the array, causing undefined behaviour. You'll need to keep track of the length some other way.
You can still print a non-terminated character array with printf, as long as you give the length:
printf("str is %.3s",s2);
printf("str is %.*s",s2_length,s2);
or, if you have access to the array itself, not a pointer:
printf("str is %.*s", (int)(sizeof s2), s2);
You've also tagged the question C++: in that language, you usually want to avoid all this error-prone malarkey and use std::string instead.
A "C string" is, by definition, null-terminated. The name comes from the C convention of having null-terminated strings. If you want something else, it's not a C string.
So if you have a string that is not null-terminated, you cannot use the C string manipulation routines on it. You can't use strlen, strcpy or strcat. Basically, any function that takes a char* but no separate length is not usable.
Then what can you do? If you have a string that is not null-terminated, you will have the length separately. (If you don't, you're screwed. You need some way to find the length, either by a terminator or by storing it separately.) What you can do is allocate a buffer of the appropriate size, copy the string over, and append a null. Or you can write your own set of string manipulation functions that work with pointer and length. In C++ you can use std::string's constructor that takes a char* and a length; that one doesn't need the terminator.
Your supposition is correct: your strlen is returning the correct value out of sheer luck, because there happens to be a zero on the stack right after your improperly terminated string. It probably helps that the string is 3 bytes, and the compiler is likely aligning stuff on the stack to 4-byte boundaries.
You cannot depend on this. C strings need NUL characters (zeroes) at the end to work correctly. C string handling is messy, and error-prone; there are libraries and APIs that help make it less so… but it's still easy to screw up. :)
In this particular case, your string could be initialized as one of these:
A: char s2[4] = { 'a','a','a', 0 }; // good if string MUST be 3 chars long
B: char *s2 = "aaa"; // if you don't need to modify the string after creation
C: char s2[]="aaa"; // if you DO need to modify the string afterwards
Also note that declarations B and C are 'safer' in the sense that if someone comes along later and changes the string declaration in a way that alters the length, B and C are still correct automatically, whereas A depends on the programmer remembering to change the array size and keeping the explicit null terminator at the end.
What happens is that strlen keeps going, reading memory values until it eventually gets to a null. it then assumes that is the terminator and returns the length that could be massively large. If you're using strlen in an environment that expects C-strings to be used, you could then copy this huge buffer of data into another one that is just not big enough - causing buffer overrun problems, or at best, you could copy a large amount of garbage data into your buffer.
Copying a non-null terminated C string into a std:string will do this. If you then decide that you know this string is only 3 characters long and discard the rest, you will still have a massively long std:string that contains the first 3 good characters and then a load of wastage. That's inefficient.
The moral is, if you're using the CRT functions to operator on C strings, they must be null-terminated. Its no different to any other API, you must follow the rules that API sets down for correct usage.
Of course, there is no reason you cannot use the CRT functions if you always use the specific-length versions (eg strncpy) but you will have to limit yourself to just those, always, and manually keep track of the correct lengths.
Convention states that a char array with a terminating \0 is a null terminated string. This means that all str*() functions expect to find a null-terminator at the end of the char-array. But that's it, it's convention only.
By convention also strings should contain printable characters.
If you create an array like you did char arr[3] = {'a', 'a', 'a'}; you have created a char array. Since it is not terminated by a \0 it is not called a string in C, although its contents can be printed to stdout.
The C standard does not define the term string until the section 7 - Library functions. The definition in C11 7.1.1p1 reads:
A string is a contiguous sequence of characters terminated by and including the first null character.
(emphasis mine)
If the definition of string is a sequence of characters terminated by a null character, a sequence of non-null characters not terminated by a null is not a string, period.
What you have done is undefined behavior.
You are trying to write to a memory location that is not yours.
Change it to
char s2[] = {'a','a','a','\0'};

Trouble creating char * object to send to c++/c shared object library

First off I am new to ctypes and did search for an answer to my question. Definitely will appreciate any insight from here.
I have a byte string supplied to me by another tool. It contains what appears to be hex and other values. I'm creating the c_char_p object as follows:
mybytestring = b'something with a lot of hex \x00\x00\xc7\x87\x9bb and other alphanumeric and non-word characters' # Length of this is very long let's say 480
mycharp = c_char_p(mybytestring)
I also create a c_char_Array as follows:
mybuff = create_string_buffer(mybytestring)
The problem is when I send either mycharp or mybuff to a c++ library .so function, the string gets cut off at the NULL terminator (first occurrence of '\x00')
I'm loading the c++ library and calling the function as follows:
lib_handle = cdll.LoadLibrary(mylib.so)
lib_handle.myfunction(mycharp)
lib_handle.myfunction(mybuff)
The c++ function expects a char *
Does someone know how to be able to send the whole string with NULL terminators ('\x00') included?
Thanks
Add your original data to a vector<char> vec, and send vec.data()
But the actual problem is
The c++ function expects a char *.
You will need to change this (to accept a second arg=length of the buffer, or for example, to accept a vector<char>) if you want it to accept an array of char including null.
Alternatively you can figure out what do you actually want the c++ function to do, and make self a "preprocessing" of the char array, adding a null-terminator to each new array, and after that send to the c++ function.
For example, you may decide that the “input” array is actually a set of c-string: you will need to do a simple parse to “split” and send to the c++ in a cycle one, after other.
Or maybe you decide that the input could be a string in an UTF16 and not UTF8. Then you need to, as good as possible, convert it to UTF8 and send to the c++ function.

Difference between unsigned char and char pointers

I'm a bit confused with differences between unsigned char (which is also BYTE in WinAPI) and char pointers.
Currently I'm working with some ATL-based legacy code and I see a lot of expressions like the following:
CAtlArray<BYTE> rawContent;
CALL_THE_FUNCTION_WHICH_FILLS_RAW_CONTENT(rawContent);
return ArrayToUnicodeString(rawContent);
// or return ArrayToAnsiString(rawContent);
Now, the implementations of ArrayToXXString look the following way:
CStringA ArrayToAnsiString(const CAtlArray<BYTE>& array)
{
CAtlArray<BYTE> copiedArray;
copiedArray.Copy(array);
copiedArray.Add('\0');
// Casting from BYTE* -> LPCSTR (const char*).
return CStringA((LPCSTR)copiedArray.GetData());
}
CStringW ArrayToUnicodeString(const CAtlArray<BYTE>& array)
{
CAtlArray<BYTE> copiedArray;
copiedArray.Copy(array);
copiedArray.Add('\0');
copiedArray.Add('\0');
// Same here.
return CStringW((LPCWSTR)copiedArray.GetData());
}
So, the questions:
Is the C-style cast from BYTE* to LPCSTR (const char*) safe for all possible cases?
Is it really necessary to add double null-termination when converting array data to wide-character string?
The conversion routine CStringW((LPCWSTR)copiedArray.GetData()) seems invalid to me, is that true?
Any way to make all this code easier to understand and to maintain?
The C standard is kind of weird when it comes to the definition of a byte. You do have a couple of guarantees though.
A byte will always be one char in size
sizeof(char) always returns 1
A byte will be at least 8 bits in size
This definition doesn't mesh well with older platforms where a byte was 6 or 7 bits long, but it does mean BYTE*, and char * are guaranteed to be equivalent.
Multiple nulls are needed at the end of a Unicode string because there are valid Unicode characters that start with a zero (null) byte.
As for making the code easier to read, that is completely a matter of style. This code appears to be written in a style used by a lot of old C Windows code, which has definitely fallen out of favor. There are probably a ton of ways to make it clearer for you, but how to make it clearer has no clear answer.
Yes, it is always safe. Because they both point to an array of single-byte memory locations.
LPCSTR: Long Pointer to Const (single-byte) String
LPCWSTR : Long Pointer to Const Wide (multi-byte) String
LPCTSTR : Long Pointer to Const context-dependent (single-byte or multi-byte) String
In wide character strings, every single character occupies 2 bytes of memory, and the length of the memory location containing the string must be a multiple of 2. So if you want to add a wide '\0' to the end of a string, you should add two bytes.
Sorry for this part, I do not know ATL and I cannot help you on this part, but actually I see no complexity here, and I think it is easy to maintain. What code do you really want to make easier to understand and maintain?
If the BYTE* behaves like a proper string (i.e. the last BYTE is 0), you can cast a BYTE* to a LPCSTR, yes. Functions working with LPCSTR assume zero-terminated strings.
I think the multiple zeroes are only necessary when dealing with some multibyte character sets. The most common 8-bit encodings (like ordinary Windows Western and also UTF-8) don't require them.
The CString is Microsoft's best attempt at user-friendly strings. For instance, its constructor can handle both char and wchar_t type input, regardless of whether the CString itself is wide or not, so you don't have to worry about the conversion much.
Edit: wait, now I see that they are abusing a BYTE array for storing wide chars in. I couldn't recommend that.
An LPCWSTR is a String with 2 Bytes per character, a "char" is one Byte per character. That means you cannot cast it in C-style, because you have to adjust the memory (add a "0" before each standard-ASCII), and not just read the Data in a different way from the memory (what a C-Cast would do).
So the cast is not so safe i would say.
The Double-Nulltermination: You have always 2 Bytes as one Character, so your "End-of-string" sign must be 2 Bytes long.
To make that code easier to understand look after lexical_cast in Boost (http://www.boost.org/doc/libs/1_48_0/doc/html/boost_lexical_cast.html)
Another way would be using the std::strings (using like std::basic_string; ), and you can perform on String operations.