Convert between signed char & unsigned char representing UTF8 - c++

I am using libxml2 and ICU in the same project. They represent
UTF8 differently. libxml2 uses unsigned char*, and ICU constructors take in plain char* (which on my Pentium 64-bit is equivalent to signed char).
Question: how do I convert between the two? Can I just
use static_cast?
I understand that UTF8 only cares that the underlying data
type be at least 8 bits long. Both signed char and unsigned
char satisfy this. I am just wondering if there is any
gotcha here? Any corner cases?
EDIT: at my compiler's (g++/Gentoo) insistence, only reinterpret_cast can do this conversion (without relying on the C-style cast). Let's say we have two unsigned char strings: 0000 and 1000. The conversion will turn them both into 0. Is this possible under UTF8?

Some libraries use char for storing UTF-8, others use unsigned char.
In this case you may need to cast between char* and unsigned char* using reinterpret_cast, since these types have the same storage unit size and alignment. E.g.:
char const* s = ...;
unsigned char const* p = reinterpret_cast<unsigned char const*>(s);
static_cast can always simulate reinterpret_cast through an intermediate conversion to void*, e.g. char* -> void* -> unsigned char*, e.g.:
char const* s = ...;
void const* intermediate = s;
unsigned char const* p = static_cast<unsigned char const*>(intermediate);

If unsigned char* is just a pointer to a string it should not cause any problem.

It should not matter. In any case as soon as you need to extract a char from the char * or unsigned char * stream you will need a function provided by the library that will extract an int and update the pointer/iterator in a manner that is opaque to you (the caller)

Thanks all. Mike said it best: the difference that makes no difference, and "a byte is a byte is a byte".

Related

Taking an index out of const char* argument

I have the following code:
int some_array[256] = { ... };
int do_stuff(const char* str)
{
int index = *str;
return some_array[index];
}
Apparently the above code causes a bug in some platforms, because *str can in fact be negative.
So I thought of two possible solutions:
Casting the value on assignment (unsigned int index = (unsigned char)*str;).
Passing const unsigned char* instead.
Edit: The rest of this question did not get a treatment, so I moved it to a new thread.
The signedness of char is indeed platform-dependent, but what you do know is that there are as many values of char as there are of unsigned char, and the conversion is injective. So you can absolutely cast the value to associate a lookup index with each character:
unsigned char idx = *str;
return arr[idx];
You should of course make sure that the arr has at least UCHAR_MAX + 1 elements. (This may cause hilarious edge cases when sizeof(unsigned long long int) == 1, which is fortunately rare.)
Characters are allowed to be signed or unsigned, depending on the platform. An assumption of unsigned range is what causes your bug.
Your do_stuff code does not treat const char* as a string representation. It uses it as a sequence of byte-sized indexes into a look-up table. Therefore, there is nothing wrong with forcing unsigned char type on the characters of your string inside do_stuff (i.e. use your solution #1). This keeps re-interpretation of char as an index localized to the implementation of do_stuff function.
Of course, this assumes that other parts of your code do treat str as a C string.

How to work with uint8_t instead of char?

I wish to understand the situation regarding uint8_t vs char, portability, bit-manipulation, the best practices, state of affairs, etc. Do you know a good reading on the topic?
I wish to do byte-IO. But of course char has a more complicated and subtle definition than uint8_t; which I assume was one of the reasons for introducing stdint header.
However, I had problems using uint8_t on multiple occasions. A few months ago, once, because iostreams are not defined for uint8_t. Isn't there a C++ library doing really-well-defined-byte-IO i.e. read and write uint8_t? If not, I assume there is no demand for it. Why?
My latest headache stems from the failure of this code to compile:
uint8_t read(decltype(cin) & s)
{
char c;
s.get(c);
return reinterpret_cast<uint8_t>(c);
}
error: invalid cast from type 'char' to type 'uint8_t {aka unsigned char}'
Why the error? How to make this work?
The general, portable, roundtrip-correct way would be to:
demand in your API that all byte values can be expressed with at most 8 bits,
use the layout-compatibility of char, signed char and unsigned char for I/O, and
convert unsigned char to uint8_t as needed.
For example:
bool read_one_byte(std::istream & is, uint8_t * out)
{
unsigned char x; // a "byte" on your system
if (is.get(reinterpret_cast<char *>(&x)))
{
*out = x;
return true;
}
return false;
}
bool write_one_byte(std::ostream & os, uint8_t val)
{
unsigned char x = val;
return os.write(reinterpret_cast<char const *>(&x), 1);
}
Some explanation: Rule 1 guarantees that values can be round-trip converted between uint8_t and unsigned char without losing information. Rule 2 means that we can use the iostream I/O operations on unsigned char variables, even though they're expressed in terms of chars.
We could also have used is.read(reinterpret_cast<char *>(&x), 1) instead of is.get() for symmetry. (Using read in general, for stream counts larger than 1, also requires the use of gcount() on error, but that doesn't apply here.)
As always, you must never ignore the return value of I/O operations. Doing so is always a bug in your program.
A few months ago, once, because iostreams are not defined for uint8_t.
uint8_t is pretty much just a typedef for unsigned char. In fact, i doubt you could find a machine where it isn't.
uint8_t read(decltype(cin) & s)
{
char c;
s.get(c);
return reinterpret_cast<uint8_t>(c);
}
Using decltype(cin) instead of std::istream has no advantage at all, it is just a potential source of confusion.
The cast in the return-statement isn't necessary; converting a char into an unsigned char works implicitly.
A few months ago, once, because iostreams are not defined for uint8_t.
They are. Not for uint8_t itself, but most certainly for the type it actually represents. operator>> is overloaded for unsigned char. This code works:
uint8_t read(istream& s)
{
return s.get();
}
Since unsigned char and char can alias each other you can also just reinterpret_cast any pointer to a char string to an unsigned char* and work with that.
In case you want the most portable way possible take a look at Kerreks answer.

How do I convert the contents of an unsigned char * to a const char *?

I can across reinterpret casts, and most of the time it was brought up, a warning was given, so I am wondering if there are other alternatives (or clean implementations of reinterpret cast of course)
You don't say what warning was given or what the problem was, but casting to char* with reinterpret_cast should work without warnings:
unsigned char *a;
const char *b = reinterpret_cast<char*>(a);
It depends on what you're trying to do.
If you just want to access the contents as char, then a simple
static_cast or using the value in a context where a char is expected
will do the trick.
If you need to pass the buffer to a function expecting a char const*,
a reinterpret_cast is about the only solution.
If you want a string, using the pointers into the buffer will be fine:
std::string
bufferToString( unsigned char const* buffer, size_t length )
{
return std::string( buffer, buffer + length );
}
or you can copy into an existing string:
myString.assign( buffer, buffer + length );
myString.append( buffer, buffer + length );
// etc.
Any string function (or algorithm, like std::copy) which takes two
iterators can be used. All that is required is that dereferencing the
iterator result in a type which converts implicitly to char, which is
the case of unsigned char.
(You cannot use the string functions which take a buffer address and a
length, as these are not templates, and require the buffer address to
have type char const*. And while unsigned char converts implicitly
to char, unsigned char* requires a reinterpret_cast to convert it
to char*.)

Best way to create a string buffer for binary data

When I try the following, I get an error:
unsigned char * data = "00000000"; //error: cannot convert const char to unsigned char
Is there a special way to do this which I'm missing?
Update
For the sake of brevity, I'll explain what I'm trying to achieve:
I'd like to create a StringBuffer in C++ which uses unsigned values for raw binary data. It seems that an unsigned char is the best way to accomplish this. If there is a better method?
std::vector<unsigned char> data(8, '0');
Or, if the data is not uniform:
auto & arr = "abcdefg";
std::vector<unsigned char> data(arr, arr + sizeof(arr) - 1);
Or, so you can assign directly from a literal:
std::basic_string<unsigned char> data = (const unsigned char *)"abcdefg";
Yes, do this:
const char *data = "00000000";
A string literal is an array of char, not unsigned char.
If you need to pass this to a function that takes const unsigned char *, well, you'll need to cast it:
foo(static_cast<const unsigned char *>(data));
You have many ways. One is to write:
const unsigned char *data = (const unsigned char *)"00000000";
Another, which is more recommended is to declare data as it should be:
const char *data = "00000000";
And when you pass it to your function:
myFunc((const unsigned char *)data);
Note that, in general a string of unsigned char is unusual. An array of unsigned chars is more common, but you wouldn't initialize it with a string ("00000000")
Response to your update
If you want raw binary data, first let me tell you that instead of unsigned char, you are better off using bigger containers, such as long int or long long. This is because when you perform operations on the binary literal (which is an array), your operations are cut by 4 or 8, which is a speed boost.
Second, if you want your class to represent binary values, don't initialize it with a string, but with individual values. In your case would be:
unsigned char data[] = {0x30, 0x30, 0x30, 0x30, /* etc */}
Note that I assume you are storing binary as binary! That is, you get 8 bits in an unsigned char. If you, on the other hand, mean binary as in string of 0s and 1s, which is not really a good idea, but either way, you don't really need unsigned char and just char is sufficient.
unsigned char data[] = "00000000";
This will copy "00000000" into an unsigned char[] buffer, which also means that the buffer won't be read-only like a string literal.
The reason why the way you're doing it won't work is because your pointing data to a (signed) string literal (char[]), so data has to be of type char*. You can't do that without explicitly casting "00000000", such as: (unsigned char*)"00000000".
Note that string literals aren't explicitly of type constchar[], however if you don't treat them as such and try and modify them, you will cause undefined behaviour - a lot of the times being an access violation error.
You're trying to assign string value to pointer to unsigned char. You cannot do that. If you have pointer, you can assign only memory address or NULL to that.
Use const char instead.
Your target variable is a pointer to an unsigned char. "00000000" is a string literal. It's type is const char[9]. You have two type mismatches here. One is that unsigned char and char are different types. The lack of a const qualifier is also a big problem.
You can do this:
unsigned char * data = (unsigned char *)"00000000";
But this is something you should not do. Ever. Casting away the constness of a string literal will get you in big trouble some day.
The following is a little better, but strictly speaking it is still unspecified behavior (maybe undefined behavior; I don't want to chase down which it is in the standard):
const unsigned char * data = (const unsigned char *)"00000000";
Here you are preserving the constness but you are changing the pointer type from char* to unsigned char*.
#Holland -
unsigned char * data = "00000000";
One very important point I'm not sure we're making clear: the string "00000000\0" (9 bytes, including delimiter) might be in READ-ONLY MEMORY (depending on your platform).
In other words, if you defined your variable ("data") this way, and you passed it to a function that might try to CHANGE "data" ... then you could get an ACCESS VIOLATION.
The solution is:
1) declare as "const char *" (as the others have already said)
... and ...
2) TREAT it as "const char *" (do NOT modify its contents, or pass it to a function that might modify its contents).

How do I specify an integer literal of type unsigned char in C++?

I can specify an integer literal of type unsigned long as follows:
const unsigned long example = 9UL;
How do I do likewise for an unsigned char?
const unsigned char example = 9U?;
This is needed to avoid compiler warning:
unsigned char example2 = 0;
...
min(9U?, example2);
I'm hoping to avoid the verbose workaround I currently have and not have 'unsigned char' appear in the line calling min without declaring 9 in a variable on a separate line:
min(static_cast<unsigned char>(9), example2);
C++11 introduced user defined literals. It can be used like this:
inline constexpr unsigned char operator "" _uchar( unsigned long long arg ) noexcept
{
return static_cast< unsigned char >( arg );
}
unsigned char answer()
{
return 42;
}
int main()
{
std::cout << std::min( 42, answer() ); // Compile time error!
std::cout << std::min( 42_uchar, answer() ); // OK
}
C provides no standard way to designate an integer constant with width less that of type int.
However, stdint.h does provide the UINT8_C() macro to do something that's pretty much as close to what you're looking for as you'll get in C.
But most people just use either no suffix (to get an int constant) or a U suffix (to get an unsigned int constant). They work fine for char-sized values, and that's pretty much all you'll get from the stdint.h macro anyway.
You can cast the constant. For example:
min(static_cast<unsigned char>(9), example2);
You can also use the constructor syntax:
typedef unsigned char uchar;
min(uchar(9), example2);
The typedef isn't required on all compilers.
If you are using Visual C++ and have no need for interoperability between compilers, you can use the ui8 suffix on a number to make it into an unsigned 8-bit constant.
min(9ui8, example2);
You can't do this with actual char constants like '9' though.
Assuming that you are using std::min what you actually should do is explicitly specify what type min should be using as such
unsigned char example2 = 0;
min<unsigned char>(9, example2);
Simply const unsigned char example = 0; will do fine.
I suppose '\0' would be a char literal with the value 0, but I don't see the point either.
There is no suffix for unsigned char types. Integer constants are either int or long (signed or unsigned) and in C99 long long. You can use the plain 'U' suffix without worry as long as the value is within the valid range of unsigned chars.
The question was how to "specify an integer 'literal' of type unsigned char in C++?". Not how to declare an identifier.
You use the escape backslash and octal digits in apostrophes. (eg. '\177')
The octal value is always taken to be unsigned.