working with binary data and unsigned char - c++

I'm writing a program that reads a content of a binary file (specificly Windows PE file. Wikipedia page and detailed PE structure).
Because of the binary data in the file, the characters often "fall out" of the ascii range (0-127) and that result in negative values.
To make sure I won't work with unwanted negative values, I can either pass const unsigned char * or convert the resulting char in the calculation to unsigned char.
On one hand, passing const unsigned char * makes sense because the data is non-ascii that has a numaric value and thus should be treated as positive.
In addition, it'll let me perform calculations without the need to cast the result to unsigned char.
On the other hand, I can't pass constant strings (const char *, such as pre-defined strings "MZ", "PE\0\0" etc.) to functions without first casting them to const unsigned char *.
What would be the better approach or best-practice in this scenario?

I think I'd use unsigned char, but avoid casting, and instead define a little class named ustring (or something similar). You have a couple of choices with that. One would be to instantiate std::basic_string over unsigned char. This can be useful (it gives you all of std::string's functionality, but with unsigned chars instead of chars. The obvious disadvantage is that it's probably overkill, and has essentially no compatibility with std::string, even though it's almost exactly the same thing.
The other obvious possibility would be to define your own class. Since you apparently care mostly about string literals, I'd probably go this way. The class would be initalized with a string literal, and it would just hold a pointer to the string, but as unsigned char * instead of just char *.
Then there's one more step to make life better: define a user defined literal operator named something like _us, so creating an object of your type from a string literal will look something like this: auto DOS_sig = "MZ"_us;
class ustring {
unsigned char const *data;
unsigned long long len;
public:
ustring(unsigned char const *s, unsigned long long len)
: data(s)
, len(len)
{}
operator char const *() const { return data; }
bool operator==(ustring const &other) const {
// note: memcmp treats what you pass it as unsigned chars.
return len == other.len && 0 == memcmp(data, other.data, len);
}
// you probably want to add more stuff here.
};
ustring operator"" _us(char const * const s, unsigned long long len) {
return ustring((unsigned char const *)s, len);
}
If I'm not mistaken, this should be pretty easy to work with. For example, let's assume you've memory mapped what you think is a PE file, with its base address at mapped_file. To see if it has a DOS signature, you might do something like this:
if (ustring(&mapped_file[0], 2) == "MZ"_us)
std::cerr << "File appears to be an executable.\n";
else
std::cerr << "file does not appear to be an executable.\n";
Caution: I haven't tested this, so fencepost errors and such are likely--for example, I don't remember whether the length passed to the user defined literal operator includes the NUL terminator or not. This isn't intended to represent finished code, just a sketch of a general direction that might be useful to explore.

Related

Taking an index out of const char* argument

I have the following code:
int some_array[256] = { ... };
int do_stuff(const char* str)
{
int index = *str;
return some_array[index];
}
Apparently the above code causes a bug in some platforms, because *str can in fact be negative.
So I thought of two possible solutions:
Casting the value on assignment (unsigned int index = (unsigned char)*str;).
Passing const unsigned char* instead.
Edit: The rest of this question did not get a treatment, so I moved it to a new thread.
The signedness of char is indeed platform-dependent, but what you do know is that there are as many values of char as there are of unsigned char, and the conversion is injective. So you can absolutely cast the value to associate a lookup index with each character:
unsigned char idx = *str;
return arr[idx];
You should of course make sure that the arr has at least UCHAR_MAX + 1 elements. (This may cause hilarious edge cases when sizeof(unsigned long long int) == 1, which is fortunately rare.)
Characters are allowed to be signed or unsigned, depending on the platform. An assumption of unsigned range is what causes your bug.
Your do_stuff code does not treat const char* as a string representation. It uses it as a sequence of byte-sized indexes into a look-up table. Therefore, there is nothing wrong with forcing unsigned char type on the characters of your string inside do_stuff (i.e. use your solution #1). This keeps re-interpretation of char as an index localized to the implementation of do_stuff function.
Of course, this assumes that other parts of your code do treat str as a C string.

How to work with uint8_t instead of char?

I wish to understand the situation regarding uint8_t vs char, portability, bit-manipulation, the best practices, state of affairs, etc. Do you know a good reading on the topic?
I wish to do byte-IO. But of course char has a more complicated and subtle definition than uint8_t; which I assume was one of the reasons for introducing stdint header.
However, I had problems using uint8_t on multiple occasions. A few months ago, once, because iostreams are not defined for uint8_t. Isn't there a C++ library doing really-well-defined-byte-IO i.e. read and write uint8_t? If not, I assume there is no demand for it. Why?
My latest headache stems from the failure of this code to compile:
uint8_t read(decltype(cin) & s)
{
char c;
s.get(c);
return reinterpret_cast<uint8_t>(c);
}
error: invalid cast from type 'char' to type 'uint8_t {aka unsigned char}'
Why the error? How to make this work?
The general, portable, roundtrip-correct way would be to:
demand in your API that all byte values can be expressed with at most 8 bits,
use the layout-compatibility of char, signed char and unsigned char for I/O, and
convert unsigned char to uint8_t as needed.
For example:
bool read_one_byte(std::istream & is, uint8_t * out)
{
unsigned char x; // a "byte" on your system
if (is.get(reinterpret_cast<char *>(&x)))
{
*out = x;
return true;
}
return false;
}
bool write_one_byte(std::ostream & os, uint8_t val)
{
unsigned char x = val;
return os.write(reinterpret_cast<char const *>(&x), 1);
}
Some explanation: Rule 1 guarantees that values can be round-trip converted between uint8_t and unsigned char without losing information. Rule 2 means that we can use the iostream I/O operations on unsigned char variables, even though they're expressed in terms of chars.
We could also have used is.read(reinterpret_cast<char *>(&x), 1) instead of is.get() for symmetry. (Using read in general, for stream counts larger than 1, also requires the use of gcount() on error, but that doesn't apply here.)
As always, you must never ignore the return value of I/O operations. Doing so is always a bug in your program.
A few months ago, once, because iostreams are not defined for uint8_t.
uint8_t is pretty much just a typedef for unsigned char. In fact, i doubt you could find a machine where it isn't.
uint8_t read(decltype(cin) & s)
{
char c;
s.get(c);
return reinterpret_cast<uint8_t>(c);
}
Using decltype(cin) instead of std::istream has no advantage at all, it is just a potential source of confusion.
The cast in the return-statement isn't necessary; converting a char into an unsigned char works implicitly.
A few months ago, once, because iostreams are not defined for uint8_t.
They are. Not for uint8_t itself, but most certainly for the type it actually represents. operator>> is overloaded for unsigned char. This code works:
uint8_t read(istream& s)
{
return s.get();
}
Since unsigned char and char can alias each other you can also just reinterpret_cast any pointer to a char string to an unsigned char* and work with that.
In case you want the most portable way possible take a look at Kerreks answer.

Global typecast operator overload?

I'm writing some 'portable' code (meaning that it targets 32- and 64-bit MSVC2k10 and GCC on Linux) in which I have, more or less:
typedef unsigned char uint8;
C-strings are always uint8; this is for string-processing reasons. Legacy code needs char compiled as signed, so I can't set compiler switches to default it to unsigned. But if I'm processing a string I can't very well index an array:
char foo[500];
char *ptr = (foo + 4);
*ptr = some_array_that_normalizes_it[*ptr];
You can't index an array with a negative number at run-time without serious consequences. Keeping C-strings unsigned allows for such easier protection from bugs.
I would really like to not have to keep casting (char *) every time I use a function that takes char *'s, and also stop duplicating class functions so that they take either. This is especially a pain because a string constant is implicitly passed as a char *
int foo = strlen("Hello"); // "Hello" is passed as a char *
I want all of these to work:
char foo[500] = "Hello!"; // Works
uint8 foo2[500] = "Hello!"; // Works
uint32 len = strlen(foo); // Works
uint32 len2 = strlen(foo2); // Doesn't work
uint32 len3 = strlen((char *)foo2); // Works
There are probably caveats to allowing implicit type conversions of this nature, however, it'd be nice to use functions that take a char * without a cast every time.
So, I figured something like this would work:
operator char* (const uint8* foo) { return (char *)foo; }
However it does not. I can't figure out any way to make it work. I also can't find anything to tell me why there seems to be no way to do this. I can see the possible logic - implicit conversions like that could be a cause of FAR too many bugs - but I can't find anything that says "this will not work in C++" or why, or how to make it work (short of making uin8 a class which is ridiculous).
Global cast(typecast) operator, global assignment operator, global array subscript operator and global function call operator overloading are not allowed in C++.
MSVS C++ will be generate C2801 errors on them. Look at wiki for list of C++ operators and them overloading rules.
I'm not a big fan of operator [ab]using, but thats what c++ is for right?
You can do the following:
const char* operator+(const uint8* foo)
{
return (const char *)foo;
}
char* operator+(uint8* foo)
{
return (char *)foo;
}
With those defined, your example from above:
uint32 len2 = strlen(foo2);
will become
uint32 len2 = strlen(+foo2);
It is not an automatic cast, but this way you have an easy, yet explicit way of doing it.
Both compilers you mention do have a "treat chars as unsigned" switch. Why not use that?

Best way to create a string buffer for binary data

When I try the following, I get an error:
unsigned char * data = "00000000"; //error: cannot convert const char to unsigned char
Is there a special way to do this which I'm missing?
Update
For the sake of brevity, I'll explain what I'm trying to achieve:
I'd like to create a StringBuffer in C++ which uses unsigned values for raw binary data. It seems that an unsigned char is the best way to accomplish this. If there is a better method?
std::vector<unsigned char> data(8, '0');
Or, if the data is not uniform:
auto & arr = "abcdefg";
std::vector<unsigned char> data(arr, arr + sizeof(arr) - 1);
Or, so you can assign directly from a literal:
std::basic_string<unsigned char> data = (const unsigned char *)"abcdefg";
Yes, do this:
const char *data = "00000000";
A string literal is an array of char, not unsigned char.
If you need to pass this to a function that takes const unsigned char *, well, you'll need to cast it:
foo(static_cast<const unsigned char *>(data));
You have many ways. One is to write:
const unsigned char *data = (const unsigned char *)"00000000";
Another, which is more recommended is to declare data as it should be:
const char *data = "00000000";
And when you pass it to your function:
myFunc((const unsigned char *)data);
Note that, in general a string of unsigned char is unusual. An array of unsigned chars is more common, but you wouldn't initialize it with a string ("00000000")
Response to your update
If you want raw binary data, first let me tell you that instead of unsigned char, you are better off using bigger containers, such as long int or long long. This is because when you perform operations on the binary literal (which is an array), your operations are cut by 4 or 8, which is a speed boost.
Second, if you want your class to represent binary values, don't initialize it with a string, but with individual values. In your case would be:
unsigned char data[] = {0x30, 0x30, 0x30, 0x30, /* etc */}
Note that I assume you are storing binary as binary! That is, you get 8 bits in an unsigned char. If you, on the other hand, mean binary as in string of 0s and 1s, which is not really a good idea, but either way, you don't really need unsigned char and just char is sufficient.
unsigned char data[] = "00000000";
This will copy "00000000" into an unsigned char[] buffer, which also means that the buffer won't be read-only like a string literal.
The reason why the way you're doing it won't work is because your pointing data to a (signed) string literal (char[]), so data has to be of type char*. You can't do that without explicitly casting "00000000", such as: (unsigned char*)"00000000".
Note that string literals aren't explicitly of type constchar[], however if you don't treat them as such and try and modify them, you will cause undefined behaviour - a lot of the times being an access violation error.
You're trying to assign string value to pointer to unsigned char. You cannot do that. If you have pointer, you can assign only memory address or NULL to that.
Use const char instead.
Your target variable is a pointer to an unsigned char. "00000000" is a string literal. It's type is const char[9]. You have two type mismatches here. One is that unsigned char and char are different types. The lack of a const qualifier is also a big problem.
You can do this:
unsigned char * data = (unsigned char *)"00000000";
But this is something you should not do. Ever. Casting away the constness of a string literal will get you in big trouble some day.
The following is a little better, but strictly speaking it is still unspecified behavior (maybe undefined behavior; I don't want to chase down which it is in the standard):
const unsigned char * data = (const unsigned char *)"00000000";
Here you are preserving the constness but you are changing the pointer type from char* to unsigned char*.
#Holland -
unsigned char * data = "00000000";
One very important point I'm not sure we're making clear: the string "00000000\0" (9 bytes, including delimiter) might be in READ-ONLY MEMORY (depending on your platform).
In other words, if you defined your variable ("data") this way, and you passed it to a function that might try to CHANGE "data" ... then you could get an ACCESS VIOLATION.
The solution is:
1) declare as "const char *" (as the others have already said)
... and ...
2) TREAT it as "const char *" (do NOT modify its contents, or pass it to a function that might modify its contents).

What can I do with an unsigned char* when I needed a string?

Suppose that I have a unsigned char*, let's call it: some_data
unsigned char* some_data;
And some_data has url-like data in it. for example:
"aasdASDASsdfasdfasdf&Foo=cow&asdfasasdfadsfdsafasd"
I have a function that can grab the value of 'foo' as follows:
// looks for the value of 'foo'
bool grabFooValue(const std::string& p_string, std::string& p_foo_value)
{
size_t start = p_string.find("Foo="), end;
if(start == std::string::npos)
return false;
start += 4;
end = p_string.find_first_of("& ", start);
p_foo_value = p_string.substr(start, end - start);
return true;
}
The trouble is that I need a string to pass to this function, or at least a char* (which can be converted to a string no problem).
I can solve this problem by casting:
reinterpret_cast<char *>(some_data)
And then pass it to the function all okie-dokie
...
Until I used valgrind and found out that this can lead to a subtle memory leak.
Conditional jump or move depends on uninitialised value(s) __GI_strlen
From what I gathered, it has to do with the reinterpret casting messing up the null indicating the end of the string. Thus when c++ tries to figure out the length of the string thing's get screwy.
Given that I can't change the fact that some_data is represented by an unsigned char*, is there a way to go about using my grabFooValue function without having these subtle problems?
I'd prefer to keep the value-finding function that I already have, unless there is clearly a better way to rip the foo-value out of this (sometimes large) unsigned char*.
And despite the unsigned char* some_data 's varying, and sometimes large size, I can assume that the value of 'foo' will be somewhere early on, so my thoughts were to try and get a char* of the first X characters of the unsigned char*. This could potentially get rid of the string-length issue by having me set where the char* ends.
I tried using a combination of strncpy and casting but so far no dice. Any thoughts?
You need to know the length of the data your unsigned char * points to, since it isn't 0-terminated.
Then, use e.g:
std::string s((char *) some_data, (char *) some_data + len);