I have the following code:
int some_array[256] = { ... };
int do_stuff(const char* str)
{
int index = *str;
return some_array[index];
}
Apparently the above code causes a bug in some platforms, because *str can in fact be negative.
So I thought of two possible solutions:
Casting the value on assignment (unsigned int index = (unsigned char)*str;).
Passing const unsigned char* instead.
Edit: The rest of this question did not get a treatment, so I moved it to a new thread.
The signedness of char is indeed platform-dependent, but what you do know is that there are as many values of char as there are of unsigned char, and the conversion is injective. So you can absolutely cast the value to associate a lookup index with each character:
unsigned char idx = *str;
return arr[idx];
You should of course make sure that the arr has at least UCHAR_MAX + 1 elements. (This may cause hilarious edge cases when sizeof(unsigned long long int) == 1, which is fortunately rare.)
Characters are allowed to be signed or unsigned, depending on the platform. An assumption of unsigned range is what causes your bug.
Your do_stuff code does not treat const char* as a string representation. It uses it as a sequence of byte-sized indexes into a look-up table. Therefore, there is nothing wrong with forcing unsigned char type on the characters of your string inside do_stuff (i.e. use your solution #1). This keeps re-interpretation of char as an index localized to the implementation of do_stuff function.
Of course, this assumes that other parts of your code do treat str as a C string.
Related
I want to calculate a hash of the structure passing as string. Although vlanId values are different, the hash value is still the same. The StringHash() funtion calculates the values of the hash. I haven't assigned any value to portId and vsi.
#include<stdio.h>
#include <functional>
#include <cstring>
using namespace std;
unsigned long StringHash(unsigned char *Arr)
{
hash<string> str_hash;
string Str((const char *)Arr);
unsigned long str_hash_value = str_hash(Str);
printf("Hash=%lu\n", str_hash_value);
return str_hash_value;
}
typedef struct
{
unsigned char portId;
unsigned short vlanId;
unsigned short vsi;
}VlanConfig;
int main()
{
VlanConfig v1;
memset(&v1,0,sizeof(VlanConfig));
unsigned char *index = (unsigned char *)&v1 + sizeof(unsigned char);
v1.vlanId = 10;
StringHash(index);
StringHash((unsigned char *)&v1);
v1.vlanId = 12;
StringHash(index);
StringHash((unsigned char *)&v1);
return 0;
}
Output:
Hash=6142509188972423790
Hash=6142509188972423790
Hash=6142509188972423790
Hash=6142509188972423790
You pass the bytes of your structure to a function expecting a zero terminated string. Well, the first byte of your structure already is zero, so you calculate the same hash every time.
Now, that is the explanation why, but not the solution to your problem. Passing a random sequence of bytes to a function expecting a zero-terminated sequence of characters is going to fail spectacularly, no matter how you do it.
Find another way to hash your structure. You are already using hash<>, why not use it for your case:
namespace std
{
template<> struct hash<VlanConfig>
{
std::size_t operator()(VlanConfig const& c) const noexcept
{
std::size_t h1 = std::hash<char>{}(c.portId);
std::size_t h2 = std::hash<short>{}(c.vlanId);
std::size_t h3 = std::hash<short>{}(c.vsi);
return h1 ^ (h2 << 1) ^ (h3 << 2); // or use boost::hash_combine
}
};
}
Then you can do this:
VlanConfig myVariable;
// fill myVariable
std::cout << std::hash<VlanConfig>{}(myVariable) << std::endl;
I can't say for certain, but most likely your issue is structure padding. Unless explicietly set ot pack members and ignore alignment, most compilers will set up the struct as follows:
Byte 0: portId
Byte 1: padding
Bytes 2,3: vlanId
Bytes 4,5: vsi
So when you figure the address of index, it'll point to the padding byte, which is always zero. Thus you're always hashing an empty string.
You should be able to check this in a debugger by inspecting index and comparing it to the address of vlanId.
-- Edit --
After giving this some more thought, I have to say that in my extremely humble opinion, this isn't a good way to get a hash value. Trying to treat several numeric values that might, or might not, be contiguous in memory as a std::string, has too many possibilities for error.
Start with the fact that even if you do get the address correct, consider what happens when you hash two different configurations, one of which has vlanId set to 256, while the other has it set to 512. Assuming a little endian machine, both of those will have a zero byte as the first character of the string, and so you're right back here again.
Worse yet is the case when all four bytes in vlanId and vsi are non zero. In that case, you'll read right off the end of your struct, and keep on going, reading who knows what. There's no way that's going to end well.
One possible solution is to figure the size of data, and use the following ctor for std::string: string (char const *s, size_t n); which has the advantage of forcing the string to exactly the size you want.
I'm writing a program that reads a content of a binary file (specificly Windows PE file. Wikipedia page and detailed PE structure).
Because of the binary data in the file, the characters often "fall out" of the ascii range (0-127) and that result in negative values.
To make sure I won't work with unwanted negative values, I can either pass const unsigned char * or convert the resulting char in the calculation to unsigned char.
On one hand, passing const unsigned char * makes sense because the data is non-ascii that has a numaric value and thus should be treated as positive.
In addition, it'll let me perform calculations without the need to cast the result to unsigned char.
On the other hand, I can't pass constant strings (const char *, such as pre-defined strings "MZ", "PE\0\0" etc.) to functions without first casting them to const unsigned char *.
What would be the better approach or best-practice in this scenario?
I think I'd use unsigned char, but avoid casting, and instead define a little class named ustring (or something similar). You have a couple of choices with that. One would be to instantiate std::basic_string over unsigned char. This can be useful (it gives you all of std::string's functionality, but with unsigned chars instead of chars. The obvious disadvantage is that it's probably overkill, and has essentially no compatibility with std::string, even though it's almost exactly the same thing.
The other obvious possibility would be to define your own class. Since you apparently care mostly about string literals, I'd probably go this way. The class would be initalized with a string literal, and it would just hold a pointer to the string, but as unsigned char * instead of just char *.
Then there's one more step to make life better: define a user defined literal operator named something like _us, so creating an object of your type from a string literal will look something like this: auto DOS_sig = "MZ"_us;
class ustring {
unsigned char const *data;
unsigned long long len;
public:
ustring(unsigned char const *s, unsigned long long len)
: data(s)
, len(len)
{}
operator char const *() const { return data; }
bool operator==(ustring const &other) const {
// note: memcmp treats what you pass it as unsigned chars.
return len == other.len && 0 == memcmp(data, other.data, len);
}
// you probably want to add more stuff here.
};
ustring operator"" _us(char const * const s, unsigned long long len) {
return ustring((unsigned char const *)s, len);
}
If I'm not mistaken, this should be pretty easy to work with. For example, let's assume you've memory mapped what you think is a PE file, with its base address at mapped_file. To see if it has a DOS signature, you might do something like this:
if (ustring(&mapped_file[0], 2) == "MZ"_us)
std::cerr << "File appears to be an executable.\n";
else
std::cerr << "file does not appear to be an executable.\n";
Caution: I haven't tested this, so fencepost errors and such are likely--for example, I don't remember whether the length passed to the user defined literal operator includes the NUL terminator or not. This isn't intended to represent finished code, just a sketch of a general direction that might be useful to explore.
One of the function in a 3rd party class return awchar_t* that holding a resource id (I don't know why it uses wchar_t* type ) I need to convert this pointer to short int
This method, using AND operator works for me. but it seems like not the correct way. is there any proper way to do this?
wchar_t* s;
short int b = (unsigned long)(s) & 0xFFFF;
wchar_t* s; // I assume this is what you meant
short int b = static_cast<short int>(reinterpret_cast<intptr_t>(s))
You could also replace short int b with auto b, and it will be deduced as short int from the type of the right-hand expression.
It returns the resource ID as a wchar_t* because that is the data type that Windows uses to carry resource identifiers. Resources can be identified by either numeric ID or by name. If numeric, the pointer itself contains the actual ID number encoded in its lower 16 bits. Otherwise it is a normal pointer to a null-terminated string elsewhere in memory. There is an IS_INTRESOURCE() macro to differentiate which is the actual case, eg:
wchar_t *s = ...;
if (IS_INTRESOURCE(s))
{
// s is a numeric ID...
WORD b = (WORD) s;
...
}
else
{
// s is a null-terminated name string
...
}
Did you mean in your code wchar_t *s;?
I'd do the conversion more explicit using
short int b = reinterpret_cast<short int>(s);
If it fits your application needs, I suggest using a data type with a fixed nr of bits, e.g. uint16_t. Using short int means you only know for sure your variable has at least 16 bits. An additional question: Why do you not use unsigned short int, instead of (signed) short int?
In general, knowing the exact nr of bits make things a little more predictable, and makes it easier to know exactly what happens when you cast or use bitmasks.
So, string comes with the value type of char. I want a string of value type unsigned char. Why i want such a thing is because i am currently writing a program which converts large input of hexadecimal to decimal, and i am using strings to calculate the result. But the range of char, which is -128 to 127 is too small, unsigned char with range 0 to 255 would work perfectly instead. Consider this code:
#include<iostream>
using namespace std;
int main()
{
typedef basic_string<unsigned char> u_string;
u_string x= "Hello!";
return 0;
}
But when i try to compile, it shows 2 errors, one is _invalid conversion from const char* to unsigned const char*_ and the other is initializing argument 1 of std::basic_string<_CharT, _Traits, _Alloc>::basic_string...(it goes on)
EDIT:
"Why does the problem "converts large input of hexadecimal to decimal" require initializing a u_string with a string literal?"
While calculating, each time i shift to the left of the hexadecimal number, i multiply by 16. At most the result is going to be 16x9=144, which surpasses the limit of 127, and it makes it negative value.
Also, i have to initialize it like this:
x="0"; x[0] -='0';
Because i want it to be 0 in value. if the variable is null, then i can't perform operations on it, if it is 0, then i can.
So, what should i do?
String literals are const char and you are assigning them to a const unsigned char.
Two solution you have:
First, Copy string from standard strings to your element by element.
Second, Write your own user-literal for your string class:
inline constexpr const unsigned char * operator"" _us(const char *s,unsigned int)
{
return (const unsigned char *) s;
}
// OR
u_string operator"" _us(const char *s, unsigned int len)
{
return u_string(s, s+len);
}
u_string x = "Hello!"_us;
An alternative solution would be to make your compiler treat char as unsigned. There are compiler flags for this:
MSVC: /J
GCC, Clang, ICC: -funsigned-char
When I try the following, I get an error:
unsigned char * data = "00000000"; //error: cannot convert const char to unsigned char
Is there a special way to do this which I'm missing?
Update
For the sake of brevity, I'll explain what I'm trying to achieve:
I'd like to create a StringBuffer in C++ which uses unsigned values for raw binary data. It seems that an unsigned char is the best way to accomplish this. If there is a better method?
std::vector<unsigned char> data(8, '0');
Or, if the data is not uniform:
auto & arr = "abcdefg";
std::vector<unsigned char> data(arr, arr + sizeof(arr) - 1);
Or, so you can assign directly from a literal:
std::basic_string<unsigned char> data = (const unsigned char *)"abcdefg";
Yes, do this:
const char *data = "00000000";
A string literal is an array of char, not unsigned char.
If you need to pass this to a function that takes const unsigned char *, well, you'll need to cast it:
foo(static_cast<const unsigned char *>(data));
You have many ways. One is to write:
const unsigned char *data = (const unsigned char *)"00000000";
Another, which is more recommended is to declare data as it should be:
const char *data = "00000000";
And when you pass it to your function:
myFunc((const unsigned char *)data);
Note that, in general a string of unsigned char is unusual. An array of unsigned chars is more common, but you wouldn't initialize it with a string ("00000000")
Response to your update
If you want raw binary data, first let me tell you that instead of unsigned char, you are better off using bigger containers, such as long int or long long. This is because when you perform operations on the binary literal (which is an array), your operations are cut by 4 or 8, which is a speed boost.
Second, if you want your class to represent binary values, don't initialize it with a string, but with individual values. In your case would be:
unsigned char data[] = {0x30, 0x30, 0x30, 0x30, /* etc */}
Note that I assume you are storing binary as binary! That is, you get 8 bits in an unsigned char. If you, on the other hand, mean binary as in string of 0s and 1s, which is not really a good idea, but either way, you don't really need unsigned char and just char is sufficient.
unsigned char data[] = "00000000";
This will copy "00000000" into an unsigned char[] buffer, which also means that the buffer won't be read-only like a string literal.
The reason why the way you're doing it won't work is because your pointing data to a (signed) string literal (char[]), so data has to be of type char*. You can't do that without explicitly casting "00000000", such as: (unsigned char*)"00000000".
Note that string literals aren't explicitly of type constchar[], however if you don't treat them as such and try and modify them, you will cause undefined behaviour - a lot of the times being an access violation error.
You're trying to assign string value to pointer to unsigned char. You cannot do that. If you have pointer, you can assign only memory address or NULL to that.
Use const char instead.
Your target variable is a pointer to an unsigned char. "00000000" is a string literal. It's type is const char[9]. You have two type mismatches here. One is that unsigned char and char are different types. The lack of a const qualifier is also a big problem.
You can do this:
unsigned char * data = (unsigned char *)"00000000";
But this is something you should not do. Ever. Casting away the constness of a string literal will get you in big trouble some day.
The following is a little better, but strictly speaking it is still unspecified behavior (maybe undefined behavior; I don't want to chase down which it is in the standard):
const unsigned char * data = (const unsigned char *)"00000000";
Here you are preserving the constness but you are changing the pointer type from char* to unsigned char*.
#Holland -
unsigned char * data = "00000000";
One very important point I'm not sure we're making clear: the string "00000000\0" (9 bytes, including delimiter) might be in READ-ONLY MEMORY (depending on your platform).
In other words, if you defined your variable ("data") this way, and you passed it to a function that might try to CHANGE "data" ... then you could get an ACCESS VIOLATION.
The solution is:
1) declare as "const char *" (as the others have already said)
... and ...
2) TREAT it as "const char *" (do NOT modify its contents, or pass it to a function that might modify its contents).