Why hash values are same for string?

Why hash values are same for string? - c++

I want to calculate a hash of the structure passing as string. Although vlanId values are different, the hash value is still the same. The StringHash() funtion calculates the values of the hash. I haven't assigned any value to portId and vsi.
#include<stdio.h>
#include <functional>
#include <cstring>
using namespace std;
unsigned long StringHash(unsigned char *Arr)
{
hash<string> str_hash;
string Str((const char *)Arr);
unsigned long str_hash_value = str_hash(Str);
printf("Hash=%lu\n", str_hash_value);
return str_hash_value;
}
typedef struct
{
unsigned char portId;
unsigned short vlanId;
unsigned short vsi;
}VlanConfig;
int main()
{
VlanConfig v1;
memset(&v1,0,sizeof(VlanConfig));
unsigned char *index = (unsigned char *)&v1 + sizeof(unsigned char);
v1.vlanId = 10;
StringHash(index);
StringHash((unsigned char *)&v1);
v1.vlanId = 12;
StringHash(index);
StringHash((unsigned char *)&v1);
return 0;
}
Output:
Hash=6142509188972423790
Hash=6142509188972423790
Hash=6142509188972423790
Hash=6142509188972423790

You pass the bytes of your structure to a function expecting a zero terminated string. Well, the first byte of your structure already is zero, so you calculate the same hash every time.
Now, that is the explanation why, but not the solution to your problem. Passing a random sequence of bytes to a function expecting a zero-terminated sequence of characters is going to fail spectacularly, no matter how you do it.
Find another way to hash your structure. You are already using hash<>, why not use it for your case:
namespace std
{
template<> struct hash<VlanConfig>
{
std::size_t operator()(VlanConfig const& c) const noexcept
{
std::size_t h1 = std::hash<char>{}(c.portId);
std::size_t h2 = std::hash<short>{}(c.vlanId);
std::size_t h3 = std::hash<short>{}(c.vsi);
return h1 ^ (h2 << 1) ^ (h3 << 2); // or use boost::hash_combine
}
};
}
Then you can do this:
VlanConfig myVariable;
// fill myVariable
std::cout << std::hash<VlanConfig>{}(myVariable) << std::endl;

I can't say for certain, but most likely your issue is structure padding. Unless explicietly set ot pack members and ignore alignment, most compilers will set up the struct as follows:
Byte 0: portId
Byte 1: padding
Bytes 2,3: vlanId
Bytes 4,5: vsi
So when you figure the address of index, it'll point to the padding byte, which is always zero. Thus you're always hashing an empty string.
You should be able to check this in a debugger by inspecting index and comparing it to the address of vlanId.
-- Edit --
After giving this some more thought, I have to say that in my extremely humble opinion, this isn't a good way to get a hash value. Trying to treat several numeric values that might, or might not, be contiguous in memory as a std::string, has too many possibilities for error.
Start with the fact that even if you do get the address correct, consider what happens when you hash two different configurations, one of which has vlanId set to 256, while the other has it set to 512. Assuming a little endian machine, both of those will have a zero byte as the first character of the string, and so you're right back here again.
Worse yet is the case when all four bytes in vlanId and vsi are non zero. In that case, you'll read right off the end of your struct, and keep on going, reading who knows what. There's no way that's going to end well.
One possible solution is to figure the size of data, and use the following ctor for std::string: string (char const *s, size_t n); which has the advantage of forcing the string to exactly the size you want.

Related

memcpy with initialized variable and negative numbers with cast

I have
QByteArray bytes // Fullfilled earlier
char id_c = bytes[7];
int _id;
_id = 0; // If I comment this result would be different
memcpy(&_id, &id_c, 1);
int result = _id;
I have _id variable and if I comment "_id=0" result variable result would be different with negative number. Why? Why initializing _id with 0 would be different?!
How can I do this alternatively with same result as using "_id=0" but without memcpy and unwanted castings?
This is not my code. I am interested how to get same result correctly without stupid castings.

Correct.
Because this statement:
memcpy(&_id, &id_c, 1);
Is only copying a single byte from &id_c into an address representing a 4-byte integer, &_id. Only the first byte of memory occupied by _id gets anything copied into it. Without the zero init of _id first, the remaining three bytes of that value are left undefined (presumably random garbage values off the stack).
What's wrong with an "unwanted casting"? This is just as fine and the compiler generates the most efficient code.
QByteArray bytes // Fullfilled earlier
int _id = (int)(bytes[7]);
int result = _id;
If you want sign extended result of the unsigned byte copied into _id, then this:
int _id = (signed char)(bytes[7]);

_id = 0 is called assigning 0 value to the variable _id, if you comment that then we cannot be sure what is stored in that _id , and you are updating only one byte out of that, as it is of type int it is more than one byte in size.

You might try these net/host byte order conversions:
on linux
on windows
the only difference is the header file to use; You can use preprocessor tricks to determine the platform and choose the proper header if cross-platform programming is intended. A better approach is to use the C++20 feature std::endian. But you need to handle the conversion yourself:
#include <bit>
#include <climits>
int int_cvt(int x){
if constexpr (endian::native==endian::big)
return x;
y=0;
while(x){
unsigned char c=x;
x>>=std::CHAR_BIT;
y<<=std::CHAR_BIT;
y+=c;
};
return y;
};
cheers,
FM.

Taking an index out of const char* argument

I have the following code:
int some_array[256] = { ... };
int do_stuff(const char* str)
{
int index = *str;
return some_array[index];
}
Apparently the above code causes a bug in some platforms, because *str can in fact be negative.
So I thought of two possible solutions:
Casting the value on assignment (unsigned int index = (unsigned char)*str;).
Passing const unsigned char* instead.
Edit: The rest of this question did not get a treatment, so I moved it to a new thread.

The signedness of char is indeed platform-dependent, but what you do know is that there are as many values of char as there are of unsigned char, and the conversion is injective. So you can absolutely cast the value to associate a lookup index with each character:
unsigned char idx = *str;
return arr[idx];
You should of course make sure that the arr has at least UCHAR_MAX + 1 elements. (This may cause hilarious edge cases when sizeof(unsigned long long int) == 1, which is fortunately rare.)

Characters are allowed to be signed or unsigned, depending on the platform. An assumption of unsigned range is what causes your bug.
Your do_stuff code does not treat const char* as a string representation. It uses it as a sequence of byte-sized indexes into a look-up table. Therefore, there is nothing wrong with forcing unsigned char type on the characters of your string inside do_stuff (i.e. use your solution #1). This keeps re-interpretation of char as an index localized to the implementation of do_stuff function.
Of course, this assumes that other parts of your code do treat str as a C string.

Pointer to char vs String

Consider these two pieces of code. They're converting base10 number to baseN number, where N is the number of characters in given alphabet. Actually, they generate permutations of letters of given alphabet. It's assumed that 1 is equal to first letter of the alphabet.
#include <iostream>
#include <string>
typedef unsigned long long ull;
using namespace std;
void conv(ull num, const string alpha, string *word){
int base=alpha.size();
*word="";
while (num) {
*word+=alpha[(num-1)%base];
num=(num-1)/base;
}
}
int main(){
ull nu;
const string alpha="abcdef";
string word;
for (nu=1;nu<=10;++nu) {
conv(nu,alpha,&word);
cout << word << endl;
}
return 0;
}
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
typedef unsigned long long ull;
void conv(ull num, const char* alpha, char *word){
int base=strlen(alpha);
while (num) {
(*word++)=alpha[(num-1)%base];
num=(num-1)/base;
}
}
int main() {
char *a=calloc(10,sizeof(char));
const char *alpha="abcdef";
ull h;
for (h=1;h<=10;++h) {
conv(h,alpha,a);
printf("%s\n", a);
}
}
Output is the same:
a
b
c
d
aa
ba
ca
da
No, I didn't forget to reverse the strings, reversal was removed for code clarification.
For some reason speed is very important for me. I've tested the speed of executables compiled from the examples above and noticed that the one written n C++ using string is more than 10 times less fast than the one written in C using char *.
Each executable was compiled with -O2 flag of GCC. I was running tests using much bigger numbers to convert, such as 1e8 and more.
The question is: why is string less fast than char * in that case?

Your code snippets are not equivalent. *a='n' does not append to the char array. It changes the first char in the array to 'n'.
In C++, std::strings should be preferred to char arrays, because they're a lot easier to use, for example appending is done simply with the += operator.
Also they automatically manage their memory for you which char arrays don't do. That being said, std::strings are much less error prone than the manually managed char arrays.

Doing a trace of your code you get:
*a='n';
// 'n0000'
// ^
// a
++a;
// 'n0000'
// ^
// a
*a='o'
// 'no000'
// ^
// a
In the end, a points to its original address + 1, wich is o. If you print a you will get 'o'.
Anyways, what if you need 'nothing' instead of 'no'? It wont fit in 5 chars and you will need to reallocate mem etc. That kind of things is what string class do for you behind the scenes, and faster enough so it's not a problem almost every scenario.

It's possible to use both char * and string to handle some text in C++. It seems to me that string addition is much slower than pointer addition. Why does this happen?
That is because when you use a char array or deal with a pointer to it (char*) the memory is only allocated once. What you describe with "addition" is only an iteration of the pointer to the array. So its just moving of a pointer.
// Both allocate memory one time:
char test[4];
char* ptrTest = new char[4];
// This will just set the values which already exist in the array and will
// not append anything.
*(ptrTest++) = 't'
*(ptrTest++) = 'e';
*(ptrTest++) = 's';
*(ptrTest) = 't';
When you use a string instead, the += operator actually appends characters to the end of your string. In order to accomplish this, memory will be dynamically allocated every time you append something to the string. This process does take longer than just iterating a pointer.
// This will allocate space for one character on every call of the += operator
std::string test;
test += 't';
test += 'e';
test += 's';
test += 't';

std::string a(2,' ');
a[0] = 'n';
a[1] = 'o';
Change the size of your string in the constructor or use the reserve, resize methods, that is your choice.
You are mixing different things in your question, one is a raw representation of bytes that can get interpreted as a string, no semantics or checks, the other is an abstraction of a string with checks, believe me, it is a lot of more important the security and avoid segfaults that can lead on code injection and privilege escalation than 2ms.

From the std::string documentation (here) you can see, that the
basic_string& operator+=(charT c)
is equivalent to calling push_back(c) on that string, so
string a;
a+='n';
a+='o';
is equivalent to:
string a;
a.push_back('n');
a.push_back('o');
The push_back does take care of a lot more than the raw pointer operations and is thus slower. It for instance takes care of automatic memory management of the string class.

memcpy behaving in an unexpected way

Given below is my sample code :
int function1(unsigned char *out, int length){
unsigned long crypto_out_len = 16;
unsigned char crypto_out[16] = {0};
.......
//produces 16 bytes output & stores in crypto_out
crypto_function(crypto_out, crypto_out_len);
//lets say crypto_output contents after are : "abcdefghijklmnop"
.......
memcpy(out, crypto_out,length);
return 0;
}
function2(){
unsigned char out[10] = {0};
function1(out, 10);
std::pair<unsigned char *,int> map_entry;
map_entry.first = out;
map_entry.second = 10;
}
Now, map_entry.first should contain : "abcdefghij", right?
But it contains "abcdefghij#$%f1^", some garbage associated with it. How should I avoid such unexpected behaviour so that map_entry.first should contain exactly "abcdefghij".

Since you haven't pasted the whole code, I can't be 100% sure but I think I know what's wrong. memcpy() is behaving correctly here, and everything is 100% defined behavior.
In this case, out is a 10-character string without a null terminator. You assign it to unsigned char* that contains no length information, and I suspect you simply don't use the number ten when you are referring to map_entry.first.
If you print it as unsigned char* or construct a std::string with it, C++ expects it to be a null-terminated string. Therefore, it reads it up until the first null character. Now, since out didn't have one it just runs over and starts reading characters on the stack after out which happen to be what you see as garbage.
What you need to do, is make sure that either the string is null-terminated or make sure that you always refer to it specifying the correct length. For the former, you'd want to make out 11-byte long, and leave the last byte as 0:
function2(){
unsigned char out[11] = {0};
function1(out, 10);
std::pair<unsigned char *,int> map_entry;
map_entry.first = out;
map_entry.second = 10;
}
Please also note that C++ will actually stop at the first null character it encounters. If your crypto_function() may output zero bytes in the middle of the string, you should be aware that the string will be truncated at the point.
For the latter, you'd have to use functions that actually allow you to specify the string length, and always pass the length of 10 to those. If you always work with it like this, you don't have to worry about zero bytes from crypto_function().

You are confusing char[] with strings. out does contain your expected data, but its not 0 terminated, so if you try to display it as a string it may look like it contains extra data. If the data is actually strings, you need to correctly 0 terminate them.

char array to uint8_t array

this is one area of C/C++ that i have never been good at.
my problem is that i have a string that will need to eventually contain some null characters. treating everything as a char array (or string) won't work, as things tend to crap out when they find the first null. so i thought, ok, i'll switch over to uint8_t, so everything is just a number. i can move things around as needed, and cast it back to a char when i'm ready.
my main question right now is: how can i copy a portion of a string to an uint8_t buffer?
effectively, i'd like to do something like:
std::string s = "abcdefghi";
uint8_t *val = (uint8_t*)malloc(s.length() + 1);
memset(val, 0, s.length() + 1);
// Assume offset is just some number
memcpy(val + offset, s.substr(1, 5).c_str(), 5);
obviously, i get an error when i try this. there is probably some sort of trickery that can be done in the first argument of the memcpy (i see stuff like (*(uint8_t*)) online, and have no clue what that means).
any help on what to do?
and while i am here, how can i easily cast this back to a char array? just static_cast the uint8_t pointer to a char pointer?
thanks a lot.

i thought, ok, i'll switch over to uint8_t, so everything is just a number.
That's not going to make algorithms that look for a '\0' suddenly stop doing it, nor do algorithms that use char have to pay attention to '\0'. Signaling the end with a null character is a convention of C strings, not char arrays. uint8_t might just be a typedef for char anyway.
As Nicol Bolas points out std::string is already capable of storing strings that contain the null character without treating the null character specially.
As for your question, I'm not sure what error you're referring to, as the following works just fine:
#include <iostream>
#include <string>
#include <cstdint>
#include <cstring>
int main() {
std::string s = "abcdefghi";
std::uint8_t *val = (std::uint8_t*)std::malloc(s.length() + 1);
std::memset(val, 0, s.length() + 1);
int offset = 2;
std::memcpy(val + offset, s.substr(1, 5).c_str(), 5);
std::cout << (val+offset) << '\n';
}
The memcpy line takes the second through sixth characters from the string s and copies them into val. The line with cout then prints "bcdef".
Of course this is C++, so if you want to manually allocate some memory and zero it out you can do so like:
std::unique_ptr<uint8_t[]> val(new uint8_t[s.length()+1]());
or use a vector:
std::vector<uint8_t> val(s.length()+1,0);
To cast from an array of uint8_t you could (but typically shouldn't) do the following:
char *c = reinterpret_cast<uint8_t*>(val);

Well, the code works ok, it copies the substring in val. However, you will have 0s on all the positions until the offset.
e.g. for offset=2 val would be {0, 0, b, c, d, e, f, 0, 0, 0}
If you print this, it will show nothing because the string is null terminated on the first position (I guess this is the error you were talking about...).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why hash values are same for string? - c++

Related

memcpy with initialized variable and negative numbers with cast

Taking an index out of const char* argument

Pointer to char vs String

memcpy behaving in an unexpected way

char array to uint8_t array

Categories

Resources