concatenating uint16_t and uint32_t values for hashing - c++

I am trying concatenating (not adding) 2 uint16_t struct members and 2 uint32_t struct members and assigning the result to const void *p for the purpose of hashing. The struct and concatenation function that I am trying to implement is as follows.
struct xyz {
....
uint32_t a;
uint32_t b;
....
uint16_t c;
uint16_t d;
....
}
const void *p=concatenation(xyz.a,xyz.b,xyz.c,xyz.d)
Edited:
I have to use pre-defined hash functions. The most suitable hash function for my task seems to be this.
uint32_t hash(const uint32_t p[], size_t n)
{
//Returns the hash of the 'n' 32-bit words at 'p'
}
or
uint32_t hash64(const uint64_t p[], size_t n)
{
//Returns the hash of the 'n' 64-bit words at 'p'
}

for the purpose of hashing
In this case, I'd rather prefer providing a custom hash function – or specialise std::hash for. For use with standard templates, this might look like this:
namespace std // any extension of std namespace is UB
// sole exception: specialising templates, which we are going to do
{
template <>
struct hash<xyz>
{
size_t operator()(xyz const& i) const
{
// TODO: need to calculate the value from a, b, c, and d appropriately
return 0;
};
};
// if xyz is polymorphic, you might need to operate on pointers
// no problem either:
template <>
struct hash<xyz*>
{
size_t operator()(xyz const* i) const
{
return hash<xyz>()(*i);
// or if hash value is type dependent:
return i->hash(); // custom virtual hash member function needs to be implented
}
}
// now you can have
std::unordered_set<xyz> someSet;
void demo()
{
someSet.insert(xyz());
}
(Untested code, in case of errors please fix yourself.)
A list of hashing algorithms which might be used can be found at wikipedia.

If you want the value to fit into a pointer, the full value can be 32 bits on x86 or 64 bits on x64. I'm going to assume you are compiling for 64 bit machines.
This means you can only fit 2 uint16 and one uint32, or 2 uint32s.
Either way, you would shift the values into a uint64 (c | (d << 16) | (c << 32)) and then convert that value to a void*.
Edit: for clarification, you cannot fit all the structs members bit shifted one after another into a single pointer. You need a minimum of 96 bits to hold the packed struct which means at least two 64 bit pointers.

There are a few things to consider:
Does that hash value need to be portable across systems? If it does, then you will need to be careful to order the bytes the same way on different systems. If not, then the implementation can be simpler.
Do you want to hash every member of the class, and the class has no padding, and no value of a member should be hashed equally to another different value?
If both of these simplifications apply, then your function is fast and easy to implement but violating that precondition will break the hash. If not, then you must serialise the the data into a buffer, which practically means that you cannot simply return a pointer.
Here is a super simple implementation for the case that you don't need portability, and you hash all members, and there is no padding:
xyz example;
static_assert(std::has_unique_object_representations_v<xyz>);
const void* p = &example;
Note that this doesn't work with (IEEE-754) float members due to peculiarities of NaN.
A more robust solution that can produce hashes that are portable across systems is to use a general purpose serialisation scheme, and hash the serialised result. There is no standard serialisation functionality in C++.

void* has problems like: Who owns the memory? What's the type you are going to reinterpret the pointer as?
A more typed solution would be to use std::array of std::byte then you at least know that you're looking at an array of raw bytes and nothing else:
#include <cstdint>
#include <array>
#include <cstddef>
#include <cstring>
auto concat(std::uint32_t a, std::uint32_t b, std::uint16_t c, std::uint16_t d) {
std::array<std::byte, sizeof a + sizeof b + sizeof c + sizeof d> res;
std::byte* p = res.data();
std::memcpy(p, &a, sizeof a);
std::memcpy(p += sizeof a, &b, sizeof b);
std::memcpy(p += sizeof b, &c, sizeof c);
std::memcpy(p += sizeof c, &d, sizeof d);
return res;
}
int main() {
std::uint32_t a = 1, b = 0;
std::uint16_t c = 1, d = 0;
auto res = concat(a, b, c, d);
return 0;
}

Related

Dynamic type cast to the variable type

Suppose, I have a class, like:
struct A
{
uint8_t f1;
int16_t f2;
};
And I need to set it's members values from a memory buffer data, like:
uint8_t * memory=device.getBufferedDataFromDevice();
A a;
a.f1=*((uint8_t*)&memory[someAddress]);
a.f2=*((int16_t*)&memory[someOtherAddress]);
But I'd like to make it more flexible, and avoid the explicit type cast, to have a possibility to change the type in the declaration without changing the rest of the code. Of course, I could achieve it with something like:
memcpy((void*)&a.f1, (void*)&memory[someAddress], sizeof(A::f1));
But I'd also want to avoid calling a function for a simple types like 1-4 bytes long integers (which I have), as the simple assignment could be compiled to a single CPU instruction. Please advise, what is the c++ way to implement this?
Thank you!
memcpy is fully understood by every modern C++ compiler, and there is not going to be an actual function call unless you take its address, store that in a pointer, then confuse the compiler enough that it no longer knows the pointer points at memcpy.
Or, you know, turn off optimizations.
memcpy((void*)&a.f1, (void*)&memory[someAddress], sizeof(A::f1));
there is neither reason to cast to void*, nor use dangerous C-style casts, here.
std::memcpy(&a.f1, &memory[someAddress], sizeof(a.f1));
this is a standards-compliant way to move memory that represents data of the same type as a.f1 over a.f1, assuming a.f1 is trivially copyable. (Note I used the same token sequence -- a.f1 -- for both the written-to stuff and the size.)
The compiler will optimize this into appropriate assembly, and there will be no function-call overhead.
Live example, you can see the generated assembly.
Now, you may object "but there is no guarantee!".
The C++ standard does not include a guarantee that a+b won't be implemented as a loop int r = 0; for (int i = 0; i < a; ++i){++r;} for (int i = 0; i < b; ++i){++r;}.
You cannot presume your C++ compiler is hostile.
Existing C++ compilers optimize calls to memcpy. Writing code assuming it won't happen is a waste of time.
You can also write a slightly safer memcpy
template<class Dest>
void memcpyT( Dest* dest, void const* src ) {
static_assert( std::is_trivially_copyable_v<Dest> );
memcpy( dest, src, sizeof(Dest) );
}
which I included as an alternative in the above live example.
You can have a code similar to this:
template<typename T>
void mymemcopy(T* a, void* b) {
memcpy((void*)a, b, sizeof(T));
}
template<typename T>
constexpr void mymemcopy(T** a, void* b) {
*a = static_cast<T*>(b);
}
constexpr void mymemcopy(int* a, void* b) {
*a = *(int*)b;
}
constexpr void mymemcopy(unsigned char* a, void* b) {
*a = *(unsigned char*)b;
}
int main()
{
int a, b =10;
mymemcopy(&a, &b);
double a1, b1 =10;
mymemcopy(&a1, &b1);
unsigned char a2, b2 =10;
mymemcopy(&a2, &b2);
unsigned char *a3, *b3 =nullptr;
mymemcopy(&a3, &b3);
}
I somehow think your case use is for embedded programming and I'm not expert. I know in embedded programming you need to decrease both memory usage and code. But you are asking will increase code size obviously.

Practical use of Anonymous union in real world C++ programing

I know that we can access anonymous unions without creating it's object(without dot),
but could anybody please explain,what is the use of anonymous unions in real world c++ programing?
I have mostly used unions to store multiple different types of elements in the same contiguous storage without resorting to dynamic polymorphism. Thus, every element of my union is a struct describing the data for the corresponding node type. Using an anonymous union mostly gives a more convenient notation, i.e. instead of object.union_member.struct_member, I can just write object.struct_member, since there is no other member of that name anyways.
A recent example where I used them would be a rooted (mostly binary) tree which has different kinds of nodes:
struct multitree_node {
multitree_node_type type;
...
union {
node_type_1 n1;
node_type_2 n2;
...
};
};
Using this type tag type I am able to determine which element of the union to use. All of the structs node_type_x have roughly the same size, which is why I used the union in the first place (no unused storage).
With C++17, you would be able to do this using std::variant, but for now, using anonymous unions are a convenient way of implementing such 'polymorphic' types without virtual functions.
Here's a real-world example:
struct Point3D {
union {
struct {
float x, y, z;
};
struct {
float c[3];
};
};
};
Point3D p;
You can access p's x/y/z coordinates with p.x, p.y, p.z. This is convenient.
But sometimes you want to access point as a float[3] array. You can use p.c for that.
Note: Using this construct is Undefined Behavior by the standard. But, it works on all compilers I've met so far. So, if you want to use such a construct, be aware, that this may broke some day.
I actually remembered a use case I came across a while back. You know bit-fields? The standard makes very little guarantees about their layout in memory. If you want to pack binary data into an integer of a specific size, you are usually better off doing bit-wise arithmetic yourself.
However, with unions and the common initial sequence guarantee, you can put all the boilerplate behind member access syntax. So your code will look like it's using a bit-field, but will in fact just be packing bits into a predictable memory location.
Here's a Live Example
#include <cstdint>
#include <type_traits>
#include <climits>
#include <iostream>
template<typename UInt, std::size_t Pos, std::size_t Width>
struct BitField {
static_assert(std::is_integral<UInt>::value && std::is_unsigned<UInt>::value,
"To avoid UB, only unsigned integral type are supported");
static_assert(Width > 0 && Pos < sizeof(UInt) * CHAR_BIT && Width < sizeof(UInt) * CHAR_BIT - Pos,
"Position and/or width cannot be supported");
UInt mem;
BitField& operator=(UInt val) {
if((val & ((UInt(1) << Width) - 1)) == val) {
mem &= ~(((UInt(1) << Width) - 1) << Pos);
mem |= val << Pos;
}
// Should probably handle the error somehow
return *this;
}
operator UInt() {
return (mem >> Pos) & Width;
}
};
struct MyColor {
union {
std::uint32_t raw;
BitField<std::uint32_t, 0, 8> r;
BitField<std::uint32_t, 8, 8> g;
BitField<std::uint32_t, 16, 8> b;
};
MyColor() : raw(0) {}
};
int main() {
MyColor c;
c.r = 0xF;
c.g = 0xA;
c.b = 0xD;
std::cout << std::hex << c.raw;
}

C++ understanding Unions and Structs

I've come to work on an ongoing project where some unions are defined as follows:
/* header.h */
typedef union my_union_t {
float data[4];
struct {
float varA;
float varB;
float varC;
float varD;
};
} my_union;
If I understand well, unions are for saving space, so sizeof(my_union_t) = MAX of the variables in it. What are the advantages of using the statement above instead of this one:
typedef struct my_struct {
float varA;
float varB;
float varC;
float varD;
};
Won't be the space allocated for both of them the same?
And how can I initialize varA,varB... from my_union?
Unions are often used when implementing a variant like object (a type field and a union of data types), or in implementing serialisation.
The way you are using a union is a recipe for disaster.
You are assuming the the struct in the union is packing the floats with no gaps between then!
The standard guarantees that float data[4]; is contiguous, but not the structure elements. The only other thing you know is that the address of varA; is the same as the address of data[0].
Never use a union in this way.
As for your question: "And how can I initialize varA,varB... from my_union?". The answer is, access the structure members in the normal long-winded way not via the data[] array.
Union are not mostly for saving space, but to implement sum types (for that, you'll put the union in some struct or class having also a discriminating field which would keep the run-time tag). Also, I suggest you to use a recent standard of C++, at least C++11 since it has better support of unions (e.g. permits more easily union of objects and their construction or initialization).
The advantage of using your union is to be able to index the n-th floating point (with 0 <= n <= 3) as u.data[n]
To assign a union field in some variable declared my_union u; just code e.g. u.varB = 3.14; which in your case has the same effect as u.data[1] = 3.14;
A good example of well deserved union is a mutable object which can hold either an int or a string (you could not use derived classes in that case):
class IntOrString {
bool isint;
union {
int num; // when isint is true
str::string str; // when isint is false
};
public:
IntOrString(int n=0) : isint(true), num(n) {};
IntOrString(std::string s) : isint(false), str(s) {};
IntOrString(const IntOrString& o): isint(o.isint)
{ if (isint) num = o.num; else str = o.str); };
IntOrString(IntOrString&&p) : isint(p.isint)
{ if (isint) num = std::move (p.num);
else str = std::move (p.str); };
~IntOrString() { if (isint) num=0; else str->~std::string(); };
void set (int n)
{ if (!isint) str->~std::string(); isint=true; num=n; };
void set (std::string s) { str = s; isint=false; };
bool is_int() const { return isint; };
int as_int() const { return (isint?num:0; };
const std::string as_string() const { return (isint?"":str;};
};
Notice the explicit calls of destructor of str field. Notice also that you can safely use IntOrString in a standard container (std::vector<IntOrString>)
See also std::optional in future versions of C++ (which conceptually is a tagged union with void)
BTW, in Ocaml, you simply code:
type intorstring = Integer of int | String of string;;
and you'll use pattern matching. If you wanted to make that mutable, you'll need to make a record or a reference of it.
You'll better use union-s in a C++ idiomatic way (see this for general advices).
I think the best way to understand unions is to just to give 2 common practical examples.
The first example is working with images. Imagine you have and RGB image that is arranged in a long buffer.
What most people would do, is represent the buffer as a char* and then loop it by 3's to get the R,G,B.
What you could do instead, is make a little union, and use that to loop over the image buffer:
union RGB
{
char raw[3];
struct
{
char R;
char G;
char B;
} colors;
}
RGB* pixel = buffer[0];
///pixel.colors.R == The red color in the first pixel.
Another very useful use for unions is using registers and bitfields.
Lets say you have a 32 bit value, that represents some HW register, or something.
Sometimes, to save space, you can split the 32 bits into bit fields, but you also want the whole representation of that register as a 32 bit type.
This obviously saves bit shift calculation that a lot of programmers use for no reason at all.
union MySpecialRegister
{
uint32_t register;
struct
{
unsigned int firstField : 5;
unsigned int somethingInTheMiddle : 25;
unsigned int lastField : 6;
} data;
}
// Now you can read the raw register into the register field
// then you can read the fields using the inner data struct
The advantage is that with a union you can access the same memory in two different ways.
In your example the union contains four floats. You can access those floats as varA, varB... which might be more descriptive names or you can access the same variables as an array data[0], data[1]... which might be more useful in loops.
With a union you can also use the same memory for different kinds of data, you might find that useful for things like writing a function to tell you if you are on a big endian or little endian CPU.
No, it is not for saving space. It is for ability to represent some binary data as various data types.
for example
#include <iostream>
#include <stdint.h>
union Foo{
int x;
struct y
{
unsigned char b0, b1, b2, b3;
};
char z[sizeof(int)];
};
int main()
{
Foo bar;
bar.x = 100;
std::cout << std::hex; // to show number in hexadec repr;
for(size_t i = 0; i < sizeof(int); i++)
{
std::cout << "0x" << (int)bar.z[i] << " "; // int is just to show values as numbers, not a characters
}
return 0;
}
output: 0x64 0x0 0x0 0x0 The same values are stored in struct bar.y, but not in array but in sturcture members. Its because my machine have a little endiannes. If it were big, than the output would be reversed: 0x0 0x0 0x0 0x64
You can achieve the same using reinterpret_cast:
#include <iostream>
#include <stdint.h>
int main()
{
int x = 100;
char * xBytes = reinterpret_cast<char*>(&x);
std::cout << std::hex; // to show number in hexadec repr;
for (size_t i = 0; i < sizeof(int); i++)
{
std::cout << "0x" << (int)xBytes[i] << " "; // (int) is just to show values as numbers, not a characters
}
return 0;
}
its usefull, for example, when you need to read some binary file, that was written on a machine with different endianess than yours. You can just access values as bytearray and swap those bytes as you wish.
Also, it is usefull when you have to deal with bit fields, but its a whole different story :)
First of all: Avoid unions where the access goes to the same memory but to different types!
Unions did not save space at all. The only define multiple names on the same memory area! And you can only store one of the elements in one time in a union.
if you have
union X
{
int x;
char y[4];
};
you can store an int OR 4 chars but not both! The general problem is, that nobody knows which data is actually stored in a union. If you store a int and read the chars, the compiler will not check that and also there is no runtime check. A solution is often to provide an additional data element in a struct to a union which contains the actual stored data type as an enum.
struct Y
{
enum { IS_CHAR, IS_INT } tinfo;
union
{
int x;
char y[4];
};
}
But in c++ you always should use classes or structs which can derive from a maybe empty parent class like this:
class Base
{
};
class Int_Type: public Base
{
...
int x;
};
class Char_Type: public Base
{
...
char y[4];
};
So you can device pointers to base which actually can hold a Int or a Char Type for you. With virtual functions you can access the members in a object oriented way of programming.
As mentioned already from Basile's answer, a useful case can be the access via different names to the same type.
union X
{
struct data
{
float a;
float b;
};
float arr[2];
};
which allows different access ways to the same data with the same type. Using different types which are stored in the same memory should be avoided at all!

c++ stringstream into variables of different types

I'm getting a string containing raw binary data which needs to be converted to integers. The Problem is these values are not always in the same order and do not always appear. So the format of the binary data gets described in a config file and the type of the values read from the binary data is not known at compile time.
I'm thinking of a solution similar to this:
enum BinaryType {
TYPE_UINT16,
TYPE_UNIT32,
TYPE_INT32
};
long convert(BinaryType t, std::stringstream ss) {
long return_value;
switch(t) {
case TYPE_UINT16:
unsigned short us_value;
ss.read(&us_value, sizeof(unsigned short));
return_value = short;
break;
case TYPE_UINT32:
unsigned int ui_value;
ss.read(&ui_value, sizeof(unsigned int));
return_value = ui_value;
break;
case TYPE_INT32:
signed int si_value;
ss.read(&si_value, sizeof(signed int));
return_value = si_value;
break;
}
return return_value;
}
The goal is to output these values in decimal.
My Questions are:
This code is very repetitive. Is there a simpler solution? (Templates?)
should I make use of the standard types like signed int if the value needs to be 32 bit? What to use instead? Endianness?
A simple solution: define a base class for converters:
class Converter {
public:
virtual int_64 convert(std::stringstream& ss) = 0;
}
Next define a concrete converter for each binary type. Have a map/array mapping from binary types identifiers to your converters, e.g.:
Converter* converters[MAX_BINARY_TYPES];
converters[TYPE_UINT16] = new ConverterUINT16;
...
Now, you can use it like this (variables defined like in your function convert):
cout << converters[t]->convert(ss)
For portability, instead of basic types like int, long, etc, you should use int32_t, int64_t which are guaranteed to be the same on all systems.
Of course, if your code is meant to deal with different endianness, you need to deal with it explicitly. For the above example code you can have two different converters' sets, one for little endian data decoding, another for big endian. Another thing you can do is to write a wrapper class for std::stringstream, let's call it StringStream, which defines functions for reading int32, uint32, etc., and swaps the bytes if the endianness is different than the architecture of the system your code is running on. You can make the class a template and instantiate it with one of the two:
class SameByteOrder {
public:
template<typename T> static void swap(T &) {}
};
class OtherByteOrder {
public:
template<typename T> static void swap(T &o)
{
char *p = reinterpret_cast<char *>(&o);
size_t size = sizeof(T);
for (size_t i=0; i < size / 2; ++i)
std::swap(p[i], p[size - i - 1]);
}
};
then use the swap function inside your StringStream's functions to swap (or not) the bytes.

what is the purpose and return type of the __builtin_offsetof operator?

What is the purpose of __builtin_offsetof operator (or _FOFF operator in Symbian) in C++?
In addition what does it return? Pointer? Number of bytes?
It's a builtin provided by the GCC compiler to implement the offsetof macro that is specified by the C and C++ Standard:
GCC - offsetof
It returns the offset in bytes that a member of a POD struct/union is at.
Sample:
struct abc1 { int a, b, c; };
union abc2 { int a, b, c; };
struct abc3 { abc3() { } int a, b, c; }; // non-POD
union abc4 { abc4() { } int a, b, c; }; // non-POD
assert(offsetof(abc1, a) == 0); // always, because there's no padding before a.
assert(offsetof(abc1, b) == 4); // here, on my system
assert(offsetof(abc2, a) == offsetof(abc2, b)); // (members overlap)
assert(offsetof(abc3, c) == 8); // undefined behavior. GCC outputs warnings
assert(offsetof(abc4, a) == 0); // undefined behavior. GCC outputs warnings
#Jonathan provides a nice example of where you can use it. I remember having seen it used to implement intrusive lists (lists whose data items include next and prev pointers itself), but i can't remember where it was helpful in implementing it, sadly.
As #litb points out and #JesperE shows, offsetof() provides an integer offset in bytes (as a size_t value).
When might you use it?
One case where it might be relevant is a table-driven operation for reading an enormous number of diverse configuration parameters from a file and stuffing the values into an equally enormous data structure. Reducing enormous down to SO trivial (and ignoring a wide variety of necessary real-world practices, such as defining structure types in headers), I mean that some parameters could be integers and others strings, and the code might look faintly like:
#include <stddef.h>
typedef stuct config_info config_info;
struct config_info
{
int parameter1;
int parameter2;
int parameter3;
char *string1;
char *string2;
char *string3;
int parameter4;
} main_configuration;
typedef struct config_desc config_desc;
static const struct config_desc
{
char *name;
enum paramtype { PT_INT, PT_STR } type;
size_t offset;
int min_val;
int max_val;
int max_len;
} desc_configuration[] =
{
{ "GIZMOTRON_RATING", PT_INT, offsetof(config_info, parameter1), 0, 100, 0 },
{ "NECROSIS_FACTOR", PT_INT, offsetof(config_info, parameter2), -20, +20, 0 },
{ "GILLYWEED_LEAVES", PT_INT, offsetof(config_info, parameter3), 1, 3, 0 },
{ "INFLATION_FACTOR", PT_INT, offsetof(config_info, parameter4), 1000, 10000, 0 },
{ "EXTRA_CONFIG", PT_STR, offsetof(config_info, string1), 0, 0, 64 },
{ "USER_NAME", PT_STR, offsetof(config_info, string2), 0, 0, 16 },
{ "GIZMOTRON_LABEL", PT_STR, offsetof(config_info, string3), 0, 0, 32 },
};
You can now write a general function that reads lines from the config file, discarding comments and blank lines. It then isolates the parameter name, and looks that up in the desc_configuration table (which you might sort so that you can do a binary search - multiple SO questions address that). When it finds the correct config_desc record, it can pass the value it found and the config_desc entry to one of two routines - one for processing strings, the other for processing integers.
The key part of those functions is:
static int validate_set_int_config(const config_desc *desc, char *value)
{
int *data = (int *)((char *)&main_configuration + desc->offset);
...
*data = atoi(value);
...
}
static int validate_set_str_config(const config_desc *desc, char *value)
{
char **data = (char **)((char *)&main_configuration + desc->offset);
...
*data = strdup(value);
...
}
This avoids having to write a separate function for each separate member of the structure.
The purpose of a built-in __offsetof operator is that the compiler vendor can continue to #define an offsetof() macro, yet have it work with classes that define unary operator&. The typical C macro definition of offsetof() only worked when (&lvalue) returned the address of that rvalue. I.e.
#define offsetof(type, member) (int)(&((type *)0)->member) // C definition, not C++
struct CFoo {
struct Evil {
int operator&() { return 42; }
};
Evil foo;
};
ptrdiff_t t = offsetof(CFoo, foo); // Would call Evil::operator& and return 42
As #litb, said: the offset in bytes of a struct/class member. In C++ there are cases where it is undefined, in case the compiler will complain. IIRC, one way to implement it (in C, at least) is to do
#define offsetof(type, member) (int)(&((type *)0)->member)
But I'm sure there are problems this, but I'll leave that to the interested reader to point out...