Storing integral values in a byte sequence in C++ - c++

I am implementing an LZW compression/decompression utility library and am in need of returning the compressed output in what I am using as:
using ByteSequence = std::vector<std::uint8_t>
The output format for the compressor will include the positions in the compressor's dictionary of various sequences found by the algorithm. For example, having 16-bit positions in the output would look like:
std::vector<std::uint16_t> pos{123, 385, /* ... */};
The output, however needs to be a ByteSequence, and it needs to be portable among architectures. What I am currently doing to convert the pos vector to the desired format is:
for (auto p : pos)
{
std::uint8_t *bytes = (std::uint8_t *) &p;
output.push_back(bytes[0]);
output.push_back(bytes[1]);
}
This works, but only under the assumption that the keys will be 16-bit each and to be honest, it looks like a cheap trick to me.
How should I do this in a better, cleaner way? Thank you!

The way you extract bytes is undefined behaviour. The C++ standard [basic.lval] reads:
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:
. . .
a char, unsigned char, or std::byte type.
std::uint8_t is not in this list, and AFAIK there is no guarantee that std::uint8_t and unsigned char are the same type.
A conversion function might look like:
template<typename T>
void convert_forward(const std::vector<T>& in, std::vector<std::uint8_t>& out) {
out.reserve(out.size() + in.size() * sizeof(T));
for (const T& i : in) {
std::uint8_t buff[sizeof(T)];
std::memcpy(buff, &i, sizeof(T));
std::copy(std::begin(buff), std::end(buff), std::back_inserter(out));
}
}
Alternative implementation without back_inserter:
template<typename T>
void convert_forward(const std::vector<T>& in, std::vector<std::uint8_t>& out) {
const auto old_size = out.size();
out.resize(old_size + in.size() * sizeof(T));
auto dest = out.data() + old_size;
for (const T& i : in) {
std::memcpy(dest, &i, sizeof(T));
dest += sizeof(T);
}
}
Beware about endianness. It should be taken into account either in the forward conversion or in the backward one.

This should be portable, though possibly not so efficient as direct byte manipulation:
template<class T>
void number2bytes(std::vector<uint8_t>& bytes, T x)
{
static_assert(std::is_integral<T>::value, "Integral required.");
for (size_t i = 0; i < sizeof(T); ++i)
{
bytes.push_back(x & 0xFF);
x >>= 8;
}
}
The static_assert is added to protect from accidental passing some weird non-number type overloading & and >>=.

Related

compile-time variable-length objects based on string

Related SO questions:
variable size struct
string-based generator
Sadly neither one (or other similar ones) provide the solution I'm looking for.
Background
USB descriptors are (generally) byte-array structures. A "string descriptor" is defined as an array of bytes, that begins with a standard "header" of 2 bytes, followed by a string of UNICODE (16-bit) characters.
For example a USB string descriptor of value "AB" would have the following sequence of bytes:
0x06 0x03 0x41 0x00 0x42 0x00
where 0x06 is the total size of the descriptor (including the header), 0x03 is its "type" (defined by the standard)
Current (unsatisfactory) approach:
// other types omitted for clarity
enum UsbDescriptorType: uint8_t { USB_DESCR_STRING = 0x03 };
struct UsbDescrStd {
uint8_t bLength;
UsbDescriptorType bDescriptorType;
};
template<size_t N>
struct UsbDescrString final: UsbDescrStd {
char str[N * 2];
constexpr UsbDescrString(const char s[N]) noexcept
: UsbDescrStd{sizeof(*this), UsbDescriptorType::USB_DESCR_STRING}
, str {}
{
for(size_t i = 0; i < N; ++i)
str[i * 2] = s[i];
}
};
Below are the examples of its usage and short comments on why they are "not good enough" for me:
// requires size information
constexpr UsbDescrString<9> uds9{"Descr str"};
// string duplication
constexpr UsbDescrString<sizeof("Descr str")-1> udsa{"Descr str"};
// requires an explicit string storage
constexpr auto UsbDescrStrTxt{"Descr str"};
constexpr UsbDescrString<sizeof(UsbDescrStrTxt)-1> udsa2{UsbDescrStrTxt};
// ugly use of a macro
#define MAKE_UDS(name, s) UsbDescrString<sizeof(s)-1> name{s}
constexpr MAKE_UDS(udsm, "Descr str");
"String argument to template" is explicitly prohibited as of C++20, cutting that solution off as well.
What I'm trying to achieve
Ideally I'd love to be able to write code like the following:
constexpr UsbDescrString uds{"Descr str"}; // or a similar "terse" approach
It is simple, terse, error-resistant, and to the point. And I need help writing my UsbDescrString in a way that allows me to create compile-time objects without unnecessary code bloat.
Adding a CTAD to UsbDescrString should be enough
template<size_t N>
struct UsbDescrString final: UsbDescrStd {
char str[N * 2];
constexpr UsbDescrString(const char (&s)[N+1]) noexcept
: UsbDescrStd{sizeof(*this), UsbDescriptorType::USB_DESCR_STRING}
, str {}
{
for(size_t i = 0; i < N; ++i)
str[i * 2] = s[i];
}
};
template<size_t N>
UsbDescrString(const char (&)[N]) -> UsbDescrString<N-1>;
Note that in order to prevent array to pointer decay, const char (&) needs to be used as the constructor parameter.
Demo
"String argument to template" is explicitly prohibited as of C++20,
cutting that solution off as well.
However, thanks to P0732, with the help of some helper classes such as basic_fixed_string, now in C++20 you can
template<fixed_string>
struct UsbDescrString final: UsbDescrStd;
constexpr UsbDescrString<"Descr str"> uds9;

concatenating uint16_t and uint32_t values for hashing

I am trying concatenating (not adding) 2 uint16_t struct members and 2 uint32_t struct members and assigning the result to const void *p for the purpose of hashing. The struct and concatenation function that I am trying to implement is as follows.
struct xyz {
....
uint32_t a;
uint32_t b;
....
uint16_t c;
uint16_t d;
....
}
const void *p=concatenation(xyz.a,xyz.b,xyz.c,xyz.d)
Edited:
I have to use pre-defined hash functions. The most suitable hash function for my task seems to be this.
uint32_t hash(const uint32_t p[], size_t n)
{
//Returns the hash of the 'n' 32-bit words at 'p'
}
or
uint32_t hash64(const uint64_t p[], size_t n)
{
//Returns the hash of the 'n' 64-bit words at 'p'
}
for the purpose of hashing
In this case, I'd rather prefer providing a custom hash function – or specialise std::hash for. For use with standard templates, this might look like this:
namespace std // any extension of std namespace is UB
// sole exception: specialising templates, which we are going to do
{
template <>
struct hash<xyz>
{
size_t operator()(xyz const& i) const
{
// TODO: need to calculate the value from a, b, c, and d appropriately
return 0;
};
};
// if xyz is polymorphic, you might need to operate on pointers
// no problem either:
template <>
struct hash<xyz*>
{
size_t operator()(xyz const* i) const
{
return hash<xyz>()(*i);
// or if hash value is type dependent:
return i->hash(); // custom virtual hash member function needs to be implented
}
}
// now you can have
std::unordered_set<xyz> someSet;
void demo()
{
someSet.insert(xyz());
}
(Untested code, in case of errors please fix yourself.)
A list of hashing algorithms which might be used can be found at wikipedia.
If you want the value to fit into a pointer, the full value can be 32 bits on x86 or 64 bits on x64. I'm going to assume you are compiling for 64 bit machines.
This means you can only fit 2 uint16 and one uint32, or 2 uint32s.
Either way, you would shift the values into a uint64 (c | (d << 16) | (c << 32)) and then convert that value to a void*.
Edit: for clarification, you cannot fit all the structs members bit shifted one after another into a single pointer. You need a minimum of 96 bits to hold the packed struct which means at least two 64 bit pointers.
There are a few things to consider:
Does that hash value need to be portable across systems? If it does, then you will need to be careful to order the bytes the same way on different systems. If not, then the implementation can be simpler.
Do you want to hash every member of the class, and the class has no padding, and no value of a member should be hashed equally to another different value?
If both of these simplifications apply, then your function is fast and easy to implement but violating that precondition will break the hash. If not, then you must serialise the the data into a buffer, which practically means that you cannot simply return a pointer.
Here is a super simple implementation for the case that you don't need portability, and you hash all members, and there is no padding:
xyz example;
static_assert(std::has_unique_object_representations_v<xyz>);
const void* p = &example;
Note that this doesn't work with (IEEE-754) float members due to peculiarities of NaN.
A more robust solution that can produce hashes that are portable across systems is to use a general purpose serialisation scheme, and hash the serialised result. There is no standard serialisation functionality in C++.
void* has problems like: Who owns the memory? What's the type you are going to reinterpret the pointer as?
A more typed solution would be to use std::array of std::byte then you at least know that you're looking at an array of raw bytes and nothing else:
#include <cstdint>
#include <array>
#include <cstddef>
#include <cstring>
auto concat(std::uint32_t a, std::uint32_t b, std::uint16_t c, std::uint16_t d) {
std::array<std::byte, sizeof a + sizeof b + sizeof c + sizeof d> res;
std::byte* p = res.data();
std::memcpy(p, &a, sizeof a);
std::memcpy(p += sizeof a, &b, sizeof b);
std::memcpy(p += sizeof b, &c, sizeof c);
std::memcpy(p += sizeof c, &d, sizeof d);
return res;
}
int main() {
std::uint32_t a = 1, b = 0;
std::uint16_t c = 1, d = 0;
auto res = concat(a, b, c, d);
return 0;
}

C++ Cast double to char and replace into std::array

For RAM optimization purpose, I need to store my data as a std::array<char, N> where N is the "templated size_t" size of the array.
I need to manage a huge amount of "Line" objects that contain any kind of data (numerical and char).
So my Line class is:
template<size_t NByte>
class Line {
public:
Line (std::array<char, NByte> data, std::vector<size_t> offset) :
_data(data), _offset(offset) {}
template<typename T>
T getValue (const size_t& index) const {
return *reinterpret_cast<const T*>(_data.data() + _offset[index]);
}
template<typename T>
void setValue (const size_t& index, const T value) const {
char * new_value = const_cast<char *>(reinterpret_cast<const char *>(&value));
std::move(new_value, new_value + sizeof(T), const_cast<char *>(_data.data() + _offset[index]));
}
private:
std::array<char, NByte> _data;
std::vector<size_t> _offset;
};
My questions are:
Is there a better way to do the setter and getter function?
Is this robust against memory leak?
Is there any problem to use this code in production/release?
Edit: The question behind those is: Is there any other way to work with binary data in memory and provide "human understandable" interface for final user through setter and getter?
Is there any problem in using this code in production/release?
Yes, this code is platform-dependent.
The data will be stored differently in Big-Endian platforms and in Little-Endian platforms.
If your're counting on two systems communicating with each other (transmitting and receiving this data), then you will have to make sure that both sides use platforms of the same Endianness.

C++ variable length arrays in struct

I am writing a program for creating, sending, receiving and interpreting ARP packets. I have a structure representing the ARP header like this:
struct ArpHeader
{
unsigned short hardwareType;
unsigned short protocolType;
unsigned char hardwareAddressLength;
unsigned char protocolAddressLength;
unsigned short operationCode;
unsigned char senderHardwareAddress[6];
unsigned char senderProtocolAddress[4];
unsigned char targetHardwareAddress[6];
unsigned char targetProtocolAddress[4];
};
This only works for hardware addresses with length 6 and protocol addresses with length 4. The address lengths are given in the header as well, so to be correct the structure would have to look something like this:
struct ArpHeader
{
unsigned short hardwareType;
unsigned short protocolType;
unsigned char hardwareAddressLength;
unsigned char protocolAddressLength;
unsigned short operationCode;
unsigned char senderHardwareAddress[hardwareAddressLength];
unsigned char senderProtocolAddress[protocolAddressLength];
unsigned char targetHardwareAddress[hardwareAddressLength];
unsigned char targetProtocolAddress[protocolAddressLength];
};
This obviously won't work since the address lengths are not known at compile time. Template structures aren't an option either since I would like to fill in values for the structure and then just cast it from (ArpHeader*) to (char*) in order to get a byte array which can be sent on the network or cast a received byte array from (char*) to (ArpHeader*) in order to interpret it.
One solution would be to create a class with all header fields as member variables, a function to create a byte array representing the ARP header which can be sent on the network and a constructor which would take only a byte array (received on the network) and interpret it by reading all header fields and writing them to the member variables. This is not a nice solution though since it would require a LOT more code.
In contrary a similar structure for a UDP header for example is simple since all header fields are of known constant size. I use
#pragma pack(push, 1)
#pragma pack(pop)
around the structure declaration so that I can actually do a simple C-style cast to get a byte array to be sent on the network.
Is there any solution I could use here which would be close to a structure or at least not require a lot more code than a structure?
I know the last field in a structure (if it is an array) does not need a specific compile-time size, can I use something similar like that for my problem? Just leaving the sizes of those 4 arrays empty will compile, but I have no idea how that would actually function. Just logically speaking it cannot work since the compiler would have no idea where the second array starts if the size of the first array is unknown.
You want a fairly low level thing, an ARP packet, and you are trying to find a way to define a datastructure properly so you can cast the blob into that structure. Instead, you can use an interface over the blob.
struct ArpHeader {
mutable std::vector<uint8_t> buf_;
template <typename T>
struct ref {
uint8_t * const p_;
ref (uint8_t *p) : p_(p) {}
operator T () const { T t; memcpy(&t, p_, sizeof(t)); return t; }
T operator = (T t) const { memcpy(p_, &t, sizeof(t)); return t; }
};
template <typename T>
ref<T> get (size_t offset) const {
if (offset + sizeof(T) > buf_.size()) throw SOMETHING;
return ref<T>(&buf_[0] + offset);
}
ref<uint16_t> hwType() const { return get<uint16_t>(0); }
ref<uint16_t> protType () const { return get<uint16_t>(2); }
ref<uint8_t> hwAddrLen () const { return get<uint8_t>(4); }
ref<uint8_t> protAddrLen () const { return get<uint8_t>(5); }
ref<uint16_t> opCode () const { return get<uint16_t>(6); }
uint8_t *senderHwAddr () const { return &buf_[0] + 8; }
uint8_t *senderProtAddr () const { return senderHwAddr() + hwAddrLen(); }
uint8_t *targetHwAddr () const { return senderProtAddr() + protAddrLen(); }
uint8_t *targetProtAddr () const { return targetHwAddr() + hwAddrLen(); }
};
If you need const correctness, you remove mutable, create a const_ref, and duplicate the accessors into non-const versions, and make the const versions return const_ref and const uint8_t *.
Short answer: you just cannot have variable-sized types in C++.
Every type in C++ must have a known (and stable) size during compilation. IE operator sizeof() must give a consistent answer. Note, you can have types that hold variable amount of data (eg: std::vector<int>) by using the heap, yet the size of the actual object is always constant.
So, you can never produce a type declaration that you would cast and get the fields magically adjusted. This goes deeply into the fundamental object layout - every member (aka field) must have a known (and stable) offset.
Usually, the issue have is solved by writing (or generating) member functions that parse the input data and initialize the object's data. This is basically the age-old data serialization problem, which has been solved countless times in the last 30 or so years.
Here is a mockup of a basic solution:
class packet {
public:
// simple things
uint16_t hardware_type() const;
// variable-sized things
size_t sender_address_len() const;
bool copy_sender_address_out(char *dest, size_t dest_size) const;
// initialization
bool parse_in(const char *src, size_t len);
private:
uint16_t hardware_type_;
std::vector<char> sender_address_;
};
Notes:
the code above shows the very basic structure that would let you do the following:
packet p;
if (!p.parse_in(input, sz))
return false;
the modern way of doing the same thing via RAII would look like this:
if (!packet::validate(input, sz))
return false;
packet p = packet::parse_in(input, sz); // static function
// returns an instance or throws
If you want to keep access to the data simple and the data itself public, there is a way to achieve what you want without changing the way you access data. First, you can use std::string instead of the char arrays to store the addresses:
#include <string>
using namespace std; // using this to shorten notation. Preferably put 'std::'
// everywhere you need it instead.
struct ArpHeader
{
unsigned char hardwareAddressLength;
unsigned char protocolAddressLength;
string senderHardwareAddress;
string senderProtocolAddress;
string targetHardwareAddress;
string targetProtocolAddress;
};
Then, you can overload the conversion operator operator const char*() and the constructor arpHeader(const char*) (and of course operator=(const char*) preferably too), in order to keep your current sending/receiving functions working, if that's what you need.
A simplified conversion operator (skipped some fields, to make it less complicated, but you should have no problem in adding them back), would look like this:
operator const char*(){
char* myRepresentation;
unsigned char mySize
= 2+ senderHardwareAddress.length()
+ senderProtocolAddress.length()
+ targetHardwareAddress.length()
+ targetProtocolAddress.length();
// We need to store the size, since it varies
myRepresentation = new char[mySize+1];
myRepresentation[0] = mySize;
myRepresentation[1] = hardwareAddressLength;
myRepresentation[2] = protocolAddressLength;
unsigned int offset = 3; // just to shorten notation
memcpy(myRepresentation+offset, senderHardwareAddress.c_str(), senderHardwareAddress.size());
offset += senderHardwareAddress.size();
memcpy(myRepresentation+offset, senderProtocolAddress.c_str(), senderProtocolAddress.size());
offset += senderProtocolAddress.size();
memcpy(myRepresentation+offset, targetHardwareAddress.c_str(), targetHardwareAddress.size());
offset += targetHardwareAddress.size();
memcpy(myRepresentation+offset, targetProtocolAddress.c_str(), targetProtocolAddress.size());
return myRepresentation;
}
While the constructor can be defined as such:
ArpHeader& operator=(const char* buffer){
hardwareAddressLength = buffer[1];
protocolAddressLength = buffer[2];
unsigned int offset = 3; // just to shorten notation
senderHardwareAddress = string(buffer+offset, hardwareAddressLength);
offset += hardwareAddressLength;
senderProtocolAddress = string(buffer+offset, protocolAddressLength);
offset += protocolAddressLength;
targetHardwareAddress = string(buffer+offset, hardwareAddressLength);
offset += hardwareAddressLength;
targetProtocolAddress = string(buffer+offset, protocolAddressLength);
return *this;
}
ArpHeader(const char* buffer){
*this = buffer; // Re-using the operator=
}
Then using your class is as simple as:
ArpHeader h1, h2;
h1.hardwareAddressLength = 3;
h1.protocolAddressLength = 10;
h1.senderHardwareAddress = "foo";
h1.senderProtocolAddress = "something1";
h1.targetHardwareAddress = "bar";
h1.targetProtocolAddress = "something2";
cout << h1.senderHardwareAddress << ", " << h1.senderProtocolAddress
<< " => " << h1.targetHardwareAddress << ", " << h1.targetProtocolAddress << endl;
const char* gottaSendThisSomewhere = h1;
h2 = gottaSendThisSomewhere;
cout << h2.senderHardwareAddress << ", " << h2.senderProtocolAddress
<< " => " << h2.targetHardwareAddress << ", " << h2.targetProtocolAddress << endl;
delete[] gottaSendThisSomewhere;
Which should offer you the utility needed, and keep your code working without changing anything out of the class.
Note however that if you're willing to change the rest of the code a bit (talking here about the one you've written already, ouside of the class), jxh's answer should work as fast as this, and is more elegant on the inner side.

Simplest way to read binary data from a std::vector<unsigned char>?

I have a lump of binary data in the form of const std::vector<unsigned char>, and want to be able to extract individual fields from that, such as 4 bytes for an integer, 1 for a boolean, etc. This needs to be, as far as possible, both efficient and simple. eg. It should be able to read the data in place without needing to copy it (eg. into a string or array). And it should be able to read one field at a time, like a parser, since the lump of data does not have a fixed format. I already know how to determine what type of field to read in each case - the problem is getting a usable interface on top of an std::vector for doing this.
However I can't find a simple way to get this data into an easily usable form that gives me useful read functionality. eg. std::basic_istringstream<unsigned char> gives me a reading interface, but it seems like I need to copy the data into a temporary std::basic_string<unsigned char> first, which is not idea for bigger blocks of data.
Maybe there is some way I can use a streambuf in this situation to read the data in place, but it would appear that I'd need to derive my own streambuf class to do that.
It occurs to me that I can probably just use sscanf on the vector's data(), and that would seem to be both more succinct and more efficient than the C++ standard library alternatives. EDIT: Having been reminded that sscanf doesn't do what I wrongly thought it did, I actually don't know a clean way to do this in C or C++. But am I missing something, and if so, what?
You have access to the data in a vector through its operator[]. A vector's data is guranteed to be stored in a single contiguous array, and [] returns a reference to a member of that array. You may use that reference directly, or through a memcpy.
std::vector<unsigned char> v;
...
byteField = v[12];
memcpy(&intField, &v[13], sizeof intField);
memcpy(charArray, &v[20], lengthOfCharArray);
EDIT 1:
If you want something "more convenient" that that, you could try:
template <class T>
ReadFromVector(T& t, std::size_t offset,
const std::vector<unsigned char>& v) {
memcpy(&t, &v[offset], sizeof(T));
}
Usage would be:
std::vector<unsigned char> v;
...
char c;
int i;
uint64_t ull;
ReadFromVector(c, 17, v);
ReadFromVector(i, 99, v);
ReadFromVector(ull, 43, v);
EDIT 2:
struct Reader {
const std::vector<unsigned char>& v;
std::size_t offset;
Reader(const std::vector<unsigned char>& v) : v(v), offset() {}
template <class T>
Reader& operator>>(T&t) {
memcpy(&t, &v[offset], sizeof t);
offset += sizeof t;
return *this;
}
void operator+=(int i) { offset += i };
char *getStringPointer() { return &v[offset]; }
};
Usage:
std::vector<unsigned char> v;
Reader r(v);
int i; uint64_t ull;
r >> i >> ull;
char *companyName = r.getStringPointer();
r += strlen(companyName);
If your vector stores binary data, you can't use sscanf or similar, they work on text.
For converting a byte for a bool is simple enough
bool b = my_vec[10];
For extracting an unsigned int that's stored in big endian order (assuming your ints are 32 bits):
unsigned int i = my_vec[10] << 24 | my_vec[11] << 16 | my_vec[12] << 8 | my_vec[13];
A 16 bit unsigned short would be similar:
unsigned short s = my_vec[10] << 8 | my_vec[11];¨
If you can afford the Qt dependency, QByteArray has the fromRawData() named constructor, which wraps existing data buffers in a QByteArray without copying the data. With that byte array, you can the feed a QTextStream.
I'm not aware of any such function in the standard streams library (short of implementing your own streambuf, of course), but I'd love to be proved wrong :)
You can use a struct that describes the data you are trying to extract. You can move data from your vector into the struct like this:
struct MyData {
int intVal;
bool boolVal;
char[15] stringVal;
} __attribute__((__packed__));
// assuming all extracted types are prefixed with a one byte indicator.
// Also assumes "vec" is your populated vector
int pos = 0;
while (pos < vec.size()-1) {
switch(vec[pos++]) {
case 0: { // handle int
int intValue;
memcpy(&vec[pos], &intValue, sizeof(int));
pos += sizeof(int);
// do something with handled value
break;
}
case 1: { // handle double
double doubleValue;
memcpy(&vec[pos], &doubleValue, sizeof(double));
pos += sizeof(double);
// do something with handled value
break;
}
case 2: { // handle MyData
struct MyData data;
memcpy(&vec[pos], &data, sizeof(struct MyData));
pos += sizeof(struct MyData);
// do something with handled value
break;
}
default: {
// ERROR: unknown type indicator
break;
}
}
}
Use a for loop to iterate over the vector and use bitwise operators to access each bit group. For example, to access the upper four bits of the first usigned char in your vector:
int myInt = vec[0] & 0xF0;
To read the fifth bit from the right, right after the chunk we just read:
bool myBool = vec[0] & 0x08;
The three least significant (lowest) bits can be accesed like so:
int myInt2 = vec[0] & 0x07;
You can then repeat this process (using a for loop) for every element in your vector.