Nice representation of byte array and its size - c++

How would you represent byte array and its size nicely? I'd like to store (in main memory or within a file) raw byte arrays(unsigned chars) in which first 2/4 bytes will represents its size. But operations on such array does not look well:
void func(unsigned char *bytearray)
{
int size;
memcpy(&size, bytearray, sizeof(int));
//rest of operation when we know bytearray size
}
How can I avoid that? I think about a simple structure:
struct bytearray
{
int size;
unsigned char *data;
};
bytearray *b = reinterpret_cast<bytearray*>(new unsigned char[10]);
b->data = reinterpret_cast<unsigned char*>(&(b->size) + 1);
And I've got an access to a size and data part of bytearray. But it still looks ugly. Could you recommend an another approach?

Unless you have some overwhelming reason to do otherwise, just do the idiomatic thing and use std::vector<unsigned char>.

You're effectively re-inventing the "Pascal string". However
b->data = reinterpret_cast<unsigned char*>(&(b->size) + 1);
won't work at all, because the pointer points to itself, and the pointer will get overwritten.
You should be able to use an array with unspecified size for the last element of a structure:
struct bytearray
{
int size;
unsigned char data[];
};
bytearray *b = reinterpret_cast<bytearray*>(::operator new(sizeof (bytearray) + 10));
b->size = 10;
//...
::operator delete(b);
Unlike std::vector, this actually stores the size and data together, so you can, for example, write it to a file in one operation. And memory locality is better.
Still, the fact that std::vector is already tested and many useful algorithms are implemented for you makes it very attractive.

I would use std::vector<unsigned char> to manage the memory, and write a conversion function to create some iovec like structure for you at the time that you need such a thing.
iovec make_iovec (std::vector<unsigned char> &v) {
iovec iv = { &v[0], v.size() };
return iv;
}
Using iovec, if you need to write both the length and data in a single system call, you can use the writev call to accomplish it.
ssize_t write_vector(int fd, std::vector<unsigned char> &v) {
uint32_t len = htonl(v.size());
iovec iv[2] = { { &len, sizeof(uint32_t) }, make_iovec(v) };
return writev(fd, iv, 2);
}

Related

Function that dynamically construct a byte array and return length

I need to create an encoder function in a class
bool encodeMsg(unsigned char* buffer, unsigned short& len);
This class has some fixed length members and some variable length vectors (of different structures).
I have to encode a Byte stream based on some sequence of these member variables.
Here is a salable version,
class test
{
public:
test();
~test();
bool encodeMsg(unsigned char* buffer);
bool decodeMsg(const unsigned char* buffer, unsigned short len);
private:
unsigned char a; // 0x12
unsigned char b; // 0x34
unsigned char c; // 0x56
}
what I want is 0x123456 in my buffer when I encode.
Questions,
How should I allocate memory? As It is not known before calling this function
Is there a way to map class object memory which basically gives what I want.
I know this is very basic question, but want to know optimal and conventional method to do it.
How should I allocate memory? As It is not known before calling this function
Given you current code, the caller should allocate the memory:
unsigned char buffer[3];
unsigned short len = sizeof buffer;
my_test_object.encodeMsg(buffer, len);
Is there a way to map class object memory which basically gives what I want.
That's very vague. If you use a (possibly compiler-specific) #pragma or attribute to ensure the character values occupy 3 contiguous bytes in memory, and as long as you don't add any virtual functions to the class, you can implement encodeMsg() using:
memcpy(buffer, (unsigned char*)this + offsetof(test, a), 3);
But, what's the point? At best, I can't imagine that memcpy ever being faster than the "nice" way to write it:
buffer[0] = a;
buffer[1] = b;
buffer[2] = c;
If you actually mean something more akin to:
test* p = reinterpret_cast<test*>(buffer);
*p = *this;
That will have undefined behaviour, and may write up to sizeof(test) bytes into the buffer, which is quite likely to be 4 rather than 3, and that could cause some client code buffer overruns, remove an already-set NUL terminator etc.. Hackish and dangerous.
Taking a step back, if you have to ask these sorts of questions you should be worrying about adopting good programming practice - only once you're a master of this kind of thing should you be worrying about what's optimal. For developing good habits, you might want to look at the boost serialisation library and get comfortable with it first.
If you can change the interface of your encodeMsg() function you could store the byte stream in a vector.
bool test::encodeMsg(std::vector<unsigned char>& buffer)
{
// if speed is important you can fill the buffer some other way
buffer.push_back(a);
buffer.push_back(b);
buffer.push_back(c);
return true;
}
If encodeMsg() can't fail (does not need to return bool) you can create and return the vector in it like this:
std::vector<unsigned char> test::encodeMsg()
{
std::vector<unisgned char> buffer;
// if speed is important you can fill the buffer some other way
buffer.push_back(a);
buffer.push_back(b);
buffer.push_back(c);
return buffer;
}
The C++ way would be to use streams. Just implement the insertion operator << for encoding like this
std::ostream& operator<<(std::ostream& os, const test& t)
{
os << t.a;
os << t.b;
os << t.c;
return os;
}
Same with extraction operator >> for decoding
std::istream& operator>>(std::istream& is, test& t)
{
is >> t.a;
is >> t.b;
is >> t.c;
return is;
}
This moves memory management to the stream and caller. If you need a special encoding for the types then derive your codec from istream and ostream and use those.
The memory and the size can be retrieved from the stream when using a stringstream like this
test t;
std::ostringstream strm;
strm << t;
std::string result = strm.str();
auto size = result.length(); // size
auto array = result.data(); // the byte array
For classes that are trivially copyable std::is_trivially_copyable<test>::value == true, encoding and decoding is actually straight forward (assuming you have already allocated the memory for buffer:
bool encodeMsg(unsigned char* buffer, unsigned short& len) {
auto* ptr=reinterprete_cast<unsigned char*>(this);
len=sizeof(test);
memcpy(buffer, ptr, len);
return true;
}
bool decodeMsg(const unsigned char& buffer){
auto* ptr=reinterprete_cast<unsigned char*>(this);
memcpy(ptr, buffer, sizeof(test));
return true;
}
or shorter
bool encodeMsg(unsigned char* buffer, unsigned short& len) {
len=sizeof(test);
memcpy(buffer, (unsigned char*)this, len);
return true;
}
bool decodeMsg(const unsigned char& buffer){
memcpy((unsigned char*)this, buffer, sizeof(test));
return true;
}
Most probably, you will copy 4 bytes instead of 3 though due to stuffing.
As far as interpreting something directly as a byte array goes - casting a pointer from test* to unsigned char* and accessing the object through it is legal,but not the other way round. So what you could write is:
unsigned char* buffer encodeMsg( unsigned short& len) {
len=sizeof(test);
return reinterprete_cast<unsigned char*>(this);
}
bool decodeMsg(const unsigned char& buffer){
auto* ptr=reinterprete_cast<unsigned char*>(this);
memcpy(ptr, buffer, sizeof(test));
return true;
}

Safe use of a function that writes data after a pointer

I have a function foo(void* buffer, size_t len) that calculates a hash from the data at buffer (of size len) and appends it at the end of buffer.
Usually I have a vector that I would pass to foo(&myVec[0], myVec.size())
How would I safely use this function with a vector? Resize it before passing it?
void foo(void* buffer, size_t len)
{
if(buffer == NULL)
{
printf("Error\n");
return;
}
std::vector<unsigned char> hash(128);
gethash(buffer, len, &hash[0]);
unsigned char *data = ((unsigned char*) buffer) + len;
memcpy(data, &hash[0], hash.size());
}
Assuming the vector is a vector<char>, you could avoid all the messyness and do:
void foo(vector<char>& buffer)
{
std::vector<unsigned char> hash(128);
gethash(buffer.data(), buffer.size(), hash.data());
buffer.resize(buffer.size()+hash.size();
unsigned char *data = buffer.data() + buffer.size();
memcpy(data, hash.data(), hash.size());
}
This is still a bit "messy", but a lot less than the code you posted.
As suggested in the comments, something like:
buffer.insert(buffer.end(), hash.begin(), hash.end());
is probably better than the last three lines I wrote.
You can't do it!
If bufferpoints to memory with given len you can't reallocate more memory at exact this place.
2 ways to deal with it:
The buffer comes already with enough size for data and hash, but I would prefer a struct for this solution!
or
Allocate new memory, copy data and hash value to it and return the pointer the new data memory. But don't forget to free this memory later and also the input memory.
The second solution can be done with a vector
void foo( vector<char> &vec )
{
...
gethash(&vec[0], len, &hash[0]);
...
vec.resize(...); // reallocate and copy data if needed
memcpy // which I do not want to use with a vector :-)
}
A resize of a vector results in allocating new memory and copy data from old to new memory and free the old buffer allocated. It is possible that the vector holds (much) more memory as expected so that no reallocation must happen. But how it behaves must not be known until speed is a criteria. But you can also create a vector with a minimum of internal size so that you prevent automatic allocation and copy.
You can do this before calling foo:
int main() {
std::vector<int> buffer;
size_t sz = buffer.size();
size_t tsz = sizeof(decltype(buffer)::value_type);
buffer.resize(128 / tsz + buffer.size());
foo(buffer.data(), sz * tsz);
return 0;
}
You can do it like this:
void foo(vector<unsigned char>& data)
{
if(data.empty())
return;
vector<unsigned char> result(128);
gethash(&data[0], data.size(), &result[0]);
data.insert(data.end(), result.begin(), result.end());
}

Contents of an untyped object copied into vector<unsigned char>

I'm trying to write the contents of an untyped object that holds the bytes of an image into a vector filled with unsigned char. Sadly, i cannot get it to work. Maybe someone could point me in the right direction?
Here is what I have at the moment:
vector<unsigned char> SQLiteDB::BlobData(int clmNum){
//i get the data of the image
const void* data = sqlite3_column_blob(pSQLiteConn->pRes, clmNum);
vector<unsigned char> bytes;
//return the size of the image in bytes
int size = getBytes(clNum);
unsigned char b[size];
memcpy(b, data, size);
for(int j=0;j<size,j++){
bytes.push_back(b[j])M
}
return bytes;
}
If i try to trace the contents of the bytes vector it's all empty.
So the question is, how can i get the data into the vector?
You should use the vector's constructor that takes a couple of iterators:
const unsigned char* data = static_cast<const unsigned char*>(sqlite3_column_blob(pSQLiteConn->pRes, clmNum));
vector<unsigned char> bytes(data, data + getBytes(clNum));
Directly write into the vector, no need for additional useless copies:
bytes.resize(size);
memcpy(bytes.data(), data, size);
Instead of a copy, this has a zero-initialisation, so using the constructor like Maxim demonstrates or vector::insert is better.
const unsigned char* data = static_cast<const unsigned char*>(sqlite3_column_blob(pSQLiteConn->pRes, clmNum));
bytes.insert(data, data + getBytes(clNum));

Casting an unsigned int + a string to an unsigned char vector

I'm working with the NetLink socket library ( https://sourceforge.net/apps/wordpress/netlinksockets/ ), and I want to send some binary data over the network in a format that I specify.
The format I have planned is pretty simple and is as follows:
Bytes 0 and 1: an opcode of the type uint16_t (i.e., an unsigned integer always 2 bytes long)
Bytes 2 onward: any other data necessary, such as a string, an integer, a combination of each, etc.. the other party will interpret this data according to the opcode. For example, if the opcode is 0 which represents "log in", this data will consist of one byte integer telling you how long the username is, followed by a string containing the username, followed by a string containing the password. For opcode 1, "send a chat message", the entire data here could be just a string for the chat message.
Here's what the library gives me to work with for sending data, though:
void send(const string& data);
void send(const char* data);
void rawSend(const vector<unsigned char>* data);
I'm assuming I want to use rawSend() for this.. but rawSend() takes unsigned chars, not a void* pointer to memory? Isn't there going to be some loss of data here if I try to cast certain types of data to an array of unsigned chars? Please correct me if I'm wrong.. but if I'm right, does this mean I should be looking at another library that has support for real binary data transfer?
Assuming this library does serve my purposes, how exactly would I cast and concatenate my various data types into one std::vector? What I've tried is something like this:
#define OPCODE_LOGINREQUEST 0
std::vector<unsigned char>* loginRequestData = new std::vector<unsigned char>();
uint16_t opcode = OPCODE_LOGINREQUEST;
loginRequestData->push_back(opcode);
// and at this point (not shown), I would push_back() the individual characters of the strings of the username and password.. after one byte worth of integer telling you how many characters long the username is (so you know when the username stops and the password begins)
socket->rawSend(loginRequestData);
Ran into some exceptions, though, on the other end when I tried to interpret the data. Am I approaching the casting all wrong? Am I going to lose data by casting to unsigned chars?
Thanks in advance.
I like how they make you create a vector (which must use the heap and thus execute in unpredictable time) instead of just falling back to the C standard (const void* buffer, size_t len) tuple, which is compatible with everything and can't be beat for performance. Oh, well.
You could try this:
void send_message(uint16_t opcode, const void* rawData, size_t rawDataSize)
{
vector<unsigned char> buffer;
buffer.reserve(sizeof(uint16_t) + rawDataSize);
#if BIG_ENDIAN_OPCODE
buffer.push_back(opcode >> 8);
buffer.push_back(opcode & 0xFF);
#elseif LITTLE_ENDIAN_OPCODE
buffer.push_back(opcode & 0xFF);
buffer.push_back(opcode >> 8);
#else
// Native order opcode
buffer.insert(buffer.end(), reinterpret_cast<const unsigned char*>(&opcode),
reinterpret_cast<const unsigned char*>(&opcode) + sizeof(uint16_t));
#endif
const unsigned char* base(reinterpret_cast<const unsigned char*>(rawData));
buffer.insert(buffer.end(), base, base + rawDataSize);
socket->rawSend(&buffer); // Why isn't this API using a reference?!
}
This uses insert which should optimize better than a hand-written loop with push_back(). It also won't leak the buffer if rawSend tosses an exception.
NOTE: Byte order must match for the platforms on both ends of this connection. If it does not, you'll need to either pick one byte order and stick with it (Internet standards usually do this, and you use the htonl and htons functions) or you need to detect byte order ("native" or "backwards" from the receiver's POV) and fix it if "backwards".
I would use something like this:
#define OPCODE_LOGINREQUEST 0
#define OPCODE_MESSAGE 1
void addRaw(std::vector<unsigned char> &v, const void *data, const size_t len)
{
const unsigned char *ptr = static_cast<const unsigned char*>(data);
v.insert(v.end(), ptr, ptr + len);
}
void addUint8(std::vector<unsigned char> &v, uint8_t val)
{
v.push_back(val);
}
void addUint16(std::vector<unsigned char> &v, uint16_t val)
{
val = htons(val);
addRaw(v, &val, sizeof(uint16_t));
}
void addStringLen(std::vector<unsigned char> &v, const std::string &val)
{
uint8_t len = std::min(val.length(), 255);
addUint8(v, len);
addRaw(v, val.c_str(), len);
}
void addStringRaw(std::vector<unsigned char> &v, const std::string &val)
{
addRaw(v, val.c_str(), val.length());
}
void sendLogin(const std::string &user, const std::string &pass)
{
std::vector<unsigned char> data(
sizeof(uint16_t) +
sizeof(uint8_t) + std::min(user.length(), 255) +
sizeof(uint8_t) + std::min(pass.length(), 255)
);
addUint16(data, OPCODE_LOGINREQUEST);
addStringLen(data, user);
addStringLen(data, pass);
socket->rawSend(&data);
}
void sendMsg(const std::string &msg)
{
std::vector<unsigned char> data(
sizeof(uint16_t) +
msg.length()
);
addUint16(data, OPCODE_MESSAGE);
addStringRaw(data, msg);
socket->rawSend(&data);
}
std::vector<unsigned char>* loginRequestData = new std::vector<unsigned char>();
uint16_t opcode = OPCODE_LOGINREQUEST;
loginRequestData->push_back(opcode);
If unsigned char is 8 bits long -which in most systems is-, you will be loosing the higher 8 bits from opcode every time you push. You should be getting a warning for this.
The decision for rawSend to take a vector is quite odd, a general library would work at a different level of abstraction. I can only guess that it is this way because rawSend makes a copy of the passed data, and guarantees its lifetime until the operation has completed. If not, then is just a poor design choice; add to that the fact that its taking the argument by pointer... You should see this data as a container of raw memory, there are some quirks to get right but here is how you would be expected to work with pod types in this scenario:
data->insert( data->end(), reinterpret_cast< char const* >( &opcode ), reinterpret_cast< char const* >( &opcode ) + sizeof( opcode ) );
This will work:
#define OPCODE_LOGINREQUEST 0
std::vector<unsigned char>* loginRequestData = new std::vector<unsigned char>();
uint16_t opcode = OPCODE_LOGINREQUEST;
unsigned char *opcode_data = (unsigned char *)&opcode;
for(int i = 0; i < sizeof(opcode); i++)
loginRequestData->push_back(opcode_data[i]);
socket->rawSend(loginRequestData);
This will also work for any POD type.
Yeah, go with rawSend since send probably expects a NULL terminator.
You don't lose anything by casting to char instead of void*. Memory is memory. Types are never stored in memory in C++ except for RTTI info. You can recover your data by casting to the type indicated by your opcode.
If you can decide the format of all your sends at compile time, I recommend using structs to represent them. I've done this before professionally, and this is simply the best way to clearly store the formats for a wide variety of messages. And it's super easy to unpack on the other side; just cast the raw buffer into the struct based on the opcode!
struct MessageType1 {
uint16_t opcode;
int myData1;
int myData2;
};
MessageType1 msg;
std::vector<char> vec;
char* end = (char*)&msg + sizeof(msg);
vec.insert( vec.end(), &msg, end );
send(vec);
The struct approach is the best, neatest way to send and receive, but the layout is fixed at compile time.
If the format of the messages is not decided until runtime, use a char array:
char buffer[2048];
*((uint16_t*)buffer) = opcode;
// now memcpy into it
// or placement-new to construct objects in the buffer memory
int usedBufferSpace = 24; //or whatever
std::vector<char> vec;
const char* end = buffer + usedBufferSpace;
vec.insert( vec.end(), buffer, end );
send(&buffer);

Dereferencing Variable Size Arrays in Structs

Structs seem like a useful way to parse a binary blob of data (ie a file or network packet). This is fine and dandy until you have variable size arrays in the blob. For instance:
struct nodeheader{
int flags;
int data_size;
char data[];
};
This allows me to find the last data character:
nodeheader b;
cout << b.data[b.data_size-1];
Problem being, I want to have multiple variable length arrays:
struct nodeheader{
int friend_size;
int data_size;
char data[];
char friend[];
};
I'm not manually allocating these structures. I have a file like so:
char file_data[1024];
nodeheader* node = &(file_data[10]);
As I'm trying to parse a binary file (more specifically a class file). I've written an implementation in Java (which was my class assignment), no I'm doing a personal version in C++ and was hoping to get away without having to write 100 lines of code. Any ideas?
Thanks,
Stefan
You cannot have multiple variable sized arrays. How should the compiler at compile time know where friend[] is located? The location of friend depends on the size of data[] and the size of data is unknown at compile time.
This is a very dangerous construct, and I'd advise against it. You can only include a variable-length array in a struct when it is the LAST element, and when you do so, you have to make sure you allocate enough memory, e.g.:
nodeheader *nh = (nodeheader *)malloc(sizeof(nodeheader) + max_data_size);
What you want to do is just use regular dynamically allocated arrays:
struct nodeheader
{
char *data;
size_t data_size;
char *friend;
size_t friend_size;
};
nodeheader AllocNodeHeader(size_t data_size, size_t friend_size)
{
nodeheader nh;
nh.data = (char *)malloc(data_size); // check for NULL return
nh.data_size = data_size;
nh.friend = (char *)malloc(friend_size); // check for NULL return
nh.friend_size = friend_size;
return nh;
}
void FreeNodeHeader(nodeheader *nh)
{
free(nh->data);
nh->data = NULL;
free(nh->friend);
nh->friend = NULL;
}
You can't - at least not in the simple way that you're attempting. The unsized array at the end of a structure is basically an offset to the end of the structure, with no build-in way to find the end.
All the fields are converted to numeric offsets at compile time, so they need to be calculable at that time.
The answers so far are seriously over-complicating a simple problem. Mecki is right about why it can't be done the way you are trying to do it, however you can do it very similarly:
struct nodeheader
{
int friend_size;
int data_size;
};
struct nodefile
{
nodeheader *header;
char *data;
char *friend;
};
char file_data[1024];
// .. file in file_data ..
nodefile file;
file.header = (nodeheader *)&file_data[0];
file.data = (char *)&file.header[1];
file.friend = &file.data[file->header.data_size];
For what you are doing you need an encoder/decoder for the format. The decoder takes the raw data and fills out your structure (in your case allocating space for the copy of each section of the data), and the decoder writes raw binary.
(Was 'Use std::vector')
Edit:
On reading feedback, I suppose I should expand my answer. You can effectively fit two variable length arrays in your structure as follows, and the storage will be freed for you automatically when file_data goes out of scope:
struct nodeheader {
std::vector<unsigned char> data;
std::vector<unsigned char> friend_buf; // 'friend' is a keyword!
// etc...
};
nodeheader file_data;
Now file_data.data.size(), etc gives you the length and and &file_data.data[0] gives you a raw pointer to the data if you need it.
You'll have to fill file data from the file piecemeal - read the length of each buffer, call resize() on the destination vector, then read in the data. (There are ways to do this slightly more efficiently. In the context of disk file I/O, I'm assuming it doesn't matter).
Incidentally OP's technique is incorrect even for his 'fine and dandy' cases, e.g. with only one VLA at the end.
char file_data[1024];
nodeheader* node = &(file_data[10]);
There's no guarantee that file_data is properly aligned for the nodeheader type. Prefer to obtain file_data by malloc() - which guarantees to return a pointer aligned for any type - or else (better) declare the buffer to be of the correct type in the first place:
struct biggestnodeheader {
int flags;
int data_size;
char data[ENOUGH_SPACE_FOR_LARGEST_HEADER_I_EVER_NEED];
};
biggestnodeheader file_data;
// etc...