I am writing a program for creating, sending, receiving and interpreting ARP packets. I have a structure representing the ARP header like this:
struct ArpHeader
{
unsigned short hardwareType;
unsigned short protocolType;
unsigned char hardwareAddressLength;
unsigned char protocolAddressLength;
unsigned short operationCode;
unsigned char senderHardwareAddress[6];
unsigned char senderProtocolAddress[4];
unsigned char targetHardwareAddress[6];
unsigned char targetProtocolAddress[4];
};
This only works for hardware addresses with length 6 and protocol addresses with length 4. The address lengths are given in the header as well, so to be correct the structure would have to look something like this:
struct ArpHeader
{
unsigned short hardwareType;
unsigned short protocolType;
unsigned char hardwareAddressLength;
unsigned char protocolAddressLength;
unsigned short operationCode;
unsigned char senderHardwareAddress[hardwareAddressLength];
unsigned char senderProtocolAddress[protocolAddressLength];
unsigned char targetHardwareAddress[hardwareAddressLength];
unsigned char targetProtocolAddress[protocolAddressLength];
};
This obviously won't work since the address lengths are not known at compile time. Template structures aren't an option either since I would like to fill in values for the structure and then just cast it from (ArpHeader*) to (char*) in order to get a byte array which can be sent on the network or cast a received byte array from (char*) to (ArpHeader*) in order to interpret it.
One solution would be to create a class with all header fields as member variables, a function to create a byte array representing the ARP header which can be sent on the network and a constructor which would take only a byte array (received on the network) and interpret it by reading all header fields and writing them to the member variables. This is not a nice solution though since it would require a LOT more code.
In contrary a similar structure for a UDP header for example is simple since all header fields are of known constant size. I use
#pragma pack(push, 1)
#pragma pack(pop)
around the structure declaration so that I can actually do a simple C-style cast to get a byte array to be sent on the network.
Is there any solution I could use here which would be close to a structure or at least not require a lot more code than a structure?
I know the last field in a structure (if it is an array) does not need a specific compile-time size, can I use something similar like that for my problem? Just leaving the sizes of those 4 arrays empty will compile, but I have no idea how that would actually function. Just logically speaking it cannot work since the compiler would have no idea where the second array starts if the size of the first array is unknown.
You want a fairly low level thing, an ARP packet, and you are trying to find a way to define a datastructure properly so you can cast the blob into that structure. Instead, you can use an interface over the blob.
struct ArpHeader {
mutable std::vector<uint8_t> buf_;
template <typename T>
struct ref {
uint8_t * const p_;
ref (uint8_t *p) : p_(p) {}
operator T () const { T t; memcpy(&t, p_, sizeof(t)); return t; }
T operator = (T t) const { memcpy(p_, &t, sizeof(t)); return t; }
};
template <typename T>
ref<T> get (size_t offset) const {
if (offset + sizeof(T) > buf_.size()) throw SOMETHING;
return ref<T>(&buf_[0] + offset);
}
ref<uint16_t> hwType() const { return get<uint16_t>(0); }
ref<uint16_t> protType () const { return get<uint16_t>(2); }
ref<uint8_t> hwAddrLen () const { return get<uint8_t>(4); }
ref<uint8_t> protAddrLen () const { return get<uint8_t>(5); }
ref<uint16_t> opCode () const { return get<uint16_t>(6); }
uint8_t *senderHwAddr () const { return &buf_[0] + 8; }
uint8_t *senderProtAddr () const { return senderHwAddr() + hwAddrLen(); }
uint8_t *targetHwAddr () const { return senderProtAddr() + protAddrLen(); }
uint8_t *targetProtAddr () const { return targetHwAddr() + hwAddrLen(); }
};
If you need const correctness, you remove mutable, create a const_ref, and duplicate the accessors into non-const versions, and make the const versions return const_ref and const uint8_t *.
Short answer: you just cannot have variable-sized types in C++.
Every type in C++ must have a known (and stable) size during compilation. IE operator sizeof() must give a consistent answer. Note, you can have types that hold variable amount of data (eg: std::vector<int>) by using the heap, yet the size of the actual object is always constant.
So, you can never produce a type declaration that you would cast and get the fields magically adjusted. This goes deeply into the fundamental object layout - every member (aka field) must have a known (and stable) offset.
Usually, the issue have is solved by writing (or generating) member functions that parse the input data and initialize the object's data. This is basically the age-old data serialization problem, which has been solved countless times in the last 30 or so years.
Here is a mockup of a basic solution:
class packet {
public:
// simple things
uint16_t hardware_type() const;
// variable-sized things
size_t sender_address_len() const;
bool copy_sender_address_out(char *dest, size_t dest_size) const;
// initialization
bool parse_in(const char *src, size_t len);
private:
uint16_t hardware_type_;
std::vector<char> sender_address_;
};
Notes:
the code above shows the very basic structure that would let you do the following:
packet p;
if (!p.parse_in(input, sz))
return false;
the modern way of doing the same thing via RAII would look like this:
if (!packet::validate(input, sz))
return false;
packet p = packet::parse_in(input, sz); // static function
// returns an instance or throws
If you want to keep access to the data simple and the data itself public, there is a way to achieve what you want without changing the way you access data. First, you can use std::string instead of the char arrays to store the addresses:
#include <string>
using namespace std; // using this to shorten notation. Preferably put 'std::'
// everywhere you need it instead.
struct ArpHeader
{
unsigned char hardwareAddressLength;
unsigned char protocolAddressLength;
string senderHardwareAddress;
string senderProtocolAddress;
string targetHardwareAddress;
string targetProtocolAddress;
};
Then, you can overload the conversion operator operator const char*() and the constructor arpHeader(const char*) (and of course operator=(const char*) preferably too), in order to keep your current sending/receiving functions working, if that's what you need.
A simplified conversion operator (skipped some fields, to make it less complicated, but you should have no problem in adding them back), would look like this:
operator const char*(){
char* myRepresentation;
unsigned char mySize
= 2+ senderHardwareAddress.length()
+ senderProtocolAddress.length()
+ targetHardwareAddress.length()
+ targetProtocolAddress.length();
// We need to store the size, since it varies
myRepresentation = new char[mySize+1];
myRepresentation[0] = mySize;
myRepresentation[1] = hardwareAddressLength;
myRepresentation[2] = protocolAddressLength;
unsigned int offset = 3; // just to shorten notation
memcpy(myRepresentation+offset, senderHardwareAddress.c_str(), senderHardwareAddress.size());
offset += senderHardwareAddress.size();
memcpy(myRepresentation+offset, senderProtocolAddress.c_str(), senderProtocolAddress.size());
offset += senderProtocolAddress.size();
memcpy(myRepresentation+offset, targetHardwareAddress.c_str(), targetHardwareAddress.size());
offset += targetHardwareAddress.size();
memcpy(myRepresentation+offset, targetProtocolAddress.c_str(), targetProtocolAddress.size());
return myRepresentation;
}
While the constructor can be defined as such:
ArpHeader& operator=(const char* buffer){
hardwareAddressLength = buffer[1];
protocolAddressLength = buffer[2];
unsigned int offset = 3; // just to shorten notation
senderHardwareAddress = string(buffer+offset, hardwareAddressLength);
offset += hardwareAddressLength;
senderProtocolAddress = string(buffer+offset, protocolAddressLength);
offset += protocolAddressLength;
targetHardwareAddress = string(buffer+offset, hardwareAddressLength);
offset += hardwareAddressLength;
targetProtocolAddress = string(buffer+offset, protocolAddressLength);
return *this;
}
ArpHeader(const char* buffer){
*this = buffer; // Re-using the operator=
}
Then using your class is as simple as:
ArpHeader h1, h2;
h1.hardwareAddressLength = 3;
h1.protocolAddressLength = 10;
h1.senderHardwareAddress = "foo";
h1.senderProtocolAddress = "something1";
h1.targetHardwareAddress = "bar";
h1.targetProtocolAddress = "something2";
cout << h1.senderHardwareAddress << ", " << h1.senderProtocolAddress
<< " => " << h1.targetHardwareAddress << ", " << h1.targetProtocolAddress << endl;
const char* gottaSendThisSomewhere = h1;
h2 = gottaSendThisSomewhere;
cout << h2.senderHardwareAddress << ", " << h2.senderProtocolAddress
<< " => " << h2.targetHardwareAddress << ", " << h2.targetProtocolAddress << endl;
delete[] gottaSendThisSomewhere;
Which should offer you the utility needed, and keep your code working without changing anything out of the class.
Note however that if you're willing to change the rest of the code a bit (talking here about the one you've written already, ouside of the class), jxh's answer should work as fast as this, and is more elegant on the inner side.
Related
I want to read/write binary data, essentially a vector of plain old data struct.
I managed to do so with arrays of char, but now I need to convert those arrays back and forth to an vector of struct.
I already managed to do this by reading/writing directly into files:
int main() {
struct thing { float f; char c; };
std::vector<thing> write_this = { {2,'c'},{5,'f'},{543,'e'} };
std::ofstream outfile{ "test.bin", std::ios::binary };
outfile.write(reinterpret_cast<const char *>(write_this.data()),
write_this.size() * sizeof(decltype(write_this)::value_type));
outfile.close();
std::vector<thing> result(3);
std::ifstream infile{ "test.bin", std::ios::binary };
infile.read(reinterpret_cast<char *>(result.data()),
result.size() * sizeof(thing));
for (auto& i : result)
std::cout << i.f << " " << i.c << ' ';
std::cout << '\n';
system("PAUSE");
return 0;
}
Since I want to store several different segments of data into a file, I'm using vectors of unsigned char as an intermediary. I now need to cast an array of char or vector of char to a vector of struct, and vice versa.
What's the simplest/cleanest/fastest way to do so?
Foreword: Note that the written data is not portable to other systems, and for some PODs (ones that contain bit fields) not even to other processes on same system compiled by a different compiler - unless those systems are somehow guaranteed to have identical binary representation.
Also, memory references in the written data will be meaningless to other processes, even to separate executions of the same program.
quickly cast a vector of unsigned char into a vector of POD struct and vice versa
You could do this:
static_assert(std::is_trivial_v<thing>);
constexpr int s = sizeof(thing);
int n = 3;
std::vector<unsigned char> result_raw(n * s);
unsigned char* data = result_raw.data();
infile.read(reinterpret_cast<char*>(data), n * s);
// ->
for(int i = 0; i < n; i++) {
unsigned char temp[s];
unsigned char* addr = data + i * s;
std::memcpy(temp, addr, s);
new (addr) thing;
std::memcpy(addr, temp, s);
}
thing* result = std::launder(reinterpret_cast<thing*>(data));
// <-
Now, you have a pointer to the first thing in the vector. Beautiful part about this is that the part between the arrow comments that creates the objects and makes the program well-defined compiles to zero instructions (as long as you enable optimisation).
You don't get a std::vector<thing> though. To get that, you must copy all of the from one vector another. Or you could read directly onto the vector of things like in your example. But you didn't want to do the latter, and former is slower than not copying.
In future, if p0593rX proposal is adopted, this block of code that does nothing could be greatly simplified.
I have to admit that I am a bit confused at the moment, so sorry if the question isnt quite clear or trivial (actually I hope it is the latter)....
I am sending an array of bytes across the network and would like to do something like this on the sender side:
size_t max_size = 100;
uint8_t buffer[size];
idontknowwhat_t x{buffer};
uint16_t size = 11; // total number of bytes in the buffer
uint16_t id_a,id_b,id_c; // some ids
uint8_t a,b,c; // some data
x << size << id_a << a << id_b << b << id_c << c;
someMethodToSend(buffer,size);
and on the receiver side something like this:
size_t max_size = 100;
uint8_t buffer[size];
someMethodToReceive(buffer);
idontknowwhat_t x{buffer};
uint16_t size;
x >> size;
for (uint16_t i=0; i<size-2; i++) {
uint16_t id;
uint8_t data;
x >> id >> data;
std::cout << id << " " << data;
}
So my aim is basically to avoid ugly casts and manually incrementing a pointer while being able to have uint8_t and uint16_t (and possibly also uint32_t) in the buffer. The data I put in the buffer here is just an example, and I am aware that I need to take care of the byte order when sending over the network (and it would be fine if I had to do this "manually").
Is there something that I can use in place of my hypothetical idontknowwhat_t ?
You cannot really avoid doing ugly casts, but at least you can hide them into the idontknowwhat_t class's operator>> and operator<< functions. And using templates, you could limit the number of casts in your code to the bare minimum.
class idontknowwhat_t
{
uint8_t* _data;
public:
idontknowwhat_t(uint8_t* buffer)
: _data(buffer)
{}
template<typename insert_type>
idontknowwhat_t& operator<<(insert_type value)
{
*reinterpret_cast<insert_type*>(_data) = value;
_data += sizeof(insert_type);
return *this;
}
template<typename extract_type>
idontknowwhat_t& operator>>(extract_type& value)
{
value = *reinterpret_cast<extract_type*>(_data);
_data += sizeof(extract_type);
return *this;
}
};
I think this will actually work directly with your code. In this example, the idontknowwhat_t class does not own the buffer and simply keeps a raw pointer to the next bit of data it expects to read or write. For real-life purposes I would recommend letting the idontknowwhat_t class manage the buffer memory.
In addition, none of the code on this page actually takes care of the data's endianness, which would definitely be the idontknowwhat_t class's responsibility. There is a boost library for that. I'm not documenting that library's use here, since I think it distracts from the questions real point.
Have you tried std::list? You could group the elements into types and put them into lists with the appropriate type. Then you could create an std::list of std::lists.
I have to decode some byte array(raw data). It can consist of basic data types(int,unsigned int,char,short etc.).According to defined structure, i need to interpret them. below is example:
struct testData
{
int a;
char c;
};
unsigned char** buf = {0x01,0x00,0x00,0x00,0x41}
example byte array(in little endian) : 0100000041
should give decoding like : a = 1, c = 'A'
The sample data can be very big and the sample structure( e.g. testData) can contain 200 - 3000 fields.
If I use am casting to read the appropriate data from **buf and set pointer like below:
int a = *(reinterpret_cast<int*>(*buf);
*buf += 4;
char c = **buf;
*buf += 1;
My CPU usage is quite high if number of fileds need to be decoded are high. example:
struct testData
{
int element1;
char element2;
int element3;
... ...
... ...
short element200;
char element201;
char element202;
}
Is there a way to reduce the CPU load as well as keep decoding very fast?
I have two constraints:
"Structure can contain padding byte."
I do not have control on how structure will be defined. Structure can contain nested elements as well.
int a = *(reinterpret_cast<int*>(*buf);
Don't use reinterpret_cast. You are lying to the compiler and forcing unaligned accesses. Worse, you are hiding from the compiler the very information it needs to optimize your code -- that the pointer is actually to characters. Instead, code what you mean as straightforward as possible, which is:
int a=static_cast<int>(*buf[0]) |
(static_cast<int>(*buf[1])<<8) |
(static_cast<int>(*buf[2])<<16) |
(static_cast<int>(*buf[3])<<24);
This is simple, clear, and what you actually want. The compiler will have no problem optimizing it. (And, it will work regardless of your platform's endianness.)
You should be able to simply map the struct to the buffer, as long as the struct is properly packed:
#pragma pack(push, 1)
struct testData
{
int element1;
char element2;
int element3;
... ...
... ...
short element200;
char element201;
char element202;
}
#pragma pack(pop)
You should also declare the structure in a alignment sensible way, don't mix int followed by char followed by int... Then if you read the data in an aligned buffer, a simple cast of the buffer to testData* would give you access to all members. This way you would avoid all those gratuitous copies. If you read the structure in forward fashion (p->element1, then read p->element2 then p->element3 and so on) hardware prefetch should kick in and give a big boost.
Further enhancements would require actual measurements of hot spots. Also, check this book out from the library and read it: The Software Optimization Cookbook.
Further to David Schwartz's response, you can tidy this up by writing some helper template functions. I'd suggest something like this (untested).
template<typename T>
const unsigned char * read_from_buffer( T* value, const unsigned char * buffer);
template<>
const unsigned char * read_from_buffer<int>( int* value, const unsigned char * buffer)
{
*value = static_cast<int>(*buf[0]) |
(static_cast<int>(*buf[1])<<8) |
(static_cast<int>(*buf[2])<<16) |
(static_cast<int>(*buf[3])<<24);
return buffer+4'
}
template<>
const unsigned char * read_from_buffer<char>( char * value, const unsigned char * buffer )
{
*value = *buffer;
return buffer+1;
}
struct TestData
{
int a;
char c;
};
int main()
{
unsigned char buf[] = {0x01,0x00,0x00,0x00,0x41};
unsigned char * ptr = buf;
TestData data;
ptr = read_from_buffer( &data.a, ptr );
ptr = read_from_buffer( &data.c, ptr );
}
You could encapsulate this even further and add error checking etc and you'd have a nice binary stream like interface.
I have a lump of binary data in the form of const std::vector<unsigned char>, and want to be able to extract individual fields from that, such as 4 bytes for an integer, 1 for a boolean, etc. This needs to be, as far as possible, both efficient and simple. eg. It should be able to read the data in place without needing to copy it (eg. into a string or array). And it should be able to read one field at a time, like a parser, since the lump of data does not have a fixed format. I already know how to determine what type of field to read in each case - the problem is getting a usable interface on top of an std::vector for doing this.
However I can't find a simple way to get this data into an easily usable form that gives me useful read functionality. eg. std::basic_istringstream<unsigned char> gives me a reading interface, but it seems like I need to copy the data into a temporary std::basic_string<unsigned char> first, which is not idea for bigger blocks of data.
Maybe there is some way I can use a streambuf in this situation to read the data in place, but it would appear that I'd need to derive my own streambuf class to do that.
It occurs to me that I can probably just use sscanf on the vector's data(), and that would seem to be both more succinct and more efficient than the C++ standard library alternatives. EDIT: Having been reminded that sscanf doesn't do what I wrongly thought it did, I actually don't know a clean way to do this in C or C++. But am I missing something, and if so, what?
You have access to the data in a vector through its operator[]. A vector's data is guranteed to be stored in a single contiguous array, and [] returns a reference to a member of that array. You may use that reference directly, or through a memcpy.
std::vector<unsigned char> v;
...
byteField = v[12];
memcpy(&intField, &v[13], sizeof intField);
memcpy(charArray, &v[20], lengthOfCharArray);
EDIT 1:
If you want something "more convenient" that that, you could try:
template <class T>
ReadFromVector(T& t, std::size_t offset,
const std::vector<unsigned char>& v) {
memcpy(&t, &v[offset], sizeof(T));
}
Usage would be:
std::vector<unsigned char> v;
...
char c;
int i;
uint64_t ull;
ReadFromVector(c, 17, v);
ReadFromVector(i, 99, v);
ReadFromVector(ull, 43, v);
EDIT 2:
struct Reader {
const std::vector<unsigned char>& v;
std::size_t offset;
Reader(const std::vector<unsigned char>& v) : v(v), offset() {}
template <class T>
Reader& operator>>(T&t) {
memcpy(&t, &v[offset], sizeof t);
offset += sizeof t;
return *this;
}
void operator+=(int i) { offset += i };
char *getStringPointer() { return &v[offset]; }
};
Usage:
std::vector<unsigned char> v;
Reader r(v);
int i; uint64_t ull;
r >> i >> ull;
char *companyName = r.getStringPointer();
r += strlen(companyName);
If your vector stores binary data, you can't use sscanf or similar, they work on text.
For converting a byte for a bool is simple enough
bool b = my_vec[10];
For extracting an unsigned int that's stored in big endian order (assuming your ints are 32 bits):
unsigned int i = my_vec[10] << 24 | my_vec[11] << 16 | my_vec[12] << 8 | my_vec[13];
A 16 bit unsigned short would be similar:
unsigned short s = my_vec[10] << 8 | my_vec[11];ยจ
If you can afford the Qt dependency, QByteArray has the fromRawData() named constructor, which wraps existing data buffers in a QByteArray without copying the data. With that byte array, you can the feed a QTextStream.
I'm not aware of any such function in the standard streams library (short of implementing your own streambuf, of course), but I'd love to be proved wrong :)
You can use a struct that describes the data you are trying to extract. You can move data from your vector into the struct like this:
struct MyData {
int intVal;
bool boolVal;
char[15] stringVal;
} __attribute__((__packed__));
// assuming all extracted types are prefixed with a one byte indicator.
// Also assumes "vec" is your populated vector
int pos = 0;
while (pos < vec.size()-1) {
switch(vec[pos++]) {
case 0: { // handle int
int intValue;
memcpy(&vec[pos], &intValue, sizeof(int));
pos += sizeof(int);
// do something with handled value
break;
}
case 1: { // handle double
double doubleValue;
memcpy(&vec[pos], &doubleValue, sizeof(double));
pos += sizeof(double);
// do something with handled value
break;
}
case 2: { // handle MyData
struct MyData data;
memcpy(&vec[pos], &data, sizeof(struct MyData));
pos += sizeof(struct MyData);
// do something with handled value
break;
}
default: {
// ERROR: unknown type indicator
break;
}
}
}
Use a for loop to iterate over the vector and use bitwise operators to access each bit group. For example, to access the upper four bits of the first usigned char in your vector:
int myInt = vec[0] & 0xF0;
To read the fifth bit from the right, right after the chunk we just read:
bool myBool = vec[0] & 0x08;
The three least significant (lowest) bits can be accesed like so:
int myInt2 = vec[0] & 0x07;
You can then repeat this process (using a for loop) for every element in your vector.
I have a single buffer, and several pointers into it. I want to sort the pointers based upon the bytes in the buffer they point at.
qsort() and stl::sort() can be given custom comparision functions. For example, if the buffer was zero-terminated I could use strcmp:
int my_strcmp(const void* a,const void* b) {
const char* const one = *(const char**)a,
const two = *(const char**)b;
return ::strcmp(one,two);
}
however, if the buffer is not zero-terminated, I have to use memcmp() which requires a length parameter.
Is there a tidy, efficient way to get the length of the buffer into my comparision function without a global variable?
With std::sort, you can use a Functor like this:
struct CompString {
CompString(int len) : m_Len(len) {}
bool operator<(const char *a, const char *b) const {
return std::memcmp(a, b, m_Len);
}
private:
int m_Len;
};
Then you can do this:
std::sort(begin(), end(), CompString(4)); // all strings are 4 chars long
EDIT: from the comment suggestions (i guess both strings are in a common buffer?):
struct CompString {
CompString (const unsigned char* e) : end(e) {}
bool operator()(const unsigned char *a, const unsigned char *b) const {
return std::memcmp(a, b, std::min(end - a, end - b)) < 0;
}
private:
const unsigned char* const end;
};
With the C function qsort(), no, there is no way to pass the length to your comparison function without using a global variable, which means it can't be done in a thread-safe manner. Some systems have a qsort_r() function (r stands for reentrant) which allows you to pass an extra context parameter, which then gets passed on to your comparison function:
int my_comparison_func(void *context, const void *a, const void *b)
{
return memcmp(*(const void **)a, *(const void **)b, (size_t)context);
}
qsort_r(data, n, sizeof(void*), (void*)number_of_bytes_to_compare, &my_comparison_func);
Is there a reason you can't null-terminate your buffers?
If not, since you're using C++ you can write your own function object:
struct MyStrCmp {
MyStrCmp (int n): length(n) { }
inline bool operator< (char *lhs, char *rhs) {
return ::strcmp (lhs, rhs, length);
}
int length;
};
// ...
std::sort (myList.begin (), myList.end (), MyStrCmp (STR_LENGTH));
Can you pack your buffer pointer + length into a structure and pass a pointer of that structure as void *?
You could use a hack like:
int buffcmp(const void *b1, const void *b2)
{
static int bsize=-1;
if(b2==NULL) {bsize=*(int*)(b1); return 0;}
return memcmp(b1, b2, idsize);
}
which you would first call as buffcmp(&bsize, NULL) and then pass it as the comparison function to qsort.
You could of course make the comparison behave more naturally in the case of buffcmp(NULL, NULL) etc by adding more if statements.
You could functors (give the length to the functor's constructor) or Boost.Lambda (use the length in-place).
I'm not clear on what you're asking. But I'll try, assuming that
You have a single buffer
You have an array of pointers of some kind which has been processed in some way so that some or all of its contents point into the buffer
That is code equivalent to:
char *buf = (char*)malloc(sizeof(char)*bufsize);
for (int i=0; i<bufsize; ++i){
buf[i] = some_cleverly_chosen_value(i);
}
char *ary[arraysize] = {0};
for(int i=0; i<arraysize; ++i){
ary[i] = buf + some_clever_function(i);
}
/* ...do the sort here */
Now if you control the allocation of the buffer, you could substitute
char *buf = (char*)malloc(sizeof(char)*(bufsize+1));
buf[bufsize]='\0';
and go ahead using strcmp. This may be possible even if you don't control the filling of the buffer.
If you have to live with a buffer handed you by someone else you can
Use some global storage (which you asked to avoid and good thinking).
Hand the sort function something more complicated than a raw pointer (the address of a struct or class that supports the extra data). For this you need to control the deffinition of ary in the above code.
Use a sort function which supports an extra input. Either sort_r as suggested by Adam, or a home-rolled solution (which I do recommend as an exercise for the student, and don't recommend in real life). In either case the extra data is probably a pointer to the end of the buffer.
memcmp should stop on the first byte that is unequal, so the length should be large, i.e. to-the-end-of-the-buffer. Then the only way it can return zero is if it does go to the end of the buffer.
(BTW, I lean toward merge sort myself. It's stable and well-behaved.)