Swap two variables with XOR - c++

With the following method we can swap two variable A and B
A = A XOR B
B = A XOR B
A = A XOR B
I want to implement such a method in C++ that operate with all types (int, float, char, ...) as well as structures. As we know all types of data including structures take a specific space of memory, for example 4 bytes, 8 bytes
In my opinion this method for swapping must work with all types excluding pointer based types, It should swap memory contents, that is bits, of two variables
My Question
I have no idea how can I implement such a method in C++ that works with structures (those does not contain any pointers). Can any one please help me?

Your problem is easily reduced to xor-swap buffers of raw memory. Something like that.
void xorswap(void *a, void *b, size_t size);
That can be implemented in terms of xorswaps of primitive types. For example:
void xorswap(void *a, void *b, size_t size)
{
if (a == b)
return; //nothing to do
size_t qwords = size / 8;
size_t rest = size % 8;
uint64_t *a64 = (uint64_t *)a;
uint64_t *b64 = (uint64_t *)b;
for (size_t i = 0; i < qwords; ++i)
xorswap64(a64++, b64++);
uint8_t *a8 = (uint8_t*)a64;
uint8_t *b8 = (uint8_t*)b64;
for (size_t i = 0; i < rest; ++i)
xorswap8(a8++, b8++);
}
I leave the implementation of xorswap64() and xorswap8() as an exercise to the reader.
Also note that to be efficient, the original buffers should be 8-byte aligned. If that's not the case, depending on the architecture, the code may work suboptimally or not work at all (again, an exercise to the reader ;-).
Other optimizations are possible. You can even use Duff's device to unroll the last loop, but I don't know if it is worth it. You'll have to profile it to know for sure.

You could use Bitwise XOR "^" in C to XOR two bits. See here and here. Now to XOR 'a' and 'b' start XORing from the east significant bit to the most significant bit.

Related

C++ convert int array to int16_t array

How do I convert an int array to int16_t array in C++ or C with low cost? Assume that all the values in int array are within the range of int16_t.
I know two options:
i. Use for loop to assign each element in the int array to corresponding element in int16_t array.
int *a = new int[2];
a[0] = 1;
a[1] = 2;
int16_t *b = new int16_t[2];
for (int i = 0; i < 2; i++) {
b[i] = a[i];
}
But it needs to do copy and has overhead.
ii. Use cast
int16_t* c = reinterpret_cast<int16_t*>(a);
//1 0 2 0
for (int i = 0; i < 4; i++) {
cout << (int)r[i] << endl;
}
But I do not want those 0.
Is there any other low-cost way to transfer int array[2] to int16_t array[2] and keep the values?
Since you cannot make any assumptions about the size of int (the standard does not give any statement about the exact size of the primitive data types) you can't do any fancy tricks using casts here. Your example:
int16_t* c = reinterpret_cast<int16_t*>(a);
does not work because on your system the type int happens to be 32 bit long, so for each int you will get two int16_t. In your case, it so happens that all values are in a certain range, so the second int16_t is always 0.
I would suggest just copying your integers. Anything else is premature optimization.
Because your target int16_t is represented by fewer bits than your source int (32-bits or more) the bits must be transformed from 32/64 bits to 16 bits. Your first solution is correct, but compilers will warn on the assignment to a smaller int. A static_cast to int16_t on the right side of the assignment will silence the warning.
Yes, by transform I mean copy.
Can you simply avoid the copy and just use an int1_t6 array?
As you are increasing the size of the array elements (maybe, depends what platform), you'd either have to copy, or use a helper function/cast on each access. As the values are still smaller, you do not gain anything having them as an int16_t unless you need to modify the values with other int16_t's.
I guess it really depends on what you need to do with the values once they are an int16_t.
int16_t getArrVal( const unsigned index ){
return array[index];
}

How to read sequence of bytes from pointer in C++ as long?

I have a pointer to a char array, and I need to go along and XOR each byte with a 64 bit mask. I thought the easiest way to do this would be to read each 8 bytes as one long long or uint64_t and XOR with that, but I'm unsure how. Maybe casting to a long long* and dereferencing? I'm still quite unsure about pointers in general, so any example code would be much appreciated as well. Thanks!
EDIT: Example code (just to show what I want, I know it doesn't work):
void encrypt(char* in, uint64_t len, uint64_t key) {
for (int i = 0; i < (len>>3); i++) {
(uint64_t*)in ^= key;
in += 8;
}
}
}
The straightforward way to do your XOR-masking is by bytes:
void encrypt(uint8_t* in, size_t len, const uint8_t key[8])
{
for (size_t i = 0; i < len; i++) {
in[i] ^= key[i % 8];
}
}
Note: here the key is an array of 8 bytes, not a 64-bit number. This code is straightforward - no tricks needed, easy to debug. Measure its performance, and be done with it if the performance is good enough.
Some (most?) compilers optimize such simple code by vectorizing it. That is, all the details (casting to uint64_t and such) are performed by the compiler. However, if you try to be "clever" in your code, you may inadvertently prevent the compiler from doing the optimization. So try to write simple code.
P.S. You should probably also use the restrict keyword, which is currently non-standard, but may be required for best performance. I have no experience with using it, so didn't add it to my example.
If you have a bad compiler, cannot enable the vectorization option, or just want to play around, you can use this version with casting:
void encrypt(uint8_t* in, size_t len, uint64_t key)
{
uint64_t* in64 = reinterpret_cast<uint64_t*>(in);
for (size_t i = 0; i < len / 8; i++) {
in64[i] ^= key;
}
}
It has some limitations:
Requires the length to be divisible by 8
Requires the processor to support unaligned pointers (not sure about x86 - will probably work)
Compiler may refuse to vectorize this one, leading to worse performance
As noted by Hurkyl, the order of the 8 bytes in the mask is not clear (on x86, little-endian, the least significant byte will mask the first byte of the input array)

What's the fastest way to switch endianness when reading from a file with C++?

I've been provided a binary file to read, which holds a sequence of raw values. For the sake of simplicity suppose they're unsigned integral values, either 4-byte or 8-byte long. Unfortunately for me, the byte order for these values is incompatible with my processor's endianness (little vs big or vice-versa; never mind about weird PDF-endianness etc.); and I want this data in memory with the proper endianness.
What's the fastest way to do this, considering the fact that I'm reading the data from a file? If it's not worth exploiting this fact, please explain why that is.
Considering the fact that you're reading the data from a file, the way you switch endianness is going to have insignificant effect on the runtime, compared to what the file-IO does.
What could make a significant difference is how you read the data. Trying to read the bytes out of order would not be a good idea. Simply read the bytes in order, and switch endianness afterwards. This separates the reading and the byte swapping.
What I want from the byte swapping code typically, and certainly in a case of reading a file, is that it works for any endianness and doesn't depend on architechture specific instructions.
char* buf = read(); // let buf be a pointer to the read buffer
uint32_t v;
// little to native
v = 0;
for(unsigned i = 0; i < sizeof v; i++)
v |= buf[i] << CHAR_BIT * i;
// big to native
v = 0;
for(unsigned i = 0; i < sizeof v; i++)
v |= buf[i] << CHAR_BIT * (sizeof v - i);
This works whether the native is big, little, or one of the middle endian variety.
Of course, boost has already implemented these for you, so there is no need to re-implement. Also, there are the ntoh? family of functions provided by both POSIX and by the windows C library, which can be used to convert big endian to/from native.
Not the fastest, but a portable way would be to read the file into an (unsigned) int array, alias the int array to a char one (allowed per strict aliasing rule) and swap bytes in memory.
Fully portable way:
swapints(unsigned int *arr, size_t l) {
unsigned int cur;
char *ix;
for (size_t i=0; i<l; i++) {
int cur;
char *dest = static_cast<char *>(&cur) + sizeof(int);
char *src = static_cast<char *>(&(arr[i]));
for(int j=0; j<sizeof(int); j++) *(--dest) = *(src++);
arr[i] = cur;
}
}
But if you do not need portability, some systems offer swapping functions. For example BSD systems have bswap16, bswap32 and bswap64 to swap byte in uint16_t, uint32_t and uint_64_t respectively. No doubt equivalent functions exist in Microsoft or GNU-Linux worlds.
Alternatively, if you know that the file is in network order (big endian) and your processor is not, you can use the ntohs and ntohl functions for respectively uint16_t and uint32_t.
Remark (per AndrewHenle's comment): whatever the host endianness, ntohs and ntohl can always be used - simply they are no-ops on big-endian systems

C++ Vector data access

I've got an array of bytes, declared like so:
typedef unsigned char byte;
vector<byte> myBytes = {255, 0 , 76 ...} //individual bytes no larger in value than 255
The problem I have is I need to access the raw data of the vector (without any copying of course), but I need to assign an arbitrary amount of bits to any given pointer to an element.
In other words, I need to assign, say an unsigned int to a certain position in the vector.
So given the example above, I am looking to do something like below:
myBytes[0] = static_cast<unsigned int>(76535); //assign n-bit (here 32-bit) value to any index in the vector
So that the vector data would now look like:
{2, 247, 42, 1} //raw representation of a 32-bit int (76535)
Is this possible? I kind of need to use a vector and am just wondering whether the raw data can be accessed in this way, or does how the vector stores raw data make this impossible or worse - unsafe?
Thanks in advance!
EDIT
I didn't want to add complication, but I'm constructing variously sized integer as follows:
//**N_TYPES
u16& VMTypes::u8sto16(u8& first, u8& last) {
return *new u16((first << 8) | last & 0xffff);
}
u8* VMTypes::u16to8s(u16& orig) {
u8 first = (u8)orig;
u8 last = (u8)(orig >> 8);
return new u8[2]{ first, last };
}
What's terrible about this, is I'm not sure of the endianness of the numbers generated. But I know that I am constructing and destructing them the same everywhere (I'm writing a stack machine), so if I'm not mistaken, endianness is not effected with what I'm trying to do.
EDIT 2
I am constructing ints in the following horrible way:
u32 a = 76535;
u16* b = VMTypes::u32to16s(a);
u8 aa[4] = { VMTypes::u16to8s(b[0])[0], VMTypes::u16to8s(b[0])[1], VMTypes::u16to8s(b[1])[0], VMTypes::u16to8s(b[1])[1] };
Could this then work?:
memcpy(&_stack[0], aa, sizeof(u32));
Yes, it is possible. You take the starting address by &myVector[n] and memcpy your int to that location. Make sure that you stay in the bounds of your vector.
The other way around works too. Take the location and memcpy out of it to your int.
As suggested: by using memcpy you will copy the byte representation of your integer into the vector. That byte representation or byte order may be different from your expectation. Keywords are big and little endian.
As knivil says, memcpy will work if you know the endianess of your system. However, if you want to be safe, you can do this with bitwise arithmetic:
unsigned int myInt = 76535;
const int ratio = sizeof(int) / sizeof(byte);
for(int b = 0; b < ratio; b++)
{
myBytes[b] = byte(myInt >> (8*sizeof(byte)*(ratio - b)));
}
The int can be read out of the vector using a similar pattern, if you want me to show you how let me know.

C++ SSE Optimisation with multiple functions

I have some code that is structurally similar to the below. There is a bunch of small SSE helper functions, a larger one that does most of the work, and the public function that organises data, runs the large function in a loop and deals with any left over data.
This gave about a 2x speed boost over the scalar implementation, however I would like to obtain more if possible. As well as some conceptual issues there were some things in the disassembly (only looked at x86 VC++ 2010 in detail, but support x86 and GCC) I did not like.
For at least some targets I can only use SSE and SSE2 here, but if it is worth a separate build I could possibly use newer instruction sets as well.
Problem 1:
All the small helpers got inlined into the large helper nicely, and the large one didn't.
However, even though it is only referenced by one function in one source file and there are plenty registers (Looking at the algorithm, pretty sure it only needs at most 12 XMM registers except for loading the data arrays), the compiler seems to want to follow normal calling conventions for fooHelper.
So after putting data into XMM registers in foo it puts them back on the stack, and passes pointers, then after the loops and tidy up stuff, it loads that stack back into XMM so i can unload it again...
I guess I could force it to inline fooHelper, but that is a very large number of duplicated instructions because it wouldn't use 4 XMM registers to do the job. I could also not use SSE in foo itself, which would remove the load/store issue, but fooHelper is still doing completely unrequired loads and stores on those 4 state variables...
Ideally since this is a private function a way to ignore the normal calling conventions would be nice, and I am sure this will come up in lots of other larger pieces of SSE where I don't really want everything fully inlined.
Problem 2:
The implementation is basically working on 4 state vectors organised as AAAA, BBBB, CCCC, DDDD, such that the code can simply be written as if it is working with A, B, C and D as separate variables, while processing all 4 data streams at once.
However the output itself is in the form ABCD, ABCD, ABCD, ABCD and the input is also 4 separate buffers requiring the _m_set_epi32 to load it.
Is there a better way to deal with these inputs and outputs (the format of which can not practically be changed)?
namespace
{
void fooHelperA(__m128i &a, __m128i b, __m128i x, int s)
{
...small function (<5 sse operations)...
}
...bunch of other small functions...
//
void fooHelper(
const int *data1, const int *data2, const int *data3, const int *data4,
__m128i &a, __m128i &b, __m128i &c, __m128i &d)
{
//Get the current piece of data
__m128 c = _mm_set_epi32(data1[0], data2[0], data3[0], data4[0]);
...do stuff with data...
fooHelperA(a, b, c, 5);
...
c = _mm_set_epi32(data1[1], data2[1], data3[1], data4[1]);
...
fooHelperA(b, a, c, 7);
... lots more code ...
c = _mm_set_epi32(data1[3], data2[3], data3[3], data4[3]);
...
}
}
void foo(
const char*data1, const char *data2, const float *data3, const char *data4,
int*out1, int*out2, int*out3, int*out4,
size_t len)
{
__m128i a = _mm_setzero_si128();
__m128i b = _mm_setzero_si128();
__m128i c = _mm_setzero_si128();
__m128i d = _mm_setzero_si128();
while (len >= 16) //expected to loop <25 times for datasets in question
{
fooHelper((const int*)data1, (const int*)data2, (const int*)data3, (const int*)data4, a,b,c,d);
data1 += 16;
data2 += 16;
data3 += 16;
data4 += 16;
len -= 16;
}
if (len)
{
int[4][4] buffer;
...padd data into buffer...
fooHelper(buffer[0], buffer[1], buffer[2], buffer[3], a,b,c,d);
}
ALIGNED(16, int[4][4]) tmp;
_mm_store_si128((__m128i*)tmp[0], a);
_mm_store_si128((__m128i*)tmp[1], b);
_mm_store_si128((__m128i*)tmp[2], c);
_mm_store_si128((__m128i*)tmp[3], d);
out1[0] = tmp[0][0];
out2[0] = tmp[0][1];
out3[0] = tmp[0][2];
out4[0] = tmp[0][3];
out1[1] = tmp[0][0];
out2[1] = tmp[0][1];
out3[1] = tmp[0][2];
out4[1] = tmp[0][3];
out1[2] = tmp[0][0];
out2[2] = tmp[0][1];
out3[2] = tmp[0][2];
out4[2] = tmp[0][3];
out1[3] = tmp[0][0];
out2[3] = tmp[0][1];
out3[3] = tmp[0][2];
out4[3] = tmp[0][3];
}
Some advice,
1) Looking at your code and data description, it seem you can have huge gain by moving your data organization from SOA (Struct of array ) your AAAA vector to a AOS array of struct where your input data will already be organized as ABCD , you will have 1 big input vector (4x bigger)!
2) take care to your data alignment. for now you don't care has you should have pinalllity due to the set_epi32 function but if you switch to AOS you should be able to use a fast load ( memory to XMS ).
3) the end of the function is a bit strange, (I cannot simulate for now) I really don't understand why you need a tmp 2d array.
4) interleaving (& the inverse operation) can be done using some example of SOA/ AOS conversion ... Intel wrote a lot of paper on this topic when promoting SIMD Instruction Set.
good luck,
alex