I have a pointer to a char array, and I need to go along and XOR each byte with a 64 bit mask. I thought the easiest way to do this would be to read each 8 bytes as one long long or uint64_t and XOR with that, but I'm unsure how. Maybe casting to a long long* and dereferencing? I'm still quite unsure about pointers in general, so any example code would be much appreciated as well. Thanks!
EDIT: Example code (just to show what I want, I know it doesn't work):
void encrypt(char* in, uint64_t len, uint64_t key) {
for (int i = 0; i < (len>>3); i++) {
(uint64_t*)in ^= key;
in += 8;
}
}
}
The straightforward way to do your XOR-masking is by bytes:
void encrypt(uint8_t* in, size_t len, const uint8_t key[8])
{
for (size_t i = 0; i < len; i++) {
in[i] ^= key[i % 8];
}
}
Note: here the key is an array of 8 bytes, not a 64-bit number. This code is straightforward - no tricks needed, easy to debug. Measure its performance, and be done with it if the performance is good enough.
Some (most?) compilers optimize such simple code by vectorizing it. That is, all the details (casting to uint64_t and such) are performed by the compiler. However, if you try to be "clever" in your code, you may inadvertently prevent the compiler from doing the optimization. So try to write simple code.
P.S. You should probably also use the restrict keyword, which is currently non-standard, but may be required for best performance. I have no experience with using it, so didn't add it to my example.
If you have a bad compiler, cannot enable the vectorization option, or just want to play around, you can use this version with casting:
void encrypt(uint8_t* in, size_t len, uint64_t key)
{
uint64_t* in64 = reinterpret_cast<uint64_t*>(in);
for (size_t i = 0; i < len / 8; i++) {
in64[i] ^= key;
}
}
It has some limitations:
Requires the length to be divisible by 8
Requires the processor to support unaligned pointers (not sure about x86 - will probably work)
Compiler may refuse to vectorize this one, leading to worse performance
As noted by Hurkyl, the order of the 8 bytes in the mask is not clear (on x86, little-endian, the least significant byte will mask the first byte of the input array)
Related
While reading this question, I've seen the first comment saying that:
size_t for length is not a great idea, the proper types are signed ones for optimization/UB reasons.
followed by another comment supporting the reasoning. Is it true?
The question is important, because if I were to write e.g. a matrix library, the image dimensions could be size_t, just to avoid checking if they are negative. But then all loops would naturally use size_t. Could this impact on optimization?
size_t being unsigned is mostly an historical accident - if your world is 16 bit, going from 32767 to 65535 maximum object size is a big win; in current-day mainstream computing (where 64 and 32 bit are the norm) the fact that size_t is unsigned is mostly a nuisance.
Although unsigned types have less undefined behavior (as wraparound is guaranteed), the fact that they have mostly "bitfield" semantics is often cause of bugs and other bad surprises; in particular:
difference between unsigned values is unsigned as well, with the usual wraparound semantics, so if you may expect a negative value you have to cast beforehand;
unsigned a = 10, b = 20;
// prints UINT_MAX-10, i.e. 4294967286 if unsigned is 32 bit
std::cout << a-b << "\n";
more in general, in signed/unsigned comparisons and mathematical operations unsigned wins (so the signed value is casted to unsigned implicitly) which, again, leads to surprises;
unsigned a = 10;
int b = -2;
if(a < b) std::cout<<"a < b\n"; // prints "a < b"
in common situations (e.g. iterating backwards) the unsigned semantics are often problematic, as you'd like the index to go negative for the boundary condition
// This works fine if T is signed, loops forever if T is unsigned
for(T idx = c.size() - 1; idx >= 0; idx--) {
// ...
}
Also, the fact that an unsigned value cannot assume a negative value is mostly a strawman; you may avoid checking for negative values, but due to implicit signed-unsigned conversions it won't stop any error - you are just shifting the blame. If the user passes a negative value to your library function taking a size_t, it will just become a very big number, which will be just as wrong if not worse.
int sum_arr(int *arr, unsigned len) {
int ret = 0;
for(unsigned i = 0; i < len; ++i) {
ret += arr[i];
}
return ret;
}
// compiles successfully and overflows the array; it len was signed,
// it would just return 0
sum_arr(some_array, -10);
For the optimization part: the advantages of signed types in this regard are overrated; yes, the compiler can assume that overflow will never happen, so it can be extra smart in some situations, but generally this won't be game-changing (as in general wraparound semantics comes "for free" on current day architectures); most importantly, as usual if your profiler finds that a particular zone is a bottleneck you can modify just it to make it go faster (including switching types locally to make the compiler generate better code, if you find it advantageous).
Long story short: I'd go for signed, not for performance reasons, but because the semantics is generally way less surprising/hostile in most common scenarios.
That comment is simply wrong. When working with native pointer-sized operands on any reasonable architectute, there is no difference at the machine level between signed and unsigned offsets, and thus no room for them to have different performance properties.
As you've noted, use of size_t has some nice properties like not having to account for the possibility that a value might be negative (although accounting for it might be as simple as forbidding that in your interface contract). It also ensures that you can handle any size that a caller is requesting using the standard type for sizes/counts, without truncation or bounds checks. On the other hand, it precludes using the same type for index-offsets when the offset might need to be negative, and in some ways makes it difficult to perform certain types of comparisons (you have to write them arranged algebraically so that neither side is negative), but the same issue comes up when using signed types, in that you have to do algebraic rearrangements to ensure that no subexpression can overflow.
Ultimately you should initially always use the type that makes sense semantically to you, rather than trying to choose a type for performance properties. Only if there's a serious measured performance problem that looks like it might be improved by tradeoffs involving choice of types should you consider changing them.
I stand by my comment.
There is a simple way to check this: checking what the compiler generates.
void test1(double* data, size_t size)
{
for(size_t i = 0; i < size; i += 4)
{
data[i] = 0;
data[i+1] = 1;
data[i+2] = 2;
data[i+3] = 3;
}
}
void test2(double* data, int size)
{
for(int i = 0; i < size; i += 4)
{
data[i] = 0;
data[i+1] = 1;
data[i+2] = 2;
data[i+3] = 3;
}
}
So what does the compiler generate? I would expect loop unrolling, SIMD... for something that simple:
Let's check godbolt.
Well, the signed version has unrolling, SIMD, not the unsigned one.
I'm not going to show any benchmark, because in this example, the bottleneck is going to be on memory access, not on CPU computation. But you get the idea.
Second example, just keep the first assignment:
void test1(double* data, size_t size)
{
for(size_t i = 0; i < size; i += 4)
{
data[i] = 0;
}
}
void test2(double* data, int size)
{
for(int i = 0; i < size; i += 4)
{
data[i] = 0;
}
}
As you want gcc
OK, not as impressive as for clang, but it still generates different code.
I'm starting to get into the mild depths of C, using arduinos and such like, and just wanted some advice on how I'm generating random noise using a For loop.
The important bit:
void testdrawnoise() {
int j = 0;
for (uint8_t i=0; i<display.width(); i++) {
if (i == display.width()-1) {
j++;
i=0;
}
M = random(0, 2); // Random 0/1
display.drawPixel(i, j, M); // (Width, Height, Pixel on/off)
display.refresh();
}
}
The function draws a pixel one by one across the screen, moving to the next line down once i is has reached display.width()-1. Whether the pixel appears on(black) or off(white) is determined by M.
The code is working fine, but I feel like it could be done better, or at least neater, and perhaps more efficiently.
Input and critiques greatly appreciated.
First of all, your loop never ends, and goes on incrementing j without bounds, so, after you filled the screen once, you go on looping outside of the screen height; although your library does bounds checking, it's certainly not a productive use of CPU to keep on looping without actually doing useful work until j overflows and goes back to zero.
Also, signed overflow is undefined behavior in C++, so you are technically on shaky grounds (I originally thought that Arduino always compiles with -fwrapv which guarantees wraparound on signed integer overflow, but apparently I was mistaken).
Given that the library you are using keeps the whole framebuffer in memory and sends it all on refresh calls, it doesn't make much sense to re-send it at each pixel - especially since the frame transmission is probably going to be by far the slowest part of this loop. So, you can move it out of the loop.
Putting this together (plus caching width and height and using the simpler overload of random), you can change this to:
void testdrawnoise() {
int w = display.width(), h = display.height();
for (int j=0; j<h; ++j) {
for (int i=0; i<w; ++i) {
display.drawPixel(i, j, random(2));
}
}
display.refresh();
}
(if your screen dimensions are smaller than 256 on AVR Arduinos you may gain something by changing all those int to byte, but don't take my word for it)
Notice that this will do it just once, you can put it into your loop() function or in an infinite loop to make it keep generating random patterns.
This is what you can do with the provided interface; now, going into undocumented territory we can go faster.
As stated above, the library you are using keeps the whole framebuffer in memory, packed (as expected) at 8 bits per byte, in a single global variable named sharpmem_buffer, initialized with a malloc of the obvious size.
It should also be noted that, when you ask for a random bit in your code, the PRNG generates a full 31-bit random number and takes just the low bit. Why waste all the other perfectly good random bits?
At the same time, when you call drawPixel, the library performs a series of boolean operations on the corresponding byte in memory to set just the bit you asked for without touching the rest of the bits. Quite stupid, given that you are going to overwrite the other ones with random anyway.
So, putting together these two facts, we can do something like:
void testdrawnoise() {
// access the buffer defined in another .cpp
extern byte *sharpmem_buffer;
byte *ptr = sharpmem_buffer; // pointer to current position
// end position
byte *end = ptr + display.width()*display.height()/8;
for (; ptr!=end; ++ptr) {
// store a full byte of random
*ptr = random(256);
}
display.refresh();
}
which, subtracted the refresh() time, should be at very least 8 times faster than the previous version (I actually expect significantly more, given that not only the core of the loop executes 1/8th of iterations, but it's also way simpler - no function calls besides random, no branches, no boolean operations on memory).
On AVR Arduinos the only point that can be optimized further is probably the RNG - we are still using only 8 bit of a 31 bit (if they are actually 31 bits? Arduino documentation as usual sucks badly at providing useful technical information) RNG, so we could probably generate 3 bytes of random out of a single RNG call, or 4 if we switched to a hand-rolled LCG that didn't mess with the sign bit. On ARM Arduinos, in this last case, we could even gain something by performing full 32-bit stores in memory instead of writing single bytes.
However, these further optimizations are (1) tedious to write (if you have to handle screens where the number of pixels is not multiple of 24/32) and (2) probably not particularly profitable, given that most of the time will be spent in transmission over the SPI anyway. Worth mentioning them anyway, as they may be useful in other cases where there's no transmission bottleneck to slow everything down.
Given that OP's MCU is actually a Cortex M0 (so, a 32 bit ARM), it's worth trying to make it even faster using a full 32 bit PRNG and 32 bit stores.
As said above, built-in random returns a signed value, and it's not exactly clear what range it provides; for this reason, we'll have to roll our own PRNG that is guaranteed to provide 32 full bits of randomness.
A decent and very fast PRNG that provides 32 random bits with minimal state is xorshift; we'll just use the xorshift32 straight from Wikipedia, as we don't really need the improved "*" or "+" versions (nor we really care about having a bigger period provided by the larger counterparts).
struct XorShift32 {
uint32_t state = 0x12345678;
uint32_t next() {
uint32_t x = state;
x ^= x << 13;
x ^= x >> 17;
x ^= x << 5;
state = x;
return x;
}
};
XorShift32 xorShift;
Now we can rewrite testdrawnoise():
void testdrawnoise() {
int size = display.width()*display.height();
// access the buffer defined in another .cpp
extern byte *sharpmem_buffer;
/*
we can access the framebuffer as if it was an array of 32-bit words;
this is fine, since it was alloc-ed with malloc, which guarantees memory
aligned for the most restrictive built-in type, and the library only
uses it with byte pointers, so there should be no strict aliasing problem
*/
uint32_t *ptr = (uint32_t *)sharpmem_buffer;
/*
notice that the division is an integer division, which truncates; so, we
are filling the framebuffer up the the last multiple of 4 bytes; with
"strange" sizes we may be leaving out up to 3 bytes (see later)
*/
uint32_t *end = ptr + size/32;
for (; ptr!=end; ++ptr) {
// store a full byte of random
*ptr = xorShift.next();
}
// now to fill the possibly missing last three bytes
// pick it up where we left it
byte *final_ptr = (byte *)end;
byte *final_end = sharpmem_buffer + size/8;
// generate 32 random bits; it's ok, we'll need at most 24
uint32_t r = xorShift.next();
for(; final_ptr!=final_end; ++final_ptr) {
// take the lower 8 bits
*final_ptr = r;
// throw away the bits we used, get in the upper ones
r = r>>8;
}
display.refresh();
}
I've been provided a binary file to read, which holds a sequence of raw values. For the sake of simplicity suppose they're unsigned integral values, either 4-byte or 8-byte long. Unfortunately for me, the byte order for these values is incompatible with my processor's endianness (little vs big or vice-versa; never mind about weird PDF-endianness etc.); and I want this data in memory with the proper endianness.
What's the fastest way to do this, considering the fact that I'm reading the data from a file? If it's not worth exploiting this fact, please explain why that is.
Considering the fact that you're reading the data from a file, the way you switch endianness is going to have insignificant effect on the runtime, compared to what the file-IO does.
What could make a significant difference is how you read the data. Trying to read the bytes out of order would not be a good idea. Simply read the bytes in order, and switch endianness afterwards. This separates the reading and the byte swapping.
What I want from the byte swapping code typically, and certainly in a case of reading a file, is that it works for any endianness and doesn't depend on architechture specific instructions.
char* buf = read(); // let buf be a pointer to the read buffer
uint32_t v;
// little to native
v = 0;
for(unsigned i = 0; i < sizeof v; i++)
v |= buf[i] << CHAR_BIT * i;
// big to native
v = 0;
for(unsigned i = 0; i < sizeof v; i++)
v |= buf[i] << CHAR_BIT * (sizeof v - i);
This works whether the native is big, little, or one of the middle endian variety.
Of course, boost has already implemented these for you, so there is no need to re-implement. Also, there are the ntoh? family of functions provided by both POSIX and by the windows C library, which can be used to convert big endian to/from native.
Not the fastest, but a portable way would be to read the file into an (unsigned) int array, alias the int array to a char one (allowed per strict aliasing rule) and swap bytes in memory.
Fully portable way:
swapints(unsigned int *arr, size_t l) {
unsigned int cur;
char *ix;
for (size_t i=0; i<l; i++) {
int cur;
char *dest = static_cast<char *>(&cur) + sizeof(int);
char *src = static_cast<char *>(&(arr[i]));
for(int j=0; j<sizeof(int); j++) *(--dest) = *(src++);
arr[i] = cur;
}
}
But if you do not need portability, some systems offer swapping functions. For example BSD systems have bswap16, bswap32 and bswap64 to swap byte in uint16_t, uint32_t and uint_64_t respectively. No doubt equivalent functions exist in Microsoft or GNU-Linux worlds.
Alternatively, if you know that the file is in network order (big endian) and your processor is not, you can use the ntohs and ntohl functions for respectively uint16_t and uint32_t.
Remark (per AndrewHenle's comment): whatever the host endianness, ntohs and ntohl can always be used - simply they are no-ops on big-endian systems
With the following method we can swap two variable A and B
A = A XOR B
B = A XOR B
A = A XOR B
I want to implement such a method in C++ that operate with all types (int, float, char, ...) as well as structures. As we know all types of data including structures take a specific space of memory, for example 4 bytes, 8 bytes
In my opinion this method for swapping must work with all types excluding pointer based types, It should swap memory contents, that is bits, of two variables
My Question
I have no idea how can I implement such a method in C++ that works with structures (those does not contain any pointers). Can any one please help me?
Your problem is easily reduced to xor-swap buffers of raw memory. Something like that.
void xorswap(void *a, void *b, size_t size);
That can be implemented in terms of xorswaps of primitive types. For example:
void xorswap(void *a, void *b, size_t size)
{
if (a == b)
return; //nothing to do
size_t qwords = size / 8;
size_t rest = size % 8;
uint64_t *a64 = (uint64_t *)a;
uint64_t *b64 = (uint64_t *)b;
for (size_t i = 0; i < qwords; ++i)
xorswap64(a64++, b64++);
uint8_t *a8 = (uint8_t*)a64;
uint8_t *b8 = (uint8_t*)b64;
for (size_t i = 0; i < rest; ++i)
xorswap8(a8++, b8++);
}
I leave the implementation of xorswap64() and xorswap8() as an exercise to the reader.
Also note that to be efficient, the original buffers should be 8-byte aligned. If that's not the case, depending on the architecture, the code may work suboptimally or not work at all (again, an exercise to the reader ;-).
Other optimizations are possible. You can even use Duff's device to unroll the last loop, but I don't know if it is worth it. You'll have to profile it to know for sure.
You could use Bitwise XOR "^" in C to XOR two bits. See here and here. Now to XOR 'a' and 'b' start XORing from the east significant bit to the most significant bit.
I have a hex pattern stored in a variable, how to do I know what is the size of the hex pattern
E.g. --
#define MY_PATTERN 0xFFFF
now I want to know the size of MY_PATTERN, to use somewhere in my code.
sizeof (MY_PATTERN)
this is giving me warning -- "integer conversion resulted in truncation".
How can I fix this ? What is the way I should write it ?
The pattern can increase or decrease in size so I can't hard code it.
Don't do it.
There's no such thing in C++ as a "hex pattern". What you actually use is an integer literal. See paragraph "The type of the literal". Thus, sizeof (0xffff) is equal to sizeof(int). And the bad thing is: the exact size may vary.
From the design point of view, I can't really think of a situation where such a solution is acceptable. You're not even deriving a type from a literal value, which would be a suspicious as well, but at least, a typesafe solution. Sizes of values are mostly used in operations working with memory buffers directly, like memcpy() or fwrite(). Sizes defined in such indirect ways lead to a very brittle binary interface and maintenance difficulties. What if you compile a program on both x86 and Motorola 68000 machines and want them to interoperate via a network protocol, or want to write some files on the first machine, and read them on another? sizeof(int) is 4 for the first and 2 for the second. It will break.
Instead, explicitly use the exactly sized types, like int8_t, uint32_t, etc. They're defined in the <cstdint> header.
This will solve your problem:
#define MY_PATTERN 0xFFFF
struct TypeInfo
{
template<typename T>
static size_t SizeOfType(T) { return sizeof(T); }
};
void main()
{
size_t size_of_type = TypeInfo::SizeOfType(MY_PATTERN);
}
as pointed out by Nighthawk441 you can just do:
sizeof(MY_PATTERN);
Just make sure to use a size_t wherever you are getting a warning and that should solve your problem.
You could explicitly typedef various types to hold hex numbers with restricted sizes such that:
typedef unsigned char one_byte_hex;
typedef unsigned short two_byte_hex;
typedef unsigned int four_byte_hex;
one_byte_hex pattern = 0xFF;
two_byte_hex bigger_pattern = 0xFFFF;
four_byte_hex big_pattern = 0xFFFFFFFF;
//sizeof(pattern) == 1
//sizeof(bigger_pattern) == 2
//sizeof(biggest_pattern) == 4
four_byte_hex new_pattern = static_cast<four_byte_hex>(pattern);
//sizeof(new_pattern) == 4
It would be easier to just treat all hex numbers as unsigned ints regardless of pattern used though.
Alternatively, you could put together a function which checks how many times it can shift the bits of the pattern until it's 0.
size_t sizeof_pattern(unsigned int pattern)
{
size_t bits = 0;
size_t bytes = 0;
unsigned int tmp = pattern;
while(tmp >> 1 != 0){
bits++;
tmp = tmp >> 1;
}
bytes = (bits + 1) / 8; //add 1 to bits to shift range from 0-31 to 1-32 so we can divide properly. 8 bits per byte.
if((bits + 1) % 8 != 0){
bytes++; //requires one more byte to store value since we have remaining bits.
}
return bytes;
}