Each datatype has a certain range, based on the hardware. For example, on a 32bit machine an int has the range -2147483648 to 2147483647.
C++ compilers 'pad' object memory to fit into certain sizes. I'm pretty sure it's 2, 4, 8, 16, 32, 64 etc. This also probably depends on the machine.
I want to manually align my objects to meet padding requirements. Is there a way to:
Determine what machine a program is running on
Determine padding sizes
Set custom data-type based on bitsize
I've used bitsets before in Java, but I'm not familiar with C++. As for machine requirements, I know programs for different hardware sets are usually compiled differently in C++, so I'm wondering if its even possible at all.
Example->
/*getHardwarePack size obviously doesn't exist, just here to explain. What I'm trying to get
here would be the minimum alignment size for the machine the program is running on*/
#define PACK_SIZE = getHardwarePackSize();
#define MONTHS = 12;
class date{
private:
//Pseudo code that represents making a custom type
customType monthType = MONTH/PACK_SIZE;
monthType.remainder = MONTH % PACK_SIZE;
monthType months = 12;
};
The idea is to be able to fit every variable into the minimum bit size and track how many bits are left over.
Theoretically, it would be possible to make use of every unused bit and improve memory efficiency. Obviously this would never work anything like this, but the example is just to explain the concept.
This is a lot more complex than what you are trying to describe, as there are requirements for alignment on objects and items within objects. For example, if the compiler decides that an integer item is 16 bytes into a struct or class, it may well decide that "ah, I can use an aligned SSE instruction to load this data, because it is aligned at 16 bytes" (or something similar in ARM, PowerPC, etc). So if you do not satisfy AT LEAST that alignment in your code, you will cause the program to go wrong (crash or misread the data, depending on the architecture).
Typically, the alignment used and given by the compiler will be "right" for whatever architecture the compiler is targeting. Changing it will often lead to worse performance. Not always, of course, but you'd better know exactly what you are doing before you fiddle with it. And measure the performance before/after, and test thoroughly that nothing has been broken.
The padding is typically just to the next "minimum alignment for the largest type" - e.g. if a struct contains only int and a couple of char variables, it will be padded to 4 bytes [inside the struct and at the end, as required]. For double, padding to 8 bytes is done to ensure, but three double will, typically, take up 8 * 3 bytes with no further padding.
Also, determining what hardware you are executing on (or will execute on) is probably better done during compilation, than during runtime. At runtime, your code will have been generated, and the code is already loaded. You can't really change the offsets and alignments of things at this point.
If you are using the gcc or clang compilers, you can use the __attribute__((aligned(n))), e.g. int x[4] __attribute__((aligned(32))); would create a 16-byte array that is aligned to 32 bytes. This can be done inside structures or classes as well as for any variable you are using. But this is a compile-time option, can not be used at runtime.
It is also possible, in C++11 onwards, to find out the alignment of a type or variable with alignof.
Note that it gives the alignment required for the type, so if you do something daft like:
int x;
char buf[4 * sizeof(int)];
int *p = (int *)buf + 7;
std::cout << alignof(*p) << std::endl;
the code will print 4, although the alignment of buf+7 is probably 3 (7 modulo 4).
Types can not be chosen at runtime. C++ is a statically typed language: the type of something is determined at runtime - sure, classes that derive from a baseclass can be created at runtime, but for any given object, it has ONE TYPE, always and forever until it is no longer allocated.
It is better to make such choices at compile-time, as it makes the code much more straight forward for the compiler, and will allow better optimisation than if the choices are made at runtime, since you then have to make a runtime decision to use branch A or branch B of some piece of code.
As an example of aligned vs. unaligned access:
#include <cstdio>
#include <cstdlib>
#include <vector>
#define LOOP_COUNT 1000
unsigned long long rdtscl(void)
{
unsigned int lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
struct A
{
long a;
long b;
long d;
char c;
};
struct B
{
long a;
long b;
long d;
char c;
} __attribute__((packed));
std::vector<A> arr1(LOOP_COUNT);
std::vector<B> arr2(LOOP_COUNT);
int main()
{
for (int i = 0; i < LOOP_COUNT; i++)
{
arr1[i].a = arr2[i].a = rand();
arr1[i].b = arr2[i].b = rand();
arr1[i].c = arr2[i].c = rand();
arr1[i].d = arr2[i].d = rand();
}
printf("align A %zd, size %zd\n", alignof(A), sizeof(A));
printf("align B %zd, size %zd\n", alignof(B), sizeof(B));
for(int loops = 0; loops < 10; loops++)
{
printf("Run %d\n", loops);
size_t sum = 0;
size_t sum2 = 0;
unsigned long long before = rdtscl();
for (int i = 0; i < LOOP_COUNT; i++)
sum += arr1[i].a + arr1[i].b + arr1[i].c + arr1[i].d;
unsigned long long after = rdtscl();
printf("ARR1 %lld sum=%zd\n",(after - before), sum);
before = rdtscl();
for (int i = 0; i < LOOP_COUNT; i++)
sum2 += arr2[i].a + arr2[i].b + arr2[i].c + arr2[i].d;
after = rdtscl();
printf("ARR2 %lld sum=%zd\n",(after - before), sum2);
}
}
[Part of that code is taken from another project, so it's perhaps not the neatest C++ code ever written, but it saved me writing code from scratch, that isn't relevant to the project]
Then the results:
$ ./a.out
align A 8, size 32
align B 1, size 25
Run 0
ARR1 5091 sum=3218410893518
ARR2 5051 sum=3218410893518
Run 1
ARR1 3922 sum=3218410893518
ARR2 4258 sum=3218410893518
Run 2
ARR1 3898 sum=3218410893518
ARR2 4241 sum=3218410893518
Run 3
ARR1 3876 sum=3218410893518
ARR2 4184 sum=3218410893518
Run 4
ARR1 3875 sum=3218410893518
ARR2 4191 sum=3218410893518
Run 5
ARR1 3876 sum=3218410893518
ARR2 4186 sum=3218410893518
Run 6
ARR1 3875 sum=3218410893518
ARR2 4189 sum=3218410893518
Run 7
ARR1 3925 sum=3218410893518
ARR2 4229 sum=3218410893518
Run 8
ARR1 3884 sum=3218410893518
ARR2 4210 sum=3218410893518
Run 9
ARR1 3876 sum=3218410893518
ARR2 4186 sum=3218410893518
As you can see, the code that is aligned, using arr1 takes around 3900 clock-cycles, and the one using arr2 takes around 4200 cycles. So 300 cycles in roughly 4000 cycles, some 7.5% if my "menthol arithmetic" is works correctly.
Of course, like so many different things, it really depends on the exact situation, how the objects are used, what the cache-size is, exactly what processor it is, how much other code and data in other places around it also using cache-space. The only way to be certain is to experiment with YOUR code.
[I ran the code several times, and although I didn't always get the same results, I always got similar proportional results]
Related
I'm a simd beginner, I've read this article about the topic (since I'm using a AVX2-compatible machine).
Now, I've read in this question to check if your pointer is aligned.
I'm testing it with this toy example main.cpp:
#include <iostream>
#include <immintrin.h>
#define is_aligned(POINTER, BYTE_COUNT) \
(((uintptr_t)(const void *)(POINTER)) % (BYTE_COUNT) == 0)
int main()
{
float a[8];
for(int i=0; i<8; i++){
a[i]=i;
}
__m256 evens = _mm256_set_ps(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0);
std::cout<<is_aligned(a, 16)<<" "<<is_aligned(&evens, 16)<<std::endl;
std::cout<<is_aligned(a, 32)<<" "<<is_aligned(&evens, 32)<<std::endl;
}
And compile it with icpc -std=c++11 -o main main.cpp.
The resulting printing is:
1 1
1 1
However, if I add thhese 3 lines before the 4 prints:
for(int i=0; i<8; i++)
std::cout<<a[i]<<" ";
std::cout<<std::endl;
This is the result:
0 1 2 3 4 5 6 7
1 1
0 1
In particular, I don't understand that last 0. Why is it different from the last printing? What am I missing?
Your is_aligned (which is a macro, not a function) determines whether the object has been aligned to particular boundary. It does not determine the alignment requirement of the type of the object.
The compiler will guarantee for a float array, that it be aligned to at least the alignment requirement of a float - which is typically 4. 32 is not a factor of 4, so there is no guarantee that the array be aligned to 32 byte boundary. However, there are many memory addresses that are divisible by both 4 and 32, so it is possible that a memory address at a 4 byte boundary happens to also be at a 32 byte boundary. This is what happened in your first test, but as explained, there is no guarantee that it would happen. In your latter test you added some local variables, and the array ended up in another memory location. It so happened that the other memory location wasn't at the 32 byte boundary.
To request a stricter alignment that may be required by SIMD instructions, you can use the alignas specifier:
alignas(32) float a[8];
I need to read a file in c++ that has this specific format:
10 5
1 2 3 4 1 5 1 5 2 1
All the values are separated with a space. The first 2 on the first line are the variables N and M respectively and all the N values from the second line need to be in an array called S with the size of N. The code I have written has no problem with files like these but it does not work when it comes to really big files with millions and so on that i need it to work with. Here is the code
int N,M;
FILE *read = fopen("file.in", "r");
fscanf(read, "%d %d ", &N, &M);
int S[N];
for( i =0; i < N; i++){
fscanf(read, "%d ", &S[i]);
}
What should I change?
There are multiple potential issues when getting in the range of millions of integers:
int is most often 32 bits, a 32 bits signed integer will have a range of -2^31 to 2^31 - 1, and thus the maximum of 2,147,483,647. You should switch to a 64 bits integral.
You are using int S[N] a Variable Length Array (VLA) which is not Standard C++ (it is Standard C99, but... there are discussions as to whether it was a good idea or not). The important detail, though, is that a VLA is stored on the stack: 1 million of 32 bits int is 4 MB, 2 millions is 8 MB, etc... check your default stack size, but it likely is less than 8 MB, and thus you have a stack-overflow (you're on the right site for help!).
So, let's switch to C++ and do away with those issues:
#include <cstdint> // for int64_t
#include <fstream>
#include <vector>
int main(int argc, char* argv[]) {
std::ifstream stream("data.txt");
int64_t n = 0, m = 0;
stream >> n >> m;
std::vector<int> data;
for (int64_t c = 0; c != n; ++c) {
int i = 0;
stream >> i;
data.push_back(i);
}
// do your best :)
}
First of all, we use int64_t from <cstdint> to do away with the integer overflow issue. Second, we use a stream (input file stream: ifstream) to avoid having to learn what is the format associated with each and every integral type (it's a pain). Third, we use a vector to store the data we read, and do away with the stack overflow issue.
You are using variable sized arrays. This is not standard and not supported by all compilers. If your compiler support it, and you go in the millions, you'll run out of stack space (stack overflow).
Alternatively, you could define S as being a vector with vector<int> S(N);
I need to replicate a 6-byte integer value into a memory region, starting with its beginning and as quickly as possible. If such an operation is supported in hardware, I'd like to use it (I'm on x64 processors now, compiler is GCC 4.6.3).
The memset doesn't suit the job, because it can replicate bytes only. The std::fill isn't good either, because I even can't define an iterator, jumping between 6 byte-width positions in the memory region.
So, I'd like to have a function:
void myMemset(void* ptr, uint64_t value, uint8_t width, size_t num)
This looks like memset, but there is an additional argument width to define how much bytes from the value to replicate. If something like that could be expressed in C++, that would be even better.
I already know about obvious myMemset implementation, which would call the memcpy in loop with last argument (bytes to copy) equal to the width. Also I know, that I can define a temporary memory region with size 6 * 8 = 48 bytes, fill it up with 6-byte integers and then memcpy it to the destination area.
Can we do better?
Something along #Mark Ransom comment:
Copy 6 bytes, then 6, 12, 24, 48, 96, etc.
void memcpy6(void *dest, const void *src, size_t n /* number of 6 byte blocks */) {
if (n-- == 0) {
return;
}
memcpy(dest, src, 6);
size_t width = 1;
while (n >= width) {
memcpy(&((char *) dest)[width * 6], dest, width * 6);
n -= width;
width <<= 1; // double w
}
if (n > 0) {
memcpy(&((char *) dest)[width * 6], dest, n * 6);
}
}
Optimization: scale n and width by 6.
[Edit]
Corrected destination #SchighSchagh
Added cast (char *)
Determine the most efficient write size that the CPU supports; then find the smallest number that can be evenly divided by both 6 and that write size and call that "block size".
Now split the memory region up into blocks of that size. Each block will be identical and all writes will be correctly aligned (assuming the memory region itself is correctly aligned).
For example, if the most efficient write size that the CPU supports is 4 bytes (e.g. ancient 80486) then the "size of block" would be 12 bytes. You'd set 3 general purpose registers and do 3 stores per block.
For another example, if the most efficient write size that the CPU supports is 16 bytes (e.g. SSE) then the "size of block" would be 48 bytes. You'd set 3 SSE registers and do 3 stores per block.
Also, I'd recommend rounding the size of the memory region up to ensure it is a multiple of the block size (with some "not strictly necessary" padding). A few unnecessary writes are less expensive than code to fill a "partial block".
The second most efficient method might be to use a memory copy (but not memcpy() or memmove()). In this case you'd write the initial 6 bytes (or 12 bytes or 48 bytes or whatever), then copy from (e.g.) &area[0] to &area[6] (working from lowest to highest) until you reach the end. For this memmove() will not work because it will notice the area is overlapping and work from highest to lowest instead; and memcpy() will not work because it assumes the source and destination do not overlap; so you'd have to create your own memory copy to suit. The main problem with this is that you double the number of memory accesses - "reading and writing" is slower than "writing alone".
If your Num is large enough, you can try using the AVX vector instructions that will handle 32 bytes at a time (_mm256_load_si256/_mm256_store_si256 or their unaligned variants).
As 32 is not a multiple of 6, you will have to first replicate the 6 bytes pattern 16 times using short memcpy's or 32/64 bits moves.
ABCDEF
ABCDEF|ABCDEF
ABCD EFAB CDEF|ABCD EFAB CDEF
ABCDEFAB CDEFABCD EFABCDEF|ABCDEFAB CDEFABCD EFABCDE
ABCDEFABCDEFABCD EFABCDEFABCDEFAB CDEFABCDEFABCDEF|ABCDEFABCDEFABCD EFABCDEFABCDEFAB CDEFABCDEFABCDEF
You will also finish with a short memcpy.
Try the __movsq intrinsic (x64 only; in assembly, rep movsq) that will move 8 bytes at a time, with a suitable repetition factor, and setting the destination address 6 bytes after the source. Check that overlapping addresses are handled smartly.
Write 8 bytes at a time.
Being on a 64-bit machine, certainly the generated code can operate well with 8-byte writes. After dealing with some set-up issues, in a tight loop, write 8-bytes per write about num times. Assumptions apply - see code.
// assume little endian
void myMemset(void* ptr, uint64_t value, uint8_t width, size_t num) {
assert(width > 0 && width <= 8);
uint64_t *ptr64 = (uint64_t *) ptr;
// # to stop early to prevent writing past array end
static const unsigned stop_early[8 + 1] = { 0, 8, 3, 2, 1, 1, 1, 1, 0 };
size_t se = stop_early[width];
if (num > se) {
num -= se;
// assume no bus-fault with 64-bit write # `ptr64, ptr64+1, ... ptr64+7`
while (num > 0) { // tight loop
num--;
*ptr64 = value;
ptr64 = (uint64_t *) ((char *) ptr64 + width);
}
ptr = ptr64;
num = se;
}
// Cope with last few writes
while (num-- > 0) {
memcpy(ptr, &value, width);
ptr = (char *) ptr + width;
}
}
Further optimization includes writing 2 blocks at a time width == 3 or 4, 4 blocks at a time when width == 2 and 8 blocks at a time width == 1.
I'm working on an x86 or x86_64 machine. I have an array unsigned int a[32] all of whose elements have value either 0 or 1. I want to set the single variable unsigned int b so that (b >> i) & 1 == a[i] will hold for all 32 elements of a. I'm working with GCC on Linux (shouldn't matter much I guess).
What's the fastest way to do this in C?
The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.
I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you've got an AVX2 equipped processor:
uint32_t bitpack(const bool array[32]) {
__mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
return _mm256_movemask_epi8(tmp);
}
Assuming sizeof(bool) = 1. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.
If sizeof(bool) == 1 then you can pack 8 bools at a time into 8 bits (more with 128-bit multiplications) using the technique discussed here in a computer with fast multiplication like this
inline int pack8b(bool* a)
{
uint64_t t = *((uint64_t*)a);
return (0x8040201008040201*t >> 56) & 0xFF;
}
int pack32b(bool* a)
{
return (pack8b(a + 0) << 24) | (pack8b(a + 8) << 16) |
(pack8b(a + 16) << 8) | (pack8b(a + 24) << 0);
}
Explanation:
Suppose the bools a[0] to a[7] have their least significant bits named a-h respectively. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)
| a7 || a6 || a4 || a4 || a3 || a2 || a1 || a0 |
.......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
↑..d.....↑.c......↑b.......a
↑.c......↑b.......a
↑b.......a
a
────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out
So by using the magic number 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201 we have the above code
Of course you need to make sure that the bool array is correctly 8-byte aligned. You can also unroll the code and optimize it, like shift only once instead of shifting left 56 bits
Sorry I overlooked the question and saw doynax's bool array as well as misread "32 0/1 values" and thought they're 32 bools. Of course the same technique can also be used to pack multiple uint32_t or uint16_t values (or other distribution of bits) at the same time but it's a lot less efficient than packing bytes
On newer x86 CPUs with BMI2 the PEXT instruction can be used. The pack8b function above can be replaced with
_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);
And to pack 2 uint32_t as the question requires use
_pext_u64(*((uint64_t*)a), (1ULL << 32) | 1ULL);
Other answers contain an obvious loop implementation.
Here's a first variant:
unsigned int result=0;
for(unsigned i = 0; i < 32; ++i)
result = (result<<1) + a[i];
On modern x86 CPUs, I think shifts of any distance in a register is constant, and this solution won't be better. Your CPU might not be so nice; this code minimizes the cost of long-distance shifts; it does 32 1-bit shifts which every CPU can do (you can always add result to itself to get the same effect). The obvious loop implementation shown by others does about 900 (sum on 32) 1-bit shifts, by virtue of shifting a distance equal to the loop index. (See #Jongware's measurements of differences in comments; apparantly long shifts on x86 are not unit time).
Let us try something more radical.
Assume you can pack m booleans into an int somehow (trivially you can do this for m==1), and that you have two instance variables i1 and i2 containing such m packed bits.
Then the following code packs m*2 booleans into an int:
(i1<<m+i2)
Using this we can pack 2^n bits as follows:
unsigned int a2[16],a4[8],a8[4],a16[2], a32[1]; // each "aN" will hold N bits of the answer
a2[0]=(a1[0]<<1)+a2[1]; // the original bits are a1[k]; can be scalar variables or ints
a2[1]=(a1[2]<<1)+a1[3]; // yes, you can use "|" instead of "+"
...
a2[15]=(a1[30]<<1)+a1[31];
a4[0]=(a2[0]<<2)+a2[1];
a4[1]=(a2[2]<<2)+a2[3];
...
a4[7]=(a2[14]<<2)+a2[15];
a8[0]=(a4[0]<<4)+a4[1];
a8[1]=(a4[2]<<4)+a4[3];
a8[1]=(a4[4]<<4)+a4[5];
a8[1]=(a4[6]<<4)+a4[7];
a16[0]=(a8[0]<<8)+a8[1]);
a16[1]=(a8[2]<<8)+a8[3]);
a32[0]=(a16[0]<<16)+a16[1];
Assuming our friendly compiler resolves an[k] into a (scalar) direct memory access (if not, you can simply replace the variable an[k] with an_k), the above code does (abstractly) 63 fetches, 31 writes, 31 shifts and 31 adds. (There's an obvious extension to 64 bits).
On modern x86 CPUs, I think shifts of any distance in a register is constant. If not, this code minimizes the cost of long-distance shifts; it in effect does 64 1-bit shifts.
On an x64 machine, other than the fetches of the original booleans a1[k], I'd expect all the rest of the scalars to be schedulable by the compiler to fit in the registers, thus 32 memory fetches, 31 shifts and 31 adds. Its pretty hard to avoid the fetches (if the original booleans are scattered around) and the shifts/adds match the obvious simple loop. But there is no loop, so we avoid 32 increment/compare/index operations.
If the starting booleans are really in array, with each bit occupying the bottom bit of and otherwise zeroed byte:
bool a1[32];
then we can abuse our knowledge of memory layout to fetch several at a time:
a4[0]=((unsigned int)a1)[0]; // picks up 4 bools in one fetch
a4[1]=((unsigned int)a1)[1];
...
a4[7]=((unsigned int)a1)[7];
a8[0]=(a4[0]<<1)+a4[1];
a8[1]=(a4[2]<<1)+a4[3];
a8[2]=(a4[4]<<1)+a4[5];
a8[3]=(a8[6]<<1)+a4[7];
a16[0]=(a8[0]<<2)+a8[1];
a16[0]=(a8[2]<<2)+a8[3];
a32[0]=(a16[0]<<4)+a16[1];
Here our cost is 8 fetches of (sets of 4) booleans, 7 shifts and 7 adds. Again, no loop overhead. (Again there is an obvious generalization to 64 bits).
To get faster than this, you probably have to drop into assembler and use some of the many wonderful and wierd instrucions available there (the vector registers probably have scatter/gather ops that might work nicely).
As always, these solutions needed to performance tested.
I would probably go for this:
unsigned a[32] =
{
1, 0, 0, 1, 1, 1, 0 ,0, 1, 0, 0, 0, 1, 1, 0, 0
, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1
};
int main()
{
unsigned b = 0;
for(unsigned i = 0; i < sizeof(a) / sizeof(*a); ++i)
b |= a[i] << i;
printf("b: %u\n", b);
}
Compiler optimization may well unroll that but just in case you can always try:
int main()
{
unsigned b = 0;
b |= a[0];
b |= a[1] << 1;
b |= a[2] << 2;
b |= a[3] << 3;
// ... etc
b |= a[31] << 31;
printf("b: %u\n", b);
}
To determine what the fastest way is, time all of the various suggestions. Here is one that well may end up as "the" fastest (using standard C, no processor dependent SSE or the likes):
unsigned int bits[32][2] = {
{0,0x80000000},{0,0x40000000},{0,0x20000000},{0,0x10000000},
{0,0x8000000},{0,0x4000000},{0,0x2000000},{0,0x1000000},
{0,0x800000},{0,0x400000},{0,0x200000},{0,0x100000},
{0,0x80000},{0,0x40000},{0,0x20000},{0,0x10000},
{0,0x8000},{0,0x4000},{0,0x2000},{0,0x1000},
{0,0x800},{0,0x400},{0,0x200},{0,0x100},
{0,0x80},{0,0x40},{0,0x20},{0,0x10},
{0,8},{0,4},{0,2},{0,1}
};
unsigned int b = 0;
for (i=0; i< 32; i++)
b |= bits[i][a[i]];
The first value in the array is to be the leftmost bit: the highest possible value.
Testing proof-of-concept with some rough timings show this is indeed not magnitudes better than the straightforward loop with b |= (a[i]<<(31-i)):
Ira 3618 ticks
naive, unrolled 5620 ticks
Ira, 1-shifted 10044 ticks
Galik 10265 ticks
Jongware, using adds 12536 ticks
Jongware 12682 ticks
naive 13373 ticks
(Relative timings, with the same compiler options.)
(The 'adds' routine is mine with indexing replaced with a pointer-to and an explicit add for both indexed arrays. It is 10% slower, meaning my compiler is efficiently optimizing indexed access. Good to know.)
unsigned b=0;
for(int i=31; i>=0; --i){
b<<=1;
b|=a[i];
}
Your problem is a good opportunity to use -->, also called the downto operator:
unsigned int a[32];
unsigned int b = 0;
for (unsigned int i = 32; i --> 0;) {
b += b + a[i];
}
The advantage of using --> is it works with both signed and unsigned loop index types.
This approach is portable and readable, it might not produce the fastest code, but clang does unroll the loop and produce decent performance, see https://godbolt.org/g/6xgwLJ
Using integer math alone, I'd like to "safely" average two unsigned ints in C++.
What I mean by "safely" is avoiding overflows (and anything else that can be thought of).
For instance, averaging 200 and 5000 is easy:
unsigned int a = 200;
unsigned int b = 5000;
unsigned int average = (a + b) / 2; // Equals: 2600 as intended
But in the case of 4294967295 and 5000 then:
unsigned int a = 4294967295;
unsigned int b = 5000;
unsigned int average = (a + b) / 2; // Equals: 2499 instead of 2147486147
The best I've come up with is:
unsigned int a = 4294967295;
unsigned int b = 5000;
unsigned int average = (a / 2) + (b / 2); // Equals: 2147486147 as expected
Are there better ways?
Your last approach seems promising. You can improve on that by manually considering the lowest bits of a and b:
unsigned int average = (a / 2) + (b / 2) + (a & b & 1);
This gives the correct results in case both a and b are odd.
If you know ahead of time which one is higher, a very efficient way is possible. Otherwise you're better off using one of the other strategies, instead of conditionally swapping to use this.
unsigned int average = low + ((high - low) / 2);
Here's a related article: http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html
Your method is not correct if both numbers are odd eg 5 and 7, average is 6 but your method #3 returns 5.
Try this:
average = (a>>1) + (b>>1) + (a & b & 1)
with math operators only:
average = a/2 + b/2 + (a%2) * (b%2)
And the correct answer is...
(A&B)+((A^B)>>1)
If you don't mind a little x86 inline assembly (GNU C syntax), you can take advantage of supercat's suggestion to use rotate-with-carry after an add to put the high 32 bits of the full 33-bit result into a register.
Of course, you usually should mind using inline-asm, because it defeats some optimizations (https://gcc.gnu.org/wiki/DontUseInlineAsm). But here we go anyway:
// works for 64-bit long as well on x86-64, and doesn't depend on calling convention
unsigned average(unsigned x, unsigned y)
{
unsigned result;
asm("add %[x], %[res]\n\t"
"rcr %[res]"
: [res] "=r" (result) // output
: [y] "%0"(y), // input: in the same reg as results output. Commutative with next operand
[x] "rme"(x) // input: reg, mem, or immediate
: // no clobbers. ("cc" is implicit on x86)
);
return result;
}
The % modifier to tell the compiler the args are commutative doesn't actually help make better asm in the case I tried, calling the function with y being a constant or pointer-deref (memory operand). Probably using a matching constraint for an output operand defeats that, since you can't use it with read-write operands.
As you can see on the Godbolt compiler explorer, this compiles correctly, and so does a version where we change the operands to unsigned long, with the same inline asm. clang3.9 makes a mess of it, though, and decides to use the "m" option for the "rme" constraint, so it stores to memory and uses a memory operand.
RCR-by-one is not too slow, but it's still 3 uops on Skylake, with 2 cycle latency. It's great on AMD CPUs, where RCR has single-cycle latency. (Source: Agner Fog's instruction tables, see also the x86 tag wiki for x86 performance links). It's still better than #sellibitze's version, but worse than #Sheldon's order-dependent version. (See code on Godbolt)
But remember that inline-asm defeats optimizations like constant-propagation, so any pure-C++ version will be better in that case.
What you have is fine, with the minor detail that it will claim that the average of 3 and 3 is 2. I'm guessing that you don't want that; fortunately, there's an easy fix:
unsigned int average = a/2 + b/2 + (a & b & 1);
This just bumps the average back up in the case that both divisions were truncated.
In C++20, you can use std::midpoint:
template <class T>
constexpr T midpoint(T a, T b) noexcept;
The paper P0811R3 that introduced std::midpoint recommended this snippet (slightly adopted to work with C++11):
#include <type_traits>
template <typename Integer>
constexpr Integer midpoint(Integer a, Integer b) noexcept {
using U = std::make_unsigned<Integer>::type;
return a>b ? a-(U(a)-b)/2 : a+(U(b)-a)/2;
}
For completeness, here is the unmodified C++20 implementation from the paper:
constexpr Integer midpoint(Integer a, Integer b) noexcept {
using U = make_unsigned_t<Integer>;
return a>b ? a-(U(a)-b)/2 : a+(U(b)-a)/2;
}
If the code is for an embedded micro, and if speed is critical, assembly language may be helpful. On many microcontrollers, the result of the add would naturally go into the carry flag, and instructions exist to shift it back into a register. On an ARM, the average operation (source and dest. in registers) could be done in two instructions; any C-language equivalent would likely yield at least 5, and probably a fair bit more than that.
Incidentally, on machines with shorter word sizes, the differences can be even more substantial. On an 8-bit PIC-18 series, averaging two 32-bit numbers would take twelve instructions. Doing the shifts, add, and correction, would take 5 instructions for each shift, eight for the add, and eight for the correction, so 26 (not quite a 2.5x difference, but probably more significant in absolute terms).
int[] array = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
decimal avg = 0;
for (int i = 0; i < array.Length; i++){
avg = (array[i] - avg) / (i+1) + avg;
}
expects avg == 5.0 for this test
(((a&b << 1) + (a^b)) >> 1) is also a nice way.
Courtesy: http://www.ragestorm.net/blogs/?p=29