While attempting to write the fastest factorial function that evaluates at run-time, I found myself questioning whether it is a good idea to declare the constant array f[] at the function level or at the unit level.
#include <stdint.h>
#include <cassert>
// uint64_t const f[21] = { 1,1,2,6,24,120,720,5040,40320,362880,3628800,39916800,479001600,6227020800, 87178291200, 1307674368000, 20922789888000, 355687428096000, 6402373705728000, 121645100408832000, 2432902008176640000 };
const uint64_t factorial(const uint8_t n) {
static uint64_t const f[21] = { 1,1,2,6,24,120,720,5040,40320,362880,3628800,39916800,479001600,6227020800, 87178291200, 1307674368000, 20922789888000, 355687428096000, 6402373705728000, 121645100408832000, 2432902008176640000 };
assert(n <= 20);
return f[n];
}
What are the pros and cons of each placement, assuming that f[] will be used only by the factorial() function ?
Is a constant, which is declared at the function level, created and destroyed each time the function is executed, like a non-const variable, or does the linker collect and put all constants in the .rodata section at compile-time?
In my humble opinion the best is to let the compiler decide and do evrything with constexpr. The compiler will make the best decision for you.
And because the number of factorials that will fit into an unsigned 64 bit value is very low (21), a compile time array will use mainly only 21*8 = 168 bytes.
168 bytes
That number is that low that we can build easily a compile time constexpr std::array and stop all further considerations.
Really everything can be done at compile time.
We will first define the default approach for calculation a factorial as a constexpr function:
constexpr unsigned long long factorial(unsigned long long n) noexcept {
return n == 0ull ? 1 : n * factorial(n - 1ull);
}
With that, factorials can easily be calculated at compile time. Then, we fill a std::array with all factorials. We use also a constexpr and make it a template with a variadic parameter pack.
We use std::integer_sequence to create a factorials for indices 0,1,2,3,4,5, ....
That is straigtforward and not complicated:
template <size_t... ManyIndices>
constexpr auto generateArrayHelper(std::integer_sequence<size_t, ManyIndices...>) noexcept {
return std::array<unsigned long long, sizeof...(ManyIndices)>{ { factorial(ManyIndices)... } };
};
This function will be fed with an integer sequence 0,1,2,3,4,... and return a std::array<unsigned long long, ...> with the corresponding factorials.
We know that we can store maximum 21 values. And therefore we make a next function, that will call the above with the integer sequence 1,2,3,4,...,20,21, like so:
constexpr auto generateArray()noexcept {
return generateArrayHelper(std::make_integer_sequence<size_t, MaxIndexFor64BitValue>());
}
And now, finally,
constexpr auto Factorial = generateArray();
will give us a compile-time std::array<unsigned long long, 21> with the name Factorial containing all factorials. And if we need the i'th factorial, then we can simply write Factorial[i]. There will be no calculation at runtime.
I do not think that there is a faster way to calculate a factorial.
Please see the complete program below:
#include <iostream>
#include <array>
#include <utility>
// ----------------------------------------------------------------------
// All the below will be calculated at compile time
// constexpr factorial function
constexpr unsigned long long factorial(unsigned long long n) noexcept {
return n == 0ull ? 1 : n * factorial(n - 1ull);
}
// We will automatically build an array of factorials at compile time
// Generate a std::array with n elements
template <size_t... ManyIndices>
constexpr auto generateArrayHelper(std::integer_sequence<size_t, ManyIndices...>) noexcept {
return std::array<unsigned long long, sizeof...(ManyIndices)>{ { factorial(ManyIndices)... } };
};
// Max index for factorials for an 64bit unsigned value
constexpr size_t MaxIndexFor64BitValue = 21;
// Generate the required number of elements
constexpr auto generateArray()noexcept {
return generateArrayHelper(std::make_integer_sequence<size_t, MaxIndexFor64BitValue>());
}
// This is an constexpr array of all factorials numbers
constexpr auto Factorial = generateArray();
// All the above was compile time
// ----------------------------------------------------------------------
// Test function
int main() {
for (size_t i{}; i < MaxIndexFor64BitValue; ++i)
std::cout << i << '\t' << Factorial[i] << '\n';
return 0;
}
Developed, compiled and tested with Microsoft Visual Studio Community 2019, Version 16.8.2
Additionally compiled and tested with gcc 10.2 and clang 11.0.1
Language: C++17
Related
I was curious how far I could push gcc as far as compile-time evaluation is concerned, so I made it compute the Ackermann function, specifically with input values of 4 and 1 (anything higher than that is impractical):
consteval unsigned int A(unsigned int x, unsigned int y)
{
if(x == 0)
return y+1;
else if(y == 0)
return A(x-1, 1);
else
return A(x-1, A(x, y-1));
}
unsigned int result = A(4, 1);
(I think the recursion depth is bounded at ~16K but just to be safe I compiled this with -std=c++20 -fconstexpr-depth=100000 -fconstexpr-ops-limit=12800000000)
Not surprisingly, this takes up an obscene amount of stack space (in fact, it causes the compiler to crash if run with the default process stack size of 8mb) and takes several minutes to compute. However, it does eventually get there so evidently the compiler could handle it.
After that I decided to try implementing the Ackermann function using templates, with metafunctions and partial specialization pattern matching. Amazingly, the following implementation only takes a few seconds to evaluate:
template<unsigned int x, unsigned int y>
struct A {
static constexpr unsigned int value = A<x-1, A<x, y-1>::value>::value;
};
template<unsigned int y>
struct A<0, y> {
static constexpr unsigned int value = y+1;
};
template<unsigned int x>
struct A<x, 0> {
static constexpr unsigned int value = A<x-1, 1>::value;
};
unsigned int result = A<4,1>::value;
(compile with -ftemplate-depth=17000)
Why is there such a dramatic difference in evaluation time? Aren't these essentially equivalent? I guess I can understand the consteval solution requiring slightly more memory and evaluation time because semantically it consists of a bunch of function calls, but that doesn't explain why this exact same (non-consteval) function computed at runtime only takes slightly longer than the metafunction version (compiled without optimizations).
Why is consteval so slow? And how can the metafunction version be so fast? It's actually not much slower than optimized machine-code.
In the template version of A, when a particular specialization, say A<2,3>, is instantiated, the compiler remembers this type, and never needs to instantiate it again. This comes from the fact that types are unique, and each "call" to this meta-function is just computing a type.
The consteval function version is not optimized to do this, and so A(2,3) may be evaluated multiple times, depending on the control flow, resulting in the performance difference you observe. There's nothing stopping compilers from "caching" the results of function calls, but these optimizations likely just haven't been implemented yet.
As my understanding, the keyword constexpr is telling the compiler that the evaluation of expression can happen at compile time. Specifically, constexpr on a variable means that the value of the variable can be evaluated at compile time, whereas constexpr on a function means that this function may be invoked and evaluated its return value at compile time. If the funcion is invoked at runtime, it just acts as a common function.
Today, I wrote a piece of code to try to use constexpr:
#include <iostream>
using namespace std;
constexpr long int fib(int n)
{
return (n <= 1)? n : fib(n-1) + fib(n-2);
}
int main ()
{
constexpr long int res = fib(32);
// const long int res = fib(32);
cout << res << endl;
return 0;
}
I was expecting that the compilation of the code would spend much time but I'm wrong. It only spent 0.0290s to do the compilation:
$ time g++ test.cpp
real 0m0.290s
user 0m0.252s
sys 0m0.035s
But if I change constexpr long int res = fib(32); into const long int res = fib(32);, to my surprise, it spent much more time on the compilation:
$ time g++ test.cpp
real 0m5.830s
user 0m5.568s
sys 0m0.233s
In a word, it seems that const makes the function fib(32) to be evaluated at compile time, but constexpr makes it to be evaluated at runtime. I'm really confused.
My system: Ubuntu 18.04
My gcc: g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
By inspecting the generated assembly, we can confirm that in both cases G++ 7.5 computed the fib(32) value at compile time:
movl $2178309, %esi
The reason G++ evaluates constexpr context so fast is due to memoization which it performs when evaluating constexpr and template contexts.
Memoization completely kills fibonacci computational complexity by reducing it to O(N) complexity.
So why then is non-constexpr evaluation so much slower? I presume that's a bug/shortcoming in the optimizer. If I try with G++ 8.1 or later, there's no difference in compilation times, so presumably it had already been addressed.
A short compilation time is not proof that the calls weren't evaluated at compile time. You can look at the compiled assembly to see that it was in fact evaluated at compile time using constexpr in my test here: https://godbolt.org/z/vbWaxe
In my tests with a newer compiler, const wasn't measurably slower than constexpr. It could be a quality of implementation issue with your version of the compiler.
The secret for the fast compile time evaluation is that there are only very very few Fibonacci numbers that would fit into the nowadays maximum data type unsigned long long.
The basic message for Fibonacci number calculation is: Any calculation at runtime is not necessary at all! Everything can and should be done at compile time. And then, a simple lookup mechanism can be used. That will always be the most efficient and fastest solution.
So, with Binet's formula, we can calculate that there are only very few Fibonacci numbers that will fit in a C++ unsigned long long data type, which has usually 64 bit now in 2021 and is the "biggest" available data type. Roundabout 93. That is nowadays a really low number.
With modern C++ 17 (and above) features, we can easily create a std::array of all Fibonacci numbers for a 64bit data type at compile time. Because there are, as mentioned above, only 93 numbers.
So, we will spend only 93*8= 744 BYTE of none-runtime memory for our lookup array. That is really negligible. And, the compiler can get those values fast.
After the compile time calculation, we can easily get the Fibonacci number n by writing FIB[n]. For detailed explanation, please see below.
And, if we want to know, if a number is Fibonacci, then we use std::binary_search for finding the value. So, this function will be for example:
bool isFib(const unsigned long long numberToBeChecked) {
return std::binary_search(FIB.begin(), FIB.end(), numberToBeChecked);
}
FIB (of course any other name possible) is a compile time, constexpr std::array. So, how to build that array?
We will first define the default approach for calculation a Fibonacci number as a constexpr function (non-recursive):
// Constexpr function to calculate the nth Fibonacci number
constexpr unsigned long long getFibonacciNumber(size_t index) noexcept {
// Initialize first two even numbers
unsigned long long f1{ 0 }, f2{ 1 };
// Calculating Fibonacci value
while (index--) {
// get next value of Fibonacci sequence
unsigned long long f3 = f2 + f1;
// Move to next number
f1 = f2;
f2 = f3;
}
return f2;
}
With that, Fibonacci numbers can easily be calculated at compile time as constexpr values. Then, we fill a std::array with all Fibonacci numbers. We use also a constexpr and make it a template with a variadic parameter pack.
We use std::integer_sequence to create a Fibonacci number for indices 0,1,2,3,4,5, ....
That is straigtforward and not complicated:
template <size_t... ManyIndices>
constexpr auto generateArrayHelper(std::integer_sequence<size_t, ManyIndices...>) noexcept {
return std::array<unsigned long long, sizeof...(ManyIndices)>{ { getFibonacciNumber(ManyIndices)... } };
};
This function will be fed with an integer sequence 0,1,2,3,4,... and return a std::array<unsigned long long, ...> with the corresponding Fibonacci numbers.
We know that we can store maximum 93 values. And therefore we make a next function, that will call the above with the integer sequence 1,2,3,4,...,92,93, like so:
constexpr auto generateArray() noexcept {
return generateArrayHelper(std::make_integer_sequence<size_t, MaxIndexFor64BitValue>());
}
And now, finally,
constexpr auto FIB = generateArray();
will give us a compile-time std::array<unsigned long long, 93> with the name FIB containing all Fibonacci numbers. And if we need the i'th Fibonacci number, then we can simply write FIB[i]. There will be no calculation at runtime.
The whole example program will look like this:
#include <iostream>
#include <array>
#include <utility>
#include <algorithm>
#include <iomanip>
// ----------------------------------------------------------------------
// All the following will be done during compile time
// Constexpr function to calculate the nth Fibonacci number
constexpr unsigned long long getFibonacciNumber(size_t index) noexcept {
// Initialize first two even numbers
unsigned long long f1{ 0 }, f2{ 1 };
// calculating Fibonacci value
while (index--) {
// get next value of Fibonacci sequence
unsigned long long f3 = f2 + f1;
// Move to next number
f1 = f2;
f2 = f3;
}
return f2;
}
// We will automatically build an array of Fibonacci numbers at compile time
// Generate a std::array with n elements
template <size_t... ManyIndices>
constexpr auto generateArrayHelper(std::integer_sequence<size_t, ManyIndices...>) noexcept {
return std::array<unsigned long long, sizeof...(ManyIndices)>{ { getFibonacciNumber(ManyIndices)... } };
};
// Max index for Fibonaccis that for an 64bit unsigned value (Binet's formula)
constexpr size_t MaxIndexFor64BitValue = 93;
// Generate the required number of elements
constexpr auto generateArray()noexcept {
return generateArrayHelper(std::make_integer_sequence<size_t, MaxIndexFor64BitValue>());
}
// This is an constexpr array of all Fibonacci numbers
constexpr auto FIB = generateArray();
// All the above was compile time
// ----------------------------------------------------------------------
// Check, if a number belongs to the Fibonacci series
bool isFib(const unsigned long long numberToBeChecked) {
return std::binary_search(FIB.begin(), FIB.end(), numberToBeChecked);
}
// Test
int main() {
const unsigned long long testValue{ 498454011879264ull };
std::cout << std::boolalpha << "Does '" <<testValue << "' belong to Fibonacci series? --> " << isFib(testValue) << "\n\n";
for (size_t i{}; i < 10u; ++i)
std::cout << i << '\t' << FIB[i] << '\n';
return 0;
}
Developed and tested with Microsoft Visual Studio Community 2019, Version 16.8.2
Additionally tested with gcc 10.2 and clang 11.0.1
Language: C++ 17
In our company in the code we use 64bit flag enums:
enum Flags : unsigned long long {
Flag1 = 1uLL<<0, // 1
//...
Flag40 = 1uLL<<40 // 1099511627776
};
And adding comments to see each flag decimal value even if we read the code in a text viewer. The problem is that nothing prevents a dev to put a wrong number in a comment.
There is a solution for this problem - a template with static_assert + a macro to use this approach easily - no need to use parenthesis and to add ::val everywhere:
template <unsigned long long i, unsigned long long j>
struct SNChecker{
static_assert(i == j, "Numbers not same!");
static const unsigned long long val = i;
};
#define SAMENUM(i, j) SNChecker<(i), (j)>::val
enum ET : unsigned long long {
ET1 = SAMENUM(1uLL<<2, 4),
ET2fail = SAMENUM(1uLL<<3, 4), // compile time error
ET4 = SAMENUM(1uLL<<40, 1099511627776uLL),
};
It all looks good, but we are not really fond of macros.
A question: can we do same with constexpr function, but without errors readability regression?
The closest solution I could think of is:
constexpr unsigned long long SameNum(unsigned long long i, unsigned long long j)
{
return (i == j) ? i : (throw "Numbers not same!");
}
but it generates an compile time error
error: expression '<throw-expression>' is not a constant-expression
instead of whatever I write in the static_assert
Edit:
The answer below is almost perfect except for one small regression: the call is a bit less pretty than using the macro.
One more approach (still worse than using static_assert but "prettier" in usage)
int NumbersNotSame() { return 0; }
constexpr unsigned long long SameNum(unsigned long long i, unsigned long long j)
{
return (i == j) ? i : (NumbersNotSame());
}
static_assert in constexpr function:
template<unsigned long long I, unsigned long long J>
constexpr unsigned long long SameNum()
{
static_assert(I == J, "numbers don't match");
return I;
}
enum ET : unsigned long long {
ET1 = SameNum<1uLL<<2, 4>(),
ET2fail = SameNum<1uLL<<3, 4>(), // compile time error
ET4 = SameNum<1uLL<<40, 1099511627776uLL>(),
};
Dear all,
I've a program in C++ that should be as fast as possible.
In particular, there is a crucial part that works as follows:
There is a variable D ranging in [0,255], based on its value
it calls the function whose pointer is stored in a array F (with 255 elements).
i.e., F[D] gives the pointer to the function to call.
The functions are very simple, they execute few expressions or/and assignments (no cycles).
How can I do this faster?
I can replicate the code in the functions. I do not need features of functions call (I'm using them since it was simpler way to do).
My goal is that of removing the inefficiencies due to calls of functions.
I consider to use Switch/case.
The code of each case is the code of the corresponding function.
Is there a faster way to do it?
A switch/case may be faster, as the compiler can make more specific optimizations, and there are other things like code locality compared to calling a function pointer. However, it's unlikely that you'll notice a substantial performance increase.
Convert it to a switch
Make sure the tiny functions can be inlined (how to do this is compiler dependant - but explicitly marking them inline and placing them in the compilation unit is a good bet)
Finally, you might try profile-guided optimizations if (as seems quite possible) your switch statement calls the branches in a non-uniform fashion.
The aim is to avoid function-call overhead and have the compiler reorder the switch statement to reduce the number of branches typically encountered.
Edit: since there's some skeptical voices as to how much this helps; I gave it a shot. In the following program, the function-pointer loop needs 1.07 seconds, and the inlined switch statement takes 0.79 seconds on my machine - YMMV:
template<unsigned n0,unsigned n> struct F { static inline unsigned func(unsigned val); };
template<unsigned n> struct F<0,n> { static inline unsigned func(unsigned val) { return val + n;} };
template<unsigned n> struct F<1,n> { static inline unsigned func(unsigned val) { return val - n; } };
template<unsigned n> struct F<2,n> { static inline unsigned func(unsigned val) { return val ^ n; } };
template<unsigned n> struct F<3,n> { static inline unsigned func(unsigned val) { return val * n; } };
template<unsigned n> struct F<4,n> { static inline unsigned func(unsigned val) { return (val << ( n %16)) + n*(n&0xff); } };
template<unsigned n> struct F<5,n> { static inline unsigned func(unsigned val) { return (val >> ( n %16)) + (n*(n&0xff) << 16); } };
template<unsigned n> struct F<6,n> { static inline unsigned func(unsigned val) { return val / (n|1) + val; } };
template<unsigned n> struct F<7,n> { static inline unsigned func(unsigned val) { return (val <<16) + (val>>16); } };
template<unsigned n> struct f { static inline unsigned func(unsigned val) { return F<n%8,n>::func(val); } };
typedef unsigned (*fPtr)(unsigned);
fPtr funcs[256];
template<unsigned n0,unsigned n1> inline void fAssign() {
if(n0==n1-1 || n0==n1) //||n0==n1 just to avoid compiler warning
funcs[n0] = f<n0>::func;
else {
fAssign<n0,(n0 + n1)/2>();
fAssign<(n0 + n1)/2,n1>();
}
}
__forceinline unsigned funcSwitch(unsigned char type,unsigned val);//huge function elided
__declspec(noinline) unsigned doloop(unsigned val,unsigned start,unsigned end) {
for(unsigned x=start;x<end;++x)
val = funcs[x*37&0xff](val);
return val;
}
__declspec(noinline) unsigned doloop2(unsigned val,unsigned start,unsigned end) {
for(unsigned x=start;x<end;++x)
val = funcSwitch(x*37&0xff,val);
return val;
}
I verified that all function calls are inlined except those to doloop, doloop2 and funcs[?] to ensure I'm not measuring odd compiler choices.
So, on this machine, with MSC 10, this (thoroughly artifical) benchmark shows that the huge-switch version is a third faster than the function-pointer lookup-based version. PGO slowed both versions down; probably because they're too small to exhibit cache effects and the program is small enough to be fully inlined/optimized even without PGO.
You're not going to beat the lookup table based indirect call with any other method, including switch. Switch is going to take at best logarithmic time (well, it might hash to get constant time, but that'll likely be slower than logarithmic time with ints for most input sizes).
This is assuming your code looks like this:
typedef void (*myProc_t)();
myProc_t functionArray[255] = { ... };
void CallSomepin(unsigned char D)
{
functionArray[D]();
}
If you're creating the array each time the function is called however, it might be a good idea to amortize the cost of the construction by doing the initialization of the function array once, rather than every time.
EDIT: That said, the best way to avoid the inefficiency of the indirect call is simply to not do it. Look at your code and see if there are places you can replace the indirect lookup based call with a direct call.
There is a variable D ranging in [0,255], based on its value it calls the function whose pointer is stored in a array F (with 255 elements)
Warning, this will result in a buffer overflow if D has the value 255.
I'm using a well known template to allow binary constants
template< unsigned long long N >
struct binary
{
enum { value = (N % 10) + 2 * binary< N / 10 > :: value } ;
};
template<>
struct binary< 0 >
{
enum { value = 0 } ;
};
So you can do something like binary<101011011>::value. Unfortunately this has a limit of 20 digits for a unsigned long long.
Does anyone have a better solution?
Does this work if you have a leading zero on your binary value? A leading zero makes the constant octal rather than decimal.
Which leads to a way to squeeze a couple more digits out of this solution - always start your binary constant with a zero! Then replace the 10's in your template with 8's.
The approaches I've always used, though not as elegant as yours:
1/ Just use hex. After a while, you just get to know which hex digits represent which bit patterns.
2/ Use constants and OR or ADD them. For example (may need qualifiers on the bit patterns to make them unsigned or long):
#define b0 0x00000001
#define b1 0x00000002
: : :
#define b31 0x80000000
unsigned long x = b2 | b7
3/ If performance isn't critical and readability is important, you can just do it at runtime with a function such as "x = fromBin("101011011");".
4/ As a sneaky solution, you could write a pre-pre-processor that goes through your *.cppme files and creates the *.cpp ones by replacing all "0b101011011"-type strings with their equivalent "0x15b" strings). I wouldn't do this lightly since there's all sorts of tricky combinations of syntax you may have to worry about. But it would allow you to write your string as you want to without having to worry about the vagaries of the compiler, and you could limit the syntax trickiness by careful coding.
Of course, the next step after that would be patching GCC to recognize "0b" constants but that may be an overkill :-)
C++0x has user-defined literals, which could be used to implement what you're talking about.
Otherwise, I don't know how to improve this template.
template<unsigned int p,unsigned int i> struct BinaryDigit
{
enum { value = p*2+i };
typedef BinaryDigit<value,0> O;
typedef BinaryDigit<value,1> I;
};
struct Bin
{
typedef BinaryDigit<0,0> O;
typedef BinaryDigit<0,1> I;
};
Allowing:
Bin::O::I::I::O::O::value
much more verbose, but no limits (until you hit the size of an unsigned int of course).
You can add more non-type template parameters to "simulate" additional bits:
// Utility metafunction used by top_bit<N>.
template <unsigned long long N1, unsigned long long N2>
struct compare {
enum { value = N1 > N2 ? N1 >> 1 : compare<N1 << 1, N2>::value };
};
// This is hit when N1 grows beyond the size representable
// in an unsigned long long. It's value is never actually used.
template<unsigned long long N2>
struct compare<0, N2> {
enum { value = 42 };
};
// Determine the highest 1-bit in an integer. Returns 0 for N == 0.
template <unsigned long long N>
struct top_bit {
enum { value = compare<1, N>::value };
};
template <unsigned long long N1, unsigned long long N2 = 0>
struct binary {
enum {
value =
(top_bit<binary<N2>::value>::value << 1) * binary<N1>::value +
binary<N2>::value
};
};
template <unsigned long long N1>
struct binary<N1, 0> {
enum { value = (N1 % 10) + 2 * binary<N1 / 10>::value };
};
template <>
struct binary<0> {
enum { value = 0 } ;
};
You can use this as before, e.g.:
binary<1001101>::value
But you can also use the following equivalent forms:
binary<100,1101>::value
binary<1001,101>::value
binary<100110,1>::value
Basically, the extra parameter gives you another 20 bits to play with. You could add even more parameters if necessary.
Because the place value of the second number is used to figure out how far to the left the first number needs to be shifted, the second number must begin with a 1. (This is required anyway, since starting it with a 0 would cause the number to be interpreted as an octal number.)
Technically it is not C nor C++, it is a GCC specific extension, but GCC allows binary constants as seen here:
The following statements are identical:
i = 42;
i = 0x2a;
i = 052;
i = 0b101010;
Hope that helps. Some Intel compilers and I am sure others, implement some of the GNU extensions. Maybe you are lucky.
A simple #define works very well:
#define HEX__(n) 0x##n##LU
#define B8__(x) ((x&0x0000000FLU)?1:0)\
+((x&0x000000F0LU)?2:0)\
+((x&0x00000F00LU)?4:0)\
+((x&0x0000F000LU)?8:0)\
+((x&0x000F0000LU)?16:0)\
+((x&0x00F00000LU)?32:0)\
+((x&0x0F000000LU)?64:0)\
+((x&0xF0000000LU)?128:0)
#define B8(d) ((unsigned char)B8__(HEX__(d)))
#define B16(dmsb,dlsb) (((unsigned short)B8(dmsb)<<8) + B8(dlsb))
#define B32(dmsb,db2,db3,dlsb) (((unsigned long)B8(dmsb)<<24) + ((unsigned long)B8(db2)<<16) + ((unsigned long)B8(db3)<<8) + B8(dlsb))
B8(011100111)
B16(10011011,10011011)
B32(10011011,10011011,10011011,10011011)
Not my invention, I saw it on a forum a long time ago.