I need to store a 128 bits long UUID in a variable. Is there a 128-bit datatype in C++? I do not need arithmetic operations, I just want to easily store and read the value very fast.
A new feature from C++11 would be fine, too.
Although GCC does provide __int128, it is supported only for targets (processors) which have an integer mode wide enough to hold 128 bits. On a given system, sizeof() intmax_t and uintmax_t determine the maximum value that the compiler and the platform support.
GCC and Clang support __int128
Checkout boost's implementation:
#include <boost/multiprecision/cpp_int.hpp>
using namespace boost::multiprecision;
int128_t v = 1;
This is better than strings and arrays, especially if you need to do arithmetic operations with it.
Your question has two parts.
1.128-bit integer. As suggested by #PatrikBeck boost::multiprecision is good way for really big integers.
2.Variable to store UUID / GUID / CLSID or whatever you call it. In this case boost::multiprecision is not a good idea. You need GUID structure which is designed for that purpose. As cross-platform tag added, you can simply copy that structure to your code and make it like:
struct GUID
uint32_t Data1;
uint16_t Data2;
uint16_t Data3;
uint8_t Data4[8];
This format is defined by Microsoft because of some inner reasons, you can even simplify it to:
struct GUID
uint8_t Data[16];
You will get better performance having simple structure rather than object that can handle bunch of different stuff. Anyway you don't need to do math with GUIDS, so you don't need any fancy object.
I would recommend using std::bitset<128> (you can always do something like using UUID = std::bitset<128>;). It will probably have a similar memory layout to the custom struct proposed in the other answers, but you won't need to define your own comparison operators, hash etc.
There is no 128-bit integer in Visual-C++ because the Microsoft calling convention only allows returning of 2 32-bit values in the RAX:EAX pair. The presents a constant headache because when you multiply two integers together with the result is a two-word integer. Most load-and-store machines support working with two CPU word-sized integers but working with 4 requires software hack, so a 32-bit CPU cannot process 128-bit integers and 8-bit and 16-bit CPUs can't do 64-bit integers without a rather costly software hack. 64-bit CPUs can and regularly do work with 128-bit because if you multiply two 64-bit integers you get a 128-bit integer so GCC version 4.6 does support 128-bit integers. This presents a problem with writing portable code because you have to do an ugly hack where you return one 64-bit word in the return register and you pass the other in using a reference. For example, in order to print a floating-point number fast with Grisu we use 128-bit unsigned multiplication as follows:
#include <cstdint>
#if defined(_MSC_VER) && defined(_M_AMD64)
#define USING_VISUAL_CPP_X64 1
#include <intrin.h>
#include <intrin0.h>
#pragma intrinsic(_umul128)
#elif (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6))
#define USING_GCC 1
#if defined(__x86_64__)
UI8 h;
UI8 l = _umul128(f, rhs_f, &h);
if (l & (UI8(1) << 63)) // rounding
return TBinary(h, e + rhs_e + 64);
UIH p = static_cast<UIH>(f) * static_cast<UIH>(rhs_f);
UI8 h = p >> 64;
UI8 l = static_cast<UI8>(p);
if (l & (UI8(1) << 63)) // rounding
return TBinary(h, e + rhs_e + 64);
const UI8 M32 = 0xFFFFFFFF;
const UI8 a = f >> 32;
const UI8 b = f & M32;
const UI8 c = rhs_f >> 32;
const UI8 d = rhs_f & M32;
const UI8 ac = a * c;
const UI8 bc = b * c;
const UI8 ad = a * d;
const UI8 bd = b * d;
UI8 tmp = (bd >> 32) + (ad & M32) + (bc & M32);
tmp += 1U << 31; /// mult_round
return TBinary(ac + (ad >> 32) + (bc >> 32) + (tmp >> 32), e + rhs_e + 64);
Use the TBigInteger template and set any bit range in the template array like this TBigInt<128,true> for being a signed 128 bit integer or TBigInt<128,false> for being an unsigned 128 bit integer.
Hope that helps maybe a late reply and someone else found this method already.
The TBigInt is a structure defined by Unreal Engine. It provides a multi-bit integer override.
Basic usage (as far as I can tell):
#include <Math/BigInt.h>
void foo() {
TBigInt<128, true> signed128bInt = 0;
TBigInt<128, false> unsigned128bInt = 0;
I want to multiply a 57-bit integer with an 11-bit integer. The result can be up to 68 bits so I'm planning to split my result into 2 different integers. I cannot use any library and It should be as simple as possible because the code will be translated to VHDL.
There is some way to that online but all of them are not meet my criteria. I want to split the result as an 60-bit lower part and an 8-bit higher part.
int main() {
unsigned long long int log2 = 0b101100010111001000010111111101111101000111001111011110011;
unsigned short int absE;
unsigned in result_M;
unsigned long long int result_L;
result_L = absE * log2;
result_M = 0;
signal absE : std_logic_vector(10 downto 0);
signal log2 : std_logic_vector(57 downto 0) := "101100010111001000010111111101111101000111001111011110011";
signal result: std_logic_vector(67 downto 0);
result <= absE * log2;
You can split the 57-bit value into smaller chunks to perform the multiplications and recombine into the required parts, for example 8+49 bits:
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
int main() {
#define MASK(n) ((1ULL << (n)) - 1)
uint64_t log2 = MASK(57); // 57 bits
uint16_t absE = MASK(11); // 11 bits
uint32_t m1 = (log2 >> 49) * absE; // middle 19 bits at offset 49;
uint64_t m0 = (log2 & MASK(49)) * absE + ((m1 & MASK(11)) << 49); // low 61 bits
uint16_t result_H = (uint16_t)(m1 >> 11) + (uint16_t)(m0 >> 60); // final high 8 bits
uint64_t result_L = m0 & MASK(60);
printf("%#"PRIx64" * %#"PRIx16" = %#"PRIx16"%012"PRIx64"\n",
log2, absE, result_H, result_L);
return 0;
Output: 0x1ffffffffffffff * 0x7ff = 0xffdfffffffffff801
You may need more steps if you cannot use the 64-bit multiplication used for the 49-bit by 11-bit step.
__int128 a;
__int128 b;
__int128 c;
uint64_t c_lo;
uint8_t c_hi;
a = 0x789;
b = 0x123456789ABCDEF;
c = a * b;
c_lo = (uint64_t)c & ((UINT64_C(1) << 60) - 1);
c_hi = (unsigned __int128)c >> 60;
You will need the standard library for this. You will need the header file <stdint.h> (<cstdint> in C++), but that shouldn't be a problem when translating into VHDL.
VHDL is different than C - here you have the paper how to implement multiplication. Expand it to as many bits as you need:
if you (or every wants who wants) are not dealing with arbitrary length you can use a library like this: int512.h
I have 3 unsigned bytes that are coming over the wire separately.
[byte1, byte2, byte3]
I need to convert these to a signed 32-bit value but I am not quite sure how to handle the sign of the negative values.
I thought of copying the bytes to the upper 3 bytes in the int32 and then shifting everything to the right but I read this may have unexpected behavior.
Is there an easier way to handle this?
The representation is using two's complement.
You could use:
uint32_t sign_extend_24_32(uint32_t x) {
const int bits = 24;
uint32_t m = 1u << (bits - 1);
return (x ^ m) - m;
This works because:
if the old sign was 1, then the XOR makes it zero and the subtraction will set it and borrow through all higher bits, setting them as well.
if the old sign was 0, the XOR will set it, the subtract resets it again and doesn't borrow so the upper bits stay 0.
Templated version
template<class T>
T sign_extend(T x, const int bits) {
T m = 1;
m <<= bits - 1;
return (x ^ m) - m;
Assuming both representations are two's complement, simply
upper_byte = (Signed_byte(incoming_msb) >= 0? 0 : Byte(-1));
using Signed_byte = signed char;
using Byte = unsigned char;
and upper_byte is a variable representing the missing fourth byte.
The conversion to Signed_byte is formally implementation-dependent, but a two's complement implementation doesn't have a choice, really.
You could let the compiler process itself the sign extension. Assuming that the lowest significant byte is byte1 and the high significant byte is byte3;
int val = (signed char) byte3; // C guarantees the sign extension
val << 16; // shift the byte at its definitive place
val |= ((int) (unsigned char) byte2) << 8; // place the second byte
val |= ((int) (unsigned char) byte1; // and the least significant one
I have used C style cast here when static_cast would have been more C++ish, but as an old dinosaur (and Java programmer) I find C style cast more readable for integer conversions.
This is a pretty old question, but I recently had to do the same (while dealing with 24-bit audio samples), and wrote my own solution for it. It's using a similar principle as this answer, but more generic, and potentially generates better code after compiling.
template <size_t Bits, typename T>
inline constexpr T sign_extend(const T& v) noexcept {
static_assert(std::is_integral<T>::value, "T is not integral");
static_assert((sizeof(T) * 8u) >= Bits, "T is smaller than the specified width");
if constexpr ((sizeof(T) * 8u) == Bits) return v;
else {
using S = struct { signed Val : Bits; };
return reinterpret_cast<const S*>(&v)->Val;
This has no hard-coded math, it simply lets the compiler do the work and figure out the best way to sign-extend the number. With certain widths, this can even generate a native sign-extension instruction in the assembly, such as MOVSX on x86.
This function assumes you copied your N-bit number into the lower N bits of the type you want to extend it to. So for example:
int16_t a = -42;
int32_t b{};
memcpy(&b, &a, sizeof(a));
b = sign_extend<16>(b);
Of course it works for any number of bits, extending it to the full width of the type that contained the data.
Here's a method that works for any bit count, even if it's not a multiple of 8. This assumes you've already assembled the 3 bytes into an integer value.
const int bits = 24;
int mask = (1 << bits) - 1;
bool is_negative = (value & ~(mask >> 1)) != 0;
value |= -is_negative & ~mask;
You can use a bitfield
template<size_t L>
inline int32_t sign_extend_to_32(const char *x)
struct {int32_t i: L;} s;
memcpy(&s, x, 3);
return s.i;
// or
return s.i = (x[2] << 16) | (x[1] << 8) | x[0]; // assume little endian
Easy and no undefined behavior invoked
int32_t r = sign_extend_to_32<24>(your_3byte_array);
Of course copying the bytes to the upper 3 bytes in the int32 and then shifting everything to the right as you thought is also a good idea. There's no undefined behavior if you use memcpy like above. An alternative is reinterpret_cast in C++ and union in C, which can avoid the use of memcpy. However there's an implementation defined behavior because right shift is not always a sign-extension shift (although almost all modern compilers do that)
Assuming your 24bit value is stored in variable int32_t val, you can easily extend the sign by following:
val = (val << 8) >> 8;
I'm working on generating different types of Gradient Noise. One of the things that this noise requires is the generation of random vectors given a position vector.
This position vector could be anything from a single int, or a 2D position, 3D position, 4D position etc.
On top of this, an additional "seed" value is needed.
What's required is a hash of these n+1 integers into a unique integer with which I can seed a PRNG. It's important that it's these values as I need to be able to retrieve the original seed every time the same values are used.
So far I've tried an implementation of Fowler–Noll–Vo; but it was way too slow for my purposes.
I've also tried using successive calls to a pairing function:
int pairing_function(int x, int y)
return(0.5*(x+y)*(x+y+1) + x);
int hash = pairing_function(pairing_function(x,y),seed);
But what seems to happen is that with a large enough seed, the values overflow the size of an int (or even larger types).
What's a good method to achieve what I'm trying to do here? What's important is speed over any cryptographic concerns as well as not returning numbers larger than my original data types.
I'm using C++ but so long as any code is readable I can nut it out.
It is strange that FNV be way too slow because it is just 1 xor and 1 integer product per byte of data. From Wikipedia [it is ] designed to be fast to compute.
If you want something really quick, you can try these implementations, where the multiplication is coded as shifts and additions :
dan bernstein implementation :
unsigned long
hash(unsigned char *str)
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
sdbm implementation (hash(i) = hash(i - 1) * 65599 + str[i]) :
static unsigned long
unsigned char *str;
unsigned long hash = 0;
int c;
while (c = *str++)
hash = c + (hash << 6) + (hash << 16) - hash;
return hash;
References "Hash Functions" from cse.yorku.ca
It sounds like FNV you used might have been inefficient because of the way it was used. Here's (I think, I haven't tested it) the same thing in a way that could be trivially inlined.
inline uint32_t hash(uint32_t h, uint32_t x) {
for (int i = 0; i < 4; i++) {
h ^= x & 255;
x >>= 8;
h = (h << 24) + h * 0x193;
return h;
I think calling hash(hash(2166136261, seed), x) or hash(hash(hash(2166136261, seed), x), y) should give you the same result (assuming little-endian) as a library function.
However, to speed that up at the cost of hash quality, you can might try a change like this:
inline uint32_t hash(uint32_t h, uint32_t x) {
for (int i = 0; i < 2; i++) {
h ^= x & 65535;
x >>= 16;
h = (h << 24) + h * 0x193;
return h;
or even:
inline uint32_t hash(uint32_t h, uint32_t x) {
h ^= x;
h = (h << 24) + h * 0x193;
return h;
These changes weaken the low-order bits somewhat, so you'll want to follow standard practice in using the high-order bits preferentially. For example, if your need only 16 bits, then shift the final result right by 16 rather than masking it with 0xffff;
The h = ... line will regularly overflow an int, though, and it relies on the standard mod-2**32 behaviour. If that's a problem then you'll want to replace that line with something different and perhaps accept fewer useful bits in your hash. Maybe h = (h >> 4) + (h & 0x7fffff) * 0x193; but that's just a random tweak and I haven't checked it for hash quality.
I will challenge you on
So far I've tried an implementation of Fowler–Noll–Vo; but it was way too slow for my purposes.
as in some simple benchmarks I've done the FNV hash is the fastest. I assume you have benchmarks for all hashes you've tried?
For the benchmark I just simply measured the time taken for 1 billion hashes of various algorithms in MVSC++ 2013 using two 32-bit unsigned int for input:
FNV (32-bit) = 222M hashes/sec
Your pairing_function() = 175M hashes/sec
Simple Hash x + (y << 10) = 170M hashes/sec
Your hash() function using pairing_function() = 167M hashes/sec
Dan Bernstein = 101M hashes/sec
Obviously these are very basic benchmark results and I wouldn't necessarily trust them all that much. I wouldn't be surprised to see some algorithms run faster/slower on different platforms and compilers.
Overall though, while FNV is the fastest in this case there is only a factor of two difference between the fastest and slowest. If this really makes a difference in your case I would suggest taking another look at your problem to see if it can be redesigned to not need the hash or at least reduce the dependence on the hash speed.
Note: I changed your pairing function to:
int pairing_function(int x, int y)
return((x+y)*(x+y+1)/2 + x);
for the above benchmarks. Using your version results in a conversion to/from double which makes it x5 slower and your hash() function x8 slower.
For the FNV hash I found a source online and modified it from there to work directly on 2 integers (assumes a 32-bit integer):
#define FNV_32_PRIME 16777619u
unsigned int FNVHash32(const int input1, const int input2)
unsigned int hash = 2166136261u;
const unsigned char* pBuf = (unsigned char *) &input1;
for (int i = 0; i < 4; ++i)
hash *= FNV_32_PRIME;
hash ^= *pBuf++;
pBuf = (unsigned char *) &input2;
for (int i = 0; i < 4; ++i)
hash *= FNV_32_PRIME;
hash ^= *pBuf++;
return hash;
Since FNV just works on bytes you can extend this to work with any number of integers or other data.
I have 8 bool variables, and I want to "merge" them into a byte.
Is there an easy/preferred method to do this?
How about the other way around, decoding a byte into 8 separate boolean values?
I come in assuming it's not an unreasonable question, but since I couldn't find relevant documentation via Google, it's probably another one of those "nonono all your intuition is wrong" cases.
The hard way:
unsigned char ToByte(bool b[8])
unsigned char c = 0;
for (int i=0; i < 8; ++i)
if (b[i])
c |= 1 << i;
return c;
void FromByte(unsigned char c, bool b[8])
for (int i=0; i < 8; ++i)
b[i] = (c & (1<<i)) != 0;
Or the cool way:
struct Bits
unsigned b0:1, b1:1, b2:1, b3:1, b4:1, b5:1, b6:1, b7:1;
union CBits
Bits bits;
unsigned char byte;
Then you can assign to one member of the union and read from another. But note that the order of the bits in Bits is implementation defined.
Note that reading one union member after writing another is well-defined in ISO C99, and as an extension in several major C++ implementations (including MSVC and GNU-compatible C++ compilers), but is Undefined Behaviour in ISO C++. memcpy or C++20 std::bit_cast are the safe ways to type-pun in portable C++.
(Also, the bit-order of bitfields within a char is implementation defined, as is possible padding between bitfield members.)
You might want to look into std::bitset. It allows you to compactly store booleans as bits, with all of the operators you would expect.
No point fooling around with bit-flipping and whatnot when you can abstract away.
The cool way (using the multiplication technique)
inline uint8_t pack8bools(bool* a)
uint64_t t;
memcpy(&t, a, sizeof t); // strict-aliasing & alignment safe load
return 0x8040201008040201ULL*t >> 56;
// bit order: a[0]<<7 | a[1]<<6 | ... | a[7]<<0 on little-endian
// for a[0] => LSB, use 0x0102040810204080ULL on little-endian
void unpack8bools(uint8_t b, bool* a)
// on little-endian, a[0] = (b>>7) & 1 like printing order
auto MAGIC = 0x8040201008040201ULL; // for opposite order, byte-reverse this
auto MASK = 0x8080808080808080ULL;
uint64_t t = ((MAGIC*b) & MASK) >> 7;
memcpy(a, &t, sizeof t); // store 8 bytes without UB
Assuming sizeof(bool) == 1
To portably do LSB <-> a[0] (like the pext/pdep version below) instead of using the opposite of host endianness, use htole64(0x0102040810204080ULL) as the magic multiplier in both versions. (htole64 is from BSD / GNU <endian.h>). That arranges the multiplier bytes to match little-endian order for the bool array. htobe64 with the same constant gives the other order, MSB-first like you'd use for printing a number in base 2.
You may want to make sure that the bool array is 8-byte aligned (alignas(8)) for performance, and that the compiler knows this. memcpy is always safe for any alignment, but on ISAs that require alignment, a compiler can only inline memcpy as a single load or store instruction if it knows the pointer is sufficiently aligned. *(uint64_t*)a would promise alignment, but also violate the strict-aliasing rule. Even on ISAs that allow unaligned loads, they can be faster when naturally aligned. But the compiler can still inline memcpy without seeing that guarantee at compile time.
How they work
Suppose we have 8 bools b[0] to b[7] whose least significant bits are named a-h respectively that we want to pack into a single byte. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)
| b7 || b6 || b4 || b4 || b3 || b2 || b1 || b0 |
× 1000000001000000001000000001000000001000000001000000001000000001
+ ↑...e....↑..d.....↑.c......↑b.......a
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out
So the magic number for packing would be 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201. If you're on a big endian machine you'll need to use the magic number 0x0102040810204080 which is calculated in a similar manner
For unpacking we can do a similar multiplication
| b7 || b6 || b4 || b4 || b3 || b2 || b1 || b0 |
× 1000000001000000001000000001000000001000000001000000001000000001
= h0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh
& 1000000010000000100000001000000010000000100000001000000010000000
= h0000000g0000000f0000000e0000000d0000000c0000000b0000000a0000000
After multiplying we have the needed bits at the most significant positions, so we need to mask out irrelevant bits and shift the remaining ones to the least significant positions. The output will be the bytes contain a to h in little endian.
The efficient way
On newer x86 CPUs with BMI2 there are PEXT and PDEP instructions for this purpose. The pack8bools function above can be replaced with
_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);
And the unpack8bools function can be implemented as
_pdep_u64(b, 0x0101010101010101ULL);
(This maps LSB -> LSB, like a 0x0102040810204080ULL multiplier constant, opposite of 0x8040201008040201ULL. x86 is little-endian: a[0] = (b>>0) & 1; after memcpy.)
Unfortunately those instructions are very slow on AMD before Zen 3 so you may need to compare with the multiplication method above to see which is better
The other fast way is SSE2
x86 SIMD has an operation that takes the high bit of every byte (or float or double) in a vector register, and gives it to you as an integer. The instruction for bytes is pmovmskb. This can of course do 16 bytes at a time with the same number of instructions, so it gets better than the multiply trick if you have lots of this to do.
#include <immintrin.h>
inline uint8_t pack8bools_SSE2(const bool* a)
__m128i v = _mm_loadl_epi64( (const __m128i*)a ); // 8-byte load, despite the pointer type.
// __m128 v = _mm_cvtsi64_si128( uint64 ); // alternative if you already have an 8-byte integer
v = _mm_slli_epi32(v, 7); // low bit of each byte becomes the highest
return _mm_movemask_epi8(v);
There isn't a single instruction to unpack until AVX-512, which has mask-to-vector instructions. It is doable with SIMD, but likely not as efficiently as the multiply trick. See Convert 16 bits mask to 16 bytes mask and more generally is there an inverse instruction to the movemask instruction in intel avx2? for unpacking bitmaps to other element sizes.
How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD has some answers specifically for 8-bits -> 8-bytes, but if you can't do 16 bits at a time for that direction, the multiply trick is probably better, and pext certainly is (except on CPUs where it's disastrously slow, like AMD before Zen 3).
#include <stdint.h> // to get the uint8_t type
uint8_t GetByteFromBools(const bool eightBools[8])
uint8_t ret = 0;
for (int i=0; i<8; i++) if (eightBools[i] == true) ret |= (1<<i);
return ret;
void DecodeByteIntoEightBools(uint8_t theByte, bool eightBools[8])
for (int i=0; i<8; i++) eightBools[i] = ((theByte & (1<<i)) != 0);
bool a,b,c,d,e,f,g,h;
//do stuff
char y= a<<7 | b<<6 | c<<5 | d<<4 | e <<3 | f<<2 | g<<1 | h;//merge
although you are probably better off using a bitset
I'd like to note that type punning through unions is UB in C++ (as rodrigo does in his answer. The safest way to do that is memcpy()
struct Bits
unsigned b0:1, b1:1, b2:1, b3:1, b4:1, b5:1, b6:1, b7:1;
unsigned char toByte(Bits b){
unsigned char ret;
memcpy(&ret, &b, 1);
return ret;
As others have said, the compiler is smart enough to optimize out memcpy().
BTW, this is the way that Boost does type punning.
There is no way to pack 8 bool variables into one byte. There is a way packing 8 logical true/false states in a single byte using Bitmasking.
You would use the bitwise shift operation and casting to archive it. a function could work like this:
unsigned char toByte(bool *bools)
unsigned char byte = \0;
for(int i = 0; i < 8; ++i) byte |= ((unsigned char) bools[i]) << i;
return byte;
Thanks Christian Rau for the correction s!
I am stuck with a problem. I am working on a hardware which only does support 32 bit operations.
sizeof(int64_t) is 4. Sizeof(int) is 4.
and I am porting an application which assumes size of int64_t to be 8 bytes. The problem is it has this macro
BIG_MULL(a,b) ( (int64_t)(a) * (int64_t)(b) >> 23)
The result is always a 32 bit integer but since my system doesn't support 64 bit operation, it always return me the LSB of the operation, rounding of all the results making my system crash.
Can someone help me out?
Vikas Gupta
You simply cannot reliably store 64 bits of data in a 32-bit integer. You either have to redesign the software to work with 32-bit integers as the maximum size available or provide a way of providing 64 bits of storage for the 64-bit integers. Neither is simple - to be polite about it.
One possibility - not an easy one - is to create a structure:
typedef struct { uint32_t msw; uint32_t lsw; } INT64_t;
You can then store the data in the two 32-bit integers, and do arithmetic with components of the structure. Of course, in general, a 32-bit by 32-bit multiply produces a 64-bit answer; to do full multiplication without overflowing, you may be forced to store 4 16-bit unsigned numbers (because 16-bit numbers can be multiplied to give 32-bit results w/o overflowing). You will use functions to do the hard work - so the macro becomes a call to a function that accepts two (pointers to?) the INT64_t structure and returns one.
It won't be as fast as before...but it has some chance of working if they used the macros everywhere that was necessary.
I assume that the numbers that you are trying to multiply together are 32-bit integers. You just want to generate a product that may be larger than 32 bits. You then want to drop some known number of least significant bits from the product.
As a start, this will multiply the two integers together and overflow.
#define WORD_MASK ((1<<16) - 1)
#define LOW_WORD(x) (x & WORD_MASK)
#define HIGH_WORD(x) ((x & (WORD_MASK<<16)) >> 16)
#define BIG_MULL(a, b) \
((LOW_WORD(a) * LOW_WORD(b)) << 0) + \
((LOW_WORD(a) * HIGH_WORD(b)) << 16) + \
((HIGH_WORD(a) * LOW_WORD(b)) << 16) + \
((HIGH_WORD(a) * HIGH_WORD(b)) << 32)
If you want to drop the 23 least-significant bits from this, you could adjust it like so.
#define WORD_MASK ((1<<16) - 1)
#define LOW_WORD(x) (x & WORD_MASK)
#define HIGH_WORD(x) ((x & (WORD_MASK<<16)) >> 16)
#define BIG_MULL(a, b) \
((LOW_WORD(a) * HIGH_WORD(b)) >> 7) + \
((HIGH_WORD(a) * LOW_WORD(b)) >> 7) + \
((HIGH_WORD(a) * HIGH_WORD(b)) << 9)
Note that this will still overflow if the actual product of the multiplication is greater than 41 (=64-23) bits.
I have adjusted the code to handle signed integers.
#define LOW_WORD(x) (((x) << 16) >> 16)
#define HIGH_WORD(x) ((x) >> 16)
#define ABS(x) (((x) >= 0) ? (x) : -(x))
#define SIGN(x) (((x) >= 0) ? 1 : -1)
#define UNSIGNED_BIG_MULT(a, b) \
(((LOW_WORD((a)) * HIGH_WORD((b))) >> 7) + \
((HIGH_WORD((a)) * LOW_WORD((b))) >> 7) + \
((HIGH_WORD((a)) * HIGH_WORD((b))) << 9))
#define BIG_MULT(a, b) \
(UNSIGNED_BIG_MULT(ABS((a)), ABS((b))) * \
SIGN((a)) * \
If you change your macro to
#define BIG_MULL(a,b) ( (int64_t)(a) * (int64_t)(b))
since it looks like int64_t is defined for you it should work
While there are other questions raised by sizeof(int64_t) == 4, this is wrong:
#define BIG_MULL(a,b) ( (int64_t)(a) * (int64_t)(b) >> 23)
The standard requires intN_t types for values of N = 8, 16, 32, and 64... if the platform supports them.
The type you should use is intmax_t, which is defined to be the largest integral type the platform supports. If your platform doesn't have 64-bit integers, your code won't break with intmax_t.
You might want to look at a bignum library such as GNU GMP. In one sense a bignum library is overkill, since they typically support arbitrary sized numbers, not just a increased in fixed size numbers. However, since it's already done, the fact that it does more than you want might not be an issue.
The alternative is to pack a couple 32-bit ints into a struct similar to Microsoft's LARGE_INTEGER:
typedef union _LARGE_INTEGER {
struct {
DWORD LowPart;
LONG HighPart;
struct {
DWORD LowPart;
LONG HighPart;
} u;
And create functions that take parameters of this type and return results in structs of this type. You could also wrap these operations in a C++ class that will let you define operator overloads that let the expressions look more natural. But I'd look at the already made libraries (like GMP) to see if they can be used - it may save you a lot of work.
I just hope you don't need to implement division using structures like this in straight C - it's painful and runs slow.