The AArch64 crc32{b,h,w,x} instructions take as input a CRC-32 value and a data value (8, 16, 32, or 64 bits, respectively), and output a new CRC-32 value which, presumably, should be passed in as the input to the next crc32 instruction.
To get the same value at the end that the crc32 program produces for a given set of bytes, what does the initial input value have to be? Is there anything else I have to do?
The algorithm is well-described elsewhere, but I can't find examples of using the instructions anywhere.
Through the wonders of trial and error, it seems the initial value for the crc32 accumulator is 0xffffffff (or -1), and to get the standard crc32 value you invert the value returned, so ~crc32.
e.g. (dwords array must be 8-byte aligned).
uint32_t crc32 = 0xffffffff;
for (int i = 0; i < number_of_dwords; i++) {
asm volatile ( "crc32x %w[crc], %w[crcin], %x[data]" : [crc] "=r" (crc32) : [crcin] "r" (crc32), [data] "r" (dwords[i]) );
}
return ~crc32;
Related
I have uint64_t variable with some value (for example 0x700a06fffff48517). I want to get char with the first bit of each byte in the uint (so from 0x700a06fffff48517 I want 0b00011110). Is there a better way than this?
#include <inttypes>
char getFirstBits(uint64_t x) {
x >>= 7; // shift to put first bits to last bits in byte
char c = 0;
for (size_t i = 0; i < 8; i++) {
c <<= 1;
c |= x & 1;
x >>= 8;
}
return c;
}
The fastest I can think of on (recent) x86 is
#include <immintrin.h>
uint8_t getFirstBits(uint64_t val) {
return _pext_u64(val, 0x8080808080808080ULL);
}
This is a generic solution that doesn't depend on any CPU architectures
char getFirstBits(uint64_t x) {
x = (ntohll(x) >> 7) & 0x0101010101010101; // get the first bits
return 0x8040201008040201*x >> 56; // move them together
}
This is basically the multiplication technique where bits are moved around using a single multiplication with a magic number. The remaining bitwise operations are for removing the unnecessary bits. ntohll should be htobe64 on *nix. For more details about that technique and what the magic number means read
How to create a byte out of 8 bool values (and vice versa)?
What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?
You can also use SIMD to do it:
How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD
How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?
It found immintrin.h, but it cannot find _pext_u64 (it found _pext_u32), I guess it's because I'm on 32-bit windows. However, when I use _pext_u32 to process both halves of uint64, it crashes with unknown instruction (seems like my processor doesn't have the instruction).
PEXT is a new instruction in the BMI2 extension, so if your CPU doesn't support BMI2 then you can't use it. In 32-bit mode only the 32-bit version of PEXT is supported, that's why _pext_u64 doesn't work
I am polling a 32-bit register in a motor driver for a value.
Only bits 0-9 are required, the rest need to be ignored.
How do I ignore bits 10-31?
Image of register bits
In order to poll the motor driver for a value, I send the location of the register, which sends back the entire 32-bit number. But I only need bits 0-9 to display.
Serial.println(sendData(0x35, 0))
If you want to extract such bits then you must mask the whole integer with a value that keeps just the bits you are interested in.
This can be done with bitwise AND (&) operator, eg:
uint32_t value = reg & 0x3ff;
uint32_t value = reg & 0b1111111111; // if you have C++11
Rather than Serial.println() I'd go with Serial.print().
You can then just print out the specific bits that you're interested in with a for loop.
auto data = sendData(0x35, 0);
for (int i=0; i<=9; ++i)
Serial.print(data && (1<<i));
Any other method will result in extra bits being printed since there's no data structure that holds 10 bits.
You do a bitwise and with a number with the last 10 bits set to 1. This will set all the other bits to 0. For example:
value = value & ((1<<10) - 1);
Or
value = value & 0x3FF;
I have the following function:
int GetGroup(unsigned bitResult, int iStartPos, int iNumOfBites)
{
return (bitResult >> (iStartPos + 1- iNumOfBites)) & ~(~0 << iNumOfBites);
}
The function returns group of bits from a byte.
i.e if bitResult=102 (01100110)2, iStartPos=5, iNumOfBites=3
Output: 2 (10)2
For iStartPos=7, iNumOfBites=4
Output: 3 (0110)2
I'm looking for better way / "friendly" to do that, i.e with bitset or something like that.Any suggestion?
(src >> start) & ((1UL << len)-1) // or 1ULL << if you need a 64-bit mask
is one way to express extraction of len bits, starting at start. (In this case, start is the LSB of the range you want. Your function requires the MSB as input.) This expression is from Wikipedia's article on the x86 BMI1 instruction set extensions.
Both ways of producing the mask look risky in case len is the full width of the type, though. (The corner-case of extracting all the bits). Shifts by the full width of the type can either produce zero or unchanged. (It actually invokes undefined behaviour, but this is in practice what happens if the compiler can't see that at compile time. x86 for example masks the shift count down to the 0-31 range (for 32bit shifts). With 32bit ints:
If 1 << 32 produces 1, then 1-1 = 0, so the result will be zero.
If ~0 << 32 produces ~0, rather than 0, the mask will be zero.
Remember that 1<<len is undefined behaviour for len too large: unlike writing it as 0x3ffffffffff or whatever, no automatic promotion to long long happens, so the type of the 1 matters.
I think from your examples you want the bits [iStartPos : iStartPos - iNumOfBites], where bits are numbered from zero.
The main thing I'd change in your function is the naming of the function and variables, and add a comment.
bitResult is the input to the function; don't use "result" in its name.
iStartPos ok, but a little verbose
iNumOfBites Computers have bits and bytes. If you're dealing with bites, you need a doctor (or a dentist).
Also, the return type should probably be unsigned.
// extract bits [msb : msb-len] from input into the low bits of the result
unsigned BitExtract(unsigned input, int msb, int len)
{
return (input >> (msb-len + 1)) & ~(~0 << len);
}
If your start-position parameter was the lsb, rather than msb, the expression would be simpler, and the code would be smaller and faster (unless that just makes extra work for the caller). With LSB as a param, BitExtract is 7 instructions, vs. 9 if it's MSB (on x86-64, gcc 5.2).
There's also a machine instruction (introduced with Intel Haswell, and AMD Piledriver) that does this operation. You will get somewhat smaller and slightly faster code by using it. It also uses the LSB, len position convention, not MSB, so you get shorter code with LSB as an argument.
Intel CPUs only know the version that would require loading an immediate into a register first, so when the values are compile-time constants, it doesn't save much compared to simply shifting and masking. e.g. see this post about using it or pextr for RGB32 -> RGB16. And of course it doesn't matter whether the parameter is the MSB or LSB of the desired range, if start and len are both compile time constants.
Only AMD implements a version of bextr that can have the control mask as an immediate constant, but unfortunately it seems gcc 5.2 doesn't use the immediate version for code that uses the intrinsic (even with -march=bdver2 (i.e. bulldozer v2 aka piledriver). (It will generate bextr with an immediate argument on its own in some cases with -march=bdver2.)
I tested it out on godbolt to see what kind of code you'd get with or without bextr.
#include <immintrin.h>
// Intel ICC uses different intrinsics for bextr
// extract bits [msb : msb-len] from input into the low bits of the result
unsigned BitExtract(unsigned input, int msb, int len)
{
#ifdef __BMI__ // probably also need to check for __GNUC__
return __builtin_ia32_bextr_u32(input, (len<<8) | (msb-len+1) );
#else
return (input >> (msb-len + 1)) & ~(~0 << len);
#endif
}
It would take an extra instruction (a movzx) to implement a (msb-len+1)&0xff safety check to avoid the start byte from spilling into the length byte. I left it out because it's nonsense to ask for a starting bit outside the 0-31 range, let alone the 0-255 range. Since it won't crash, just return some other nonsense result, there's not much point.
Anyway, bext saves quite a few instructions (if BMI2 shlx / shrx isn't available either! -march=native on godbolt is Haswell, and thus includes BMI2 as well.)
But bextr on Intel CPUs decodes to 2 uops (http://agner.org/optimize/), so it's not very useful at all compared to shrx / and, except for saving some code size. pext is actually better for throughput (1 uop / 3c latency), even though it's a way more powerful instruction. It is worse for latency, though. And AMD CPUs run pext very slowly, but bextr as a single uop.
I would probably do something like the following in order to provide additional protections around errors in arguments and to reduce the amount of shifting.
I am not sure if I understood the meaning of the arguments you are using so this may require a bit of tweaking.
And I am not sure if this is necessarily any more efficient since there are a number of decisions and range checks made in the interests of safety.
/*
* Arguments: const unsigned bitResult byte containing the bit field to extract
* const int iStartPos zero based offset from the least significant bit
* const int iNumOfBites number of bits to the right of the starting position
*
* Description: Extract a bitfield beginning at the specified position for the specified
* number of bits returning the extracted bit field right justified.
*/
int GetGroup(const unsigned bitResult, const int iStartPos, const int iNumOfBites)
{
// masks to remove any leading bits that need to disappear.
// we change starting position to be one based so the first element is unused.
const static unsigned bitMasks[] = {0x01, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff};
int iStart = (iStartPos > 7) ? 8 : iStartPos + 1;
int iNum = (iNumOfBites > 8) ? 8 : iNumOfBites;
unsigned retVal = (bitResult & bitMasks[iStart]);
if (iStart > iNum) {
retVal >>= (iStart - iNum);
}
return retVal;
}
pragma pack(push, 1)
struct Bit
{
union
{
uint8_t _value;
struct {
uint8_t _bit0:0;
uint8_t _bit1:0;
uint8_t _bit2:0;
uint8_t _bit3:0;
uint8_t _bit4:0;
uint8_t _bit5:0;
uint8_t _bit6:0;
uint8_t _bit7:0;
};
};
};
#pragma pack(pop, 1)
typedef Bit bit;
struct B
{
union
{
uint32_t _value;
bit bytes[1]; // 1 for Single Byte
};
};
With a struct and union you can set the Struct B _value to your result, then access byte[0]._bit0 through byte[0]._bit7 for each 0 or 1 and vise versa. Set each bit, and the result will be in the _value.
Sorry about the clumsy title; I couldn't find a bit way of expressing what I'm trying to do.
I am getting an input from the user of multiple 32-bit integers. For example, the user may enter the following values (showing in hex for ease of explanation):
0x00001234
0x00005678
0x0000abcd
In this particular case, the first 2 bytes of each input is constant, and the last 2 bytes are variable. For efficiency purposes, I could store 0x0000 as a single constant, and create a vector of uint16_t values to store the variable portion of the input (0x1234, 0x5678, 0xabcd).
Now let's say the user enters the following:
0x00000234
0x56780000
0x00001000
In this case I would need a vector of uint32_t values to store the variable portion of the input as each value affects different bytes.
My current thought is to do the following:
uint32_t myVal = 0;
myVal |= input1;
myVal |= input2;
// ...
And then at the end find the distance between the first and last "toggled" (i.e. 1) bit in myVal. The distance will give me required field size for the variable portion of all of the inputs.
However, this doesn't sound like it would scale well for a large number of user inputs. Any recommendations about an elegant and efficient way of determining this?
Update:
I simplified the problem in my above explanation.
Just to be clear, I am not doing this to save memory (I have better things to do than to try and conserve a few bytes and this isn't for optimization purposes).
In summary, component A provides component B in my system with values. Sometimes these values are 128-bit, but component B only supports 32-bit values.
If the variable portion of the 128-bit value can be expressed with a 32-bit value, I can accept it. Otherwise I will need to reject it with an error.
I'm not in a position to modify component B to allow 128-bit values, or modify component A to prevent its use of 128-bit values (there are hardware limitations here too).
Though I can't see a reason for all that... Why just not to compare an input with the std::numeric_limits<uint16_t>::max()? If the input gives a larger value then you need to use uint32_t.
Answering your edit:
I suppose for for better performance you should use hardware specific low level instructions. You could iterate over 32-bit parts of the input 128-bit value and subsequently add each one to the some variable and check the difference between next value and current sum. If the difference isn't equal to the sum then you should skip this 128-bit value, otherwise you'll get the necessary result in the end. The sample follows:
uint32_t get_value( uint32_t v1, uint32_t v2, uint32_t v3, uint32_t v4)
{
uint32_t temp = v1;
if ( temp - v2 != temp ) throw exception;
temp += v2; if ( temp - v3 != temp ) throw exception;
temp += v3; if ( temp - v4 != temp ) throw exception;
temp = v4;
return temp;
}
In this C++ example it may be looks silly but I believe in the assembly code this should efficiently process the input stream.
Store the first full 128 bit number you encounter, then push the lower order 32 bits of it onto a vector, set bool reject_all = false. For each remaining number, if high-order (128-32=96) bits differ from the first number's then set reject_all = true, otherwise push their lower-order bits on the vector. At the end of the loop, use reject_all to decide whether to use the vector of values.
The most efficient way to store a series of unsigned integers in the range [0, (2^32)-1] is by just using uint32_t. Jumping through hoops to save 2 bytes from user input is not worth your time--the user cannot possibly, in his lifetime, enter enough integers that your code would have to start compressing them. He or she would die of old age long before memory constraints became apparent on any modern system.
It looks like you have to come up with a cumulative bitmask -- which you can then look at to see whether you have trailing or leading constant bits. An algorithm that operates on each input will be required (making it an O(n) algorithm, where n is the number of values to inspect).
The algorithm would be similar to something like what you've already done:
unsigned long long bitmask = 0uL;
std::size_t count = val.size();
for (std::size_t i = 0; i < count; ++i)
bitmask |= val[i];
You can then check to see how many bits/bytes leading/trailing can be made constant, and whether you're going to use the full 32 bits. If you have access to SSE instructions, you can vectorize this using OpenMP.
There's also a possible optimization by short-circuiting to see if the distance between the first 1 bit and the last 1 bit is already greater than 32, in which case you can stop.
For this algorithm to scale better, you're going to have to do it in parallel. Your friend would be vector processing (maybe using CUDA for Nvidia GPUs, or OpenCL if you're on the Mac or on platforms that already support OpenCL, or just OpenMP annotations).
Use
uint32_t ORVal = 0;
uint32_t ANDVal = 0xFFFFFFFF;
ORVal |= input1;
ANDVal &= input1;
ORVal |= input2;
ANDVal &= input2;
ORVal |= input3;
ANDVal &= input3; // etc.
// At end of input...
mask = ORVal ^ ANDVal;
// bit positions set to 0 were constant, bit positions set to 1 changed
A bit position in ORVal will be 1 if at least one input had 1 in that position and 0 if ALL inputs had 0 in that position. A bit position in ANDVal will be 0 if at least one input had 0 in that bit position and 1 if ALL inputs had 1 in that position.
If a bit position in inputs was always 1, then ORVal and ANDVal will both be set to 1.
If a bit position in inputs was always 0, then ORVal and ANDVal will both be set to 0.
If there was a mix of 0 and 1 in a bit position then ORVal will be set to 1 and ANDVal set to 0, hence the XOR at the end gives the mask for bit positions that changed.
I have a UINT8 pointer mArray, which is being assigned information via a *(UINT16 *) casting. EG:
int offset = someValue;
UINT16 mUINT16 = 0xAAFF
*(UINT16 *)&mArray[offset] = mUINT16;
for(int i = 0; i < mArrayLength; i++)
{
printf("%02X",*(mArray + i));
}
output: ... FF AA ...
expected: ... AA FF ...
The value I am expecting to be printed when it reaches offset is to be AA FF, but the value that is printed is FF AA, and for the life of me I can't figure out why.
You are using a little endian machine.
You didn't specify but I'm guessing your mArray is an array of bytes instead of an array of UINT16s.
You're also running on a little-endian machine. On little endian machines the bytes are stored in the opposite order of big-endian machines. Big endians store them pretty much the way humans read them.
You are probably using a computer that uses a "little-endian" representation of numbers in memory (such as Intel x86 architecture). Basically this means that the least significant byte of any value will be stored at the lowest address of the memory location that is used to store the values. See Wikipdia for details.
In your case, the number 0xAAFF consists of the two bytes 0xAA and 0xFF with 0xFF being the least significant one. Hence, a little-endian machine will store 0xFF at the lowest address and then 0xAA. Hence, if you interpret the memory location to which you have written an UINT16 value as an UINT8, you will get the byte written to that location which happens to be 0xFF
If you want to write an array of UINT16 values into an appropriately sized array of UINT8 values such that the output will match your expectations you could do it in the following way:
/* copy inItems UINT16 values from inArray to outArray in
* MSB first (big-endian) order
*/
void copyBigEndianArray(UINT16 *inArray, size_t inItems, UINT8 *outArray)
{
for (int i = 0; i < inItems; i++)
{
// shift one byte right: AAFF -> 00AA
outArray[2*i] = inArray[i] >> 8;
// cut off left byte in conversion: AAFF -> FF
outArray[2*i + 1] = inArray[i]
}
}
You might also want to check out the hton*/ntoh*-family of functions if they are available on your platform.
It's because your computer's CPU is using little endian representation of integers in memory