I have a stream of 16 bit values, and I need to adjust the 4 least significant bits of each sample. The new values are different for each short, but repeat every X shorts - essentially tagging each short with an ID.
Are there any bit twiddling tricks to do this faster than just a for-loop?
More details
I'm converting a file from one format to another. Currently implemented with FILE* but I could use Windows specific APIs if helpful.
[while data remaining]
{
read X shorts from input
tag 4 LSB's
write modified data to output
}
In addition to bulk operations, I guess I was looking for opinions on the best way to stomp those last 4 bits.
Shift right 4, shift left 4, | in the new values
& in my zero bits, then | in the 1 bits
modulus 16, add new value
We're only supporting win7 (32 or 64) right now, so hardware would be whatever people choose for that.
If you're working on e.g. a 32-bit platform, you can do them 2 at a time. Or on a modern x86 equivalent, you could use SIMD instructions to operate on 128 bits at a time.
Other than that, there are no bit-twiddling methods to avoid looping through your entire data set, given that it sounds like you must modify every element!
Best way to stomp those last 4 bits is your option 2:
int i;
i &= 0xFFF0;
i |= tag;
Doing this on a long would be faster if you know tag values in advance.
You can memcpy 4 shorts in one long and then do the same operations as above on 4 shorts at a time:
long l;
l &= 0xFFF0FFF0FFF0FFF0;
l |= tags;
where tags = (long) tag1 << 48 + (long) tag2 << 32 + (long) tag3 << 16 + (long) tag4;
This has sense if you are reusing this value tags often, not if you have to build it differently for each set of 4 shorts.
Related
I am working on a toy file system, I am using a bitset to keep track of used and unused pages. I am using an array of ints (in order to use GCC's built in bit ops.) to represent the bitset. I am not using the std::bitset as it will not be available on the final environment (embedded system.).
Now according to Linux perf during the tests allocating files takes 35% of runtime of this, 45% of the time is lost setting bits using,
#define BIT_SET(a,b) ((a) |= (1ULL<<(b)))
inside a loop. According to perf 42% of the time is lost in or. Deleting is a bit faster but then most time is lost in and operation to clear the bits toggling the bits using xor did not made any difference.
Basically I am wondering if there are smarter ways to set multiple bits in one go. If user requests 10 pages of space just set all bits in one go, but the problem is space can span word boundries. or any GCC/Clang instrinsics that I should be aware of?
You should be able to use a function like this to set multiple bits in a bitset at once:
void set_mask(word_t* bitset, word_t mask, int lowbit) {
int index= lowbit / sizeof(word_t);
int offset = lowbit % sizeof(word_t);
bitset[index] |= (mask << offset);
mask >>= (sizeof(word_t) - offset);
bitset[index+1] |= mask
}
If the mask does not span a boundary, the 2nd word is ORd with 0, so it is unchanged. Doing it unconditionally may be faster than the test to see if it needs to be done. If testing shows otherwise, add an if (mask) before the last line.
I'm trying to create a simple DBMS and although I've read a lot about it and have already designed the system, I have some issues about the implementation.
I need to know what's the best method in C++ to use a series of bits whose length will be dynamic. This series of bits will be saved in order to figure out which pages in the files are free and not free. For a single file the number of pages used will be fixed, so I can probably use a bitset for that. However the number of records per page AND file will not be fixed. So I don't think bitset would be the best way to do this.
I thought maybe to just use a sequence of characters, since each character is 1 byte = 8 bits maybe if I use an array of them I would be able to create the bit map that I want.
I never had to manipulate bits at such a low level, so I don't really know if there is some other better method to do this, or even if this method would work at all.
thanks in advance
If you are just wanting the basics on the bit twiddling, the following is one way of doing it using an array of characters.
Assume you have an array for the bits (the length needs to be (totalitems / 8 )):
unsigned char *bits; // this of course needs to be allocated somewhere
You can compute the index into the array and the specific bit within that position as follows:
// compute array position
int pos = item / 8; // 8 bits per byte
// compute the bit within the byte. Could use "item & 7" for the same
// result, however modern compilers will typically already make
// that optimization.
int bit = item % 8;
And then you can check if a bit is set with the following (assumes zero-based indexing):
if ( bits[pos] & ( 1 << bit ))
return 1; // it is set
else
return 0; // it is not set
The following will set a specific bit:
bits[pos] |= ( 1 << bit );
And the following can be used to clear a specific bit:
bits[pos] &= ~( 1 << bit );
I would implement a wrapper class and simply store your bitmap in a linked list of chunks where each chunk would hold a fixed size array (I would use a stdint type like uint32_t to ensure a given number of bits) then you simply add links to your list to expand. I'll leave contracting as an exercise to the reader.
I am in the process of building an assembler for a rather unusual machine that me and a few other people are building. This machine takes 18 bit instructions, and I am writing the assembler in C++.
I have collected all of the instructions into a vector of 32 bit unsigned integers, none of which is any larger than what can be represented with an 18 bit unsigned number.
However, there does not appear to be any way (as far as I can tell) to output such an unusual number of bits to a binary file in C++, can anyone help me with this.
(I would also be willing to use C's stdio and File structures. However there still does not appear to be any way to output such an arbitrary amount of bits).
Thank you for your help.
Edit: It looks like I didn't specify how the instructions will be stored in memory well enough.
Instructions are contiguous in memory. Say the instructions start at location 0 in memory:
The first instruction will be at 0. The second instruction will be at 18, the third instruction will be at 36, and so on.
There is no gaps, or no padding in the instructions. There can be a few superfluous 0s at the end of the program if needed.
The machine uses big endian instructions. So an instruction stored as 3 should map to: 000000000000000011
Keep an eight-bit accumulator.
Shift bits from the current instruction into to the accumulator until either:
The accumulator is full; or
No bits remain of the current instruction.
Whenever the accumulator is full:
Write its contents to the file and clear it.
Whenever no bits remain of the current instruction:
Move to the next instruction.
When no instructions remain:
Shift zeros into the accumulator until it is full.
Write its contents.
End.
For n instructions, this will leave (8 - 18n mod 8) zero bits after the last instruction.
There are a lot of ways you can achieve the same end result (I am assuming the end result is a tight packing of these 18 bits).
A simple method would be to create a bit-packer class that accepts the 32-bit words, and generates a buffer that packs the 18-bit words from each entry. The class would need to do some bit shifting, but I don't expect it to be particularly difficult. The last byte can have a few zero bits at the end if the original vector length is not a multiple of 4. Once you give all your words to this class, you can get a packed data buffer, and write it to a file.
You could maybe represent your data in a bitset and then write the bitset to a file.
Wouldn't work with fstreams write function, but there is a way that is described here...
The short answer: Your C++ program should output the 18-bit values in the format expected by your unusual machine.
We need more information, specifically, that format that your "unusual machine" expects, or more precisely, the format that your assembler should be outputting. Once you understand what the format of the output that you're generating is, the answer should be straightforward.
One possible format — I'm making things up here — is that we could take two of your 18-bit instructions:
instruction 1 instruction 2 ...
MSB LSB MSB LSB ...
bits → ABCDEFGHIJKLMNOPQR abcdefghijklmnopqr ...
...and write them in an 8-bits/byte file thus:
KLMNOPQR CDEFGHIJ 000000AB klmnopqr cdefghij 000000ab ...
...this is basically arranging the values in "little-endian" form, with 6 zero bits padding the 18-bit values out to 24 bits.
But I'm assuming: the padding, the little-endianness, the number of bits / byte, etc. Without more information, it's hard to say if this answer is even remotely near correct, or if it is exactly what you want.
Another possibility is a tight packing:
ABCDEFGH IJKLMNOP QRabcdef ghijklmn opqr0000
or
ABCDEFGH IJKLMNOP abcdefQR ghijklmn 0000opqr
...but I've made assumptions about where the corner cases go here.
Just output them to the file as 32 bit unsigned integers, just as you have in memory, with the endianness that you prefer.
And then, when the loader / eeprom writer / JTAG or whatever method you use to send the code to the machine, for each 32 bit word that is read, just omit the 14 more significant bits and send the real 18 bits to the target.
Unless, of course, you have written a FAT driver for your machine...
I'm having some trouble figuring out the NEON equivalence of a couple of Intel SSE operations. It seems that NEON is not capable to handle an entire Q register at once(128 bit value data type). I haven't found anything in the arm_neon.h header or in the NEON intrinsics reference.
What I want to do is the following:
// Intel SSE
// shift the entire 128 bit value with 2 bytes to the right; this is done
// without sign extension by shifting in zeros
__m128i val = _mm_srli_si128(vector_of_8_s16, 2);
// insert the least significant 16 bits of "some_16_bit_val"
// the whole thing in this case, into the selected 16 bit
// integer of vector "val"(the 16 bit element with index 7 in this case)
val = _mm_insert_epi16(val, some_16_bit_val, 7);
I've looked at the shifting operations provided by NEON but could not find an equivalent way of doing the above(I don't have much experience with NEON). Is it possible to do the above(I guess it is I just don't know how)?
Any pointers greatly appreciated.
You want the VEXT instruction. Your example would look something like:
int16x8_t val = vextq_s16(vector_of_8_s16, another_vector_s16, 1);
After this, bits 0-111 of val will contain bits 16-127 of vector_of_8_s16, and bits 112-127 of val will contain bits 0-15 of another_vector_s16.
I have some code here, and don't really understand the ">>" and the "&". Can someone clarify?
buttons[0] = indata[byteindex]&1;
buttons[1] = (indata[byteindex]>>1)&1;
rawaxes[7] = (indata[byteindex]>>4)&0xf;
These are bitwise operators, meaning they operate on the binary bits that make up a value. See Bitwise operation on Wikipedia for more detail.
& is for AND
If indata[byteindex] is the number 4, then in binary it would look like 00000100. ANDing this number with 1 gives 0, because bit 1 is not set:
00000100 AND 00000001 = 0
If the value is 5 however, then you will get this:
00000101 AND 00000001 = 1
Any bit matched with the mask is allowed through.
>> is for right-shifting
Right-shifting shifts bits along to the right!
00010000 >> 4 = 00000001
One of the standard patterns for extracting a bit field is (reg >> offset) & mask, where reg is the register (or other memory location) you're reading, offset is how many least-significant bits you skip over, and mask is the set of bits that matter. The >> offset step can be omitted if offset is 0. mask is usually equal to 2width-1, or (1 << width) - 1 in C, where width is the number of bits in the field.
So, looking at what you have:
buttons[0] = indata[byteindex]&1;
Here, offset is 0 (it was omitted) and mask is 1. So this gets just the least-significant bit in indata[byteindex]:
bit number -> 7 6 5 4 3 2 1 0
+-+-+-+-+-+-+-+-+
indata[byteindex] | | | | | | | |*|
+-+-+-+-+-+-+-+-+
|
\----> buttons[0]
Next:
buttons[1] = (indata[byteindex]>>1)&1;
Here, offset is 1 and width is 1...
bit number -> 7 6 5 4 3 2 1 0
+-+-+-+-+-+-+-+-+
indata[byteindex] | | | | | | |*| |
+-+-+-+-+-+-+-+-+
|
\------> buttons[1]
And, finally:
rawaxes[7] = (indata[byteindex]>>4)&0xf;
Here, offset is 4 and width is 4 (24-1 = 16 - 1 = 15 = 0xf):
bit number -> 7 6 5 4 3 2 1 0
+-+-+-+-+-+-+-+-+
indata[byteindex] |*|*|*|*| | | | |
+-+-+-+-+-+-+-+-+
| | | |
\--v--/
|
\---------------> rawaxes[7]
EDIT...
but I don't understand what the point of it is...
Mike pulls up a rocking chair and sits down.
Back in the old days of 8-bit CPUs, a computer typically had 64K (65 536 bytes) of address space. Now we wanted to do as much as we could with our fancy whiz-bang machines, so we would do things like buy 64K of RAM and map everything to RAM. Shazam, 64K of RAM and bragging rights all around.
But a computer that can only access RAM isn't much good. It needs some ROM for an OS (or at least a BIOS), and some addresses for I/O. (You in the back--siddown. I know Intel chips had separate address space for I/O, but it doesn't help here because the I/O space was much, much smaller than the memory space, so you ran into the same constraints.)
Address space used for ROM and I/O was space that wasn't accessible as RAM, so you wanted to minimize how much space wasn't used for RAM. So, for example, when your I/O peripheral had five different things whose status amounted to a single bit each, rather than give each one of those bits its own byte (and, hence, address), they got the brilliant idea of packing all five of those bits into one byte, leaving three bits that did nothing. Voila, the Interrupt Status Register was born.
The hardware designers were also impressed with how fewer addresses resulted in fewer address bits (since address bits is ceiling of log-base-2 of number of addresses), meaning fewer address pins on the chip, freeing pins for other purposes. (These were the days when 48-pin chips were considered large, and 64-pins huge, and grid array packages were out of the question because multi-layer circuit boards were prohibitively expensive. These were also the days before multiplexing the address and data on the same pins became commonplace.)
So the chips were taped out and fabricated, and hardware was built, and then it fell to the programmers to make the hardware work. And lo, the programmers said, "WTF? I just want to know if there is a byte to read in the bloody serial port, but there are all these other bits like "receiver overrun" in the way." And the hardware guys considered this, and said, "tough cookies, deal with it."
So the programmers went to the Guru, the guy who hadn't forgotten his Boolean algebra and was happy not to be writing COBOL. And the Guru said, "use the Bit AND operation to force those bits you don't care about to 0. If you need a number, and not just a zero-or-nonzero, use a logical shift right (LSR) on the result." And they tried it. It worked, and there was much rejoicing, though the wiser ones started wondering about things like race conditions in a read-modify-write cycle, but that's a story for another time.
And so the technique of packing loosely or completely unrelated bits into registers became commonplace. People developing protocols, which always want to use fewer bits, jumped on these techniques as well. And so, even today, with our gigabytes of RAM and gigabits of bandwidth, we still pack and unpack bitfields with expressions whose legibility borders on keyboard head banging.
(Yes, I know bit fields probably go back to the ENIAC, and maybe even the Difference Engine if Lady Ada needed to stuff two data elements into one register, but I haven't been alive that long, okay? I'm sticking with what I know.)
(Note to hardware designers out there: There really isn't much justification anymore for packing things like status flags and control bits that a driver writer will want to use independently. I've done several designs with one bit per 32-bit register in many cases. No bit shifting or masking, no races, driver code is simpler to write and understand, and the address decode logic is trivially more complex. If the driver software is complex, simplifying flag and bitfield handling can save you a lot of ROM and CPU cycles.)
(More random trivia: The Atmel AVR architecture (used in the Arduino, among many other places) has some specialized bit-set and bit-clear instructions. The avr-libc library used to provide macros for these instructions, but now the gcc compiler is smart enough to recognize that reg |= (1 << bitNum); is a bit set and reg &= ~(1 << bitNum); is a bit clear, and puts in the proper instruction. I'm sure other architectures have similar optimizations.)
These are bitwise operators.
& ands two arguments bit by bit.
'>>' shifts first argument's bit string to the right by second argument.
'<<' does the opposite. | is bitwise or and ^ is bitwise xor just like & is bitwise and.
In English, the first line is grabbing to lowest bit (bit 0) only out of Button[0]. Basically, if the value is odd, it will be 1, if even, it will be 0.
(bit 1)
The second is grabbing the second bit. If that bit is set, it returns 1, else 0. It could have also been written as
buttons[1] = (indata[byteindex]&2)>>1;
and it would have done the same thing.
The last (3rd) line is grabbing the 5th throuh 8th bits (bits 4-7). Basically, it will be a number from 0 to 15 when it is complete. It also could hav been written as
rawaxes[7] = (indata[byteindex]&0xf0) >> 4;
and done the same thing. I'd also guess from context that these arrays are unsigned char arrays. Just a guess though.
The '&' (in this case) is a bitwise AND operator and ">>" is the bit-shift operator (so x>>y yields x shifted right Y bits).
So, they're taking the least significant bit of indata[byteindex] and putting it into buttons[0]. They taking the next least significant bit and putting it into buttons[1].
The last one probably needs to be looked at in binary to make a lot of sense. 0xf is 11112, so they're taking the input, shifting it right 4 bits, then retaining the 4 least significant bits of that result.