I need an efficient function to copy a sequence of bits of arbitrary length from an arbitrary position in a packed bit array to another packed bit array. (A bit-by-bit copy won't be good enough and the function is hoped to be fast for small sequences, say 5 to 25 bits in length, but any length should be supported.)
Example:
Copy 12 bits from position 7 in
_ ________ ___
11111000 01000000 01101011 11110110 01100100
to position 21 in
___ ________ _
11011111 11001010 00000100 11001101 10111111 00101101,
yielding
___ ________ _
11011111 11001010 00000001 00000001 10111111 00101101.
I plan to work with byte/word/doubleword/quadword accesses where possible and beneficial. I am sufficiently familiar with the shifting/bitwise masking techniques to know that this is not trivial and there are many corner cases. So I am looking for a (nearly) ready-made solution, in pseudocode or any language, provided it has been validated.
Any methodological hint to ease this development is welcome.
Related
I have a byte b. I am looking for the most efficient bit manipulation to
convert each bit in b to the first bit of each nibble in a 32 bit int x.
For example, if b = 01010111, then x = 0x10101111
I know I can do a brute force approach:
x = (b&1) | (((b>>1)&1)<<4) | ......
Edit: this for an OpenCL kernel for GPU
PDEP
As user harold mentioned in the comments, PDEP is the instruction that just does exactly what you want - but it's only available on x86 (as far as I know), and it has terrible1 performance on the newest AMD chips.
LUT
Barring that, a lookup table of 256 x 4-byte entries seems reasonable - at the cost of 1K of pressure on your cache subsystem. You'll find a lot of smart people advocate against LUTs due to the hidden cost of cache misses - but if this particular operation is in fact "hot" then it may turn out to be the fastest even when factoring in any additional misses.
As with any LUT solution, you should be especially careful to benchmark it not only with micro-benchmarks, but in the full application to evaluate the effect of memory pressure.
You could also consider a compromise split-LUT solution that uses one or two 16-entry LUTs for each nibble of the byte, where the result is calculated something like:
int32 x = high_lut[(b & 0xF0) >> 4] | low_lut[b & 0xF]
This cuts the size of the LUTs down by a factor of between ~11 to 322, since we have much fewer entries and some entries can be 2 bytes rather than 4 bytes.
Bit Manipulation
If you really want a bit manipulation solution, to impress your inlaws or something, you can try something like the following:
Split the byte into nibbles and use multiplication by 0x00001111 (low nibble) and 0x01111000 (high nibble) to splat the low (resp. high) nibble into the low (resp high) half of the 4-byte word, and combine the results with or or add. So if your byte had bits abcd efgh you'll have a word like abcd abcd abcd abcd efgh efgh efgh efgh.
and this result with a mask that picks out the bit that belongs in each nibble (although it usually won't be in the right place). The mask is something like 0x84218421 and the result (in binary) will be something like a000 0b00 00c0 000d e000 0f00 00g0 000h.
Now move the 6 out of 8 bits that aren't in the high bit to the right position using the carry behavior of subtraction, something like: ((x | 0x08880888) - 0x01110111) ^ 0x08880888.
The basic idea in the last step is that you set the high bit of each nibble, and subtract 1 from the nibble. So for example, you have the 0b00 nibble, which becomes 1b00 - 1 - the subtraction carries though all the zeros, and stops at the first one, which is either the high bit (b is zero) or b if it is one. So you effectively set the high bit based on the value of the selected bit. Note that you don't need to do this for a or e since they are already in the right place.
The final xor is needed because the above actually sets the high bit to the opposite value as the selected bit, so we need to flip it.
I didn't try it out, so there are no doubt bugs, but the basic idea should be sound. There is probably various ways to optimize it further, but it's not too bad as is: a couple of multiplications and perhaps a half-dozen bit-operations. On platforms with slow multiplications you can probably find another approach for the first step that uses only 1 multiplication combined with a few more primitive operations, or zero at the cost of several more operations.
1 Fully 18x worse throughput than Intel - evidently AMD opted not to implement the circuit to do PDEP in hardware and instead implement it via a series of more elementary operations.
2 The largest reduction is if you share a single 16-entry LUT for both the high and low nibble, although this requires an additional shift for the result of the high nibble lookup. The smaller reduction, shown in the example, uses two 16-entry LUTs: one 4-byte one for the high nibble, and a 2-byte one for the low nibble, and avoids the shift.
I started learning C++ on the website cplusplus.com and there is a tutorial about the language. In that tutorial the first lesson is on compilers and in that lesson, that can be found at http://www.cplusplus.com/doc/tutorial/introduction/, they give the following example:
A single instruction to a computer could look like this:
00000 10011110
A particular computer's machine language program that allows a user to input two numbers, adds the two numbers together, and displays the total could include these machine code instructions:
00000 10011110
00001 11110100
00010 10011110
00011 11010100
00100 10111111
00101 00000000
My question is why do they put 5 bits in front (on the left side) separate from the other 8 bits on the right side? What does the group of 5 bits on the left mean? Does that group tell the computer how to interpret the 8 bits on the right? For example does it tell the computer that what's following on the right side is a number or a character or an operator? I have tried to find an answer to this question on the Internet, but I couldn't find anything that would clear things up for me. If anyone could provide me with a clear answer in simple terms that would be much appreciated.
As noted it seems to be arbitrary, one possible explanation is that it's separating operators and operands, but as it's sequential the best guess is that it's just the instruction address:
00000 => address 0
00001 => address 1
00010 => address 2
00011 => address 3
00100 => address 4
00101 => address 5
Machine code instructions are hardware dependent, here are some examples separating operator and operands
[ op | target address ]
2 1024 decimal
000010 00000 00000 00000 10000 000000 binary
I know CRC calculation algorithm from Wikipedia. About structure of RAR file I read here. For example, there was written:
The file has the magic number of:
0x 52 61 72 21 1A 07 00
Which is a break down of the following to describe an Archive Header:
0x6152 - HEAD_CRC
0x72 - HEAD_TYPE
0x1A21 - HEAD_FLAGS
0x0007 - HEAD_SIZE
If I understand correctly, the HEAD_CRC (0x6152) is CRC value of Marker Block (MARK_HEAD). Somewhere I read, that CRC of a WinRAR file is calculated with standard polynomial 0xEDB88320, but when size of CRC is less than 4 bytes, it's necessary to use less significant bytes. In this case (of course if I undestand correctly) CRC value is 0x6152, so it has 2 bytes. Now I don't know, which bytes I have to take as less significant. From the standard polynomial (0xEDB88320)? Then 0x8320 probably are less significant bytes of this polynomial. Next, how to calculate CRC of the Marker Block (i. e. from the following bytes: 0x 52 61 72 21 1A 07 00), if we have already right polynomial?
There was likely a 16-bit check for an older format that is not derived from a 32-bit CRC. The standard 32-bit CRC, used by zip and rar, applied to the last five bytes of the header has no portion equal to the first two bytes. The Polish page appears to be incorrect in claiming that the two-byte check is the low two-bytes of a 32-bit CRC.
It does appear from the documentation that that header is constructed in a standard way as other blocks in the older format, so that the author, for fun, arranged for his format to give the check value "Ra" so that it could spell out "Rar!" followed by a text-terminating control-Z.
I found another 16-bit check in the unrar source code, but that check does not result in those values either.
Oh, and no, you can't take part of a CRC polynomial and expect that to be a good CRC polynomial for a smaller check. What the page in Polish is saying is that you would compute the full 32-bit CRC, and then take the low two bytes of the result. However that doesn't work for the magic number header.
Per WinRAR TechNote.txt file included with the install:
The marker block is actually considered as a fixed byte sequence: 0x52 0x61 0x72 0x21 0x1a 0x07 0x00
And as you already indicated, at the very end you can read:
The CRC is calculated using the standard polynomial 0xEDB88320. In case the size of the CRC is less than 4 bytes, only the low order bytes are used.
In Python, the calculation and grabbing of the 2 low order bytes goes like this:
zlib.crc32(correct_byte_range) & 0xffff
rerar has some code that does this, just like the rarfile library that it uses. ReScene .NET source code has an algorithm in C# for calculating the CRC32 hash. See also How do I calculate CRC32 mathematically?
I have an ISA which is "kind" of little endian.
The basic memory unit is an integer and not byte.For example
00000000: BEFC03FF 00008000
Represents that the "low" integer is BEFC03FF and "high" integer is 00008000.
I need to read the value represented by some bits.For example bits 31 till 47.
What I am doing in VS10 (c++) generate uint64_t var = 0x00008000BEFC03FF
after it use relevant mask and check the value of var & mask.
Is it legal to do that way?I do some assumption about uint64_t bits arrangement - is it legal?
Can I suppose that for very compiler and for every OS (without dependency on hw) the arrangement of bits in the uint64_t will be this way?
You are right to be concerned, It does matter.
However, in this particular case, since ISA is little endian, i.e. if it has AD[31:0], the least significant bit of an integer is packed to bit 0. Assuming your processor is also little endian, then nothing to worry about. when the data written to memory, it should have the right byte order
0000 FF
0001 03
0002 ..
suppose, if your external bus protocol is big endian and your processor is little endian. then a 16 bit integer in your processor, say 0x1234 would be 0001_0010_0011_0100 in native format, but 0010_1100_0100_1000 on the bus (assuming it's 16 bit).
In this case, multi byte data crosses endian boundary, the hardware will only swap bits inside a byte, because it must preserve the memory contiguousness between bytes. after hardware swap, it becomes:
0000 0001_0010
0001 0011_0100
then it is up to the software to swap the byte order
With unsigned char you can store a number from 0 to 255
255(b10) = 11111111(b2) <= that's 1 byte
This will make it easy to preform operations like +,-,*...
Now how about:
255(b10) = 10101101(b2)
Following this method will make it possible to represent up to 399 using unsigned char?
399(b10) = 11111111(b2)
Can someone propose an algorithm to preform addition using the last method?
With eight bits there are only 256 possible value (28), no matter how you slice and dice it.
Your scheme to encode digits in a 2-3-3 form like:
255 = 10 101 101
399 = 11 111 111
ignores the fact that those three-bit sequences in there can only represent eight values (0-7), not ten (ie, that second one would be 377, not 399).
The trade-off is that this means you gain the numbers '25[6-7]' (2 values) '2[6-7][0-7]' (16 values) and '3[0-7][0-7]' (64 values) for a total of 82 values.
Your sacrifice for that gain is that you can no longer represent any numbers containing 8 or 9: '[8-9]' (2 values), '[1-7][8-9]' (14 values), '[8-9][0-9]' (20 values), '1[0-7][8-9]' (16 values), '1[8-9][0-9]' (20 values) or '2[0-4][8-9]' (10 values), for a total of 82 values.
The balance there (82 vs. 82) shows that there are still only 256 possible values for an eight-bit data type.
So your encoding scheme is based on a flawed premise, which makes the second part of your question (how to add them) irrelevant, I'm afraid.
A unsigned char type can only mathematically hold values between 0 and 255 as determined by the rule 2^n - 1 for the maximum unsigned value that the amount of bits n can represent. There is no way to "improve" a char range, you probably want to use an unsigned short which holds two bytes instead.
You're mistaken.
In your scheme, 255 would be 010101101, which is 9 bits. The leading zero is important. I'm assuming here you're using something that looks like the octal representation. 3 bits/digit. Any other alternative means you cannot represent all the other digits.
|0|000|
|1|001|
|2|010|
|3|011|
|4|100|
|5|101|
|6|110|
|7|111|
|8|???|
|9|???|
9 in binary is 1001.
So you can't use 3 bits per digit. You need to use 4 bits if you want to represent 8 and 9. Again, I'm trying to assume here that you're encoding each digit separately.
So, 399 according to you would be: 001110011001 - 12 bits.
By comparison, binary does 399 in 110001111 - 9 bits.
So binary is the most efficient, because encoding digits from 0 to 9 in your system means that the maximum number you can store without any information loss in 8 bits is 99 - 10011001 :)
One way to think of binary, is a path that is the result of a log search to find the number.
If you really want to condense the number of bits needed to represent a number, what you're really after is some sort of compression and not the way binary is done.
What you want to do is mathematically impossible. You can only represent 256 discrete values with 8 boolean values.
To test this, make a chart of all possible values, in decimal and binary. I.e.
000 = 00000000
001 = 00000001
002 = 00000010
003 = 00000011
004 = 00000100
...
254 = 11111110
255 = 11111111
You will see that after 255, you need a ninth bit.
You can let 255 = 10101101, but if you work backwards from that, you will run out before you reach 0.
You seem to hope you can somehow use a different counting mechanism to store more values. This is not mathematically possible. See the Pidgeonhole Principle.