Why does zeroing the last 12bits of a mmap offset ensure it is a multiple of __SC_PAGE_SIZE?
For example:
offset = address & ~(PAGE_SIZE - 1);
Here PAGE_SIZE = 4096.
4096dec = 00..001000000000000bin
If you're interested in zero'ing out all the bits preceding the 1, you do PAGE_SIZE-1:
00..000111111111111
The NOT operator ensures all the bits that aren't in these positions are considered:
~00..000111111111111 = 11..11000000000000
and you simply AND the address bits with the above to zero-out the zero portion above.
This is a commonly used bit-trick to get a value which is a multiple of a power-of-two number.
One thing you should notice: the code you posted might decrease the address value to get the offset as a power-of-two. I.e. if you enter 4500, you'll get the offset as 4096 (i.e. you dropped to the bottom-closest multiple of that power-of-two number.
The address alignment version is way more used:
aligned_address = (address + PAGE_SIZE -1) & ~(PAGE_SIZE - 1);
Related
What does & ~(minOffsetAlignment - 1) mean?
Does it mean address of destructor?
This is the code snippet I got it from.
VkDeviceSize is uint64_t.
VkDeviceSize getAlignment(VkDeviceSize instanceSize, VkDeviceSize minOffsetAlignment)
{
if (minOffsetAlignment > 0)
{
return (instanceSize + minOffsetAlignment - 1) & ~(minOffsetAlignment - 1);
}
return instanceSize;
}
It's a common trick to align a certain number to a certain boundary. It assumes that minOffsetAlignment is a power of two.
If minOffsetAlignment is a power of two, its binary form will be a one followed by zeros. For example if 8, the binary will be b00001000.
If you subtract one, it will become a mask where all the bits that can be changed will be flagged as one. For the same example, it will be b00000111 (all numbers from 0 to 7).
If you take the complement of this number, it becomes a mask for clearing. In the example, b11111000.
If you AND (&) any number against this mask, it will have the effect to zero all the bits relative to numbers below the alignment.
For example, say I have the number 9 b00001001. Doing 9&7 is b00001001 & b11111000 which results in b00001000 or 8.
The resulting value is the computed value aligned for the given amount.
What does &~ mean?
In this case, & is the bitwise and operator. It is a binary operator. Each bit of the result is set if the corresponding bit is set for both operands, otherwise the bit is unset. In this case, the left hand operand is (instanceSize + minOffsetAlignment - 1) and the right hand operand is ~(minOffsetAlignment - 1)
~ is the bitwise not operator. It is a unary operator. Each bit of the result is set if the corresponding bit is unset in the operand, otherwise the bit is unset. In this case, the operand is (minOffsetAlignment - 1)
The idea of this code is to say, "Given a particular (object) instance size, how many bytes of memory are used if the allocator has to round-up object sizes to some power-of-two multiple." Note that if the instance size is already a multiple of the power of two (e.g. 8), then no rounding-up is needed.
It is often the case that memory needs to be allocated in multiples of the hardware's word size in bytes. On a 32-bit platform that would mean a multiple of 4, on 64-bit, a multiple of 8. Note however that the multiple (and thus minOffsetAlignment) must be a power of 2 (the author of this code, for better or worse, is 100% assuming the person calling this function knows this).
Let's assume for the rest of my answer that we're working with a hardware word size of 64-bit, and that our platform must allocate memory in chunks of 8 bytes (64 bits divided by 8 bits per byte == 8 bytes). So minOffsetAlignment will be passed as 8.
Consequently, if the instance size is 1-8 bytes, our function must return 8. If it's 9-16 bytes, it must return 16, and so on.
This answer shows how to round up a value to some nearest integer provided the value being rounded isn't already a multiple of that integer.
In the case of rounding up to the nearest multiple of 8, the procedure is to add 7, then do integer division to divide by 8, then multiply by 8.
So:
(0 + 7) / 8 * 8 == 0 // a zero-byte instance requires zero memory bytes
(1 + 7) / 8 * 8 == 8 // a one-byte instance requires eight memory bytes
(2 + 7) / 8 * 8 == 8 // a two-byte instance requires eight memory bytes
...
(8 + 7) / 8 * 8 == 8 // an eight-byte instance requires eight memory bytes
(9 + 7) / 8 * 8 == 16 // a nine-byte instance requires sixteen memory bytes
...
What the code in your question is doing is exactly the above algorithm, but it's using bitwise operations instead of addition, division, and multiplication which on some hardware is faster.
Now, how does it do this?
First we have to get the (instanceSize + 7) part. That's here:
(instanceSize + minOffsetAlignment - 1)
Then we have to divide it by 8, truncate the remainder, and multiply the result of the division by 8.
The last part does that all in one step, and this explains why minOffsetAlignment had to be a power of two.
First, we see:
minOffsetAlignment - 1
If minOffsetAlignment is 8, then its binary value is 0b00001000.
Subtracting 1 from 8 gives 7, which in binary is 0b00000111.
Now the complement of 7 is taken:
~(minOffsetAlignment - 1)
This inverts all the bits so we get ...1111000 (for a 64-bit integer such as VkDeviceSize there are 61 leading 1's and 3 trailing 0's).
Now, putting the whole statement together:
(instanceSize + minOffsetAlignment - 1) & ~(minOffsetAlignment - 1)
We see the & operator will clear the last three bits of whatever the result of (instanceSize + minOffsetAlignment - 1) is.
This forces the return value to be a multiple of 8 (since any binary integer with the last three bits 0 is a multiple of 8), but given that we already added 7 it also rounded it up to the nearest multiple of 8 provided instanceSize wasn't already a multiple of 8.
I found this piece of code:
void* aligned_malloc(size_t required_bytes, size_t alignment) {
int offset = alignment - 1;
void* P = (void * ) malloc(required_bytes + offset);
void* q = (void * ) (((size_t)(p) + offset) & ~(alignment - 1));
return q;
}
that is the implementation of aligned malloc in C++. Aligned malloc is a function that supports allocating memory such that the
memory address returned is divisible by a specific power of two.
Example:
align_malloc (1000, 128) will return a memory address that is a multiple of 128 and that points to memory of size 1000 bytes.
But I don't understand line 4. Why sum twice the offset?
Thanks
Why sum twice the offset?
offset isn't exactly being summed twice. First use of offset is for the size to allocate:
void* p = (void * ) malloc(required_bytes + offset);
Second time is for the alignment:
void* q = (void * ) (((size_t)(p) + offset) & ~(alignment - 1));
Explanation:
~(alignment - 1) is a negation of offset (remember, int offset = alignment - 1;) which gives you the mask you need to satisfy the alignment requested. Arithmetic-wise, adding the offset and doing bitwise and (&) with its negation gives you the address of the aligned pointer.
How does this arithmetic work? First, remember that the internal call to malloc() is for required_bytes + offset bytes. As in, not the alignment you asked for. For example, you wanted to allocate 10 bytes with alignment of 16 (so the desired behavior is to allocate the 10 bytes starting in an address that is divisible with 16). So this malloc() from above will give you 10+16-1=25 bytes. Not necessarily starting at the right address in terms of being divisible with 16). But then this 16-1 is 0x000F and its negation (~) is 0xFFF0. And now we apply the bitwise and like this: p + 15 & 0xFFF0 which will cause every pointer p to be a multiple of 16.
But wait, why add this offset of alignment - 1 in the first place? You do it because once you get the pointer p returned by malloc(), the one thing you cannot do -- do in order to find the nearest address which is a multiple of the alignment requested -- is look for it before p, as this could cross into an address space of something allocated before p. For this, you begin by adding alignment - 1, which, think about it, is exactly the maximum by which you'd have to advance to get your alignment.
* Thanks to user DevSolar for some additional phrasing.
Note 1: For this way to work the alignment must be a power of 2. This snippet does not enforce such a thing and so could cause unexpected behavior.
Note 2: An interesting question is how could you implement a free() version for such an allocation, with the return value from this function.
Started working on screen capturing software specifically targeted for Windows. While looking through an example on MSDN for Capturing an Image I found myself a bit confused.
Keep in mind when I refer to the size of the bitmap that does not include headers and so forth associated with an actual file. I'm talking about raw pixel data. I would have thought that the formula should be (width*height)*bits-per-pixel. However, according to the example this is the proper way to calculate the size:
DWORD dwBmpSize = ((bmpScreen.bmWidth * bi.biBitCount + 31) / 32) * 4 * bmpScreen.bmHeight;
and or: ((width*bits-per-pixel + 31) / 32) * 4 * height
I don't understand why there's the extra calculations involving 31, 32 and 4. Perhaps padding? I'm not sure but any explanations would be quite appreciated. I've already tried Googling and didn't find any particularly helpful results.
The bits representing the bitmap pixels are packed in rows. The size of each row is rounded up to a multiple of 4 bytes (a 32-bit DWORD) by padding.
(bits_per_row + 31)/32 * 4 ensures the round up to the next multiple of 32 bits. The answer is in bytes, rather than bits hence *4 rather than *32.
See: https://en.wikipedia.org/wiki/BMP_file_format
Under Bitmap Header Types you'll find the following:
The scan lines are DWORD aligned [...]. They must be padded for scan line widths, in bytes, that are not evenly divisible by four [...]. For example, a 10- by 10-pixel 24-bpp bitmap will have two padding bytes at the end of each scan line.
The formula
((bmpScreen.bmWidth * bi.biBitCount + 31) / 32) * 4
establishes DWORD-alignment (in bytes). The trailing * 4 is really the result of * 32 / 8, where the multiplication with 32 produces a value that's a multiple of 32 (in bits), and the division by 8 translates it back to bytes.
Although this does produce the desired result, I prefer a different implementation. A DWORD is 32 bits, i.e. a power of 2. Rounding up to a power of 2 can be implemented using the following formula:
(value + ((1 << n) - 1)) & ~((1 << n) - 1)
Adding (1 << n) - 1 adjusts the initial value to go past the next n-th power of 2 (unless it already is an n-th power of 2). (1 << n) - 1 evaluates to a value, where the n least significant bits are set, ~((1 << n) - 1) negates that, i.e. all bits but the n least significant bits are set. This serves as a mask to remove the n least significant bits of the adjusted initial value.
Applied to this specific case, where a DWORD is 32 bits, i.e. n is 5, and (1 << n) - 1 evaluates to 31. value is the raw scanline width in bits:
auto raw_scanline_width_in_bits{ bmpScreen.bmWidth * bi.biBitCount };
auto aligned_scanline_width_in_bits{ (raw_scanline_width_in_bits + 31) & ~31 };
auto aligned_scanline_width_in_bytes{ raw_scanline_width_in_bits / 8 };
This produces the same results, but provides a different perspective, that may be more accessible to some.
I need a function to read n bits starting from bit x(bit index should start from zero), and if the result is not byte aligned, pad it with zeros. The function will receive uint8_t array on the input, and should return uint8_t array as well. For example, I have file with following contents:
1011 0011 0110 0000
Read three bits from the third bit(x=2,n=3); Result:
1100 0000
There's no (theoretical) limit on input and bit pattern lengths
Implementing such a bitfield extraction efficiently without beyond the direct bit-serial algorithm isn't precisely hard but a tad cumbersome.
Effectively it boils down to an innerloop reading a pair of bytes from the input for each output byte, shifting the resulting word into place based on the source bit-offset, and writing back the upper or lower byte. In addition the final output byte is masked based on the length.
Below is my (poorly-tested) attempt at an implementation:
void extract_bitfield(unsigned char *dstptr, const unsigned char *srcptr, size_t bitpos, size_t bitlen) {
// Skip to the source byte covering the first bit of the range
srcptr += bitpos / CHAR_BIT;
// Similarly work out the expected, inclusive, final output byte
unsigned char *endptr = &dstptr[bitlen / CHAR_BIT];
// Truncate the bit-positions to offsets within a byte
bitpos %= CHAR_BIT;
bitlen %= CHAR_BIT;
// Scan through and write out a correctly shifted version of every destination byte
// via an intermediate shifter register
unsigned long accum = *srcptr++;
while(dstptr <= endptr) {
accum = accum << CHAR_BIT | *srcptr++;
*dstptr++ = accum << bitpos >> CHAR_BIT;
}
// Mask out the unwanted LSB bits not covered by the length
*endptr &= ~(UCHAR_MAX >> bitlen);
}
Beware that the code above may read past the end of the source buffer and somewhat messy special handling is required if you can't set up the overhead to allow this. It also assumes sizeof(long) != 1.
Of course to get efficiency out of this you will want to use as wide of a native word as possible. However if the target buffer necessarily word-aligned then things get even messier. Furthermore little-endian systems will need byte swizzling fix-ups.
Another subtlety to take heed of is the potential inability to shift a whole word, that is shift counts are frequently interpreted modulo the word length.
Anyway, happy bit-hacking!
Basically it's still a bunch of shift and addition operations.
I'll use a slightly larger example to demonstrate this.
Suppose we are give an input of 4 characters, and x = 10, n = 18.
00101011 10001001 10101110 01011100
First we need to locate the character contains our first bit, by x / 8, which gives us 1 (the second character) in this case. We also need the offset in that character, by x % 8, which equals to 2.
Now we can get out first character of the solution in three operations.
Left shift the second character 10001001 with 2 bits, gives us 00100100.
Right shift the third character 10101110 with 6 (comes from 8 - 2) bits, gives us 00000010.
Add these two characters gives us the first character in your return string, gives 00100110.
Loop this routine for n / 8 rounds. And if n % 8 is not 0, extract that many bits from the next character, you can do it in many approaches.
So in this example, our second round will give us 10111001, and the last step we get 10, then pad the rest bits with 0s.
My key is a 64 bit address and the output is a 1 byte number (0-255). Collisions are allowed but the probability of them occurring should be low. Also, assume that number of elements to be inserted are low, lets say not more than 255, as to minimize the pigeon hole effect.
The addresses are addresses of the functions in the program.
uint64_t addr = ...
uint8_t hash = addr & 0xFF;
I think that meets all of your requirements.
I would XOR together the 2 LSB (least significant bytes), if this distribues badly, then add a 3rd one, and so forth
The rationale behind this is the following: function addresses do not distribute uniformly. The problem normally lies in the lower (lsb) bits. Functions usually need to begin in addresses divisible by 4/8/16 so the 2-4 lsb are probably meaningless. By XORing with the next byte, you should get rid of most of these problems and it's still pretty fast.
Function addresses are, I think, quite likely to be aligned (see this question, for instance). That seems to indicate that you want to skip least significant bits, depending on the alignment.
So, perhaps take the 8 bits starting from bit 3, i.e. skipping the least significant 3 bits (bits 0 through 2):
const uint8_t hash = (address >> 3);
This should be obvious from inspection of your set of addresses. In hex, watch the rightmost digit.
How about:
uint64_t data = 0x12131212121211B12;
uint32_t d1 = (data >> 32) ^ (uint32_t)(data);
uint16_t d2 = (d1 >> 16) ^ (uint16_t)(d1);
uint8_t d3 = (d2 >> 8) ^ (uint8_t)(d2);
return d3;
It combined all bits of your 8 bytes with 3 shifts and three xor instructions.