struct s11
{
alignas(16) char s;
int i;
};
does alignas(n) X x mean it would allocate n bytes to hold a X, when X is actually much smaller than n bytes, the rest memory would be all 0s?
when s11.s is in memory 0x0, does that mean s11.i would be at 0x10? However, the following pdiff is 0x4
s11 o;
ptrdiff_t pdiff = (char*)&o.i - (char*)&o.s;
Please explain why sizeof(s11) is not 16 times 2, while sizeof(s1) is 8 times 3?
struct s1
{
char s;
double d; // alignof(double) is 8.
int i;
}
EDIT:
This is what I found after a lot of digging and code snippets, that how alignas affect struct(class) size.
alignas(n) X x means in memory x is located at a place with an offset of multiple of n bytes, starts from 0x00. Therefore, s11.s can be located at 0x00, 0x10, 0x20 and so on. Because s11.s is the first field, it is at 0x00, and it is a char, so it ocupies 1 bytes.
So is the 2nd field. int has an align 4, therefore it can be located at 0x00, 0x04, 0x08, 0x0c, etc. Because the first byte is taken by s11.s, so s11.i is located at 0x04.
when alignof operator used in struct does it affect sizeof
You aren't using alignof operator. You're using alignas specifier.
Using alignas specifier doesn't necessarily affect the size of a class, but it can affect it.
does alignas(n) X x mean it would allocate n bytes to hold a X, when X is actually much smaller than n bytes, the rest memory would be all 0s?
No. If there is padding, there's no guarantee that it would be all 0s. Furthermore, there are cases where the memory can be occupied by other members.
when s11.s is in memory 0x0, does that mean s11.i would be at 0x10?
No.
Please explain why sizeof(s11) is not 16 times 2
Probably because it doesn't need to be. Here is an example layout that would satisfy the alignment requirements:
offset member alignment offset % alignment must equal 0
0 s 16 0 % 16 == 0
1 padding
2 padding
3 padding
4 i 4 4 % 4 == 0
5 i
6 i
7 i
8 padding
9 padding
10 padding
11 padding
12 padding
13 padding
14 padding
15 padding
---
super alignment 16 16 % 16 == 0
while sizeof(s1) is 8 times 3?
Here is an example layout that would satisfy the alignment requirements:
offset member alignment offset % alignment must equal 0
0 s 1 0 % 1 == 0
1 padding
2 padding
3 padding
4 padding
5 padding
6 padding
7 padding
8 d 8 8 % 8 == 0
9 d
10 d
11 d
12 d
13 d
14 d
15 d
16 i 4 16 % 4 == 0
17 i
18 i
19 i
20 padding
21 padding
22 padding
23 padding
---
super alignment 8 24 % 8 == 0
My examples apply to a system where double is 8 bytes and int is 4.
No, it means the starting address of the member s must be aligned on a 16-byte boundary. Probably in the case of it being the first member of a struct, this will result in the struct itself adopting that alignment and not affecting the size.
No, there is no specified alignment pertaining to the member i, so normal alignment rules apply. If you're getting 4 then your system is using integers that have a size of 4 bytes, and are similarly aligned. The value of padding bytes is unspecified.
The first explanation is in my answer to #2. Follow-up incorporting the answer to #1, the size will probably be 16 due to the required structure alignment. And sizeof(s1) is 8 times 3 because of the alignment requirements of the double member causing padding elsewhere, combined with the entire structure's alignment. Moving the char member to the end of the structure will reduce the total size.
I have two functions that print 32bit number in binary.
First one divides the number into bytes and starts printing from the last byte (from the 25th bit of the whole integer).
Second one is more straightforward and starts from the 1st bit of the number.
It seems to me that these functions should have different outputs, because they process the bits in different orders. However the outputs are the same. Why?
#include <stdio.h>
void printBits(size_t const size, void const * const ptr)
{
unsigned char *b = (unsigned char*) ptr;
unsigned char byte;
int i, j;
for (i=size-1;i>=0;i--)
{
for (j=7;j>=0;j--)
{
byte = (b[i] >> j) & 1;
printf("%u", byte);
}
}
puts("");
}
void printBits_2( unsigned *A) {
for (int i=31;i>=0;i--)
{
printf("%u", (A[0] >> i ) & 1u );
}
puts("");
}
int main()
{
unsigned a = 1014750;
printBits(sizeof(a), &a); // ->00000000000011110111101111011110
printBits_2(&a); // ->00000000000011110111101111011110
return 0;
}
Both your functions print binary representation of the number from the most significant bit to the least significant bit. Today's PCs (and majority of other computer architectures) use so-called Little Endian format, in which multi-byte values are stored with least significant byte first.
That means that 32-bit value 0x01020304 stored on address 0x1000 will look like this in the memory:
+--------++--------+--------+--------+--------+
|Address || 0x1000 | 0x1001 | 0x1002 | 0x1003 |
+--------++--------+--------+--------+--------+
|Data || 0x04 | 0x03 | 0x02 | 0x01 |
+--------++--------+--------+--------+--------+
Therefore, on Little Endian architectures, printing value's bits from MSB to LSB is equivalent to taking its bytes in reversed order and printing each byte's bits from MSB to LSB.
This is the expected result when:
1) You use both functions to print a single integer, in binary.
2) Your C++ implementation is on a little-endian hardware platform.
Change either one of these factors (with printBits_2 appropriately adjusted), and the results will be different.
They don't process the bits in different orders. Here's a visual:
Bytes: 4 3 2 1
Bits: 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1
Bits: 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
The fact that the output is the same from both of these functions tells you that your platform uses Little-Endian encoding, which means the most significant byte comes last.
The first two rows show how the first function works on your program, and the last row shows how the second function works.
However, the first function will fail on platforms that use Big-Endian encoding and output the bits in this order shown in the third row:
Bytes: 4 3 2 1
Bits: 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1
Bits: 8 7 6 5 4 3 2 1 16 15 14 13 12 11 10 9 24 23 22 21 20 19 18 17 32 31 30 29 28 27 26 25
For the printbits1 function, it is taking the uint32 pointer and assigning it to a char pointer.
unsigned char *b = (unsigned char*) ptr;
Now, in a big endian processor, b[0] will point to the Most significant byte of the uint32 value. The inner loop prints this byte in binary, and then b[1] will point to the next most significant byte in ptr. Therefore this method prints the uint32 value MSB first.
As for printbits2, you are using
unsigned *A
i.e. an unsigned int. This loop runs from 31 to 0 and prints the uint32 value in binary.
If an SSE/AVX register's value is such that all its bytes are either 0 or 1, is there any way to efficiently get the indices of all non zero elements?
For example, if xmm value is
| r0=0 | r1=1 | r2=0 | r3=1 | r4=0 | r5=1 | r6=0 |...| r14=0 | r15=1 |
the result should be something like (1, 3, 5, ... , 15). The result should be placed in another _m128i variable or char[16] array.
If it helps, we can assume that register's value is such that all bytes are either 0 or some constant nonzero value (not necessary 1).
I am pretty much wondering if there is an instruction for that or preferably C/C++ intrinsic. In any SSE or AVX set of instructions.
EDIT 1:
It was correctly observed by #zx485 that original question was not clear enough. I was looking for any "consecutive" solution.
The example 0 1 0 1 0 1 0 1... above should result in either of the following:
If we assume that indices start from 1, then 0 would be a termination byte and the result might be
002 004 006 008 010 012 014 016 000 000 000 000 000 000 000 000
If we assume that negative byte is a termination byte the result might be
001 003 005 007 009 011 013 015 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF
Anything, that gives as a consecutive bytes which we can interpret as indices of non-zero elements in the original value
EDIT 2:
Indeed, as #harold and #Peter Cordes suggest in the comments to the original post, one of the possible solutions is to create a mask first (e.g. with pmovmskb) and check non zero indices there. But that will lead to a loop.
Your question was unclear regarding the aspect if you want the result array to be "compressed". What I mean by "compressed" is, that the result should be consecutive. So, for example for 0 1 0 1 0 1 0 1..., there are two possibilities:
Non-consecutive:
XMM0: 000 001 000 003 000 005 000 007 000 009 000 011 000 013 000 015
Consecutive:
XMM0: 001 003 005 007 009 011 013 015 000 000 000 000 000 000 000 000
One problem of the consecutive approach is: how do you decide if it's index 0 or a termination value?
I'm offering a simple solution to the first, non-consecutive approach, which should be quite fast:
.data
ddqZeroToFifteen db 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
ddqTestValue: db 0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1
.code
movdqa xmm0, xmmword ptr [ddqTestValue]
pxor xmm1, xmm1 ; zero XMM1
pcmpeqb xmm0, xmm1 ; set to -1 for all matching
pandn xmm0, xmmword ptr [ddqZeroToFifteen] ; invert and apply indices
Just for the sake of completeness: the second, the consecutive approach, is not covered in this answer.
Updated answer: the new solution is slightly more efficient.
You can do this without a loop by using the pext instruction from the Bit Manipulation Instruction Set 2 ,
in combination with a few other SSE instructions.
/*
gcc -O3 -Wall -m64 -mavx2 -march=broadwell ind_nonz_avx.c
*/
#include <stdio.h>
#include <immintrin.h>
#include <stdint.h>
__m128i nonz_index(__m128i x){
/* Set some constants that will (hopefully) be hoisted out of a loop after inlining. */
uint64_t indx_const = 0xFEDCBA9876543210; /* 16 4-bit integers, all possible indices from 0 o 15 */
__m128i cntr = _mm_set_epi8(64,60,56,52,48,44,40,36,32,28,24,20,16,12,8,4);
__m128i pshufbcnst = _mm_set_epi8(0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80, 0x0E,0x0C,0x0A,0x08,0x06,0x04,0x02,0x00);
__m128i cnst0F = _mm_set1_epi8(0x0F);
__m128i msk = _mm_cmpeq_epi8(x,_mm_setzero_si128()); /* Generate 16x8 bit mask. */
msk = _mm_srli_epi64(msk,4); /* Pack 16x8 bit mask to 16x4 bit mask. */
msk = _mm_shuffle_epi8(msk,pshufbcnst); /* Pack 16x8 bit mask to 16x4 bit mask, continued. */
uint64_t msk64 = ~ _mm_cvtsi128_si64x(msk); /* Move to general purpose register and invert 16x4 bit mask. */
/* Compute the termination byte nonzmsk separately. */
int64_t nnz64 = _mm_popcnt_u64(msk64); /* Count the nonzero bits in msk64. */
__m128i nnz = _mm_set1_epi8(nnz64); /* May generate vmovd + vpbroadcastb if AVX2 is enabled. */
__m128i nonzmsk = _mm_cmpgt_epi8(cntr,nnz); /* nonzmsk is a mask of the form 0xFF, 0xFF, ..., 0xFF, 0, 0, ...,0 to mark the output positions without an index */
uint64_t indx64 = _pext_u64(indx_const,msk64); /* parallel bits extract. pext shuffles indx_const such that indx64 contains the nnz64 4-bit indices that we want.*/
__m128i indx = _mm_cvtsi64x_si128(indx64); /* Use a few integer instructions to unpack 4-bit integers to 8-bit integers. */
__m128i indx_024 = indx; /* Even indices. */
__m128i indx_135 = _mm_srli_epi64(indx,4); /* Odd indices. */
indx = _mm_unpacklo_epi8(indx_024,indx_135); /* Merge odd and even indices. */
indx = _mm_and_si128(indx,cnst0F); /* Mask out the high bits 4,5,6,7 of every byte. */
return _mm_or_si128(indx,nonzmsk); /* Merge indx with nonzmsk . */
}
int main(){
int i;
char w[16],xa[16];
__m128i x;
/* Example with bytes 15, 12, 7, 5, 4, 3, 2, 1, 0 set. */
x = _mm_set_epi8(1,0,0,1, 0,0,0,0, 1,0,1,1, 1,1,1,1);
/* Other examples. */
/*
x = _mm_set_epi8(1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1);
x = _mm_set_epi8(0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0);
x = _mm_set_epi8(1,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0);
x = _mm_set_epi8(0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,1);
*/
__m128i indices = nonz_index(x);
_mm_storeu_si128((__m128i *)w,indices);
_mm_storeu_si128((__m128i *)xa,x);
printf("counter 15..0 ");for (i=15;i>-1;i--) printf(" %2d ",i); printf("\n\n");
printf("example xmm: ");for (i=15;i>-1;i--) printf(" %2d ",xa[i]); printf("\n");
printf("result in dec ");for (i=15;i>-1;i--) printf(" %2hhd ",w[i]); printf("\n");
printf("result in hex ");for (i=15;i>-1;i--) printf(" %2hhX ",w[i]); printf("\n");
return 0;
}
It takes about five instructions to get 0xFF (the termination byte) at the unwanted positions.
Note that a function nonz_index that returns the indices and only the position of the termination byte, without actually
inserting the termination byte(s), would be much cheaper to compute and might be as suitable in a particular application.
The position of the first termination byte is nnz64>>2.
The result is:
$ ./a.out
counter 15..0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
example xmm: 1 0 0 1 0 0 0 0 1 0 1 1 1 1 1 1
result in dec -1 -1 -1 -1 -1 -1 -1 15 12 7 5 4 3 2 1 0
result in hex FF FF FF FF FF FF FF F C 7 5 4 3 2 1 0
The pext instruction is supported on Intel Haswell processors or newer.
On 64 bit, the size of SP_DEVINFO_LIST_DETAIL_DATA_W is 560. Shouldn't it be 554?
typedef struct _SP_DEVINFO_LIST_DETAIL_DATA_W {
DWORD cbSize;
GUID ClassGuid;
HANDLE RemoteMachineHandle;
WCHAR RemoteMachineName[SP_MAX_MACHINENAME_LENGTH];
} SP_DEVINFO_LIST_DETAIL_DATA, *PSP_DEVINFO_LIST_DETAIL_DATA;
cbSize is 4, ClassGuid is 16, RemoteMachineHandle is 8 (64 bit), RemoteMachineName is 2*(260+3) (SP_MAX_MACHINENAME_LENGTH is MAX_PATH + 3)
So, 4+16+8+2*263=554. Why does sizeof(_SP_DEVINFO_LIST_DETAIL_DATA_W) return 560 ?
You are overlooking the requirement to align the fields, important to ensure they can be accessed efficiently by the processor. The HANDLE type is 64-bits, 8 bytes, when you target x64. The RemoteMachineHandle member is therefore aligned to an offset that is a multiple of 8. Which moves it from offset 20 to offset 24, the next offset that is divisible by 8. The extra 4 bytes are padding and are unused.
Which makes the structure size 4 + 16 + 4 + 8 + 2*263 = 558 bytes.
There's an additional problem - an array of this struct would make the handle again misaligned. The element at index 1 would have the Handle at offset 558 + 4 + 16 + 4 = 582. Which is not a multiple of 8, the member will be misaligned again.
So the compiler adds an additional 2 bytes of padding to the end of the struct so the total size of the struct is a multiple of 8. Thus:
Offset Size Member
0 4 cbSize
4 16 ClassGuid
20 4 -
24 8 RemoteMachineHandle
32 526 RemoteMachineName
558 2 -
-------------
560
I'm guessing padding to accomodate the alignment requirement of some member. I am not familiar with these types so I can't explain the alignment of this structure, however.
If you really want to pack your struct in the most efficient way, you can order the members by size (decreasing). The compiler isn't normally allowed to reorder the members.
I try to work with SetupDiGetDeviceInfoListDetail and got wrong size too. finally I found the solution
this is the definition of the struct
typedef struct _SP_DEVINFO_LIST_DETAIL_DATA {
DWORD cbSize;
GUID ClassGuid;
HANDLE RemoteMachineHandle;
TCHAR RemoteMachineName[SP_MAX_MACHINENAME_LENGTH];
} SP_DEVINFO_LIST_DETAIL_DATA, *PSP_DEVINFO_LIST_DETAIL_DATA;
in setupapi.h
#define SP_MAX_MACHINENAME_LENGTH (MAX_PATH + 3)
https://msdn.microsoft.com/en-us/library/cc249520.aspx
MAX_PATH 0x00000104
There for
[StructLayout(LayoutKind.Sequential,Pack = 1, CharSet = CharSet.Ansi)]
public struct SP_DEVINFO_LIST_DETAIL_DATA
{
public uint cbSize;
public Guid classGuid;
public int RemoteMachineHandle;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 263)]public string RemoteMachineName;
};