Efficient zero padding using cudaMemcpy3D - c++

I would like to transfer a 3d array stored in linear memory on the host, into a larger (3D) array on the device. As an example (see below), I tried to transfer a (3x3x3) array into a (5x5x3) array.
I expect that on the host I get 2D slices with the following pattern:
x x x 0 0
x x x 0 0
x x x 0 0
0 0 0 0 0
0 0 0 0 0
where x are the values of my array. However, I get something like that, where y are the values of the next 2D slice:
x x x 0 0
x x x 0 0
x x x 0 0
y y y 0 0
y y y 0 0
According to the cudaMemcpy3D documentation I would have expect that the extent parameter would take into account the padding in the vertical axis but apparently not.
Am I mistaken in the understanding of the documentation? If yes, is there any other way to perform this operation? The final size of the array to transfer will be 60x60x900 into an array of size 1100x1500x900. I use the zero padding to prepare a Fourier transform.
Here is the simplified code that I used:
cudaError_t cuda_status;
cudaPitchedPtr d_ptr;
cudaExtent d_extent = make_cudaExtent(sizeof(int)*5,sizeof(int)*5,sizeof(int)*3);
cudaExtent h_extent = make_cudaExtent(sizeof(int)*3,sizeof(int)*3,sizeof(int)*3);
int* h_array = (int*) malloc(27*sizeof(int));
int* h_result = (int*) malloc(512*sizeof(int)*5*3);
for (int i = 0; i<27; i++)
{
h_array[i] = i;
}
cuda_status = cudaMalloc3D(&d_ptr, d_extent);
cout << cudaGetErrorString(cuda_status) << endl;
cudaMemcpy3DParms myParms = {0};
myParms.extent = h_extent;
myParms.srcPtr.ptr = h_array;
myParms.srcPtr.pitch = 3*sizeof(int);
myParms.srcPtr.xsize = 3*sizeof(int);
myParms.srcPtr.ysize = 3*sizeof(int);
myParms.dstPtr = d_ptr;
myParms.kind = cudaMemcpyHostToDevice;
cuda_status = cudaMemcpy3D(&myParms);
cout << cudaGetErrorString(cuda_status) << endl;
cout << "Pitch: " << d_ptr.pitch << " / xsize:" << d_ptr.xsize << " / ysize:" << d_ptr.ysize << endl; // returns Pitch: 512 / xsize:20 / ysize:20 which is as expected
// Copy array to host to be able to print the values - may not be necessary
cout << cudaMemcpy(h_result, (int*) d_ptr.ptr, 512*5*3, cudaMemcpyDeviceToHost) << endl;
cout << h_result[128] << " " << h_result[3*128] << " " << h_result[5*128] << " " << endl; // output : 3 9 15 / expected 3 0 9

The problems here have to do with your extents and sizes.
When an extent is used with cudaMemcpy3D for the non-cudaArray case, it is intended to provide the size of the region in bytes. A way to think about this is that product of the 3 dimensions of the extent should yield the size of the region in bytes.
What you're doing however is scaling each of the 3 dimensions by the element size, which is not correct:
cudaExtent h_extent = make_cudaExtent(sizeof(int)*3,sizeof(int)*3,sizeof(int)*3);
^^^^^^^^^^^
this is the only element scaling expected
You've made a similar error here:
myParms.srcPtr.xsize = 3*sizeof(int); // correct
myParms.srcPtr.ysize = 3*sizeof(int); // incorrect
We only scale the x (width) dimension by the element size, we don't scale the y (height) or z (depth) dimensions.
I haven't fully verified your code, but with those 2 changes, your code produces the output you indicate is expected:
$ cat t1593.cu
#include <iostream>
using namespace std;
int main(){
cudaError_t cuda_status;
cudaPitchedPtr d_ptr;
cudaExtent d_extent = make_cudaExtent(sizeof(int)*5,5,3);
cudaExtent h_extent = make_cudaExtent(sizeof(int)*3,3,3);
int* h_array = (int*) malloc(27*sizeof(int));
int* h_result = (int*) malloc(512*sizeof(int)*5*3);
for (int i = 0; i<27; i++)
{
h_array[i] = i;
}
cuda_status = cudaMalloc3D(&d_ptr, d_extent);
cout << cudaGetErrorString(cuda_status) << endl;
cudaMemcpy3DParms myParms = {0};
myParms.extent = h_extent;
myParms.srcPtr.ptr = h_array;
myParms.srcPtr.pitch = 3*sizeof(int);
myParms.srcPtr.xsize = 3*sizeof(int);
myParms.srcPtr.ysize = 3;
myParms.dstPtr = d_ptr;
myParms.kind = cudaMemcpyHostToDevice;
cuda_status = cudaMemcpy3D(&myParms);
cout << cudaGetErrorString(cuda_status) << endl;
cout << "Pitch: " << d_ptr.pitch << " / xsize:" << d_ptr.xsize << " / ysize:" << d_ptr.ysize << endl; // returns Pitch: 512 / xsize:20 / ysize:20 wich is as expected
// Copy array to host to be able to print the values - may not be necessary
cout << cudaMemcpy(h_result, (int*) d_ptr.ptr, d_ptr.pitch*5*3, cudaMemcpyDeviceToHost) << endl;
cout << h_result[128] << " " << h_result[3*128] << " " << h_result[5*128] << " " << endl; // output : 3 9 15 / expected 3 0 9
}
$ nvcc -o t1593 t1593.cu
$ cuda-memcheck ./t1593
========= CUDA-MEMCHECK
no error
no error
Pitch: 512 / xsize:20 / ysize:5
0
3 0 9
========= ERROR SUMMARY: 0 errors
$
I should also point out that the strided memcpy operations in CUDA (e.g. cudaMemcpy2D, cudaMemcpy3D) are not necessarily the fastest way to conduct such a transfer. You can find writeups of this characteristic in various questions about cudaMemcpy2D here on SO cuda tag.
The net of it is that it may be faster to transfer the data to the device in an unstrided, unpadded linear transfer, then write a CUDA kernel to take the data that is now on the device, and place it in the array of interest, with appropriate striding/padding.

Related

Pointer Exception while getting RGB values from (video) frame Intel Realsense

I'm trying to get the different RGB values from a frame with the Realsense SDK. This is for a 3D depth camera with RGB. According to https://github.com/IntelRealSense/librealsense/issues/3364 I need to use
int i = 100, j = 100; // fetch pixel 100,100
rs2::frame rgb = ...
auto ptr = (uint8_t*)rgb.get_data();
auto stride = rgb.as<rs2::video_frame>().stride();
cout << "R=" << ptr[3*(i * stride + j)];
cout << ", G=" << ptr[3*(i * stride + j) + 1];
cout << ", B=" << ptr[3*(i * stride + j) + 2];
In my code I'm getting a pointer exception if I want to get the values for pixel (x,y)=1000,1000. With (x,y)=100,100 it works every time... Error: Exception thrown: read access violation. ptr was 0x11103131EB9192A.
I set the enable_stream to cfg.enable_stream(RS2_STREAM_COLOR, WIDTH_COLOR_FRAME, HEIGTH_COLOR_FRAME, RS2_FORMAT_RGB8, 15); where in the .h file are:
#define WIDTH_COLOR_FRAME 1920
#define HEIGTH_COLOR_FRAME 1080
This is my code. Maybe it has something to do with the RS2_FORMAT_RGB8?
frameset frames = pl.wait_for_frames();
frame color = frames.get_color_frame();
uint8_t* ptr = (uint8_t*)color.get_data();
int stride = color.as<video_frame>().get_stride_in_bytes();
int i = 1000, j = 1000; // fetch pixel 100,100
cout << "R=" << int(ptr[3 * (i * stride + j)]);
cout << ", G=" << int(ptr[3 * (i * stride + j) + 1]);
cout << ", B=" << int(ptr[3 * (i * stride + j) + 2]);
cout << endl;
Thanks in advance!
stride is in bytes (length of row in bytes), multiplication with 3 is not required.
cout << " R= " << int(ptr[i * stride + (3*j) ]);
cout << ", G= " << int(ptr[i * stride + (3*j) + 1]);
cout << ", B= " << int(ptr[i * stride + (3*j) + 2]);
I had the same problem and even with the last answers I still got segfaults.
I found out that when you do
uint8_t *ptr = color.get_data()
the realsense sdk won't increase/track some internal reference and the pointer went invalid after some time, causing the segfaults.
my Fix is copy the content to a local buffer.
malloc new buffer with RGB size.
right after get_data() copy data to the new buffer.
that fixed all my issues.
all the best.

CUDA cufft 2D example

I am currently working on a program that has to implement a 2D-FFT, (for cross correlation). I did a 1D FFT with CUDA which gave me the correct results, i am now trying to implement a 2D version. With few examples and documentation online i find it hard to find out what the error is.
So far i have been using the cuFFT manual only.
Anyway, i have created two 5x5 arrays and filled them with 1's. I have copied them onto the GPU memory and done the forward FFT, multiplied them and then done ifft on the result. This gives me a 5x5 array with values 650. I would expect to get a DC signal with the value 25 in only one slot in the 5x5 array. Instead i get 650 in the entire array.
Furthermore i am not allowed to print out the value of the signal after it has been copied onto the GPU memory. Writing
cout << d_signal[1].x << endl;
Gives me an acces violation. I have done the same thing in other cuda programs, where this has not been an issue. Does it have something to do with how the complex variable works, or is it human error?
If anyone has any pointers to what is going wrong i would greatly appreciate it. Here is the code
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <helper_functions.h>
#include <helper_cuda.h>
#include <ctime>
#include <time.h>
#include <stdio.h>
#include <iostream>
#include <math.h>
#include <cufft.h>
#include <fstream>
using namespace std;
typedef float2 Complex;
__global__ void ComplexMUL(Complex *a, Complex *b)
{
int i = threadIdx.x;
a[i].x = a[i].x * b[i].x - a[i].y*b[i].y;
a[i].y = a[i].x * b[i].y + a[i].y*b[i].x;
}
int main()
{
int N = 5;
int SIZE = N*N;
Complex *fg = new Complex[SIZE];
for (int i = 0; i < SIZE; i++){
fg[i].x = 1;
fg[i].y = 0;
}
Complex *fig = new Complex[SIZE];
for (int i = 0; i < SIZE; i++){
fig[i].x = 1; //
fig[i].y = 0;
}
for (int i = 0; i < 24; i=i+5)
{
cout << fg[i].x << " " << fg[i + 1].x << " " << fg[i + 2].x << " " << fg[i + 3].x << " " << fg[i + 4].x << endl;
}
cout << "----------------" << endl;
for (int i = 0; i < 24; i = i + 5)
{
cout << fig[i].x << " " << fig[i + 1].x << " " << fig[i + 2].x << " " << fig[i + 3].x << " " << fig[i + 4].x << endl;
}
cout << "----------------" << endl;
int mem_size = sizeof(Complex)* SIZE;
cufftComplex *d_signal;
checkCudaErrors(cudaMalloc((void **) &d_signal, mem_size));
checkCudaErrors(cudaMemcpy(d_signal, fg, mem_size, cudaMemcpyHostToDevice));
cufftComplex *d_filter_kernel;
checkCudaErrors(cudaMalloc((void **)&d_filter_kernel, mem_size));
checkCudaErrors(cudaMemcpy(d_filter_kernel, fig, mem_size, cudaMemcpyHostToDevice));
// cout << d_signal[1].x << endl;
// CUFFT plan
cufftHandle plan;
cufftPlan2d(&plan, N, N, CUFFT_C2C);
// Transform signal and filter
printf("Transforming signal cufftExecR2C\n");
cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD);
cufftExecC2C(plan, (cufftComplex *)d_filter_kernel, (cufftComplex *)d_filter_kernel, CUFFT_FORWARD);
printf("Launching Complex multiplication<<< >>>\n");
ComplexMUL <<< 32, 256 >> >(d_signal, d_filter_kernel);
// Transform signal back
printf("Transforming signal back cufftExecC2C\n");
cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_INVERSE);
Complex *result = new Complex[SIZE];
cudaMemcpy(result, d_signal, sizeof(Complex)*SIZE, cudaMemcpyDeviceToHost);
for (int i = 0; i < SIZE; i=i+5)
{
cout << result[i].x << " " << result[i + 1].x << " " << result[i + 2].x << " " << result[i + 3].x << " " << result[i + 4].x << endl;
}
delete result, fg, fig;
cufftDestroy(plan);
//cufftDestroy(plan2);
cudaFree(d_signal);
cudaFree(d_filter_kernel);
}
The above code gives the following terminal output:
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
----------------
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
----------------
Transforming signal cufftExecR2C
Launching Complex multiplication<<< >>>
Transforming signal back cufftExecC2C
625 625 625 625 625
625 625 625 625 625
625 625 625 625 625
625 625 625 625 625
625 625 625 625 625
This gives me a 5x5 array with values 650 : It reads 625 which is 5555. The convolution algorithm you are using requires a supplemental divide by NN. Indeed, in cufft, there is no normalization coefficient in the forward transform. Hence, your convolution cannot be the simple multiply of the two fields in frequency domain. (some would call it the mathematicians DFT and not the physicists DFT).
Furthermore i am not allowed to print out the value of the signal after it has been copied onto the GPU memory: This is standard CUDA behavior. When allocating memory on the device, the data exists in device memory address space, and cannot be accessed by the CPU without additionnal effort. Search for managed memory, or zerocopy to have data accessible from both sides of the PCI Express (this is discussed in many other posts).
There are several problems here:
You are launching far too many threads for the size of the input arrays in the multiplication kernel, so that should be failing with out-of-bounds memory errors. I am surprised you are not receiving any sort of runtime error.
Your expected solution from the fft/fft - dot product - ifft sequence is, I believe, incorrect. The correct solution would be a 5x5 matrix with 25 in each entry.
As clearly described in the cuFFT documentation, the library performs unnormalised FFTs:
cuFFT performs un-normalized FFTs; that is, performing a forward FFT on an input data set followed by an inverse FFT on the resulting set yields data that is equal to the input, scaled by the number of elements. Scaling either transform by the reciprocal of the size of the data set is left for the user to perform as seen fit.
So by my reckoning, the correct output solution for your code should be a 5x5 matrix with 625 in each entry, which would be normalised to a 5x5 matrix with 25 in each entry, ie. the expected result. I don't understand how the problem at (1) isn't producing different results as the multiplication kernel should be failing.
TLDR; nothing to see here, move along...
Just as an add up to the other things mentioned already: I think your complex multiplication kernel is not doing the right thing. You are overwriting a[i].x in the first line and then use the new value of a[i].x in the second line to calculate a[i].y. I think you need to first generate a backup of a[i].x before you overwrite, something like:
float aReal_bk = a[i].x;
a[i].x = a[i].x * b[i].x - a[i].y * b[i].y;
a[i].y = aReal_bk * b[i].y + a[i].y * b[i].x;

Verifying essential matrix

I'm trying to code a simple structure from motion scenario, using only 2 images taken from the same camera.
I use SIFT to find matching points between the images (total of 72 matches), out of which 62 are correct.
I use OpenCV to calculate the fundamental matrix, then the essential. When I try to verify the essential matrix by doing p2^T * E * p1 I get very high values instead of close to zero.
Am I doing something wrong?
Here's the code: (pts1, pts2 are std::vector<Point2f>. dmat is Mat_<double>)
int n = pts1.size();
std::cout << "Total point matches: " << n << std::endl;
std::vector<unsigned char> status(n);
std::cout << "K=" << K << std::endl;
F = findFundamentalMat(pts1, pts2,FM_RANSAC,3,0.99,status);
std::cout << "F=" << F << std::endl;
std::cout << "Total inliers: " << std::accumulate(status.begin(),status.end(),0) << std::endl;
E = K.t() * F * K;
std::cout << "E=" << E << std::endl;
for (int i = 0; i < n;++i)
{
dmat p1(3,1), p2(3,1);
p1 << pts1[i].x, pts1[i].y, 1;
p2 << pts2[i].x, pts2[i].y, 1;
dmat mv = p2.t() * E * p1;
double v = mv(0, 0);
std::cout << v << std::endl;
}
and here is the output from this code:
Total point matches: 72
K=[390.0703661671206, 0, 319.5;
0, 390.0703661671206, 239.5;
0, 0, 1]
F=[-2.723736291531157e-007, 7.660367616625481e-005, -0.01766345189507435;
-4.219955880897177e-005, 9.025976628215733e-006, -0.04376995849516735;
0.009562535474535394, 0.03723116011143099, 1]
Total inliers: 62
E=[-0.04144297973569942, 11.65562396370436, 0.2325229628055823;
-6.420869252333299, 1.373346486079092, -21.48936503378938;
-0.2462444924550576, 24.91291898830852, -0.03174504032716108]
188648
-38467.5
-34880.7
289671
257263
87504.7
462472
-30138.1
-30569.3
174520
-32342.8
-32342.8
-37543.4
241378
-36875.4
-36899
-38796.4
-38225.2
-38120.9
394285
-440986
396805
455397
543629
14281.6
630398
-29714.6
191699
-37854.1
-39295.8
-3395.93
-3088.56
629769
-28132.9
178537
878596
-58957.9
-31034.5
-30677.3
-29854.5
165689
-13575.9
-13294.3
-6607.96
-3446.41
622355
-31803
-35149
-38455.4
2068.12
82164.6
-35731.2
-36252.7
-36746.9
-35325.3
414185
-35216.3
-126107
-5551.84
100196
2.29755e+006
177785
-31991.8
-31991.8
100340
108897
108897
84660.4
-7828.65
225817
225817
295423
The equation v2^T * E * v1 is true for the essential matrix only when v2 and v1 are in normalized coordinates, i.e. v1 = K^(-1)*p1, with p1 the observed point in pixels. Same goes for v2 and p2.
If you have it, you can refer to definition 9.16 page 257 of Hartley and Zisserman's book. But note that this makes sense, given the relation E = K.t() * F * K.

Values not written to vector

I'm trying to read pairs values from a file in the constructor of an object.
The file looks like this:
4
1 1
2 2
3 3
4 4
The first number is number of pairs to read.
In some of the lines the values seem to have been correctly written into the vector. In the next they are gone. I am totally confused
inline
BaseInterpolator::BaseInterpolator(std::string data_file_name)
{
std::ifstream in_file(data_file_name);
if (!in_file) {
std::cerr << "Can't open input file " << data_file_name << std::endl;
exit(EXIT_FAILURE);
}
size_t n;
in_file >> n;
xs_.reserve(n);
ys_.reserve(n);
size_t i = 0;
while(in_file >> xs_[i] >> ys_[i])
{
// this line prints correct values i.e. 1 1, 2 2, 3 3, 4 4
std::cout << xs_[i] << " " << ys_[i] << std::endl;
// this lines prints xs_.size() = 0
std::cout << "xs_.size() = " << xs_.size() << std::endl;
if(i + 1 < n)
i += 1;
else
break;
// this line prints 0 0, 0 0, 0 0
std::cout << xs_[i] << " " << ys_[i] << std::endl;
}
// this line prints correct values i.e. 4 4
std::cout << xs_[i] << " " << ys_[i] << std::endl;
// this lines prints xs_.size() = 0
std::cout << "xs_.size() = " << xs_.size() << std::endl;
}
The class is defined thus:
class BaseInterpolator
{
public:
~BaseInterpolator();
BaseInterpolator();
BaseInterpolator(std::vector<double> &xs, std::vector<double> &ys);
BaseInterpolator(std::string data_file_name);
virtual int interpolate(std::vector<double> &x, std::vector<double> &fx) = 0;
virtual int interpolate(std::string input_file_name,
std::string output_file_name) = 0;
protected:
std::vector<double> xs_;
std::vector<double> ys_;
};
You're experiencing undefined behaviour. It seems like it's half working, but that's twice as bad as not working at all.
The problem is this:
xs_.reserve(n);
ys_.reserve(n);
You are only reserving a size, not creating it.
Replace it by :
xs_.resize(n);
ys_.resize(n);
Now, xs[i] with i < n is actually valid.
If in doubt, use xs_.at(i) instead of xs_[i]. It performs an additional boundary check which saves you the trouble from debugging without knowing where to start.
You're using reserve(), which increases capacity (storage space), but does not increase the size of the vector (i.e. it does not add any objects into it). You should use resize() instead. This will take care of size() being 0.
You're printing the xs_[i] and ys_[i] after you increment i. It's natural those will be 0 (or perhaps a random value) - you haven't initialised them yet.
vector::reserve reserve space for further operation but don't change the size of the vector, you should use vector::resize.

Object/Struct Alignment in C/C++

#include <iostream>
using namespace std;
struct test
{
int i;
double h;
int j;
};
int main()
{
test te;
te.i = 5;
te.h = 6.5;
te.j = 10;
cout << "size of an int: " << sizeof(int) << endl; // Should be 4
cout << "size of a double: " << sizeof(double) << endl; //Should be 8
cout << "size of test: " << sizeof(test) << endl; // Should be 24 (word size of 8 for double)
//These two should be the same
cout << "start address of the object: " << &te << endl;
cout << "address of i member: " << &te.i << endl;
//These two should be the same
cout << "start address of the double field: " << &te.h << endl;
cout << "calculate the offset of the double field: " << (&te + sizeof(double)) << endl; //NOT THE SAME
return 0;
}
Output:
size of an int: 4
size of a double: 8
size of test: 24
start address of the object: 0x7fffb9fd44e0
address of i member: 0x7fffb9fd44e0
start address of the double field: 0x7fffb9fd44e8
calculate the offset of the double field: 0x7fffb9fd45a0
Why do the last two lines produce different values? Something I am doing wrong with pointer arithmetic?
(&te + sizeof(double))
This is the same as:
&((&te)[sizeof(double)])
You should do:
(char*)(&te) + sizeof(int)
You are correct -- the problem is with pointer arithmetic.
When you add to a pointer, you increment the pointer by a multiple of that pointer's type
Therefore, &te + 1 will be 24 bytes after &te.
Your code &te + sizeof(double) will add 24 * sizeof(double) or 192 bytes.
Firstly, your code is wrong, you'd want to add the size of the fields before h (i.e. an int), there's no reason to assume double. Second, you need to normalise everything to char * first (pointer arithmetic is done in units of the thing being pointed to).
More generally, you can't rely on code like this to work. The compiler is free to insert padding between fields to align things to word boundaries and so on. If you really want to know the offset of a particular field, there's an offsetof macro that you can use. It's defined in <stddef.h> in C, <cstddef> in C++.
Most compilers offer an option to remove all padding (e.g. GCC's __attribute__ ((packed))).
I believe it's only well-defined to use offsetof on POD types.
struct test
{
int i;
int j;
double h;
};
Since your largest data type is 8 bytes, the struct adds padding around your ints, either put the largest data type first, or think about the padding on your end! Hope this helps!
&te + sizeof(double) is equivalent to &te + 8, which is equivalent to &((&te)[8]). That is — since &te has type test *, &te + 8 adds eight times the size of a test.
You can see what's going on more clearly using the offsetof() macro:
#include <iostream>
#include <cstddef>
using namespace std;
struct test
{
int i;
double h;
int j;
};
int main()
{
test te;
te.i = 5;
te.h = 6.5;
te.j = 10;
cout << "size of an int: " << sizeof(int) << endl; // Should be 4
cout << "size of a double: " << sizeof(double) << endl; // Should be 8
cout << "size of test: " << sizeof(test) << endl; // Should be 24 (word size of 8 for double)
cout << "i: size = " << sizeof te.i << ", offset = " << offsetof(test, i) << endl;
cout << "h: size = " << sizeof te.h << ", offset = " << offsetof(test, h) << endl;
cout << "j: size = " << sizeof te.j << ", offset = " << offsetof(test, j) << endl;
return 0;
}
On my system (x86), I get the following output:
size of an int: 4
size of a double: 8
size of test: 16
i: size = 4, offset = 0
h: size = 8, offset = 4
j: size = 4, offset = 12
On another system (SPARC), I get:
size of an int: 4
size of a double: 8
size of test: 24
i: size = 4, offset = 0
h: size = 8, offset = 8
j: size = 4, offset = 16
The compiler will insert padding bytes between struct members to ensure that each member is aligned properly. As you can see, alignment requirements vary from system to system; on one system (x86), double is 8 bytes but only requires 4-byte alignment, and on another system (SPARC), double is 8 bytes and requires 8-byte alignment.
Padding can also be added at the end of a struct to ensure that everything is aligned properly when you have an array of the struct type. On SPARC, for example, the compile adds 4 bytes pf padding at the end of the struct.
The language guarantees that the first declared member will be at an offset of 0, and that members are laid out in the order in which they're declared. (At least that's true for simple structs; C++ metadata might complicate things.)
Compilers are free to space out structs however they want past the first member, and usually use padding to align to word boundaries for speed.
See these:
C struct sizes inconsistence
Struct varies in memory size?
et. al.