What's "pitch" in cudaMemcpy2DToArray and cudaMemcpy2DFromArray - c++

I'm converting the deprecated cudaMemcpyToArray and cudaMemcpyFromArray into cudaMemcpy2DToArray and cudaMemcpy2DFromArray. Rather than size of the deprecated calls, the new API calls for width, height, and pitch. The descriptions of spitch and dpitch are correspondingly "Pitch of source memory" and "Pitch of destination memory". I wonder what are those values: size of data items, something else?
More specifically, if I were to copy W*H floats, should I have pitch=sizeof(float), width=W, height=H, or pitch=sizeof(float)*W, width=sizeof(float)*W, height=H, or something else?

It should be:
pitch=sizeof(float)*W
width = sizeof(float)*W
height = H
The above is for cudaMemcpy2DToArray, and assumes you are transferring from host to device, which would most likely involve an unpitched allocation in host memory as the source.
The pitch of a pitched allocation is the size in bytes of one line of of a 2D allocation, including padding bytes at the end of the line. It is the value returned by cudaMallocPitch, for example. For unpitched allocations, it is still the width of the line, and it is given by W*sizeof(element) where the 2D allocation width is given by W elements each of size sizeof(element).
This question and the link it refers to may also be of interest.

Related

OpenCL: How to find out maximum size of Image3D that fits on GPU

When using OpenCL one can check the maximum theoretical size of an image3D on a device by calling
CL_DEVICE_IMAGE3D_MAX_WIDTH
CL_DEVICE_IMAGE3D_MAX_DEPTH
CL_DEVICE_IMAGE3D_MAX_HEIGHT
but for example my GPU does not provide enough memory to allocate a 3D RGBA image of that size. So obviously, when I try to push such an image, an OutOfResources error occurs.
My question: Given a vector<float> which contains an image with dimensions <2048, how do I check if it fits on my GPU?
The background of my question is, that I would like to split up the image otherwise in order to process it in parts.
You can use CL_DEVICE_MAX_MEM_ALLOC_SIZE which returns the maximum size of memory object allocation in bytes.
Although this doesn't take into account the currently used memory, you could try to do it yourself by keeping track of the allocations you're doing and checking against CL_DEVICE_GLOBAL_MEM_SIZE.

clCreateImage2D (..., void* hst_ptr,..) how to use it?

I read the formal definition on the official site but still I dont understand: What is void* hst-ptr used for? For example here. Why is buffer here so useful, Why buffer is a pointer to char and the size is 4*width*height
host_ptr is used when either of the flags CL_MEM_USE_HOST_PTR or CL_MEM_COPY_HOST_PTR are specified. In those cases, host_ptr points to CPU memory containing an image to use or copy.
In the example code you linked to, buffer is the host-side (CPU memory) image being copied to the device-side (GPU typ.) image (since they are using the CL_MEM_COPY_HOST_PTR flag).
It's not important that they made it a pointer to char since they are using memcpy to fill it in, but it helps the allocation using new char since a char is 1 byte in size.
4 * width * height is because that is the total number of bytes (chars) needed.
4 is the size of one CL_RGBA CL_UNORM_INT8 pixel (R, G, B, A are each one byte).
width * height because that the total count of pixels.
The code is allocating host-side memory for an image, filling it in with something, then creating a device-side image using those bytes, then processing it using a kernel, then copying the bytes back to the host-side image.

How to get the number of elements [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I'm developing an application in C++ and I need to know the number of elements of a variable. I'm looking for how to do this but I'm not able to find a solution for this. The variable is defined in this way:
unsigned char *values = (unsigned char *) some_function(some_parameter);
// "some_function" takes "some_parameter" and fills "values" correctly
Thanks in advance for any help you can provide.
Best regards.
Since you told us which function you're using (FreeImage::GetBits), we now know that you're querying the raw data of an image. Its size is the product of the pitch and height of the image, as seen in this formula:
int size = image.GetPitch() * image.GetHeight();
This is the size in bytes, which is the number of elements if you access a char pointer. But speaking of "number of elements" in such a case (where we speak of some low-level memory, a bit stream with no high-level types) is a bit misleading, as when reading the question one might think it's about a higher level array.
In case you wonder: Raw image data is typically laid out in rows of size pitch, one pixel after another, from left to right, where the size per pixel can depend on some storage format (for example 1 byte grayscale, 3 bytes RGB with 8 bit per channel, 1 bit for monochrome bitmaps, and many more formats).
These rows are laid out from top to bottom (in most cases) or sometimes from bottom to top (in the case of BMP file format for example). The pitch is at least the width of the image times the size per pixel, so all pixels have space in such a "scan line", which is how such a memory per line of the image is called. It's rounded up to some alignment, so every line can start at an aligned address in the memory for the whole image. The unused space is called "padding" and ignored.
Depending on the library, sometimes "pitch" means "pixels per line" in the memory, not "bytes per line", but in this case it's already given in bytes so you only have to multiply by the image height. Note that typically the height is not padded like the width, since there's no advantage of doing so.
You can never deterministically know the length of an array given a pointer to the beginning of the array. You must pass some information along with the array.
That extra information may be in the form of:
another return value specifying the length
an agreed encoding that encodes the length into the beginning of the array
an agreed encoding that marks the end of the array (e.g. \0 at the end of a string)

How and when should I use pitched pointer with the cuda API?

I have quite a good understanding about how to allocate and copy linear memory with cudaMalloc() and cudaMemcpy(). However, when I want to use the CUDA functions to allocate and copy 2D or 3D matrices, I am often befuddled by the various arguments, especially concerning pitched pointers which are always present when dealing with 2D/3D arrays. The documentation is good for providing a couple examples on how to use them but it assumes that I am familiar with the notion of padding and pitch, which I am not.
I usually end up tweaking the various examples I find in the documentation or somewhere else on the web, but the blind debugging that follows is quite painful, so my question is:
What is a pitch? How do I use it? How do I allocate and copy 2D and 3D arrays in CUDA?
Here is an explanation about pitched pointer and padding in CUDA.
Linear memory vs padded memory
First, lets start with the reason for the existence of non linear memory. When allocating memory with cudaMalloc, the result is like an allocation with malloc, we have a contiguous memory chunk of the size specified and we can put anything we want in it. If we want to allocate a vector of 10000 float, we simply do:
float* myVector;
cudaMalloc(&myVector, 10000*sizeof(float));
and then access ith element of myVector by classic indexing:
float element = myVector[i];
and if we want to access the next element, we just do:
float next_element = myvector[i+1];
It works very fine because accessing an element right next to the first one is (for reasons I am not aware of and I don't wish to be for now) cheap.
Things become a little bit different when we use our memory as a 2D array. Lets say our 10000 float vector is in fact a 100x100 array. We can allocate it by using the same cudaMalloc function, and if we want to read the i-th row, we do:
float* myArray;
cudaMalloc(&myArray, 10000*sizeof(float));
int row[100]; // number of columns
for (int j=0; j<100; ++j)
row[j] = myArray[i*100+j];
Word alignment
So we have to read memory from myArray+100*i to myArray+101*i-1. The number of memory access operation it will take depends on the number of memory words this row takes. The number of bytes in a memory word depends on the implementation. To minimize the number of memory accesses when reading a single row, we must assure that we start the row on the start of a word, hence we must pad the memory for every row until the start of a new one.
Bank conflicts
Another reason for padding arrays is the bank mechanism in CUDA, concerning shared memory access. When the array is in the shared memory, it is split into several memory banks. Two CUDA threads can access it simultaneously, provided they don't access memory belonging to the same memory bank. Since we usually want to treat each row in parallel, we can ensure that we can access it simulateously by padding each row to the start of a new bank.
Now, instead of allocating the 2D array with cudaMalloc, we will use cudaMallocPitch:
size_t pitch;
float* myArray;
cudaMallocPitch(&myArray, &pitch, 100*sizeof(float), 100); // width in bytes by height
Note that the pitch here is the return value of the function: cudaMallocPitch checks what it should be on your system and returns the appropriate value. What cudaMallocPitch does is the following:
Allocate the first row.
Check if the number of bytes allocated makes it correctly aligned. For example that it is a multiple of 128.
If not, allocate further bytes to reach the next multiple of 128. the pitch is then the number of bytes allocated for a single row, including the extra bytes (padding bytes).
Reiterate for each row.
At the end, we have typically allocated more memory than necessary because each row is now the size of pitch, and not the size of w*sizeof(float).
But now, when we want to access an element in a column, we must do:
float* row_start = (float*)((char*)myArray + row * pitch);
float column_element = row_start[column];
The offset in bytes between two successive columns can no more be deduced from the size of our array, that is why we want to keep the pitch returned by cudaMallocPitch. And since pitch is a multiple of the padding size (typically, the biggest of word size and bank size), it works great. Yay.
Copying data to/from pitched memory
Now that we know how to create and access a single element in an array created by cudaMallocPitch, we might want to copy whole part of it to and from other memory, linear or not.
Lets say we want to copy our array in a 100x100 array allocated on our host with malloc:
float* host_memory = (float*)malloc(100*100*sizeof(float));
If we use cudaMemcpy, we will copy all the memory allocated with cudaMallocPitch, including the padded bytes between each rows. What we must do to avoid padding memory is copying each row one by one. We can do it manually:
for (size_t i=0; i<100; ++i) {
cudaMemcpy(host_memory[i*100], myArray[pitch*i],
100*sizeof(float), cudaMemcpyDeviceToHost);
}
Or we can tell the CUDA API that we want only the useful memory from the memory we allocated with padding bytes for its convenience so if it could deal with its own mess automatically it would be very nice indeed, thank you. And here enters cudaMemcpy2D:
cudaMemcpy2D(host_memory, 100*sizeof(float)/*no pitch on host*/,
myArray, pitch/*CUDA pitch*/,
100*sizeof(float)/*width in bytes*/, 100/*heigth*/,
cudaMemcpyDeviceToHost);
Now the copy will be done automatically. It will copy the number of bytes specified in width (here: 100xsizeof(float)), height time (here: 100), skipping pitch bytes every time it jumps to a next row. Note that we must still provide the pitch for the destination memory because it could be padded, too. Here it is not, so the pitch is equal to the pitch of a non-padded array: it is the size of a row. Note also that the width parameter in the memcpy function is expressed in bytes, but the height parameter is expressed in number of elements. That is because of the way the copy is done, someway like I wrote the manual copy above: the width is the size of each copy along a row (elements that are contiguous in memory) and the height is the number of times this operation must be accomplished. (These inconsistencies in units, as a physicist, annoys me very much.)
Dealing with 3D arrays
3D arrays are no different that 2D arrays actually, there is no additional padding included. A 3D array is just a 2D classical array of padded rows. That is why when allocating a 3D array, you only get one pitch that is the difference in bytes count between to successive points along a row. If you want to access to successive points along the depth dimension, you can safely multiply the pitch by the number of columns, which gives you the slicePitch.
The CUDA API for accessing 3D memory is slightly different than the one for 2D memory, but the idea is the same :
When using cudaMalloc3D, you receive a pitch value that you must carefully keep for subsequent access to the memory.
When copying a 3D memory chunk, you cannot use cudaMemcpy unless you are copying a single row. You must use any other kind of copy utility provided by the CUDA utility that takes the pitch into account.
When you copy your data to/from linear memory, you must provide a pitch to your pointer even though it is irrelevant: this pitch is the size of a row, expressed in bytes.
The size parameters are expressed in bytes for the row size, and in number of elements for the column and depth dimension.

Layout of Pixel-data in Memory?

I'm writing a C++ library for an image format that is based on PNG. One stopping point for me is that I'm unsure as to how I ought to lay out the pixel data in memory; as far as I'm aware, there are two practical approaches:
An array of size (width * height); each pixel can be accessed by array[y*width + x].
An array of size (height), containing pointers to arrays of size (width).
The standard reference implementation for PNG (libpng) uses method 2 of the above, while I've seen others use method 1. Is one better than the other, or is each a method with its own pros and cons, to where a compromise must be made? Further, which format do most graphical display systems use (perhaps for ease of using the output of my library into other APIs)?
Off the top of my head:
The one thing that would make me choose #2 is the fact that your memory requirements are a little relaxed. If you were to go for #1, the system will need to be able to allocate height * width amount of contiguous memory. Whereas, in case of #2, it has the freedom to allocate smaller chunks of contiguous memory of size width (could as well be height) off of areas that are free. (When you factor in the channels per pixel, the #1 may fail for even moderately sized images.)
Further, it may be slightly better when swapping rows (or columns) if required for image manipulation purposes (pointer swap suffices).
The downside for #2 is of course an extra level of indirection that seeps in for every access and the array of pointers to be maintained. But this is hardly a matter given todays processor speed and memory.
The second downside for #2 is that the data isn't necessarily next to each other, which makes it harder for the processor the load the right memory pages into the cache.
The advantage of method 2 (cutting up the array in rows) is that you can perform memory operations in steps, e.g. resizing or shuffling the image without reallocating the entire chunk of memory at once. For really large images this may be an advantage.
The advantage of a single array is that you calculations are simpler, i.e. to go one row down you do
pos += width;
instead of having to reference pointers. For small to medium images this is probably faster. Unless you're dealing with images of hundreds of Mb, I would go with method 1.
I suspect that libpng does that (style 2) for a few possible reasons:
Avoid large allocations (as mentioned), and may ease handling VERY large PNGs, especially in systems without VM
(perhaps) allow for interleaved decode ala interleaved JPEG (if PNG supports that)
Ease of certain transformations (vertical flip) (unlikely)
Ease of scaling (insert or remove lines without needing a full second buffer, or widen/narrow lines) (unlikely but possible)
The problem with this approach (assuming each line is an allocation) is a LOT more allocation/free overhead, and perhaps encourage memory fragmentation.
Unless you have a good reason, use style 1 (single allocation), and perhaps round to a "good" boundary for the architecture you're using (could be 4, 8, 16 or perhaps even more bytes). Note that many library functions may look for style 1 with no padding - think about how you'll be using this and where you'll be passing them to.
Windows itself uses a variation of method 1. Each row of the image is padded to a multiple of 4 bytes, and the order of the colors is B,G,R instead of the more normal R,G,B. Also the first row of the buffer is the bottom row of the image.