Wrong pixel values when using padded local buffer OpenCL - c++

I'm facing an unexpected result when I use a local buffer to copy data in an OpenCL kernel. The code presented here is quite simple (and useless since I don't need to use a local buffer for such an operation), but this is a first step for convolution-like processes.
Here is my code :
std::string implementCopyFromLocalKernel()
{
return BOOST_COMPUTE_STRINGIZE_SOURCE(
__kernel void copyFromLocal_knl(__global const float* in,
const ulong sizeX, const ulong sizeY,
const int filterRadiusX, const int filterRadiusY,
__local float* localImage,
const ulong localSizeX, const ulong localSizeY,
__global float* out)
{
// Store each work-item’s unique row and column
const int x = get_global_id(0);
const int y = get_global_id(1);
// Group size
int groupSizeX = get_local_size(0);
int groupSizeY = get_local_size(1);
// Determine the size of the work group output region
int groupIdX = get_group_id(0);
int groupIdY = get_group_id(1);
// Determine the local ID of each work item
int localX = get_local_id(0);
int localY = get_local_id(1);
// Padding
int paddingX = filterRadiusX;
int paddingY = filterRadiusY;
// Cache the data to local memory
// Copy the data for the current coordinates
localImage[localX + localY*localSizeX] = in[x + y * sizeX];
barrier(CLK_LOCAL_MEM_FENCE);
out[x + y * sizeX] = localImage[localX + localY*localSizeX];
return;
}
);
}
void copyLocalBuffer(const boost::compute::context& context, boost::compute::command_queue& queue, const boost::compute::buffer& bufInn boost::compute::buffer& bufOut, const size_t sizeX, const size_t sizeY)
{
const size_t nbPx = sizeX * sizeY;
const size_t maxSize = (sizeX > sizeY ? sizeX : sizeY);
// Prepare to launch the kernel
std::string kernel_src = implementCopyFromLocalKernel();
boost::compute::program program;
try {
program = boost::compute::program::create_with_source(kernel_src, pGpuDescription->getContext(deviceIdx));
program.build();
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error bulding program from source : " << std::endl << e.what() << std::endl
<< program.build_log() << std::endl;
return;
}
boost::compute::kernel kernel;
try {
kernel = program.create_kernel("copyFromLocal_knl");
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error creating kernel : " << std::endl << e.what() << std::endl;
return;
}
try {
int localSizeX = 16;
int localSizeY = 16;
int paddingPixelsX = 2;// 0; // <- Changing to 0 works
int paddingPixelsY = paddingPixelsX;
int localWidth = localSizeX + 2 * paddingPixelsX;
int localHeight = localSizeY + 2 * paddingPixelsY;
boost::compute::buffer localImage(context, localWidth*localHeight * sizeof(float));
kernel.set_arg(0, bufIn);
kernel.set_arg(1, sizeX);
kernel.set_arg(2, sizeY);
kernel.set_arg(3, paddingPixelsX);
kernel.set_arg(4, paddingPixelsY);
kernel.set_arg(5, localImage);
kernel.set_arg(6, localWidth);
kernel.set_arg(7, localHeight);
kernel.set_arg(8, bufOut);
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error setting kernel arguments: " << std::endl << e.what() << std::endl;
return;
}
try {
size_t origin[2] = { 0, 0 };
size_t region[2] = { 256, 256 };// { sizeX, sizeY };
size_t localSize[2] = { 16, 16 };
queue.enqueue_nd_range_kernel(kernel, 2, origin, region, localSize);
}
catch (const boost::compute::opencl_error& e) {
std::cout << "Error executing kernel : " << std::endl << e.what() << std::endl;
return;
}
}
I reduced the code to simply copy the pixels corresponding to each work item in the associated local coordinate of the local image. Hence, the local image buffer must have unused data for 2*paddingPixelsX on each line and 2*paddingPixelsY unused lines.
It works if I don't add padding data (paddingPixelsX and paddingPixelsY = 0), but it seems that some work items don't read the data from the input buffer or write the data into the ouput buffer (or the local buffer?) in the correct place. Moreover, when I run my program several times, I never get the same result.
This is an example of result I get (right) for the mandrill image as input (left) :
I ensure that the threads are synchronized with barrier(CLK_LOCAL_MEM_FENCE); and each work item read and write a specific data and if my code is buggy, I don't understand why no padding don't gives errors.
Does someone has an idea?
Thanks,

As already confirmed, the problem was that dynamically allocated local buffer passed to the kernel was only created for one work group.
One of the solutions is to create local buffer statically inside the kernel, for example:
__local float localImage[16*16];
If the size of the buffer cannot be hard coded then it could be set via preprocessor:
__local float localImage[SIZE_X*SIZE_Y];
which then these parameters are passed during kernel build.
From what I remember using kernel parameters to define size of static local buffer may not work for every GPU (compilation will fail).
I'm not familiar with boost compute but I presume something similar should be possible to achieve by passing parameters to implementCopyFromLocalKernel() which then they would be converted into values during stringizing.

Thanks to #doqtor, I understood that the issue came from the buffer passed as kernel parameter. Because of that, all work group used the same buffer.
Since I don't know the padding size I will need for convolution operations, I need this buffer as parameter. I modified the kernel parametrization so that a different buffer is used by each work group :
kernel.set_arg(5, localWidth*localHeight*sizeof(float), NULL);
I missed the important part when I read the documentation of clSetKernelArg:
If the argument is declared with the __local qualifier, the arg_value entry must be NULL.

Related

(C++) Fastest way possible for reading in matrix files (arbitrary size)

I'm developing a bioinformatic tool, which requires reading in millions of matrix files (average dimension = (20k, 20k)). They are tab-delimited text files, and they look something like:
0.53 0.11
0.24 0.33
Because the software reads the matrix files one at a time, memory is not an issue, but it's very slow. The following is my current function for reading in a matrix file. I first make a matrix object using a double pointer, then fill in the matrix by looping through an input file .
float** make_matrix(int nrow, int ncol, float val){
float** M = new float *[nrow];
for(int i = 0; i < nrow; i++) {
M[i] = new float[ncol];
for(int j = 0; j < ncol; j++) {
M[i][j] = val;
}
}
return M;
}
float** read_matrix(string fname, int dim_1, int dim_2){
float** K = make_matrix(dim_1, dim_2, 0);
ifstream ifile(fname);
for (int i = 0; i < dim_1; ++i) {
for (int j = 0; j < dim_2; ++j) {
ifile >> K[i][j];
}
}
ifile.clear();
ifile.seekg(0, ios::beg);
return K;
}
Is there a much faster way to do this? From my experience with python, reading in a matrix file using pandas is so much faster than using python for-loops. Is there a trick like that in c++?
(added)
Thanks so much everyone for all your suggestions and comments!
The fastest way, by far, is to change the way you write those files: write in binary format, two int first (width, height) then just dump your values.
You will be able to load it in just three read calls.
Just for fun, I measured the program posted above (using a 20,000x20,000 ASCII input file, as described) on my Mac Mini (3.2GHz i7 with SSD drive) and found that it took about 102 seconds to parse in the file using the posted code.
Then I wrote a version of the same function that uses the C stdio API (fopen()/fread()/fclose()) and does character-by-character parsing into a 1D float array. This implementation takes about 13 seconds to parse in the file on the same hardware, so it's about 7 times faster.
Both programs were compiled with g++ -O3 test_read_matrix.cpp.
float* faster_read_matrix(string fname, int numRows, int numCols)
{
FILE * fpIn = fopen(fname.c_str(), "r");
if (fpIn == NULL)
{
printf("Couldn't open file [%s] for input!\n", fname.c_str());
return NULL;
}
float* K = new float[numRows*numCols];
// We'll hold the current number in (numberBuf) until we're ready to parse it
char numberBuf[128] = {'\0'};
int numCharsInBuffer = 0;
int curRow = 0, curCol = 0;
while(curRow < numRows)
{
char tempBuf[4*1024]; // an arbitrary size
const size_t bytesRead = fread(tempBuf, 1, sizeof(tempBuf), fpIn);
if (bytesRead <= 0)
{
if (bytesRead < 0) perror("fread");
break;
}
for (size_t i=0; i<bytesRead; i++)
{
const char c = tempBuf[i];
if ((c=='.')||(c=='+')||(c=='-')||(isdigit(c)))
{
if ((numCharsInBuffer+1) < sizeof(numberBuf)) numberBuf[numCharsInBuffer++] = c;
else
{
printf("Error, number string was too long for numberBuf!\n");
}
}
else
{
if (numCharsInBuffer > 0)
{
// Parse the current number-chars we have assembled into (numberBuf) and reset (numberBuf) to empty
numberBuf[numCharsInBuffer] = '\0';
if (curCol < numCols) K[curRow*numCols+curCol] = strtod(numberBuf, NULL);
else
{
printf("Error, too many values in row %i! (Expected %i, found at least %i)\n", curRow, numCols, curCol);
}
curCol++;
}
numCharsInBuffer = 0;
if (c == '\n')
{
curRow++;
curCol = 0;
if (curRow >= numRows) break;
}
}
}
}
fclose(fpIn);
if (curRow != numRows) printf("Warning: I read %i lines in the file, but I expected there would be %i!\n", curRow, numRows);
return K;
}
I am dissatisfied with Jeremy Friesner’s otherwise excellent answer because it:
blames the problem to be with C++'s I/O system (which it is not)
fixes the problem by circumventing the actual I/O problem without being explicit about how it is a significant contributor to speed
modifies memory accesses which (may or may not) contribute to speed, and does so in a way that very large matrices may not be supported
The reason his code runs so much faster is because he removes the single most important bottleneck: unoptimized disk access. JWO’s original code can be brought to match with three extra lines of code:
float** read_matrix(std::string fname, int dim_1, int dim_2){
float** K = make_matrix(dim_1, dim_2, 0);
std::size_t buffer_size = 4*1024; // 1
char buffer[buffer_size]; // 2
std::ifstream ifile(fname);
ifile.rdbuf()->pubsetbuf(buffer, buffer_size); // 3
for (int i = 0; i < dim_1; ++i) {
for (int j = 0; j < dim_2; ++j) {
ss >> K[i][j];
}
}
// ifile.clear();
// ifile.seekg(0, std::ios::beg);
return K;
}
The addition exactly replicates Friesner’s design, but using the C++ library capabilities without all the extra programming grief on our end.
You’ll notice I also removed a couple lines at the bottom that should be inconsequential to program function and correctness, but which may cause a minor cumulative time issue as well. (If they are not inconsequential, that is a bug and should be fixed!)
How much difference this all makes depends entirely on the quality of the C++ Standard Library implementation. AFAIK the big three modern C++ compilers (MSVC, GCC, and Clang) all have sufficiently-optimized I/O handling to make the issue moot.
locale
One other thing that may also make a difference is to .imbue() the stream with the default "C" locale, which avoids a lot of special handling for numbers in locale-dependent formats other than what your files use. You only need to bother to do this if you have changed your global locale, though.
ifile.imbue(std::locale(""));
redundant initialization
Another thing that is killing your time is the effort to zero-initialize the array when you create it. Don’t do that if you don’t need it! (You don’t need it here because you know the total extents and will fill them properly. C++17 and later is nice enough to give you a zero value if the input stream goes bad, too. So you get zeros for unread values either way.)
dynamic memory block size
Finally, keeping memory accesses to an array of array should not significantly affect speed, but it still might be worth testing if you can change it. This is assuming that the resulting matrix will never be too large for the memory manager to return as a single block (and consequently crash your program).
A common design is to allocate the entire array as a single block, with the requested size plus size for the array of pointers to the rest of the block. This allows you to delete the array in a single delete[] statement. Again, I don’t believe this should be an optimization issue you need to care about until your profiler says so.
At the risk of the answer being considered incomplete (no code examples), I would like to add to the other answers additional options how to tackle the problem:
Use a binary format (width,height, values...) as file format and then use file mapping (MapViewOfFile() on Windows, mmap() or so on posix/unix systems).
Then, you can simply point your "matrix structure" pointer to the mapped address space and you are done. And in case, you do something like sparse access to the matrix, it can even save some real IO. If you always do full access to all elements of the matrix (no sparse matrices etc.), it is still quite elegant and probably faster than malloc/read.
Replacements for c++ iostream, which is known to be quite slow and should not be used for performance critical stuff:
Have a look at the {fmt} library, which has become quite popular in recent years and claims to be quite fast.
Back in the days, when I did a lot of numerics on large data sets, I always opted for binary files for storage. (It was back in the days, when the fastest CPU you get your hands on were the Pentium 1 (with the floating point bug :)). Back then, all was slower, memory was much more limited (we had MB not GB as units for RAM in our systems) and all in all, nearly 20 years have passed since.
So, as a refresher, I did write some code to show, how much faster than iostream and text files you can do if you do not have extra constraints (such as endianess of different cpus etc.).
So far, my little test only has an iostream and a binary file version with a) stdio fread() kind of loading and b) mmap(). Since I sit in front of a debian bullseye computer, my code uses linux specific stuff for the mmap() approach. To run it on Windows, you have to change a few lines of code and some includes.
Edit: I added a save function using {fmt} now as well.
Edit: I added a load function with stdio now as well.
Edit: To reduce memory workload, I reordered the code somewhat
and now only keep 2 matrix instances in memory at any given time.
The program does the following:
create a 20k x 20k matrix in ram (in a struct named Matrix_t). With random values, slowly generated by std::random.
Write the matrix with iostream to a text file.
Write the matrix with stdio to a binary file.
Create a new matrix textMatrix by loading its data from the text file.
Create a new matrix inMemoryMatrix by loading its data from the binary file with a few fread() calls.
mmap() the binary file and use it under the name mappedMatrix.
Compare each of the loaded matrices to the original randomMatrix to see if the round-trip worked.
Here the results I got on my machine after compiling this work of wonder with clang++ -O3 -o fmatio fast-matrix-io.cpp -lfmt:
./fmatio
creating random matrix (20k x 20k) (27.0775seconds)
the first 10 floating values in randomMatrix are:
57970.2 -365700 -986079 44657.8 826968 -506928 668277 398241 -828176 394645
saveMatrixAsText_IOSTREAM()
saving matrix with iostream. (192.749seconds)
saveMatrixAsText_FMT(mat0_fmt.txt)
saving matrix with {fmt}. (34.4932seconds)
saveMatrixAsBinary()
saving matrix into a binary file. (30.7591seconds)
loadMatrixFromText_IOSTREAM()
loading matrix from text file with iostream. (102.074seconds)
randomMatrix == textMatrix
comparing randomMatrix with textMatrix. (0.125328seconds)
loadMatrixFromText_STDIO(mat0_fmt.txt)
loading matrix from text file with stdio. (71.2746seconds)
randomMatrix == textMatrix
comparing randomMatrix with textMatrix (stdio). (0.124684seconds)
loadMatrixFromBinary(mat0.bin)
loading matrix from binary file into memory. (0.495685seconds)
randomMatrix == inMemoryMatrix
comparing randomMatrix with inMemoryMatrix. (0.124206seconds)
mapMatrixFromBinaryFile(mat0.bin)
mapping a view to a matrix in a binary file. (4.5883e-05seconds)
randomMatrix == mappedMatrix
comparing randomMatrix with mappedMatrix. (0.158459seconds)
And here is the code:
#include <cinttypes>
#include <memory>
#include <random>
#include <iostream>
#include <fstream>
#include <cstring>
#include <string>
#include <chrono>
#include <limits>
#include <iomanip>
// includes for mmap()...
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <cstdio>
#include <cstdlib>
#include <unistd.h>
// includes for {fmt}...
#include <fmt/core.h>
#include <fmt/os.h>
struct StopWatch {
using Clock = std::chrono::high_resolution_clock;
using TimePoint =
std::chrono::time_point<Clock>;
using Duration =
std::chrono::duration<double>;
void start(const char* description) {
this->description = std::string(description);
tstart = Clock::now();
}
void stop() {
TimePoint tend = Clock::now();
Duration elapsed = tend - tstart;
std::cout << description << " (" << elapsed.count()
<< "seconds)" << std::endl;
}
TimePoint tstart;
std::string description;
};
struct Matrix_t {
uint32_t ncol;
uint32_t nrow;
float values[];
inline uint32_t to_index(uint32_t col, uint32_t row) const {
return ncol * row + col;
}
};
template <class Initializer>
Matrix_t *createMatrix
( uint32_t ncol,
uint32_t nrow,
Initializer initFn
) {
size_t nfloats = ncol*nrow;
size_t nbytes = UINTMAX_C(8) + nfloats * sizeof(float);
Matrix_t * result =
reinterpret_cast<Matrix_t*>(operator new(nbytes));
if (nullptr != result) {
result->ncol = ncol;
result->nrow = nrow;
for (uint32_t row = 0; row < nrow; row++) {
for (uint32_t col = 0; col < ncol; col++) {
result->values[result->to_index(col,row)] =
initFn(ncol,nrow,col,row);
}
}
}
return result;
}
void saveMatrixAsText_IOSTREAM(const char* filePath,
const Matrix_t* matrix) {
std::cout << "saveMatrixAsText_IOSTREAM()" << std::endl;
if (nullptr == matrix) {
std::cout << "cannot save matrix - no matrix!" << std::endl;
}
std::ofstream outFile(filePath);
if (outFile) {
outFile << matrix->ncol << " " << matrix->nrow << std::endl;
const auto defaultPrecision = outFile.precision();
outFile.precision
(std::numeric_limits<float>::max_digits10);
for (uint32_t row = 0; row < matrix->nrow; row++) {
for (uint32_t col = 0; col < matrix->ncol; col++) {
outFile << matrix->values[matrix->to_index(col,row)]
<< " ";
}
outFile << std::endl;
}
} else {
std::cout << "could not open " << filePath << " for writing."
<< std::endl;
}
}
void saveMatrixAsText_FMT(const char* filePath,
const Matrix_t* matrix) {
std::cout << "saveMatrixAsText_FMT(" << filePath << ")"
<< std::endl;
if (nullptr == matrix) {
std::cout << "cannot save matrix - no matrix!" << std::endl;
}
auto outFile = fmt::output_file(filePath);
outFile.print("{} {}\n", matrix->ncol, matrix->nrow);
for (uint32_t row = 0; row < matrix->nrow; row++) {
outFile.print("{}", matrix->values[matrix->to_index(0,row)]);
for (uint32_t col = 1; col < matrix->ncol; col++) {
outFile.print(" {}",
matrix->values[matrix->to_index(col,row)]);
}
outFile.print("\n");
}
}
void saveMatrixAsBinary(const char* filePath,
const Matrix_t* matrix) {
std::cout << "saveMatrixAsBinary()" << std::endl;
FILE * outFile = fopen(filePath, "wb");
if (nullptr != outFile) {
fwrite( &matrix->ncol, 4, 1, outFile);
fwrite( &matrix->nrow, 4, 1, outFile);
size_t nfloats = matrix->ncol * matrix->nrow;
fwrite( &matrix->values, sizeof(float), nfloats, outFile);
fclose(outFile);
} else {
std::cout << "could not open " << filePath << " for writing."
<< std::endl;
}
}
Matrix_t* loadMatrixFromText_IOSTREAM(const char* filePath) {
std::cout << "loadMatrixFromText_IOSTREAM()" << std::endl;
std::ifstream inFile(filePath);
if (inFile) {
uint32_t ncol;
uint32_t nrow;
inFile >> ncol;
inFile >> nrow;
uint32_t nfloats = ncol * nrow;
auto loader =
[&inFile]
(uint32_t , uint32_t , uint32_t , uint32_t )
-> float
{
float value;
inFile >> value;
return value;
};
Matrix_t * matrix = createMatrix( ncol, nrow, loader);
return matrix;
} else {
std::cout << "could not open " << filePath << "for reading."
<< std::endl;
}
return nullptr;
}
Matrix_t* loadMatrixFromText_STDIO(const char* filePath) {
std::cout << "loadMatrixFromText_STDIO(" << filePath << ")"
<< std::endl;
Matrix_t* matrix = nullptr;
FILE * inFile = fopen(filePath, "rt");
if (nullptr != inFile) {
uint32_t ncol;
uint32_t nrow;
fscanf(inFile, "%d %d", &ncol, &nrow);
auto loader =
[&inFile]
(uint32_t , uint32_t , uint32_t , uint32_t )
-> float
{
float value;
fscanf(inFile, "%f", &value);
return value;
};
matrix = createMatrix( ncol, nrow, loader);
fclose(inFile);
} else {
std::cout << "could not open " << filePath << "for reading."
<< std::endl;
}
return matrix;
}
Matrix_t* loadMatrixFromBinary(const char* filePath) {
std::cout << "loadMatrixFromBinary(" << filePath << ")"
<< std::endl;
FILE * inFile = fopen(filePath, "rb");
if (nullptr != inFile) {
uint32_t ncol;
uint32_t nrow;
fread( &ncol, 4, 1, inFile);
fread( &nrow, 4, 1, inFile);
uint32_t nfloats = ncol * nrow;
uint32_t nbytes = nfloats * sizeof(float) + UINT32_C(8);
Matrix_t* matrix =
reinterpret_cast<Matrix_t*>
(operator new (nbytes));
if (nullptr != matrix) {
matrix->ncol = ncol;
matrix->nrow = nrow;
fread( &matrix->values[0], sizeof(float), nfloats, inFile);
return matrix;
} else {
std::cout << "could not find memory for the matrix."
<< std::endl;
}
fclose(inFile);
} else {
std::cout << "could not open file "
<< filePath << " for reading." << std::endl;
}
return nullptr;
}
void freeMatrix(Matrix_t* matrix) {
operator delete(matrix);
}
Matrix_t* mapMatrixFromBinaryFile(const char* filePath) {
std::cout << "mapMatrixFromBinaryFile(" << filePath << ")"
<< std::endl;
Matrix_t * matrix = nullptr;
int fd = open( filePath, O_RDONLY);
if (-1 != fd) {
struct stat sb;
if (-1 != fstat(fd, &sb)) {
auto fileSize = sb.st_size;
matrix =
reinterpret_cast<Matrix_t*>
(mmap(nullptr, fileSize, PROT_READ, MAP_PRIVATE, fd, 0));
if (nullptr == matrix) {
std::cout << "mmap() failed!" << std::endl;
}
} else {
std::cout << "fstat() failed!" << std::endl;
}
close(fd);
} else {
std::cout << "open() failed!" << std::endl;
}
return matrix;
}
void unmapMatrix(Matrix_t* matrix) {
if (nullptr == matrix)
return;
size_t nbytes =
UINTMAX_C(8) +
sizeof(float) * matrix->ncol * matrix->nrow;
munmap(matrix, nbytes);
}
bool areMatricesEqual( const Matrix_t* m1, const Matrix_t* m2) {
if (nullptr == m1) return false;
if (nullptr == m2) return false;
if (m1->ncol != m2->ncol) return false;
if (m1->nrow != m2->nrow) return false;
// both exist and have same size...
size_t nfloats = m1->ncol * m1->nrow;
size_t nbytes = nfloats * sizeof(float);
return 0 == memcmp( m1->values, m2->values, nbytes);
}
int main(int argc, const char* argv[]) {
std::random_device rdev;
std::default_random_engine reng(rdev());
std::uniform_real_distribution<> rdist(-1.0E6F, 1.0E6F);
StopWatch sw;
auto randomInitFunction =
[&reng,&rdist]
(uint32_t ncol, uint32_t nrow, uint32_t col, uint32_t row)
-> float
{
return rdist(reng);
};
sw.start("creating random matrix (20k x 20k)");
Matrix_t * randomMatrix =
createMatrix(UINT32_C(20000),
UINT32_C(20000),
randomInitFunction);
sw.stop();
if (nullptr != randomMatrix) {
std::cout
<< "the first 10 floating values in randomMatrix are: "
<< std::endl;
std::cout << randomMatrix->values[0];
for (size_t i = 1; i < 10; i++) {
std::cout << " " << randomMatrix->values[i];
}
std::cout << std::endl;
sw.start("saving matrix with iostream.");
saveMatrixAsText_IOSTREAM("mat0_iostream.txt", randomMatrix);
sw.stop();
sw.start("saving matrix with {fmt}.");
saveMatrixAsText_FMT("mat0_fmt.txt", randomMatrix);
sw.stop();
sw.start("saving matrix into a binary file.");
saveMatrixAsBinary("mat0.bin", randomMatrix);
sw.stop();
sw.start("loading matrix from text file with iostream.");
Matrix_t* textMatrix =
loadMatrixFromText_IOSTREAM("mat0_iostream.txt");
sw.stop();
sw.start("comparing randomMatrix with textMatrix.");
if (!areMatricesEqual(randomMatrix, textMatrix)) {
std::cout << "randomMatrix != textMatrix!" << std::endl;
} else {
std::cout << "randomMatrix == textMatrix" << std::endl;
}
sw.stop();
freeMatrix(textMatrix);
textMatrix = nullptr;
sw.start("loading matrix from text file with stdio.");
textMatrix =
loadMatrixFromText_STDIO("mat0_fmt.txt");
sw.stop();
sw.start("comparing randomMatrix with textMatrix (stdio).");
if (!areMatricesEqual(randomMatrix, textMatrix)) {
std::cout << "randomMatrix != textMatrix!" << std::endl;
} else {
std::cout << "randomMatrix == textMatrix" << std::endl;
}
sw.stop();
freeMatrix(textMatrix);
textMatrix = nullptr;
sw.start("loading matrix from binary file into memory.");
Matrix_t* inMemoryMatrix =
loadMatrixFromBinary("mat0.bin");
sw.stop();
sw.start("comparing randomMatrix with inMemoryMatrix.");
if (!areMatricesEqual(randomMatrix, inMemoryMatrix)) {
std::cout << "randomMatrix != inMemoryMatrix!"
<< std::endl;
} else {
std::cout << "randomMatrix == inMemoryMatrix" << std::endl;
}
sw.stop();
freeMatrix(inMemoryMatrix);
inMemoryMatrix = nullptr;
sw.start("mapping a view to a matrix in a binary file.");
Matrix_t* mappedMatrix =
mapMatrixFromBinaryFile("mat0.bin");
sw.stop();
sw.start("comparing randomMatrix with mappedMatrix.");
if (!areMatricesEqual(randomMatrix, mappedMatrix)) {
std::cout << "randomMatrix != mappedMatrix!"
<< std::endl;
} else {
std::cout << "randomMatrix == mappedMatrix" << std::endl;
}
sw.stop();
unmapMatrix(mappedMatrix);
mappedMatrix = nullptr;
freeMatrix(randomMatrix);
} else {
std::cout << "could not create random matrix!" << std::endl;
}
return 0;
}
Please note, that binary formats where you simply cast to a struct pointer also depend on how the compiler does alignment and padding within structures. In my case, I was lucky and it worked. On other systems, you might have to tweak a little (#pragma pack(4) or something along that line) to make it work.

Why does loading a block of memory from a DLL only crash at the second call to memmove?

This is the class in question (only functions that pertain to this question) and everything it depends on (all written myself). It provides an interface to a DLL.
struct MemRegion {
const uint64_t address;
const uint64_t size;
};
enum Version {
VERSION_US,
VERSION_JP
};
const struct MemRegion SEGMENTS[2][2] = {
{{1302528, 2836576},
{14045184, 4897408}},
{{1294336, 2406112},
{13594624, 4897632}},
};
using Slot = array<vector<uint8_t>, 2>;
class Game {
private:
Version m_version;
HMODULE m_dll;
const MemRegion* m_regions;
public:
Game(Version version, cstr dll_path) {
m_version = version;
m_dll = LoadLibraryA(dll_path);
if (m_dll == NULL) {
unsigned int lastError = GetLastError();
cerr << "Last error is " << lastError << endl;
exit(-2);
}
// this is a custom macro which calls a function in the dll
call_void_fn(m_dll, "sm64_init");
m_regions = SEGMENTS[version];
}
~Game() {
FreeLibrary(m_dll);
}
void advance() {
call_void_fn(m_dll, "sm64_update");
}
Slot alloc_slot() {
Slot buffers = {
vector<uint8_t>(m_regions[0].size),
vector<uint8_t>(m_regions[1].size)
};
return buffers;
}
void save_slot(Slot& slot) {
for (int i = 0; i < 2; i++) {
const MemRegion& region = m_regions[i];
vector<uint8_t>& buffer = slot[i];
cerr << "before memmove for savestate" << endl;
memmove(buffer.data(), reinterpret_cast<void* const>(m_dll + region.address), region.size);
cerr << "after memmove for savestate" << endl;
}
}
};
When I call save_slot(), it should copy two blocks of memory to a couple of vector<uint8_t>s. This does not seem to be the case, though. The function finishes the first copy, but throws a segmentation fault at the second memcpy. Why does it only happen at the second copy, and how can I get around this sort of issue?
Edit 1: This is what GDB gives me when the program terminates:
Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ffac2164452 in msvcrt!memmove () from C:\Windows\System32\msvcrt.dll
Edit 2: I tried accessing the segments individually. It works, but for some reason, I can't access both segments in the same program.
I found out that HMODULE is equivalent to void*. Since you can't really use pointer arithmetic on void*s, you have to cast it to a uint8_t* or equivalent to properly get an offset.
Here's what that looks like in practice:
void save_state(Slot& slot) {
uint8_t* const _dll = (uint8_t*)((void*)m_dll);
for (int i = 0; i < 2; i++) {
MemRegion segment = m_regions[i];
std::vector<uint8_t>& buffer = slot[i];
memmove(&buffer[0], _dll + segment.address, segment.size);
}
}

Use data allocated dynamically in CUDA kernel on host

I am trying to build a container class on the device which manages some memory.
This memory is allocated dynamically and filled during object construction in the kernel.
According to the documentation that can be done with a simple new[] in the kernel (using CUDA 8.0 with compute cabability 5.0 in Visual Studio 2012).
Afterwards I want to access the data inside the containers in host code (e.g. for testing if all values are correct).
A minimal version of the DeviceContainer class looks like this:
class DeviceContainer
{
public:
__device__ DeviceContainer(unsigned int size);
__host__ __device__ ~DeviceContainer();
__host__ __device__ DeviceContainer(const DeviceContainer & other);
__host__ __device__ DeviceContainer & operator=(const DeviceContainer & other);
__host__ __device__ unsigned int getSize() const { return m_sizeData; }
__device__ int * getDataDevice() const { return mp_dev_data; }
__host__ int* getDataHost() const;
private:
int * mp_dev_data;
unsigned int m_sizeData;
};
__device__ DeviceContainer::DeviceContainer(unsigned int size) :
m_sizeData(size), mp_dev_data(nullptr)
{
mp_dev_data = new int[m_sizeData];
for(unsigned int i = 0; i < m_sizeData; ++i) {
mp_dev_data[i] = i;
}
}
__host__ __device__ DeviceContainer::DeviceContainer(const DeviceContainer & other) :
m_sizeData(other.m_sizeData)
{
#ifndef __CUDA_ARCH__
cudaSafeCall( cudaMalloc((void**)&mp_dev_data, m_sizeData * sizeof(int)) );
cudaSafeCall( cudaMemcpy(mp_dev_data, other.mp_dev_data, m_sizeData * sizeof(int), cudaMemcpyDeviceToDevice) );
#else
mp_dev_data = new int[m_sizeData];
memcpy(mp_dev_data, other.mp_dev_data, m_sizeData * sizeof(int));
#endif
}
__host__ __device__ DeviceContainer::~DeviceContainer()
{
#ifndef __CUDA_ARCH__
cudaSafeCall( cudaFree(mp_dev_data) );
#else
delete[] mp_dev_data;
#endif
mp_dev_data = nullptr;
}
__host__ __device__ DeviceContainer & DeviceContainer::operator=(const DeviceContainer & other)
{
m_sizeData = other.m_sizeData;
#ifndef __CUDA_ARCH__
cudaSafeCall( cudaMalloc((void**)&mp_dev_data, m_sizeData * sizeof(int)) );
cudaSafeCall( cudaMemcpy(mp_dev_data, other.mp_dev_data, m_sizeData * sizeof(int), cudaMemcpyDeviceToDevice) );
#else
mp_dev_data = new int[m_sizeData];
memcpy(mp_dev_data, other.mp_dev_data, m_sizeData * sizeof(int));
#endif
return *this;
}
__host__ int* DeviceContainer::getDataHost() const
{
int * pDataHost = new int[m_sizeData];
cudaSafeCall( cudaMemcpy(pDataHost, mp_dev_data, m_sizeData * sizeof(int), cudaMemcpyDeviceToHost) );
return pDataHost;
}
It just manages the array mp_dev_data.
The array is created and filled with consecutive values during construction, which should only be possible on the device. (Note that in reality the size of the containers might be different from each other.)
I think I need to provide a copy constructor and an assignment operator since I don't know any other way to fill the array in the kernel. (See question No. 3 below.)
Since copy and deletion can also happen on the host, __CUDA_ARCH__ is used to determine for which execution path we're compiling. On the host cudaMemcpy and cudaFree is used, on the device we can just use memcpy and delete[].
The kernel for object creation is rather simple:
__global__ void createContainer(DeviceContainer * pContainer, unsigned int numContainer, unsigned int containerSize)
{
unsigned int offset = blockIdx.x * blockDim.x + threadIdx.x;
if(offset < numContainer)
{
pContainer[offset] = DeviceContainer(containerSize);
}
}
Each thread in a one-dimensional grid that is in range creates a single container object.
The main-function then allocates arrays for the container (90000 in this case) on the device and host, calls the kernel and attempts to use the objects:
void main()
{
const unsigned int numContainer = 90000;
const unsigned int containerSize = 5;
DeviceContainer * pDevContainer;
cudaSafeCall( cudaMalloc((void**)&pDevContainer, numContainer * sizeof(DeviceContainer)) );
dim3 blockSize(1024, 1, 1);
dim3 gridSize((numContainer + blockSize.x - 1)/blockSize.x , 1, 1);
createContainer<<<gridSize, blockSize>>>(pDevContainer, numContainer, containerSize);
cudaCheckError();
DeviceContainer * pHostContainer = (DeviceContainer *)malloc(numContainer * sizeof(DeviceContainer));
cudaSafeCall( cudaMemcpy(pHostContainer, pDevContainer, numContainer * sizeof(DeviceContainer), cudaMemcpyDeviceToHost) );
for(unsigned int i = 0; i < numContainer; ++i)
{
const DeviceContainer & dc = pHostContainer[i];
int * pData = dc.getDataHost();
for(unsigned int j = 0; j < dc.getSize(); ++j)
{
std::cout << pData[j];
}
std::cout << std::endl;
delete[] pData;
}
free(pHostContainer);
cudaSafeCall( cudaFree(pDevContainer) );
}
I have to use malloc for array creation on the host, since i don't want to have a default constructor for the DeviceContainer.
I try to access the data inside a container via getDataHost() which internally just calls cudaMemcpy.
cudaSafeCall and cudaCheckError are simple macros that evaluate the cudaError returned by the function oder actively poll the last error. For the sake of completeness:
#define cudaSafeCall(error) __cudaSafeCall(error, __FILE__, __LINE__)
#define cudaCheckError() __cudaCheckError(__FILE__, __LINE__)
inline void __cudaSafeCall(cudaError error, const char *file, const int line)
{
if (error != cudaSuccess)
{
std::cerr << "cudaSafeCall() returned:" << std::endl;
std::cerr << "\tFile: " << file << ",\nLine: " << line << " - CudaError " << error << ":" << std::endl;
std::cerr << "\t" << cudaGetErrorString(error) << std::endl;
system("PAUSE");
exit( -1 );
}
}
inline void __cudaCheckError(const char *file, const int line)
{
cudaError error = cudaDeviceSynchronize();
if (error != cudaSuccess)
{
std::cerr << "cudaCheckError() returned:" << std::endl;
std::cerr << "\tFile: " << file << ",\tLine: " << line << " - CudaError " << error << ":" << std::endl;
std::cerr << "\t" << cudaGetErrorString(error) << std::endl;
system("PAUSE");
exit( -1 );
}
}
I have 3 problems with this code:
If it is executed as presented here i recieve an "unspecified launch failure" of the kernel. The Nsight Debugger stops me on the line mp_dev_data = new int[m_sizeData]; (either in the constructor or the assignment operator) and reports several access violation on global memory. The number of violations appears to be random between 4 and 11 and they occur in non-consecutive threads but always near the upper end of the grid (block 85 and 86).
If i reduce numContainer to 10, the kernel runs smoothly, however, the cudaMamcpy in getDataHost() fails with an invalid argument error - even though mp_dev_data is not 0. (I suspect that the assignment is faulty and the memory has already been deleted by another object.)
Even though I would like to know how to correctly implement the DeviceContainer with proper memory management, in my case it would also be sufficient to make it non-copyable and non-assignable. However, I don't know how to properly fill the container-array in the kernel. Maybe something like
DeviceContainer dc(5);
memcpy(&pContainer[offset], &dc, sizeof(DeviceContainer));
Which would lead to problems with deleting mp_dev_data in the destructor. I would need to manually manage memory deletion which feels rather dirty.
I also tried to use malloc and free in kernel code instead of new and delete but the results were the same.
I am sorry that I wasn't able to frame my question in a shorter manner.
TL;DR: How to implement a class that dynamically allocates memory in a kernel and can also be used in host code? How can I initialize an array in a kernel with objects that can not be copied or assigned?
Any help is appreciated. Thank You.
Apparently the answer is: What I am trying to do is more or less impossible.
Memory allocated with new or malloc in the kernel is not placed in global memory but rather in a special heap memory which is inaccessible from the host.
The only option to access all memory on the host is to first allocate an array in global memory which is big enough to hold all elements on the heap and then write a kernel that copies all elements from the heap to global memory.
The access violation are caused by the limited heap size (which can be changed by cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size).

Implementing cdbpp library for string values

I am trying to implement the cdbpp library from chokkan. I am facing some problems when I was trying to implement the same for values with data type of strings.
The original code and documentation can be found here:
http://www.chokkan.org/software/cdbpp/ and the git source code is here: https://github.com/chokkan/cdbpp
This is what I have so far:
In the sample.cpp (from where i am calling the main function), I modified the build() function:
bool build()
{
// Open a database file for writing (with binary mode).
std::ofstream ofs(DBNAME, std::ios_base::binary);
if (ofs.fail()) {
std::cerr << "ERROR: Failed to open a database file." << std::endl;
return false;
}
try {
// Create an instance of CDB++ writer.
cdbpp::builder dbw(ofs);
// Insert key/value pairs to the CDB++ writer.
for (int i = 1;i < N;++i) {
std::string key = int2str(i);
const char* val = "foobar"; //string value here
dbw.put(key.c_str(), key.length(), &val, sizeof(i));
}
} catch (const cdbpp::builder_exception& e) {
// Abort if something went wrong...
std::cerr << "ERROR: " << e.what() << std::endl;
return false;
}
return true;
}
and in cdbpp.h file, i modified the put() function as :
void put(const key_t *key, size_t ksize, const value_t *value, size_t vsize)
{
// Write out the current record.
std::string temp2 = *value;
const char* temp = temp2.c_str();
write_uint32((uint32_t)ksize);
m_os.write(reinterpret_cast<const char *>(key), ksize);
write_uint32((uint32_t)vsize);
m_os.write(reinterpret_cast<const char *>(temp), vsize);
// Compute the hash value and choose a hash table.
uint32_t hv = hash_function()(static_cast<const void *>(key), ksize);
hashtable& ht = m_ht[hv % NUM_TABLES];
// Store the hash value and offset to the hash table.
ht.push_back(bucket(hv, m_cur));
// Increment the current position.
m_cur += sizeof(uint32_t) + ksize + sizeof(uint32_t) + vsize;
}
Now the I get the correct value if the string is less than or equal to 3 characters(eg: foo will return foo). If it is greater than 3 it gives me the correct string up to 3 characters then garbage value(eg. foobar gives me foo�`)
I am a little new to c++ and I would appreciate any help you could give me.
(moving possible answer in comment to real answer)
vsize as passed into put is the size of an integer when it should be the length of the value string.

Heap corruption - Loading file (StaticMesh)

I have the following class declared:
class StaticMesh
{
public:
unsigned int v_count;
float* vertices;
unsigned int n_count;
float* normals;
void Load_lin(const char* file);
void Draw(void);
void Release(void);
};
This class (as it's name indicates) represents a static mesh, which can load .lin files.
.lin files are generated by another application I made using C#. This application reads .obj files and generates the .lin file that has this structure:
v_count v
n_count n
a#a#a
b#b#b
a#a#a
b#b#b
Where v is the number of vertices, n the number of normals, and a/b represent coordinates.
Load_lin(const char*) is the function that loads these files, and here it is:
void StaticMesh::Load_lin(const char* file)
{
std::ifstream in (file);
if (!in)
{
std::cout << "Error: Failed to load staticmesh from '" << file << "'." << std::endl;
return;
}
char buffer[256];
in.getline(buffer, 256);
sscanf_s(buffer, "v_count %i", &v_count);
in.getline(buffer, 256);
sscanf_s(buffer, "n_count %i", &n_count);
vertices = new float[v_count];
normals = new float[n_count];
unsigned int a = 0;
unsigned int p = 0;
float x, y, z;
do
{
in.getline(buffer, 256);
if (buffer[0] == '\n' || buffer[0] == '\r') break;
sscanf_s(buffer, "%f#%f#%f", &x, &y, &z);
vertices[a++] = x;
vertices[a++] = z;
vertices[a++] = y;
in.getline(buffer, 256);
sscanf_s(buffer, "%f#%f#%f", &x, &y, &z);
normals[p++] = x;
normals[p++] = z;
normals[p++] = y;
} while (!in.eof());
in.close();
}
I've narrowed down the cause of the error to this function, however, the error only shows when the application is closed, and sometimes it doesn't happen.
So the line where the error occurs is actually the end of WinMain:
return msn.message;
I've went further and used std::cout to print the variables 'a' and 'p', this causes an heap corruption error, but this time in malloc.c line 55:
__forceinline void * __cdecl _heap_alloc (size_t size)
{
if (_crtheap == 0) {
_FF_MSGBANNER(); /* write run-time error banner */
_NMSG_WRITE(_RT_CRT_NOTINIT); /* write message */
__crtExitProcess(255); /* normally _exit(255) */
}
return HeapAlloc(_crtheap, 0, size ? size : 1);
} // LINE 55
I've searched for this last error with no avail.
Thank you for your time. :)
I think v_count and n_count mentions the number of vertices. According to the code, each vertex will have 3 components (x/y/z) and each component is stored in a float variable. This means that you need to allocate 3 times v_count and 3 times n_count number of floats for vertices and normals respectively.
i.e modify your allocation as
vertices = new float[v_count * 3];
normals = new float[n_count * 3];
v_count v means the number you read is (unsigned int)'v'. And then you have vertices = new float[v_count];. It's likely you don't allocate enough storage and write outside the bounds of the arrays vertices and normals resulting in undefined behavior.
Just try re-write your code to use std::vector instead of raw pointers. This will really help you avoid issues with heap corruption.