Optimizing for 3D imaging processes in C++

Optimizing for 3D imaging processes in C++ - c++

I am working with 3D volumetric images, possibly large (256x256x256). I have 3 such volumes that I want to read in and operate on. Presently, each volume is stored as a text file of numbers which I read in using ifstream. I save it as a matrix (This is a class I have written by dynamic allocation of a 3D array). Then I perform operations on these 3 matrices, addition, multiplication and even Fourier transform. So far, everything works well, but, it takes a hell lot of time, especially the Fourier transform since it has 6 nested loops.
I want to know how I can speed this up. Also, whether the fact that I have stored the images in text files makes a difference. Should I save them as binary or in some other easier/faster to read in format? Is fstream the fastest way I can read in? I use the same 3 matrices each time without changing them. Does that make a difference? Also, is pointer to pointer to pointer the best way to store a 3D volume? If not what else can I do?

Also, is pointer to pointer to pointer best way to store a 3d volume?
Nope thats usually very ineficient.
If not what else can I do?
Its likely that you will get better performance if you store it in a contiguous block, and use computed offsets into the block.
I'd usually use a structure like this:
class DataBlock {
unsigned int nx;
unsigned int ny;
unsigned int nz;
std::vector<double> data;
DataBlock(in_nx,in_ny,in_nz) :
nx(in_nx), ny(in_ny), nz(in_nz) , data(in_nx*in_ny*in_nz, 0)
{}
//You may want to make this check bounds in debug builds
double& at(unsigned int x, unsigned int y, unsigned int z) {
return data[ x + y*nx + z*nx*ny ];
};
const double& at(unsigned int x, unsigned int y, unsigned int z) const {
return data[ x + y*nx + z*nx*ny ];
};
private:
//Dont want this class copied, so remove the copy constructor and assignment.
DataBlock(const DataBlock&);
DataBlock&operator=(const DataBlock&);
};

Storing a large (2563 elements) 3D image file as plain text is a waste of resources.
Without loss of generality, if you have a plain text file for your image and each line of your file consists of one value, you will have to read several characters until you find the end of the line (for a 3-digit number, these will be 4 bytes; 3 bytes for the digits, 1 byte for newline). Afterwards you will have to convert these single digits to a number. When using binary, you directly read a fixed amount of bytes and you will have your number. You could and should write and read it as a binary image.
There are several formats for doing so, the one I would recommend is the meta image file format of VTK. In this format, you have a plain text header file and a binary file with the actual image data. With the information from the header file you will know how large your image is and what datatype you will be using. In your program, you then directly read the binary data and save it to a 3D array.
If you really want to speed things up, use CUDA or OpenCL which will be pretty fast for your applications.
There are several C++ libraries that can help you with writing, saving and manipulating image data, including the before-mentioned VTK and ITK.

2563 is a rather large number. Parsing 2563 text strings will take a considerable amount of time. Using binary will make the reading/writing process much faster because it doesn't require converting a number to/from string, and using much less space. For example to read the number 123 as char from a text file the program will need to read it as a string and convert from decimal to binary using lots of multiplies by 10. Whereas if you had written it directly as the binary value 0b01111011 you only need to read that byte back again into memory, no conversion at all.
Using hexadecimal number may also increase reading speed since each hex digit can map directly to binary value but if you need more speed, binary file is the way to go. Just a fread command is enough to load the whole 2563 bytes = 16MB file into memory in less than 1 sec. And when you're done, just fwrite it back to file. To speedup you can use SIMD (SSE/AVX), CUDA or another parallel processing technique. You can improve the speed even further by multithreading or by only saving the non zero values because in many cases, most values will often be 0's.
Another reason maybe because your array is large and each dimension is a power of 2. This has been discussed in many questions on SO:
Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?
Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
You may consider changing the last dimension to 257 and try again. Or better use another algorithm like divide and conquer that's more cache friendly

You should add timers around the load and the process so you know which is taking the most time, and focus your optimization efforts on it. If you control the file format, make one that is more efficient to read. If it is the processing, I'll echo what previous folks have said, investigate efficient memory layout as well as GPGPU computing. Good luck.

Related

Need to quickly subset a very big ncdf using a set of coordinates

I have a netcdf file which contains a float array (21600, 43200). I don't want to read in the entire array to RAM because it's too large, so I'm using the Dataset object from the netCDF4 library to read in the array.
I would like to calculate the mean of a subset this array using two 1D numpy arrays (x_coords, y_coords) of 300-400 coordinates.
I don't think I can use basic indexing, because the coordinates I have aren't continuous. What I'm currently doing is just feeding the arrays directly into the object, like so:
ncdf_data = Dataset(file, 'r')
mean = np.mean(ncdf_data.variables['q'][x_coords, y_coords])
The above code takes far too long for my liking (~3-4 seconds depending on the coordinates I'm using), and I'd like to speed this up somehow. Is there a pythonic way that I can use to directly work out the mean from such a subset without triggering fancy indexing?

I know h5py warns about the slow speed of fancy indexing,
docs.h5py.org/en/latest/high/dataset.html#fancy-indexing.
netcdf probably has the same problem.
Can you load contiguous slice that contains all values, and apply the faster numpy advanced indexing to that subset? Or you may have to work with chunks.
numpy advanced indexing is slower than it's basic slicing, but that is still quite a bit faster than the fancy indexing directly off the file.
However you do it, np.mean will be operating on data in memory, not directly on data in the file. The slowness of fancy indexing is because it has to access data scattered through out the file. Loading the data into an array in memory isn't the slow part. The slow part is seeking and reading from the file.
Putting the file on a faster drive (e.g. a solid state one) might help.

How to insert a char array into a float vector

Hmm, hello! I am trying to read a binary file that contains a number of float values at a specific position. As seemingly must be done with binary files, they were saved as arrays of bytes, and I have been searching for a way to convert them back to floats with no success. Basically I have a char* memory block, and am attempting to extract the floats stored at a particular location and seamlessly insert them into a vector. I wonder, would that be possible, or would I be forced to rely on arrays instead if I wished to save copying the data? And how could it possibly be done? Thank you ^_^

If you know where the floats are you can read them back:
float a = *(float*)buffer[position];
Then you can do whatever you need of a, including 'push_back'ing it into a vector.
Make sure you read the file in binary mode, and if you know the positions of the float in the file it should work.
I'd need to see the code that generated the file to be more efficient.

Efficient reading/writing of streams of bits in C++

I'm writing a data compression library and I need to write a sequence of encodings of integers (in a variety of integer encoders) in memory, store it in a file and then read all of them later.
Integer encodings have to be stored consecutively.
Since generally their size in bits isn't a multiple of 8, I don't have them aligned in memory.
In short, what I need is something which exposes functions like these:
unsigned int BitReader::read_bits(size_t bits);
unsigned int BitWriter::write_bits(unsigned int num, size_t bits);
void BitWriter::get_array(char** array);
BitReader::BitReader(char *array);
Since I need to invoke those functions in a (very) tight loop, efficiency is of paramount interest (especially in reading).
Do you know some C++ libraries which does what I want? Thanks.

If efficiency is your only requirement then get the address of the storage for the data and write it directly to storage. Then on restore allocate the same storage and perform the reverse operation. It's simple, fast, and has no learning curve.

Openning a stream to whatever input / output is the most affective although not efficient for huge amount of data. Streams provide a portable way of carrying out read/write operations thats why you can open stream in memory or disk. If you want to take control of the stream to the disk itself i would recommend using _bios_disk function google "_bios_disk" for more info.

Extremely big integer multiplication and addition

Greetings,
I need to multiply two extremely long integer values stored in a text file (exported via GMP (MPIR, to be exact), so they can be any in any base). Now, I would usually just import these integers via the mpz_inp_str() function and perform the multiplication in RAM, however, these values are so long that I can't really load them (about 1 GB of data each). What would be the fastest way to do this? Perhaps there are some external libraries that do this sort of thing already? Are there any easily implementable methods for this (performance is not incredibly important as this operation would only be performed once or twice)?
tl;dr: I need to multiply values so large they don't fit into process memory limits (Windows).
Thank you for your time.

I don't know if there is a library that supports this, but you could use GMP/MPIR on parts of each really big number (RBN). That is, start by breaking each RBN into manageable, uniformly sized chunks (e.g. 10M digit chunks, expect an undersized chunk for most significant digits, also see below).
RBN1 --> A B C D E
RBN2 --> F G H I J
The chunking can be done in base 10, so just read <chuck_size> characters from the file for each piece. Then multiply chunks from each number one at a time.
AxF BxF CxF DxF ExF
+ AxG BxG CxG DxG ExG
+ AxH BxH CxH DxH ExH
+ AxI BxI CxI DxI ExI
+ AxJ BxJ CxJ DxJ ExJ
Perform each column of the final sum in memory. Then, keeping the carry in memory, write the column out to disk, repeat for next column... For carries, convert each column sum result to a string with GMP, write out the bottom <chunk size> portion of the result and read the top portion back in as a GMP int for the carry.
I'd suggest selecting a chunk size dynamically for each multiplication in order to keep each column addition in memory; the larger the numbers, the more column additions will need to be done, the smaller the chunk size will need to be.
For both reading and writing, I'd suggest using memory mapped files, boost has a nice interface for this (note that this does not load the entire file, it just basically buffers the IO on virtual memory). Open one mapped file for each input RBN numbers, and one output with size = size(RBN1) + size(RBN2) + 1; With memory mapped files, file access is treated as a raw char*, so you can read/write chunks directly using gmp c-string io methods. You will probably need to read into an intermediate buffer in order to NULL terminated strings for GMP (unless you want to temporarily alter the memory mapped file).
This isn't very easy to implement correctly, but then again this isn't a particularly easy problem (maybe just tedious). This approach has the advantage that it exactly mirrors what GMP is doing in memory, so the algorithms are well known.

Optimal datafile format loading on a game console

I need to load large models and other structured binary data on an older CD-based game console as efficiently as possible. What's the best way to do it? The data will be exported from a Python application. This is a pretty elaborate hobby project.
Requierements:
no reliance on fully standard compliant STL - i might use uSTL though.
as little overhead as possible. Aim for a solution so good. that it could be used on the original Playstation, and yet as modern and elegant as possible.
no backward/forward compatibility necessary.
no copying of large chunks around - preferably files get loaded into RAM in background, and all large chunks accessed directly from there later.
should not rely on the target having the same endianness and alignment, i.e. a C plugin in Python which dumps its structs to disc would not be a very good idea.
should allow to move the loaded data around, as with individual files 1/3 the RAM size, fragmentation might be an issue. No MMU to abuse.
robustness is a great bonus, as my attention span is very short, i.e. i'd change saving part of the code and forget the loading one or vice versa, so at least a dumb safeguard would be nice.
exchangeability between loaded data and runtime-generated data without runtime overhead and without severe memory management issues would be a nice bonus.
I kind of have a semi-plan of parsing in Python trivial, limited-syntax C headers which would use structs with offsets instead of pointers, and convenience wrapper structs/classes in the main app with getters which would convert offsets to properly typed pointers/references, but i'd like to hear your suggestions.
Clarification: the request is primarily about data loading framework and memory management issues.

On platforms like the Nintendo GameCube and DS, 3D models are usually stored in a very simple custom format:
A brief header, containing a magic number identifying the file, the number of vertices, normals, etc., and optionally a checksum of the data following the header (Adler-32, CRC-16, etc).
A possibly compressed list of 32-bit floating-point 3-tuples for each vector and normal.
A possibly compressed list of edges or faces.
All of the data is in the native endian format of the target platform.
The compression format is often trivial (Huffman), simple (Arithmetic), or standard (gzip). All of these require very little memory or computational power.
You could take formats like that as a cue: it's quite a compact representation.
My suggestion is to use a format most similar to your in-memory data structures, to minimize post-processing and copying. If that means you create the format yourself, so be it. You have extreme needs, so extreme measures are needed.

This is a common game development pattern.
The usual approach is to cook the data in an offline pre-process step. The resulting blobs can be streamed in with minimal overhead. The blobs are platform dependent and should contain the proper alignment & endian-ness of the target platform.
At runtime, you can simply cast a pointer to the in-memory blob file. You can deal with nested structures as well. If you keep a table of contents with offsets to all the pointer values within the blob, you can then fix-up the pointers to point to the proper address. This is similar to how dll loading works.
I've been working on a ruby library, bbq, that I use to cook data for my iphone game.
Here's the memory layout I use for the blob header:
// Memory layout
//
// p begining of file in memory.
// p + 0 : num_pointers
// p + 4 : offset 0
// p + 8 : offset 1
// ...
// p + ((num_pointers - 1) * 4) : offset n-1
// p + (num_pointers * 4) : num_pointers // again so we can figure out
// what memory to free.
// p + ((num_pointers + 1) * 4) : start of cooked data
//
Here's how I load binary blob file and fix up pointers:
void* bbq_load(const char* filename)
{
unsigned char* p;
int size = LoadFileToMemory(filename, &p);
if(size <= 0)
return 0;
// get the start of the pointer table
unsigned int* ptr_table = (unsigned int*)p;
unsigned int num_ptrs = *ptr_table;
ptr_table++;
// get the start of the actual data
// the 2 is to skip past both num_pointer values
unsigned char* base = p + ((num_ptrs + 2) * sizeof(unsigned int));
// fix up the pointers
while ((ptr_table + 1) < (unsigned int*)base)
{
unsigned int* ptr = (unsigned int*)(base + *ptr_table);
*ptr = (unsigned int)((unsigned char*)ptr + *ptr);
ptr_table++;
}
return base;
}
My bbq library isn't quite ready for prime time, but it could give you some ideas on how to write one yourself in python.
Good Luck!

I note that nowhere in your description do you ask for "ease of programming". :-)
Thus, here's what comes to mind for me as a way of creating this:
The data should be in the same on-disk format as it would be in the target's memory, such that it can simply pull blobs from disk into memory with no reformatting it. Depending on how much freedom you want in putting things into memory, the "blobs" could be the whole file, or could be smaller bits within it; I don't understand your data well enough to suggest how to subdivide it but presumably you can. Because we can't rely on the same endianness and alignment on the host, you'll need to be somewhat clever about translating things when writing the files on the host-side, but at least this way you only need the cleverness on one side of the transfer rather than on both.
In order to provide a bit of assurance that the target-side and host-side code matches, you should write this in a form where you provide a single data description and have some generation code that will generate both the target-side C code and the host-side Python code from it. You could even have your generator generate a small random "version" number in the process, and have the host-side code write this into the file header and the target-side check it, and give you an error if they don't match. (The point of using a random value is that the only information bit you care about is whether they match, and you don't want to have to increment it manually.)

Consider storing your data as BLOBs in a SQLite DB. SQLite is extremely portable and lighweight, ANSI C, has both C++ and Python interfaces. This will take care of large files, no fragmentation, variable-length records with fast access, and so on. The rest is just serialization of structs to these BLOBs.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js