Reading and storing large matrix file for GPU - c++

Goal: Storing a large matrix in memory (Radon matrix), and transferring it into GPU memory for massively parallel operations.
Problem: Horrible reading time, and potentially sub-optimal use of space (but non-limiting for the program's usage)
I have the possibility of doing this in either C or C++.
The files which I'm receiving are parsed as follows:
0.70316,0.71267,0.72221,0.73177,0.74135,0.75094,0.76053,0.77011,0.77967,0.7892,0.79868,0.80811,0.81747
and this goes on for at least 50MB.
My naïve implementation:
float ** Radon;
Radon = (float **)malloc(HeightxNproj * sizeof(float *));
for (int i = 0; i < Hauteur * Nproj; i++)
Radon[i] = (float *)malloc(WidthSquared * sizeof(float));
FILE *radonFile;
radonFile = fopen("radon.txt", "r");
if (radonFile == NULL)
{
printf("Radon file opening failed.");
return -1;
}
for (int i = 0; i < HeightxNproj; i++)
{
for (int j = 0; j < WidthSquared; j++)
{
fscanf(radonFile, "%f,", &Radon[i][j]);
}
}
fclose(radonFile);
printf("Radon loaded.");
I'm programming for windows. I've read a bit about File Memory Mapping, but I don't know if this method, which is not actually storing the matrix in memory, is compatible with GPGPU programming. I'm using CUDA, and I'll have to pass this matrix onto GPU memory for parallel operations.
This file-reading method performs terribly, it's taking roughly a minute to read and parse the 50MB file. Is there a way to shorten reading and parsing time? The matrix is also a sparse matrix, are there common ways to deal with such matrix?

The more separate accessing of a file the more performance you lose. The first step you should take is to estimate number of information you need to read from the file and read it in one take. It will increase your performance by huge amount. You can use memory mapped files.
and this goes on for at least 50MB.
This is not that much.
The files i'm receiving are parsed as follows:
0.70316,0.71267,0.72221,0.73177,0.74135,0.75094,0.76053,0.77011,0.77967,0.7892,0.79868,0.80811,0.81747
Save it in binary to save about half of the memory (maybe even more). This will also increase reading speed.
Read the whole file at one time.
An example will make you realize how naive and slow is your approach:
Once I was implementing algorithm that was reading .obj 3d model. The model was like 10 MB and it took around 1-2 minute to load. This was very strange, because Blender could load it immediately - maybe 1 or 2 seconds. Mapping whole file to memory and pre-allocating buffers allowed me to load the file in less than 5 secs.
Note:
I can do this in either C or C++, both are ok.
Don't ever mix C with C++ when it comes to memory management, unless you are sure what you are doing. C++ exceptions can cause huge memory leaks if you don't protect C dynamically allocated memory using RAII.

Related

C++ Binary File I/O Operations Slow Down... How DB Handle Binary Files?

I am trying to make a simple file-based hash table. Here is my insert member function:
private: std::fstream f; // std::ios::in | std::ios::out | std::ios::binary
public: void insert(const char* this_key, long this_value) {
char* that_key;
long that_value;
long this_hash = std::hash<std::string>{}(this_key) % M;
long that_hash; // also block status
long block = this_hash;
long offset = block * BLOCK_SIZE;
while (true) {
this->f.seekg(offset);
this->f.read((char*) &that_hash, sizeof(long));
if (that_hash > -1) { // -1 (by default) indicates a never allocated block
this->f.read(that_key, BLOCK_SIZE);
if (strcmp(this_key, that_key) == 0) {
this->f.seekp(this->f.tellg());
this->f.write((char*) &this_value, sizeof(long));
break;
} else {
block = (block + 1) % M; // linear probing
offset = block * BLOCK_SIZE;
continue;
}
} else {
this->f.seekp(offset);
this->f.write((char*) &this_hash, sizeof(long)); // as block status
this->f.write(this_key, KEY_SIZE);
this->f.write((char*) &this_value, sizeof(long));
break;
}
}
}
Tests up to 10M key, value pairs with 50,000,017 blocks were fairly done. (Binary file size was about 3.8GB).
However, a test with 50M keys and 250,000,013 blocks extremely slows down... (Binary file size was more than 19GB in this case). 1,000 inserts usually takes 4~5ms but exceptionally take more than 2,000ms. It gets slower and slower then takes 40~150ms... (x10 ~ x30 slower...) I definitely have no idea...
What causes this exceptional binary file I/O slowing down?
Do seekg&seekp and other I/O operations are affected by file size? (I could not find any references about this question though...)
How key, value stores and databases avoid this I/O slow down?
How can I solve this problem?
Data size
Usually disk drive block size are a power of 2 so if your data block size is also a power of 2, then you can essentially eliminate the case where a data block cross a disk block boundary.
In your case, a value of 64 bytes (or 32 bytes if you don't really need to store the hash) might somewhat perform a bit better.
Insertion order
The other thing you could do to improve performance is to do your insertion is increasing hash order to reduce the number of a time data must be loaded from the disk.
Generally when data is read or written to the disk, the OS will read/write a large chuck at a time (maybe 4k) so if your algorithm is written is a way to write data locally in time, you will reduce the number of time data must actually be read or written to the disk.
Given that you make a lot of insertion, one possibility would be to process insertion in batch of say 1000 or even 10000 key/value pair at a time. Essentially, you would accumulate data in memory and sort it and once you have enough item (or when you are done inserting), you will then write the data in order.
That way, you should be able to reduce disk access which is very slow. This is probably even more important if you are using traditional hard drive as moving the head is slow (in which case, it might be useful to defragment it). Also, be sure that your hard drive have more than enough free space.
In some case, local caching (in your application) might also be helpful particularly if you are aware how your data is used.
File size VS collisions
When you use an hash, you want to find the sweet spot between file size and collisions. If you have too much collisions, then you will waste lot of time and at some point it might degenerate when it become hard to find a free place for almost every insertion.
On the other hand, if your file is really very large, you might end up in a case where you might fill your RAM with data that is mainly empty and still need to replace data with data from the disk on almost all insertion.
For example, if your data is 20GB and you are able to load say 2 GB in memory, then if insert are really random, 90% of the time you might need real access to hard drive.
Configuration
Well options will depends on the OS and it is beyond the scope of a programming forum. If you want to optimize your computer, then you should look elsewhere.
Reading
It might be helpful to read about operating systems (file system, cache layer…) and algorithms (external sorting algorithms, B-Tree and other structures) to get a better understanding.
Alternatives
Extra RAM
Fast SSD
Multithreading (for ex. input and output threads)
Rewriting of the algorithm (for ex. to read/write a whole disk page at once)
Faster CPU / 64 bit computer
Using algorithms designed for such scenarios.
Using a database.
Profiling code
Tuning parameters

Write data from C++ Vector to text file fast

I would like to know the best way to write data from vector<string> to text file fast as the data would involve few millions lines.
I have tried ofstream (<<) in C++ as well as fprintf using C, yet, the performance between them is little as i have recorded the time that is used to generate the required file.
vector<string> OBJdata;
OBJdata = assembleOBJ(pointer, vertexCount, facePointer);
FILE * objOutput;
objOutput = fopen("sample.obj", "wt");
for (int i = 0; i < OBJdata.size(); i++)
{
fwrite(&OBJdata[i],1, sizeof(OBJdata[i].length()),objOutput );
}
fclose(objOutput);
There is no "best". There are only options with different advantages and disadvantages, both of which vary with your host hardware (e.g. writing to a high performance drive will be faster than a slower on), file system, and device drivers (implementation of disk drivers can trade-off performance to increase chances of data being correctly written to the drive).
Generally, however, manipulating data in memory is faster than transferring it to or from a device like a hard drive. There are limitations on this as, with virtual memory, data in physical memory may be transferred in some circumstances to virtual memory - on disk.
So, assuming you have sufficient RAM and a fast CPU, an approach like
// assume your_stream is an object of type derived from ostream
// THRESHOLD is a large-ish positive integer
std::string buffer;
buffer.reserve(THRESHOLD);
for (std::vector<string>::const_iterator i = yourvec.begin(), end = yourvec.end(); i != end; ++i)
{
if (buffer.length() + i->length + 1 >= THRESHOLD)
{
your_stream << buffer;
buffer.resize(0);
}
buffer.append(*i);
buffer.append(1, '\n');
}
your_stream << buffer;
The strategy here is reducing the number of distinct operations that write to the stream. As a rule of thumb, a larger value of THRESHOLD will reduce the number of distinct output operations, but will also consume more memory, so there is usually a sweet spot somewhere in terms of performance. The problem is, that sweet spot depends on the factors I mentioned above (hardware, file system, device drivers, etc). So this approach is worth some effort to find the sweet spot only if you KNOW the exact hardware and host system configuration your program will run on (or you KNOW that the program will only be executed in a small range of configurations). It is not worth the effort if you don't know these things, since what works with one configuration will often not work for another.
Under windows, you might want to use win API functions to work with the file (CreateFile(), WriteFile(), etc) rather than C++ streams. That might give small performance gains, but I wouldn't hold my breath.
You may want to take a look at writev that allows you to write multiple elements at once - thus taking better advantage of the buffering. See: http://linux.die.net/man/2/writev

Memory with Vectors and Pointers in MPI+OpenMP App?

I have been reading about vectors with respect to memory allocation and trying to work through some hybrid parallelized code that seems to be heavily chewing through memory unexpectedly. The program used to use only OpenMP so was limited in processing the data on a per node basis, but code was recently added to utilize MPI as well. The data itself is of a fixed size for the current problem (15GB) and is distributed equally amongst all of the MPI processes.
The program overall is using vectors almost everywhere, but prior to MPI code, was able to process much much larger amounts of data without running out of mem.
The current data it is trying to process is only 15gb in size, but any attempts to run are wrought with bad_alloc errors -- even when requesting nodes with >64GB mem.
I am shortening down the code to explain the critical points and adding some inline comments. It is mixed MPI+openmp.
Here are my questions:
If I am running out of memory on a per node basis, does that definitively mean that its the OpenMP threads that are causing the problem?
Could it be that increasing these vectors on-the-fly is causing the memory issue? It seems like the vectors are only used for storing data and then for the allocation of other memory -- could I get away with using a std::list instead?
The vectors are never used after this step, but that mem remains allocated. I could certainly free the memory associated with them, but these vectors were not previously a problem when they ran only on a single node over a series of ~50 hours on the full set of data (not a smaller chunk). Is there anything about the MPI code below that could be causing it to ask for too much memory in some way? Previously, the mallocs for the pointers were not in the code.
svec<int> ind_leftFW;
svec<int> ind_rightFW;
svec<int> ind_leftRC;
svec<int> ind_rightRC;
svec<int> ind_toasted;
// For additional examples, I will be using only leftFront and rightFront but all of these have similar behavior
/* Searching through previously created [outside of function] vectors for things + some math */
#pragma omp critical
{
// first use of declared vectors
ind_leftFront.push_back(i);
ind_rightFront.push_back(c);
}
/* Other push_backs into the other vector structures */
int* all_indexFront_sizes;
int* all_IndexFront_left;
int* all_IndexFront_right;
int* leftFront;
int* rightFront;
int numFront;
numFront = (int)ind_leftFront.size();
leftFront = (int*)malloc(sizeof(int)*numFront);
rightFront = (int*)malloc(sizeof(int)*numFront);
for(int count=0; count<numFront; count++) {
leftFront[count] = ind_leftFront[count];
rightFront[count] = ind_rightFront[count];
}
all_indexFront_sizes = (int*)malloc(sizeof(int)*numranks);
MPI_Allgather(&numFront,1,MPI_INT,all_indexFront_sizes,1,MPI_INT,MPI_COMM_WORLD);
int* Front_displs;
Front_displs = (int*)malloc(sizeof(int)*numranks);
Front_displs[0] = 0;
for(int count=1;count<numranks;count++)
Front_displs[count] = Front_displs[count-1] + all_indexFront_sizes[count-1];
all_IndexFront_left = (int*)malloc(sizeof(int)*total_indexFront);
all_IndexFront_right = (int*)malloc(sizeof(int)*total_indexFront);
MPI_Allgatherv(leftFront, numFront,MPI_INT,all_IndexFront_left ,all_indexFront_sizes, FW_displs, MPI_INT,MPI_COMM_WORLD);
MPI_Allgatherv(rightFront,numFront,MPI_INT,all_IndexFront_right,all_indexFront_sizes, FW_displs, MPI_INT,MPI_COMM_WORLD);
free(Front_displs);
free(all_indexFront_sizes);
free(all_IndexFront_left);
free(all_IndexFront_right);
free(leftFront);
free(rightFront);

How to increase speed of reading data on Windows using c++

I am reading block of data from volume snapshot using CreateFile/ReadFile and buffersize of 4096 bytes.
The problem I am facing is ReadFile is too slow, I am able to read 68439 blocks i.e. 267 Mb in 45 seconds, How can I increase the speed? Below is a part of my code that I am using,
block_handle = CreateFile(block_file,GENERIC_READ,FILE_SHARE_READ,0,OPEN_EXISTING,FILE_FLAG_SEQUENTIAL_SCAN,NULL);
if(block_handle != INVALID_HANDLE_VALUE)
{
DWORD pos = -1;
for(ULONG i = 0; i < 68439; i++)
{
sectorno = (i*8);
distance = sectorno * sectorsize;
phyoff.QuadPart = distance;
if(pos != phyoff.u.LowPart)
{
pos=SetFilePointer(block_handle, phyoff.u.LowPart,&phyoff.u.HighPart,FILE_BEGIN);
if (phyoff.u.LowPart == INVALID_SET_FILE_POINTER && GetLastError() != NO_ERROR)
{
printf("SetFilePointer Error: %d\n", GetLastError());
phyoff.QuadPart = -1;
return;
}
}
ret = ReadFile(block_handle, data, 4096, &dwRead, 0);
if(ret == FALSE)
{
printf("Error Read");
return;
}
pos += 4096;
}
}
Should I use OVERLAPPED structure? or what can be the possible solution?
Note: The code is not threaded.
Awaiting a positive response.
I'm not quite sure why you're using these extremely low level system functions for this.
Personally I have used C-style file operations (using fopen and fread) as well as C++-style operations (using fstream and read, see this link), to read raw binary files. From a local disk the read speed is on the order of 100MB/second.
In your case, if you don't want to use the standard C or C++ file operations, my guess is that the reason your code is slower is due to the fact that you're performing a seek after each block. Do you really need to call SetFilePointer for every block? If the blocks are sequential you shouldn't need to do this.
Also, experiment with different block sizes, don't be afraid to go up and beyond 1MB.
Your problem is the fragmented data reads. You cannot solve this by fiddling with ReadFile parameters. You need to defragment your reads. here are three approaches:
Defragment the data on the disk
Defragment the reads. That is, collect all the reads you need, but do not read anything yet. Sort the reads into order. Read everything in order, skipping the SetFilePointer wherever possible ( i.e. sequential blocks ). This will speed the total read greatly, but introduce a lag before the first read starts.
Memory map the data. Copy ALL the data into memory and do random access reads from memory. Whether or not this is possible depends on how much data there is in total.
Also, you might want to get fancy, and experiment with caching. When you read a block of data, it might be that although the next read is not sequential, it may well have a high probability of being close by. So when you read a block, sequentially read an enormous block of nearby data into memory. Before the next read, check if the new read is already in memory - thus saving a seek and a disk access. Testing, debugging and tuning this is a lot of work, so I do not really recommend it unless this is a mission critical optimization. Also note that your OS and/or your disk hardware may already be doing something along these lines, so be prepared to see no improvement whatsoever.
If possible, read sequentially (and tell CreateFile you intend to read sequentially with FILE_FLAG_SEQUENTIAL_SCAN).
Avoid unnecessary seeks. If you're reading sequentially, you shouldn't need any seeks.
Read larger chunks (like an integer multiple of the typical cluster size). I believe Windows's own file copy uses reads on the order of 8 MB rather than 4 KB. Consider using an integer multiple of the system's allocation granularity (available from GetSystemInfo).
Read from aligned offsets (you seem to be doing this).
Read to a page-aligned buffer. Consider using VirtualAlloc to allocate the buffer.
Be aware that fragmentation of the file can cause expensive seeking. There's not much you can do about this.
Be aware that volume compression can make seeks especially expensive because it may have to decompress the file from the beginning to find the starting point in the middle of the file.
Be aware that volume encryption might slow things down. Not much you can do but be aware.
Be aware that other software, like anti-malware, may be scanning the entire file every time you touch it. Fewer operations will minimize this hit.

Concurrent Processing From File

Consider the following code:
std::vector<int> indices = /* Non overlapping ranges. */;
std::istream& in = /*...*/;
for(std::size_t i= 0; i< indices.size()-1; ++i)
{
in.seekg(indices[i]);
std::vector<int> data(indices[i+1] - indices[i]);
in.read(reinterpret_cast<char*>(data.data()), data.size()*sizeof(int));
process_data(data);
}
I would like to make this code parallel and as fast a possible.
One way of parallizing it using PPL would be:
std::vector<int> indices = /* Non overlapping ranges. */;
std::istream& in = /*...*/;
std::vector<concurrency::task<void>> tasks;
for(std::size_t i= 0; i< indices.size()-1; ++i)
{
in.seekg(indices[i]);
std::vector<int> data(indices[i+1] - indices[i]);
in.read(reinterpret_cast<char*>(data.data()), data.size()*sizeof(int));
tasks.emplace_back(std::bind(&process_data, std::move(data)));
}
concurrency::when_all(tasks.begin(), tasks.end()).wait();
The problem with this approach is that I want to process the data (which fits into CPU cache) in the same thread as it was read into memory (where the data is hot in cache), which is not the case here, it is simply wasting the opportunity of using hot data.
I have two ideas how to improve this, however, I have not been able to realize either.
Start the next iteration on a separate task.
/* ??? */
{
in.seekg(indices[i]);
std::vector<int> data(indices[i+1] - indices[i]); // data size will fit into CPU cache.
in.read(reinterpret_cast<char*>(data.data()), data.size()*sizeof(int));
/* Start a task that begins the next iteration? */
process_data(data);
}
Use memory mapped files and map the required region of the file and instead of seeking just read from the pointer with the correct offset. Process the data ranges using a parallel_for_each. However, I don't understand the performance implication of memory mapped files in terms of when it is read to memory and cache considerations. Maybe I don't even have to consider the cache since the file is simply DMA:d to system memory, never passing through the CPU?
Any suggestions or comments?
It's most likely that you are pursuing the wrong goal.
As already noted, any advantage of 'hot data' will be dwarfed by disk speed. Otherwise, there're important details you didn't tell.
1) Whether the file is 'big'
2) Whether a single record is 'big'
3) Whether the processing is 'slow'
If the file is 'big', your biggest priority is ensuring that the file is read sequentially. Your "indices" makes me think otherwise. The recent example from my own experience is 6 seconds vs 20 minutes depending on random vs sequential reads. No kidding.
If the file is 'small' and you're positive that it is cached entirely, you just need a syncronized queue to deliver tasks to your threads, then it won't be a problem to process in the same thread.
The other way around is splitting 'indices' into halves, one for each thread.