I just spent quite some time trying to get this loop openMPed, but for 2 threads, it doubles Wall time! Am I missing something important?
The overall task is to read in a big file (~ 1GB) in parallel, an ifstream is divided into several stringbuffer and these are used to insert the data into the structs Symbol. Up to here everything is fast. Also giving the loop private variables str and locVec to operate on doesn't change something.
vector<string> strbuf; // filled from ifstream
vector< vector <Symbol> > symVec; // to be filled
#pragma omp parallel for num_threads(2) default(none) shared(strbuf, symVec)
for (int i=0; i<2; i++)
{
string str = strbuf[i];
std::stringstream ss(str);
// no problem until here
// this is where it slows down:
vector<Symbol> locVec;
std::copy(std::istream_iterator<Symbol>(ss), std::istream_iterator<Symbol>(), std::back_inserter(locVec));
symVec[i] = locVec;
}
EDIT::
Sorry for being unprecise, but the file content is already read in sequencially and divided into the strbufs at this point. The file is closed. Within the loop there is no file access.
It's much better to do sequential I/O on a file rather than I/O at different sections of a file. This essentially boils down to causing a lot of seeks on the underlying device (I'm assuming a disk here). This also increases the amount of underlying system calls required to read the file into said buffers. You're better off using 1 thread to read the file in it's totality sequentially (maybe mmap() with MAP_POPULATE) and assigning processing to different threads.
Another option is to use calls such as aio_read() to handle reading in different sections if for some reason you really do not want to read the file all at once.
Without all the code I cannot be completely sure but remember that simply opening a file does not guarantee it's content to be in memory and reading from a file will cause page faults that will then cause the actual file contents to be read so even if you're not explicitly trying to read from the file using a read/write the OS will take care of that for you.
Related
What is the best solution if I just want to parallelize the loop only and sequential saving to file using openmp. I have a file with a large volume of information I would like to split into equal chunks (16 bytes each) , encrypted using openmp (multithreading programming in C++). After completion of encryption process these chunks store in a single file, but the same sequence of the original.
i_count=meta->subchunk2_size-meta->subchunk2_size%16;// TO GET THE EXACT LENTH MODE 16
// Get the number of processors in this system
int iCPU = omp_get_num_procs();
// Now set the number of threads
omp_set_num_threads(iCPU);
#pragma omp parallel for ordered
for( i=0;i<i_count;i+=16)
{
fread(Store,sizeof(char),16,Rfile);// read
ENCRYPT();
#pragma omp ordered
fwrite(Store,sizeof(char),16,Wfile) // write
}
The program it supposed works parallel but the saving to file work sequential, but the implementation of the program shows it work in sequential order.
You're much better off reading the whole file into a buffer in one thread, processing the buffer in parallel without using ordered, and then writing the buffer in one thread. Something like this
fread(Store,sizeof(char),icount,Rfile);// read
#pragma omp parallel for schedule(static)
for( i=0;i<i_count;i+=16) {
ENCRYPT(&Store[i]); //ENCRYPT acts on 16 bytes at a time
}
fwrite(Store,sizeof(char),icount,Wfile) // write
If the file is too big to read in all at once then do it in chunks in a loop. The main point is that the ENCRYPT function should be much slower than reading writing files. Otherwise, there is no point in using multiple threads anyway because you can't really speed up reading writing files with multiple threads.
I've been coding a multi-thread simulation storing the outputs in files. So far, I've assigned one file to core (with a ofstream myfiles[NUMBER_OF_CORES]) from the beginning but it's a bit messy as I'm working with several computers having 20+ cores. I've been doing that to avoid overheading if using one file, but could I use something like a stream per core and in the end, use something like:
for(int i =0; i < NUMBER_OF_CORES; i++){
myfile << CORE_STREAM[i];
}
starting with a CORE_STREAM[NUMBER_OF_CORES] array? I've never manipulated streams in this way. Which class should I construct this from if it exists?
You could use a ostringstream to store intermediate results in memory. Like ofstream, it implements the ostream interface so your existing code probably will work as-is.
To dump one stream on another, you'd do myfile << core_stream[i].rdbuf(). = Read Buffer
Have you considered using a ZMQ pipeline? Your simulation threads could write to a ZMQ_PUSH socket (see zmq_socket(3)) and whatever is writing to the file (another thread or process, ZMQ doesn't care) could read from a ZMQ_PULL socket. That way your simulation threads can potentially get out of doing any blocking IO without staging results in memory. I can't imagine working on a distributed computing project these days and not using ZMQ.
I have one big file. It is a text file so I am reading one line at a time.
std::ifstream inFile( "big_file.txt" );
std::string line;
while( getline( inFile, line ) )
{
}
I want to distribute the lines that I read from 'big_file.txt' to several files. The file count depends on the number of cores available on the machine.
Edit: The target files might be on different physical devices, or content possibly sent to a different machine
My (unsuccesful)attempt so far is as follows
// list of writer objects each running in its own thread
std::vector<FileWriter> writers;
// create as many threads as there are cores
unsigned long const cores = boost::thread::hardware_concurrency();
for( unsigned long i = 0; i < cores; ++i)
{
std::ostringstream ss;
ss << i;
FileWriter rt(ss.str());
writers.push_back(rt);
}
then as I call getline(inFile, line), I want to be able to send the line to the threads in a round-robin fashion. It really does not have to be in round-robin; whatever method is best to distribute the work among threads is fine.
I have run out of ideas.
Please suggest boost and pre c++11 STL as I don't have a complete c++11 environment yet.
Unless each new file is on a separate physical device, it is unlikely that there would be a performance gain simply by using multiple threads to write the individual files. This type of process will be I/O bound rather than CPU bound typically.
One important thing to make sure of is to use buffered I/O (which it appears to be the case since you show ifstream). Without buffered I/O, the latency of writing individual lines to different files would be a huge bottleneck.
Edit Given that the individual lines may be written to separate devices, then it might gain in performance by using multiple threads. If there is a long latency (e.g., on a network send call if sending to another machine via some mechanism), then other threads could still be writing to other locations, so that would definitely help.
I might not completely understand the question, but it seems then it would just make sense to use a thread pool. One possibility would be to use threadpool. I have not used it, but it seems to have a good reputation.
I would like to know whether there might be any possibility of some performance gain on file read by using openMP.
Example code,
fstream file;
file.open("test.txt",ios::in);
file.seekg(0,ios::end);
int len = file.tellg();
char *arr = new char[len];
char *temp = new char[1];
int i;
#pragma omp parallel for shared(arr, len) private(temp, i)
for(i = 0; i < len; i++)
{
file.seekg(i);
file.read(temp,1);
arr[i] = temp[0];
}
I guess using multiple threads for I/O operation is a bad option because finally file read operation will be serialized. But still, I would like to whether one can expect a performance gain. Moreover, I would also like to know how does openMP handles the parallel file read operations.
As you mentioned, you're not likely to get any speedup parallelizing any sort of I/O bound task like this. However, there is a much bigger problem. The code isn't even correct.
The seekg() and read() methods modify the file variable. So your iterations aren't independent. So you will have race conditions on the stream. In other words, the loop isn't parallelizable.
So don't expect that code to work at all - let alone with better performance.
Although there are lots of performance improvements in file streams those you are proposing are not among them:
std::streambuf is stateful and trying to access it simultanously from multiple threads of execution will thoroughly mess it up.
Processing individual characters is essentially a worst case scenario for a contemporary processor. If you really end up doing it in parallel you'd have multiple processors messing with the same cache lines. This will actually dramatically degrade performance compared to a single thread of execution.
I don't know why people are so.fond of using seeks: each seek essentially kills any current buffer and may cause a system call just to position the stream to a defined state. The key problem with seeking is that sets the stream up to be either for reading or writing, depending what is the next operation. Yes, the open mode may be taken into account but it probably isn't.
If you want to read a fast approach to read a file using std::ifstream you should
imbue() a std::locale which advertises not to do any conversion
open the file in std::binary mode
skip trying to get what may be a wrong estimate on the size of the file (seeking to the end and hoping that this somehow gives you the number of characters in a file is futile)
read the to a suitable std::ostream e.g. std::ostringstream (if you can provide the destination buffer you can use a faster output stream) using the output operator for stream buffers: out << in.rdbuf()
I don't see that concurreny would help you with reading a stream.
I normally use the method described in csv parser to read spreadsheet files. However, when reading a 64MB file which has around 40 columns and 250K rows of data, it takes about 4 minutes. In the original method, a CSVRow class is used to read the file row by row, and a private vector is used to store all the data in a row.
Several things to note:
I did reserve enough capacity of the vector but not much helpful.
I also need to create instances of some class when reading each line, but even when the code just read in the data without creating any instances, it takes long time.
The file is tab-delimited instead of comma-delimited, but I don't think it matters.
Since some columns in that file are not useful data, I changed the method to have a private string member to store all the data and then find the position of the (n-1)th and the nth delimiter to get the useful data (of course there are many useful columns). By doing so, I avoid some push_back operations, and cut the time to a little more than 2 minutes. However, that still seems too long to me.
Here are my questions:
Is there a way to read such a
spreadsheet file more efficiently?
Shall I read the file by buffer
instead of line by line? If so, how
to read by buffer and use the csvrow
class?
I haven't tried boost tokenizer, is
that more efficient?
Thank you for your help!
It looks like your being bottle-necked by IO. Instead of reading the file line by line, read it in blocks of maybe 8 MB. Parse the block read for records and determine if end of the block is a partial record. If it is, copy the portion of the last record from the block and prepend it to the next block. Repeat until the file is all read. This way, for a 64 MB file you're only making 8 IO requests. You can experiment with block size to determine what gives the best performance vs memory usage.
If reading the whole data into memory acceptable (and apparently it is), then I'd do this:
Read the whole file into a std::vector
Populate a vector > which contains the start positions of all newline characters and cells the data. These positions denote the start/end of each cell
Some code sketch to demonstrate the idea:
vector<vector<vector<char>::size_Type> > rows;
for ( vector<char>::size_type i = 0; i < data.size(); ++i ) {
vector<vector<char>::size_type> currentRow;
currentRow.push_back( i );
while ( data[i] != '\n' ) {
if ( data[i] == ',' ) { // XXX consider comma at end of line
currentRow.push_back( i );
}
}
rows.push_back( currentRow );
}
// XXX consider files which don't end in a newline
Thus, you know the positions of all newlines and all commas, and you have the complete CSV date available as one contiguous memory block. So you can easily extract a cell text like this:
// XXX error checking omitted for simplicity
string getCellText( int row, int col )
{
// XXX Needs handling for last cell of a line
const vector<char>::size_type start = rows[row][col];
const vector<char>::size_type end = rows[row][col + 1];
return string(data[start], data[end]);
}
This article should be helpful.
In short:
1. Either use memory mapped files OR read file in 4kbyte blocks to access the data. Memory-mapped files will be faster.
2. Try to avoid using push_back, std::string operations (like +) and similar routines from stl within parsing loop. They are nice, but they ALL use dynamically allocated memory, and dynamic memory allocation is slow. Anything that is being frequently dynamically allocated, will make your program slower. Try to preallocate all buffers before parsing. Counting all tokens in order to preallocate memory for them shouldn't be difficult.
3. Use profiler to identify what causes the slowdown.
4. You may want to try to avoid using iostream's << and >> operators, and parse file yourself.
In general, efficient C/C++ parser implementation should be able to parse 20 megabytes big text file within 3 seconds.