I have one big file. It is a text file so I am reading one line at a time.
std::ifstream inFile( "big_file.txt" );
std::string line;
while( getline( inFile, line ) )
{
}
I want to distribute the lines that I read from 'big_file.txt' to several files. The file count depends on the number of cores available on the machine.
Edit: The target files might be on different physical devices, or content possibly sent to a different machine
My (unsuccesful)attempt so far is as follows
// list of writer objects each running in its own thread
std::vector<FileWriter> writers;
// create as many threads as there are cores
unsigned long const cores = boost::thread::hardware_concurrency();
for( unsigned long i = 0; i < cores; ++i)
{
std::ostringstream ss;
ss << i;
FileWriter rt(ss.str());
writers.push_back(rt);
}
then as I call getline(inFile, line), I want to be able to send the line to the threads in a round-robin fashion. It really does not have to be in round-robin; whatever method is best to distribute the work among threads is fine.
I have run out of ideas.
Please suggest boost and pre c++11 STL as I don't have a complete c++11 environment yet.
Unless each new file is on a separate physical device, it is unlikely that there would be a performance gain simply by using multiple threads to write the individual files. This type of process will be I/O bound rather than CPU bound typically.
One important thing to make sure of is to use buffered I/O (which it appears to be the case since you show ifstream). Without buffered I/O, the latency of writing individual lines to different files would be a huge bottleneck.
Edit Given that the individual lines may be written to separate devices, then it might gain in performance by using multiple threads. If there is a long latency (e.g., on a network send call if sending to another machine via some mechanism), then other threads could still be writing to other locations, so that would definitely help.
I might not completely understand the question, but it seems then it would just make sense to use a thread pool. One possibility would be to use threadpool. I have not used it, but it seems to have a good reputation.
Related
Suppose a program has a caching mechanism where, at the end of some specific calculation, the program writes the output of that calculation to the disk to avoid re-computing it later, when the program is re-ran. It does so for a large number of calculations, and saves each output to separate files (one per calculation, with filenames determined by hashing the computation parameters). The data is written to the file with standard C++ streams:
void* data = /* result of computation */;
std::size_t dataSize = /* size of the result in bytes */;
std::string cacheFile = /* unique filename for this computation */;
std::ofstream out(cacheFile, std::ios::binary);
out << dataSize;
out.write(static_cast<const char *>(data), dataSize);
The calculation is deterministic, hence the data written to a given file will always be the same.
Question: is it safe for multiple threads (or processes) to attempt this simultaneously, for the same calculation, and with the same output file? It does not matter if some threads or processes fail to write the file, as long as at least one succeeds, and as long as all programs are left in a valid state.
In the manual tests I ran, no program failure or data corruption occurred, and the file was always created with the correct content, but this may be platform-dependent. For reference, in our specific case, the size of the data ranges from 2 to 50 kilobytes.
is it safe for multiple threads (or processes) to attempt this simultaneously, for the same calculation, and with the same output file?
It is a race condition when multiple threads try to write into the same file, so that you may end up with a corrupted file. There is no guarantee that ofstream::write is atomic and that depends on a particular filesystem.
The robust solution for your problem (works both with multiple threads and/or processes):
Write into a temporary file with a unique name in the destination directory (so that the temporary and the final files are in the same filesystem for rename to not move data).
rename the temporary file to its final name. It replaces the existing file if one is there. Non-portable renameat2 is more flexible.
It is possible to synchronise threads within the same process to write to one file using thread synchronisation. However, this isn't possible between different processes, so it is better to avoid it. There isn't anything in the C++ standard library that you can use for that.
Operating systems do provide special functions for locking files that are guaranteed to be atomic (like lockf on Linux or LockFile(Ex) on Windows). You might like to check them out.
I've been coding a multi-thread simulation storing the outputs in files. So far, I've assigned one file to core (with a ofstream myfiles[NUMBER_OF_CORES]) from the beginning but it's a bit messy as I'm working with several computers having 20+ cores. I've been doing that to avoid overheading if using one file, but could I use something like a stream per core and in the end, use something like:
for(int i =0; i < NUMBER_OF_CORES; i++){
myfile << CORE_STREAM[i];
}
starting with a CORE_STREAM[NUMBER_OF_CORES] array? I've never manipulated streams in this way. Which class should I construct this from if it exists?
You could use a ostringstream to store intermediate results in memory. Like ofstream, it implements the ostream interface so your existing code probably will work as-is.
To dump one stream on another, you'd do myfile << core_stream[i].rdbuf(). = Read Buffer
Have you considered using a ZMQ pipeline? Your simulation threads could write to a ZMQ_PUSH socket (see zmq_socket(3)) and whatever is writing to the file (another thread or process, ZMQ doesn't care) could read from a ZMQ_PULL socket. That way your simulation threads can potentially get out of doing any blocking IO without staging results in memory. I can't imagine working on a distributed computing project these days and not using ZMQ.
I have written a program (using FFTW) to perform Fourier transforms of some data files written in OpenFOAM.
The program first finds the paths to each data file (501 files in my current example), then splits the paths between threads, such that thread0 gets paths 0->61, thread1 gets 62-> 123 or so, etc, and then runs the remaining files in serial at the end.
I have implemented timers throughout the code to try and see where it bottlenecks, since run in serial each file takes around 3.5s and for 8 files in parallel the time taken is around 21s (a reduction from the 28s for 8x3.5 (serial time), but not by so much)
The problematic section of my code is below
if (DIAG_timers) {readTimer = timerNow();}
for (yindex=0; yindex<ycells; yindex++)
{
for (xindex=0; xindex<xcells; xindex++)
{
getline(alphaFile, alphaStringValue);
convertToNumber(alphaStringValue, alphaValue[xindex][yindex]);
}
}
if (DIAG_timers) {endTimerP(readTimer, tid, "reading value and converting", false);}
Here, timerNow() returns the clock value, and endTimerP calculates the time that has passed in ms. (The remaining arguments relate to it running in a parallel thread, to avoid outputting 8 lines for each loop etc, and a description of what the timer measures).
convertToNumber takes the value on alphaStringValue, and converts it to a double, which is then stored in the alphaValue array.
alphaFile is a std::ifstream object, and alphaStringValue is a std::string which stores the text on each line.
The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible, since only 8 (1 per thread) should be open at once. I am unsure if mmap would do this better? Several threads on stackoverflow argue about the merits of mmap vs more straightforward read operations, in particular for sequential access, so I don't know if that would be beneficial.
I tried surrounding the code block with a mutex so that only one thread could run the block at once, in case reading multiple files was leading to slow IO via vaguely random access, but that just reduced the process to roughly serial-speed times.
Any suggestions allowing me to run this section more quickly, possibly via copying the file, or indeed anything else, would be appreciated.
Edit:
template<class T> inline void convertToNumber(std::string const& s, T &result)
{
std::istringstream i(s);
T x;
if (!(i >> x))
throw BadConversion("convertToNumber(\"" + s + "\")");
result = x;
}
turns out to have been the slow section. I assume this is due to the creation of 5 million stringstreams per file, followed by the testing of 5 million if conditions? Replacing it with TonyD's suggestion presumably removes the possibility of catching an error, but saves a vast number of (at least in this controlled case) unnecessary operations.
The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible,
Yes. But loading them there will still count towards your process' wall clock time unless they were already read by another process short before.
since only 8 (1 per thread) should be open at once.
Since any files that were not loaded in memory before the process started will have to be loaded and thus the loading will count towards the process wall clock time, it does not matter how many are open at once. Any that are not cache will slow down the process.
I am unsure if mmap would do this better?
No, it wouldn't. mmap is faster, but because it saves the copy from kernel buffer to application buffer and some system call overhead (with read you do a kernel entry for each page while with mmap pages that are read with read-ahead won't cause further page faults). But it will not save you the time to read the files from disk if they are not already cached.
mmap does not load anything in memory. The kernel loads data from disk to internal buffers, the page cache. read copies the data from there to your application buffer while mmap exposes parts of the page cache directly in your address space. But in either case the data are fetched on first access and remain there until the memory manager drops them to reuse the memory. The page cache is global, so if one process causes some data to be cached, next process will get them faster. But if it's first access after longer time, the data will have to be read and this will affect read and mmap exactly the same way.
Since parallelizing the process didn't improve the time much, it seems majority of the time is the actual I/O. So you can optimize a bit more and mmap can help, but don't expect much. The only way to improve I/O time is to get a faster disk.
You should be able to ask the system to tell you how much time was spent on the CPU and how much was spent waiting for data (I/O) using getrusage(2) (call it at end of each thread to get data for that thread). So you can confirm how much time was spent by I/O.
mmap is certainly the most efficient way to get large amounts of data into memory. The main benefit here is that there is no extra copying involved.
It does however make the code slightly more complex, since you can't directly use the file I/O functions to use mmap (and the main benefit is sort of lost if you use "m" mode of stdio functions, as you are now getting at least one copy). From past experiments that I've made, mmap beats all other file reading variants by some amount. How much depends on what proportion of the overall time is spent on waiting for the disk, and how much time is spent actually processing the file content.
I just spent quite some time trying to get this loop openMPed, but for 2 threads, it doubles Wall time! Am I missing something important?
The overall task is to read in a big file (~ 1GB) in parallel, an ifstream is divided into several stringbuffer and these are used to insert the data into the structs Symbol. Up to here everything is fast. Also giving the loop private variables str and locVec to operate on doesn't change something.
vector<string> strbuf; // filled from ifstream
vector< vector <Symbol> > symVec; // to be filled
#pragma omp parallel for num_threads(2) default(none) shared(strbuf, symVec)
for (int i=0; i<2; i++)
{
string str = strbuf[i];
std::stringstream ss(str);
// no problem until here
// this is where it slows down:
vector<Symbol> locVec;
std::copy(std::istream_iterator<Symbol>(ss), std::istream_iterator<Symbol>(), std::back_inserter(locVec));
symVec[i] = locVec;
}
EDIT::
Sorry for being unprecise, but the file content is already read in sequencially and divided into the strbufs at this point. The file is closed. Within the loop there is no file access.
It's much better to do sequential I/O on a file rather than I/O at different sections of a file. This essentially boils down to causing a lot of seeks on the underlying device (I'm assuming a disk here). This also increases the amount of underlying system calls required to read the file into said buffers. You're better off using 1 thread to read the file in it's totality sequentially (maybe mmap() with MAP_POPULATE) and assigning processing to different threads.
Another option is to use calls such as aio_read() to handle reading in different sections if for some reason you really do not want to read the file all at once.
Without all the code I cannot be completely sure but remember that simply opening a file does not guarantee it's content to be in memory and reading from a file will cause page faults that will then cause the actual file contents to be read so even if you're not explicitly trying to read from the file using a read/write the OS will take care of that for you.
I would like to know whether there might be any possibility of some performance gain on file read by using openMP.
Example code,
fstream file;
file.open("test.txt",ios::in);
file.seekg(0,ios::end);
int len = file.tellg();
char *arr = new char[len];
char *temp = new char[1];
int i;
#pragma omp parallel for shared(arr, len) private(temp, i)
for(i = 0; i < len; i++)
{
file.seekg(i);
file.read(temp,1);
arr[i] = temp[0];
}
I guess using multiple threads for I/O operation is a bad option because finally file read operation will be serialized. But still, I would like to whether one can expect a performance gain. Moreover, I would also like to know how does openMP handles the parallel file read operations.
As you mentioned, you're not likely to get any speedup parallelizing any sort of I/O bound task like this. However, there is a much bigger problem. The code isn't even correct.
The seekg() and read() methods modify the file variable. So your iterations aren't independent. So you will have race conditions on the stream. In other words, the loop isn't parallelizable.
So don't expect that code to work at all - let alone with better performance.
Although there are lots of performance improvements in file streams those you are proposing are not among them:
std::streambuf is stateful and trying to access it simultanously from multiple threads of execution will thoroughly mess it up.
Processing individual characters is essentially a worst case scenario for a contemporary processor. If you really end up doing it in parallel you'd have multiple processors messing with the same cache lines. This will actually dramatically degrade performance compared to a single thread of execution.
I don't know why people are so.fond of using seeks: each seek essentially kills any current buffer and may cause a system call just to position the stream to a defined state. The key problem with seeking is that sets the stream up to be either for reading or writing, depending what is the next operation. Yes, the open mode may be taken into account but it probably isn't.
If you want to read a fast approach to read a file using std::ifstream you should
imbue() a std::locale which advertises not to do any conversion
open the file in std::binary mode
skip trying to get what may be a wrong estimate on the size of the file (seeking to the end and hoping that this somehow gives you the number of characters in a file is futile)
read the to a suitable std::ostream e.g. std::ostringstream (if you can provide the destination buffer you can use a faster output stream) using the output operator for stream buffers: out << in.rdbuf()
I don't see that concurreny would help you with reading a stream.