Should we do nested goroutines?

Should we do nested goroutines? - concurrency

I'm trying to build a parser for a large number of files, and I can't find information about what might possibly be called "nested goroutines" (maybe this is not the right name ?).
Given a lot of files, each of them having a lot of lines. Should I do:
for file in folder:
go do1
def do1:
for line in file:
go do2
def do2:
do_something
Or should I use only "one level" of goroutines, and do the following:
for file in folder:
for line in file:
go do_something
My question target primarily performance issues.
Thanks for reaching that sentence !

If you go through with the architecture you've specified, you have a good chance of running out of CPU/Mem/etc because you're going to be creating an arbitrary amount of workers. I suggest, instead go with an architecture that allows you to throttle via channels. For example:
In your main process feed the files into a channel:
for _, file := range folder {
fileChan <- file
}
then in another goroutine break the files into lines and feed those into a channel:
for {
select{
case file := <-fileChan
for _, line := range file {
lineChan <- line
}
}
}
then in a 3rd goroutine pop out the lines and do what you will with them:
for {
select{
case line := <-lineChan:
// process the line
}
}
The main advantage to this is that you can create as many or as few go routines as your system can handle and pass them all the same channels and whichever go routine gets to the channel first will just handle it, so you're able to throttle the amount of resources you're using.
Here is a working example: http://play.golang.org/p/-Qjd0sTtyP

The answer depends on how processor-intensive the operation on each line is.
If the line operation is short-lived, definitely don't bother to spawn a goroutine for each line.
If it's expensive (think ~5 secs or more), proceed with caution. You may run out of memory. As of Go 1.4, spawning a goroutine allocates a 2048 byte stack. For 2 million lines, you could allocate over 2GB of RAM for the goroutine stacks alone. Consider whether it's worth allocating this memory.
In short, you will probably get the best results with the following setup:
for file in folder:
go process_file(file)
If the number of files exceeds the number of CPUs, you're likely to have enough concurrency to mask the disk I/O latency involved in reading the files from disk.

Related

Query on boost interprocess::file_lock on NFS

We have an application which user can run to generate some data at user specified path. This unique output data is generated with respect to one unique input data-set - this input data is provided by the user.
When we initially developed the application, we never anticipated that number of unique input data-set will be large (due to nature of application). Our expectation was number of unique input data-set could be of order of 10 where as one user has this as 1000. So, that particular user started 1000 jobs of our application on grid and all writing data to same path. Note - these 1000 jobs are not fired from our application and rather he spawned 1000 processes of our application on different machines.
Now this lead to some collision and data loss.
To guard against it, I am planning to synchronization using boost::interprocess. This is what I am planning:
// usual processing of input data ...
boost::filesystem::path reportLockFilePath(boost::filesystem::system_complete(userDir));
rerportLockFilePath.append("report.lock");
// if lock file does not exist, create one
if (!boost::filesystem::exists(reportLockFilePath) {
boost::interprocess::named_mutex reportLockMutex(boost::interprocess::open_or_create, "report_mutex");
boost::interprocess::scoped_lock< boost::interprocess::named_mutex > lock(reportLockMutex);
std::ofstream lockStrm(reportLockFilePath.string().c_str());
lockStrm << "## report lock file ##" << std::endl;
lockStrm.flush();
}
boost::interprocess::file_lock reportFileLock(reportLockFilePath.string().c_str());
boost::interprocess::scoped_lock< boost::interprocess::file_lock > lock(reportFileLock);
// usual reporting code that we already have ...
Now, questions are -
If this is correct synchronization for the problem at hand
If this synchronization scheme will work, when jobs are on different machines and path is on NFS
If on NFS etc., this is not going to work, what are the C++ alternatives? I prefer to avoid lower level C functions to avoid race condition due to lock being held when one instance of execution crashes etc.

I just removed the named mutex part (as that was causing problem on few machines due to permission issue - probably related to umask issue discussed in this context in some other post) and replaced with
std::ofstream lockStrm(reportLockFilePath.string().c_str(), std::ios_base::app);
And it worked at least in our internal testing.

RAM consumption regarding cores/

I am working on a 31 ,available, Go of RAM, 12 cores Linux KUbuntu computer.
I produce simulations which calculate functions over 4 dimensions (x,y,z,t).
I define my dimensions as arrays that I numpy.meshgrid for use. So, for each point of time, I calculate for each point x,y,z the result. It comes as heavy calculations with heavy data.
First, I learned how to use it with only one core. It works well and whatever are the size of my "boxs" ( x,y,z). Because of the fact I work a lot with Fourier transform, I define x,y,z,t as powers of 2 : 64,128,256,...
I can,without dificulties, go to x = y = z = t = 512, even if it takes a lot of time to run it (which makes sense). When I do that, I use around 20-30% of the available RAM of the computer. Great.
Then I wanted to use more cores. So I implemented this code :
import multiprocessing as mp
pool = mp.Pool(processes=8)
results = [pool.apply_async(conv_green, args=(tstep, S_, )) for tstep in t]
So here I ask my script to use 8 cores, and define my results as the use of the function "conv_green" with the args "tstep,S_" all along t.
It works pretty well, use 8 cores as expected BUT I can not run any more simulations who use figures equal or above to 512 for x,y,z,t.
This is where my problem is. Technically, switching from the mono core system to multi chanegd nothing to the routine of my calculations. I do not understand why I have enough RAM for 512... in mono core and why,sudenly, when I switch to multi cores, computer does not even want to launch it ( and the error occurs at the" results = pool.apply ..." line)
So if you guys know how this works and why I get this "treshold", thanks for helping me solving out !
Best regards.
PS : this is the error which pops out when it crashes with 512 in multi cores :
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/dist
packages/spyderlib/widgets/externalshell/sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "/home/alexis/Heat/Simu⁄Lecture Propre/Test Tkinter/Simulation N spots SCAN Tkinter.py", line 280, in
XYslice = array([p.get()[0] for p in results])
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
SystemError: NULL result without error in PyObject_Call

For multiprocessing in any language each thread will need private storage which it can write to without interference from the other threads. As soon as interference is possible the data structure has to be locked, which (in the worst case) takes us back to single threading.
It would appear that your large data structure is being copied for each of the threads, effectively multiplying your memory usage by eight when you have eight processors ... or up to 200% of your available RAM.
The best solution would be to prevent the unnecessary copying.
If that's not feasible then all you can do is limit the number of processors it can run on, four should be ok in your instance but make sure your machine has lots of swap space. The swap space also gives you some play to allow the virtual memory to exceed the physical RAM, if the "working set" is small enough you may be able to significantly exceed your physical RAM given enough swap.

Parse very large CSV files with C++

My goal is to parse large csv files with C++ in a QT project in OSX environment.
(When I say csv I mean tsv and other variants 1GB ~ 5GB ).
It seems like a simple task , but things get complicated when file sizes get bigger. I don't want to write my own parser because of the many edge cases related to parsing csv files.
I have found various csv processing libraries to handle this job, but parsing 1GB file takes about 90 ~ 120 seconds on my machine which is not acceptable. I am not doing anything with the data right now, I just process and discard the data for testing purposes.
cccsvparser is one of the libraries I have tried . But the the only fast enough library was fast-cpp-csv-parser which gives acceptable results: 15 secs on my machine, but it works only when the file structure is known.
Example using: fast-cpp-csv-parser
#include "csv.h"
int main(){
io::CSVReader<3> in("ram.csv");
in.read_header(io::ignore_extra_column, "vendor", "size", "speed");
std::string vendor; int size; double speed;
while(in.read_row(vendor, size, speed)){
// do stuff with the data
}
}
As you can see I cannot load arbitrary files and I must specifically define variables to match my file structure. I'm not aware of any method that allows me to create those variables dynamically in runtime .
The other approach I have tried is to read csv file line by line with fast-cpp-csv-parser LineReader class which is really fast (about 7 secs to read whole file), and then parse each line with cccsvparser lib that can process strings, but this takes about 40 seconds until done, it is an improvement compared to the first attempts but still unacceptable.
I have seen various Stack Overflow questions related to csv file parsing none of them takes large file processing in to account.
Also I spent a lot of time googling to find a solution to this problem, and I really miss the freedom that package managers like npm or pip offer when searching for out of the box solutions.
I will appreciate any suggestion about how to handle this problem.
Edit:
When using #fbucek's approach, processing time reduced to 25 seconds, which is a great improvement.
can we optimize this even more?

I am assuming you are using only one thread.
Multithreading can speedup your process.
Best accomplishment so far is 40 sec. Let's stick to that.
I have assumed that first you read then you process -> ( about 7 secs to read whole file)
7 sec for reading
33 sec for processing
First of all you can divide your file into chunks, let's say 50MB.
That means that you can start processing after reading 50MB of file. You do not need to wait till whole file is finished.
That's 0.35 sec for reading ( now it is 0.35 + 33 second for processing = cca 34sec )
When you use Multithreading, you can process multiple chunks at a time. That can speedup process theoretically up to number of your cores. Let's say you have 4 cores.
That's 33/4 = 8.25 sec.
I think you can speed up you processing with 4 cores up to 9 s. in total.
Look at QThreadPool and QRunnable or QtConcurrent
I would prefer QThreadPool
Divide task into parts:
First try to loop over file and divide it into chunks. And do nothing with it.
Then create "ChunkProcessor" class which can process that chunk
Make "ChunkProcessor" a subclass of QRunnable and in reimplemented run() function execute your process
When you have chunks, you have class which can process them and that class is QThreadPool compatible, you can pass it into
It could look like this
loopoverfile {
whenever chunk is ready {
ChunkProcessor *chunkprocessor = new ChunkProcessor(chunk);
QThreadPool::globalInstance()->start(chunkprocessor);
connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
}
}
You can use std::share_ptr to pass processed data in order not to use QMutex or something else and avoid serialization problems with multiple thread access to some resource.
Note: in order to use custom signal you have to register it before use
qRegisterMetaType<std::shared_ptr<ProcessedData>>("std::shared_ptr<ProcessedData>");
Edit: (based on discussion, my answer was not clear about that)
It does not matter what disk you use or how fast is it. Reading is single thread operation.
This solution was suggested only because it took 7 sec to read and again does not matter what disk it is. 7 sec is what's count. And only purpose is to start processing as soon as possible and not to wait till reading is finished.
You can use:
QByteArray data = file.readAll();
Or you can use principal idea: ( I do not know why it take 7 sec to read, what is behind it )
QFile file("in.txt");
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
return;
QByteArray* data = new QByteArray;
int count = 0;
while (!file.atEnd()) {
++count;
data->append(file.readLine());
if ( count > 10000 ) {
ChunkProcessor *chunkprocessor = new ChunkProcessor(data);
QThreadPool::globalInstance()->start(chunkprocessor);
connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
data = new QByteArray;
count = 0;
}
}
One file, one thread, read almost as fast as read by line "without" interruption.
What you do with data is another problem, but has nothing to do with I/O. It is already in memory.
So only concern would be 5GB file and ammout of RAM on the machine.
It is very simple solution all you need is subclass QRunnable, reimplement run function, emit signal when it is finished, pass processed data using shared pointer and in main thread joint that data into one structure or whatever. Simple thread safe solution.

I would propose a multi-thread suggestion with a slight variation is that one thread is dedicated to reading file in predefined (configurable) size of chunks and keeps on feeding data to a set of threads (more than one based cpu cores). Let us say that the configuration looks like this:
chunk size = 50 MB
Disk Thread = 1
Process Threads = 5
Create a class for reading data from file. In this class, it holds a data structure which is used to communicate with process threads. For example this structure would contain starting offset, ending offset of the read buffer for each process thread. For reading file data, reader class holds 2 buffers each of chunk size (50 MB in this case)
Create a process class which holds a pointers (shared) for the read buffers and offsets data structure.
Now create driver (probably main thread), creates all the threads and waiting on their completion and handles the signals.
Reader thread is invoked with reader class, reads 50 MB of the data and based on number of threads creates offsets data structure object. In this case t1 handles 0 - 10 MB, t2 handles 10 - 20 MB and so on. Once ready, it notifies processor threads. It then immediately reads next chunk from disk and waits for processor thread to completion notification from process threads.
Processor threads on the notification, reads data from buffer and processes it. Once done, it notifies reader thread about completion and waits for next chunk.
This process completes till the whole data is read and processed. Then reader thread notifies back to the main thread about completion which sends PROCESS_COMPLETION, upon all threads exits. or main thread chooses to process next file in the queue.
Note that offsets are taken for easy explanation, offsets to line delimiter mapping needs to be handled programmatically.

If the parser you have used is not distributed obviously the approach is not scalable.
I would vote for a technique like this below
chunk up the file into a size that can be handled by a machine / time constraint
distribute the chunks to a cluster of machines (1..*) that can meet your time/space requirements
consider dealing at block sizes for a given chunk
Avoid threads on same resource (i.e given block) to save yourself from all thread related issues.
Use threads to achieve non competing (on a resource) operations - such as reading on one thread and writing on a different thread to a different file.
do your parsing (now for this small chunk you can invoke your parser).
do your operations.
merge the results back / if can distribute them as they are..
Now, having said that, why can't you use Hadoop like frameworks?

Irregular file writing performance in c++

I am writing an app which receives a binary data stream wtih a simple function call like put(DataBLock, dateTime); where each data package is 4 MB
I have to write these datablocks to seperate files for future use with some additional data like id, insertion time, tag etc...
So I both tried these two methods:
first with FILE:
data.id = seedFileId;
seedFileId++;
std::string fileName = getFileName(data.id);
char *fNameArray = (char*)fileName.c_str();
FILE* pFile;
pFile = fopen(fNameArray,"wb");
fwrite(reinterpret_cast<const char *>(&data.dataTime), 1, sizeof(data.dataTime), pFile);
data.dataInsertionTime = time(0);
fwrite(reinterpret_cast<const char *>(&data.dataInsertionTime), 1, sizeof(data.dataInsertionTime), pFile);
fwrite(reinterpret_cast<const char *>(&data.id), 1, sizeof(long), pFile);
fwrite(reinterpret_cast<const char *>(&data.tag), 1, sizeof(data.tag), pFile);
fwrite(reinterpret_cast<const char *>(&data.data_block[0]), 1, data.data_block.size() * sizeof(int), pFile);
fclose(pFile);
second with ostream:
ofstream fout;
data.id = seedFileId;
seedFileId++;
std::string fileName = getFileName(data.id);
char *fNameArray = (char*)fileName.c_str();
fout.open(fNameArray, ios::out| ios::binary | ios::app);
fout.write(reinterpret_cast<const char *>(&data.dataTime), sizeof(data.dataTime));
data.dataInsertionTime = time(0);
fout.write(reinterpret_cast<const char *>(&data.dataInsertionTime), sizeof(data.dataInsertionTime));
fout.write(reinterpret_cast<const char *>(&data.id), sizeof(long));
fout.write(reinterpret_cast<const char *>(&data.tag), sizeof(data.tag));
fout.write(reinterpret_cast<const char *>(&data.data_block[0]), data.data_block.size() * sizeof(int));
fout.close();
In my tests the first methods looks faster, but my main problem is in both ways at first everythings goes fine, for every file writing operation it tooks almost the same time (like 20 milliseconds), but after the 250 - 300th package it starts to make some peaks like 150 to 300 milliseconds and then goes down to 20 milliseconds and then again 150 ms and so on... So it becomes very unpredictable.
When I put some timers to the code I figured out that the main reason for these peaks are because of the fout.open(...) and pfile = fopen(...) lines. I have no idea if this is because of the operating system, hard drive, any kind of cache or buffer mechanism etc...
So the question is; why these file opening lines become problematic after some time, and is there a way to make file writing operation stable, I mean fixed time?
Thanks.
NOTE: I'm using Visual studio 2008 vc++, Windows 7 x64. (I tried also for 32 bit configuration but the result is same)
EDIT: After some point writing speed slows down as well even if the opening file time is minimum. I tried with different package sizes so here are the results:
For 2 MB packages it takes double time to slow down, I mean after ~ 600th item slowing down begins
For 4 MB packages almost 300th item
For 8 MB packages almost 150th item
So it seems to me it is some sort of caching problem or something? (in hard drive or OS). But I also tried with disabling hard drive cache but nothing changed...
Any idea?

This is all perfectly normal, you are observing the behavior of the file system cache. Which is a chunk of RAM that's is set aside by the operating system to buffer disk data. It is normally a fat gigabyte, can be much more if your machine has lots of RAM. Sounds like you've got 4 GB installed, not that much for a 64-bit operating system. Depends however on the RAM needs of other processes that run on the machine.
Your calls to fwrite() or ofstream::write() write to a small buffer created by the CRT, it in turns make operating system calls to flush full buffers. The OS writes normally completely very quickly, it is a simple memory-to-memory copy going from the CRT buffer to the file system cache. Effective write speed is in excess of a gigabyte/second.
The file system driver lazily writes the file system cache data to the disk. Optimized to minimize the seek time on the write head, by far the most expensive operation on the disk drive. Effective write speed is determined by the rotational speed of the disk platter and the time needed to position the write head. Typical is around 30 megabytes/second for consumer-level drives, give or take a factor of 2.
Perhaps you see the fire-hose problem here. You are writing to the file cache a lot faster than it can be emptied. This does hit the wall eventually, you'll manage to fill the cache to capacity and suddenly see the perf of your program fall off a cliff. Your program must now wait until space opens up in the cache so the write can complete, effective write speed is now throttled by disk write speeds.
The 20 msec delays you observe are normal as well. That's typically how long it takes to open a file. That is a time that's completely dominated by disk head seek times, it needs to travel to the file system index to write the directory entry. Nominal times are between 20 and 50 msec, you are on the low end of that already.
Clearly there is very little you can do in your code to improve this. What CRT functions you use certainly don't make any difference, as you found out. At best you could increase the size of the files you write, that reduces the overhead spent on creating the file.
Buying more RAM is always a good idea. But it of course merely delays the moment where the firehose overflows the bucket. You need better drive hardware to get ahead. An SSD is pretty nice, so is a striped raid array. Best thing to do is to simply not wait for your program to complete :)

So the question is; why these file opening lines become problematic
after some time, and is there a way to make file writing operation
stable, I mean fixed time?
This observation(.i.e. varying time taken in write operation) does not mean that there is problem in OS or File System.There could be various reason behind your observation. One possible reason could be the delayed write may be used by kernel to write the data to disk. Sometime kernel cache it(buffer) in case another process should read or write it soon so that extra disk operation can be avoided.
This situation may lead to inconsistency in the time taken in different write call for same size of data/buffer.
File I/O is bit complex and complicated topic and depends on various other factors. For complete information on internal algorithm on File System, you may want to refer the great great classic book "The Design Of UNIX Operating System" By Maurice J Bach which describes these concepts and the implementation in detailed way.
Having said that, you may want to use the flush call immediately after your write call in both version of your program(.i.e. C and C++). This way you may get the consistent time in your file I/O write time. Otherwise your programs behaviour look correct to me.
//C program
fwrite(data,fp);
fflush(fp);
//C++ Program
fout.write(data);
fout.flush();

It's possible that the spikes are not related to I/O itself, but NTFS metadata: when your file count reach some limit, some NTFS AVL-like data structure needs some refactoring and... bump!
To check it you should preallocate the file entries, for example creating all the files with zero size, and then opening them when writing, just for testing: if my theory is correct you shouldn't see your spikes anymore.
UHH - and you must disable file indexing (Windows search service) there! Just remembered of it... see here.

Reading binary files, Linux Buffer Cache

I am busy writing something to test the read speeds for disk IO on Linux.
At the moment I have something like this to read the files:
Edited to change code to this:
const int segsize = 1048576;
char buffer[segsize];
ifstream file;
file.open(sFile.c_str());
while(file.readsome(buffer,segsize)) {}
For foo.dat, which is 150GB, the first time I read it in, it takes around 2 minutes.
However if I run it within 60 seconds of the first run, it will then take around 3 seconds to run. How is that possible? Surely the only place that could be read from that fast is the buffer cache in RAM, and the file is too big to fit in RAM.
The machine has 50GB of ram, and the drive is a NFS mount with all the default settings. Please let me know where I could look to confirm that this file is actually being read at this speed? Is my code wrong? It appears to take a correct amount of time the first time the file is read.
Edited to Add:
Found out that my files were only reading up to a random point. I've managed to fix this by changing segsize down to 1024 from 1048576. I have no idea why changing this allows the ifstream to read the whole file instead of stopping at a random point.
Thanks for the answers.

On Linux, you can do this for a quick troughput test:
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.863904 s, 243 MB/s
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.0748273 s, 2.8 GB/s
$ sync && echo 3 > /proc/sys/vm/drop_caches
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.919688 s, 228 MB/s
echo 3 > /proc/sys/vm/drop_caches will flush the cache properly

in_avail doesn't give the length of the file, but a lower bound of what is available (especially if the buffer has already been used, it return the size available in the buffer). Its goal is to know what can be read without blocking.
unsigned int is most probably unable to hold a length of more than 4GB, so what is read can very well be in the cache.
C++0x Stream Positioning may be interesting to you if you are using large files

in_avail returns the lower bound of how much is available to read in the streams read buffer, not the size of the file. To read the whole file via the stream, just keep
calling the stream's readsome() method and checking how much was read with the gcount() method - when that returns zero, you have read everthing.

It appears to take a correct amount of time the first time the file is read.
On that first read, you're reading 150GB in about 2 minutes. That works out to about 10 gigabits per second. Is that what you're expecting (based on the network to your NFS mount)?

One possibility is that the file could be at least in part sparse. A sparse file has regions that are truly empty - they don't even have disk space allocated to them. These sparse regions also don't consume much cache space, and so reading the sparse regions will essentially only require time to zero out the userspace pages they're being read into.
You can check with ls -lsh. The first column will be the on-disk size - if it's less than the file size, the file is indeed sparse. To de-sparse the file, just write to every page of it.
If you would like to test for true disk speeds, one option would be to use the O_DIRECT flag to open(2) to bypass the cache. Note that all IO using O_DIRECT must be page-aligned, and some filesystems do not support it (in particular, it won't work over NFS). Also, it's a bad idea for anything other than benchmarking. See some of Linus's rants in this thread.
Finally, to drop all caches on a linux system for testing, you can do:
echo 3 > /proc/sys/vm/drop_caches
If you do this on both client and server, you will force the file out of memory. Of course, this will have a negative performance impact on anything else running at the time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js