erlang processes and message passing architecture - concurrency

The task I have in hand is to read the lines of large file, process them, and return ordered results.
My algorithm is:
start with master process that will evaluate the workload (written in the first line of the file)
spawn worker processes: each worker will read part of the file using pread/3, process this part, and send results to master
master receives all sub-results, sort, and return
so basically no communication needed between workers.
My questions:
How to find the optimal balance between the number of erlang processes and the number of cores? so if I spawn one process for each processor core I have would that be under utilizing of my cpu?
How does pread/3 reach the specified line; does it iterate over all lines in file ? and is pread/3 a good plan to parallel file reading?
Is it better to send one big message from process A to B or send N small messages? I have found part of the answer in the below link, but I would appreciate further elaboration
erlang message passing architecture

Erlang processes are cheap. You're free (and encouraged) to use more than however many cores you have. There might be an upper limit to what is practical for your problem (loading 1TB of data in one process per line is asking a bit for much, depending on line size).
The easiest way to do it when you don't know is to let the user decide. This means you could decide to spawn N workers, and distribute work between them, waiting to hear back. Re-run the program while changing N if you don't like how it runs.
Trickier ways to do it is to benchmark a bunch of time, pick what you think makes sense as a maximal value, stick it in a pool library (if you want to; some pool go for preallocated resources, some for a resizable amount), and settle for what would be a one-size-fits-all solution.
But really, there is no easy 'optimal number of cores'. You can run it on 50 processes as well as on 65,000 of them if you want; if the task is embarrassingly parallel, the VM should be able to make usage of most of them and saturate the cores anyway.
-
Parallel file reads is an interesting question. It may or may not be faster (as direct comments have mentioned) and it may only represent a speed up if the work on each line is minimal enough that reading the file has the biggest cost.
The tricky bit is really that functions like pread/2-3 takes a byte offset. Your question is worded such that you are worried about lines of the file. The byte offsets you hand off to workers may therefore end up straddling a line. If your block ends up at the word my in this is my line\nhere it goes\n, one worker will see itself have an incomplete line, while the other will report only on my line\n, missing the prior this is.
Generally, this kind of annoying stuff is what will lead you to have the first process own the file and sift through it, only to hand off bits of text to process to workers; that process will then act as some sort of coordinator.
The nice aspect of this strategy is that if the main process knows everything that was sent as a message, it also knows when all responses have been received, making it easy to know when to return the results. If everything is disjoint, you have to trust both the starter and the workers to tell you "we're all out of work" as a distinct set of independent messages to know.
In practice, you'll probably find that what helps the most will be to know do operations that help the life of your hardware regarding file operations, more than "how many people can read the file at once". There's only one hard disk (or SSD), all data has to go through it anyway; parallelism may be limited in the end for the access there.
-
Use messages that make sense for your program. The most performant program would have a lot of processes able to do work without ever needing to pass messages, communicate, or acquire locks.
A more realistic very performant program would use very few messages of a very small size.
The fun thing here is that your problem is inherently data-based. So there's a few things you can do:
make sure you read text in a binary format; large binaries (> 64b) get allocated on a global binary heap, are shared around and GC'd with reference counting
Hand in information on what needs to be done rather than the data for doing it; this one would need measuring, but the lead process could go over the file, note where lines end, and just hand byte offsets to the workers so they can go and read the file themselves; do note that you'll end up reading the file twice, so if memory allocation is not your main overhead, this will likely be slower
Make sure the file is read in raw or ram mode; other modes use a middle-man process to read and forward data (this is useful if you read files over a network in clustered Erlang nodes); raw and ram modes gives the file descriptor directly to the calling process and is a lot faster.
First worry about writing a clear, readable and correct program. Only if it is too slow should you attempt to refactor and optimize it; you may very well find it good enough on the first try.
I hope this helps.
P.S. You can try the really simple stuff at first:
either:
read the whole file at once with {ok, Bin} = file:read_file(Path) and split lines (with binary:split(Bin, <<"\n">>, [global])),
use {ok, Io} = file:open(File, [read,ram]) and then use file:read_line(Io) on the file descriptor repeatedly
use {ok, Io} = file:open(File, [read,raw,{read_ahead,BlockSize}]) and then use file:read_line(Io) on the file descriptor repeatedly
call rpc:pmap({?MODULE, Function}, ExtraArgs, Lines) to run everything in parallel automatically (it will spawn one process per line)
call lists:sort/1 on the result.
Then from there you can refine each step if you identify them as problematic.

Related

Thread Optimization [duplicate]

I have an input file in my application that contains a vast amount of information. Reading over it sequentially, and at only a single file offset at a time is not sufficient for my application's usage. Ideally, I'd like to have two threads, that have separate and distinct ifstreams reading from two unique file offsets of the same file. I can't just start one ifstream up, and then make a copy of it using its copy constructor (since its uncopyable). So, how do I handle this?
Immediately I can think of two ways,
Construct a new ifstream for the second thread, open it on the same file.
Share a single instance of an open ifstream across both threads (using for instance boost::shared_ptr<>). Seek to the appropriate file offset that current thread is currently interested in, when the thread gets a time slice.
Is one of these two methods preferred?
Is there a third (or fourth) option that I have not yet thought of?
Obviously I am ultimately limited by the hard drive having to spin back and forth, but what I am interested in taking advantage of (if possible), is some OS level disk caching at both file offsets simultaneously.
Thanks.
Two std::ifstream instances will probably be the best option here. Modern HDDs are optimized for a large queue of I/O requests, so reading from two std::ifstream instances concurrently should give quite nice performance.
If you have a single std::ifstream you'll have to worry about synchronizing access to it, plus it might defeat the operating system's automatic sequential access read-ahead caching, resulting in poorer performance.
Between the two, I would prefer the second. Having two openings of the same file might cause an inconsistent view between the files, depending on the underlying OS.
For a third option, pass a reference or raw pointer into the other thread. So long as the semantics are that one thread "owns" the istream, the raw pointer or reference are fine.
Finally note that on the vast majority of hardware, the disk is the bottleneck, not CPU, when loading large files. Using two threads will make this worse because you're turning a sequential file access into a random access. Typical hard disks can do maybe 100MB/s sequentially, but top out at 3 or 4 MB/s random access.
Other option:
Memory-map the file, create as many memory istream objects as you want. (istrstream is good for this, istringstream is not).
It really depends on your system. A modern system will generally read
ahead; seeking within the file is likely to inhibit this, so should
definitly be avoided.
It might be worth experimenting how read-ahead works on your system:
open the file, then read the first half of it sequentially, and see how
long that takes. Then open it, seek to the middle, and read the second
half sequentially. (On some systems I've seen in the past, a simple
seek, at any time, will turn off read-ahead.) Finally, open it, then
read every other record; this will simulate two threads using the same
file descriptor. (For all of these tests, use fixed length records, and
open in binary mode. Also take whatever steps are necessary to ensure
that any data from the file is purged from the OS's cache before
starting the test—under Unix, copying a file of 10 or 20 Gigabytes
to /dev/null is usually sufficient for this.
That will give you some ideas, but to be really certain, the best
solution would be to test the real cases. I'd be surprised if sharing a
single ifstream (and thus a single file descriptor), and constantly
seeking, won, but you never know.
I'd also recommend system specific solutions like mmap, but if you've
got that much data, there's a good chance you won't be able to map it
all in one go anyway. (You can still use mmap, mapping sections of it
at a time, but it becomes a lot more complicated.)
Finally, would it be possible to get the data already cut up into
smaller files? That might be the fastest solution of all. (Ideally,
this would be done where the data is generated or imported into the
system.)
My vote would be a single reader, which hands the data to multiple worker threads.
If your file is on a single disk, then multiple readers will kill your read performance. Yes, your kernel may have some fantastic caching or queuing capabilities, but it is going to be spending more time seeking than reading data.

Performance bottleneck at _L_unlock_16

I am trying to use google perf tools CPU profiler for debugging performance issues on a multi-threaded program. With single thread it take 250 ms while 4 threads take around 900ms.
My program has a mmap'ed file which is shared across threads and all operations are read only. Also my program creates large number of objects which are not shared across threads. (Specifically my program uses CRF++ library to do some querying). I am trying to figure out how to make my program perform better with multi threads. Call graph produced by CPU profiler of gperf tools shows that my program spends a lot of time (around 50%) of time in _L_unlock_16.
Searching web for _L_unlock_16 pointed to some bug reports with canonical suggesting that its associated with libpthread. But other than that I was not able to find any useful information for debugging.
A brief description of what my program does. I have few words in a file (4). In my program I have a processWord() which processes a single word using CRF++. This processWord() is what each thread executes. My main() reads words from the file and each threads runs processWord() in parallel. If I process a single word(hence only 1 thread) it takes 250ms and so if I process all 4 words(and hence 4 threads) I expected it to finish by same time 250 ms, however as I mentioned above it's taking around 900ms.
This is the callgraph of the execution - https://www.dropbox.com/s/o1mkh477i7e9s4m/cgout_n2.png
I want to understand why my program is spending lot of time at _L_unlock_16 and what I can do to mitigate it.
Yet again the _L_unlock_16 is not a function of your code. Have you looked at the stracktraces above that function? What are the callers of it when the program waits? You've said that the program wastes 50% waiting inside. But, which part of the program ordered that operation? Is it again from memory alloc/dealloc ops?
The function seems to come from libpthread. Does CRF+ handle threads/libpthread in any way? If yes, then maybe the library is illconfigured? Or maybe it implements some 'basic threadsafety' by adding locks everywhere and simply is not built well for multithreading? What does the docs say about that?
Personally, I'd guess that it ignores threads and that you have added all the threading. I may be wrong, but if that's true, then the CRF++ probably will not call that 'unlock' function at all, and the 'unlock' is somwhow called from your code that orchestrates the threads/locks/queues/messages etc? Halt the program a few times and look at who called the unlock. If it really spends 50% sitting in the unlock, you will very quickly know who causes the lock to be used and you will be able to either eliminate it or at least perform a more refined research..
EDIT #1:
Eh.. when I said "stacktrace" I meant stacktrace, not callgraph. Callgraph may look nice in trivial cases, but in more complex ones, it will be mangled and unreadable and will hide the precious details into "compacted" form.. But, fortunatelly, here the case looks simple enough.
Please observe the beginning: "Process word, 99x". I assume that the "99x" is the call count. Then, look at "tagger-parse": 97x. From that:
61x into rebuildFeatures from which 41x goes right into unlock and 20(13) indirectly into it
23x goes to buildLattice fro which 21x goes into unlock
I'd guess that it was the CRF++ uses locking quite heavily. For me, it seems that you simply observe the effects of CRF's internal locking. It certainly is not lockless internally.
It seems to lock at least once per "processWord". It's hard to say without looking at code (is it opensource? I've not checked..), from stacktraces it would be more obvious, but IF it really locks once per "processWord" that it could even be a sort of a "global lock" that protects "everything" from "all threads" and causes all jobs to serialize. Whatever. Anyways, clearly, it's the CRF++'s internals that lock and wait.
If your CRF objects are really (really) not shared across threads, then remove threading configuration flags from CRF, pray that they were sane enough to not use any static variables nor global objects, add some own locking (if needed) at the topmost job/result level and retry. You should have it now much faster.
If the CRF objects are shared, unshare them and see above.
But, if they are shared behind the scenes, then there's little doable. Change your library to a one that has a better threading support, OR fix the library, OR ignore and use it with current performance.
The last advice may sound strange (it works slowly, right? so why to ignore it?), but in fact is the most important one, and you should try it first. If the parallel tasks have similar "data profile", then there is very probable that they will try to hit the same locks in the same approximate moment of time. Imagine a medium-sized cache that holds words sorted by its first letter. At the toplevel there's array of, say, 26 entries. Each entry has a lock and a list of words inside. If you run 100 threads that will each first check "mom" then "dad" then "son" - then all of that 100 threads will first hit and wait for each other at "M", then at "D" then at "S". Well, approximately/probably of course. But you get the idea. If the data profile were more random then they'd block each other far less. Mind that processing ONE word is a .. small task and that you try to process the same word(s). Even if the internal CRF's locking is smart, it is just bound to hit the same areas. Try again with a more dispersed data.
Add to that the fact that threading costs. If something was guarded against races with use of locks, then every lock/unlock costs because at least they have to "halt and check if the lock is open" (sorry very imprecise wording). If the data-to-process is small in relation to the-amount-of-lockchecks, then adding more threads will not help and will just waste the time. For checking one word, it may even happen that the sole handling of a single lock takes longer than processing the word! But, if the amount of data to be processed were larger, then the cost of flipping a lock compared to processing the data might start being neglible.
Prepare a set of 100 or more words. Run and measure it on one thread. Then partition the words at random and run it on 2 and 4 threads. And measure. If not better, try at 1000 and 10000 words. The more the better, of course, keeping in mind that the test should not last till your next birthday ;)
If you notice that 10k words split over 4 threads (2500w per th) works about 40%-30%-or even 25% faster than on one thread - here you go! You simply gave it a too small job. It was tailored and optimized for bigger ones!
But, on the other hand, it may happen that 10k words split over 4 threads does not work faster, or worse, works slower - then it might indicate that the library handles multithreading very wrong. Now try the other things like stripping threading from it or repairing it.

The fastest way to write data while producing it

In my program I am simulating a N-body system for a large number of iterations. For each iteration I produce a set of 6N coordinates which I need to append to a file and then use for executing the next iteration. The code is written in C++ and currently makes use of ofstream's method write() to write the data in binary format at each iteration.
I am not an expert in this field, but I would like to improve this part of the program, since I am in the process of optimizing the whole code. I feel that the latency associated with writing the result of the computation at each cycle significantly slows down the performance of the software.
I'm confused because I have no experience in actual parallel programming and low level file I/O. I thought of some abstract techniques that I imagined I could implement, since I am programming for modern (possibly multi-core) machines with Unix OSes:
Writing the data in the file in chunks of n iterations (there seem to be better ways to proceed...)
Parallelizing the code with OpenMP (how to actually implement a buffer so that the threads are synchronized appropriately, and do not overlap?)
Using mmap (the file size could be huge, on the order of GBs, is this approach robust enough?)
However, I don't know how to best implement them and combine them appropriately.
Of course writing into a file at each iteration is inefficient and most likely slow down your computing. (as a rule of thumb, depends on your actuel case)
You have to use a producer -> consumer design pattern. They will be linked by a queue, like a conveyor belt.
The producer will try to produce as fast as it can, only slowing if the consumer can't handle it.
The consumer will try to "consume" as fast as it can.
By splitting the two, you can increase performance more easily because each process is simpler and has less interferences from the other.
If the producer is faster, you need to improve the consumer, in your case by writing into file in the most efficient way, chunk by chunk most likely (as you said)
If the consumer is faster, you need to improve the producer, most likely by parallelizing it as you said.
There is no need to optimize both. Only optimize the slowest (the bottleneck).
Practically, you use threads and a synchronized queue between them. For implementation hints, have a look here, especially §18.12 "The Producer-Consumer Pattern".
About flow management, you'll have to add a little bit more complexity by selecting a "max queue size" and making the producer(s) wait if the queue has not enough space. Beware of deadlocks then, code it carefully. (see the wikipedia link I gave about that)
Note : It's a good idea to use boost threads because threads are not very portable. (well, they are since C++0x but C++0x availability is not yet good)
It's better to split operation into two independent processes: data-producing and file-writing. Data-producing would use some buffer for iteration-wise data passing, and file-writing would use a queue to store write requests. Then, data-producing would just post a write request and go on, while file-writing would cope with the writing in the background.
Essentially, if the data is produced much faster than it can possibly be stored, you'll quickly end up holding most of it in the buffer. In that case your actual approach seems to be quite reasonable as is, since little can be done programmatically then to improve the situation.
If you don't want to play with doing stuff in a different threads, you could try using aio_write(), which allows asynchronous writes. Essentially you give the OS the buffer to write, and the function returns immediately, and finishes the the write while your program continues, you can check later to see if the write has completed.
This solution still does suffer from the producer/consumer problem mentioned in other answers, if your algorithm is producing data faster than it can be written, eventually you will run out of memory to store the results between the algorithm and the write, so you'd have to try it and see how it works out.
"Using mmap (the file size could be huge, on the order of GBs, is this
approach robust enough?)"
mmap is the OS's method of loading programs, shared libraries and the page/swap file - it's as robust as any other file I/O and generally higher performance.
BUT on most OS's it's bad/difficult/impossible to expand the size of a mapped file while it's being used. So if you know the size of the data, or you are only reading, it's great. For a log/dump that you are continually adding to it's less sutiable - unless you know some maximum size.

Performance when reading a file line by line vs reading the whole file

Is there a noticeable difference (in theory) when reading a while line by line compared to reading the whole file in one go?
Reading the whole file does have a negative impact on the amount of memory used but does it work faster?
I need to read a file and process each line. I don't know whether I should read one line at a time and process it, or read the whole file, process all, then write to output.
I've already setup the prgm to read line by line and I want to know whether it is worth the effort to change it to read the whole file (not easy given my setup).
Thanks,
Reading the whole file will be slightly faster -- but not much!
But be careful reading the whole file is not scalable as you are limited by the available memory in the system, once the file size exceeds the size of RAM avaibale to your program it will start using swap space will be much slower. If the file size exceeds the size of virtual memory available then your program will crash.
I think it would depend on the needs of your application (like most things, I know). Reading a 1 MB file in Node js is ~3-4x faster with fs.readFile() than using a readable stream or line reader as far as just file reading goes. Streams may offer some additional performance if the file is very large and you are processing input on the fly. It may also be ideal if your application is already consuming a lot of memory as a Node process has a ~1.5 GB memory limit on 64 bit systems. Processing chunks as they come in may also be more performant if the source of the data is slow relative to how fast the cpu can process it (archives on HDD or tape, network connections like TCP). As far as reading a file into memory vs. streaming it into memory, I am guessing the function call overhead of emitting data events and switching to the processing function callback slow down the process.
Like others, I believe doing bigger reads will improve the performance of your application some, but don't expect miracles, I/O is already buffered at the OS layer, so you'll only be gaining by reducing the overhead of having too many read calls. Reading the whole file in one go is dangerous, unless you know the maximum possible size for your input files. A most reasonable approach is to read the file in large blocks.
If you wanted to improve even more, you should consider overlapping the I/O with the processing. Let's say you read the input file in blocks of 128MB. On your main thread you read the first 128MB block and then pass it on to a worker thread for processing. While the worker thread gets to work the main thread reads the second 128MB block. From that point on, while the worker thread is processing block N, the main thread is reading block N+1 from disk.
Reading the entire file into memory is generally not a good idea because the files can be huge and may take up a lot of memory and in worst case run out of memory. So, to balance performance and memory usage, you read a block of file into a buffer and parse through the buffer. When you are done processing the block, read the next block until EOF.
Deciding on a good block size will have to be done based on what you want to achieve.
To be honest, after studying the efficiency for a while during my degree, I came to conclude this about your question: it depends how often this file is going to be read. If you reading it once, then do the whole thing, because that would just free the process for other tasks.
Again one more thing to keep in your mind, is the file going to be edited later and require update (as in read the updated part only?) if so you might need to set a marker to recgonise where to read from (and then again how often it is updated?). But yes if it is a one time job, go ahead and read it as a whole, as long as you do not require tokens to be created of certain literals in the file.
hope this helps.
One factor is how much data you are going to be reading, and so how long the program initially takes to run, i.e. whether there's any benefit in working on the performance.
See the book quotes in this answer for some good, general advice on thinking about software performance.
(I know you're for an answer in theory, but this aspect of when to worry about performance is also important, whenever you have a finite amount of time to spend.)

Many small files or one big file? (Or, Overhead of opening and closing file handles) (C++)

I have created an application that does the following:
Make some calculations, write calculated data to a file - repeat for 500,000 times (over all, write 500,000 files one after the other) - repeat 2 more times (over all, 1.5 mil files were written).
Read data from a file, make some intense calculations with the data from the file - repeat for 1,500,000 iterations (iterate over all the files written in step 1.)
Repeat step 2 for 200 iterations.
Each file is ~212k, so over all i have ~300Gb of data. It looks like the entire process takes ~40 days on a Core 2 Duo CPU with 2.8 Ghz.
My problem is (as you can probably guess) is the time it takes to complete the entire process. All the calculations are serial (each calculation is dependent on the one before), so i can't parallel this process to different CPUs or PCs. I'm trying to think how to make the process more efficient and I'm pretty sure the most of the overhead goes to file system access (duh...). Every time i access a file i open a handle to it and then close it once i finish reading the data.
One of my ideas to improve the run time was to use one big file of 300Gb (or several big files of 50Gb each), and then I would only use one open file handle and simply seek to each relevant data and read it, but I'm not what is the overhead of opening and closing file handles. can someone shed some light on this?
Another idea i had was to try and group the files to bigger ~100Mb files and then i would read 100Mb each time instead of many 212k reads, but this is much more complicated to implement than the idea above.
Anyway, if anyone can give me some advice on this or have any idea how to improve the run time i would appreciate it!
Thanks.
Profiler update:
I ran a profiler on the process, it looks like the calculations take 62% of runtime and the file read takes 34%. Meaning that even if i miraculously cut file i/o costs by a factor of 34, I'm still left with 24 days, which is quite an improvement, but still a long time :)
Opening a file handle isn't probable to be the bottleneck; actual disk IO is. If you can parallelize disk access (by e.g. using multiple disks, faster disks, a RAM disk, ...) you may benefit way more. Also, be sure to have IO not block the application: read from disk, and process while waiting for IO. E.g. with a reader and a processor thread.
Another thing: if the next step depends on the current calculation, why go through the effort of saving it to disk? Maybe with another view on the process' dependencies you can rework the data flow and get rid of a lot of IO.
Oh yes, and measure it :)
Each file is ~212k, so over all i have
~300Gb of data. It looks like the
entire process takes ~40 days ...a ll the
calculations are serial (each
calculation is dependent on the one
before), so i can't parallel this
process to different CPUs or PCs. ... pretty
sure the most of the overhead goes to
file system access ... Every
time i access a file i open a handle
to it and then close it once i finish
reading the data.
Writing data 300GB of data serially might take 40 minutes, only a tiny fraction of 40 days. Disk write performance shouldn't be an issue here.
Your idea of opening the file only once is spot-on. Probably closing the file after every operation is causing your processing to block until the disk has completely written out all the data, negating the benefits of disk caching.
My bet is the fastest implementation of this application will use a memory-mapped file, all modern operating systems have this capability. It can end up being the simplest code, too. You'll need a 64-bit processor and operating system, you should not need 300GB of RAM. Map the whole file into address space at one time and just read and write your data with pointers.
From your brief explaination it sounds like xtofl suggestion of threads is the correct way to go. I would recommend you profile your application first though to ensure that the time is divided between IO an cpu.
Then I would consider three threads joined by two queues.
Thread 1 reads files and loads them into ram, then places data/pointers in the queue. If the queue goes over a certain size the thread sleeps, if it goes below a certain size if starts again.
Thread 2 reads the data off the queue and does the calculations then writes the data to the second queue
Thread 3 reads the second queue and writes the data to disk
You could consider merging thread 1 and 3, this might reduce contention on the disk as your app would only do one disk op at a time.
Also how does the operating system handle all the files? Are they all in one directory? What is performance like when you browse the directory (gui filemanager/dir/ls)? If this performance is bad you might be working outside your file systems comfort zone. Although you could only change this on unix, some file systems are optimised for different types of file usage, eg large files, lots of small files etc. You could also consider splitting the files across different directories.
Before making any changes it might be useful to run a profiler trace to figure out where most of the time is spent to make sure you actually optimize the real problem.
What about using SQLite? I think you can get away with a single table.
Using memory mapped files should be investigated as it will reduce the number of system calls.