I've got very limited knowledge about Erlang, but as far as I understand, it can spawn "processes" with a very low cost.
So I wonder, what are those "processes" behind the scenes?
Are they Fibers? Threads? Continuations?
They are Lightweight Processes.
Also see my question Technically why is processes in Erlang more efficient than OS threads.
Also, from the Erlang doc:
Erlang processes are light-weight
(grow and shrink dynamically) with
small memory footprint, fast to create
and terminate and the scheduling
overhead is low.
Source: http://www.erlang.org/doc/reference_manual/processes.html
You might also want to have a look to this:
http://www.defmacro.org/ramblings/concurrency.html
When talking about Erlang processes, it says:
Erlang processes are lightweight
threads. They're very cheap to start
up and destroy and are very fast to
switch between because under the hood
they're simply functions. A typical
Erlang system running on a modern
desktop computer can switch between
many tens of thousands such processes.
Processes are switched every couple of
dozen function calls which makes
switches less granular but saves a
tremendous amount of time normally
wasted on context switching.
I haven't found a definitive source, but from what I understand:
There is a scheduler (eg or multiple schedulers that act
cooperatively) that determines which erlang process to launch on which OS
thread.
These processes have a growable stack (perhaps a preamble in
each functions that allocates stack if needed) so they don't consume
too much memory unless they need it.
They yield back to the scheduler depending on whether they are
waiting on data or have executed for a sufficient amount of time
(maybe preamble code in some functions checks to see how much time
has elapsed). Unlike threads, they don't get preempted?
Each process allocates memory from different pages or from a
different allocator, so it is not possible to share memory (in
similar way that OS processes avoid sharing memory).
Presumably also having separate allocators or pages per erlang process would also help with
garbage collection, and in the case that the process ends, then the
pages can be returned without having to do any garbage collection:
http://prog21.dadgum.com/16.html
Basically they are Threads ;) One address space for them.
Related
I have an online service which is single thread, since whole program share one memory pool(use new or malloc),a module might destroy memory which leads to another module work incorrectly, so I want to split whole program into two part, each part runs on a thread, is it possible to isolate thread memory like multiprocess so I can check where is the problem? (splitting it into multiprocess cost a lot of time and risky so I don't want to try)
As long as you'll use threads, memory can be easily corrupted since, BY DESIGN, threads are sharing the same memory. Splitting your program across two threads won't help in ANY manner for security - it can greatly help with CPU load, latency, performances, etc. but in no way as an anti memory corruption mechanism.
So either you'll need to ensure a proper development and that your code won't plow memory where it must not, or you use multiple process - those are isolated from each other by operating system.
You can try to sanitize your code by using tools designed for this purpose, but it depends on your platform - you didn't gave it in your question.
It can go from a simple Debug compilation with MSVC under Windows, up to a Valgrind analysis under Linux, and so on - list can be huge, so please give additional informations.
Also, if it's your threading code that contains the error, maybe rethink what you're doing: switching it to multiprocess may be the cheapest solution in the end - don't be fooled by sunk cost, especially since threading can NOT protect part 1 against part 2 and reciprocally...
Isolation like this is quite often done by using separate processes.
The downsides of doing that are
harder communication between the processes (but thats kind of the point)
the overhead of starting processes is typically a lot larger than starting thread. So you would not typically spawn a new process for each request for example.
Common model is a lead process that starts a child process to do the request serving. The lead process just monitors the health of the worker
Or you could fix your code so that it doesnt get corrupted (AN easy thing to say I know)
I have a C++ program running on Linux in which a new thread is created to do some computationally expensive work independent of the main thread (The computational work completes by writing the results to files, which end up being very large). However, I'm getting relatively poor performance.
If I implement the program straightforward (without introducing other threads), it completes the task in roughly 2 hours. With the multi-threaded program it takes around 12 hours to do the same task (this was tested with only one thread spawned).
I've tried a couple of things, including pthread_setaffinity_np to set the thread to a single CPU (out of the 24 available on the server I'm using), as well as pthread_setschedparam to set the scheduling policy (I've only tried SCHED_BATCH). But the effects of these have so far been negligible.
Are there any general causes for this kind of problem?
EDIT: I've added some example code that I'm using, which is hopefully the most relevant parts. The function process_job() is what actually does the computational work, but it would be too much to include here. Basically, it reads in two files of data, and uses these to perform queries on an in-memory graph database, in which the results are written to two large files over a period of hours.
EDIT part 2: Just to clarify, the problem is not that I want to use threads to increase the performance of an algorithm I have. But rather, I want to run many instances of my algorithm simultaneously. Therefore, I expect the algorithm would run at a similar speed when put in a thread as it would if I didn't use multi-threads at all.
EDIT part 3: Thanks for the suggestions all. I'm currently doing some unit tests (seeing which parts are slowing down) as some have suggested. As the program takes a while to load and execute, it is taking time to see any results from the tests and therefore I apologize for late responses. I think the main point I wanted to clarify is possible reasons why threading could cause a program to run slowly. From what I gather from the comments, it simply shouldn't be. I'll post when I can find a reasonable resolution, thanks again.
(FINAL) EDIT part 4: It turns out that the problem was not related to threading after all. Describing it would be too cumbersome at this point (including the use of compiler optimization levels), but the ideas posted here were very useful and appreciated.
struct sched_param sched_param = {
sched_get_priority_min(SCHED_BATCH)
};
int set_thread_to_core(const long tid, const int &core_id) {
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(core_id, &mask);
return pthread_setaffinity_np(tid, sizeof(mask), &mask);
}
void *worker_thread(void *arg) {
job_data *temp = (job_data *)arg; // get the information for the task passed in
...
long tid = pthread_self();
int set_thread = set_thread_to_core(tid, slot_id); // assume slot_id is 1 (it is in the test case I run)
sched_get_priority_min(SCHED_BATCH);
pthread_setschedparam(tid, SCHED_BATCH, &sched_param);
int success = process_job(...); // this is where all the work actually happens
pthread_exit(NULL);
}
int main(int argc, char* argv[]) {
...
pthread_t temp;
pthread_create(&temp, NULL, worker_thread, (void *) &jobs[i]); // jobs is a vector of a class type containing information for the task
...
return 0;
}
If you have plenty of CPU cores, and have plenty of work to do, it should not take longer to run in multithreaded than single threaded mode - the actual CPU time may be a fraction longer, but the "wall-clock time" should be shorter. I'm pretty sure that your code has some sort of bottleneck where one thread is blocking the other.
This is because of one or more of these things - I'll list them first, then go into detail below:
Some lock in a thread is blocking the second thread from running.
Sharing of data between threads (either true or "false" sharing)
Cache thrashing.
Competition for some external resource causing thrashing and/or blocking.
Badly designed code in general...
Some lock in a thread is blocking the second thread from running.
If there is a thread that takes a lock, and another thread wants to use the resource that is locked by this thread, it will have to wait. This obviously means the thread isn't doing anything useful. Locks should be kept to a minimum by only taking the lock for a short period. Using some code to identify if locks are holding your code, such as:
while (!tryLock(some_some_lock))
{
tried_locking_failed[lock_id][thread_id]++;
}
total_locks[some_lock]++;
Printing some stats of the locks would help to identify where the locking is contentious - or you can try the old trick of "Press break in the debugger and see where you are" - if a thread is constantly waiting for some lock, then that's what's preventing progress...
Sharing of data between threads (either true or "false" sharing)
If two threads use [and update the value of it frequently] the same variable, then the two threads will have to swap "I've updated this" messages, and the CPU's have to fetch the data from the other CPU before it can continue with it's use of the variable. Since "data" is shared on a "per cache-line" level, and a cache-line is typically 32-bytes, something like:
int var[NUM_THREADS];
...
var[thread_id]++;
would classify as something called "false sharing" - the ACTUAL data updated is unique per CPU, but since the data is within the same 32-byte region, the cores will still have updated the same are of memory.
Cache thrashing.
If two threads do a lot of memory reading and writing, the cache of the CPU may be constantly throwing away good data to fill it with data for the other thread. There are some techniques available to ensure that two threads don't run in "lockstep" on which part of cache the CPU uses. If the data is 2^n (power of two) and fairly large (a multiple of the cache-size), it's a good idea to "add an offset" for each thread - for example 1KB or 2KB. That way, when the second thread reads the same distance into the data region, it will not overwrite exactly the same area of cache that the first thread is currently using.
Competition for some external resource causing thrashing and/or blocking.
If two threads are reading or writing from/to the hard-disk, network card, or some other shared resource, this can lead to one thread blocking another thread, which in turn means lower performance. It is also possible that the code detects different threads and does some extra flushing to ensure that data is written in the correct order or similar, before starting work with the other thread.
It is also possible that there are locks internally in the code that deals with the resource (user-mode library or kernel mode drivers) that block when more than one thread is using the same resource.
Generally bad design
This is a "catchall" for "lots of other things that can be wrong". If the result from one calculation in one thread is needed to progress the other, obviously, not a lot of work can be done in that thread.
Too small a work-unit, so all the time is spent starting and stopping the thread, and not enough work is being done. Say for example that you dole out small numbers to be "calculate if this is a prime" to each thread, one number at a time, it will probably take a lot longer to give the number to the thread than the calculation of "is this actually a prime-number" - the solution is to give a set of numbers (perhaps 10, 20, 32, 64 or such) to each thread, and then report back the result for the whole lot in one go.
There are plenty of other "bad design". Without understanding your code it's quite hard to say for sure.
It is entirely possible that your problem is none of the ones I've mentioned here, but most likely it is one of these. Hopefully this asnwer is helpful to identify the cause.
Read CPU Caches and Why You Care to understand why a naive port of an algorithm from one thread to multiple threads will more often than not result in greatly reduced performance and negative scalability. Algorithms that are specififcally designed for parallelism take care of overactive interlocked operations, false sharing and other causes of cache pollution.
Here are a few things you might wanna look into.
1°) Do you enter any critical section (locks, semaphores, etc.) between your worker thread and your main thread? (this should be the case if your queries modify the graph). If so, that could be one of the sources of the multithreading overhead : threads competing for a lock usually degrades performances.
2°) You're using a 24 cores machines, which I assume would be NUMA (Non-Uniform Memory Access). Since you set the threads affinities during your tests, you should pay close attention to the memory topology of your hardware. Looking at the files in /sys/devices/system/cpu/cpuX/ can help you with that (beware that cpu0 and cpu1 aren't necessarily close together, and thus does not necessarily share memory). Threads heavily using memory should use local memory (allocated in the same NUMA node as the core they're executing on).
3°) You are heavily using disk I/O. Which kind of I/O is that? if every thread perform every time some synchronous I/O, you might wanna consider asynchronous system calls, so that the OS stays in charge of scheduling those requests to the disk.
4°) Some caches issues have already been mentionned in other answers. From experience, false sharing can hurt performances as much as you're observing. My last recommendation (which should have been my first) is to use a profiler tool, such as Linux Perf, or OProfile. With such performance degradation you're experiencing, the cause will certainly appear quite clearly.
The other answers have all addressed the general guidelines that can cause your symptoms. I will give my own, hopefully not excessively redundant version. Then I will talk a bit about how you can get to the bottom of the problem with everything discussed in mind.
In general, there's a few reasons you'd expect multiple threads to perform better:
A piece of work is dependent on some resources (disk, memory, cache, etc.) while other pieces can proceed independently of these resources or said workload.
You have multiple CPU cores that can process your workload in parallel.
The main reasons, enumerated above, you'd expect multiple threads to perform less well are all based on resource contention:
Disk contention: already explained in detail and can be a possible issue, especially if you are writing small buffers at a time instead of batching
CPU time contention if the threads are scheduled onto the same core: probably not your issue if you're setting affinity. However, you should still double check
Cache thrashing: similarly probably not your problem if you have affinity, though this can be very expensive if it is your problem.
Shared memory: again talked about in detail and doesn't seem to be your issue, but it wouldn't hurt to audit the code to check it out.
NUMA: again talked about. If your worker thread is pinned to a different core, you will want to check whether the work it needs to access is local to the main core.
Ok so far not much new. It can be any or none of the above. The question is, for your case, how can you detect where the extra time is coming from. There's a few strategies:
Audit the code and look for obvious areas. Don't spend too much time doing this as it's generally unfruitful if you wrote the program to begin with.
Refactor the single threaded code and the multi-threaded code to isolate one process() function, then profile at key checkpoints to try to account for the difference. Then narrow it down.
Refactor the resource access into batches, then profile each batch on both the control and the experiment to account for the difference. Not only will this tell you which areas (disk access vs memory access vs spending time in some tight loop) you need to focus your efforts on, doing this refactor might even improve your running time overall. Example:
First copy the graph structure to thread-local memory (perform a straight-up copy in the single-threaded case)
Then perform the query
Then setup an asynchronous write to disk
Try to find a minimally reproducible workload with the same symptoms. This means changing your algorithm to do a subset of what it already does.
Make sure there's no other noise in the system that could've caused the difference (if some other user is running a similar system on the work core).
My own intuition for your case:
Your graph structure is not NUMA friendly for your worker core.
The kernel can actually scheduled your worker thread off the affinity core. This can happen if you don't have isolcpu on for the core you're pinning to.
I can't tell you what's wrong with your program because you haven't shared enough of it to do a detailed analysis.
What I can tell you is if this was my problem the first thing I would try is to run two profiler sessions on my application, one on the single threaded version and another on the dual thread configuration. The profiler report should give you a pretty good idea of where the extra time is going. Note that you may not need to profile the entire application run, depending on the problem the time difference may become obvious after you profile for a few seconds or minutes.
As far as profiler choices for Linux you may want to consider oprofile or as a second choice gprof.
If you find you need help interpreting the profiler output feel free to add that to your question.
It can be a right pain in the rear to track down why threads aren't working as planned. One can do so analytically, or one can use tool to show what's going on. I've had very good mileage out of ftrace, Linux's clone of Solaris's dtrace (which in turn is based on what VxWorks, Greenhill's Integrity OS and Mercury Computer Systems Inc have been doing for a looong time.)
In particular I found this page very useful: http://www.omappedia.com/wiki/Installing_and_Using_Ftrace, particularly this and this section. Don't worry about it being an OMAP orientated website; I've used it on X86 Linuxes just fine (though you may have to build a kernel to include it). Also remember that the GTKWave viewer is primarily intended for looking at log traces from VHDL developments, which is why it looks 'odd'. It's just that someone realised that it would be a usable viewer for sched_switch data too, and that saved them writing one.
Using the sched_switch tracer you can see when (but not necessarily why) your threads are running, and that might be enough to give you a clue. The 'why' can be revealed by careful examination of some of the other tracers.
If you are getting slowdown from using 1 thread, it is likely due to overhead from using thread safe library functions, or from thread setup. Creating a thread for each job will cause significant overhead, but probably not as much as you refer to.
In other words, it is probably some overhead from some thread safe library function.
The best thing to do, is to profile your code to find out where time is spent. If it is in a library call, try to find a replacement library or implement it yourself. If the bottleneck is thread creation/destruction try reusing threads, for instance using OpenMP tasks or std::async in C++11.
Some libraries are really nasty wrt thread safe overhead. For instance, many rand() implementations use a global lock, rather than using thread local prgn's. Such locking overhead is much larger than generating a number, and is hard to track without a profiler.
The slowdown could also stem from small changes you have made, for instance declaring variables volatile, which generally should not be necessary.
I suspect you're running on a machine with one single-core processor. This problem is not parallelizable on that kind of system. Your code is constantly using the processor, which has a fixed number of cycles to offer to it. It actually runs more slowly because the additional thread adds expensive context switching to the problem.
The only kinds of problems that parallelize well on a single-processor machine are those that allow one path of execution to run while another is blocked waiting for I/O, and situations (such as keeping a responsive GUI) where allowing one thread to get some processor time is more important than executing your code as quickly as possible.
If you only want to run many independent instances of your algorithm can you just submit multiple jobs (with different parameters, can be handled by a single script) to your cluster? That would eliminate the need to profile and debug your multithreaded program. I don't have much experience with multithreaded programming but if you use MPI or OpenMP then you'd have to write less code for the book keeping too. For example, if some common initialization routine is needed and the processes can run independently thereafter you can just do that by initializing in one thread and doing a broadcast. No need for maintaining locks and such.
I have win 32 C++ application . I have to load 330,000 objects into memory . If I use sequential approach it takes around 16 min . In the threading approach I divide the 330,000 objects equally among 10 containers . I create 10 threads and I assign each thread one container of size 33000 objects to load them in memory . This approach took around 9 min .
INcreasing the number of threads did not help .....
Will I get any further improvement if I use ThreadPool ?
As always without specifics, it depends.
Are you loading objects from disk or creating them in memory? If you're loading them from disk then it's probably IO bound so increasing the number of threads probably won't help very much.
In the comment you mentioned you are loading from a database. I presume when you use threads you are making N queries simultaneously? Might be worth investigating the database console to understand how its coping with many concurrent queries.
On the other hand, if the objects are created as the result of some CPU bound process (e.g. calculating pi) then chances are increasing the number of threads grater than the number of CPU's probably won't increase performance (and as ronag points out in the comments will probably hurt performance due to the increased context switching).
Are there dependencies between the objects? That would again affect how things go.
You'd typically use a thread pool if you have a collection of independent tasks that you want to run with a configurable way of running them. It sounds like using a thread pool would be a good way to run lots of benchmarks with various thread settings. You could also make the number of threads configurable which would help when running on different architectures/systems.
IME, and yours, a few threads will speed up this kind of task. I'm guessing that the overall throughput is improved because of better use of the 'intelligent' disk cacheing available on modern controllers - the disk/controller spends less time idle because there are always threads wanting to read something. The diminishing returns sets in, however, after only a few threads are loaded in and you are disk-bound. In a slightly similar app, I found that any more than 6 threads provided no additional advantage & just used up more memory.
I can't see how pooling, or otherwise, of these threads would make any difference to performance - it's just a big job that has to be done :(
Tell your customers that they have to install an SSD
Rgds,
Martin
A few years ago, in the Windows environment, I did some testing, by letting multiple instances of CPU computation intensive + memory access intensive + I/O access intensive application run. I developed 2 versions: One is running under multi-processing, another is running under multi-threading.
I found that the performance is much better for multi-processing. I read somewhere else (but I can't remember the site).
Which states that the reason is that under multi-threading, they are "fighting" for a single memory pipeline and I/O pipeline, which makes the performance worse compared to multi-processing
However, I can't find that article anymore. I was wondering, till today, whether the below still hold true?
In Windows, having the algorithm
code run under multi-processing, there is a high
chance that the performance will be
better than multi-threading.
It depends on how much the various threads or processes (I'll be using the collective term "tasks" for both of them) need to communicate, especially by sharing memory: that's easy, cheap and fast for threads, but not at all for processes, so, if a lot of it is going on, I bet processes' performance is not going to beat threads'.
Also, processes (esp. on Windows) are "heavier" to get started, so if a lot of "task starts" occur, again threads can easily beat processes in terms of performance.
Next, you can have CPUs with "hyperthreading", which can run (at least) two threads on a core very rapidly -- but, not processes (since the "hyperthreaded" threads cannot be using distinct address spaces) -- yet another case in which threads can win performance-wise.
If none of these considerations apply, then the race should be no better than a tie, anyway.
I'm not sure what the quote even means. It's very close to nonsense.
The primary thing that in-proc threads share is virtual memory address space.
I found this is true as well. but I think it has something to do with the scheduling. because if you run it long enough, the multi-processes is just as fast as multi-threads. that number is about 10 seconds. if the algorithm needs to be run for 10 seconds. the multi-processes is as fast as multi-thread. but if it only needs to be run for less than 1 second. multi-processes is much,much faster than multi-thread.
My unix/windows C++ app is already parallelized using MPI: the job is splitted in N cpus and each chunk is executed in parallel, quite efficient, very good speed scaling, the job is done right.
But some of the data is repeated in each process, and for technical reasons this data cannot be easily splitted over MPI (...).
For example:
5 Gb of static data, exact same thing loaded for each process
4 Gb of data that can be distributed in MPI, the more CPUs are used, smaller this per-CPU RAM is.
On a 4 CPU job, this would mean at least a 20Gb RAM load, most of memory 'wasted', this is awful.
I'm thinking using shared memory to reduce the overall load, the "static" chunk would be loaded only once per computer.
So, main question is:
Is there any standard MPI way to share memory on a node? Some kind of readily available + free library ?
If not, I would use boost.interprocess and use MPI calls to distribute local shared memory identifiers.
The shared-memory would be read by a "local master" on each node, and shared read-only. No need for any kind of semaphore/synchronization, because it wont change.
Any performance hit or particular issues to be wary of?
(There wont be any "strings" or overly weird data structures, everything can be brought down to arrays and structure pointers)
The job will be executed in a PBS (or SGE) queuing system, in the case of a process unclean exit, I wonder if those will cleanup the node-specific shared memory.
One increasingly common approach in High Performance Computing (HPC) is hybrid MPI/OpenMP programs. I.e. you have N MPI processes, and each MPI process has M threads. This approach maps well to clusters consisting of shared memory multiprocessor nodes.
Changing to such a hierarchical parallelization scheme obviously requires some more or less invasive changes, OTOH if done properly it can increase the performance and scalability of the code in addition to reducing memory consumption for replicated data.
Depending on the MPI implementation, you may or may not be able to make MPI calls from all threads. This is specified by the required and provided arguments to the MPI_Init_Thread() function that you must call instead of MPI_Init(). Possible values are
{ MPI_THREAD_SINGLE}
Only one thread will execute.
{ MPI_THREAD_FUNNELED}
The process may be multi-threaded, but only the main thread will make MPI calls (all MPI calls are ``funneled'' to the main thread).
{ MPI_THREAD_SERIALIZED}
The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads (all MPI calls are ``serialized'').
{ MPI_THREAD_MULTIPLE}
Multiple threads may call MPI, with no restrictions.
In my experience, modern MPI implementations like Open MPI support the most flexible MPI_THREAD_MULTIPLE. If you use older MPI libraries, or some specialized architecture, you might be worse off.
Of course, you don't need to do your threading with OpenMP, that's just the most popular option in HPC. You could use e.g. the Boost threads library, the Intel TBB library, or straight pthreads or windows threads for that matter.
I haven't worked with MPI, but if it's like other IPC libraries I've seen that hide whether other threads/processes/whatever are on the same or different machines, then it won't be able to guarantee shared memory. Yes, it could handle shared memory between two nodes on the same machine, if that machine provided shared memory itself. But trying to share memory between nodes on different machines would be very difficult at best, due to the complex coherency issues raised. I'd expect it to simply be unimplemented.
In all practicality, if you need to share memory between nodes, your best bet is to do that outside MPI. i don't think you need to use boost.interprocess-style shared memory, since you aren't describing a situation where the different nodes are making fine-grained changes to the shared memory; it's either read-only or partitioned.
John's and deus's answers cover how to map in a file, which is definitely what you want to do for the 5 Gb (gigabit?) static data. The per-CPU data sounds like the same thing, and you just need to send a message to each node telling it what part of the file it should grab. The OS should take care of mapping virtual memory to physical memory to the files.
As for cleanup... I would assume it doesn't do any cleanup of shared memory, but mmaped files should be cleaned up since files are closed (which should release their memory mappings) when a process is cleaned up. I have no idea what caveats CreateFileMapping etc. have.
Actual "shared memory" (i.e. boost.interprocess) is not cleaned up when a process dies. If possible, I'd recommend trying killing a process and seeing what is left behind.
With MPI-2 you have RMA (remote memory access) via functions such as MPI_Put and MPI_Get. Using these features, if your MPI installation supports them, would certainly help you reduce the total memory consumption of your program. The cost is added complexity in coding but that's part of the fun of parallel programming. Then again, it does keep you in the domain of MPI.
MPI-3 offers shared memory windows (see e.g. MPI_Win_allocate_shared()), which allows usage of on-node shared memory without any additional dependencies.
I don't know much about unix, and I don't know what MPI is. But in Windows, what you are describing is an exact match for a file mapping object.
If this data is imbedded in your .EXE or a .DLL that it loads, then it will automatically be shared between all processes. Teardown of your process, even as a result of a crash will not cause any leaks or unreleased locks of your data. however a 9Gb .dll sounds a bit iffy. So this probably doesn't work for you.
However, you could put your data into a file, then CreateFileMapping and MapViewOfFile on it. The mapping can be readonly, and you can map all or part of the file into memory. All processes will share pages that are mapped the same underlying CreateFileMapping object. it's good practice to close unmap views and close handles, but if you don't the OS will do it for you on teardown.
Note that unless you are running x64, you won't be able to map a 5Gb file into a single view (or even a 2Gb file, 1Gb might work). But given that you are talking about having this already working, I'm guessing that you are already x64 only.
If you store your static data in a file, you can use mmap on unix to get random access to the data. Data will be paged in as and when you need access to a particular bit of the data. All that you will need to do is overlay any binary structures over the file data. This is the unix equivalent of CreateFileMapping and MapViewOfFile mentioned above.
Incidentally glibc uses mmap when one calls malloc to request more than a page of data.
I had some projects with MPI in SHUT.
As i know , there are many ways to distribute a problem using MPI, maybe you can find another solution that does not required share memory,
my project was solving an 7,000,000 equation and 7,000,000 variable
if you can explain your problem,i would try to help you
I ran into this problem in the small when I used MPI a few years ago.
I am not certain that the SGE understands memory mapped files. If you are distributing against a beowulf cluster, I suspect you're going to have coherency issues. Could you discuss a little about your multiprocessor architecture?
My draft approach would be to set up an architecture where each part of the data is owned by a defined CPU. There would be two threads: one thread being an MPI two-way talker and one thread for computing the result. Note that MPI and threads don't always play well together.