Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I have been exploring parallel IO in Chapel. The Chapel documentation mentions a parallel IO flag, and that channels can work in parallel. But, I don't see anything else.
I don't have a particular question in mind, but just want to know more about it.
Could Chapel team or sound Chapel practitioners discuss any example of a due use of the Chapel parallel IO paradigm?
Parallel I/O can mean different things to different people, which makes it challenging to have a single, simple answer to this (though it might suggest that the Chapel project should add a parallel I/O landing page to its documentation which would point to other resources?). For example, "parallel I/O" could mean:
using multiple tasks (on a single node or across multiple nodes) to write to a single file
using multiple tasks to write to multiple files
using a parallel file system of some sort
Another important factor is the desired file format: text, binary, or a specific file format like HDF5, NetCDF, etc.
Generally speaking, the explicit way of doing parallel I/O in Chapel is to create a number of tasks using Chapel's language features for expressing parallelism (e.g., coforall, cobegin, or begin), and then to give each task its own channel to read from / write to. If all of the channels refer to a single file, the tasks would likely need to coordinate between themselves to make sure they were writing to / reading from disjoint segments of the file. If each channel refers to its own file, such coordination wouldn't be necessary.
The other major way to get parallel I/O in Chapel is implicit, by invoking a library routine where the parallelism is created and managed within the routine itself—either using techniques like the above for routines written in Chapel, or by calling out to an external parallel function (e.g., a parallel I/O routine from a C library).
Finally, you could create multiple tasks that call serial (or parallel) I/O library routines simultaneously.
For an example of the first, explicit approach, see this sample program that I recently put together in response to a similar question. It declares a 2D array whose rows are block-distributed and then uses a task per locale (compute node) to write that locale's sub-array out to a single/shared binary-format file. It then does a similar thing to read the data back into a second array and verifies that the two arrays match. In both cases, each task advances its channel to the appropriate file offset corresponding to the values it wants to write/read.
Examples of the library-based approach to parallel I/O include the hdf5WriteDistributedArray() routine which logically does something very similar to the previous example, yet using the HDF5 file format. Or, the readAllHDF5Files() routine is an example of a library routine that reads from multiple files in parallel.
I think it's safe to say that Chapel should support many more library routines to help with parallel I/O than it has today. The main challenge is knowing which patterns and formats from the space outlined above will be most important to users. We're always open to requests and input in this regard.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I want to implement a fast database alternative that only needs to handle binary data.
To specify, I want something close to a database that will be securely stored even in case of a forced termination (task manager) during execution, whilst also being accessed directly from memory in C++. Like a vector of structs that is mirrored to the hard disk. It should be able to handle hundreds of thousands of read accesses and at least 1000 write accesses per second. In case of a forced termination, at most the last command can be lost. It does not need to support multithreading and the database file will only be accessed by a single instance of the program. Only needs to run on Windows. These are the solutions I've thought of so far:
SQL Databases
Advantages
Easy to implement, since lots of libraries are available
Disadvantages
Server is on a different process, therefor possibly slow inter process communication
Necessity of parsing SQL queries
Built for multithreaded environments, so lots of unnecessary synchronization
Rows can't be directly accessed using pointers but need to be copied at least twice per change
Unnecessary delays on the UPDATE query, since the whole table needs to be searched and the WHERE case checked
These were just a few from the top of my head, there might be a lot more
Memory Mapped Files
Advantages
Direct memory mapping, so direct pointer access possible
Very fast compared to databases
Disadvantages
Forceful termination could lead to a whole page not being written
Lots of code (I don't actually mind that)
No forced synchronization possible
Increasing file size might take a lot of time
C++ vector*
Advantages
Direct pointer access possible, however, needs to manually notify of changes
Very fast compared to databases
Total programming freedom
Disadvantages
Possibly slow because of many calls to WriteFile
Lots of code (I don't actually mind that)
C++ vector with complete write every few seconds
Advantages
Direct pointer access possible
Very fast compared to databases
Total programming freedom
Disadvantages
Lots of unchanged data being rewritten to file, alternatively lots of RAM wasted on preventing unnecessary writes
Inaccessibility during writes of lots of RAM wasted on copy
Could lose multiple seconds worth of data
Multiple threads and therefor synchronization needed
*Basically, a wrapper class that only exposes per row read/write functionality of a vector OR allows direct write to memory, but relies on the caller to notify of changes, all reads are done from a copy in memory, all writes are done to a copy in memory and the file itself on a per-command basis
Also, is it possible to write to different parts of a file without flushing, and then flushing all changes at once with a guarantee that the file will be written either completely or not at all even in case of a forced termination during write? All I can think of is the following workflow:
Duplicate target file on startup, then for every set of data:
Write all changes to duplicate -> Flush by replacing original with duplicate
However, I feel like this would be a horrible waste of hard disk space for big files.
Thanks in advance for any input!
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Wants to create an application storing data in memory. But i dont want the data to be lost even if my app crashes.
What concept should i use?
Should I use a shared memory, or is there some other concept that suits my requirement better.
You are asking for persistence (or even orthogonal persistence) and/or for application checkpointing.
This is not possible (at least thru portable C++ code) in the general case for some arbitrary existing C++ code, e.g. because of ASLR, because of pointers on -or to- the local call stack, because of multi-threading, and because of external resources (sockets, opened files, ...), because the current continuation cannot be accessed, restored and handled in standard C++.
However, you might design your application with persistence in mind. This is a strong architectural requirement. You could for instance have every class contain some dumping method and its load factory function. Beware of shared pointers, and take into account that you could have cyclic references. Study garbage collection algorithms (e.g. in the Gc HandBook) which are similar to those needed for persistence (a copying GC is quite similar to a checkpointing algorithm).
Look also in serialization libraries (like libs11n). You might also consider persisting into textual format (e.g. JSON), perhaps inside some Sqlite database (or some real database like PostGreSQL or MongoDb....). I am doing this (in C) in my monimelt software.
You might also consider checkpointing libraries like BLCR
The important thing is to think about persistence & checkpointing very early at design time. Thinking of your application as some specialized bytecode interpreter or VM might help (notably if you want to persist continuations, or some form of "call stack").
You could fork your process (assuming you are on Linux or Posix) before persistence. Hence, persistence time does not matter that much (e.g. if you persist every hour or every ten minutes).
Some language implementations are able to persist their entire state (notably their heap), e.g. SBCL (a good Common Lisp implementation) with its save-lisp-and-die, or Poly/ML -an ML dialect- with its SaveState, or Squeak (a Smalltalk implementation).
See also this answer & that one. J.Pitrat's blog has a related entry: CAIA as a sleeping beauty.
Persistency of data with code (e.g. vtables of objects, function pointers) might be technically difficult. dladdr(3) -with dlsym- might help (and, if you are able to code machine-specific things, consider the old getcontext(3), but I don't recommend that). Avoid name mangling (for dlsym) by declaring extern "C" all code related to persistence. If you want to persist some data and be able to restart from it with a slightly modified program (e.g. a small bugfix) things are much more complex.
More pragmatically, you could have a class representing your entire persistable state, and implement methods to persist (and reload it). You would then persist only at certain steps of your algorithm (e.g. if you have a main loop or an event loop, at start of that loop). You probably don't want to persist too often (e.g. because of the time and disk space required to persist), e.g. perhaps every ten minutes. You might perhaps consider some transaction log if it fits in the overall picture of your application.
Use memory mapped files - mmap (https://en.wikipedia.org/wiki/Mmap) And allocate all your structures inside mapped memory region. System will properly save mapped file to disk when your app crashes.
Can some one recommend approaches to parallelize in C++, when the data to be acted up on is huge. I have been reading about openMP and Intel's TBB for parallelization in C++, but have not experimented with them yet. Which of these is better for parallel data processing ? Any other libraries/ approaches ?
"large" and "data processing" cover a lot of ground here, and it's hard to give a sensible answer without more information.
If the data processing is "embarrassingly parallel" -- if it involves doing lots and lots of calculations that are completely independant of each other -- then there's a million things that will work and it's just a matter of finding something that matches your code and background.
If it isn't embarrasingly parallel, but nearly so - the computations take a big chunk of data but just distill it into a handfull of numbers - there's fewer, but still lots of options.
If the calculation is more tightly coupled than this - where you need the processors to work on tandem on big chunks of data then you're probably stuck with the standbys - the OpenMP features of your compiler if it will work on a single machine (there's TBB, too, but usually for number crunching OpenMP is faster and easier) or MPI if it needs several machines simultaneously. You mentioned C++; Boost has a very nice MPI layer.
But thinking about which library to use for parallelization is probably thinking about the wrong end of the problem first. In many cases, you don't necessarily need to deal with these layers directly. If the number crunching involves lots of linear algebra (for instance), then PLASMA (for multicore machines - http://icl.cs.utk.edu/plasma/ ) or PetSC, which has support for distributed memory machines, eg, multiple computers ( http://www.mcs.anl.gov/petsc/petsc-as/ ) are good choices, which can completely hide the actual details of the parallel implementation from you. Other sorts of techniques have other libraries, too. It's probably best to think about what sort of analysis you need to do, and look to see if existing toolkits have the amount of parallization you need. Only once you've determined the answer is no should you start to worry about how to roll your own.
Both OpenMP and Intel TBB are for local use as they help in writing multithreaded applications.
If you have truly huge datasets, you may need to split load over several machines -- and then libraries like Open MPI for parallel programming with MPI come into play. Open MPI has a C++ interface, but you now also face a networking component and some administrative issues you do not have with a single computer.
MPI is also useful on a single local machine. It will run a job across multiple cores/CPUs, while this is probably overkill compared to threading it does mean you can move the job to a cluster with no changes. Most MPI implementations also optimize a local job to use shared memory instead of TCP for data connections.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for a framework / approach to do message passing distributed computation in C++.
I've currently got an iterative, single-threaded algorithm that incrementally updates some data model. The updates are literally additive, and I'd like to distribute (or at least parallelize) the computation hereof over as many machines+cores as possible. The data model can be viewed as a big array of (independent) floating point values.
Since the updates are all additive (i.e. commutative and associative), it's OK to merge in updates from other nodes in arbitrary order or even to batch merge updates. When it comes to applying updates, the map/reduce paradigm would work fine.
On the other hand, the updates are computed with respect to the current model state. Each step "corrects" some flaw, so it's important that the model used for computing the update is as fresh as possible (the more out of date the model, the less useful the update). Worst case, the updates are fully dependent, and parallelism doesn't do any good.
I've never implemented anything flexibly distributable, but this looks like a prime candidate. So, I'm looking for some framework or approach to distribute the updates (which consist mostly of floating point numbers and a few indexes into the array to pinpoint where to add the update). But, I'm unsure as to how:
I can broadcast updates to all connected processes. But that means massive network traffic, so I'd realistically need to batch updates; and then updates will be less current. This doesn't look scalable anyhow.
I can do some kind of ring topology. Basically, a machine sends the next machine the sum of its own updates and those of it's predecessors. But then I'd need to figure out how to not duplicate updates, after all, the ring is circular and eventually it's own updates will arrive as part of the sum of its predecessors.
or some kind of tree structure...
To recap, to get decent convergence performance, low latency is critical; the longer between update computation and update application, the less useful the update is. Updates need to be distributed to all nodes as quickly as possible; but because of the commutative and associate nature of the updates, it doesn't matter whether these updates are individually broadcast (probably inefficient) or arrive as part of a merged batch.
Does anybody know of any existing frameworks or approaches to speed up development? Or even just general pointers? I've never done anything quite like this...
You probably want MPI (Message Passing Interface.) It's essentially the industry-standard for distributed computing. There are many implementations, but I would recommend OpenMPI because it's both free, and highly regarded. It provides you with a C API to pass messages between nodes, and also provides higher-level functionality like broadcast, all-to-all, reduce, scatter-gather, etc. It works over TCP, as well as faster, lower-latency interconnects like Infiniband or Myrinet, and supports various topologies.
There is also a Boost wrapper around MPI (Boost.MPI) that will provide you with a more C++ friendly interface.
Are you looking for something like Boost.MPI?
Now I have a serial solver in C++ for solving optimization problems and I am supposed to parallelize my solver with different parameters to see whether it can help improve the performance of the solver. Now I am not sure whther I should use TBB or MPI. From a TBB book I read, I feel TBB is more suitable for looping or fine-grained code. Since I do not have much experience with TBB, I feel it is difficult to divide my code to small parts in order to realize the parallelization. In addition, from the literature, I find many authors used MPI to parallel several solvers and make it cooperate. I guess maybe MPI fits my need more. Since I do not have much knowledge on either TBB or MPI. Anyone can tell me whether my feeling is right? Will MPI fit me better? If so, what material is good for start learning MPI. I have no experience with MPI and I use Windows system and c++. Thanks a lot.
The basic thing you need to have in mind is to choose between shared-memory and distributed-memory.
Shared-memory is when you have more than one process (normally more than one thread within a process) that can access a common memory. This can be quite fine-grained and it is normally simpler to adapt a single-threaded program to have several threads. You will need to design the program in a way that the threads work most of the time in separate parts of the memory (exploit data parallelism) and that the shared part is protected against concurrent accesses using locks.
Distributed-memory means that you have different processes that might be executed in one or several distributed computers but these process have together a common goal and share data through message-passing (data communication). There is no common memory space and all the data one process need from another process will require communication.
It is a more general approach but, because of communication requirements, it requires coarse grains.
TBB is a library support for thread-based shared-memory parallelism while MPI is a library for distributed-memory parallelism (it has simple primitives for communication and also scripts for several processes in different nodes execution).
The most important thing is for you to identify the parallelisms within your solver and then choose the best solution. Do you have data parallelism (different thread/processes could be working in parallel in different chunks of data without the need of communication or sharing parts of this data)? Task parallelism (different threads/processes could be performing a different transformation to your data or a different step in the data processing in a pipeline or graph fashion)?