I'm looking for any guidance on how to solve concurrency problem that will emerge when two processes are trying to access a shared resource (text file). The solution must use an algorithm and an array as a data structure to coordinate and arrange the execution of these two processes to solve concurrency.
I'm using C++ and POSIX API and I have read about several solutions to solve concurrency, but they use semaphores, locks and other methods but not arrays. Any guidance on how to do it using arrays?
I suspect you are being asked to produce an implementation of Peterson's algorithm (or similar). This uses an integer and an array of booleans to implement a mutex, without requiring any platform support for synchronisation/atomic operations.
Related
I need to synchronize intermediate solutions of an optimization problem solved distributively over a number of worker processors. The solution vector is known to be sparse.
I have noticed that if I use MPI_AllReduce, the performance is good compared to my own AllReduce implementation.
However, I believe, the performance can be further improved if AllReduce could communicate only the nonzero entries in the solution vector. I could not find any such implementation of AllReduce.
Any ideas?
It seems that MPI_type_indexed can not be used as the indices of the nonzero entries are not known in advance.
There aren't sparse collectives in MPI. It's something that the MPI Forum has discussed in the past (to what end, I don't know), but there has also been research in the area. Usually though, when discussing these sorts of things in the forum, I believe they relate more to collectives that don't involve all processes rather than all of the data.
As Hristo said in the comments, the goal of MPI (according to some) has always been to enable more optimized tricks on top of MPI and to just use it as a low level library to abstract the communication calls. Obviously, this hasn't been how MPI has actually be used most of the time, but you can still write your own sparse collectives. Sounds like a good paper to me.
Similar problem here. Most likely you will need to implement your custom MPI_Allreduce().
There is an optimized implementation here. Very possibly you have already found this link: https://fs.hlrs.de/projects/par/mpi//myreduce.html
If you want ideas for a better performance implementation, some here:
https://dl.acm.org/citation.cfm?id=2642791
https://dl.acm.org/citation.cfm?id=2642773
Note that they don't provide an implementation and you may need to pay an small fee.
Good luck
I am looking for a way to write fast code and be able to use builtin vector operations (for the sake of readability).
FORTRAN seems to be the good candidate. However, almost all resources I find on the web are about writing code without array expressions, and have only trivial examples of vector operations.
I feel strong need in some good resource which can cover caveats and give some insight into optimizations of code with vector expressions.
Example:
currently I am not even able to predict the behavior of such code:
! a = [0], indices = [1, 1]
a(indices) = a(indices) + 1
After compiling I get a = [2], but it this correct? If I use openmp, will it behave like this?
Personally, I would be very happy to have something like following examples on numpy:
100 numpy excercises
numpy: tips and tricks to work with data
Getting the Best Performance out of NumPy
Your code is not standard conforming:
Fortran 2008 6.5.3.3.2.3:
If a vector subscript has two or more elements with the same value,
an array section with that vector subscript shall not appear in a
variable definition context (16.6.7). NOTE 6.15
Therefore the result of your operation is not defined by the standard.
Other parts of your question appear to be too broad to treat them here. There are many books about scientific programming in Fortran 90 and later.
Also be aware that by vectorization most people in Fortran and C or C++ mean the usage of SIMD instructions simd and not the vectorized expressions from NumPy. These are just array expressions in Fortran.
I have scanned many sources (~20 books and dozens of web pages). Hard luck I missed something really important. The question I posted is indeed incorrect and comes from my initial high expectation about array operations in fortran.
The answer I would expect is: there are no tools to write short, readable code in fortran with automatic parallelization (to be more precise: there are, but those are proprietary libraries).
The list of intrinsic functions available in fortran is quite short
(link), and consists only of functions easily mapped to SIMD ops.
There are lots of functions that one will be missing.
while this could be resolved by separate library with separate implementation for each platform, fortran doesn't provide such. There are commercial options (see this thread)
Brief examples of missing functions:
no built-in array sort or unique. The proposed way is to use this library, which provides single-threaded code (forget threads and CUDA)
cumulative sum / running sum. One trivially can implement it, but the resulting code will never work fine on threads/CUDA/Xeon Phi/whatever comes next.
bincount, numpy.ufunc.at, numpy.ufunc.reduceat (which is very useful in many applications)
In most cases fortran provides 2x speed up even with simple implementations, but the code written will always be one-threaded, while matlab/numpy functions can be reimplemented for GPU or other parallel platform without any effort from user side (which occasionally happened to MATLAB, also see gnumpy, theano and parakeet)
To conclude, this is bad news for me. Fortran developers really care about having fast programs today, not in the future. I also can't lock my code on proprietary software. And I'm still looking for appropriate tool. (Julia is current candidate)
See also:
STL analogue in fortran
where ready-to-use algorithms are asked.
Numerical recipes: the art of parallel programming author implements basic MATLAB-like operations to have more expressive code
I also find useful these notes to see recommended ways of code optimizations (to see there is no place for vector operations)
numpy, fortran, blitz++: a case study
dicussion about implementing unique in fortran, where proprietary tools are recommended.
I'm fond of dispatch_data_t. It provides a useful abstraction on top of a range of memory: it provides reference counting, allows consumers to create arbitrary sub-ranges (which participate in the ref counting of the parent range), concatenate sub-ranges, etc. (I won't bother to get into the gory details -- the docs are right over here: Managing Dispatch Data Objects)
I've been trying to find out if there's a C++11 equivalent, but the terms "range", "memory" and "reference counting" are pretty generic, which is making googling for this a bit of a challenge. I suspect that someone who spends more time with the C++ Standard Library than I do might know off the top of their head.
Yes, I'm aware that I can use the dispatch_data_t API from C++ code, and yes, I'm aware that it would not be difficult to crank out a naive first pass implementation of such a thing, but I'm specifically looking for something idiomatic to C++, and with a high degree of polish/reliability. (Boost maybe?)
No.
Range views are being proposed for future standard revisions, but they are non-owning.
dispatch_data_t is highly tied to GCD in that cleanup occurs in a specified queue determined at creation: to duplicate that behaviour, we would need thread pools and queues in std, which we do not have.
As you have noted, an owning overlapping immutable range type into sparse or contiguous memory would not be hard to write up. Fully poished it would have to support allocators, some kind of raw input buffer system (type erasure on the owning/destruction mechanism?), have utlities for asynchronous iteration by block (with tuned block size), deal with errors and exceptions carefully, and some way to efficiently turn rc 1 views into mutable versions.
Something that complex would first have to show up in a library like boost and go through iterative improvements. And as it is quite many faceted, something with enough of its properties for your purposes may already be there.
If you roll your own I encourage you to submit it for boost consideration.
I'm doing some linear algebra math, and was looking for some really lightweight and simple to use matrix class that could handle different dimensions: 2x2, 2x1, 3x1 and 1x2 basically.
I presume such class could be implemented with templates and using some specialization in some cases, for performance.
Anybody know of any simple implementation available for use? I don't want "bloated" implementations, as I'll running this in an embedded environment where memory is constrained.
Thanks
You could try Blitz++ -- or Boost's uBLAS
I've recently looked at a variety of C++ matrix libraries, and my vote goes to Armadillo.
The library is heavily templated and header-only.
Armadillo also leverages templates to implement a delayed evaluation framework (resolved at compile time) to minimize temporaries in the generated code (resulting in reduced memory usage and increased performance).
However, these advanced features are only a burden to the compiler and not your implementation running in the embedded environment, because most Armadillo code 'evaporates' during compilation due to its design approach based on templates.
And despite all that, one of its main design goals has been ease of use - the API is deliberately similar in style to Matlab syntax (see the comparison table on the site).
Additionally, although Armadillo can work standalone, you might want to consider using it with LAPACK (and BLAS) implementations available to improve performance. A good option would be for instance OpenBLAS (or ATLAS). Check Armadillo's FAQ, it covers some important topics.
A quick search on Google dug up this presentation showing that Armadillo has already been used in embedded systems.
std::valarray is pretty lightweight.
I use Newmat libraries for matrix computations. It's open source and easy to use, although I'm not sure it fits your definition of lightweight (it includes over 50 source files which Visual Studio compiles it into a 1.8MB static library).
CML matrix is pretty good, but may not be lightweight enough for an embedded environment. Check it out anyway: http://cmldev.net/?p=418
Another option, altough may be too late is:
https://launchpad.net/lwmatrix
I for one wasn't able to find simple enough library so I wrote it myself: http://koti.welho.com/aarpikar/lib/
I think it should be able to handle different matrix dimensions (2x2, 3x3, 3x1, etc) by simply setting some rows or columns to zero. It won't be the most fastest approach since internally all operations will be done with 4x4 matrices. Although in theory there might exist that kind of processors that can handle 4x4-operations in one tick. At least I would much rather believe in existence of such processors that than go optimizing those low level matrix calculations. :)
How about just store the matrix in an array, like
2x3 matrix = {2,3,val1,val2,...,val6}
This is really simple, and addition operations are trivial. However, you need to write your own multiplication function.
I'm building a distributed C++ application that needs to do lots of serialization and deserialization of data stored in std containers.
Currently Boost.serialization is adopted. However, it performs terrible. Our B-tree also use Boost.serialization to store key-value pair data, however, if we change Boost.serialziation to memcpy, the accessing speed will be improved 10 times or more. Since given the current distributed platform, so many data exchange are needed, therefore easy programming is also required together with high performance. I know protocol buffer could also be used as a serialization mechanism, however, I am not sure about the performance comparison between Boost.serialization and protocol buffer, another issue is , does there exist any better solutions to provider higher performance as close to memcpy?
Thanks
Someone asked a very similar question:
C++ Serialization Performance
It looks like protocol buffers are a good way to go, though without knowing your applications' requirements, it is hard to recommend any particular library or technique.