How to access multiple elements of an array efficienty in C++? - c++

This if my first post, I hope I'll meet the standards...
I'm translating into c++ (at which I'm quite new) a program originally written in MATLAB for reasons of efficiency. The piece of code I am actually working on resumes accesses to various indexes of a vector (matrix) in one step. For example, if M1 is a matrix of size, let's say, 10x15, the program would define a new one as follows:
idxs1 = [1 2 3];
idxs2 = [1 2 3 4 5];
M2 = M1 (idxs1 , idxs2);
resulting M2 as a matrix of size 3x5. Now, I guess what MATLAB actually does is access one by one the various places of M1 given by the indexes and then construct M2 by rearranging the many contents acquired, all very efficiently.
My question is, how can I reproduce such mechanism in c++? As far as I know there is no direct way to access in a row various indexes of an array, and the for loop I'm using seems rather cumbersome. Maybe there's some intelligent way to do it without demanding 'too much' processor time? Also, for the sake of educational purposes, I would be grateful if someone could explain what MATLAB actually does when such operation is performed.
Thanks in advance and sorry for the eventual inconveniences!
P.S: Just in case it adds anything to the question, I'm working with MEX files to link both languages.
P.S2: By the way, I found some related questions but regarding other languages:
python: Access multiple elements of an array
perl: How can I use an array slice to access several elements of an array simultaneously?
c: How to access multiple elements of array in one go?

"Armadillo is a high quality C++ linear algebra library, aiming towards a good balance between speed and ease of use
Useful for algorithm development directly in C++, or quick conversion of research code into production environments; the syntax (API) is deliberately similar to Matlab"
Link: http://arma.sourceforge.net/

Math program data structures can be some of the most elaborate out there. I can't even figure out what the 3rd line of your example actually does, so I can't even guess how MATLAB implements anything.
What I can tell you is, that one line of MATLAB is almost certainly hiding a ton of operations. If you want to recreate it you just need to make a utility function with a couple of for-loops that copy over all the correct indices one-by-one. Ultimately, this can't be too much different than one MATLAB does.
If you have a ton of matrix operations you need to support and you're on a large project, you might want to look into finding a C++ matrixes library. I don't have one to recommend, but Boost is a popular C++ library for many purposes including matrixes. (You can also make your own data structure, but not recommended for a novice.)

What MATLAB exactly does is left unspecified, and might very well differ from case to case depending on the indices, and even for a given set of indices it could differ from machine to machine. So let's not speculate.
In particular, it's left unspecified whether MATLAB physically copies M1. Such a copy can be faked, which saves time. The technique is known as "copy on write".
In C++, this would be possible as well, but harder. Furthermore, none of the existing container classes supports it.
If you're going to copy the elements, the CPU won't be the bottleneck. Instead, the memory bus will limit you. This is particularly the case when the indices aren't contiguous. For a 3x5 matrix, the time will be dominated by overhead - contiguity doesn't matter yet.

Related

Create a huge flag map in C++

So basically, what I need is to have 28,800 values which can be accessed by an index and can all be set to true or false. Using an array of bools or integers isn't an option, because the size needs to be set during rumtime using a parameter. Using a vector is way too slow and memory intensive. I'm new to C++ and therefore don't have any idea on how to solve this, can anyone help?
EDIT: Thank you to all who commented! Like I said, Im new to C++ programming and your answers really helped me to understand the functionality behind vectors.
So, after everyone said that vector isn't slow I checked again and it turns out that my program was running so slow because of another bug I had while filling the vector. But especially midor's and Some programmer dude's answers helped me to make the program run a bit faster than before, so thanks!
Using a vector is way too slow and memory intensive.
C++ specializes std::vector<bool> so it uses only as much memory as it needs. One bit per "flag" (+ bookkeeping overhead of course).
You could only optimize that implementation if you knew its size a priori (which you don't according to your question), or if you know that the bitmap will only contain very few set bits (e.g. 1bit in a 50'000, but you'd need to measure if a more complex implementation would be worth it). For sparse bitmaps an std::unordered_set<std::uint32_t> that stores the set bits could be an option.
But 28'800 is a very small number, so don't waste your time on optimizations. You won't get any benefits from it.

Writing optimal code in FORTRAN using array expressions

I am looking for a way to write fast code and be able to use builtin vector operations (for the sake of readability).
FORTRAN seems to be the good candidate. However, almost all resources I find on the web are about writing code without array expressions, and have only trivial examples of vector operations.
I feel strong need in some good resource which can cover caveats and give some insight into optimizations of code with vector expressions.
Example:
currently I am not even able to predict the behavior of such code:
! a = [0], indices = [1, 1]
a(indices) = a(indices) + 1
After compiling I get a = [2], but it this correct? If I use openmp, will it behave like this?
Personally, I would be very happy to have something like following examples on numpy:
100 numpy excercises
numpy: tips and tricks to work with data
Getting the Best Performance out of NumPy
Your code is not standard conforming:
Fortran 2008 6.5.3.3.2.3:
If a vector subscript has two or more elements with the same value,
an array section with that vector subscript shall not appear in a
variable definition context (16.6.7). NOTE 6.15
Therefore the result of your operation is not defined by the standard.
Other parts of your question appear to be too broad to treat them here. There are many books about scientific programming in Fortran 90 and later.
Also be aware that by vectorization most people in Fortran and C or C++ mean the usage of SIMD instructions simd and not the vectorized expressions from NumPy. These are just array expressions in Fortran.
I have scanned many sources (~20 books and dozens of web pages). Hard luck I missed something really important. The question I posted is indeed incorrect and comes from my initial high expectation about array operations in fortran.
The answer I would expect is: there are no tools to write short, readable code in fortran with automatic parallelization (to be more precise: there are, but those are proprietary libraries).
The list of intrinsic functions available in fortran is quite short
(link), and consists only of functions easily mapped to SIMD ops.
There are lots of functions that one will be missing.
while this could be resolved by separate library with separate implementation for each platform, fortran doesn't provide such. There are commercial options (see this thread)
Brief examples of missing functions:
no built-in array sort or unique. The proposed way is to use this library, which provides single-threaded code (forget threads and CUDA)
cumulative sum / running sum. One trivially can implement it, but the resulting code will never work fine on threads/CUDA/Xeon Phi/whatever comes next.
bincount, numpy.ufunc.at, numpy.ufunc.reduceat (which is very useful in many applications)
In most cases fortran provides 2x speed up even with simple implementations, but the code written will always be one-threaded, while matlab/numpy functions can be reimplemented for GPU or other parallel platform without any effort from user side (which occasionally happened to MATLAB, also see gnumpy, theano and parakeet)
To conclude, this is bad news for me. Fortran developers really care about having fast programs today, not in the future. I also can't lock my code on proprietary software. And I'm still looking for appropriate tool. (Julia is current candidate)
See also:
STL analogue in fortran
where ready-to-use algorithms are asked.
Numerical recipes: the art of parallel programming author implements basic MATLAB-like operations to have more expressive code
I also find useful these notes to see recommended ways of code optimizations (to see there is no place for vector operations)
numpy, fortran, blitz++: a case study
dicussion about implementing unique in fortran, where proprietary tools are recommended.

Can most of the data structures be implemented using vectors?

I used C++ vectors to implement stacks, queue, heaps, priority queue and directed weighted graphs. In the books and references, I have seen big classes for these data structures, all of which can be implemented in short using vectors. (May be there is more flexibility in using pointers)
Can we also implement even advanced data structures using vectors ?
If yes, why do C++ books still explain concepts with the long classes using pointers ?
Is it to keep in mind the lower level idea, if it is more vivid that way or it makes students equipped with such usage of pointers ?
It's true that many data structures can be implemented on top of a vector (array, for the sake of this answer), essentially all of them can, since every computation task can be implemented to run on a turing-machine which has a far more basic data access capability (or, in the real world, you may say that any program you implement with pointers eventually runs on a CPU with a simply array-like virtual memory space, so you could just call that a huge array). However, it's not always clever. Two main reasons :
performance / time complexity - a vector simply can't provide all basic operations that in O(1). There's a solution for fast initialization, but try to randomly insert values into a large vector and see how bad you perform - that's because you have to move all the elements by one place over and over. A list could do that in a single operation. Of course other structures have their own performance shortcomings, but that's the beauty of designing complicated data structures with these basic building blocks.
structural complexity - you can think of a list along the same line of a vector as an ordered container, and perhaps extend this into multidimensional matrices that can be implemented on top of them since they still retain some basic ordering, but there are more complicated structures. Take for e.g. a tree, a simple full binary tree one can be implemented with a vector very easily since the parent-child relations can be easily converted to index arithmetics, but what if the tree isn't full and has varying number of children per node? Now, you may say it can still be done (any graph can be implemented with vectors either through adjacency matrix or adjacency list for e.g.), but there's almost no sense in doing so when you can have a much simpler implementation using pointer links. Just think of doing an AVL roll with an array. :shudder:
Mind you that the second argument may very well boil down to performance ("hey, it's an awkward approach but I still managed to use a vector!"), but it's more than that - it would complicate your code, clutter your data structure design, and could make it far more prone to bugs.
Now, here comes the "but" - even though there's much sense in using all the possible tools the language provides you, it's very widely accepted to use vector-based structures for performance critical tasks. See almost all scientific CPU benchmarks, most of them ultimately rely on vectors (uncited, but I can elaborate further if anyone is interested. Suffice to say that even the well-known *graph*500 does that).
The reason is not that it's best programming practice, but that it's more suited with CPU internal structure and gets more "juice" out of the HW. That's due to spatial locality - CPUs are very fond of that as it allows the memory unit to parallelize accesses (in an array you always know where's the next element, in a list you have to wait until the current one is fetched), and also issue stream/stride prefetches that reduce latency of future requests.
I can't say this is always a good practice, when you run through a graph the accesses are still pretty irregular even if you use an array implementation, but it's still a very common practice.
To summarize, taking the question literally - most of them can, of sorts (for a given definition of "most", ok?), but if the intention was "why teach pointers", I believe you can see that in order to understand your limits and what you can and should use - you need to know a great deal more than just arrays and even pointers. A good programmer should know a bit about everything - OS design, CPU design, etc. You can't do anything decent unless you really understand the fabric you're running on, and that unfortunately (or not) includes lots of pointers
You can implement a kind of allocator using an std::vector as the backing store. If you do that, all the standard data structures from elementary computer science can be implemented on top of vectors. It will hardly free you from using pointers, though: vectors are really just chunks of memory with a few useful additional operations, most notably the ability to expand.
More to the point: if you don't understand pointers, you won't understand how to do use vector for advanced data structures either. vector is a useful abstraction, but it follows the C++ rule that "you don't get what you don't pay for", so it's also a very "thin" abstraction, and you do pay for the cost of abstraction in terms of the amount of code you have to write.
(Jonathan Wakely points out, in the comments, that you won't get the exact guarantees that the C++ standard library requires of allocators data structures when you implement them on top of vector. Put in principle, vectors are just a way of handling blocks of memory.)
If you are learning C++ you need to be familiar with pointers and how to use them even if there are more higher level concepts that does that job for you.
Yes, it is possible to implement most data structures with vectors or lists and if you just started learning programming it's probably a good idea that you'll know how to write these data structures yourself.
With that being said, production code should always use the standard library unless there is a good reason not to do so.

Matrix Libraries vs For Loops in C++

Could you tell me if using a matrix library results in a faster run-time than regular for-loops? Currently, I have some methods that use for-loops that iterate through multidimensional vectors to calculate matrix products and element-wise products, where the matrix size is roughly 1000 columns by 400 rows. This method is the most called method in my program and I would like to know if using a matrix library would increase the program's speed. Also, which library would you recommend (from http://eigen.tuxfamily.org/index.php?title=Benchmark, Eigen seems best to me)?
Thank You
Yes -- a fair number of C++ matrix libraries (E.g., MTL, uBLAS, Blitz++) use template metaprogramming to optimize their behavior. Absent a reason to do otherwise, I'd start with Boost uBlas. You might also want to look at the OO numerics libraries list for other possibilities.
I am trying to answer the question "should I" instead of "which one" because it isn't clear that you actually need such a library.
Would a matrix library improve execution time? Probably. The methods they teach you in high school are certainly not the fastest. However there are other issues to consider.
First, are you optimizing prematurely? Trying to make your program as fast as possible as soon as possible is tempting, but not always the right thing to do. You have to make the determination if doing so is really a valid way to spend your time.
Second, will speed have any significant effect on usability? Making a program work in 2 seconds instead of 4 seconds isn't really worth the effort.... but 30 hours instead of 60 hours? Maybe so. I like to put emphasis on getting everything working before doing the polishing.
Finally, I have met several examples of code somebody else wrote several years before which was utterly useless. Old libraries that couldn't be found or compiled with a new OS or compiler or something different meant that I had to completely rewrite something wasting weeks of my time. It may have seemed like a good idea originally to get that extra few percent performance, but it meant that their code had a limited life span, especially because of poor documentation.
Keep It Simple Stupid is an excellent mantra for so many things. I am a strong advocate for only using libraries when absolutely necessary, and then only using those which seem to be long lived and stable.

Matrix implementation benchmarks, should I whip myself?

I'm trying to find out some matrix multiplication/inversion benchmarks online. My C++ implementation can currently invert a 100 x 100 matrix in 38 seconds, but compared to this benchmark I found, my implementation's performances really suck. I don't know if it's a super-optimized something or if really you can easily invert a 200 x 200 matrix in about 0.11 seconds, so I'm looking for more benchmarks to compare the results. Have you god some good link?
UPDATE
I spotted a bug in my multiplication code, that didn't affect the result but was causing useless cycle waste. Now my inversion executes in 20 seconds. It's still a lot of time, and any idea is welcome.
Thank you folks
This sort of operation is extremely cache sensitive. You want to be doing most of your work on variables that are in your L1 & L2 cache. Check out section 6 of this doc:
http://people.redhat.com/drepper/cpumemory.pdf
He walks you through optimizing a matrix multiply in a cache-optimized way and gets some big perf improvements.
Check if you are passing huge matrix objects by value (As this could be costly if copying the whole matrix).
If possable pass by reference.
The thing about matricies and C++ is that you want to avoid copying as much as possable.
So your main object should probably not conatain the "matrix data" but rather contain meta data about the matrix and a pointer (wrapped in by somthing smart) to the data portion. Thus when copying an object you only copy a small chunk of data not the whole thing (see string implementation for an example).
Why do you need to implement your own matrix library in the first place? As you've already discovered, there are already extremely efficient libraries available doing the same thing. And as much as people like to think of C++ as a performance language, that's only true if you're really good at the language. It is extremely easy to write terribly slow code in C++.
I don't know if it's a super-optimized
something or if really you can easily
invert a 200 x 200 matrix in about
0.11 seconds
MATLAB does that without breaking a sweat either. Are you implementing the LAPACK routines for matrix inversion (e.g. LU decomposition)?
Have you tried profiling it?
Following this paper (pdf), the calculation for a 100x100 matrix with LU decomposition will need 1348250 (floating point operations). A core 2 can do around 20 Gflops (processor metrics). So theoretically speaking you can do an inversion in 1 ms.
Without the code is pretty difficult to assert what is the cause of the large gap. From my experience trying micro-optimization like loop unrolling, caching values, SEE, threading, etc, you only will get a speed up, which at best is only a constant factor of you current (which maybe enough for you).
But if you want an order of magnitude speed increase you should take a look at your algorithm, perhaps your implementation of LU decomposition have a bug. Another place to take a look is the organization of your data, try different organization, put row/columns elements together.
The LINPACK benchmarks are based on solving linear algebra problems. They're available for different machines and languages. Maybe they can help you, too.
LINPACK C++ libraries available here, too.
I actually gained about 7 seconds using **double**s instead of **long double**s, but that's not a great deal since I lost half of my precision.