Pointer against not pointer

Pointer against not pointer - c++

I read in many places including Effective C++ that it is better to store data on the stack and not as pointer to the data.
I can understand doing this with small object, because the number of new and delete calls is also reduced, which reduces the chance of a memory leak. Also, the pointer can take more space than the object itself.
But with large object, where copying them will be expensive, is it not better to store them in a smart pointer?
Because with many operations with the large object there will be few object copying which is very expensive (I am not including the getters and setters).

Let's focus purely on efficiency. There's no one-size-fits-all, unfortunately. It depends on what you are optimizing for. There's a saying, always optimize the common case. But what is the common case? Sometimes the answer lies in understanding your software's design inside out. Sometimes it's unknowable even at the high level in advance because your users will discover new ways to use it that you didn't anticipate. Sometimes you will extend the design and reveal new common cases. So optimization, but especially micro-optimization, is almost always best applied in hindsight, based on both this user-end knowledge and with a profiler in your hand.
The few times you can usually have really good foresight about the common case is when your design is forcing it rather than responding to it. For example, if you are designing a class like std::deque, then you're forcing the common case write usage to be push_fronts and push_backs rather than insertions to the middle, so the requirements give you decent foresight as to what to optimize. The common case is embedded into the design, and there's no way the design would ever want to be any different. For higher-level designs, you're usually not so lucky. And even in the cases where you know the broad common case in advance, knowing the micro-level instructions that cause slowdowns is too often incorrectly guessed, even by experts, without a profiler. So the first thing any developer should be interested in when thinking about efficiency is a profiler.
But here's some tips if you do run into a hotspot with a profiler.
Memory Access
Most of the time, the biggest micro-level hotspots if you have any will relate to memory access. So if you have a large object that is just one contiguous block where all the members are getting accessed in some tight loop, it'll aid performance.
For example, if you have an array of 4-component mathematical vectors you're sequentially accessing in a tight algorithm, you'll generally fare far, far better if they're contiguous like so:
x1,y1,z1,w1,x2,y2,x2,w2...xn,yn,zn,wn
... with a single-block structure like this (all in one contiguous block):
x
y
z
w
This is because the machine will fetch this data into a cache line which will have the adjacent vectors' data inside of it when it's all tightly packed and contiguous in memory like this.
You can very quickly slow down the algorithm if you used something like std::vector here to represent each individual 4-component mathematical vector, where every single individual one stores the mathematical components in a potentially completely different place in memory. Now you could potentially have a cache miss with each vector. In addition, you're paying for additional members since it's a variable-sized container.
std::vector is a "2-block" object that often looks like this when we use it for a mathematical 4-vector:
size
capacity
ptr --> [x y z w] another block
It also stores an allocator but I'll omit that for simplicity.
On the flip side, if you have a big "1-block" object where only some of its members get accessed in those tight, performance-critical loops, then it might be better to make it into a "2-block" structure. Say you have some Vertex structure where the most-accessed part of it is the x/y/z position but it also has a less commonly-accessed list of adjacent vertices. In that case, it might be better to hoist that out and store that adjacency data elsewhere in memory, perhaps even completely outside of the Vertex class itself (or merely a pointer), because your common case, performance-critical algorithms not accessing that data will then be able to access more contiguous vertices nearby in a single cache line since the vertices will be smaller and point to that rarely-accessed data elsewhere in memory.
Creation/Destruction Overhead
When rapid creation and destruction of objects is a concern, you can also do better to create each object in a contiguous memory block. The fewer separate memory blocks per object, the faster it'll generally go (since whether or not this stuff is going on the heap or stack, there will be fewer blocks to allocate/deallocate).
Free Store/Heap Overhead
So far I've been talking more about contiguity than stack vs. heap, and it's because stack vs. heap relates more to client-side usage of an object rather than an object's design. When you're designing the representation of an object, you don't know whether it's going on the stack or heap. What you do know is whether it's going to be fully contiguous (1 block) or not (multiple blocks).
But naturally if it's not contiguous, then at least part of it is going on the heap, and heap allocations and deallocations can be enormously expensive if you are relating the cost to the hardware stack. However, you can mitigate this overhead often with the use of efficient O(1) fixed allocators. They serve a more special purpose than malloc or free, but I would suggest concerning yourself less with the stack vs. heap distinction and more about the contiguity of an object's memory layout.
Copy/Move Overhead
Last but not least, if you are copying/swapping/moving objects a lot, then the smaller they are, the cheaper this is going to be. So you might want to sort pointers or indices to big objects sometimes, for example, instead of the original object, since even a move constructor for a type T where sizeof(T) is a large number is going to be expensive to copy/move.
So move constructing something like the "2-block" std::vector here which is not contiguous (its dynamic contents are contiguous, but that's a separate block) and stores its bulky data in a separate memory block is actually going to be cheaper than move constructing like a "1-block" 4x4 matrix that is contiguous. It's because there's no such thing as a cheap shallow copy if the object is just one big memory block rather than a tiny one with a pointer to another. One of the funny trends that arises is that objects which are cheap to copy are expensive to move, and ones which are very expensive to copy are cheap to move.
However, I would not let copying/move overhead impact your object implementation choices, because the client can always add a level of indirection there if he needs for a particular use case that taxes copies and moves. When you're designing for memory layout-type micro-efficiency, the first thing to focus on is contiguity.
Optimization
The rule for optimization is this: if you have no code or no tests or no profiling measurements, don't do it. As others have wisely suggested, your number one concern is always productivity (which includes maintainability, safety, clarity, etc). So instead of trapping yourself in hypothetical what-if scenarios, the first thing to do is to write the code, measure it twice, and change it if you really have to do so. It's better to focus on how to design your interfaces appropriately so that if you do have to change anything, it'll just affect one local source file.

The reality is that this is a micro-optimisation. You should write the code to make it readable, maintainable and robust. If you worry about speed, you use a profiling tool to measure the speed. You find things that take more time than they should, and then and only then do you worry about speed optimisation.
An object should obviously only exist once. If you make multiple copies of an object that is expensive to copy you are wasting time. You also have different copies of the same object, which is in itself not a good thing.
"Move semantics" avoids expensive copying in cases where you didn't really want to copy anything but just move an object from here to there. Google for it; it is quite an important thing to understand.

What you said is essentially correct. However, move semantics alleviate the concern about object copying in a large number of cases.

Related

Performance of container of objects vs performance of container of pointers

class C { ... };
std::vector<C> vc;
std::vector<C*> pvc;
std::vector<std::unique_ptr<C>> upvc;
Depending on the size of C, either the approach storing by value or the approach storing by pointer will be more efficient.
Is it possible to approximately know what this size is (on both a 32 and 64 bit platform)?

Yes, it is possible - benchmark it. Due to how CPU caches work these days, things are not simple anymore.
Check out this lecture about linked lists by Bjarne Stroustrup:
https://www.youtube.com/watch?v=YQs6IC-vgmo
Here is an excelent lecture by Scott Meyers about CPU caches: https://www.youtube.com/watch?v=WDIkqP4JbkE

Let's look at the details of each example before drawing any conclusions.
Vector of Objects
A vector of Objects has first, initial performance hit. When an object is added to the vector, it makes a copy. The vector will also make copies when it needs to expand the reserved memory. Larger objects will take more time to copy, as well as complex or compound objects.
Accessing the objects is very efficient - only one dereference. If your vector can fit inside a processor's data cache, this will be very efficient.
Vector of Raw Pointers
This may have an initialization performance hit. If the objects are in dynamic memory, the memory must be initialized first (allocated).
Copying a pointer into a vector is not dependent on the object size. This may be a performance savings depending on the object size.
Accessing the objects takes a performance hit. There are 2 deferences before you get to the object. Most processors don't follow pointers when loading their data cache. This may be performance hit because the processor may have to reload the data cache when dereferencing the pointer to the object.
Vector of Smart Pointers
A little bit more costly in performance than a raw pointer. However, the items will automatically be deleted when the vector is destructed. The raw pointers must be deleted before the vector can be destructed; or a memory leak is created.
Summary
The safest version is to have copies in the vector, but has performance hits depending on the size of the object and the frequency of reallocating the reserved memory area. A vector of pointers takes performance hits because of the double dereferencing, but doesn't incur extra performance hits when copying because pointers are a consistent size. A vector of smart pointers may take additional performance hits compared to a vector of raw pointers.
The real truth can be found by profiling the code. The performance savings of one data structure versus another may disappear when waiting for I/O operations, such as networking or file I/O.
Operations with the data structures may need to be performed a huge amount of times in order for the savings to be significant. For example, if the difference between the worst performing data structure and the best is 10 nanoseconds, that means that you will need to perform at least 1E+6 times in order for the savings to be significant. If a second is significant, expect to access the data structures more times (1E+9).
I suggest picking one data structure and moving on. Your time developing the code is worth more than the time that the program runs. Safety and Robustness are also more important. An unsafe program will consume more of your time fixing issues than a safe and robust version.

For a Plain Old Data (POD) type, a vector of that type is always more efficient than a vector of pointers to that type at least until sizeof(POD) > sizeof(POD*).
Almost always, the same is true for a POD type at least until sizeof(POD) > 2 * sizeof(POD*) due to superior memory locality and lower total memory usage compared to when you are dynamically allocating the objects at which to be pointed.
This kind of analysis will hold true up until sizeof(POD) crosses some threshold for your architecture, compiler and usage that you would need to discover experimentally through benchmarking. The above only puts lower bounds on that size for POD types.
It is difficult to say anything definitive about all non-POD types as their operations (e.g. - default constructor, copy constructors, assignment, etc.) can be as inexpensive as a POD's or arbitrarily more expensive.

Performance impact of objects

I am a beginner programmer with some experience at c and c++ programming. I was assigned by the university to make a physics simulator, so as you might imagine there's a big emphasis on performance.
My questions are the following:
How many assembly instructions does an instance data member access
through a pointer translate to (i.e for an example vector->x )?
Is it much more then say another approach where you simply access the
memory through say a char* (at the same memory location of variable
x), or is it the same?
Is there a big impact on performance
compiler-wise if I use an object to access that memory location or
if I just access it?
Another question regarding the subject would be
whether or not accessing heap memory is faster then stack memory
access?

C++ is a compiled language. Accessing a memory location through a pointer is the same regardless of whether that's a pointer to an object or a pointer to a char* - it's one instruction in either case. There are a couple of spots where C++ adds overhead, but it always buys you some flexibility. For example, invoking a virtual function requires an extra level of indirection. However, you would need the same indirection anyway if you were to emulate the virtual function with function pointers, or you would spend a comparable number of CPU cycles if you were to emulate it with a switch or a sequence of ifs.
In general, you should not start optimizing before you know what part of your code to optimize. Usually only a small part of your code is responsible for the bulk of the CPU time used by your program. You do not know what part to optimize until you profile your code. Almost universally it's programmer's code, not the language features of C++, that is responsible for the slowdown. The only way to know for sure is to profile.

On x86, a pointer access is typically one extra instruction, above and beyond what you normally need to perform the operation (e.x. y = object->x; would be one load of the address in object, and one load of the value of x, and one store to y - in x86 assembler both loads and stores are mov instructions with memory target). Sometimes it's "zero" instructions, because the compiler can optimise away the load of the object pointer. In other architectures, it's really down to how the architecture works - some architectures have very limited ways of accessing memory and/or loading addresses to pointers, etc, making it awkward to access pointers.
Exactly the same number of instructions - this applies for all
As #2 - objects in themselves have no impact at all.
Heap memory and stack memory is the same kind. One answer says that "stack memory is always in the caceh", which is true if it's "near the top of the stack", where all the activity goes on, but if you have an object that is being passed around that was created in main, and a pointer to it is used to pass it around for several layers of function calls, and then access through the pointer, there is an obvious chance that this memory hasn't been used for a long while, so there is no real difference there either). The big difference is that "heap memory is plenty of space, stack is limited" along with "running out of heap is possible to do limited recovery, running out of stack is immediate end of execution [without tricks that aren't very portable]"
If you look at class as a synonym for struct in C (which aside from some details, they really are), then you will realize that class and objects are not really adding any extra "effort" to the code generated.
Of course, used correctly, C++ can make it much easier to write code where you deal with things that are "do this in a very similar way, but subtly differently". In C, you often end up with :
void drawStuff(Shape *shapes, int count)
{
for(i = 0; i < count; i++)
{
switch (shapes[i].shapeType)
{
case Circle:
... code to draw a circle ...
break;
case Rectangle:
... code to draw a rectangle ...
break;
case Square:
...
break;
case Triangle:
...
break;
}
}
}
In C++, we can do this at the object creation time, and your "drawStuff" becoems:
void drawStuff(std::vector<Shape*> shapes)
{
for(auto s : shapes)
{
s->Draw();
}
}
"Look Ma, no switch..." ;)
(Of course, you do need a switch or something to do the selection of which object to create, but once choice is made, assuming your objects and the surrounding architecture are well defined, everything should work "magically" like the above example).
Finally, if it's IMPORTANT with performance, then run benchmarks, run profiling and check where the code is spending it's time. Don't optimise too early (but if you have strict performance criteria for something, keep an eye on it, because deciding on the last week of a project that you need to re-organise your data and code dramatically because performance sucks due to some bad decision is also not the best of ideas!). And don't optimise for individual instructions, look at where the time is spent, and come up with better algorithms WHERE you need to. (In the above example, using const std::vector<Shape*>& shapes will effectively pass a pointer to the shapes vector passed in, instead of copying the entire thing - which may make a difference if there are a few thousand elements in shapes).

It depends on your target architecture. An struct in C (and a class in C++) is just a block of memory containing the members in sequence. An access to such a field through a pointer means adding an offset to the pointer and loading from there. Many architectures allow a load to already specify an offset to the target address, meaning that there is no performance penalty there; but even on extreme RISC machines that don't have that, adding the offset should be so cheap that the load completely shadows it.
Stack and heap memory are really the same thing. Just different areas. Their basic access speed is therefore the same. The main difference is that the stack will most likely already be in the cache no matter what, whereas heap memory might not be if it hasn't been accessed lately.

Variable. On most processors instructions are translated to something called microcode, similar to how Java bytecode are translated to processor-specific instructions before you run it. How many actual instructions you get are different between different processor manufacturers and models.
Same as above, it depends on processor internals most of us know little about.
1+2. What you should be asking are how many clock cycles these operations take. On modern platforms the answer are one. It does not matter how many instructions they are, a modern processor have optimizations to make both run on one clock cycle. I will not get into detail here. I other words, when talking about CPU load there are no difference at all.
Here you have the tricky part. While there are no difference in how many clock cycles the instruction itself take, it needs to have data from memory before it can run - this can take a HUGE ammount of clock cycles. Actually someone proved a few years ago that even with a very optimized program a x86 processor spends at least 50% of its time waiting for memory access.
When you use stack memory you are actually doing the same thing as creating an array of structs. For the data, instructions are not duplicated unless you have virtual functions. This makes data aligned and if you are going to do sequential access, you will have optimal cache hits. When you use heap memory you will create an array of pointers, and each object will have its own memory. This memory will NOT be aligned and therefore sequential access will have a lot of cache misses. And cache misses are what really will your application slower and should be avoided at all cost.
I do not know exactly what you are doing but in many cases even using objects are much slower than plain arrays. An array of objects are aligned [object1][object2] etc. If you do something like pseudocode "for each object o {o.setX() = o.getX() + 1}"... this means that you will only access one variable and your sequential access would therefore jump over the other variables in each object and get more cache misses than if your X-variables where aligned in their own array. And if you have code that use all variables in your object, standard arrays will not be slower than object array. It will just load the different arrays into different cache blocks.
While standard arrays are faster in C++ they are MUCH faster in other languages like Java, where you should NEVER store bulk data in objects - as Java objects use more memory and are always stored at the heap. This are the most common mistake that C++ programmers do in Java, and then complain that Java are slow. However if they know how to write optimal C++ programs they store data in arrays which are as fast in Java as in C++.
What I usually do are a class to store the data, that contains arrays. Even if you use the heap, its just one object which becomes as fast as using the stack. Then I have something like "class myitem { private: int pos; mydata data; public getVar1() {return data.getVar1(pos);}}". I do not write out all of the code here, just illustrating how I do this. Then when I iterate trough it the iterator class do not actually return a new myitem instance for each item, it increase the pos value and return the same object. This means you get a nice OO API while you actually only have a few objects and and nicely aligned arrays. This pattern are the fastest pattern in C++ and if you don't use it in Java you will know pain.
The fact that we get multiple function calls do not really matter. Modern processors have something called branch prediction which will remove the cost of the vast majority of those calls. Long before the code actually runs the branch predictor will have figured out what the chains of calls do and replaced them with a single call in the generated microcode.
Also even if all calls would run each would take far less clock cycles the memory access they require, which as I pointed out makes memory alignment the only issue that should bother you.

Should I use manual alloc to allow move semantics?

I'm interested to learn when I should start considering using move semantics in favour over copying data depending on the size of that data and the usage of the class. For example for a Matrix4 class we have two options:
struct Matrix4{
float* data;
Matrix4(){ data = new float[16]; }
Matrix4(Matrix4&& other){
*this = std::move(other);
}
Matrix4& operator=(Matrix4&& other)
{
... removed for brevity ...
}
~Matrix4(){ delete [] data; }
... other operators and class methods ...
};
struct Matrix4{
float data[16]; // let the compiler do the magic
Matrix4(){}
Matrix4(const Matrix4& other){
std::copy(other.data, other.data+16, data);
}
Matrix4& operator=(const Matrix4& other)
{
std::copy(other.data, other.data+16, data);
}
... other operators and class methods ...
};
I believe there is some overhead having to alloc and dealloc memory "by hand", and given the chances of really hitting the move construct when using this class what is the preferred implementations of a class with such small in memory size? Is really always preferred move over copy?

In the first case, allocation and deallocation are expensive - because you are dynamically allocating memory from the heap, even if your matrix is constructed on the stack - and moves are cheap (just copying a pointer).
In the second case, allocation and deallocation are cheap, but moves are expensive - because they are actually copies.
So if you are writing an application and you just care about performance of that application, the answer to the question "Which one is better?" likely depends on how much you are creating/destroying matrices vs how much you are copying/moving them - and in any case, do your own measurements to support any conjectures.
By doing measurements you will also check whether your compiler is doing a lot of copy/move elisions in places where you expect moves to be going on - results may be against your expectations.
Also, cache locality may have an impact here: if you allocate storage for a matrix's data on the heap, having three matrices that you want to process element-by-element created on the stack will likely require quite a scattered memory access pattern - potentially result in more cache misses.
On the other hand, if you are using arrays for which memory is allocated on the stack, it is likely that the same cache line will be able to hold the data of all those matrices - thus increasing the cache hit rate. Not to mention the fact that in order to access elements on the heap you first need to read the value of the data pointer, which means accessing a different region of memory than the one holding the elements.
So once more, the moral of the story is: do your own measurements.
If you are writing a library on the other hand, and you cannot predict how many constructions/destructions vs moves/copies the client is going to perform, then you may offer two such matrix classes, and factor out the common behavior into a base class - possibly a base class template.
That will give the client flexibility and will give you a sufficiently high degree of reuse - no need to write the implementation of all common member functions twice.
This way, clients may choose the matrix class that best fits the creation/moving profile of the application in which they are using it.
UPDATE:
As DeadMG points out in the comments, one advantage of the array-based approach over the dynamic allocation approach is that the latter is doing manual resource management through raw pointers, new, and delete, which forces you to write user-defined destructor, copy constructor, move constructor, copy-assignment operator, and move-assignment operator.
You could avoid all of this if you were using std::vector, which would perform the memory management task for you and would save you from the burden of defining all those special member functions.
This said, the mere fact of suggesting to use std::vector instead of doing manual memory management - as much as it is a good advice in terms of design and programming practice - does not answer the question, while I believe the original answer does.

Like everything else in programming, specially when performance is concerned, it's a complicated trade-off.
Here, you have two designs: to keep the data inside your class (method 1) or to allocate the data on the heap and keep a pointer to it in the class (method 2).
As far as I can tell, these are the trade-offs you are making:
Construction/Destruction Speed: Naively implemented, method 2 will be slower here, because it requires dynamic memory allocation and deallocation. However, you can help the situation using custom memory allocators, specially if the size of your data is predictable and/or fixed.
Size: In your 4x4 matrix example, method 2 requires storing an additional pointer, plus memory allocation size overhead (typically can be anywhere from 4 to 32 bytes.) This might or might not be a factor, but it certainly must be considered, specially if your class instances are small.
Move Speed: Method 2 has very fast move operation, because it only requires setting two pointers. In method 1, you have no choice but to copy your data. However, while being able to rely on fast moving can make your code pretty and straightforward and readable and more efficient, compilers are quite good at copy elision, which means that you can write your pretty, straightforward and readable pass-by-value interfaces even if you implement method 1 and the compiler will not generate too many copies anyway. But you can't be sure of that, so relying on this compiler optimization, specially if your instances are larger, requires measurement and inspection of the generated code.
Member Access Speed: This is the most important differentiator for small classes, in my opinion. Each time you access an element in a matrix implemented using method 2 (or access a field in a class implemented that way, i.e., with external data) you access the memory twice: once to read the address of the external block of memory, and once to actually read the data you want. In method 1, you just directly access the field or element you want. This means that in method 2, every access could potentially generate an additional cache miss, which could affect your performance. This is specially important if your class instances are small (e.g. a 4x4 matrix) and you operate on many of them stored in arrays or vectors.
In fact, this is why you might want to actually copy bytes around when you are copying/moving an instance of your matrix, instead of just setting a pointer: to keep your data contiguous. This is why flat data structures (like arrays of values,) are much preferred in high-performance code, than pointer spaghetti data structures (like arrays of pointers, linked lists, etc.) So, while moving is cooler and faster than copying in isolation, you sometimes want to do copy your instances to make (or keep) a whole bunch of them contiguous and make iteration over and accessing them much much more efficient.
Flexibility of Length/Size: Method 2 is obviously more flexible in this regard because you can decide how much data you need at runtime, be it 16 or 16777216 bytes.
All in all, here's the algorithm I suggest you use for picking one implementation:
If you need variable amount of data, pick method 2.
If you have very large amounts of data in each instance of your class (e.g. several kilobytes,) pick method 2.
If you need to copy instances of your class around a lot (and I mean a lot!) pick method 2 (but try to measure the performance improvement and inspect the generated code, specially in hot areas.)
In all other cases, prefer method 1.
In short, method 1 should be your default, until proven otherwise. And the way to prove anything regarding performance is measurement! So don't optimize anything unless you have measured and have proof that one method is better than another, and also (as mentioned in other answers,) you might want to implement both methods if you are writing a library and let your users choose the implementation.

I would probably use a stdlib container (such as std::vector or std::array) that already implements move semantics, and then I would simply have the vectors or arrays move.
For example, you could use std::array< std::array, 4 > or std::vector< std::vector< float > > to represent your matrix type.
I don't think it will matter a lot for a 4x4 matrix, but it might for 10000x10000. So yes, a move constructor for a matrix type is definitely worth it, especially if you're planning to work with a lot of temporary matrices (which seems likely when you want to do calculations with them). It will also allow returning Matrix4 objects efficiently (whereas you'd have to use a by-ref call to get around copying otherwise).
Rather unrelated to the matter but probably worth mentioning: in case you decide to use std::array, please make a Matrix a template class (instead of embedding the size into the classname).

Matrix creation destruction in c++ best practice?

Suppose I have a c++ code with many small functions, in each of which i will typically need a matrix float M1(n,p) with n,p known at run-time to contain the results of intermediate computations (no need to initialize M1, just to declare it because each function will just overwrite over all rows of M1).
Part of the reason for this is that each function works on an original data matrix that it can't modify, so many operations (sorting, de-meaning, sphering) need to be done on "elsewhere".
Is it better practice to create a temporary M1(n,p) within each function, or rather once and for all in the main() and pass it to each function as a sort of bucket that each function can use as scrap space?
n and p are often moderately large [10^2-10^4] for n and [5-100] for p.
(originally posted at the codereview stackexchange but moved here).
Best,

heap allocations are indeed, quite expensive.
premature optimization is bad, but if your library is quite general and the matrices are huge, it may not be premature to seek an efficient design. After all, you don't want to modify your design after you accumulated many dependencies to it.
there are various levels in which you can tackle this problem. You can, for example, avoid the heap allocation expense by tackling it at the memory allocator level (per-thread memory pool, e.g.)
while heap allocation is expensive, you are creating one giant matrix only to do some rather expensive operations on the matrices (typically linear complexity or worse). Relatively speaking, allocating a matrix on the free store may not be that expensive compared to what you are inevitably going to have to do with it subsequently, so it may actually be quite cheap in comparison to the overall logic of a function like sorting.
I recommend you write the code naturally, taking into account #3 as a future possibility. That is, don't take in references to matrix buffers for intermediary computations to accelerate the creation of temporaries. Make the temporaries and return them by value. Correctness and good, clear interfaces come first.
Mostly the goal here is to separate the creational policy of a matrix (via allocator or other means) which gives you that breathing room to optimize as an afterthought without changing too much existing code. If you can do it by modifying only the implementation details of the functions involved or, better yet, modifying only the implementation of your matrix class, then you're really well off because then you're free to optimize without changing the design, and any design which allows that is generally going to be complete from an efficiency standpoint.
WARNING: The following is only intended if you really want to squeeze the most out of every cycle. It is essential to understand #4 and also get yourself a good profiler. It's also worth noting that you'll probably do better by optimizing memory access patterns for these matrix algorithms than trying to optimize away the heap allocation.
If you need to optimize the memory allocation, consider optimizing it with something general like a per-thread memory pool. You could make your matrix take in an optional allocator, for instance, but I emphasize optional here and I'd also emphasize correctness first with a trivial allocator implementation.
In other words:
Is it better practice to declare M1(n,p) within each function, or
rather once and for all in the main() and pass it to each function as
a sort of bucket that each function can use as scrap space.
Go ahead and create M1 as a temporary in each function. Try to avoid requiring the client to make some matrix that has no meaning to him/her only to compute intermediary results. That would be exposing an optimization detail which is something we should strive not to do when designing interfaces (hide all details that clients should not have to know about).
Instead, focus on more general concepts if you absolutely want that option to accelerate the creation of these temporaries, like an optional allocator. This fits in with practical designs like with std::set:
std::set<int, std::less<int>, MyFastAllocator<int>> s; // <-- okay
Even though most people just do:
std::set<int> s;
In your case, it might simply be:
M1 my_matrix(n, p, alloc);
It's a subtle difference, but an allocator is a much more general concept we can use than a cached matrix which otherwise has no meaning to the client except that it's some kind of cache that your functions require to help them compute results faster. Note that it doesn't have to be a general allocator. It could just be your preallocated matrix buffer passed in to a matrix constructor, but conceptually it might be good to separate it out merely for the fact that it is something a bit more opaque to clients.
Additionally, constructing this temporary matrix object would also require care not to share it across threads. That's another reason you probably want to generalize the concept a bit if you do go the optimization route, as something a bit more general like a matrix allocator can take into account thread safety or at the very least emphasize more by design that a separate allocator should be created per thread, but a raw matrix object probably cannot.
The above is only useful if you really care about the quality of your interfaces first and foremost. If not, I'd recommend going with Matthieu's advice as it is much simpler than creating an allocator, but both of us emphasize making the accelerated version optional.

Do not use premature optimisation. Create something that works properly and well and optimise later if it is shown to be slow.
(By the way I don't think stackoverflow is the correct place for it either).
In reality if you want to speed up your application operating on large matrices, using concurrency would be your solution. And if you are using concurrency, you are likely to get into far more trouble if you have one big global matrix.
Essentially it means you can never have more than one calculation happening at a time, even if you have the memory for it.
The design of your matrix needs to be optimal. We would have to look at this design.
I would therefore generally say in your code, no, do NOT create one big global matrix because it sounds wrong for what you want to do with it.

First try to define the matrix inside the function. That is definitely the better design choice. But if you get performance losses you can't affort, I think the "passing buffer per reference" is ok, as long as you keep in mind that the functions aren't thread safe anymore. If at any point you use threads, each thread needs it own buffer.

There are advantages in terms of performance in requiring an externally supplied buffer, especially when you are required to chain the functions that make use of it.
However, from a user point of view, it can soon get annoying.
I have often found that it's simple enough in C++ to get the best of both worlds by simply offerring both ways:
int compute(Matrix const& argument, Matrix& buffer);
inline int compute(Matrix const& argument) {
Matrix buffer(argument.width, argument.height);
return compute(argument, buffer);
}
This very simple wrapping means that the code is written once, and two slightly different interfaces are presented.
The more involved api (taking a buffer) is also slightly less safe as the buffer must respect some size constraints wrt the argument so you might want to further insulate the fast api (for example behind a namespace) as to encourage users to use the slower but safer interface first, and only try out the fast one when it's proven necessary.

Using realloc in c++

std::realloc is dangerous in c++ if the malloc'd memory contains non-pod types. It seems the only problem is that std::realloc wont call the type destructors if it cannot grow the memory in situ.
A trivial work around would be a try_realloc function. Instead of malloc'ing new memory if it cannot be grown in situ, it would simply return false. In which case new memory could be allocated, the objects copied (or moved) to the new memory, and finally the old memory freed.
This seems supremely useful. std::vector could make great use of this, possibly avoiding all copies/reallocations.
preemptive flame retardant: Technically, that is same Big-O performance, but if vector growth is a bottle neck in your application a x2 speed up is nice even if the Big-O remains unchanged.
BUT, I cannot find any c api that works like a try_realloc.
Am I missing something? Is try_realloc not as useful as I imagine? Is there some hidden bug that makes try_realloc unusable?
Better yet, Is there some less documented API that performs like try_realloc?
NOTE: I'm obviously, in library/platform specific code here. I'm not worried as try_realloc is inherently an optimization.
Update:
Following Steve Jessops comment's on whether vector would be more efficient using realloc I wrote up a proof of concept to test. The realloc-vector simulates a vector's growth pattern but has the option to realloc instead. I ran the program up to a million elements in the vector.
For comparison a vector must allocate 19 times while growing to a million elements.
The results, if the realloc-vector is the only thing using the heap the results are awesome, 3-4 allocation while growing to the size of million bytes.
If the realloc-vector is used alongside a vector that grows at 66% the speed of the realloc-vector The results are less promising, allocating 8-10 times during growth.
Finally, if the realloc-vector is used alongside a vector that grows at the same rate, the realloc-vector allocates 17-18 times. Barely saving one allocation over the standard vector behavior.
I don't doubt that a hacker could game allocation sizes to improve the savings, but I agree with Steve that the tremendous effort to write and maintain such an allocator isn't work the gain.

vector generally grows in large increments. You can't do that repeatedly without relocating, unless you carefully arrange things so that there's a large extent of free addresses just above the internal buffer of the vector (which in effect requires assigning whole pages, because obviously you can't have other allocations later on the same page).
So I think that in order to get a really good optimization here, you need more than a "trivial workaround" that does a cheap reallocation if possible - you have to somehow do some preparation to make it possible, and that preparation costs you address space. If you only do it for certain vectors, ones that indicate they're going to become big, then it's fairly pointless, because they can indicate with reserve() that they're going to become big. You can only do it automatically for all vectors if you have a vast address space, so that you can "waste" a big chunk of it on every vector.
As I understand it, the reason that the Allocator concept has no reallocation function is to keep it simple. If std::allocator had a try_realloc function, then either every Allocator would have to have one (which in most cases couldn't be implemented, and would just have to return false always), or else every standard container would have to be specialized for std::allocator to take advantage of it. Neither option is a great Allocator interface, although I suppose it wouldn't be a huge effort for implementers of almost all Allocator classes just to add a do-nothing try_realloc function.
If vector is slow due to re-allocation, deque might be a good replacement.

You could implement something like the try_realloc you proposed, using mmap with MAP_ANONYMOUS and MAP_FIXED and mremap with MREMAP_FIXED.
Edit: just noticed that the man page for mremap even says:
mremap() uses the Linux page table
scheme. mremap() changes the
mapping between
virtual addresses and memory pages. This can be used to implement
a very efficient
realloc(3).

realloc in C is hardly more than a convenience function; it has very little benefit for performance/reducing copies. The main exception I can think of is code that allocates a big array then reduces the size once the size needed is known - but even this might require moving data on some malloc implementations (ones which segregate blocks strictly by size) so I consider this usage of realloc really bad practice.
As long as you don't constantly reallocate your array every time you add an element, but instead grow the array exponentially (e.g. by 25%, 50%, or 100%) whenever you run out of space, just manually allocating new memory, copying, and freeing the old will yield roughly the same (and identical, in case of memory fragmentation) performance to using realloc. This is surely the approach that C++ STL implementations use, so I think your whole concern is unfounded.
Edit: The one (rare but not unheard-of) case where realloc is actually useful is for giant blocks on systems with virtual memory, where the C library interacts with the kernel to relocate whole pages to new addresses. The reason I say this is rare is because you need to be dealing with very big blocks (at least several hundred kB) before most implementations will even enter the realm of dealing with page-granularity allocation, and probably much larger (several MB maybe) before entering and exiting kernelspace to rearrange virtual memory is cheaper than simply doing the copy. Of course try_realloc would not be useful here, since the whole benefit comes from actually doing the move inexpensively.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js