C++: delete vs. free and performance - c++

Consider:
char *p=NULL;
free(p) // or
delete p;
What will happen if I use free and delete on p?
If a program takes a long time to execute, say 10 minutes, is there any way to reduce its running time to 5 minutes?

Some performance notes about new/delete and malloc/free:
malloc and free do not call the constructor and deconstructor, respectively. This means your classes won't get initalized or deinitialized automatically, which could be bad (e.g. uninitalized pointers)! This doesn't matter for POD data types like char and double, though, since they don't really have a ctor.
new and delete do call the constructor and deconstructor. This means your class instances are initalized and deinitialized automatically. However, normally there's a performance hit (compared to plain allocation), but that's for the better.
I suggest staying consistent with new/malloc usage unless you have a reason (e.g. realloc). This way, you have less dependencies, reducing your code size and load time (only by a smidgin, though). Also, you won't mess up by free'ing something allocated with new, or deleting something allocated with malloc. (This will most likely cause a crash!)

Answer 1: Both free(p) and delete p work fine with a NULL pointer.
Answer 2: Impossible to answer without seeing the slow parts of the code. You should profile the code! If you are using Linux I suggest using Callgrind (part of Valgrind) to find out what parts of the execution takes the most time.

Question one: nothing will happen.
From the current draft of ISO/IEC 14882 (or: C++):
20.8.15 C Library [c.malloc]
The contents [of <cstdlib>, that is: where free lives,] are the same as the Standard C library [(see intro.refs for that)] header <stdlib.h>, with the following changes: [nothing that effects this answer].
So, from ISO/IEC 9899:1999 (or: C):
7.20.3.2 The free function
If [the] ptr [parameter] is a null pointer, no action occurs.
From the C++ standard again, for information about delete this time:
3.7.4.2 Deallocation functions [basic.stc.dynamic.deallocation]
The value of the first argument supplied to a deallocation function may be a null pointer value; if so, and if the deallocation function is one supplied in the standard library, the call has no effect.
See also:
What is the difference between new/delete and malloc/free?
What happens when you try to free() already freed memory in c?

Nothing will happen if you call free with a NULL parameter, or delete with an NULL operand. Both are defined to accept NULL and perform no action.
There are any number of thing you can change in C++ code which can affect how fast it runs. Often the most useful (in approximate order) are:
Use good algorithms. This is a big topic, but for example, I recently halved the running time of a bit of code by using a std::vector instead of a std::list, in a case where elements were being added and removed only at the end.
Avoid repeating long calculations.
Avoid creating and copying objects unnecessarily (but anything which happens less than 10 million times per minute won't make any significant difference, unless you're handling something really big, like a vector of 10 million items).
Compile with optimisation.
Mark commonly-used functions (again, anything called more than 100 million times in your 10 minute runtime), as inline.
Having said that, divideandconquer is absolutely right - you can't effectively speed your program up unless you know what it spends its time doing. Sometimes this can be guessed correctly when you know how the code works, other times it's very surprising. So it's best to profile. Even if you can't profile the code to see exactly where the time is spent, if you measure what effect your changes have you can often figure it out eventually.

On question 2:
Previous answers are excellent. But I just wanted to add something about pre-optimization. Assuming a program of moderate complexity, the 90/10 rule usually applies - i.e. 90% of the execution time is spent in 10% of the code. "Optimized" code is often harder to read and to maintain. So, always solve the problems first, then see where the bottlenecks are (profiling is a good tool).

As others have pointed out deleting (or freeing) a NULL pointer will not do anything. However if you had allocated some memory then whether to use free() or delete depends upon the method you used to allocate them. For example, if you had used malloc() to allocate memory then you should free() and if you had used new to allocate then you should use delete. However, be careful not to mix the memory allocations. Use a single way to allocate and deallocate them.
For the second question, it will be very difficult to generalize without seeing the actual code. It should be taken on a case by case basis.

Good answers all.
On the performance issue, this provides a method that most can't imagine will work, but a few know it does, surprisingly well.
The 90/10 rule is true. In my experience, there usually are multiple trouble spots, and they usually are at mid-levels of the call stack. They often are caused by using over-general data structures, but you should never fix something unless you've proven that it actually is a problem. Performance problems are amazingly unpredictable.
Fixing any single performance problem may not give you the speedup you need, but each one you fix makes the remaining ones take a larger percentage of the remaining time, so they become easier to find. The speedups combine in a compounded fashion, so you may be surprised at the final result.
When you can no longer find significant problems that you can fix, you've done about as well as you can. Sometimes, at that point, a redesign (such as using code generation) can set off a further round of speedups.

Related

Why did C++11 make std::string::data() add a null terminating character?

Previously that was std::string::c_str()'s job, but as of C++11, data() also provides it, why was c_str()'s null-terminating-character added to std::string::data()? To me it seems like a waste of CPU cycles, in cases where the null-terminating-character is not relevant at all and only data() is used, a C++03 compiler doesn't have to care about the terminator, and don't have to write 0 to the terminator every time the string is resized, but a C++11 compiler, because of the data()-null-guarantee, has to waste cycles writing 0 every time the string is resized, so since it potentially makes code slower, I guess they had some reason to add that guarantee, what was it?
There are two points to discuss here:
Space for the null-terminator
In theory a C++03 implementation could have avoided allocating space for the terminator and/or may have needed to perform copies (e.g. unsharing).
However, all sane implementations allocated room for the null-terminator in order to support c_str() to begin with, because otherwise it would be virtually unusable if that was not a trivial call.
The null-terminator itself
It is true that some very (1999), very old implementations (2001) wrote the \0 every c_str() call.
However, major implementations changed (2004) or were already like that (2010) to avoid such a thing way before C++11 was released, so when the new standard came, for many users nothing changed.
Now, whether a C++03 implementation should have done it or not:
To me it seems like a waste of CPU cycles
Not really. If you are calling c_str() more than once, you are already wasting cycles by writing it several times. Not only that, you are messing with the cache hierarchy, which is important to consider in multithreaded systems. Recall that multi-core/SMT CPUs started to appear between 2001 and 2006, which explains the switch to modern, non-CoW implementations (even if there were multi-CPU systems a couple of decades before that).
The only situation where you would save anything is if you never called c_str(). However, note that when you are re-sizing the string, you are anyway re-writing everything. An additional byte is going to be hardly measurable.
In other words, by not writing the terminator on re-size, you are exposing yourself to worse performance/latency. By writing it once at the same time you have to perform a copy of the string, the performance behavior is way more predictable and you avoid performance pitfalls if you end up using c_str(), specially on multithreaded systems.
Advantages of the change:
When data also guarantees the null terminator, the programmer doesn't need to know obscure details of differences between c_str and data and consequently would avoid undefined behaviour from passing strings without guarantee of null termination into functions that require null termination. Such functions are ubiquitous in C interfaces, and C interfaces are used in C++ a lot.
The subscript operator was also changed to allow read access to str[str.size()]. Not allowing access to str.data() + str.size() would be inconsistent.
While not initialising the null terminator upon resize etc. may make that operation faster, it forces the initialisation in c_str which makes that function slower¹. The optimisation case that was removed was not universally the better choice. Given the change mentioned in point 2. that slowness would have affected the subscript operator as well, which would certainly not have been acceptable for performance. As such, the null terminator was going to be there anyway, and therefore there would not be a downside in guaranteeing that it is.
Curious detail: str.at(str.size()) still throws an exception.
P.S. There was another change, that is to guarantee that strings have contiguous storage (which is why data is provided in the first place). Prior to C++11, implementations could have used roped strings, and reallocate upon call to c_str. No major implementation had chosen to exploit this freedom (to my knowledge).
P.P.S Old versions of GCC's libstdc++ for example apparently did set the null terminator only in c_str until version 3.4. See the related commit for details.
¹ A factor to this is concurrency that was introduced to the language standard in C++11. Concurrent non-atomic modification is data-race undefined behaviour, which is why C++ compilers are allowed to optimize aggressively and keep things in registers. So a library implementation written in ordinary C++ would have UB for concurrent calls to .c_str()
In practice (see comments) having multiple threads writing the same thing wouldn't cause a correctness problem because asm for real CPUs doesn't have UB. And C++ UB rules mean that multiple threads actually modifying a std::string object (other than calling c_str()) without synchronization is something the compiler + library can assume doesn't happen.
But it would dirty cache and prevent other threads from reading it, so is still a poor choice, especially for strings that potentially have concurrent readers. Also it would stop .c_str() from basically optimizing away because of the store side-effect.
The premise of the question is problematic.
a string class has to do a lot of expansive things, like allocating dynamic memory, copying bytes from one buffer to another, freeing the underlying memory and so on.
what upsets you is one lousy mov assembly instruction? believe me, this doesn't effect your performance even by 0.5%.
When writing a programing language runtime, you can't be obsessive about every small assembly instruction. you have to choose your optimization battles wisely, and optimizing an un-noticable null termination is not one of them.
In this specific case, being compatible with C is way more important than null termination.
Actually, it's the other way around.
Before C++11, c_str() may in theory have cost "additional cycles" as well as a copy, so as to ensure the presence of a null terminator at the end of the buffer.
This was unfortunate, particularly as it can be fixed very simply, with effectively no additional runtime cost, by simply incorporating a null byte at the end of every buffer to begin with. Only one additional byte to allocate (and a teensie little write), with no runtime cost at point of use, in exchange for thread-safety and a boatload of sanity.
Once you've done that, c_str() is literally the same as data() by definition. So, the "change" to data() actually came for free. Nobody's adding an extra byte to the result of data(); it's already there.
Helping matters is the fact that most implementations already did this under C++03 anyway, to avoid the hypothetical runtime cost ascribed to c_str().
So, in short, this has almost certainly cost you literally nothing.

Vector re-declaration versus insertions in looping operations - C++

I have an option to either create and destroy a vector on every call to func() and push elements in each iteration, as shown in Example A OR fixed the initialization and only overwrite old values in each iteration, as shown in Example B.
Example A:
void func ()
{
std::vector<double> my_vec(5, 0.0);
for ( int i = 0; i < my_vec.size(); i++) {
my_vec.push_back(i);
// do something
}
}
while (condition) {
func();
}
Example B:
void func (std::vector<double>& my_vec)
{
for ( int i = 0; i < my_vec.size(); i++) {
my_vec[i] = i;
// do something
}
}
while (condition) {
std::vector<double> my_vec(5, 0.0);
func(myVec);
}
Which of the two would be computationally inexpensive. The size of the array won't be more than 10.
I still suspect that the question that was asked is not the question that was intended, but it occurred to me that the main point of my answer would likely not change. If the question gets updated, I can always edit this answer to match (or delete it, if it turns out to be inapplicable).
De-prioritize optimizations
There are various factors that should affect how you write your code. Among the desirable goals are space optimization, time optimization, data encapsulation, logic encapsulation, readability, robustness, and correct functionality. Ideally, all of these goals would be achievable in every piece of code, but that is not particularly realistic. Much more likely is a situation where one or more of these goals must be sacrificed in favor of the others. When that happens, optimizations should typically yield to everything else.
That is not to say that optimizations should be ignored. There are plenty of optimizations that rarely obstruct the higher-priority goals. These range from the small, such as passing by const reference instead of by value, to the large, such as choosing the logarithmic algorithm instead of the exponential one. However, the optimizations that do interfere with the other goals should be postponed until after your code is reasonably complete and functioning correctly. At that point, a profiler should be used to determine where the bottlenecks actually are. Those bottlenecks are the only places where other goals should yield to optimizations, and only if the profiler confirms that the optimizations achieved their goals.
For the question being asked, this means that the main concern should not be computational expense, but encapsulation. Why should the caller of func() need to allocate space for func() to work with? It should not, unless a profiler identified this as a performance bottleneck. And if a profiler did that, it would be much easier (and more reliable!) to ask the profiler if the change helps than to ask Stack Overflow.
Why?
I can think of two major reasons to de-prioritize optimizations. First, the "sniff test" is unreliable. While there might be a few people who can identify bottlenecks by looking at code, there are many, many more who merely think they can. Second, that's why we have optimizing compilers. It is not unheard of for someone to come up with this super-clever optimization trick only to discover that the compiler was already doing it. Keep your code clean and let the compiler handle the routine optimizations. Only step in when the task demonstrably exceeds the compiler's capabilities.
See also: premature-optimization
Choosing an optimization
OK, suppose the profiler did identify construction of this small, 10-element array as a bottleneck. The next step is to test an alternative, right? Almost. First you need an alternative, and I would consider a review of the theoretical benefits of various alternatives to be useful. Just keep in mind that this is theoretical and that the profiler gets the final say. So I'll go into the pros and cons of the alternatives from the question, as well as some other alternatives that might bear consideration. Let's start from the worst options, working our way to the better ones.
Example A
In Example A, a vector is created with 5 elements, then elements are pushed onto the vector until i meets or exceeds the vector's size. Seeing how i and the vector's size are both increased by one each iteration (and i starts smaller than the size), this loop will run until the vector grows large enough to crash the program. That means probably billions of iterations (despite the question's claim that the size will not exceed 10).
Easily the most computationally expensive option. Don't do this.
Example B
In example B, a vector is created for each iteration of the outer while loop, which is then accessed by reference from within func(). The performance cons here include passing a parameter to func() and having func() access the vector indirectly through a reference. There are no performance pros as this does everything the baseline (see below) would do, plus some extra steps.
Even though a compiler might be able to compensate for the cons, I see no reason to try this approach.
Baseline
The baseline I'm using is a fix to Example A's infinite loop. Specifically, replace "my_vec.push_back(i);" with Example B's "my_vec[i] = i;". This simple approach is along the lines of what I would expect for the initial assessment by the profiler. If you cannot beat simple, stick with it.
Example B*
The text of the question presents an inaccurate assessment of Example B. Interestingly, the assessment describes an approach that has the potential to improve on the baseline. To get code that matches the textual description, move Example B's "std::vector<double> my_vec(5, 0.0);" to the line immediately before the while statement. This has the effect of constructing the vector only once, rather than constructing it with each iteration.
The cons of this approach are the same as those of Example B as originally coded. However, we now pick up a gain in that the vector's constructor is called only once. If construction is more expensive than the indirection costs, the result should be a net improvement once the while loop iterates often enough. (Beware these conditions: that's a significant "if" and there is no a priori guess as to how many iterations is "enough".) It would be reasonable to try this and see what the profiler says.
Get some static
A variant on Example B* that helps preserve encapsulation is to use the baseline (the fixed Example A), but precede the declaration of the vector with the keyword static. This brings in the benefit of constructing the vector only once, but without the overhead associated with making the vector a parameter. In fact, the benefit could be greater than in Example B* since construction happens only once per program execution, rather than each time the while loop is started. The more times the while loop is started, the greater this benefit.
The main con here is that the vector will occupy memory throughout the program's execution. Unlike Example B*, it will not release its memory when the block containing the while loop ends. Using this approach in too many places would lead to memory bloat. So while it is reasonable to profile this approach, you might want to consider other options. (Of course if the profiler calls this out as the bottleneck, dwarfing all others, the cost is small enough to pay.)
Fix the size
My personal choice for what optimization to try here would be to start from the baseline and switch the vector to std::array<10,double>. My main motivation is that the needed size won't be more than 10. Also relevant is that the construction of a double is trivial. Construction of the array should be on par with declaring 10 variables of type double, which I would expect to be negligible. So no need for fancy optimization tricks. Just let the compiler do its thing.
The expected possible benefit of this approach is that a vector allocates space on the heap for its storage, which has an overhead cost. The local array would not have this cost. However, this is only a possible benefit. A vector implementation might already take advantage of this performance consideration for small vectors. (Maybe it does not use the heap until the capacity needs to exceed some magic number, perhaps more than 10.) I would refer you back to earlier when I mentioned "super-clever" and "compiler was already doing it".
I'd run this through the profiler. If there's no benefit, there is likely no benefit from the other approaches. Give them a try, sure, since they're easy enough, but it would probably be a better use of your time to look at other aspects to optimize.

Why can't the runtime environment decide to apply delete or delete[] instead of the programmer?

I've read that the delete[] operator is needed because the runtime environment does not keep information about if the allocated block is an array of objects that require destructor calls or not, but it does actually keep information about where in memory is the allocated block stored, and also, of course, the size of the block.
It would take just one more bit of meta data to remember if destructors need to be called on delete or not, so why not just do that?
I'm pretty sure there's a good explanation, I'm not questioning it, I just wish to know it.
I think the reason is that C++ doesn't force you into anything you don't want. It would add extra metadata and if someone didn't use it, that extra overhead would be forced upon them, in contrast to the design goals of the C++ language.
When you want the capability you described, C++ does provide a way. It's called std::vector and you should nearly always prefer it, another sort of container, or a smart pointer over raw newand delete.
C++ lets you be efficient as possible so if they did have to track the number of elements in a block that would just be an extra 4 bytes used per block.
This could be useful to a lot of people, but it also prevents total efficiency for people that don't mind putting [].
It's similar to the difference between c++ and Java. Java can be much faster to program because you never have to worry about garbage collection, but C++, if programmed correctly, can be more efficient and use less memory because it doesn't have to store any of those variables and you can decide when to delete memory blocks.
It basically comes down to the language design not wanting to put too many restrictions on implementors. Many C++ runtimes use malloc() for ::operator new () and free() (more or less) for ::operator delete (). Standard malloc/free don't provide the bookkeeping necessary for recording a number of elements and provide no way of determining the malloc'd size at free time. Adding another level of memory manipulation between new Foo and malloc for every single object is, from the C/C++ point of view, a pretty big jump in complexity/abstraction. Among other things, adding this overhead to every object would screw up some memory management approaches that are designed knowing what the size of objects are.
There are two things that need be cleared up here.
First: the assumption that malloc keeps the precise size you asked.
Not really. malloc only cares about providing a block that is large enough. Although for efficiency reasons it won't probably overallocate much, it will still probably give you a block of a "standard" size, for example a 2^n bytes block. Therefore the real size (as in, the number of objects actually allocated) is effectively unknown.
Second: the "extra bit" required
Indeed, the information required for a given object to know whether it is part of an array or not would simply be an extra bit. Logically.
As far as implementation is concerned though: where would you actually put that bit ?
The memory allocated for the object itself should probably not be touched, the object is using it after all. So ?
on some platform, this could be kept in the pointer itself (some platforms ignore a portion of the bits), but this is not portable
so it would require extra storage, at least a byte, except that with alignment issues it could well amount to 8 bytes.
Demonstration: (not convincing as noted by sth, see below)
// A plain array of doubles:
+-------+-------+-------
| 0 | 1 | 2
+-------+-------+-------
// A tentative to stash our extra bit
+-------++-------++-------++
| 0 || 1 || 2 ||
+-------++-------++-------++
// A correction since we introduced alignment issues
// Note: double's aligment is most probably its own size
+-------+-------+-------+-------+-------+-------
| 0 | bit | 1 | bit | 2 | bit
+-------+-------+-------+-------+-------+-------
Humpf!
EDIT
Therefore, on most platforms (where address do matter), you will need to "extend" each pointer, and actually double their sizes (alignment issues).
Is it acceptable for all pointers to be twice as large only so that you can stash that extra bit? For most people I guess it would be. But C++ is not designed for most people, it is primarily designed for people who care about performance, whether speed or memory, and as such this is not acceptable.
END OF EDIT
So what is the correct answer ? The correct answer is that recovering information that the type system lost is costly. Unfortunately.

Using realloc in c++

std::realloc is dangerous in c++ if the malloc'd memory contains non-pod types. It seems the only problem is that std::realloc wont call the type destructors if it cannot grow the memory in situ.
A trivial work around would be a try_realloc function. Instead of malloc'ing new memory if it cannot be grown in situ, it would simply return false. In which case new memory could be allocated, the objects copied (or moved) to the new memory, and finally the old memory freed.
This seems supremely useful. std::vector could make great use of this, possibly avoiding all copies/reallocations.
preemptive flame retardant: Technically, that is same Big-O performance, but if vector growth is a bottle neck in your application a x2 speed up is nice even if the Big-O remains unchanged.
BUT, I cannot find any c api that works like a try_realloc.
Am I missing something? Is try_realloc not as useful as I imagine? Is there some hidden bug that makes try_realloc unusable?
Better yet, Is there some less documented API that performs like try_realloc?
NOTE: I'm obviously, in library/platform specific code here. I'm not worried as try_realloc is inherently an optimization.
Update:
Following Steve Jessops comment's on whether vector would be more efficient using realloc I wrote up a proof of concept to test. The realloc-vector simulates a vector's growth pattern but has the option to realloc instead. I ran the program up to a million elements in the vector.
For comparison a vector must allocate 19 times while growing to a million elements.
The results, if the realloc-vector is the only thing using the heap the results are awesome, 3-4 allocation while growing to the size of million bytes.
If the realloc-vector is used alongside a vector that grows at 66% the speed of the realloc-vector The results are less promising, allocating 8-10 times during growth.
Finally, if the realloc-vector is used alongside a vector that grows at the same rate, the realloc-vector allocates 17-18 times. Barely saving one allocation over the standard vector behavior.
I don't doubt that a hacker could game allocation sizes to improve the savings, but I agree with Steve that the tremendous effort to write and maintain such an allocator isn't work the gain.
vector generally grows in large increments. You can't do that repeatedly without relocating, unless you carefully arrange things so that there's a large extent of free addresses just above the internal buffer of the vector (which in effect requires assigning whole pages, because obviously you can't have other allocations later on the same page).
So I think that in order to get a really good optimization here, you need more than a "trivial workaround" that does a cheap reallocation if possible - you have to somehow do some preparation to make it possible, and that preparation costs you address space. If you only do it for certain vectors, ones that indicate they're going to become big, then it's fairly pointless, because they can indicate with reserve() that they're going to become big. You can only do it automatically for all vectors if you have a vast address space, so that you can "waste" a big chunk of it on every vector.
As I understand it, the reason that the Allocator concept has no reallocation function is to keep it simple. If std::allocator had a try_realloc function, then either every Allocator would have to have one (which in most cases couldn't be implemented, and would just have to return false always), or else every standard container would have to be specialized for std::allocator to take advantage of it. Neither option is a great Allocator interface, although I suppose it wouldn't be a huge effort for implementers of almost all Allocator classes just to add a do-nothing try_realloc function.
If vector is slow due to re-allocation, deque might be a good replacement.
You could implement something like the try_realloc you proposed, using mmap with MAP_ANONYMOUS and MAP_FIXED and mremap with MREMAP_FIXED.
Edit: just noticed that the man page for mremap even says:
mremap() uses the Linux page table
scheme. mremap() changes the
mapping between
virtual addresses and memory pages. This can be used to implement
a very efficient
realloc(3).
realloc in C is hardly more than a convenience function; it has very little benefit for performance/reducing copies. The main exception I can think of is code that allocates a big array then reduces the size once the size needed is known - but even this might require moving data on some malloc implementations (ones which segregate blocks strictly by size) so I consider this usage of realloc really bad practice.
As long as you don't constantly reallocate your array every time you add an element, but instead grow the array exponentially (e.g. by 25%, 50%, or 100%) whenever you run out of space, just manually allocating new memory, copying, and freeing the old will yield roughly the same (and identical, in case of memory fragmentation) performance to using realloc. This is surely the approach that C++ STL implementations use, so I think your whole concern is unfounded.
Edit: The one (rare but not unheard-of) case where realloc is actually useful is for giant blocks on systems with virtual memory, where the C library interacts with the kernel to relocate whole pages to new addresses. The reason I say this is rare is because you need to be dealing with very big blocks (at least several hundred kB) before most implementations will even enter the realm of dealing with page-granularity allocation, and probably much larger (several MB maybe) before entering and exiting kernelspace to rearrange virtual memory is cheaper than simply doing the copy. Of course try_realloc would not be useful here, since the whole benefit comes from actually doing the move inexpensively.

Why is 'unbounded_array' more efficient than 'vector'?

It says here that
The unbounded array is similar to a
std::vector in that in can grow in
size beyond any fixed bound. However
unbounded_array is aimed at optimal
performance. Therefore unbounded_array
does not model a Sequence like
std::vector does.
What does this mean?
As a Boost developer myself, I can tell you that it's perfectly fine to question the statements in the documentation ;-)
From reading those docs, and from reading the source code (see storage.hpp) I can say that it's somewhat correct given some assumptions about the implementation of std::vector at the time that code was written. That code dates to 2000 initially, and perhaps as late as 2002. Which means at the time many STD implementations did not do a good job of optimizing destruction and construction of objects in containers. The claim about the non-resizing is easily refuted by using an initially large capacity vector. The claim about speed, I think, comes entirely from the fact that the unbounded_array has special code for eliding dtors & ctors when the stored objects have trivial implementations of them. Hence it can avoid calling them when it has to rearrange things, or when it's copying elements. Compared to really recent STD implementations it's not going to be faster, as new STD implementation tend to take advantage of things like move semantics to do even more optimizations.
It appears to lack insert and erase methods. As these may be "slow," ie their performance depends on size() in the vector implementation, they were omitted to prevent the programmer from shooting himself in the foot.
insert and erase are required by the standard for a container to be called a Sequence, so unlike vector, unbounded_array is not a sequence.
No efficiency is gained by failing to be a sequence, per se.
However, it is more efficient in its memory allocation scheme, by avoiding a concept of vector::capacity and always having the allocated block exactly the size of the content. This makes the unbounded_array object smaller and makes the block on the heap exactly as big as it needs to be.
As I understood it from the linked documentation, it is all about allocation strategy. std::vector afaik postpones allocation until necessary and than might allocate some reasonable chunk of meory, unbounded_array seams to allocate more memory early and therefore it might allocate less often. But this is only a gues from the statement in documentation, that it allocates more memory than might be needed and that the allocation is more expensive.