Converting a Vec<u32> to Vec<u8> in-place and with minimal overhead - casting

I'm trying to convert a Vec of u32s to a Vec of u8s, preferably in-place and without too much overhead.
My current solution relies on unsafe code to re-construct the Vec. Is there a better way to do this, and what are the risks associated with my solution?
use std::mem;
use std::vec::Vec;
fn main() {
let mut vec32 = vec![1u32, 2];
let vec8;
unsafe {
let length = vec32.len() * 4; // size of u8 = 4 * size of u32
let capacity = vec32.capacity() * 4; // ^
let mutptr = vec32.as_mut_ptr() as *mut u8;
mem::forget(vec32); // don't run the destructor for vec32
// construct new vec
vec8 = Vec::from_raw_parts(mutptr, length, capacity);
}
println!("{:?}", vec8)
}
Rust Playground link

Whenever writing an unsafe block, I strongly encourage people to include a comment on the block explaining why you think the code is actually safe. That type of information is useful for the people who read the code in the future.
Instead of adding comments about the "magic number" 4, just use mem::size_of::<u32>. I'd even go so far as to use size_of for u8 and perform the division for maximum clarity.
You can return the newly-created Vec from the unsafe block.
As mentioned in the comments, "dumping" a block of data like this makes the data format platform dependent; you will get different answers on little endian and big endian systems. This can lead to massive debugging headaches in the future. File formats either encode the platform endianness into the file (making the reader's job harder) or only write a specific endinanness to the file (making the writer's job harder).
I'd probably move the whole unsafe block to a function and give it a name, just for organization purposes.
You don't need to import Vec, it's in the prelude.
use std::mem;
fn main() {
let mut vec32 = vec![1u32, 2];
// I copy-pasted this code from StackOverflow without reading the answer
// surrounding it that told me to write a comment explaining why this code
// is actually safe for my own use case.
let vec8 = unsafe {
let ratio = mem::size_of::<u32>() / mem::size_of::<u8>();
let length = vec32.len() * ratio;
let capacity = vec32.capacity() * ratio;
let ptr = vec32.as_mut_ptr() as *mut u8;
// Don't run the destructor for vec32
mem::forget(vec32);
// Construct new Vec
Vec::from_raw_parts(ptr, length, capacity)
};
println!("{:?}", vec8)
}
Playground
My biggest unknown worry about this code lies in the alignment of the memory associated with the Vec.
Rust's underlying allocator allocates and deallocates memory with a specific Layout. Layout contains such information as the size and alignment of the pointer.
I'd assume that this code needs the Layout to match between paired calls to alloc and dealloc. If that's the case, dropping the Vec<u8> constructed from a Vec<u32> might tell the allocator the wrong alignment since that information is based on the element type.
Without better knowledge, the "best" thing to do would be to leave the Vec<u32> as-is and simply get a &[u8] to it. The slice has no interaction with the allocator, avoiding this problem.
Even without interacting with the allocator, you need to be careful about alignment!
See also:
How to slice a large Vec<i32> as &[u8]?
https://stackoverflow.com/a/48309116/155423

If in-place convert is not so mandatory, something like this manages bytes order control and avoids the unsafe block:
extern crate byteorder;
use byteorder::{WriteBytesExt, BigEndian};
fn main() {
let vec32: Vec<u32> = vec![0xaabbccdd, 2];
let mut vec8: Vec<u8> = vec![];
for elem in vec32 {
vec8.write_u32::<BigEndian>(elem).unwrap();
}
println!("{:?}", vec8);
}

To do this kind of conversion soundly you need to go through the Vec's associated allocator and call shrink to convert the layout to the new alignment before calling from_raw_parts. This depends on the allocator being able to perform in-place reallocation.
If you don't need the resulting vector to be resizable then reinterpreting a &mut [u32] borrow of the vec to &mut [u8] would be a simpler option.

This is how I solved the problem using a bitshifting copy.
It works on my x64 machine, but I am unsure if I have made unsafe assumptions about little/big endianism.
Runtime performance would be faster if this cast could be done memory inplace without the need for a copy, but I have not figured out how to do this yet.
/// Cast Vec<u32> to Vec<u8> without modifying underlying byte data
/// ```
/// # use fractals::services::vectors::vec_u32_to_u8;
/// assert_eq!( vec_u32_to_u8(&vec![ 0x12345678 ]), vec![ 0x12u8, 0x34u8, 0x56u8, 0x78u8 ]);
/// ```
#[allow(clippy::identity_op)]
pub fn vec_u32_to_u8(data: &Vec<u32>) -> Vec<u8> {
// TODO: https://stackoverflow.com/questions/72631065/how-to-convert-a-u32-array-to-a-u8-array-in-place
// TODO: https://stackoverflow.com/questions/29037033/how-to-slice-a-large-veci32-as-u8
let capacity = 32/8 * data.len() as usize; // 32/8 == 4
let mut output = Vec::<u8>::with_capacity(capacity);
for &value in data {
output.push((value >> 24) as u8); // r
output.push((value >> 16) as u8); // g
output.push((value >> 8) as u8); // b
output.push((value >> 0) as u8); // a
}
output
}

Related

Writing a custom, highly-specialized, special-purpose standard-compliant C++ allocator

Brief Preface
I recognize that there are many nuances and requirements for a standard-compatible allocator. There are a number of questions here covering a range of topics associated with allocators. I realize that the requirements set out by the standard are critical to ensuring that the allocator functions correctly in all cases, doesn't leak memory, doesn't cause undefined-behaviour, etc. This is particularly true where the allocator is meant to be used (or at least, can be used) in a wide range of use cases, with a variety of underlying types and different standard containers, object sizes, etc.
In contrast, I have a very specific use case where I personally strictly control all of the conditions associated with its use, as I describe in detail below. Consequently, I believe that what I've done is perfectly acceptable given the highly-specific nature of what I'm trying to implement.
I'm hoping someone with far more experience and understanding than me can either confirm that the description below is acceptable or point out the problems (and, ideally, how to fix them too).
Overview / Specific Requirements
In a nutshell, I'm trying to write an allocator that is to be used within my own code and for a single, specific purpose:
I need "a few" std::vector (probably uint16_t), with a fixed (at runtime) number of elements. I'm benchmarking to determine the best tradeoff of performance/space for the exact integer type[1]
As noted, the number of elements is always the same, but it depends on some runtime configuration data passed to the application
The number of vectors is also either fixed or at least bounded. The exact number is handled by a library providing an implementation of parallel::for(execution::par_unseq, ...)
The vectors are constructed by me (i.e. so I know with certainty that they will always be constructed with N elements)
[1] The value of the vectors are used to conditionally copy a float from one of 2 vectors to a target: c[i] = rand_vec[i] < threshold ? a[i] : b[i] where a, b, c are contiguous arrays of float, rand_vec is the std::vector I'm trying to figure out here, and threshold is a single variable of type integer_tbd. The code compiles as SSE SIMD operations. I do not remember the details of this, but I believe that this requires additional shifting instructions if the ints are smaller than the floats.
On this basis, I've written a very simple allocator, with a single static boost::lockfree::queue as the free-list. Given that I will construct the vectors myself and they will go out of scope when I'm finished with them, I know with certainty that all calls to alloc::deallocate(T*, size_t) will always return vectors of the same size, so I believe that I can simply push them back onto the queue without worrying about a pointer to a differently-sized allocation being pushed onto the free-list.
As noted in the code below, I've added in runtime tests for both the allocate and deallocate functions for now, while I've been confirming for myself that these situations cannot and will not occur. Again, I believe it is unquestionably safe to delete these runtime tests. Although some advice would be appreciated here too -- considering the surrounding code, I think they should be handled adequately by the branch predictor so they don't have a significant runtime cost (although without instrumenting, hard to say for 100% certain).
In a nutshell - as far as I can tell, everything here is completely within my control, completely deterministic in behaviour, and, thus, completely safe. This is also suggested when running the code under typical conditions -- there are no segfaults, etc. I haven't yet tried running with sanitizers yet -- I was hoping to get some feedback and guidance before doing so.
I should point out that my code runs 2x faster compared to using std::allocator which is at least qualitatively to be expected.
CR_Vector_Allocator.hpp
class CR_Vector_Allocator {
using T = CR_Range_t; // probably uint16_t or uint32_t, set elsewhere.
private:
using free_list_type = boost::lockfree::queue>;
static free_list_type free_list;
public:
T* allocate(size_t);
void deallocate(T* p, size_t) noexcept;
using value_type = T;
using pointer = T*;
using reference = T&;
template struct rebind { using other = CR_Vector_Allocator;};
};
CR_Vector_Allocator.cc
CR_Vector_Allocator::T* CR_Vector_Allocator::allocate(size_t n) {
if (n <= 1)
throw std::runtime_error("Unexpected number of elements to initialize: " +
std::to_string(n));
T* addr_;
if (free_list.pop(addr_)) return addr_;
addr_ = reinterpret_cast<T*>(std::malloc(n * sizeof(T)));
return addr_;
}
void CR_Vector_Allocator::deallocate(T* p, size_t n) noexcept {
if (n <= 1) // should never happen. but just in case, I don't want to leak
free(p);
else
free_list.push(p);
}
CR_Vector_Allocator::free_list_type CR_Vector_Allocator::free_list;
It is used in the following manner:
using CR_Vector_t = std::vector<uint16_t, CR_Vector_Allocator>;
CR_Vector_t Generate_CR_Vector(){
/* total_parameters is a member of the same class
as this member function and is defined elsewhere */
CR_Vector_t cr_vec (total_parameters);
std::uniform_int_distribution<uint16_t> dist_;
/* urng_ is a member variable of type std::mt19937_64 in the class */
std::generate(cr_vec.begin(), cr_vec.end(), [this, &dist_](){
return dist_(this->urng_);});
return cr_vec;
}
void Prepare_Next_Generation(...){
/*
...
*/
using hpx::parallel::execution::par_unseq;
hpx::parallel::for_loop_n(par_unseq, 0l, pop_size, [this](int64_t idx){
auto crossovers = Generate_CR_Vector();
auto new_parameters = Generate_New_Parameters(/* ... */, std::move(crossovers));
}
}
Any feedback, guidance or rebukes would be greatly appreciated.
Thank you!!

Dealing with a contiguous vector of fixed-size matrices for both storage layouts in Eigen

An external library gives me a raw pointer of doubles that I want to map to an Eigen type. The raw array is logically a big ordered collection of small dense fixed-size matrices, all of the same size. The main issue is that the small dense matrices may be in row-major or column-major ordering and I want to accommodate them both.
My current approach is as follows. Note that all the entries of a small fixed-size block (in the array of blocks) need to be contiguous in memory.
template<int bs, class Mattype>
void block_operation(double *const vals, const int numblocks)
{
Eigen::Map<Mattype> mappedvals(vals,
Mattype::IsRowMajor ? numblocks*bs : bs,
Mattype::IsRowMajor ? bs : numblocks*bs
);
for(int i = 0; i < numblocks; i++)
if(Mattype::isRowMajor)
mappedvals.template block<bs,bs>(i*bs,0) = block_operation_rowmajor(mappedvals);
else
mappedvals.template block<bs,bs>(0,i*bs) = block_operation_colmajor(mappedvals);
}
The calling function first figures out the Mattype (out of 2 options) and then calls the above function with the correct template parameter.
Thus all my algorithms need to be written twice and my code is interspersed with these layout checks. Is there a way to do this in a layout-agnostic way? Keep in mind that this code needs to be as fast as possible.
Ideally, I would Map the data just once and use it for all the operations needed. However, the only solution I could come up with was invoking the Map constructor once for every small block, whenever I need to access the block.
template<int bs, StorageOptions layout>
inline Map<Matrix<double,bs,bs,layout>> extractBlock(double *const vals,
const int bindex)
{
return Map<Matrix<double,bs,bs,layout>>(vals+bindex*bs*bs);
}
Would this function be optimized away to nothing (by a modern compiler like GCC 7.3 or Intel 2017 under -std=c++14 -O3), or would I be paying a small penalty every time I invoke this function (once for each block, and there are a LOT of small blocks)? Is there a better way to do this?
Your extractBlock is fine, a simpler but somewhat uglier solution is to use a reinterpret cast at the start of block_operation:
using BlockType = Matrix<double,bs,bs,layout|DontAlign>;
BlockType* blocks = reinterpret_cast<BlockType*>(vals);
for(int i...)
block[i] = ...;
This will work for fixed sizes matrices only. Also note the DontAlign which is important unless you can guaranty that vals is aligned on a 16 or even 32 bytes depending on the presence of AVX and bs.... so just use DontAlign!

Eigen: Efficiently storing the output of a matrix evaluation in a raw pointer

I am using some legacy C code that passing around lots of raw pointers. To interface with the code, I have to pass a function of the form:
const int N = ...;
T * func(T * x) {
// TODO Put N elements in x
return x + N;
}
where this function should write the result into x, and then return x.
Internally, in this function, I am using Eigen extensively to perform some calculations. Then I write the result back to the raw pointer using the Map class. A simple example which mimics what I am doing is this:
const int N = 5;
T * func(T * x) {
// Do a lot of operations that result in some matrices like
Eigen::Matrix<T, N, 1 > A = ...
Eigen::Matrix<T, N, 1 > B = ...
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
return x + N;
}
Obviously, there is much more complicated stuff going on internally, but that is the gist of it... Do some calculations with Eigen, then use the Map class to write the result back to the raw pointer.
Now the problem is that when I profile this code with Callgrind, and then view the results with KCachegrind, the lines
constraint = A - B;
are almost always the bottleneck. This is sort of understandable, because such lines could/are potentially doing three things:
Constructing the Map object
Performing the calculation
Writing the result to the pointer
So it is understandable that this line might have the longest runtime. But I am a little bit worried that perhaps I am somehow doing an extra copy in that line before the data gets written to the raw pointer.
So is there a better way of writing the result to the raw pointer? Or is that the idiom I should be using?
In the back of my mind, I am wondering if using the placement new syntax would buy me anything here.
Note: This code is mission critical and should run in realtime, so I really need to squeeze every ounce of speed out of it. For instance, getting this call from a runtime of 0.12 seconds to 0.1 seconds would be huge for us. But code legibility is also a huge concern since we are constantly tweaking the model used in the internal calculations.
These two lines of code:
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
are essentially compiled by Eigen as:
for(int i=0; i<N; ++i)
x[i] = A[i] - B[i];
The reality is a bit more complicated because of explicit unrolling, and explicit vectorization (both depends on T), but that's essentially it. So the construction of the Map object is essentially a no-op (it is optimized away by any compiler) and no, there is no extra copy going on here.
Actually, if your profiler is able to tell you that the bottleneck lies on this simple expression, then that very likely means that this piece of code has not been inlined, meaning that you did not enabled compiler optimizations flags (like -O3 with gcc/clang).

Cache Friendly Design of Applications

I've been through writing an application written in C++ and is now at the point where I have to go down into the code and make it cache-friendly.
After reading about a presentation by Tony Albrecht, I soon finally realized that I could do it right the first time by simply applying the principles right from design stage.
Another paper titled What Every Programmer Should Know About Memory written by Ulrich Drepper has strong points basically telling developers like me to be aware of writing correct memory layouts in order to be cache friendly.
But then, it feels counter intuitive because:
Thinking in terms of memory layout in general doesn't come natural.
Laying out code and data in terms of sets and lines does not come natural.
Thinking in terms of an object having properties and actions is natural.
A good example, one which I will be facing soon as I sit down and write a custom allocator, there are two structs which are going to be handled by the allocator, shown below.
Also note that once a thread worker releases an element, the same element would have to be put into play, and so it goes on.
typedef struct
{
OVERLAPPED Overlapped;
WSABUF DataBuf;
CHAR Buffer[DATA_BUFSIZE];
byte *LPBuffer;
vector<byte> byteBuffer;
DWORD BytesSEND;
DWORD BytesRECV;
} PER_IO_OPERATION_DATA, *LPPER_IO_OPERATION_DATA;
typedef struct
{
SOCKET Socket;
} PER_HANDLE_DATA, *LPPER_HANDLE_DATA;
Note that WSABUF.buf and vector can be a challenge, how they will be laid out in memory. WSABUF.buf and vector buffer allocation is dynamic and it doesn't fit into a fixed size contiguous layout. I would imagine a separate allocator will have to be created for that case.
PER_HANDLE_DATA is straight forward and can easily be laid out in a contiguous fashion.
I would have to setup another struct for storing IsActive so that it will be laid out in one contiguous block, separate from PER_IO_OPERATION_DATA.
typedef struct {
bool isActive;
} IODATA_STAT, *LPIODATA_STAT
Anyway, I'd just like to get some feedback on why do you have to be aware of the cache when starting out when it can be done after the application has been written?
Also, what's your say about reorganizing data with regards to dynamic/fixed buffer size and pointers?
Optimization
About premature optimization, I'd say it's premature if you can apply the optimization later in hindsight in response to a profiler without a series of cascading changes in your codebase. The more breathing room you have to swap out representations fairly locally and non-intrusively, the less you should worry the first time around about getting that representation optimal.
So the key thing you want to focus on getting right above all is interface design over implementation, especially if you're building large-scale software. A good, stable interface modeled at the appropriate level of abstraction is going to allow you to profile your code and optimize hotspots to smithereens without cascading breakages throughout your code: ideally just a few tweaks to a source file.
Productivity and maintainability are still the most valuable traits of a developer, and the vast majority of any codebase short of the lowest-level cores are going to hinge on those traits far more than your ability to achieve micro-efficient designs let alone the best possible algorithm for a task. The programming world is very saturated and competitive now and the people who can churn out maintainable applications quickly are generally the ones that win and survive to optimize another day.
And if you aren't using profilers and worried about anything more than broad algorithmic complexity, then you absolutely need a profiler first. Measure twice, optimize once. A profiler is what's going to have you making selective, discrete optimizations which is important not just to make you spend your time in a more valuable way the first time around, but to make sure you don't degrade your entire codebase into a maintenance nightmare.
Memory Layouts
But with that caveat aside:
1. Thinking in terms of memory layout in general doesn't come natural.
Here I would recommend a bit of C-like thinking. It will come more naturally as you face more hotspots over your career. In your example, the variable-length struct trick becomes quite effective.
struct PER_IO_OPERATION_DATA
{
...
byte byteBuffer[]; // size N
};
Simply acquire PER_IO_OPERATION_DATA* by using malloc (or your own allocator) with the size of the structure + the additional N bytes you need to make that byteButter side large enough. Since you're armed with C++, you can use this kind of low-level structure as an implementation detail behind a safe class conforming to RAII, applying the necessary bounds checking assertions in debug builds, with exception safety, and so forth. In C++, try to do at least that: if you need unsafe, low-level bit and byte manipulation code anywhere, make it a very private implementation detail hidden from the public interface.
That's typically the first pass for memory locality: identify the runtime-sized aggregate members of an object using the heap and fuse them into one contiguous block with the object itself.
Another useful type of generic container that's missing in the standard when you're trying to optimize for locality (as well as eliminating new/delete/malloc/free hotspots) is something like std::vector with a statically known "common case" size. Basic example:
struct Usually32ElementsOrLess
{
char buf[32];
char* ptr;
int num_elements;
};
Initialize the structure to make ptr point to buf unless the number of elements exceeds the fixed size (32). In such rare cases, make ptr point to a heap-allocated dynamic array. Access the structure through ptr, not buf, and make sure to implement a proper copy constructor.
With C++, you can make this into a general-purpose STL-compliant container if you like with a template parameter to determine the fixed size, even variable-sized with push_backs if you introduce a member to keep track of current memory capacity in addition to size.
Having this kind of structure around, well-tested and especially in full-blown general-purpose STL form, will really help you utilize the stack more and get a bit more memory locality out of your more daily code without requiring anything more time-consuming or risky than using std::vector. It's suitable when most of the time, the data size has an upper bound in the common case scenarios with the heap being reserved for those rare case exceptional scenarios.
2. Laying out code and data in terms of sets and lines does not come natural.
Indeed, this is very unnatural to think in terms of organizing aggregates and access patterns to align and fit to a cache line. I would suggest you save such thought for only the most critical of critical hotspots.
3. Thinking in terms of an object having properties and actions is natural.
This doesn't get in the way of these other two things. That's public interface design, and again your ideal public interface doesn't leak these low-level optimization details into the client using that interface (unless it's just a low-level data structure used as a building block for higher-level designs).
Coming back to interface design, if you want to leave more room for efficient optimizations of representation without breaking interface designs, a strided design will help a lot. Check out the OpenGL API and how it supports all kinds of various ways of passing representations of things around. There it doesn't assume, for example, that vertex positions are stored in a separate contiguous memory block from vertex normals. Because it uses strides in the design, the vertex normals can be interleaved with the vertex positions, or they may not be. It doesn't matter and doesn't require the interface to change, so it leaves room to experiment with memory layouts without breaking anything.
In C++, you can even create like a StrideIterator<T>(ptr, stride_size) to make it easier to pass things around and return them in designs that could benefit from changes to the memory layout of things being passed and returned around.
Update (Fixed Allocator)
Since you're interested in custom allocators, try this on for size:
#include <iostream>
#include <cassert>
#include <ctime>
using namespace std;
class Pool
{
public:
Pool(int element_size, int num_reserve)
{
if (sizeof(Chunk) > element_size)
element_size = sizeof(Chunk);
// This should use an aligned malloc.
mem = static_cast<char*>(malloc((num_reserve+1) * element_size));
char* ptr = static_cast<char*>(mem);
free_chunk = reinterpret_cast<Chunk*>(ptr);
free_chunk->next = 0;
Chunk* last_chunk = free_chunk;
for (int j=1; j < num_reserve+1; ++j)
{
ptr += element_size;
Chunk* chunk = reinterpret_cast<Chunk*>(ptr);
chunk->next = 0;
last_chunk->next = chunk;
last_chunk = chunk;
}
}
~Pool()
{
// This should use an aligned free.
free(mem);
}
void* allocate()
{
assert(free_chunk && free_chunk->next && "Reserve memory exhausted!");
Chunk* chunk = free_chunk;
free_chunk = free_chunk->next;
return chunk->mem;
}
void deallocate(void* mem)
{
Chunk* chunk = static_cast<Chunk*>(mem);
chunk->next = free_chunk;
free_chunk = chunk;
}
template <class T>
T* create(const T& other)
{
return new(allocate()) T(other);
}
template <class T>
void destroy(T* mem)
{
mem->~T();
deallocate(mem);
}
private:
union Chunk
{
Chunk* next;
// This should be max aligned.
char mem[1];
};
char* mem;
Chunk* free_chunk;
};
static double sys_time()
{
return static_cast<double>(clock()) / CLOCKS_PER_SEC;
}
int main()
{
enum {num = 20000000};
Pool alloc(sizeof(int), num);
// 'Touch' the array to reduce bias in the testing.
int** elements = new int*[num];
for (int j=0; j < num; ++j)
elements[j] = 0;
for (int k=0; k < 5; ++k)
{
// new/delete (malloc/free)
{
double start_time = sys_time();
for (int j=0; j < num; ++j)
elements[j] = new int(j);
for (int j=0; j < num; ++j)
delete elements[j];
cout << (sys_time() - start_time) << " seconds for new/delete" << endl;
}
// Branchless Fixed Alloc
{
double start_time = sys_time();
for (int j=0; j < num; ++j)
elements[j] = alloc.create(j);
for (int j=0; j < num; ++j)
alloc.destroy(elements[j]);
cout << (sys_time() - start_time) << " seconds for branchless alloc" << endl;
}
cout << endl;
}
delete[] elements;
}
Results on my machine:
1.711 seconds for new/delete
0.066 seconds for branchless alloc
1.681 seconds for new/delete
0.058 seconds for branchless alloc
1.668 seconds for new/delete
0.06 seconds for branchless alloc
1.68 seconds for new/delete
0.057 seconds for branchless alloc
1.663 seconds for new/delete
0.065 seconds for branchless alloc
It's a branchless pool allocator. Not safe but crazy fast. It requires you to reserve the maximum amount of memory in advance, so it's best used as a building block for an allocator which does branching and creates multiple of these reserved pools on the fly.

ACE/TAO Performance Issue

ACE/TAO length() function is taking too much time. Since it is creating that much amount of memory using new operator at the time of setting the length. Anybody knows alternate to length function for just setting the lengths in TAO.
Thanks,
From Will Otte from the ATCD mailing list:
I’m going to guess that you’ve got some code like this:
while (something) {
CORBA::ULong pos = seq.length ();
seq.length (pos+1);
seq[pos] = some_value;
}
and are observing that the performance is pretty bad compared to
std::vector<foo> vec;
while (something) {
size_t pos = vec.size ();
vec.resize (pos + 1);
vec[pos] = foo (bar); // or the much more succinct vec.push_back (foo (bar));
}
right?
The answer is likely because your STL implementation is helping you out and providing geometric growth when you use resize. The C++ standard doesn’t have any requirements like that (for resize; push_back is guaranteed to grow geometrically), so you’ve likely been lucky and shouldn’t depend on that behavior.
The TAO sequences don’t provide this for you, so if you repeatedly resize you’re going to see poor performance because every time you resize, you’re going to have to pay for an allocation of a new buffer and the time to copy all extant elements to the new underlying buffer.