Is stability of std::remove and std::remove_if design fail? - c++

Recently (from one SO comment) I learned that std::remove and std:remove_if are stable. Am I wrong to think this is a terrible design choice since it prevents certain optimizations?
Imagine removing the first and fifth elements of a 1M std::vector. Because of stability, we can't implement remove with swap. Instead we must shift every remaining element. :(
If we weren't limited by stability we could (for RA and BD iter) practically have 2 iters, one from front, second from behind, and then use swap to bring to-be-removed items to end. I'm sure smart people could maybe do even better. My question is in general, not about specific optimization I'm talking about.
EDIT: please note that C++ advertizes the zero overhead principle, and also there are std::sort and std::stable_sort sort algorithms.
EDIT2:
optimization would be something like the following:
For remove_if:
bad_iter looks from the beginning for those elements for which the predicate returns true.
good_iter looks from the end for those elements for which the predicate returns false.
when both have found what is expected they swap their elements. Termination is at good_iter <= bad_iter.
If it helps, think of it like one iter in quick sort algorithm, but we don't compare them to a special element, but instead we use the above predicate.
EDIT3: I played around and tried to find worst case (worst case for remove_if - notice how rarely the predicate would be true) and I got this:
#include <vector>
#include <string>
#include <iostream>
#include <map>
#include <algorithm>
#include <cassert>
#include <chrono>
#include <memory>
using namespace std;
int main()
{
vector<string> vsp;
int n;
cin >> n;
for (int i =0; i < n; ++i)
{ string s = "123456";
s.push_back('a' + (rand() %26));
vsp.push_back(s);
}
auto vsp2 = vsp;
auto remove_start = std::chrono::high_resolution_clock::now();
auto it=remove_if(begin(vsp),end(vsp), [](const string& s){ return s < "123456b";});
vsp.erase(it,vsp.end());
cout << vsp.size() << endl;
auto remove_end = std::chrono::high_resolution_clock::now();
cout << "erase-remove: " << chrono::duration_cast<std::chrono::milliseconds>(remove_end-remove_start).count() << " milliseconds\n";
auto partition_start = std::chrono::high_resolution_clock::now();
auto it2=partition(begin(vsp2),end(vsp2), [](const string& s){ return s >= "123456b";});
vsp2.erase(it2,vsp2.end());
cout << vsp2.size() << endl;
auto partition_end = std::chrono::high_resolution_clock::now();
cout << "partition-remove: " << chrono::duration_cast<std::chrono::milliseconds>(partition_end-partition_start).count() << " milliseconds\n";
}
C:\STL\MinGW>g++ test_int.cpp -O2 && a.exe
12345678
11870995
erase-remove: 1426 milliseconds
11870995
partition-remove: 658 milliseconds
For other usages, partition is bit faster, same or slower. Color me puzzled. :D

I assume you're asking about a hypothetical definition of stable_remove to be what remove currently is, and remove to be implemented however the implementer thinks is best to give the correct values in any order. With an expectation that implementers will be able to improve on just doing exactly the same as stable_remove.
In practice, the library can't easily do this optimization. It depends on the data, but you don't want to spend too long to work out how many elements will be removed before deciding on how to remove each one. For example you could do an extra pass to count them, but there are plenty of cases where that extra pass is inefficient. Just because an unstable remove is faster than stable for certain cases doesn't necessarily mean that an adaptive algorithm to choose between the two is a good bet.
I think the difference between remove and sort is that sorting is known to be a complicated problem with a lot of different solutions and trade-offs and tweaks. All "simple" sort algorithms are slow on average. Most standard algorithms are pretty simple, and remove is one of them but sort is not. I don't think it makes a lot of sense therefore to define stable_remove and remove as separate standard functions.
Edit: your edit with my tweak (similar to std::partition but no need to keep the values on the right) seems pretty reasonable to me. It requires a bidirectional iterator, but there is precedent in the standard for algorithms that behave differently on different iterator categories, such as std::distance. So it would be possible for the standard to define unstable_remove that only requires a forward iterator, but does your thing if it gets a bidi iterator. The standard probably wouldn't lay out the algorithm, but it could have a phrase like "if the iterator is bidirectional, does at most min(k, n-k) moves where k is the number of elements removed", which would in effect force it. But note that the standard doesn't currently say how many moves remove_if does, so I reckon that pinning this down simply wasn't a priority.
There is of course nothing stopping you from implementing your own unstable_remove.
If we accept that the standard didn't need to specify an unstable remove, the question then comes down to whether the function it does define should have been called stable_remove, anticipating a future remove that behaves differently for bidi iterators, and might behave differently for forward iterators if some clever heuristic for doing an unstable remove ever becomes well enough known to be worth a standard function. I'd say not: it is not a disaster if the names of standard functions aren't completely regular. It could have been pretty disruptive to remove the guarantee of stability from the STL's remove_if. Then the question becomes, "why didn't the STL call it stable_remove_if", to which I can only answer that in addition to all the points made in all the answers, the STL design process was a sight quicker than the standardization process.
stable_remove would also open a can of worms regarding other standard functions that could in theory have unstable versions. For a particularly silly example should copy be called stable_copy, just in case some implementation exists on which its demonstrably faster to reverse the order of elements while copying? Should copy be called copy_forward, so that the implementation can choose which of copy_backward and copy_forward is called by copy according to which is faster? Part of the committee's job is to draw a line somewhere.
I think realistically the current standard is sensible, and it would be sensible to separately define a stable_remove and a remove_with_some_other_constraints, but remove_in_some_unspecified_way just doesn't give the same opportunity for optimization that sort_in_some_unspecified_way does. Introsort was invented in 1997, just as C++ was being standardized, but I don't imagine the research effort around remove is quite what it was and is around sort. I may be wrong, optimizing remove might be the next big thing, and if so then the committee has missed a trick.

std::remove is specified to work with forward iterators.
The approach with working with a pair of iterators, from beginning and from the end, would either increase the requirements for the iterators and thus decrease the utility of the function or violate/worsen asymptotic complexity guarantees.

To answer my own question >3 years later :)
Yes it was a "fail".
There is a proposal D0041R0 that would add unstable_remove.
One could argue that just because there is a proposal to add std::unstable_remove that it does not mean that std::remove was a mistake, but I disagree. :)

Related

C++ Counting Map

Recently I was dealing with what I am sure is a very common problem, which essentially boils down into the following:
Given a long text, calculate the frequency of each word occurring in the text.
I was able to solve this problem using std::unordered_map. This, however, turned quite ugly, as for every word in the text, if that's already been encountered I had to do a find, erase, and then a re-insert into the map with the value incremented.
I realise there are other ways of doing this, such as using a hashing function on top of a vanilla array/vector and increment value there, but I was wondering if there was a more elegant way of solving this problem, like an STL component, or function. that would have a similar interface to Pythons Counter collections.
I know C++ being C++ I can't really expect such high level concepts to always be implemented for me, but was just wondering if you guys new about anything (or at least your Googling skills are superior to mine) which could make my code a little nicer.
I'm not quite sure why an std::unordered_map (or just std::map) would involve much complexity. I'd write the code something like this:
std::unordered_map<std::string, int> words;
std::string word;
while (word = getword(input))
++words[word];
There's no need for any kind of find/erase/reinsert.
In case it's not clear how/why this works: operator[] will create an entry for a value if none exists yet in the map. The associated value will be a value-initialized object of the specified type, which will be zero in the case of an int (or similar). We then increment that every time we encounter the word.
An alternative solution:
std::multiset<std::string> m;
for (auto w: words) m.insert(w);
m.count("some word");
The advantage is that you don't have to rely on the 'trick' with operator[], making the code more readable.
EDIT: As Kerrek pointed out in the comments, this solution is slower. multiset stores all the elements you insert, even if they are deemed equal (they might still differ in some aspect that operator== does not check). This causes a significant overhead compared to unordered_map<std::string, int>, which only has to store each word once.
(As a side note, processing the complete works of William Shakespeare using the map solution takes about 0.33s on my machine, as opposed to 0.78s for the multiset solution.)

boost multi_index_container and slow operator++

It is follow-up question for this MIC question. When adding items to the vector of reference wrappers I spend about 80% of time inside ++ operator whatever iterating approach I choose.
The query works as following
VersionView getVersionData(int subdeliveryGroupId, int retargetingId,
const std::wstring &flightName) const {
VersionView versions;
for (auto i = 0; i < 3; ++i) {
for (auto j = 0; j < 3; ++j) {
versions.insert(m_data.get<mvKey>().equal_range(boost::make_tuple(subdeliveryGroupId + i, retargetingId + j,
flightName)));
}
}
return versions;
}
I've tried following ways to fill the reference wrapper
template <typename InputRange> void insert(const InputRange &rng) {
// 1) base::insert(end(), rng.first, rng.second); // 12ms
// 2) std::copy(rng.first, rng.second, std::back_inserter(*this)); // 6ms
/* 3) size_t start = size(); // 12ms
auto tmp = std::reference_wrapper<const
VersionData>(VersionData(0,0,L""));
resize(start + boost::size(rng), tmp);
auto beg = rng.first;
for (;beg != rng.second; ++beg, ++start)
{
this->operator[](start) = std::reference_wrapper<const VersionData>(*beg);
}
*/
std::copy(rng.first, rng.second, std::back_inserter(*this));
}
Whatever I do I pay for operator ++ or the size method which just increments the iterator - meaning I'm still stuck in ++. So the question is if there is a way to iterate result ranges faster. If there is no such a way is it worth to try and go down the implementation of equal_range adding new argument which holds reference to the container of reference_wrapper which will be filled with results instead of creating range?
EDIT 1: sample code
http://coliru.stacked-crooked.com/a/8b82857d302e4a06/
Due to this bug it will not compile on Coliru
EDIT 2: Call tree, with time spent in operator ++
EDIT 3: Some concrete stuff. First of all I didnt started this thread just because the operator++ takes too much time in overall execution time and I dont like it just "because" but at this very moment it is the major bottleneck in our performance tests. Each request usually processed in hundreds of microseconds, request similar to this one (they are somewhat more complex) are processed ~1000-1500 micro and it is still acceptable. The original problem was that once the number of items in datastructure grows to hundreds of thousands the performance deteriorates to something like 20 milliseconds. Now after switching to MIC (which drastically improved the code readability, maintainability and overall elegance) I can reach something like 13 milliseconds per request of which 80%-90% spent in operator++. Now the question if this could be improved somehow or should I look for some tar and feathers for me? :)
The fact that 80% of getVersionData´s execution time is spent in operator++ is not indicative of any performance problem per se --at most, it tells you that equal_range and std::reference_wrapper insertion are faster in comparison. Put another way, when you profile some piece of code you will typically find locations where the most time is spent, but whether this is a problem or not depends on the required overall performance.
#kreuzerkrieg, your sample code does not exercise any kind of insertion into a vector of std::reference_wrappers! Instead, you're projecting the result of equal_range into a boost::any_range, which is expected to be fairly slow at iteration --basically, increment ops resolve to virtual calls.
So, unless I'm seriously missing something here, the sample code performance or lack thereof does not have anything to do with whatever your problem is in real code (assuming VersionView, of which you don't show the code, is not using boost::any_range).
That said, if you can afford replacing your ordered indices with equivalent hashed indices, iteration will probably be faster, but this is is an utter shot in the dark given you're not showing the real stuff.
I think that you're measuring the wrong things entirely. When I scale up from 3x3x11111 to 10x10x111111 (so 111x as many items in the index), it still runs in 290ms.
And populating the stuff takes orders of magnitude more time. Even deallocating the container appears to take more time.
What Doesn't Matter?
I've contributed a version with some trade offs, which mainly show that there's no sense in tweaking things: View On Coliru
there's a switch to avoid the any_range (it doesn't make sense using that if you care for performance)
there's a switch to tweak the flyweight:
#define USE_FLYWEIGHT 0 // 0: none 1: full 2: no tracking 3: no tracking no locking
again, it merely shows you could easily do without, and should consider doing so unless you need the memory optimization for the string (?). If so, consider using the OPTIMIZE_ATOMS approach:
the OPTIMIZE_ATOMS basically does fly weight for wstring there. Since all the strings are repeated here it will be mighty storage efficient (although the implementation is quick and dirty and should be improved). The idea is much better applied here: How to improve performance of boost interval_map lookups
Here's some rudimentary timings:
As you can see, basically nothing actually matters for query/iteration performance
Any Iterators: Doe They Matter?
It might be the culprit on your compiler. On my compile (gcc 4.8.2) it wasn't anything big, but see the disassembly of the accumulate loop without the any iterator:
As you can see from the sections I've highlighted, there doesn't seem to be much fat from the algorithm, the lambda nor from the iterator traversal. Now with the any_iterator the situation is much less clear, and if your compile optimizes less well, I can imagine it failing to inline elementary operations making iteration slow. (Just guessing a little now)
Ok, so the solution I applied is as following:
in addition to the odered_non_unique index (the 'byKey') I've added random_access index. When the data is loaded I rearrange the random index with m_data.get.begin(). Then when the MIC is queried for the data I just do boost::equal_range on the random index with custom predicate which emulates the same logic that was applied in ordering of 'byKey' index. That's it, it gave me fast 'size()' (O(1), as I understand) and fast traversal.
Now I'm ready for your rotten tomatoes :)
EDIT 1:
of course I've changed the any_range from bidirectional traversal tag to the random access one

Sort a vector in which the n first elements have been already sorted?

Consider a std::vector v of N elements, and consider that the n first elements have already been sorted withn < N and where (N-n)/N is very small:
Is there a clever way using the STL algorithms to sort this vector more rapidly than with a complete std::sort(std::begin(v), std::end(v)) ?
EDIT: a clarification: the (N-n) unsorted elements should be inserted at the right position within the n first elements already sorted.
EDIT2: bonus question: and how to find n ? (which corresponds to the first unsorted element)
Sort the other range only, and then use std::merge.
void foo( std::vector<int> & tab, int n ) {
std::sort( begin(tab)+n, end(tab));
std::inplace_merge(begin(tab), begin(tab)+n, end(tab));
}
for edit 2
auto it = std::adjacent_find(begin(tab), end(tab), std::greater<int>() );
if (it!=end(tab)) {
it++;
std::sort( it, end(tab));
std::inplace_merge(begin(tab), it, end(tab));
}
The optimal solution would be to sort the tail portion independently and then perform in-place merge, as described here
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.5750
The algorithm is quite convoluted and is usually regarded as "not worth the effort".
Of course, with C++ you can use readily available std::inplace_merge. However, the name of that algorithm is highly misleading. Firstly, there's no guarantee that std::inplace_merge actually works in-place. And when it is actually in-place, there's no guarantee that it is not implemented as a full-blown sort. In practice it boils down to trying it and seeing whether it is good enough for your purposes.
But if you really want to make it in-place and formally more efficient than a full sort, then you will have to implement it manually. STL might help with a few utility algorithms, but it does not offer any solid solutions of "just a few calls to standard functions" kind.
Using insertion sort on N - n last elements:
template <typename IT>
void mysort(IT begin, IT end) {
for (IT it = std::is_sorted_until(begin, end); it != end; ++it) {
IT insertPos = std::lower_bound(begin, it, *it);
IT endRotate = it;
std::rotate(insertPos, it, ++endRotate);
}
}
The Timsort sorting algorithm is a hybrid algorithm developed by Pythonista Tim Peters. It makes optimal use of already-sorted subsegments anywhere inside the array, including in the beginning. Although you may find a faster algorithm if you know for sure that in particular the first n elements are already sorted, this algorithm should be useful for the overall class of problems involved. Wikipedia describes it as:
The algorithm finds subsets of the data that are already ordered, and uses that knowledge to sort the remainder more efficiently.
In Tim Peters' own words,
It has supernatural performance on many
kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
as few as N-1), yet as fast as Python's previous highly tuned samplesort
hybrid on random arrays.
Full details are described in this undated text document by Tim Peters. The examples are in Python, but Python should be quite readable even to people not familiar with its syntax.
Use std::partition_point (or is_sorted_until) to find n. Then if n-m is small do an insertion sort (linear search+std::rotate).
I assume that your question has two aims:
improve runtime (using a clever way)
with few effort (restricting to STL)
Considering these aims, I'd strongly recommend against this specific optimization, unless you are sure that the effort is worth the benefit.
As far as I remember, std::sort() implements the quick sort algorithm, which is almost as fast on presorted input as to determine, if / how-much-of the input is sorted.
Instead of meddling with std::sort you can try changing the data structure to a sorted/prioritized queue.

BOOST_FOREACH versus for loop

I would like to have your advice regarding the usage of BOOST_FOREACH.
I have read around it is not really recommended in terms of performance being a very heavy header.
Moreover, it forces the use of "break" and "continue" statements since you can't really have an exit condition driven by a boolean and I've always been told that "break" and "continue" should be avoided when possible.
Of course, the advantages are that your are not dealing directly with iterators which ease the task of iterating through a container.
What do you think about it?
Do you think that if used it should be adopted systematically to guarantee homogeneity in a project or its use is recommended only under certain circumstances?
I would say C++ range based loops supercede it. This is an equivalent of this BOOST_FOREACH example:
std::string hello( "Hello, world!" );
for (auto c : hello)
{
std::cout << c;
}
I never found I needed to use it in ++03.
Note when using the range based loop over containers with expensive to copy elements, or in a generic context, it is best to use const& to those elements:
SomeContainerType<SomeType> v = ....;
for (const auto& elem : v)
{
std::cout << elem << " ";
}
Similarly, if you need to modify the elements of the container, then use a non-const & (auto& elem : v).
In programming, clarity is trump. I've always used boost foreach in C++03, found it much more readable than the hand-written loop, the header size won't kill you. As #juanchopanza rightly noted, of course, this question is obsolete in C++11.
Your concerns with break and continue are unfounded and probably counterproductive. With the traditionally long for-loop headers of C++03, people tend to not read the loop header and to overlook any condition variables that hide in the loop header. Better make your intent explicit with break and continue.
If you have decided to use boost foreach, use it systematically. It is supposed to be used to replace the bread-and-butter loops, after all.
I just replaced a use of BOOST_FOREACH with a simple for loop and got a 50% speedup, so I would say it is definitely not always the best thing to use.
You will also not get a loop counter (e.g. "i") which sometimes you actually need. Personally I'm not a fan but YMMV if it suits your style better.
BTW - a "heavy header" won't affect performance of your program, only the compilation time.

Should one prefer STL algorithms over hand-rolled loops?

I seem to be seeing more 'for' loops over iterators in questions & answers here than I do for_each(), transform(), and the like. Scott Meyers suggests that stl algorithms are preferred, or at least he did in 2001. Of course, using them often means moving the loop body into a function or function object. Some may feel this is an unacceptable complication, while others may feel it better breaks down the problem.
So... should STL algorithms be preferred over hand-rolled loops?
It depends on:
Whether high-performance is required
The readability of the loop
Whether the algorithm is complex
If the loop isn't the bottleneck, and the algorithm is simple (like for_each), then for the current C++ standard, I'd prefer a hand-rolled loop for readability. (Locality of logic is key.)
However, now that C++0x/C++11 is supported by some major compilers, I'd say use STL algorithms because they now allow lambda expressions — and thus the locality of the logic.
I’m going to go against the grain here and advocate that using STL algorithms with functors makes code much easier to understand and maintain, but you have to do it right. You have to pay more attention to readability and clearity. Particularly, you have to get the naming right. But when you do, you can end up with cleaner, clearer code, and paradigm shift into more powerful coding techniques.
Let’s take an example. Here we have a group of children, and we want to set their “Foo Count” to some value. The standard for-loop, iterator approach is:
for (vector<Child>::iterator iter = children.begin();
iter != children.end();
++iter)
{
iter->setFooCount(n);
}
Which, yeah, it’s pretty clear, and definitely not bad code. You can figure it out with just a little bit of looking at it. But look at what we can do with an appropriate functor:
for_each(children.begin(), children.end(), SetFooCount(n));
Wow, that says exactly what we need. You don’t have to figure it out; you immediately know that it’s setting the “Foo Count” of every child. (It would be even clearer if we didn’t need the .begin() / .end() nonsense, but you can’t have everything, and they didn’t consult me when making the STL.)
Granted, you do need to define this magical functor, SetFooCount, but its definition is pretty boilerplate:
class SetFooCount
{
public:
SetFooCount(int n) : fooCount(n) {}
void operator () (Child& child)
{
child.setFooCount(fooCount);
}
private:
int fooCount;
};
In total it’s more code, and you have to look at another place to find out exactly what SetFooCount is doing. But because we named it well, 99% of the time we don’t have to look at the code for SetFooCount. We assume it does what it says, and we only have to look at the for_each line.
What I really like is that using the algorithms leads to a paradigm shift. Instead of thinking of a list as a collection of objects, and doing things to every element of the list, you think of the list as a first class entity, and you operate directly on the list itself. The for-loop iterates through the list, calling a member function on each element to set the Foo Count. Instead, I am doing one command, which sets the Foo Count of every element in the list. It’s subtle, but when you look at the forest instead of the trees, you gain more power.
So with a little thought and careful naming, we can use the STL algorithms to make cleaner, clearer code, and start thinking on a less granular level.
The std::foreach is the kind of code that made me curse the STL, years ago.
I cannot say if it's better, but I like more to have the code of my loop under the loop preamble. For me, it is a strong requirement. And the std::foreach construct won't allow me that (strangely enough, the foreach versions of Java or C# are cool, as far as I am concerned... So I guess it confirms that for me the locality of the loop body is very very important).
So I'll use the foreach only if there is only already a readable/understandable algorithm usable with it. If not, no, I won't. But this is a matter of taste, I guess, as I should perhaps try harder to understand and learn to parse all this thing...
Note that the people at boost apparently felt somewhat the same way, for they wrote BOOST_FOREACH:
#include <string>
#include <iostream>
#include <boost/foreach.hpp>
int main()
{
std::string hello( "Hello, world!" );
BOOST_FOREACH( char ch, hello )
{
std::cout << ch;
}
return 0;
}
See : http://www.boost.org/doc/libs/1_35_0/doc/html/foreach.html
That's really the one thing that Scott Meyers got wrong.
If there is an actual algorithm that matches what you need to do, then of course use the algorithm.
But if all you need to do is loop through a collection and do something to each item, just do the normal loop instead of trying to separate code out into a different functor, that just ends up dicing code up into bits without any real gain.
There are some other options like boost::bind or boost::lambda, but those are really complex template metaprogramming things, they do not work very well with debugging and stepping through the code so they should generally be avoided.
As others have mentioned, this will all change when lambda expressions become a first class citizen.
The for loop is imperative, the algorithms are declarative. When you write std::max_element, it’s obvious what you need, when you use a loop to achieve the same, it’s not necessarily so.
Algorithms also can have a slight performance edge. For example, when traversing an std::deque, a specialized algorithm can avoid checking redundantly whether a given increment moves the pointer over a chunk boundary.
However, complicated functor expressions quickly render algorithm invocations unreadable. If an explicit loop is more readable, use it. If an algorithm call can be expressed without ten-storey bind expressions, by all means prefer it. Readability is more important than performance here, because this kind of optimization is what Knuth so famously attributes to Hoare; you’ll be able to use another construct without trouble once you realize it’s a bottleneck.
It depends, if the algorithm doesn't take a functor, then always use the std algorithm version. It's both simpler for you to write and clearer.
For algorithms that take functors, generally no, until C++0x lambdas can be used. If the functor is small and the algorithm is complex (most aren't) then it may be better to still use the std algorithm.
I'm a big fan of the STL algorithms in principal but in practice it's just way too cumbersome. By the time you define your functor/predicate classes a two line for loop can turn into 40+ lines of code that is suddenly 10x harder to figure out.
Thankfully, things are going to get a ton easier in C++0x with lambda functions, auto and new for syntax. Checkout this C++0x Overview on Wikipedia.
I wouldn't use a hard and fast rule for it. There are many factors to consider, like often you perform that certain operation in your code, is just a loop or an "actual" algorithm, does the algorithm depend on a lot of context that you would have to transmit to your function?
For example I wouldn't put something like
for (int i = 0; i < some_vector.size(); i++)
if (some_vector[i] == NULL) some_other_vector[i]++;
into an algorithm because it would result in a lot more code percentage wise and I would have to deal with getting some_other_vector known to the algorithm somehow.
There are a lot of other examples where using STL algorithms makes a lot of sense, but you need to decide on a case by case basis.
I think the STL algorithm interface is sub-optimal and should be avoided because using the STL toolkit directly (for algorithms) might give a very small gain in performance, but will definitely cost readability, maintainability, and even a bit of writeability when you're learning how to use the tools.
How much more efficient is a standard for loop over a vector:
int weighted_sum = 0;
for (int i = 0; i < a_vector.size(); ++i) {
weighted_sum += (i + 1) * a_vector[i]; // Just writing something a little nontrivial.
}
than using a for_each construction, or trying to fit this into a call to accumulate?
You could argue that the iteration process is less efficient, but a for _ each also introduces a function call at each step (which might be mitigated by trying to inline the function, but remember that "inline" is only a suggestion to the compiler - it may ignore it).
In any case, the difference is small. In my experience, over 90% of the code you write is not performance critical, but is coder-time critical. By keeping your STL loop all literally inline, it is very readable. There is less indirection to trip over, for yourself or future maintainers. If it's in your style guide, then you're saving some learning time for your coders (admit it, learning to properly use the STL the first time involves a few gotcha moments). This last bit is what I mean by a cost in writeability.
Of course there are some special cases -- for example, you might actually want that for_each function separated to re-use in several other places. Or, it might be one of those few highly performance-critical sections. But these are special cases -- exceptions rather than the rule.
IMO, a lot of standard library algorithms like std::for_each should be avoided - mainly for the lack-of-lambda issues mentioned by others, but also because there's such a thing as inappropriate hiding of details.
Of course hiding details away in functions and classes is all part of abstraction, and in general a library abstraction is better than reinventing the wheel. But a key skill with abstraction is knowing when to do it - and when not to do it. Excessive abstraction can damage readability, maintainability etc. Good judgement comes with experience, not from inflexible rules - though you must learn the rules before you learn to break them, of course.
OTOH, it's worth considering the fact that a lot of programmers have been using C++ (and before that, C, Pascal etc) for a long time. Old habits die hard, and there is this thing called cognitive dissonance which often leads to excuses and rationalisations. Don't jump to conclusions, though - it's at least as likely that the standards guys are guilty of post-decisional dissonance.
I think a big factor is the developer's comfort level.
It's probably true that using transform or for_each is the right thing to do, but it's not any more efficient, and handwritten loops aren't inherently dangerous. If it would take half an hour for a developer to write a simple loop, versus half a day to get the syntax for transform or for_each right, and move the provided code into a function or function object. And then other developers would need to know what was going on.
A new developer would probably be best served by learning to use transform and for_each rather than handmade loops, since he would be able to use them consistently without error. For the rest of us for whom writing loops is second nature, it's probably best to stick with what we know, and get more familiar with the algorithms in our spare time.
Put it this way -- if I told my boss I had spent the day converting handmade loops into for_each and transform calls, I doubt he'd be very pleased.