c++ vector performance non intuitive result?

c++ vector performance non intuitive result? - c++

I am fiddling with the performance wizard in VS2010, the one that tests instrumentation (function call counts and timing.)
After learning about vectors in the C++ STL, I just decided to see what info I can get about performance of filling a vector with 1 million integers:
#include <iostream>
#include <vector>
void generate_ints();
int main() {
generate_ints();
return 0;
}
void generate_ints() {
typedef std::vector<int> Generator;
typedef std::vector<int>::iterator iter;
typedef std::vector<int>::size_type size;
Generator generator;
for (size i = 0; i != 1000000; ++i) {
generator.push_back(i);
}
}
What I get is: 2402.37 milliseconds of elapsed time for the above. But I learnt that vectors have to resize themselves when they run out of capacity as they are contiguous in memory. So I thought I'd get better performance by making one addition to the above which was:
generate.reserve(1000000);
However this doubles the execution time of the program to around 5000 milliseconds. Here is a screenshot of function calls, left side without the above line of code and right side with. I really don't understand this result and it doesn't make sense to me given what I learnt about how defining a vectors capacity if you know you will fill it with a ton is a good thing. Specifying reserve basically doubled most of the function calls.
http://imagebin.org/179302

From the screenshot you posted, it looks like you're compiling without optimization, which invalidates any benchmarking you do.
You benchmark when you care about performance, and when you care about performance, you press the "go faster" button on the compiler, and enable optimizations.
Telling the compiler to go slow, and then worrying that it's slower than expected is pointless. I'm not sure why the code becomes slower when you insert a reserve call, but in debug builds, a lot of runtime checks are inserted to catch more errors, and it is quite possible that the reserve call causes more such checks to be performed, slowing down the code.
Enable optimizations and see what happens. :)

Did you perform all your benchmarks under Release configuration?
Also, try running outside of profiler, in case you have hit some profiler-induced artifact (add manual time measurements to your code - you can use clock() for that).
Also, did you by any chance make a typo and actually called resize instead of reserve?

Related

Vector re-declaration versus insertions in looping operations - C++

I have an option to either create and destroy a vector on every call to func() and push elements in each iteration, as shown in Example A OR fixed the initialization and only overwrite old values in each iteration, as shown in Example B.
Example A:
void func ()
{
std::vector<double> my_vec(5, 0.0);
for ( int i = 0; i < my_vec.size(); i++) {
my_vec.push_back(i);
// do something
}
}
while (condition) {
func();
}
Example B:
void func (std::vector<double>& my_vec)
{
for ( int i = 0; i < my_vec.size(); i++) {
my_vec[i] = i;
// do something
}
}
while (condition) {
std::vector<double> my_vec(5, 0.0);
func(myVec);
}
Which of the two would be computationally inexpensive. The size of the array won't be more than 10.

I still suspect that the question that was asked is not the question that was intended, but it occurred to me that the main point of my answer would likely not change. If the question gets updated, I can always edit this answer to match (or delete it, if it turns out to be inapplicable).
De-prioritize optimizations
There are various factors that should affect how you write your code. Among the desirable goals are space optimization, time optimization, data encapsulation, logic encapsulation, readability, robustness, and correct functionality. Ideally, all of these goals would be achievable in every piece of code, but that is not particularly realistic. Much more likely is a situation where one or more of these goals must be sacrificed in favor of the others. When that happens, optimizations should typically yield to everything else.
That is not to say that optimizations should be ignored. There are plenty of optimizations that rarely obstruct the higher-priority goals. These range from the small, such as passing by const reference instead of by value, to the large, such as choosing the logarithmic algorithm instead of the exponential one. However, the optimizations that do interfere with the other goals should be postponed until after your code is reasonably complete and functioning correctly. At that point, a profiler should be used to determine where the bottlenecks actually are. Those bottlenecks are the only places where other goals should yield to optimizations, and only if the profiler confirms that the optimizations achieved their goals.
For the question being asked, this means that the main concern should not be computational expense, but encapsulation. Why should the caller of func() need to allocate space for func() to work with? It should not, unless a profiler identified this as a performance bottleneck. And if a profiler did that, it would be much easier (and more reliable!) to ask the profiler if the change helps than to ask Stack Overflow.
Why?
I can think of two major reasons to de-prioritize optimizations. First, the "sniff test" is unreliable. While there might be a few people who can identify bottlenecks by looking at code, there are many, many more who merely think they can. Second, that's why we have optimizing compilers. It is not unheard of for someone to come up with this super-clever optimization trick only to discover that the compiler was already doing it. Keep your code clean and let the compiler handle the routine optimizations. Only step in when the task demonstrably exceeds the compiler's capabilities.
See also: premature-optimization
Choosing an optimization
OK, suppose the profiler did identify construction of this small, 10-element array as a bottleneck. The next step is to test an alternative, right? Almost. First you need an alternative, and I would consider a review of the theoretical benefits of various alternatives to be useful. Just keep in mind that this is theoretical and that the profiler gets the final say. So I'll go into the pros and cons of the alternatives from the question, as well as some other alternatives that might bear consideration. Let's start from the worst options, working our way to the better ones.
Example A
In Example A, a vector is created with 5 elements, then elements are pushed onto the vector until i meets or exceeds the vector's size. Seeing how i and the vector's size are both increased by one each iteration (and i starts smaller than the size), this loop will run until the vector grows large enough to crash the program. That means probably billions of iterations (despite the question's claim that the size will not exceed 10).
Easily the most computationally expensive option. Don't do this.
Example B
In example B, a vector is created for each iteration of the outer while loop, which is then accessed by reference from within func(). The performance cons here include passing a parameter to func() and having func() access the vector indirectly through a reference. There are no performance pros as this does everything the baseline (see below) would do, plus some extra steps.
Even though a compiler might be able to compensate for the cons, I see no reason to try this approach.
Baseline
The baseline I'm using is a fix to Example A's infinite loop. Specifically, replace "my_vec.push_back(i);" with Example B's "my_vec[i] = i;". This simple approach is along the lines of what I would expect for the initial assessment by the profiler. If you cannot beat simple, stick with it.
Example B*
The text of the question presents an inaccurate assessment of Example B. Interestingly, the assessment describes an approach that has the potential to improve on the baseline. To get code that matches the textual description, move Example B's "std::vector<double> my_vec(5, 0.0);" to the line immediately before the while statement. This has the effect of constructing the vector only once, rather than constructing it with each iteration.
The cons of this approach are the same as those of Example B as originally coded. However, we now pick up a gain in that the vector's constructor is called only once. If construction is more expensive than the indirection costs, the result should be a net improvement once the while loop iterates often enough. (Beware these conditions: that's a significant "if" and there is no a priori guess as to how many iterations is "enough".) It would be reasonable to try this and see what the profiler says.
Get some static
A variant on Example B* that helps preserve encapsulation is to use the baseline (the fixed Example A), but precede the declaration of the vector with the keyword static. This brings in the benefit of constructing the vector only once, but without the overhead associated with making the vector a parameter. In fact, the benefit could be greater than in Example B* since construction happens only once per program execution, rather than each time the while loop is started. The more times the while loop is started, the greater this benefit.
The main con here is that the vector will occupy memory throughout the program's execution. Unlike Example B*, it will not release its memory when the block containing the while loop ends. Using this approach in too many places would lead to memory bloat. So while it is reasonable to profile this approach, you might want to consider other options. (Of course if the profiler calls this out as the bottleneck, dwarfing all others, the cost is small enough to pay.)
Fix the size
My personal choice for what optimization to try here would be to start from the baseline and switch the vector to std::array<10,double>. My main motivation is that the needed size won't be more than 10. Also relevant is that the construction of a double is trivial. Construction of the array should be on par with declaring 10 variables of type double, which I would expect to be negligible. So no need for fancy optimization tricks. Just let the compiler do its thing.
The expected possible benefit of this approach is that a vector allocates space on the heap for its storage, which has an overhead cost. The local array would not have this cost. However, this is only a possible benefit. A vector implementation might already take advantage of this performance consideration for small vectors. (Maybe it does not use the heap until the capacity needs to exceed some magic number, perhaps more than 10.) I would refer you back to earlier when I mentioned "super-clever" and "compiler was already doing it".
I'd run this through the profiler. If there's no benefit, there is likely no benefit from the other approaches. Give them a try, sure, since they're easy enough, but it would probably be a better use of your time to look at other aspects to optimize.

How to find runtime efficiency of a C++ code

I'm trying to find the efficiency of a program which I have recently posted on stackoverflow.
How to efficiently delete elements from a vector given an another vector
To compare the efficiency of my code with other answers I'm using chrono object.
Is it a correct way to check the runtime efficiency?
If not then kindly suggest a way to do it with an example.
Code on Coliru
#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include <ctime>
using namespace std;
void remove_elements(vector<int>& vDestination, const vector<int>& vSource)
{
if(!vDestination.empty() && !vSource.empty())
{
for(auto i: vSource) {
vDestination.erase(std::remove(vDestination.begin(), vDestination.end(), i), vDestination.end());
}
}
}
int main() {
vector<int> v1={1,2,3};
vector<int> v2={4,5,6};
vector<int> v3={1,2,3,4,5,6,7,8,9};
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
remove_elements(v3,v1);
remove_elements(v3,v2);
std::chrono::steady_clock::time_point end= std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() <<std::endl;
for(auto i:v3)
cout << i << endl;
return 0;
}
Output
Time difference = 1472
7
8
9

Is it a correct way to check the runtime efficiency?
Looks like not the best way to do it. I see the following flaws in your method:
Values are sorted. Branch prediction may expose ridiculous effects when testing sorted vs. unsorted values with the same algorithm. Possible fix: test on both sorted and unsorted and compare results.
Values are hard-coded. CPU cache is a tricky thing and it may introduce subtle differences between tests on hard-coded values and real-life ones. In a real world you are unlikely to perform these operations on hard-coded values, so you may either read them from a file or generate random ones.
Too few values. Your code's execution time is much smaller than the timer precision.
You run the code only once. If you fix all other issues and run the code twice, the second run may be much faster than the first one due to cache warm-up: subsequent runs tend to have less cache misses than the first one.
You run the code once on fixed-size data. It would be better to run otherwise correct tests at least four times to cover a Cartesian product of the following parameters:
sorted vs. unsorted data
v3 fits CPU cache vs. v3 size exceeds CPU cache. Also consider cases when (v1.length() + v3.length()) * sizeof(int) fits the cache or not, (v1.length() + v2.length() + v3.length()) * sizeof(int) fits the cache or not and so on for all combinations.

The biggest issues with your approach are:
1) The code you're testing is too short and predictable. You need to run it at least a few thousand times so that there is at least a few hundred milliseconds between measurements. And you need to make the data set larger and less predictable. In general, CPU caches really make accurate measurements based on synthetic input data a PITA.
2) The compiler is free to reorder your code. In general it's quite difficult to ensure the code you're timing will execute between calls to check the time (and nothing else, for that matter). On the one hand, you could dial down optimization, but on the other you want to measure optimized code.
One solution is to turn off whole-program optimization and put timing calls in another compilation unit.
Another possible solution is to use a memory fence around your test, e.g.
std::atomic_thread_fence(std::memory_order_seq_cst);
(requires #include <atomic> and a C++11-capable compiler).
In addition, you might want to supplement your measurements with profiler data to see how efficiently L1/2/3 caches are used, memory bottlenecks, instruction retire rate, etc. Unfortunately the best tool for Intel x86 for that is commercial (vtune), but on AMD x86 a similar tool is free (codeXL).

You might consider using a benchmarking library like Celero to do the measurements for you and deal with the tricky parts of performance measurements, while you remain focused on the code you're trying to optimize. More complex examples are available in the code I've linked in the answer to your previous question (How to efficiently delete elements from a vector given an another vector), but a simple use case would look like this:
BENCHMARK(VectorRemoval, OriginalQuestion, 100, 1000)
{
std::vector destination(10000);
std::generate(destination.begin(), destination.end(), std::rand);
std::sample(destination.begin(), destination.end(), std::back_inserter(source),
100, std::mt19937{std::random_device{}()})
for (auto i: source)
destination.erase(std::remove(destination.begin(), destination.end(), i),
destination.end());
}

optimizing `std::vector operator []` (vector access) when it becomes a bottleneck

gprof says that my high computing app spends 53% of its time inside std::vector <...> operator [] (unsigned long), 32% of which goes to one heavily used vector. Worse, I suspect that my parallel code failing to scale beyond 3-6 cores is due to a related memory bottleneck. While my app does spend a lot of time accessing and writing memory, it seems like I should be able (or at least try) to do better than 52%. Should I try using dynamic arrays instead (size remains constant in most cases)? Would that be likely to help with possible bottlenecks?
Actually, my preferred solution would be to solve the bottleneck and leave the vectors as is for convenience. Based on the above, are there any likely culprits or solutions (tcmalloc is out)?

Did you examine your memory access pattern itself? It might be inefficient - cache unfriendly.

Did you try to use raw pointer while array accessing?
// regular place
for (int i = 0; i < arr.size(); ++i)
wcout << arr[i];
// In bottleneck
int *pArr = &arr.front();
for (int i = 0; i < arr.size(); ++i)
wcout << pArr[i];

I suspect that gprof prevents functions to be inlined. Try to use another profiling method. std::vector operator [] cannot be bottleneck because it doesn't differ much from raw array access. SGI implementaion is shown below:
reference operator[](size_type __n) { return *(begin() + __n); }
iterator begin() { return _M_start; }

You cannot trust gprof for high-speed code profiling, you should instead use a passive profiling method like oprofile to get the real picture.
As an alternative you could profile by manual code alteration (e.g. calling a computation 10 times instead of one and checking how much the execution time increases). Note that this is however going to be influenced by cache issues so YMMV.

The vector class is heavily liked and provides a certain amount of convenience, at the expense of performance, which is fine when you don't particularly need performance.
If you really need performance, it won't hurt you too much to bypass the vector class and go directly to a simple old hand-made array, whether statically or dynamically allocated. Then 1) the time you currently spend indexing should essentially disappear, speeding up your app by that amount, and 2) you can move on to whatever the "next big thing" is that takes time in your app.
EDIT:
Most programs have a lot more room for speedup than you might suppose. I made a walk-through project to illustrate this. If I can summarize it really quickly, it goes like this:
Original time is 2.7 msec per "job" (the number of "jobs" can be varied to get enough run-time to analyze it).
First cut showed roughly 60% of time was spent in vector operations, including indexing, appending, and removing. I replaced with a similar vector class from MFC, and time decreased to 1.8 msec/job. (That's a 1.5x or 50% speedup.)
Even with that array class, roughly 40% of time was spent in the [] indexing operator. I wanted it to index directly, so I forced it to index directly, not through the operator function. That reduced time to 1.5 msec/job, a 1.2x speedup.
Now roughly 60% of the time is adding/removing items in arrays. An additional fraction was spent in "new" and "delete". I decided to chuck the arrays and do two things. One was to use do-it-yourself linked lists, and to pool used objects. The first reduced time to 1.3 msec (1.15x). The second reduced it to 0.44 msec (2.95x).
Of that time, I found that about 60% of the time was in code I had written to do indexing into the list (as if it were an array). I decided that could be done instead just by having a pointer directly into the list. Result: 0.14 msec (3.14x).
Now I found that nearly all the time was being spent in a line of diagnostic I/O I was printing to the console. I decided to get rid of that: 0.0037 msec (38x).
I could have kept going, but I stopped.
The overall time per job was reduced by a compounded factor of about 700x.
What I want you to take away is if you need performance bad enough to deviate from what might be considered the accepted ways of doing things, you don't have to stop after one "bottleneck".
Just because you got a big speedup doesn't mean there are no more.
In fact the next "bottleneck" might be bigger than the first, in terms of speedup factor.
So raise your expectations of speedup you can get, and go for broke.

C++ map performance - Linux (30 sec) vs Windows (30 mins) !

I need to process a list of files. The processing action should not be repeated for the same file. The code I am using for this is -
using namespace std;
vector<File*> gInputFileList; //Can contain duplicates, File has member sFilename
map<string, File*> gProcessedFileList; //Using map to avoid linear search costs
void processFile(File* pFile)
{
File* pProcessedFile = gProcessedFileList[pFile->sFilename];
if(pProcessedFile != NULL)
return; //Already processed
foo(pFile); //foo() is the action to do for each file
gProcessedFileList[pFile->sFilename] = pFile;
}
void main()
{
size_t n= gInputFileList.size(); //Using array syntax (iterator syntax also gives identical performance)
for(size_t i=0; i<n; i++){
processFile(gInputFileList[i]);
}
}
The code works correctly, but...
My problem is that when the input size is 1000, it takes 30 minutes - HALF AN HOUR - on Windows/Visual Studio 2008 Express. For the same input, it takes only 40 seconds to run on Linux/gcc!
What could be the problem? The action foo() takes only a very short time to execute, when used separately. Should I be using something like vector::reserve for the map?
EDIT, EXTRA INFORMATION
What foo() does is:
1. it opens the file
2. reads it into memory
3. closes the file
4. the contents of the file in memory is parsed
5. it builds a list of tokens; I'm using a vector for that.
Whenever I break the program (while running the program with the 1000+ files input set): the call-stack shows that the program is in the middle of a std::vector add.

In the Microsoft Visual Studio, there's a global lock when accessing the Standard C++ Library to protect from multi threading issue in Debug builds. This can cause big performance hits. For instance, our full test code runs on Linux/gcc in 50 minutes, whereas it needs 5 hours on Windows VC++2008. Note that this performance hit does not exist when compiling in Release mode, using the non-debug Visual C++ runtime.

I would approach it like any performance problem. This means: profiling. MSVC has a built-in profiler, by the way, so it may be a good chance to get familiar with it.

Break into the program using the debugger at a random time, and the chances are very high that the stack trace will tell you where it's spending the time.

I very very strongly doubt that your performance problem is coming from the STL containers.
Try to eliminate (comment out) the call to foo(pFile) or any other method which touches the filesystem. Although running foo(pFile) once may appear fast, running it on 1000 different files (especially on Windows filesystems, in my experience) could turn out to be much slower (e.g. because of filesystem cache behaviour.)
EDIT
Your initial post was claiming that BOTH debug and release builds were affected. Now you are withdrawing that claim.
Be aware that in DEBUG builds:
the STL implementation performs
extra checks and assertions
heap
operations (memory allocation etc.)
perform extra checks and assertions;
moreover, under debug builds the
low-fragmentation heap is
disabled (up to a 10x overall
slowdown in memory allocation)
no code optimizations are performed,
which may result in further STL
performance degradation (STL relying many a time heavily on inlining, loop unwinding etc.)
With 1000 iterations you are probably not affected by the above (not at the outer loop level at least) unless you use STL/the heap heavily INSIDE foo().

I would be astounded if the performance issues you are seeing have anything at all to do with the map class. Doing 1000 lookups and 1000 insertion should take a combined time on the order of microseconds. What is foo() doing?

Without knowing how the rest of the code fits in, I think the overall idea of caching processed files is a little flaky.
Try removing duplicates from your vector first, then process them all.

Try commenting each block or major operation to determine which part actually caused the difference in execution time in Linux and Windows. I also don't think it would be because of the STL map. The problem may be inside foo(). It may be in some file operation as it is the only thing I could think of that would be costly in this case.
You may insert clock() calls in between operations to get an idea of the execution time.

You say that when you break, you find yourself inside vector::add. You don't have a vector::add in the code you've shown us, so I suspect it's inside the foo function. Without seeing that code, it's going to be difficult to say what's up.
You might have inadvertently created a Shlemiel the Painter algorithm.

You can improve things somewhat if you ditch your map and partition your vector instead. This implies reordering the input files list. It also means you have to find a way of quickly determining if a file has been processed already, possibly by holding a flag in the File class. If it's ok to reorder the files list and if you can store that dirty flag in the File object then you can improve performance from O(n log m) to O(n), for n total files and m processed files.
#include <algorithm>
#include <functional>
// ...
vector<File*>::iterator end(partition(inputfiles.begin(), inputfiles.end(),
not1(mem_fun(&File::is_processed))));
for_each(inputfiles.begin(), end, processFile);
If you can't reorder the files list or if you can't change the File object then you can switch the map with a vector and shadow each file in the input files list with a flag in the second vector at the same index. This will cost you O(n) space but will give you O(1) check for dirty state.
vector<File*> processed(inputfiles.size(), 0);
for( vector<File*>::size_type i(0); i != inputfiles.size(); ++i ) {
if( processed[i] != 0 ) return; // O(1)
// ...
processed[i] = inputfiles[i]; // O(1)
}
But be careful: You're dealing with two distinct pointers pointing at the same address, and that's the case for each pair of pointers in the two containers. Make sure one and only one pointer owns the pointee.
I don't expect either of these to yield a solution for that performance hit, but nevertheless.

If you are doing most of your work in linux then I strongly strongly suggest you only ever compile to release mode in windows. That makes life much easier, especially considering all the windows inflexible library handling headaches.

Shall I optimize or let compiler to do that?

What is the preferred method of writing loops according to efficiency:
Way a)
/*here I'm hoping that compiler will optimize this
code and won't be calling size every time it iterates through this loop*/
for (unsigned i = firstString.size(); i < anotherString.size(), ++i)
{
//do something
}
or maybe should I do it this way:
Way b)
unsigned first = firstString.size();
unsigned second = anotherString.size();
and now I can write:
for (unsigned i = first; i < second, ++i)
{
//do something
}
the second way seems to me like worse option for two reasons: scope polluting and verbosity but it has the advantage of being sure that size() will be invoked once for each object.
Looking forward to your answers.

I usually write this code as:
/* i and size are local to the loop */
for (size_t i = firstString.size(), size = anotherString.size(); i < size; ++i) {
// do something
}
This way I do not pollute the parent scope and avoid calling anotherString.size() for each loop iteration.
It is especially useful with iterators:
for(some_generic_type<T>::forward_iterator it = container.begin(), end = container.end();
it != end; ++it) {
// do something with *it
}
Since C++ 11 the code can be shortened even more by writing a range-based for loop:
for(const auto& item : container) {
// do something with item
}
or
for(auto item : container) {
// do something with item
}

In general, let the compiler do it. Focus on the algorithmic complexity of what you're doing rather than micro-optimizations.
However, note that your two examples are not semantically identical - if the body of the loop changes the size of the second string, the two loops will not iterate the same amount of times. For that reason, the compiler might not be able to perform the specific optimization you're talking about.

I would first use the first version, simply because it looks cleaner and easier to type. Then you can profile it to see if anything needs to be more optimized.
But I highly doubt that the first version will cause a noticable performance drop. If the container implements size() like this:
inline size_t size() const
{
return _internal_data_member_representing_size;
}
then the compiler should be able to inline the function, eliding the function call. My compiler's implementation of the standard containers all do this.

How will a good compiler optimize your code? Not at all, as it can't be sure size() has any side-effects. If size() had any side effects your code relied on, they'd now be gone after a possible compiler optimization.
This kind of optimization really isn't safe from a compiler's perspective, you need to do it on your own. Doing on your own doesn't mean you need to introduce two additional local variables. Depending on your implementation of size, it might be an O(1) operation. If size is also declared inline, you'll also spare the function call, making the call to size() as good as a local member access.

Don't pre-optimize your code. If you have a performance problem, use a profiler to find it, otherwise you are wasting development time. Just write the simplest / cleanest code that you can.

This is one of those things that you should test yourself. Run the loops 10,000 or even 100,000 iterations and see what difference, if any, exists.
That should tell you everything you want to know.

My recommendation is to let inconsequential optimizations creep into your style. What I mean by this is that if you learn a more optimal way of doing something, and you cant see any disadvantages to it (as far as maintainability, readability, etc) then you might as well adopt it.
But don't become obsessed. Optimizations that sacrifice maintainability should be saved for very small sections of code that you have measured and KNOW will have a major impact on your application. When you do decide to optimize, remember that picking the right algorithm for the job is often far more important than tight code.

I'm hoping that compiler will optimize this...
You shouldn't. Anything involving
A call to an unknown function or
A call to a method that might be overridden
is hard for a C++ compiler to optimize. You might get lucky, but you can't count on it.
Nevertheless, because you find the first version simpler and easier to read and understand, you should write the code exactly the way it is shown in your simple example, with the calls to size() in the loop. You should consider the second version, where you have extra variables that pull the common call out of the loop, only if your application is too slow and if you have measurements showing that this loop is a bottleneck.

Here's how I look at it. Performance and style are both important, and you have to choose between the two.
You can try it out and see if there is a performance hit. If there is an unacceptable performance hit, then choose the second option, otherwise feel free to choose style.

You shouldn't optimize your code, unless you have a proof (obtained via profiler) that this part of code is bottleneck. Needless code optimization will only waste your time, it won't improve anything.
You can waste hours trying to improve one loop, only to get 0.001% performance increase.
If you're worried about performance - use profilers.

There's nothing really wrong with way (b) if you just want to write something that will probably be no worse than way (a), and possibly faster. It also makes it clearer that you know that the string's size will remain constant.
The compiler may or may not spot that size will remain constant; just in case, you might as well perform this optimization yourself. I'd certainly do this if I was suspicious that the code I was writing was going to be run a lot, even if I wasn't sure that it would be a big deal. It's very straightforward to do, it takes no more than 10 extra seconds thinking about it, it's very unlikely to slow things down, and, if nothing else, will almost certainly make the unoptimized build run a bit more quickly.
(Also the first variable in style (b) is unnecessary; the code for the init expression is run only once.)

How much percent of time is spent in for as opposed to // do something? (Don't guess - sample it.) If it is < 10% you probably have bigger issues elsewhere.
Everybody says "Compilers are so smart these days."
Well they're no smarter than the poor coders who write them.
You need to be smart too. Maybe the compiler can optimize it but why tempt it not to?

For the "std::size_t size()const" member function which not only is O(1) but is also declared "const" and so can be automatically pulled out of the loop by the compiler, it probably doesn't matter. That said, I wouldn't count on the compiler to remove it from the loop, and I think it is a good habit to get into to factor out the calls within the loop for cases where the function isn't constant or O(1). In addition, I think assigning the values to a variable leads to the code being more readable. I would not suggest, though, that you make any premature optimizations if it will result in the code being harder to read. Again, though, I think the following code is more readable, since there is less to read within the loop:
std::size_t firststrlen = firststr.size();
std::size_t secondstrlen = secondstr.size();
for ( std::size_t i = firststrlen; i < secondstrlen; i++ ){
// ...
}
Also, I should point out that you should use "std::size_t" instead of "unsigned", as the type of "std::size_t" can vary from one platform to another, and using "unsigned" can lead to trunctations and errors on platforms for which the type of "std::size_t" is "unsigned long" instead of "unsigned int".

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js