A code works slowly because of compiler optimizations - c++

I have 3 separate global functions and I want to test it's speed. I'm using this code:
// case 1
{
chrono::duration<double, milli> totalTime;
for (uint32_t i{ 0 }; i < REPEATS; ++i)
{
auto start = chrono::steady_clock::now();
func1(); // "normal" c++ code
auto end = chrono::steady_clock::now();
auto diff = end - start;
cout << chrono::duration <double, milli>(diff).count() << " ms" << endl;
}
}
// case 2
{
chrono::duration<double, milli> totalTime;
for (uint32_t i{ 0 }; i < REPEATS; ++i)
{
auto start = chrono::steady_clock::now();
func2(); // multithreaded c++ code
auto end = chrono::steady_clock::now();
auto diff = end - start;
cout << chrono::duration <double, milli>(diff).count() << " ms" << endl;
}
}
// case 3
{
chrono::duration<double, milli> totalTime;
for (uint32_t i{ 0 }; i < REPEATS; ++i)
{
auto start = chrono::steady_clock::now();
func3(); // SIMD c++ code
auto end = chrono::steady_clock::now();
auto diff = end - start;
cout << chrono::duration <double, milli>(diff).count() << " ms" << endl;
}
}
This func1(), func2(), func3() are global functions that doesn't change the state of the program (I don't have any global variables).
The outputed result depends on the running cases. If I run case 1 and case 2 I have 100ms and 10ms respectively. If I run case 1 and case 3 I have 100ms and 130ms. If I run cases 1, 2, 3 I have 130ms, 10ms, 120 ms. The first case became slower on 30% and the third one became faster! If I run cases separatelly I have 100ms, 10ms, 130ms. I tried to turn optimisation off - the code became (surprise, surprise) much slower but at least the results are the same not depending on cases order. So I came to a conclusion that compiler do something special. Is it true?
I'm using Win7 and VS 2013.

A few things can happen:
Your test get's pre-empted by the kernel. Not much you can do to prevent that. Run your tests multiple times to make sure the results are consistent.
The compiler can inline your functions and optimize on the inlined code.
There is interaction between the functions. The simplest thing I would guess is memory allocation (i.e. func1() is optimized to request a memory block sufficient for the entire program, or one of the function brings some memory blocks into cache).
So suggestions:
First run each function before the benchmark to get rid of some of
the memory artifacts.
Run the benchmark a few times and see how
the values fluctuate to eliminate some of the OS artifacts.
Shuffle the order of the functions on each run, or take it as a
parameter, to make sure you eliminate ordering artifacts.
Don't do cout in your loop because that interacts with the OS and can mess up your cache or even get your process preempted. Write the results to some vector and output everything at the end.
There can be other things affecting your performance (i.e. Disk, networking, other processes, memory and cpu load), so take the values with a grain of salt.

Compiler optimizations vary. There are numerous things that compiler can do - one of optimizations(at least in GNU GCC) is aggresive loop unrolling - this might create faster code, but you must be aware that this can cause cache misses, effectively slowing down your code. That is, if we take just the compiler optimizations into consideration.
Now you have three different cases that if run separately give different output. This might be affected by alignment issue - if your code is properly aligned it will be faster, and if it isn't, additional padding might slow it down - I've seen similar thing happening in C#, but I can't find this thread now.
And last thing that could happen is you run too few tests to be sure - 10k tests is decent set, and you can start comparing speed output. One-time tests can be affected by OS, so keep that in mind.
Oh, and because Microsoft is brilliant at writing compilers, there are bugs in certain versions. I don't think the world of Microsoft's C++ compiler - there are many hacks and workarounds, it's not as up-to-date as other popular compilers - but that's simply my opinion. So another option is that compiler is simply malfunctioning. Also, see this and this beautiful typedef.

Related

Using std::async slower than non-async method to populate a vector

I am experimenting with std::async to populate a vector. The idea behind it is to use multi-threading to save time. However, running some benchmark tests I find that my non-async method is faster!
#include <algorithm>
#include <vector>
#include <future>
std::vector<int> Generate(int i)
{
std::vector<int> v;
for (int j = i; j < i + 10; ++j)
{
v.push_back(j);
}
return v;
}
Async:
std::vector<std::future<std::vector<int>>> futures;
for (int i = 0; i < 200; i+=10)
{
futures.push_back(std::async(
[](int i) { return Generate(i); }, i));
}
std::vector<int> res;
for (auto &&f : futures)
{
auto vec = f.get();
res.insert(std::end(res), std::begin(vec), std::end(vec));
}
Non-async:
std::vector<int> res;
for (int i = 0; i < 200; i+=10)
{
auto vec = Generate(i);
res.insert(std::end(res), std::begin(vec), std::end(vec));
}
My benchmark test shows that the async method is 71 times slower than non-async. What am I doing wrong?
std::async has two modes of operation:
std::launch::async
std::launch::deferred
In this case, you've called std::async without specifying either one, which means it's allowed to choose either one. std::launch::deferred basically means do the work on the calling thread. So std::async returns a future, and with std::launch::deferred, the action you've requested won't be carried out until you call .get on that future. It can be kind of handy under a few circumstances, but it's probably not what you want here.
Even if you specify std::launch::async, you need to realize that this starts up a new thread of execution to carry out the action you've requested. It then has to create a future, and use some sort of signalling from the thread to the future to let you know when the computation you've requested is done.
All of that adds a fair amount of overhead--anywhere from microseconds to milliseconds or so, depending on the OS, CPU, etc.
So, for asynchronous execution to make sense, the "stuff" you do asynchronously typically needs to take tens of milliseconds at the very least (and hundreds of milliseconds might be a more sensible lower threshold). I wouldn't get too wrapped up in the exact cutoff, but it needs to be something that takes a while.
So, filling an array asynchronously probably only makes sense if the array is quite a lot larger than you're dealing with here.
For filling memory, you'll quickly run into another problem though: most CPUs are enough faster than main memory that if all you're doing is writing to memory, there's a pretty good chance that a single thread will already saturate the path to memory, so even at best doing the job asynchronously will only gain a little, and may still pretty easily cause a slow-down.
The ideal case for asynchronous operation would be something like one thread that's heavily memory bound, but another that (for example) reads a little bit of data, and does a lot of computation on that small amount of data. In this case, the computation thread will mostly operate on its data in the cache, so it won't get in the way of the memory thread doing its thing.
There are multiple factors that are causing the Multithreaded code to perform (much) slower than the Singlethreaded code.
Your array sizes are too small
Multithreading often has negligible-to-no effect on datasets that are particularly small. In both versions of your code, you're generating 2000 integers, and each Logical Thread (which, because std::async is often implemented in terms of thread pools, might not be the same as a Software Thread) is only generating 10 integers. The cost of spooling up a thread every 10 integers way offsets the benefit of generating those integers in parallel.
You might see a performance gain if each thread were instead responsible for, say, 10,000 integers each, but you'll probably instead have a different issue:
All your code is bottlenecked by an inherently serial process
Both versions of the code copy the generated integers into a host vector. It would be one thing if the act of generating those integers was itself a time consuming process, but in your case, it's likely just a matter of a small, fast bit of assembly generating each integer.
So the act of copying each integer into the final vector is probably not inherently faster than generating each integer, meaning a sizable chunk of the "work" being done is completely serial, defeating the whole purpose of multithreading your code.
Fixing the code
Compilers are very good at their jobs, so in trying to revise your code, I was only barely able to get multithreaded code that was faster than the serial code. Multiple executions had varying results, so my general assessment is that this kind of code is bad at being multithreaded.
But here's what I came up with:
#include <algorithm>
#include <vector>
#include <future>
#include<chrono>
#include<iostream>
#include<iomanip>
//#1: Constants
constexpr int BLOCK_SIZE = 500000;
constexpr int NUM_OF_BLOCKS = 20;
std::vector<int> Generate(int i) {
std::vector<int> v;
for (int j = i; j < i + BLOCK_SIZE; ++j) {
v.push_back(j);
}
return v;
}
void asynchronous_attempt() {
std::vector<std::future<void>> futures;
//#2: Preallocated Vector
std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
auto it = res.begin();
for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE)
{
futures.push_back(std::async(
[it](int i) {
auto vec = Generate(i);
//#3 Copying done multithreaded
std::copy(vec.begin(), vec.end(), it + i);
}, i));
}
for (auto &&f : futures) {
f.get();
}
}
void serial_attempt() {
//#4 Changes here to show fair comparison
std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
auto it = res.begin();
for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE) {
auto vec = Generate(i);
it = std::copy(vec.begin(), vec.end(), it);
}
}
int main() {
using clock = std::chrono::steady_clock;
std::cout << "Theoretical # of Threads: " << std::thread::hardware_concurrency() << std::endl;
auto begin = clock::now();
asynchronous_attempt();
auto end = clock::now();
std::cout << "Duration of Multithreaded Attempt: " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
begin = clock::now();
serial_attempt();
end = clock::now();
std::cout << "Duration of Serial Attempt: " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
}
This resulted in the following output:
Theoretical # of Threads: 2
Duration of Multithreaded Attempt: 361149213ns
Duration of Serial Attempt: 364785676ns
Given that this was on an online compiler (here) I'm willing to bet the multithreaded code might win out on a dedicated machine, but I think this at least demonstrates the improvement in performance that we're at least on par between the two methods.
Below are the changes I made, that are ID'd in the code:
We've dramatically increased the number of integers being generated, to force the threads to do actual meaningful work, instead of getting bogged down on OS-level housekeeping
The vector has its size pre-allocated. No more frequent resizing.
Now that the space has been preallocated, we can multithread the copying instead of doing it in serial later.
We have to change the serial code so it also preallocates + copies so that it's a fair comparison.
Now, we've ensured that all the code is indeed running in parallel, and while it's not amounting to a substantial improvement over the serial code, it's at least no longer exhibiting the degenerate performance losses we were seeing before.
First of all, you are not forcing the std::async to work asynchronously (you would need to specify std::launch::async policy to do so). Second of all, it'd be kind of an overkill to asynchronously create an std::vector of 10 ints. It's just not worth it. Remember - using more threads does not mean that you will see performance benefit! Creating a thread (or even using a threadpool) introduces some overhead, which, in this case, seems to dwarf the benefits of running tasks asynchronously.
Thanks #NathanOliver ;>

Why is the performance of a running program getting better over time?

Consider the following code:
#include <iostream>
#include <chrono>
using Time = std::chrono::high_resolution_clock;
using us = std::chrono::microseconds;
int main()
{
volatile int i, k;
const int n = 1000000;
for(k = 0; k < 200; ++k) {
auto begin = Time::now();
for (i = 0; i < n; ++i); // <--
auto end = Time::now();
auto dur = std::chrono::duration_cast<us>(end - begin).count();
std::cout << dur << std::endl;
}
return 0;
}
I am repeatedly measuring the execution time of the inner for loop.
The results are shown in the following plot (y: duration, x: repetition):
What is causing the decreasing of the loop execution time?
Environment: linux (kernel 4.2) # Intel i7-2600, compiled using: g++ -std=c++11 main.cpp -O0 -o main
Edit 1
The question is not about compiler optimization or performance benchmarks.
The question is, why the performance gets better over time.
I am trying to understand what is happening at run-time.
Edit 2
As proposed by Vaughn Cato, I have changed the CPU frequency scaling policy to "Performance". Now I am getting the following results:
It confirms Vaughn Cato's conjecture. Sorry for the silly question.
What you are probably seeing is CPU frequency scaling (throttling). The CPU goes into a low-frequency state to save power when it isn't being heavily used.
Just before running your program, the CPU clock speed is probably fairly low, since there is no big load. When you run your program, the busy loop increases the load, and the CPU clock speed goes up until you hit the maximum clock speed, decreasing your times.
If you run your program several times in a row, you'll probably see the times stay at a lower value after the first run.
In you original experiment, there are too many variables than can affect the measurements:
the use of your processor by other active processes (i.e. scheduling of your OS)
The question whether your loop is optimized away or not
The access and buffering to the console.
The initial mode of your CPU (see answer about throtling)
I must admit that I was very skeptical about your observations. I therefore wrote a small variant using a preallocated vector, to avoid I/O synchronisation effects:
volatile int i, k;
const int n = 1000000, kmax=200,n_avg=30;
std::vector<long> v(kmax,0);
for(k = 0; k < kmax; ++k) {
auto begin = Time::now();
for (i = 0; i < n; ++i); // <-- remain thanks to volatile
auto end = Time::now();
auto dur = std::chrono::duration_cast<us>(end - begin).count();
v[k]=dur;
}
I then ran it several times on ideone (which, given the scale of its use, we can assume that in average the processor whould be in a constantly sollicitated state). Indeed your observations seemed to be confirmed.
I guess that this could be related to branch prediction, which should improve through the repetitive patterns.
I however went on, updated the code slightly and added a loop to repeat the experiment several times. Then I started to get also runs where your observation was not confirmed (i.e. at the end, the time was higher). But it may also be that the many other processes running on the ideone also influence the branch prediction in a different manner.
So in the end, to conclude anything would require a more cautious experiment, on a machine running this benchmark (and only it) a couple of hours.

Rarely executed and almost empty if statement drastically reduces performance in C++

Editor's clarification: When this was originally posted, there were two issues:
Test performance drops by a factor of three if seemingly inconsequential statement added
Time taken to complete the test appears to vary randomly
The second issue has been solved: the randomness only occurs when running under the debugger.
The remainder of this question should be understood as being about the first bullet point above, and in the context of running in VC++ 2010 Express's Release Mode with optimizations "Maximize Speed" and "favor fast code".
There are still some Comments in the comment section talking about the second point but they can now be disregarded.
I have a simulation where if I add a simple if statement into the while loop that runs the actual simulation, the performance drops about a factor of three (and I run a lot of calculations in the while loop, n-body gravity for the solar system besides other things) even though the if statement is almost never executed:
if (time - cb_last_orbital_update > 5000000)
{
cb_last_orbital_update = time;
}
with time and cb_last_orbital_update being both of type double and defined in the beginning of the main function, where this if statement is too. Usually there are computations I want to run there too, but it makes no difference if I delete them. The if statement as it is above has the same effect on the performance.
The variable time is the simulation time, it increases in 0.001 steps in the beginning so it takes a really long time until the if statement is executed for the first time (I also included printing a message to see if it is being executed, but it is not, or at least only when it's supposed to). Regardless, the performance drops by a factor of 3 even in the first minutes of the simulation when it hasn't been executed once yet. If I comment out the line
cb_last_orbital_update = time;
then it runs faster again, so it's not the check for
time - cb_last_orbital_update > 5000000
either, it's definitely the simple act of writing current simulation time into this variable.
Also, if I write the current time into another variable instead of cb_last_orbital_update, the performance does not drop. So this might be an issue with assigning a new value to a variable that is used to check if the "if" should be executed? These are all shots in the dark though.
Disclaimer: I am pretty new to programming, and sorry for all that text.
I am using Visual C++ 2010 Express, deactivating the stdafx.h precompiled header function didn't make a difference either.
EDIT: Basic structure of the program. Note that nowhere besides at the end of the while loop (time += time_interval;) is time changed. Also, cb_last_orbital_update has only 3 occurrences: Declaration / initialization, plus the two times in the if statement that is causing the problem.
int main(void)
{
...
double time = 0;
double time_interval = 0.001;
double cb_last_orbital_update = 0;
F_Rocket_Preset(time, time_interval, ...);
while(conditions)
{
Rocket[active].Stage[Rocket[active].r_stage].F_Update_Stage_Performance(time, time_interval, ...);
Rocket[active].F_Calculate_Aerodynamic_Variables(time);
Rocket[active].F_Calculate_Gravitational_Forces(cb_mu, cb_pos_d, time);
Rocket[active].F_Update_Rotation(time, time_interval, ...);
Rocket[active].F_Update_Position_Velocity(time_interval, time, ...);
Rocket[active].F_Calculate_Orbital_Elements(cb_mu);
F_Update_Celestial_Bodies(time, time_interval, ...);
if (time - cb_last_orbital_update > 5000000.0)
{
cb_last_orbital_update = time;
}
Rocket[active].F_Check_Apoapsis(time, time_interval);
Rocket[active].F_Status_Check(time, ...);
Rocket[active].F_Update_Mass (time_interval, time);
Rocket[active].F_Staging_Check (time, time_interval);
time += time_interval;
if (time > 3.1536E8)
{
std::cout << "\n\nBreak main loop! Sim Time: " << time << std::endl;
break;
}
}
...
}
EDIT 2:
Here is the difference in the assembly code. On the left is the fast code with the line
cb_last_orbital_update = time;
outcommented, on the right the slow code with the line.
EDIT 4:
So, i found a workaround that seems to work just fine so far:
int cb_orbit_update_counter = 1; // before while loop
if(time - cb_orbit_update_counter * 5E6 > 0)
{
cb_orbit_update_counter++;
}
EDIT 5:
While that workaround does work, it only works in combination with using __declspec(noinline). I just removed those from the function declarations again to see if that changes anything, and it does.
EDIT 6: Sorry this is getting confusing. I tracked down the culprit for the lower performance when removing __declspec(noinline) to this function, that is being executed inside the if:
__declspec(noinline) std::string F_Get_Body_Name(int r_body)
{
switch (r_body)
{
case 0:
{
return ("the Sun");
}
case 1:
{
return ("Mercury");
}
case 2:
{
return ("Venus");
}
case 3:
{
return ("Earth");
}
case 4:
{
return ("Mars");
}
case 5:
{
return ("Jupiter");
}
case 6:
{
return ("Saturn");
}
case 7:
{
return ("Uranus");
}
case 8:
{
return ("Neptune");
}
case 9:
{
return ("Pluto");
}
case 10:
{
return ("Ceres");
}
case 11:
{
return ("the Moon");
}
default:
{
return ("unnamed body");
}
}
}
The if also now does more than just increase the counter:
if(time - cb_orbit_update_counter * 1E7 > 0)
{
F_Update_Orbital_Elements_Of_Celestial_Bodies(args);
std::cout << F_Get_Body_Name(3) << " SMA: " << cb_sma[3] << "\tPos Earth: " << cb_pos_d[3][0] << " / " << cb_pos_d[3][1] << " / " << cb_pos_d[3][2] <<
"\tAlt: " << sqrt(pow(cb_pos_d[3][0] - cb_pos_d[0][0],2) + pow(cb_pos_d[3][1] - cb_pos_d[0][1],2) + pow(cb_pos_d[3][2] - cb_pos_d[0][2],2)) << std::endl;
std::cout << "Time: " << time << "\tcb_o_h[3]: " << cb_o_h[3] << std::endl;
cb_orbit_update_counter++;
}
I remove __declspec(noinline) from the function F_Get_Body_Name alone, the code gets slower. Similarly, if i remove the execution of this function or add __declspec(noinline) again, the code runs faster. All other functions still have __declspec(noinline).
EDIT 7:
So i changed the switch function to
const std::string cb_names[] = {"the Sun","Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune","Pluto","Ceres","the Moon","unnamed body"}; // global definition
const int cb_number = 12; // global definition
std::string F_Get_Body_Name(int r_body)
{
if (r_body >= 0 && r_body < cb_number)
{
return (cb_names[r_body]);
}
else
{
return (cb_names[cb_number]);
}
}
and also made another part of the code slimmer. The program now runs fast without any __declspec(noinline). As ElderBug suggested, an issue with the CPU instruction cache then / the code getting too big?
I'd put my money on Intel's branch predictor. http://en.wikipedia.org/wiki/Branch_predictor
The processor assumes (time - cb_last_orbital_update > 5000000) to be false most of the time and loads up the execution pipeline accordingly.
Once the condition (time - cb_last_orbital_update > 5000000) comes true. The misprediction delay is hitting you. You may loose 10 to 20 cycles.
if (time - cb_last_orbital_update > 5000000)
{
cb_last_orbital_update = time;
}
Something is happening that you don't expect.
One candidate is some uninitialised variables hanging around somewhere, which have different values depending on the exact code that you are running. For example, you might have uninitialised memory that is sometime a denormalised floating point number, and sometime it's not.
I think it should be clear that your code doesn't do what you expect it to do. So try debugging your code, compile with all warnings enabled, make sure you use the same compiler options (optimised vs. non-optimised can easily be a factor 10). Check that you get the same results.
Especially when you say "it runs faster again (this doesn't always work though, but i can't see a pattern). Also worked with changing 5000000 to 5E6 once. It only runs fast once though, recompiling causes the performance to drop again without changing anything. One time it ran slower only after recompiling twice." it looks quite likely that you are using different compiler options.
I will try another guess. This is hypothetical, and would be mostly due to the compiler.
My guess is that you use a lot of floating point calculations, and the introduction and use of double values in your main makes the compiler run out of XMM registers (the floating point SSE registers). This force the compiler to use memory instead of registers, and induce a lot of swapping between memory and registers, thus greatly reducing the performance. This would be happening mainly because of the computations functions inlining, because function calls are preserving registers.
The solution would be to add __declspec(noinline) to ALL your computation functions declarations.
I suggest using the Microsoft Profile Guided Optimizer -- if the compiler is making the wrong assumption for this particular branch it will help, and it will in all likelihood improve speed for the rest of the code as well.
Workaround, try 2:
The code is now looking like this:
int cb_orbit_update_counter = 1; // before while loop
if(time - cb_orbit_update_counter * 5E6 > 0)
{
cb_orbit_update_counter++;
}
So far it runs fast, plus the code is being executed when it should as far as i can tell. Again only a workaround, but if this proves to work all around then i'm satisfied.
After some more testing, seems good.
My guess is that this is because the variable cb_last_orbital_update is otherwise read-only, so when you assign to it inside the if, it destroys some optimizations that the compiler has for read-only variables (e.g. perhaps it's now stored in memory instead of a register).
Something to try (although this might still not work) is to make a third variable that is initialized via cb_last_orbital_update and time depending on whether the condition is true, and using that one instead. Presumably, the compiler would now treat that variable as a constant, but I'm not sure.

slow serial "for" with openmp enabled

I try to use openmp and find strange results.
Parallel "for" run faster with openmp as expected. But serial "for" run much faster when openmp disabled (without /openmp option. vs 2013).
Test code
const int n = 5000;
const int m = 2000000;
vector <double> a(n, 0);
double start = omp_get_wtime();
#pragma omp parallel for shared(a)
for (int i = 0; i < n; i++)
{
double StartVal = i;
for (int j = 0; j < m; ++j)
{
a[i] = (StartVal + log(exp(exp((double)i))));
}
}
cout << "omp Time: " << (omp_get_wtime() - start) << endl;
start = omp_get_wtime();
for (int i = 0; i < n; i++)
{
double StartVal = i;
for (int j = 0; j < m; ++j)
{
a[i] = (StartVal + log(exp(exp((double)i))));
}
}
cout << "serial Time: " << (omp_get_wtime() - start) << endl;
Output without /openmp option
0
omp Time: 6.4389
serial Time: 6.37592
Output with /openmp option
0
1
2
3
omp Time: 1.84636
serial Time: 16.353
Is it correct results? Or I'm doing something wrong?
I believe part of the answer lies hidden in the architecture of the computer you run on. I tried running the same code another machine (GCC 4.8 on GNU+Linux, quad Core2 CPU), and over many runs, found a slightly odd thing: while the time for both loops varied, and OpenMP with many threads always ran faster, the second loop never ran significantly faster than the first, even without OpenMP.
The next step was to try to eliminate a dependency between the loops, allocating a second vector for the second loop. It still ran no faster than the first. So I tried reversing them, running the OpenMP loop after the serial one; and while it still ran fast when multithreaded, it would now see delays when the first loop didn't. It's looking more like an operating system behaviour at this point; long-lived threads simply seem more likely to get interrupted. I had taken some measures to reduce interruptions (niceness -15, specific cpu set) but this is not a system dedicated to benchmarking.
None of my results were anywhere near as extreme as yours, however. My first guess as to what caused your large difference was that you reused the same array and ran the parallel loop first. This would distribute the array into caches on all cores, causing a slight dilemma of whether to migrade the thread to the data or the other way around; and OpenMP may have chosen any distribution, including iteration i to thread i%threads (as with schedule(static,1)), which probably would hurt multithreaded runtime, or one cacheline each which would hurt later single threaded reading if it fit in per-core caches. However, all of the array accesses are writes, so the processor shouldn't need to wait for them in the first place.
In summary, your results are certainly platform dependent and unexpected. I would suggest rerunning the test with swapped order, the two loops operating on different arrays, and placed in different compilation units, and of course to verify the written results. It is possible you've found a flaw in your compiler.

OpenMP and File I/O

I'm doing some time trials on my code, and logically it seems really easy to parallelize with OpenMP as each trial is independent of the others. As it stands, my code looks something like this:
for(int size = 30; size < 50; ++size) {
#pragma omp parallel for
for(int trial = 0; trial < 8; ++trial) {
time_t start, end;
//initializations
time(&start);
//perform computation
time(&end);
output << size << "\t" << difftime(end,start) << endl;
}
output << endl;
}
I have a sneaking suspicion that this is kind of a faux pas, however, as two threads may simultaneously write values to the output, thus screwing up the formatting. Is this a problem, and if so, will surrounding the output << size << ... code with a #pragma omp critical statement fix it?
Never mind whether your output will be screwed up (it likely will). Unless you're really careful to assign your OpenMP threads to different processors that don't share resources like memory bandwidth, your time trials aren't very meaningful either. Different runs will be interfering with each other.
The solution to the problem you're asking about is to write the result times into designated elements of an array, with one slot for each trial, and ouput the results after the fact.
As long as you don't mind the individual lines being out of order you'll be fine. OpenMP should make sure a whole line is printed at a time.
However, you will need to declare start and end as private in the pragma otherwise the threads will overwrite them and mess up your timings.