Strange performance behaviour for 64 bit modulo operation - c++

The last three of these method calls take approx. double the time than the first four.
The only difference is that their arguments doesn't fit in integer anymore. But should this matter? The parameter is declared to be long, so it should use long for calculation anyway. Does the modulo operation use another algorithm for numbers>maxint?
I am using amd athlon64 3200+, winxp sp3 and vs2008.
Stopwatch sw = new Stopwatch();
TestLong(sw, int.MaxValue - 3l);
TestLong(sw, int.MaxValue - 2l);
TestLong(sw, int.MaxValue - 1l);
TestLong(sw, int.MaxValue);
TestLong(sw, int.MaxValue + 1l);
TestLong(sw, int.MaxValue + 2l);
TestLong(sw, int.MaxValue + 3l);
Console.ReadLine();
static void TestLong(Stopwatch sw, long num)
{
long n = 0;
sw.Reset();
sw.Start();
for (long i = 3; i < 20000000; i++)
{
n += num % i;
}
sw.Stop();
Console.WriteLine(sw.Elapsed);
}
EDIT:
I now tried the same with C and the issue does not occur here, all modulo operations take the same time, in release and in debug mode with and without optimizations turned on:
#include "stdafx.h"
#include "time.h"
#include "limits.h"
static void TestLong(long long num)
{
long long n = 0;
clock_t t = clock();
for (long long i = 3; i < 20000000LL*100; i++)
{
n += num % i;
}
printf("%d - %lld\n", clock()-t, n);
}
int main()
{
printf("%i %i %i %i\n\n", sizeof (int), sizeof(long), sizeof(long long), sizeof(void*));
TestLong(3);
TestLong(10);
TestLong(131);
TestLong(INT_MAX - 1L);
TestLong(UINT_MAX +1LL);
TestLong(INT_MAX + 1LL);
TestLong(LLONG_MAX-1LL);
getchar();
return 0;
}
EDIT2:
Thanks for the great suggestions. I found that both .net and c (in debug as well as in release mode) does't not use atomically cpu instructions to calculate the remainder but they call a function that does.
In the c program I could get the name of it which is "_allrem". It also displayed full source comments for this file so I found the information that this algorithm special cases the 32bit divisors instead of dividends which was the case in the .net application.
I also found out that the performance of the c program really is only affected by the value of the divisor but not the dividend. Another test showed that the performance of the remainder function in the .net program depends on both the dividend and divisor.
BTW: Even simple additions of long long values are calculated by a consecutive add and adc instructions. So even if my processor calls itself 64bit, it really isn't :(
EDIT3:
I now ran the c app on a windows 7 x64 edition, compiled with visual studio 2010. The funny thing is, the performance behavior stays the same, although now (I checked the assembly source) true 64 bit instructions are used.

What a curious observation. Here's something you can do to investigate this further: add a "pause" at the beginning of the program, like a Console.ReadLine, but AFTER the first call to your method. Then build the program in "release" mode. Then start the program not in the debugger. Then, at the pause, attach the debugger. Debug through it and take a look at the code jitted for the method in question. It should be pretty easy to find the loop body.
It would be interesting to know how the generated loop body differs from that in your C program.
The reason for all those hoops to jump through is because the jitter changes what code it generates when jitting a "debug" assembly or when jitting a program that already has a debugger attached; it jits code that is easier to understand in a debugger in those cases. It would be more interesting to see what the jitter thinks is the "best" code generated for this case, so you have to attach the debugger late, after the jitter has run.

Have you tried performing the same operations in native code on your box?
I wouldn't be surprised if the native 64-bit remainder operation special-cased situations where both arguments are within the 32-bit range, basically delegating that to the 32-bit operation. (Or possibly it's the JIT that does that...) It does make a fair amount of sense to optimise that case, doesn't it?

Related

C/C++ elapsed process cycles, not including at breakpoints

See the following code, which is my attempt to print the time elapsed between loops.
void main()
{
while (true)
{
static clock_t timer;
clock_t t = clock();
clock_t elapsed = t - timer;
float elapsed_sec = (float)elapsed / (float)CLOCKS_PER_SEC;
timer = t;
printf("[dt:%d ms].\n", (int)(elapsed_sec*1000));
}
}
However, if I set a breakpoint and sit there for 10 seconds, when i continue execution the elapsed time includes that 5 seconds -- and I don't want it to, for my intended usage.
I assume clock() is the wrong function, but what is the correct one?
Note that if there IS no standard C or C++ single call for this -- well, how do you compute it? Is there a posix way?
I suspect that this is actually information only knowable with platform-specific calls. If that is the case, i'd like to at least know how to do so on windows (msvc).
Yes, trying to measure the CPU time of your process would be dependent on support from your operating system. Rather than look up what support is available from various operating systems, though, I would propose that your approach is flawed.
Debugging typically uses a debug build that has most optimizations turned off (to make it easier to do things like set breakpoints accurately). Timings on a non-optimized build lack practical value. Hence any timings of your program should usually be ignored when you are using breakpoints–even if the breakpoint is outside the timed section.
To combine using breakpoints with timings, I would do the debugging in two phases. In one phase, you set breakpoints and look at what is happening in the debug build. In the other phase, you use identical input (redirect a file into std::cin if it helps) and time the process in the release build. There may be some back-and-forth between the stages as you work out what is going on. That's fine; the point is not to have exactly two phases, but to keep breakpoints and timings separate.
Although JaMit gives a better answer (here), it is possible but it depends entirely on your compiler and the amount of overhead this creates will probably slow down your program too much to get an accurate result. You can use whatever time recording function you please but either way you would have to:
Record the start of the loop
Record the start of the breakpoint
Programmatically cause a breakpoint
Record the end of the breakpoint
Record the end of the loop.
If you're looking for speed though, you really need to be testing in an optimized release mode without breakpoints and writing the output to the console or a file. Nonetheless, it is possible to do what you're trying to do, and here's a sample solution.
#include <chrono>
#include <intrin.h> //include for visual studio break
#include <iostream>
int main(void) {
for (int c = 0; c < 100; c++) {
//Start of loop
auto start = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now().time_since_epoch()).count();
/*Do stuff here*/
auto startBreak = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now().time_since_epoch()).count();
//visual studio only
__debugbreak();
auto endBreak = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now().time_since_epoch()).count();
/*Do more stuff here*/
auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now().time_since_epoch()).count();
/*Time for 1 pass of the loop, including all the records of getting the time*/
std::cout << end - start - (endBreak - startBreak) << "\n";
}
return 0;
}
given your objective: which is my attempt to print the time elapsed between loops.
in the C language:
Note that clock() returns the number of clock tics since the program started.
This code measures the elapsed time for each loop by showing the start/end time for each loop
#include <stdio.h>
#include <time.h>
int main( void )
{
clock_t loopStart = clock();
clock_t loopEnd;
for( int i=0; i< 10000; i++ )
{
loopEnd = clock();
// something here to be timed
printf("[%lf : %lf ms].\n", (double)loopStart/CLOCKS_PER_SEC, (double)loopEnd/CLOCKS_PER_SEC );
loopStart = loopEnd;
}
}
Of course, if you want to display the actual number of clock ticks per loop, then remove the division by CLOCKS_PER_SEC and calculate the difference and only display the difference

Forwards vs Backwards array walking

Let me first preface this with the fact that I know these kind of micro-optimisations are rarely cost-effective. I'm curious about how stuff works though. For all cacheline numbers etc, I am thinking in terms of an x86-64 i5 Intel CPU. The numbers would obviously differ for different CPUs.
I've often been under the impression that walking an array forwards is faster than walking it backwards. This is, I believed, due to the fact that pulling in large amounts of data is done in a forward-facing manner - that is, if I read byte 0x128, then the cacheline (assuming 64bytes in length) will read in bytes 0x128-0x191 inclusive. Consequently, if the next byte I wanted to access was at 0x129, it would already be in the cache.
However, after reading a bit, I'm now under the impression that it actually wouldn't matter? Because cache line alignment will pick the starting point at the closest 64-divisible boundary, then if I pick byte 0x127 to start with, I will load 0x64-0x127 inclusive, and consequently will have the data in the cache for my backwards walk. I will suffer a cachemiss when transitioning from 0x128 to 0x127, but that's a consequence of where I've picked the addresses for this example more than any real-world consideration.
I am aware that the cachelines are read in as 8-byte chunks, and as such the full cacheline would have to be loaded before the first operation could begin if we were walking backwards, but I doubt it would make a hugely significant difference.
Could somebody clear up if I'm right here, and old me is wrong? I've searched for a full day and still not been able to get a final answer on this.
tl;dr : Is the direction in which we walk an array really that important? Does it actually make a difference? Did it make a difference in the past? (To 15 years back or so)
I have tested with the following basic code, and see the same results forwards and backwards:
#include <windows.h>
#include <iostream>
// Size of dataset
#define SIZE_OF_ARRAY 1024*1024*256
// Are we walking forwards or backwards?
#define FORWARDS 1
int main()
{
// Timer setup
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;
int* intArray = new int[SIZE_OF_ARRAY];
// Memset - shouldn't affect the test because my cache isn't 256MB!
memset(intArray, 0, SIZE_OF_ARRAY);
// Arbitrary numbers for break points
intArray[SIZE_OF_ARRAY - 1] = 55;
intArray[0] = 15;
int* backwardsPtr = &intArray[SIZE_OF_ARRAY - 1];
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
// Actual code
if (FORWARDS)
{
while (true)
{
if (*(intArray++) == 55)
break;
}
}
else
{
while (true)
{
if (*(backwardsPtr--) == 15)
break;
}
}
// Cleanup
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
std::cout << ElapsedMicroseconds.QuadPart << std::endl;
// So I can read the output
char a;
std::cin >> a;
return 0;
}
I apologise for A) Windows code, and B) Hacky implementation. It's thrown together to test a hypothesis, but doesn't prove the reasoning.
Any information about how the walking direction could make a difference, not just with cache but also other aspects, would be greatly appreciated!
Just as your experimentation shows, there is no difference. Unlike the interface between the processor and L1 cache, the memory system transacts on full cachelines, not bytes. As #user657267 pointed out, processor specific prefetchers exist. These might preference forward vs. backward, but I heavily doubt it. All modern prefetchers detect direction rather than assuming them. Furthermore, they detect stride as well. They involve incredibly complex logic and something as easy as direction isn't going to be their downfall.
Short answer: go in either direction you want and enjoy the same performance for both!

While loop behaving unexpectedly

I am not sure if this problem is compiler specific or not, but I'll ask anyways. I'm using CCS (Code Composer Studio), which is an IDE from texas instruments to program the MSP430 microcontroller.
As usual, I'm making the beginner program of making the LED blink, located in the last bit of the P1OUT register. Here's the code that DOESN'T work (I've omitted some of the other declarations, which are irrelevant):
while(1){
int i;
P1OUT ^= 0x01;
i = 10000;
while(i != 0){
i--;
}
}
Now, here's the loop that DOES work:
while(1){
int i;
P1OUT ^= 0x01;
i = 0;
while(i < 10000){
i++;
}
}
The two statements should be equivalent, but in the first instance, the LED stays on and doesn't blink, while in the second, it works as planned.
I'm thinking it has to do with some optimization done by the compiler, but I have no idea as to what specifically may be wrong.
The code is probably being optimised away as dead-code. You don't want to spin like that anyway, it's terribly wasteful on CPU cycles. You want to simply call usleep, something like:
#include <unistd.h>
int microseconds = // number of 1000ths of milliseconds to wait
while(1){
P1OUT ^= 0x01;
usleep(microseconds);
}
CCS can optimize code in a way you could never expect (also check the optimization levels in the project properties). Easiest way is to declare the variable with volatile keyword and you are done.

Interesting processing time results

I've made a small application that averages the numbers between 1 and 1000000. It's not hard to see (using a very basic algebraic formula) that the average is 500000.5 but this was more of a project in learning C++ than anything else.
Anyway, I made clock variables that were designed to find the amount of clock steps required for the application to run. When I first ran the script, it said that it took 3770000 clock steps, but every time that I've run it since then, it's taken "0.0" seconds...
I've attached my code at the bottom.
Either a.) It's saved the variables from the first time I ran it, and it's just running quickly to the answer...
or b.) something is wrong with how I'm declaring the time variables.
Regardless... it doesn't make sense.
Any help would be appreciated.
FYI (I'm running this through a Linux computer, not sure if that matters)
double avg (int arr[], int beg, int end)
{
int nums = end - beg + 1;
double sum = 0.0;
for(int i = beg; i <= end; i++)
{
sum += arr[i];
}
//for(int p = 0; p < nums*10000; p ++){}
return sum/nums;
}
int main (int argc, char *argv[])
{
int nums = 1000000;//atoi(argv[0]);
int myarray[nums];
double timediff;
//printf("Arg is: %d\n",argv[0]);
printf("Nums is: %d\n",nums);
clock_t begin_time = clock();
for(int i = 0; i < nums; i++)
{
myarray[i] = i+1;
}
double average = avg(myarray, 0, nums - 1);
printf("%f\n",average);
clock_t end_time = clock();
timediff = (double) difftime(end_time, begin_time);
printf("Time to Average: %f\n", timediff);
return 0;
}
You are measuring the I/O operation too (printf), that depends on external factors and might be affecting the run time. Also, clock() might not be as precise as needed to measure such a small task - look into higher resolution functions such as clock_get_time(). Even then, other processes might affect the run time by generating page fault interrupts and occupying the memory BUS, etc. So this kind of fluctuation is not abnormal at all.
On the machine I tested, Linux's clock call was only accurate to 1/100th of a second. If your code runs in less than 0.01 seconds, it will usually say zero seconds have passed. Also, I ran your program a total of 50 times in .13 seconds, so I find it suspicous that you claim it takes 2 seconds to run it once on your computer.
Your code incorrectly uses the difftime, which may display incorrect output as well if clock says time did pass.
I'd guess that the first timing you got was with different code than that posted in this question, becase I can't think of any way the code in this question could produce a time of 3770000.
Finally, benchmarking is hard, and your code has several benchmarking mistakes:
You're timing how long it takes to (1) fill an array, (2) calculate an average, (3) format the result string (4) make an OS call (slow) that prints said string in the right language/font/colo/etc, which is especially slow.
You're attempting to time a task which takes less than a hundredth of a second, which is WAY too small for any accurate measurement.
Here is my take on your code, measuring that the average takes ~0.001968 seconds on this machine.

Executable runs faster on Wine than Windows -- why?

Solution: Apparently the culprit was the use of floor(), the performance of which turns out to be OS-dependent in glibc.
This is a followup question to an earlier one: Same program faster on Linux than Windows -- why?
I have a small C++ program, that, when compiled with nuwen gcc 4.6.1, runs much faster on Wine than Windows XP (on the same computer). The question: why does this happen?
The timings are ~15.8 and 25.9 seconds, for Wine and Windows respectively. Note that I'm talking about the same executable, not only the same C++ program.
The source code is at the end of the post. The compiled executable is here (if you trust me enough).
This particular program does nothing useful, it is just a minimal example boiled down from a larger program I have. Please see this other question for some more precise benchmarking of the original program (important!!) and the most common possibilities ruled out (such as other programs hogging the CPU on Windows, process startup penalty, difference in system calls such as memory allocation). Also note that while here I used rand() for simplicity, in the original I used my own RNG which I know does no heap-allocation.
The reason I opened a new question on the topic is that now I can post an actual simplified code example for reproducing the phenomenon.
The code:
#include <cstdlib>
#include <cmath>
int irand(int top) {
return int(std::floor((std::rand() / (RAND_MAX + 1.0)) * top));
}
template<typename T>
class Vector {
T *vec;
const int sz;
public:
Vector(int n) : sz(n) {
vec = new T[sz];
}
~Vector() {
delete [] vec;
}
int size() const { return sz; }
const T & operator [] (int i) const { return vec[i]; }
T & operator [] (int i) { return vec[i]; }
};
int main() {
const int tmax = 20000; // increase this to make it run longer
const int m = 10000;
Vector<int> vec(150);
for (int i=0; i < vec.size(); ++i)
vec[i] = 0;
// main loop
for (int t=0; t < tmax; ++t)
for (int j=0; j < m; ++j) {
int s = irand(100) + 1;
vec[s] += 1;
}
return 0;
}
UPDATE
It seems that if I replace irand() above with something deterministic such as
int irand(int top) {
static int c = 0;
return (c++) % top;
}
then the timing difference disappears. I'd like to note though that in my original program I used a different RNG, not the system rand(). I'm digging into the source of that now.
UPDATE 2
Now I replaced the irand() function with an equivalent of what I had in the original program. It is a bit lengthy (the algorithm is from Numerical Recipes), but the point was to show that no system libraries are being called explictly (except possibly through floor()). Yet the timing difference is still there!
Perhaps floor() could be to blame? Or the compiler generates calls to something else?
class ran1 {
static const int table_len = 32;
static const int int_max = (1u << 31) - 1;
int idum;
int next;
int *shuffle_table;
void propagate() {
const int int_quo = 1277731;
int k = idum/int_quo;
idum = 16807*(idum - k*int_quo) - 2836*k;
if (idum < 0)
idum += int_max;
}
public:
ran1() {
shuffle_table = new int[table_len];
seedrand(54321);
}
~ran1() {
delete [] shuffle_table;
}
void seedrand(int seed) {
idum = seed;
for (int i = table_len-1; i >= 0; i--) {
propagate();
shuffle_table[i] = idum;
}
next = idum;
}
double frand() {
int i = next/(1 + (int_max-1)/table_len);
next = shuffle_table[i];
propagate();
shuffle_table[i] = idum;
return next/(int_max + 1.0);
}
} rng;
int irand(int top) {
return int(std::floor(rng.frand() * top));
}
edit: It turned out that the culprit was floor() and not rand() as I suspected - see
the update at the top of the OP's question.
The run time of your program is dominated by the calls to rand().
I therefore think that rand() is the culprit. I suspect that the underlying function is provided by the WINE/Windows runtime, and the two implementations have different performance characteristics.
The easiest way to test this hypothesis would be to simply call rand() in a loop, and time the same executable in both environments.
edit I've had a look at the WINE source code, and here is its implementation of rand():
/*********************************************************************
* rand (MSVCRT.#)
*/
int CDECL MSVCRT_rand(void)
{
thread_data_t *data = msvcrt_get_thread_data();
/* this is the algorithm used by MSVC, according to
* http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators */
data->random_seed = data->random_seed * 214013 + 2531011;
return (data->random_seed >> 16) & MSVCRT_RAND_MAX;
}
I don't have access to Microsoft's source code to compare, but it wouldn't surprise me if the difference in performance was in the getting of thread-local data rather than in the RNG itself.
Wikipedia says:
Wine is a compatibility layer not an emulator. It duplicates functions
of a Windows computer by providing alternative implementations of the
DLLs that Windows programs call,[citation needed] and a process to
substitute for the Windows NT kernel. This method of duplication
differs from other methods that might also be considered emulation,
where Windows programs run in a virtual machine.[2] Wine is
predominantly written using black-box testing reverse-engineering, to
avoid copyright issues.
This implies that the developers of wine could replace an api call with anything at all to as long as the end result was the same as you would get with a native windows call. And I suppose they weren't constrained by needing to make it compatible with the rest of Windows.
From what I can tell, the C standard libraries used WILL be different in the two different scenarios. This affects the rand() call as well as floor().
From the mingw site... MinGW compilers provide access to the functionality of the Microsoft C runtime and some language-specific runtimes. Running under XP, this will use the Microsoft libraries. Seems straightforward.
However, the model under wine is much more complex. According to this diagram, the operating system's libc comes into play. This could be the difference between the two.
While Wine is basically Windows, you're still comparing apples to oranges. As well, not only is it apples/oranges, the underlying vehicles hauling those apples and oranges around are completely different.
In short, your question could trivially be rephrased as "this code runs faster on Mac OSX than it does on Windows" and get the same answer.