I am not sure if this problem is compiler specific or not, but I'll ask anyways. I'm using CCS (Code Composer Studio), which is an IDE from texas instruments to program the MSP430 microcontroller.
As usual, I'm making the beginner program of making the LED blink, located in the last bit of the P1OUT register. Here's the code that DOESN'T work (I've omitted some of the other declarations, which are irrelevant):
while(1){
int i;
P1OUT ^= 0x01;
i = 10000;
while(i != 0){
i--;
}
}
Now, here's the loop that DOES work:
while(1){
int i;
P1OUT ^= 0x01;
i = 0;
while(i < 10000){
i++;
}
}
The two statements should be equivalent, but in the first instance, the LED stays on and doesn't blink, while in the second, it works as planned.
I'm thinking it has to do with some optimization done by the compiler, but I have no idea as to what specifically may be wrong.
The code is probably being optimised away as dead-code. You don't want to spin like that anyway, it's terribly wasteful on CPU cycles. You want to simply call usleep, something like:
#include <unistd.h>
int microseconds = // number of 1000ths of milliseconds to wait
while(1){
P1OUT ^= 0x01;
usleep(microseconds);
}
CCS can optimize code in a way you could never expect (also check the optimization levels in the project properties). Easiest way is to declare the variable with volatile keyword and you are done.
Related
New to Arduino, I have tried to make a for or a while loop to do a delay, instead of the delay() function. Have tried a LOT of values but the LED remains HIGH, it works if I use the delay() function. Note that I'm not going to use this code as delay, I just tried it and now I can't understand what goes wrong.
Board is a Nano Every, I use Arduino IDE, Fcpu = 8MHz.
const byte ledPin = 13;
byte ledState = HIGH;
pinMode(ledPin, OUTPUT);
void loop() {
unsigned long i = 0;
// read the state of the switch into a local variable:
//enaState = digitalRead(sw1);
//dirState = digitalRead(sw2);
while (i < 10000000)
{
i++;
}
//delay(1000);
ledState ^= 1;
digitalWrite(ledPin, ledState);
i = 0;
}
Adding volatile to avoid being removed by optimization works.
volatile unsigned long i = 0;
The arduino compiler by default runs with all optimizations enabled, to reduce code size and improve speed. Since at least with the Arduino IDE, source-line debugging is not possible on the Arduino itself, this normally doesn't make a visible difference for development. The above code snippet is a rare example where it does.
How to disable optimization, if one really wants to do it, is described in this question: VSCode disabling Arduino compilation optimizations for debugging (thanks #dmaxime).
So, I am new to online competitive programming and i came across a code where i am using the if else statement inside a for loop. I want to increase the speed of the loop and after doing some research i came across break and continue statements.
So my question is that does using continue really increases the speed of the loop or not.
CODE :
int even_sum = 0;
for(int i=0;i<200;i++){
if(i%4 == 0){
even_sum +=i;
continue;
}else{
//do other stuff when sum of multiple of 4 is not calculated
}
}
In the specific code in the question, the code has the identical meaning with and without the continue: In either case, after execution leaves even_sum +=i;, it flows to the closing } of the for statement. Any compiler of even modest quality should treat the two options identically.
The intended purpose of continue is not to speed up code by requesting a jump the compiler is going to make anyway but to skip code that is undesired in the current loop iteration—it acts as if the remaining code had been enclosed in an else clause but may be more visually appealing and less disruptive to human perception of the code.
It is conceivable a very rudimentary compiler, or even a decent compiler but with optimization disabled, might generate a jump instruction for the continue and also a jump instruction for the “then” clause of the if statement to jump over the else clause. The latter would never be executed and would have no direct effect on program execution time, but it would increase the size of the program and thus could have indirect effects. This possibility is of negligible concern in typical modern environments, where you are unlikely to encounter such a rudimentary compiler.
No, there's no speed advantage when using continue here. Both of your codes are identical and even without optimizations they produce the same machine code.
However, sometimes continue can make your code a lot more efficient, if you have structured your loop in a specific way, e.g.
This:
int even_sum = 0;
for (int i = 0; i < 200; i++) {
if (i % 4 == 0) {
even_sum += i;
continue;
}
if (huge_computation_but_always_false_when_multiple_of_4(i)) {
// do stuff
}
}
is a lot more efficient, than:
int even_sum = 0;
for (int i = 0; i < 200; i++) {
if (i % 4 == 0) {
even_sum += i;
}
if (huge_computation_but_always_false_when_multiple_of_4(i)) {
// do stuff
}
}
because the former doesn't have to execute the huge_computation_but_always_false_when_multiple_of_4() function every time.
So even though both of these codes would always produce the same result (given that huge_computation_but_always_false_when_multiple_of_4() has no side effects), the first one, which uses continue, would be a lot faster.
I am using a loop to manipulate a number of variables using bit values as I go. I have noticed that I am using the expressions if (some_state & (1 << i)), and some_var &= ~(1 << i) dozens of times before i is incremented.
This routine must be executed in on a regular interrupt, and my main clock is not all that fast, so I am wondering if this is the best I can do. In my head this is a lot of operations.
Maybe I could could store the shifted value and use it instead.
for (int i=0; i<MAX; ++i) {
uint32_t mask_value = (1 << i);
if (some_state & mask_value) {
if (! --some_array[i].timer_value) {
some_var |= mask_value;
another_var &= ~mask_value;
}
}
...
}
Does there appear to be any clear answer? Perhaps this is more of a situation where the compiler may optimize to some unknown level, so who can know until I run the routines 10,000 times, time them, and compare?
UPDATE:
I'm running GNU ARM Embedded Toolchain for my NXP LPC11 ARM (Cortex-M0) microcontroller.
Should be reasonable enough to review assembly in the IDE (LPCXpresso, Eclipse based) and to profile it.
Let me first preface this with the fact that I know these kind of micro-optimisations are rarely cost-effective. I'm curious about how stuff works though. For all cacheline numbers etc, I am thinking in terms of an x86-64 i5 Intel CPU. The numbers would obviously differ for different CPUs.
I've often been under the impression that walking an array forwards is faster than walking it backwards. This is, I believed, due to the fact that pulling in large amounts of data is done in a forward-facing manner - that is, if I read byte 0x128, then the cacheline (assuming 64bytes in length) will read in bytes 0x128-0x191 inclusive. Consequently, if the next byte I wanted to access was at 0x129, it would already be in the cache.
However, after reading a bit, I'm now under the impression that it actually wouldn't matter? Because cache line alignment will pick the starting point at the closest 64-divisible boundary, then if I pick byte 0x127 to start with, I will load 0x64-0x127 inclusive, and consequently will have the data in the cache for my backwards walk. I will suffer a cachemiss when transitioning from 0x128 to 0x127, but that's a consequence of where I've picked the addresses for this example more than any real-world consideration.
I am aware that the cachelines are read in as 8-byte chunks, and as such the full cacheline would have to be loaded before the first operation could begin if we were walking backwards, but I doubt it would make a hugely significant difference.
Could somebody clear up if I'm right here, and old me is wrong? I've searched for a full day and still not been able to get a final answer on this.
tl;dr : Is the direction in which we walk an array really that important? Does it actually make a difference? Did it make a difference in the past? (To 15 years back or so)
I have tested with the following basic code, and see the same results forwards and backwards:
#include <windows.h>
#include <iostream>
// Size of dataset
#define SIZE_OF_ARRAY 1024*1024*256
// Are we walking forwards or backwards?
#define FORWARDS 1
int main()
{
// Timer setup
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;
int* intArray = new int[SIZE_OF_ARRAY];
// Memset - shouldn't affect the test because my cache isn't 256MB!
memset(intArray, 0, SIZE_OF_ARRAY);
// Arbitrary numbers for break points
intArray[SIZE_OF_ARRAY - 1] = 55;
intArray[0] = 15;
int* backwardsPtr = &intArray[SIZE_OF_ARRAY - 1];
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
// Actual code
if (FORWARDS)
{
while (true)
{
if (*(intArray++) == 55)
break;
}
}
else
{
while (true)
{
if (*(backwardsPtr--) == 15)
break;
}
}
// Cleanup
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
std::cout << ElapsedMicroseconds.QuadPart << std::endl;
// So I can read the output
char a;
std::cin >> a;
return 0;
}
I apologise for A) Windows code, and B) Hacky implementation. It's thrown together to test a hypothesis, but doesn't prove the reasoning.
Any information about how the walking direction could make a difference, not just with cache but also other aspects, would be greatly appreciated!
Just as your experimentation shows, there is no difference. Unlike the interface between the processor and L1 cache, the memory system transacts on full cachelines, not bytes. As #user657267 pointed out, processor specific prefetchers exist. These might preference forward vs. backward, but I heavily doubt it. All modern prefetchers detect direction rather than assuming them. Furthermore, they detect stride as well. They involve incredibly complex logic and something as easy as direction isn't going to be their downfall.
Short answer: go in either direction you want and enjoy the same performance for both!
The last three of these method calls take approx. double the time than the first four.
The only difference is that their arguments doesn't fit in integer anymore. But should this matter? The parameter is declared to be long, so it should use long for calculation anyway. Does the modulo operation use another algorithm for numbers>maxint?
I am using amd athlon64 3200+, winxp sp3 and vs2008.
Stopwatch sw = new Stopwatch();
TestLong(sw, int.MaxValue - 3l);
TestLong(sw, int.MaxValue - 2l);
TestLong(sw, int.MaxValue - 1l);
TestLong(sw, int.MaxValue);
TestLong(sw, int.MaxValue + 1l);
TestLong(sw, int.MaxValue + 2l);
TestLong(sw, int.MaxValue + 3l);
Console.ReadLine();
static void TestLong(Stopwatch sw, long num)
{
long n = 0;
sw.Reset();
sw.Start();
for (long i = 3; i < 20000000; i++)
{
n += num % i;
}
sw.Stop();
Console.WriteLine(sw.Elapsed);
}
EDIT:
I now tried the same with C and the issue does not occur here, all modulo operations take the same time, in release and in debug mode with and without optimizations turned on:
#include "stdafx.h"
#include "time.h"
#include "limits.h"
static void TestLong(long long num)
{
long long n = 0;
clock_t t = clock();
for (long long i = 3; i < 20000000LL*100; i++)
{
n += num % i;
}
printf("%d - %lld\n", clock()-t, n);
}
int main()
{
printf("%i %i %i %i\n\n", sizeof (int), sizeof(long), sizeof(long long), sizeof(void*));
TestLong(3);
TestLong(10);
TestLong(131);
TestLong(INT_MAX - 1L);
TestLong(UINT_MAX +1LL);
TestLong(INT_MAX + 1LL);
TestLong(LLONG_MAX-1LL);
getchar();
return 0;
}
EDIT2:
Thanks for the great suggestions. I found that both .net and c (in debug as well as in release mode) does't not use atomically cpu instructions to calculate the remainder but they call a function that does.
In the c program I could get the name of it which is "_allrem". It also displayed full source comments for this file so I found the information that this algorithm special cases the 32bit divisors instead of dividends which was the case in the .net application.
I also found out that the performance of the c program really is only affected by the value of the divisor but not the dividend. Another test showed that the performance of the remainder function in the .net program depends on both the dividend and divisor.
BTW: Even simple additions of long long values are calculated by a consecutive add and adc instructions. So even if my processor calls itself 64bit, it really isn't :(
EDIT3:
I now ran the c app on a windows 7 x64 edition, compiled with visual studio 2010. The funny thing is, the performance behavior stays the same, although now (I checked the assembly source) true 64 bit instructions are used.
What a curious observation. Here's something you can do to investigate this further: add a "pause" at the beginning of the program, like a Console.ReadLine, but AFTER the first call to your method. Then build the program in "release" mode. Then start the program not in the debugger. Then, at the pause, attach the debugger. Debug through it and take a look at the code jitted for the method in question. It should be pretty easy to find the loop body.
It would be interesting to know how the generated loop body differs from that in your C program.
The reason for all those hoops to jump through is because the jitter changes what code it generates when jitting a "debug" assembly or when jitting a program that already has a debugger attached; it jits code that is easier to understand in a debugger in those cases. It would be more interesting to see what the jitter thinks is the "best" code generated for this case, so you have to attach the debugger late, after the jitter has run.
Have you tried performing the same operations in native code on your box?
I wouldn't be surprised if the native 64-bit remainder operation special-cased situations where both arguments are within the 32-bit range, basically delegating that to the 32-bit operation. (Or possibly it's the JIT that does that...) It does make a fair amount of sense to optimise that case, doesn't it?