Why sin/cos are slower when optimizations are enabled?

After reading a question related with the performance of sin/cos (Why is std::sin() and std::cos() slower than sin() and cos()?), I made some tests with his code and found a weird thing: If i call sin/cos with a float value, it is much slower than with double when compiled with optimization.
#include <cmath>
#include <cstdio>
const int N = 4000;
float cosine[N][N];
float sine[N][N];
int main() {
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
float ang = i*j*2*M_PI/N;
cosine[i][j] = cos(ang);
sine[i][j] = sin(ang);
With the above code I get:
With -O0: 2.402s
With -O1: 9.004s
With -O2: 9.013s
With -O3: 9.001s
Now if I change
float ang = i*j*2*M_PI/N;
double ang = i*j*2*M_PI/N;
I get:
With -O0: 2.362s
With -O1: 1.188s
With -O2: 1.197s
With -O3: 1.197s
How can the first test be that faster without optimizations?
I'm using g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, 64 bits.
EDIT: Changed the title to better describe the problem.
EDIT: Added assembly code
Here's a possibility:
In C, cos is double precision and cosf is single precision. In C++, std::cos has overloads for both double and single.
You aren't calling std::cos. If <cmath> doesn't also overload ::cos (as far as I know, it is not required to), then you are just calling the C double precision function. If this is the case, then you're suffering the cost of converting between float, double, and back.
Now, some standard libraries implement cos(float x) as (float)cos((double)x), so even if you are calling the float function it might still be doing conversions behind the scenes.
This shouldn't account for a 9x performance difference, though.

AFAIK it's because computers work at double precision natively. Using float requires conversions.'


Function doesn't work when running normally, but does while debugging

I have the following code:
#include <iostream>
#include <cmath>
bool primes[21] = {0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0};
int find_last_power(int n, int p){
return (int) std::pow(n, (double) 1/p);
long long solve(int n){
long long solution = 1;
for (int i=2; i<=n; i++){
if (primes[i]){
std::cout << "p" << i << " : " << std::pow(i, find_last_power(n, i)) << std::endl;
solution *= static_cast<long long>(std::pow(i, find_last_power(n, i)));
return solution;
int main(){
std::cout << solve(20); return 0;
primes is an array of n+1 booleans whose value primes[i] is true if i is prime and false if i is composite.
find_last_power(n, p) returns the exponent (int) of the largest power of p that is less than or equal to n.
If you run the program it writes out:
p2 : 16
p3 : 9
p5 : 5
p7 : 7
p11 : 11
p13 : 13
p17 : 17
p19 : 19
214885440 // this is the return value of solve(20)
// it is supposed to be the product of the numbers on the right (16,9...)
But the returned number is not the expected output. The program, however, runs correctly in a debugger, which is why I find it very hard to identify the bug. The expected output should be 232792560.
Any help is appreciated.
It was compiled with the following commands (on 64-bit Intel i5 4690k, Windows 10):
g++ -S -o asm.s PE_5.cxx
g++ -c asm.s -o outtput.o
g++ output.o -o out.exe
g++ --version
// g++ (MinGW.org GCC-8.2.0-3) 8.2.0
You have int overflow. Change the returned type
int solve(int n){
long long solve(int n){

Implementation of conditional statement in x86 assembly

I would like to know how to implement this lines of code into x86 masm assembly:
if (x >= 1 && x <= 100) {
} else if (x >= 101 && x <= 200) {
} else {
I'd break it into contiguous ranges, (assuming x is unsigned) like:
x is 0, do printsomething3()
x is 1 to 100, do nothing printsomething1()
x is 101 to 200, do nothing printsomething2()
x is 201 or higher, do nothing printsomething3()
Then work from lowest to highest, like:
;eax = x;
cmp eax,0
je .printsomething3
cmp eax,100
jbe .printsomething1
cmp eax,200
jbe .printsomething2
jmp .printsomething3
If the only difference is the string they print (and not the code they use to print it) I'd go one step further:
mov esi,something3 ;esi = address of string if x is 0
cmp eax,0
je .print
mov esi,something1 ;esi = address of string if x is 1 to 100
cmp eax,100
jbe .print
mov esi,something2 ;esi = address of string if x is 101 to 200
cmp eax,200
jbe .print
mov esi,something3 ;esi = address of string if x is 201 or higher
jmp .print
If you have access to a decent C compiler, you can compile it into assembly language. For gcc use the -S flag:
gcc test.c -S
This creates the file test.s which contains the assembly language output which can be assembled and linked if needed.
For example, to make your code compile successfully, I rewrote it slightly to this:
#include <stdio.h>
#include <stdlib.h>
void printsomething (int y)
printf ("something %d", y);
void func (int x)
if (x >= 1 && x <= 100)
if (x >= 101 && x <= 200)
int main (int argc, char **argv)
int x = 0;
if (argc > 1)
x = atoi (argv [1]);
return 0;
It compiles into this assembler:
.file "s.c"
.section .rodata
.string "something %d"
.globl printsomething
.type printsomething, #function
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl %edi, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
.cfi_def_cfa 7, 8
.size printsomething, .-printsomething
.globl func
.type func, #function
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl %edi, -4(%rbp)
cmpl $0, -4(%rbp)
jle .L3
cmpl $100, -4(%rbp)
jg .L3
movl $1, %edi
call printsomething
jmp .L4
cmpl $100, -4(%rbp)
jle .L5
cmpl $200, -4(%rbp)
jg .L5
movl $2, %edi
call printsomething
jmp .L4
movl $3, %edi
call printsomething
.cfi_def_cfa 7, 8
.size func, .-func
.globl main
.type main, #function
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $32, %rsp
movl %edi, -20(%rbp)
movq %rsi, -32(%rbp)
movl $0, -4(%rbp)
cmpl $1, -20(%rbp)
jle .L7
movq -32(%rbp), %rax
addq $8, %rax
movq (%rax), %rax
movq %rax, %rdi
call atoi
movl %eax, -4(%rbp)
movl $0, %eax
.cfi_def_cfa 7, 8
.size main, .-main
.ident "GCC: (GNU) 7.3.1 20180712 (Red Hat 7.3.1-6)"
.section .note.GNU-stack,"",#progbits
Examine the func: part of it and you'll see how it sets up the comparisons with 1, 100, 101, etc.

Slow std::string concatenation on windows

I have a program that needs to concatenate lots of strings together (to be more precise integers converted to strings). On my Ubuntu machine (running g++ 7.3.0) the code runs in 1.5 seconds. But the code needs to be run on Windows as well (running g++ 6.3.0 using MinGW), where it takes 15 seconds to complete. Furthermore, the Ubuntu setup runs on a much slower Laptop using an i7-4712MQ CPU # 2.30GHz, whereas the Windows machine runs on an i7-7700K CPU # 4.20GHz.
The code to reproduce the times is shown below. I compile the code with g++ tester.cpp -O2 -o tester (or tester.exe for windows)
#include <iostream>
#include <chrono>
int main(int argc, char const *argv[]) {
auto started = std::chrono::high_resolution_clock::now();
std::string str = "";
const int n = 10000000;
str.reserve(2 * n);
int a = 1;
for (int i = 0; i < n; ++i) {
str += std::to_string(a) + " ";
auto done = std::chrono::high_resolution_clock::now();
double secs = (double) std::chrono::duration_cast<std::chrono::milliseconds>(done-started).count() / 1000;
std::cout << "Done in " << secs << "\n";
return 0;
Any idea where the large performance gap might come from?
Quick look at disassembly shows that Windows version uses movl (i. e. long word, 32 bit move) and Linux version uses movq (quad word, 64 bit) and SSE registers xmm.
My bet is that on Linux, you compile for x86-64, while on Windows you target 32 bit x86.
x86-64 includes SSE2 extension, while x86 does not, so MinGW defaults to no-SSE mode.
If that's the case, building with 64 bit toolchain on Windows should result in comparable performance. Alternatively, you might enable SSE for 32 bit builds (-msse2 compiler flag, if I remember correctly).
The mingw.org implementation just seems to be much more inefficient than linux, Visual Studio or mingw-w64.org.
>g++ --version
g++ (MinGW.org GCC-6.3.0-1) 6.3.0
Done in 24.808
>g++ --version
g++ (i686-posix-dwarf-rev2, Built by MinGW-W64 project) 6.3.0
Done in 0.679
Tested with MSYS2 MinGW64:
g++ --version
g++.exe (Rev2, Built by MSYS2 project) 7.3.0
g++.exe -Wall -O3 -mtune=native -fno-exceptions -fno-rtti -c main.cpp -o main.o
g++.exe -o test.exe main.o -s
Done in 0.547
Env: Windows 10 x64
CPU: Intel Core i5-6300U, 2.4GH
In any case, MinGW uses mswcrt.dll instead of GNU libc (windows bundled one, not a universal CRT/visual studio CRT etc) so speed gap may comes from C standard library from my experience.
P.S. with some changes (same compiler flags)
#include <iostream>
#include <chrono>
#ifdef _WIN32
#include <windows.h>
static std::size_t page_size() noexcept {
return si.dwPageSize;
#include <sys/types.h>
#include <unistd.h>
static std::size_t page_size() noexcept {
return static_cast<std::size_t>( ::sysconf(_SC_PAGESIZE) );
#endif // _WIN32
int main(int argc, char const *argv[]) {
auto started = std::chrono::high_resolution_clock::now();
const std::size_t n = 10000000;
// align size to page boundary
const std::size_t al = page_size() - 1;
const std::size_t buff_size = ( (n << 1) + al) & ~al;
std::string str;
const std::string to_append( std::to_string(1) );
for (std::size_t i = 0; i < n; ++i) {
str.append( to_append );
str.push_back(' ');
auto done = std::chrono::high_resolution_clock::now();
double secs = (double) std::chrono::duration_cast<std::chrono::milliseconds>(done-started).count() / 1000;
std::cout << "Done in " << secs << "\n";
return 0;
Done in 0.046
.def _GLOBAL__sub_I_main; .scl 3; .type 32; .endef
.seh_proc _GLOBAL__sub_I_main
(Just for the proportions) Windows Release target vs. Debug target on Visual Studio C++: By default, Debug target compile-line is without optimization, while Release target compile-line is with /O2 optimization, with /Oi ("Enable Intrinsic Functions"), & with /GL ("Whole Program Optimization"). Your code, on my workstation, Debug x64 vs Relesae x64:
Debug: 70 sec.
Release: 0.27 sec.
You build with MinGW (which I am not familiar with). But from a fast search, there is a talk about Debug/Release mode ...and MinGW has equivalent /O2 optimization, /Oi ("Enable Intrinsic Functions"), and /Og ("Enable Global Optimization") flags, it seems.
Compile with these 3 flags (x64 target), & compare with the VS Release x64 benchmark. Anyway, this is MS default compile optimization for a Release target.
Test Environment:
HP 8100, Windows 10 Pro 64 bit, CPU i7 870, 16 GB DDR3 RAM, Visual Studio 2017, Targets: Debug x64 / Release x64
I tried your code at my Windows with MinGW 4.8.0 and got ~20 seconds. When I changed string concatination to std::stringstream I got 0.5 seconds:
std::stringstream ss;
for (int i = 0; i < n; ++i) {
//str += std::to_string(a) + " ";
ss << a << " ";
str = ss.str();

Deciphering the text and data segments from gcc assembly output

I am trying to examine the use of data and text segments in memory via a simple program, named source1.cpp:
int main()
const char* b="Hello everyone!";
int a=100;
return 0;
Could anyone tell me how to figure out the text and data segments, or documentation that might help me in this?

Why segmentation fault is caused by class variables order?

I've created following program :
class CLexer
CLexer( ) {
iCursorPos = 0;
void putCharacter(char character)
if(character != ' ' && character != '\n') {
m_strToken[iCursorPos] = character;
else {
m_strToken[iCursorPos] = '\0';
iCursorPos = 0;
char m_strToken[1024];
int iCursorPos = 0;
int main(int argc, char * argv[]) {
CLexer lex;
return 0;
And after execution, first call to putCharacter method with 'm' character as parameter is throwing segfault.
Attached gdb is giving following output :
Program received signal SIGSEGV, Segmentation fault.
0x00000000004018e5 in CLexer::putCharacter (this=0x7fffffffe370,
character=109 'm') at src/main.cpp:60
60 m_strToken[iCursorPos] = character;
I've managed to fix this error by moving iCursorPos variable above m_strToken in class declaration but i think it isn't proper way to fix this issue.
I'm using g++ (GCC) 6.1.1 20160501 on the lastest and updated version of ArchLinux x86_64.
if(character != ' ' && character != '\n') {
m_strToken[iCursorPos] = character;
You don't check that iCursorPos < 1024 here. So you write past the end of the buffer, into iCursorPos itself.
The next access m_strToken[iCursorPos] = character; probably writes way past the end of the buffer, and you get a segfault (luckily).
Your "fix" still isn't correct, since you corrupt other parts of your objects memory regardless.