SSE instruction set not enabled - c++

I am getting trouble with this error: "SSE instruction set not enabled". How I can figure this out?
I have ACER i7, Ubuntu 11.10, please any one can help me?
Any help will be appreciated!
Also running:
sudo cat /proc/cpuinfo | grep flags
Gives:
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clfl
ush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx rdtscp lm constant_tsc arch_perfm
on pebs bts xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 x2apic popcnt xsave avx lahf_lm
ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
Actually i was trying to install gazebo-1.0.0-RC2-x86_64, and getting this error.
/usr/lib/gcc/i686-linux-gnu/4.6.1/include/emmintrin.h:32:3: error: #error "SSE2
instruction set not enabled"
In file included from /home/bkhelifa/Downloads/software/gazebo-1.0.0-RC2-x86_64/
deps/opende/src/quickstep.cpp:39:0:
/usr/lib/gcc/i686-linux-gnu/4.6.1/include/xmmintrin.h:32:3: error: #error "SSE i
nstruction set not enabled"
/home/bkhelifa/Downloads/software/gazebo-1.0.0-RC2-x86_64/deps/opende/src/quicks
tep.cpp: In function ‘dReal dot6(dRealPtr, dRealPtr)’:
/home/bkhelifa/Downloads/software/gazebo-1.0.0-RC2-x86_64/deps/opende/src/quicks
tep.cpp:537:3: error: ‘__m128d’ was not declared in this scope
...
I already have this option in my cmakefile
if (SSE3_FOUND)
set (CMAKE_C_FLAGS_ALL "${CMAKE_C_FLAGS_ALL} -msse3")
endif()
if (SSSE3_FOUND)
set (CMAKE_C_FLAGS_ALL "${CMAKE_C_FLAGS_ALL} -mssse3")
endif()
if (SSE4_1_FOUND)
set (CMAKE_C_FLAGS_ALL "${CMAKE_C_FLAGS_ALL} -msse4.1")
endif()
if (SSE4_2_FOUND)
set (CMAKE_C_FLAGS_ALL "${CMAKE_C_FLAGS_ALL} -msse4.2")
endif()

One of your header files checks to ensure that SSE is enabled. It appears that your if statements aren't working.
If you add -march=native it should pick the best CPU arch and flags to compile for based on your processor, or you can explicitly use -march=corei7 -mavx -mpclmul, which is useful for distcc. Also, -mfpmath=sse/-mfpmath=387 will tell the compiler to generate SSE/x87 instructions for floating point math. Depending on your processor, either could be faster, but I think Intel processors are usually better at SSE.
If you want to check what gcc is enabling using the native flag run gcc -march=native -Q --help=target -v.

I got the same error and I think I've found the proper solution!
The problem is that you are included the emmintrin.h. I did the same. What is more if I defined SSE2, SSE and MMX before including this file I got the following message: warning: "SSE2" redefined [enabled by default]
So I tried to investigate where SSE2 is defined and/or used I found that this file is included by x86intrin.h. So include this file and according to the -msse* flags the proper *intrin.h files will be included automatically!
It works for me nicely (g++ 4.7.2-5).
I hope I could help!

I just built this on FreeBSD by adding this to the "Makefile" in /usr/ports/audio/soundtouch :
CC= gcc46
CXX= g++46
CPP= cpp46
CFLAGS+= -msse
I hope the rest of the "phonon-gstreamer" plugins compile with this.

Related

Difference between GCC arm options necessary when cross-compiling and when compiling directly on target?

I'm have created a c++ app and want to compile it for debian jessie 8.0 armbian target of a cubietruck board (ARM® Cortex™-A7 Dual-Core).
The - cat /proc/cpuinfo gives :
Processor : ARMv7 Processor rev 4 (v7l)
processor : 0
BogoMIPS : 956.30
processor : 1
BogoMIPS : 959.75
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 4
Hardware : sun7i
Revision : 0000
Serial : 1651668805c0e142
Chipid : 16516688-80515475-52574848-05c0e142
and the - dpkg --print-architecture
armhf
I have concluded that the related arm gcc option for cross compilation I need are:
--with-abi=aapcs-linux (-mabi)
--with-cpu=cortex-a7 (-mcpu)
--with-tune=cortex-a7 (-mtune)
--with-mode=arm/thumb (-marm -mthumb)
--with-fpu=neon-vfpv4 (-mfpu)
--with-float=hard
If I want to build the same source directly on board is the option -march=native (if it is supported) sufficient or do I need any of the above flags as well?
To find what flags -march=native activates use gcc -march=native -Q --help=target.
This is the output in my board (Pine64 - Cortex A53 with Linux 64 bits):
debian#pine64:~$ gcc -march=native -Q --help=target
The following options are target specific:
-mabi=ABI lp64
-march=ARCH native
-mbig-endian [disabled]
-mbionic [disabled]
-mcmodel= small
-mcpu=CPU
-mfix-cortex-a53-835769 [enabled]
-mgeneral-regs-only [disabled]
-mglibc [enabled]
-mlittle-endian [enabled]
-mlra [enabled]
-momit-leaf-frame-pointer [enabled]
-mstrict-align [disabled]
-mtls-dialect= desc
-mtune=CPU
-muclibc [disabled]
....
[Omitted output]

How to compile Crypto++ cross platform on osx

My desktop application has dependency on the Crypto++ library. First I tried to install Crypto++ from Brew and link with my application. First error has arrived when I tried to run application to an older mac (with older cpu, which I suppose does not have AESNI instructions). it crashed with:
Crashed Thread: 56
Exception Type: EXC_BAD_INSTRUCTION (SIGILL)
Exception Codes: 0x0000000000000001, 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Termination Signal: Illegal instruction: 4
Termination Reason: Namespace SIGNAL, Code 0x4
Terminating Process: exc handler [0]
After that I compiled crytpo++ with an older mac. so far all was good. but recently I encountered same error with even older cpu.
Basically the question is: is there a way to compile Crypto++ so the deployed lib would be cross platform ?
... the question is: is there a way to compile crypto++ so the deployed lib would be cross platform ?
Yes, but only within the processor family.
The problem is likely the use of a newer instruction, but not AES. There are three reasons I suspect it.
First, the makefile adds -march=native when building. This gets you all the CPU features for the machine you are building on.
Second, the newer instruction could be from SSE4, AVX or BMI because you compile on a newer Mac; while your older Mac can only handle, say, SSE4 in the case of a Core2 Duo.
Third, AES is guarded at runtime, so those particular machine instructions are not executed if the CPU lacks AESNI. However, other instructions the compiler may emit, like AVX or BMI, are not guarded.
Here's my OS X test environment:
MacBook, early 2010
Intel Core2 Duo
OS X 10.9
SSE 4.1
MacBook Pro, late 2012
Intel Core i7
OS X 10.8
SSE 4.1, SSE 4.2, AESNI, RDRAND, AVX
Based on the list above, if I compile on the MacBook Pro (SSE 4.1, SSE 4.2, AESNI, RDRAND, AVX) for the MacBook (SSE 4.1), then I need to limit the target machine to SSE 4.1. Otherwise, Clang is sure to emit instructions the older MacBook cannot handle.
To limit the target machine in Crypto++:
git clone https://github.com/weidai11/cryptopp.git
cd cryptopp
export CXXFLAGS="-DNDEBUG -g2 -O2 -DDISABLE_NATIVE_ARCH=1 -msse2 -msse3 -mssse3 -msse4.1"
make -j 4
-DDISABLE_NATIVE_ARCH is a relatively new addition. I don't believe its in Crypto++ 5.6.5. You need Master for it, and it will be in the upcoming Crypto++ 6.0.
If you need to remove the makefile code that adds -march=native, then its not hard to find. Open GNUmakefile, and delete this block around line 200:
# BEGIN_NATIVE_ARCH
# Guard use of -march=native (or -m{32|64} on some platforms)
# Don't add anything if -march=XXX or -mtune=XXX is specified
ifeq ($(DISABLE_NATIVE_ARCH),0)
ifeq ($(findstring -march,$(CXXFLAGS)),)
ifeq ($(findstring -mtune,$(CXXFLAGS)),)
ifeq ($(GCC42_OR_LATER)$(IS_NETBSD),10)
CXXFLAGS += -march=native
else ifneq ($(CLANG_COMPILER)$(INTEL_COMPILER),00)
CXXFLAGS += -march=native
else
# GCC 3.3 and "unknown option -march="
# Ubuntu GCC 4.1 compiler crash with -march=native
# NetBSD GCC 4.8 compiler and "bad value (native) for -march= switch"
# Sun compiler is handled below
ifeq ($(SUN_COMPILER)$(IS_X64),01)
CXXFLAGS += -m64
else ifeq ($(SUN_COMPILER)$(IS_X86),01)
CXXFLAGS += -m32
endif # X86/X32/X64
endif
endif # -mtune
endif # -march
endif # DISABLE_NATIVE_ARCH
# END_NATIVE_ARCH
After that, you should be able to run your binary on both machines.
The GNUmakefile is kind of a monstrosity. There's a lot to it. We documented it at GNUmakefile on the Crypto++ wiki.
You can also limit the machine you are compiling for using -mtune. For example:
$ export CXXFLAGS="-DNDEBUG -g2 -O2 -mtune=core2"
$ make -j 3
g++ -DNDEBUG -g2 -O2 -mtune=core2 -fPIC -pipe -c cryptlib.cpp
g++ -DNDEBUG -g2 -O2 -mtune=core2 -fPIC -pipe -c cpu.cpp
g++ -DNDEBUG -g2 -O2 -mtune=core2 -fPIC -pipe -c integer.cpp
...
First I tried to install Crypto++ from Brew and link with my application...
I don't use Brew, so I don't know how to to set CXXFLAGS when using it. Hopefully one of the Homebrew folks will provide some information about it.
Maybe Build and install Brew apps that are x86_64 instead of i386? and Using Homebrew with alternate GCC will help.
It is also possible you are compiling on an x86_64 machine, and then trying to run it on an i386 machine. If that is the case, then it likely won't work.
You may be able to build a fat library with the following, and it may work on both machines. Notice the addition of -arch x86_64 -arch i386.
export CXXFLAGS="-DNDEBUG -g2 -O2 -DDISABLE_NATIVE_ARCH=1 -arch x86_64 -arch i386 -msse2 -msse3 -mssse3 -msse4.1"
make -j 4
You might also be interested in iOS (Command Line) on the Crypto++ wiki. It goes into some detail about fat binaries in the context of iOS. The same concepts apply to OS X.
If you encounter a compile error for -msse4.1 or -msse4.2, then you may need -msse4_1 or -msse4_2. Different compilers accept (or expect) slightly different syntax.
For comparison using Linux, below is the difference in CPU capabilities between a Core2 Duo and a 3rd gen Core i5. Notice the Core i5 has SSE4.2 and AVX, while the Core2 Duo does not. AVX makes a heck of a difference, and compilers aggressively use the instruction set.
On OS X, you want to run sysctl machdep.cpu.features. I showed the one for my old MacBook from early 2010.
Core i5:
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i5-3230M CPU # 2.60GHz
...
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc
rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq ssse3 cx16 sse4_1
sse4_2 x2apic popcnt aes xsave avx rdrand hypervisor lahf_lm
Core2 Duo:
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Duo CPU T6500 # 2.10GHz
...
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64
monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm
Core Duo (MacBook):
$ sudo sysctl machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE
MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 DTES64
MON DSCPL VMX SMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1

Illegal instruction - vcvtsi2sd

I am writing a program to compute Groebner bases using the library FGB. While it has a C interface, I am calling the library from C++ code compiled with g++ on Ubuntu.
Compiling with the option -g and using x/i $pc in gdb, the illegal instruction is as follows.
0x421c39 FGb_xmalloc_spec+985: vcvtsi2sd %rbx,%xmm0,%xmm0
The line above has angle brackets around FGB_xmalloc_spec+985. As far as I can tell, my processor does not support this instruction, and I am trying to figure out why the program uses it. It looks to me like the instruction comes from the library code. However, the code I am compiling used to work on the desktop it is now failing on - one day just started throwing the illegal instruction. I assumed I screwed up some libraries or something, so I reinstalled Ubuntu 16.04 but I continue to get the illegal instruction. The same exact code does work on another desktop and a chromebook, running Ubuntu 16.04 and 14.04 respectively.
Technical information:
g++: 5.4.0 20160609
gdb: 7.11.1
Ubuntu: 16.04/14.04 LTS
Process: x86info output
Found 4 identical CPUs
Extended Family: 0 Extended Model: 1 Family: 6 Model: 23 Stepping: 10
Type: 0 (Original OEM)
CPU Model (x86info's best guess): Core 2 Duo
Processor name string (BIOS programmed): Intel(R) Core(TM)2 Quad CPU Q9650 # 3.00GHz
cpu flags
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority dtherm
Compile line
g++ -std=c++11 -g -I src -o bin/main.o -c src/main.cpp
g++ -std=c++11 -g -I src -o bin/Polynomial.o -c src/Polynomial.cpp
g++ -std=c++11 -g -I src -o bin/Util.o -c src/Util.cpp
g++ -std=c++11 -g -I src -o bin/Solve.o -c src/Solve.cpp
g++ -std=c++11 -g -o bin/StartUp bin/main.o bin/Util.o bin/Polynomial.o bin/Solve.o -Llib -lfgb -lfgbexp -lgb -lgbexp -lminpoly -lminpolyvgf -lgmp -lm -fopenmp
At this point, I am not sure what further things I can try to avoid this illegal instruction and welcome any and all suggestions.

compiling error while using sse4.2 function on intel machine

I am trying to use the intrensic function _mm_crc32_u32 on my Xeon(R) CPU E5-2650 v2 INTEL machine,
I compile the project with the sse4.2 flag enabled (inside the makefile):
CCFLAGS += -msse4.2
but i still get the error:
nmmintrin.h:31:3: error: #error "SSE4.2 instruction set not enabled"
any ideas why this might still happen?

C++ eigen3 linear algebra library, odd performance results

I've been using eigen3 linear algebra library in c++ for a while, and I've always tried to take advantage of the vectorization performance benefits. Today, I've decided to test how much vectorization really speeds my programs up. So, I've written the following test program:
--- eigentest.cpp ---
#include <eigen3/Eigen/Dense>
using namespace Eigen;
#include <iostream>
int main() {
Matrix4d accumulator=Matrix4d::Zero();
Matrix4d randMat = Matrix4d::Random();
Matrix4d constMat = Matrix4d::Constant(2);
for(int i=0; i<1000000; i++) {
randMat+=constMat;
accumulator+=randMat*randMat;
}
std::cout<<accumulator(0,0)<<"\n"; // To avoid optimizing everything away
return 0;
}
Then I've run this program after compiling it with different compiler options: (The results aren't one-time, many runs give similar results)
$ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -march=native
$ time ./eigentest
5.33334e+18
real 0m4.409s
user 0m4.404s
sys 0m0.000s
$ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x
$ time ./eigentest
5.33334e+18
real 0m4.085s
user 0m4.040s
sys 0m0.000s
$ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -march=native -O3
$ time ./eigentest
5.33334e+18
real 0m0.147s
user 0m0.136s
sys 0m0.000s
$ g++ eigentest.cpp -o eigentest -DNDEBUG -std=c++0x -O3
$time ./eigentest
5.33334e+18
real 0m0.025s
user 0m0.024s
sys 0m0.000s
And here's my relevant cpu information:
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dn
I know that there's no vectorization going on when I don't use the compiler option -march=native because when I don't use it, I never get a segmentation fault, or wrong result due to vectorization, as opposed to the case that I use it (with -NDEBUG).
These results lead me to believe that, at least on my CPU vectorization with eigen3 results in slower execution. Who should I blame? My CPU, eigen3 or gcc?
Edit: To remove any doubts, I've now tried to add the -DEIGEN_DONT_ALIGN compiler option in cases where I'm trying to measure the performance of the no-vectorization case, and the results are the same. Furthermore, when I add -DEIGEN_DONT_ALIGN along with -march=native the results become very close to the case without -march=native.
It seems that the compiler is smarter than you think and still optimizes a lot of stuff away.
On my platform, I get about 9ms without -march=native and about 39ms with -march=native. However, if I replace the line above the return by
std::cout<<accumulator<<"\n";
then the timings change to 78ms without -march=native and about 39ms with -march=native.
Thus, it seems that without vectorization, the compiler realizes that you only use the (0,0) element of the matrix and so it only computes that element. However, it can't do that optimization if vectorization is enabled.
If you output the whole matrix, thus forcing the compiler to compute all the entries, then vectorization speeds up the program with a factor 2, as expected (though I'm surprised to see that it is exactly a factor 2 in my timings).